Linear Regression Theory and Code in the Python language

1-D Linear Regression Theory and Code

In this article, we’ll examine your first machine learning algorithm of Linear Regression. The term “machine learning” itself may seem mysterious and complicated to you, but probably you have already faced it a lot of times without even knowing it. Our focus will be on the line of best fit, which you probably studied at the lessons of Physics at school.

Determination of the one-dimensional linear regression model and its solution

There is one remark about this article. We talked about the one-dimensional linear regression in this article, where you can find general information about the linear regression and regression model. Therefore, if you have read the ideas explained in the last article, you can skip this information without any embarrassment, because we are  going to discuss the same issues.

So, let’s examine a simple example. We are all familiar with Ohm’s law:


Where V is the voltage, I is the current through the conductor,  R is the resistance. By the way, this is an equation of a straight line, which can be represented in the form of y=mx+b. Indeed, if we imagine that y=V, x=I, m=R, and b=0, here we have our Ohm’s law.

So, let's examine a simple example

So, let’s imagine such a school experiment: you have a circuit with a battery providing voltage, and a load (resistor). Imagine also that the teacher does not say the resistance value of the resistor, and our task is to calculate it by ourselves. Fortunately, we have several batteries with different voltage, and we have an ammeter that measures the current. We will alternately connect these batteries to the circuit and record the values of the voltage and the current. We will get a series of numbers:

V1, I1; V2, I2; …; Vn, In.

Then we plot them on the plane of coordinates: the current on the x axis and the voltage on the y axis, and without almost any surprise we find out that they line up in an almost ideal straight line. Why do we say “almost”? As you know, there is an inevitable error in any experiments. It can be caused by the measurement error or by the fact that the copper wires, which we use, also have their own resistance. Do not forget that physical equations are only ideal models of the real world. On this occasion, they often say: “The map is not the territory.”

Then we plot them on the plane of coordinates

What will we do next? So, we take the ruler and draw a straight line, the line of best fit, so that it passes through the maximum number of points. In this case, approximately the same number of points will be both above and below the straight line.

When we have a straight line, we can calculate its slope, which is determined, as you know, by its angular coefficient. It shows how much the voltage changes with a given change in the current: ΔV/ΔI, or, in general,

the slope =  .

Since V=I*R, then the value of the slope that we have found is the required resistance:

R = the slope = ΔV/ΔI

When we have a straight line, we can calculate its slope, which is determined, as you know, by its angular coefficient

Of course, we can apply this method not only to Ohm’s law. It`s not necessary to plot exactly the voltage on the y axis, as well as the current on the x-axis. As we have already said, they can indicate anything. For example, it may be the dependence of blood pressure on age or dependence of the blood pressure on the body weight. Or stock quotes depending on the mood in Twitter.

Scientists know that it does not matter what exactly we measure because the methods of data processing and analysis remain the same. I like to formulate this circumstance in one phrase “all data are the same!”. Whether you are a biologist or a financier, you use the same linear regression. There is no “linear regression for biologists” or “linear regression for financiers.” There is just a simple linear regression.

Therefore, it will be better if we plot a certain Y value on the y-axis and a certain value of X on the x-axis.

Now we reformulate our problem in a more general form with X and Y.

Now we reformulate our problem in a more general form with X and Y

So as before we have a series of points: {(x1, y1), (x2, y2), …, (xN, yN)}.

We plot them on the coordinate plane. Then we have to find the line of best fit. Of course, we live in the era of computers and computer science, and there is a much more accurate way to find it than just to draw it on the paper with a ruler in your hand. How can you find it?

Let’s start with the fact that the desired line is straight and has the form of  y = ax + b, so the predicted value of y (we denote it by ŷ) has the form:


We plot them on the coordinate plane

We need all yi obtained during the experiment to be as close as possible to the predicted ŷi, calculated from the given xi.

What should we do?

The first idea that comes to our mind is to calculate the sum of the inconsistencies we have as a result between the actual value of yi and the predicted value of ŷi for all values of i from 1 to N:

sum_{i=1}^{N}(y_i - widehat y_i)

Suppose, in one case, we have a deviation of +5 units, and in the other one the deviation is -5 units

Why can`t we do this?

Suppose, in one case, we have a deviation of +5 units, and in the other one the deviation is -5 units. If we simply add them, we get zero, and it will seem that the deviation is zero, although, in fact, it is not so. We just need that any deviation from the predicted ŷ can be represented as a positive number. The standard way to achieve this is to square the difference:

sum_{i=1}^{N}(y_i - widehat y_i)^2

Now all deviations of the actual value from the predicted one will be positive values if they are not equal to zero. The sum we have is called error sum of squares.

Now we have an expression for calculating the value of deviations. But we need to minimize them. Differential calculus allows us to do this easily. For example, suppose we have a formula


And we want to find out what t value should be to minimize E value. Then we take a derivative and equal it to zero:

dE/dt=t-1=0 ⇒t=1

Then we take a derivative and equal it to zero

In our case, everything is not so simple, because we have much more independent variables. But let’s rewrite the formula and see what we have:

E = sum_{i=1}^{N} (y_i - widehat y_i)^2

Since ŷi=axi+b, so we can write

E = sum_{i=1}^{N} (y_i - (ax_i + b))^2

Remember that yi and xi are the data we have obtained in the experiment, and we need to find a and b that characterize the slope of the straight line and the point of its intersection with the y axis. Unlike the first example with E, we need to find the minima for a and b simultaneously, so we have to use partial derivatives.

So, the first thing we need is to find the derivative concerning a. Here it is:

frac{partial E}{partial a}= sum_{i=1}^{N} 2 (y_i - (ax_i + b)) (- x_i).

Press the pause on the video and, on a piece of paper, calculate the derivative to understand how it happened. Remember that we need to equate the derivative to zero and solve the equation we have. We get:

 sum_{i=1}^{N} 2 (y_i - (ax_i + b)) (- x_i) = 0;

 - sum_{i=1}^{N} y_ix_i + a sum_{i=1}^{N} x_i^2 +b sum_{i=1}^{N} =0.

Stop for a minute to understand how we have got the result, and try to make calculations on paper. Transfer the sum of yixi to the other part of the equation to get rid of the negative values:

 a sum_{i=1}^{N} x_i^2 + b sum_{i=1}^{N} x_i = sum_{i=1}^{N} y_ix_i.

So, we have one equation with two unknowns. Do not forget that xi and yi are our data obtained during the experiment, and N is also known – this is the number of experiments. But we still need to take the derivative concerning b. Let’s do it. We will do everything the same as before – find the derivative, equate it to zero and simplify the expression we have:

frac{partial E}{partial a} = sum_{i=1}^{N} 2 (y_i - (ax_i + b)) (-1) = 0;

 - sum_{i=1}^{N} y_i + a sum_{i=1}^{N} x_i +b sum_{i=1}^{N} 1 =0,

 a sum_{i=1}^{N} x_i + bN = sum_{i=1}^{N} y_i.

Make a pause and try to get this result on paper by yourself.

So, we have such an unusual situation:

 a sum_{i=1}^{N} x_i^2 + b sum_{i=1}^{N} x_i = sum_{i=1}^{N} y_ix_i,

 a sum_{i=1}^{N} x_i + bN = sum_{i=1}^{N} y_i.

I hope you remember that we need to find a and b since xi, yi and N are known to us. Thus, we have two equations with two unknowns. It should give us the only solution for a and b. We simplify our equations by introducing letter symbols:

 C = sum_{i=1}^{N} x_i^2, D = sum_{i=1}^{N} x_i, E = sum_{i=1}^{N} x_i y_i, F = sum_{i=1}^{N} y_i.

We have:



If you feel that you can solve this system of equations by yourself, I highly recommend doing so, since this is an excellent exercise in data processing and machine learning.

We multiply the first equation by D, and the second one by C:



Subtract the second equation from the first, so that a is reduced, and we can find b:


Now we do the same for a. Multiply the first equation by N, and the second one by D, and subtract the second equation from the first:



Here we have the solution for а:


So, we have found a and b. Now we rewrite the expressions for them in the unmodified state by replacing the letters with expressions with xi and yi, since they represent our experimental data:

a =frac{N sum_{i=1}^{N} y_i x_i - sum_{i=1}^{N} x_i sum_{i=1}^{N} y_i} {N sum_{i=1}^{N} x_i^2 - (sum_{i=1}^{N} x_i)^2},

b =frac{sum_{i=1}^{N} x_isum_{i=1}^{N} y_i x_i - sum_{i=1}^{N} y_i sum_{i=1}^{N} x_i^2} {(sum_{i=1}^{N} x_i)^2 - N sum_{i=1}^{N} x_i^2}.

Note that if you change the sign for b in the expression, then we have the same denominators in both expressions:

a =frac{N sum_{i=1}^{N} y_i x_i - sum_{i=1}^{N} x_i sum_{i=1}^{N} y_i} {N sum_{i=1}^{N} x_i^2 - (sum_{i=1}^{N} x_i)^2},

b =frac{sum_{i=1}^{N} y_isum_{i=1}^{N} x_i^2 - sum_{i=1}^{N} x_i sum_{i=1}^{N} y_i x_i} {N sum_{i=1}^{N} x_i^2 - (sum_{i=1}^{N} x_i)^2}.

The next thing we can do to simplify the expressions is to find the mean value. It should be reminded that the average value of the x variable is defined as the sum of all x, divided by the number of variables N:

overline x = frac{1}{N} sum_{i=1}^{N} x_i.

The same can be done if there are both x and y in the expression:

overline {xy} = frac{1}{N} sum_{i=1}^{N} x_i y_i.


We rewrite our expressions for a and b and then simplify them:

a = frac{overline {xy} - overline x overline y}{overline {x^2} - overline x^2}; b = frac{overline y overline {x^2} - overline x overline {xy}}{overline {x^2} - overline x^2}.

Try to get this result by yourself with a pencil and paper. Pay attention to the difference between the mean value of the square and the square of the mean value. Using the NymPy library, you can rotate one focus. The matter is that this library is specially optimized for carrying out matrix and vector operations. So never write something like this:

total = 0

for i in xrange(N):

total +=a[i]*b[i].

 Instead, use the special function dot:

total = или total =,b),

In this case, the calculations are much faster than using the usual loop operator.

Programming linear regression of a one-dimensional model in Python

In this section of the article, we will start programming. We will write the code for a one-dimensional linear regression.

Let’s start from scratch so you can write a code with us. The first thing we have to do is to create a new file and call it We already have a script that has generated all the necessary data, so we do not need to search for them, it is enough just to take a csv-file from the github repository.

First of all, we import the necessary NumPy and Matplotlib libraries in our code. In fact, we do not need MatPotlib for the successful solution of this problem in the code, but we need it to demonstrate the process of solving the problem.

NumPy is an open-source library for the Python programming language. Capabilities:

  • Support for multidimensional arrays (including matrices);
  • Support for high-level mathematical functions which are used for working with multidimensional arrays.

Matplotlib is a library in the Python programming language for visualization of 2D graphic data (3D graphics is also supported). The images we can get can be used as illustrations in publications.

import numpy as np

import matplotlib.pyplot as plt

Next, we load the data. Create two empty arrays, or lists. You can use the csv-reader, but we prefer to make everything simple. Therefore, first, we convert the x data into real numbers and move them to the X array. The same is done with y. For our convenience, we change indents in the code into spaces. However, you can do this only if you want.



for line in open(‘data_1d.csv’):

x, y = line.split(‘,’)



Now we transform X and Y into NumPy arrays, so it will be more convenient to work with them.



Moreover, we display our data in a graphical form to understand how they look and launch


The first run may take some time to draw. So, we see that the points are located more or less along a straight line. Now we use the equations we obtained earlier to calculate the a and b coefficients. Do not forget that in the expressions we have the coefficients have a common denominator, so there is no need to calculate it twice. We just assign its value to the variable denominator.

First, we calculate the sum. Of course, we could use the for loop, but this method is inefficient. It is much better to use the NumPy library, which works with arrays, and all calculations are much faster. We will use special functions of the library.

denominator = – X.mean() * X.sum()

And it’s all. Now you understand why it is so good to use the NumPy library. Now calculate the numerators of the coefficients a and b.

a = ( – Y.mean()*X.sum() ) / denominator

b = ( Y.mean() * – X.mean() * ) / denominator

Now calculate our predicted Y.

Yhat = a*X + b

 And we will display all this in a graphic form.

plt.scatter(X, Y)

plt.plot(X, Yhat)

Launch it one more time. The first graph is a set of initial data points; the second one is our set of initial data and the desired line of best fit. Obviously, it passes through all the points. Therefore, our decision is correct.

Determination of the model quality – R-square in a one-dimensional linear regression

Let`s discuss briefly how to determine the effectiveness of our model.

We have already discussed the definition of the linear regression model. We know how we can build it, how to be convinced of the linear dependence, how to derive the solution so that we can find the line of best fit. We are aware of implementing this solution in the programming code so that we can find the coefficients which determine the desired straight line – the angle of the slope and the point of intersection with the y-axis.

Now we want to find a quantitative indicator to see if our model is good enough. For this purpose, people usually use the coefficient of determination R2 in the linear regression and not only in linear, but also in others. The determination coefficient R2 is defined in the following way:

R^2 = 1 - frac{SSres}{SStot},

where the sum of the squares of the remainders is simply the sum of the error squares, and

SStot = sum_i (y_i - y)^2.


So, what does the coefficient of determination show? Suppose that the sum of squares of the remainders is a value close to zero: SSres≈0. Then the value of the determination coefficient is close to one:

R2 ≈ 1 – 0 ≈ 1.

If the value of the determination coefficient is one, this means that we have the ideal model.

The coefficient of determination is zero

What does the value of the coefficient of determination mean if it is zero?

If R2 is 0, it means that we have calculated the mean y value. It also means that our model is not very successful.

Why can this happen? If there is no clear tendency on our chart, then the line of the best fit is the mean value of our data. In this case, R2 turns out to be close to zero.

The coefficient of determination is negative

Moreover, if the value of the coefficient of determination will be negative?

If R2 is <0, it means that the model you have developed gives a prediction even worse than simple averaging. Obviously, this is not very good especially if you worked a lot on the development of the model. In that case, you should stop for a minute and think about why the model works so badly because the real working model should predict the result much better than just indicate the mean value every time. Perhaps you just misunderstood the trend and created the wrong line of best fit.

Coefficient of determination in the code

So, we will write the code for calculating the coefficient of determination R2. The numerator includes an expression, and we introduce a special variable for this difference. Keep in mind that our Y is a vector, and our variable Yhat is also a vector of dimension n, so their difference is also a vector. However, if we subtract the same number from each element of the vector Y, we get a scalar value, which is very convenient.

d1 = Y – Yhat

d2 = Y – Y.mean()

The coefficient of determination is equal to one minus the sum of squares of the remainders, divided by the sum of common squares. We already have a difference for each element of the array, and we need to square them:

r2 = 1 – /

Give the command to print

print “the r-squared is:”, r2

So launch the program. The coefficient of determination turned out to be equal to 0.99, which is very close to one. There is no surprise, as I have intentionally chosen the data for this result.

Demonstration of Moore’s law in the code

Let’s examine Moore’s law in the Python language code. The first thing we will start with is importing libraries. Let’s start with Re because there we have one funny thing. We also import NumPy and Matplotlib.

import re

import numpy as np

import matplotlib.pyplot as plt

Next, we create empty arrays for X and Y. Besides, we need to remove the fractional part of our input data. If you do not know how to do this, do not worry, it does not play an important role in our course.

X = [ ]

Y = [ ]

non_decimal = re.compile(r ‘ [^d]+’)

for line in open (‘moore.csv’):

r = line.split(‘t’)

x = int(non_decimal.sub(‘ ‘, r[2].split(‘[‘)[0]))

y = int(non_decimal.sub(‘ ‘, r[1].split(‘[‘)[0]))



Do not forget that the years are our x and they are in the third column of the table, and the number of transistors is our y and they are in the second column. Also, we convert our data into np-arrays, since it is easier to work with them

X = np.array(X)

Y = np.array(Y)

 And we display our data in a graphical form.

plt.scatter(X, Y)

Next, we will find the logarithm of the number of transistors and also display this graphically, so you can see that the points are located approximately along the straight line.

Y = np.log(Y)

plt.scatter(X, Y)

Now we can find the solution for our linear regression. We calculate the denominator, a, b and the desired line </s>.

denominator = – X.mean() * X.sum()

a – ( – Y.mean() * X.sum() ) / denominator

b = ( Y.mean() * – X.mean() * ) / denominator

Yhat = a*X + b

 When we calculate the desired linear regression, of course, we want to see the result.

plt.scatter(X, Y)

plt.plot (X, Yhat)


Our next task is to determine how good our linear model is by calculating the coefficient of determination. So, let’s calculate how far we are from true values. The results of the calculations will be displayed on the screen.

d1 = Y – Yhat

d2 = Y – Y.mean()

r2 = 1 – /

print(‘’a:’’, a, ‘’b:’’, b)

print(‘’the r-squared is:’’, r2)

 Now let’s determine how many times the number of transistors on a chip doubles. To do this, I will do some more mathematical operations, but I think it will not be difficult for you to follow me.

So, the logarithm of the number of transistors is equal to one a multiplied by the number of years plus b: ln(tc) = a * year + b, in fact, this is the linear regression that has been found by us. Consequently, the number of transistors is tc = eb*ea*year. If we make the corresponding calculations, we can find out the doubling time.

print(‘’time to double:’’, np.log(2)/a, ‘’years’’)

Launch the program to see what we have. The first diagram shows the number of transistors on a chip from year to year. These are our input data. As we see, this is exponential growth. The second diagram shows the logarithm of the number of transistors, and the result we have is already very similar to the straight line. In the third diagram, we see both the logarithm of the number of transistors from year to year and the line of best fit that has been calculated.

We also see that our determination coefficient is 0.95. It is an excellent indicator, and it means that our dependence is really linear. The doubling time is approximately 1.97 years, which is very close to two years. Consequently, the doubling time of the number of transistors on a chip is really about two years.

Ads are prohibited by the Google Adsense copyright protection program. To restore Google Ads, contact the copyright holders of the published content. This site uses a fraud technology. We ask you to leave this place to secure your personal data. Google Good Team
Like this post? Please share to your friends:
Leave a Reply

;-) :| :x :twisted: :smile: :shock: :sad: :roll: :razz: :oops: :o :mrgreen: :lol: :idea: :grin: :evil: :cry: :cool: :arrow: :???: :?: :!:

Add Comment
Viewing Highlight

Forgot password?
New to site? Create an Account

Already have an account? Login
Forgot Password