The role of linear regression in machine learning

Linear Regression in Python

Welcome to our introductory course of machine learning, data processing and linear regression in the programming language Python. Our series of educational articles is a good start if you are interested in machine learning and extensive research on this subject, but you have no idea what to do first. Having studied the linear regression, you’ll can proceed to the logistic regression.

We start with the concepts which are familiar to everyone as we believe. If you studied Maths or Physics in high school, you have already got acquainted with the term of the line of best fit; you know how to find a derivative and what vectors and matrix are.

Information which you can find in our articles is mainly aimed at practical use of knowledge. That`s why we aren`t going to discuss only definitions. You will be able to use them in the programming code. Besides, Python is a rather simple programming language for studying. It will be easier for you if you have some knowledge about the cycle operator, the conditional operator, and other essential programming methods.

It`s great if you have written some programs and have some experience in using NumPy and Matplotlib, but it isn`t critical. The examples of a programming code are short to understand what it does.


Of course, we don`t put you any marks, so you have to set yourself criteria to know whether you have learned all the material or not. For example, if you can derive an equation which was considered as an example and you can write a code without revising it one more time it means that you have learned the material and can go further.

So let`s read the content of the rubric to understand what awaits you while reading:

The intention of Python  and  linear regression rubric

  • At first, we`ll inform you about the issues machine learning deals with and what linear regression is.
  • Then we`ll examine one example. We`ll use linear regression to show the truth of Moore’s law. Moore’s law states that the number of transistors in a chip doubles every two years. It will demonstrate how broad applicability of linear regression is in solving the problems of reality surrounding us because Moore’s law is one of the most well-known consequences of computer progress.
  • When we get acquainted with this problem, we`ll give you all the tools for its successful solution.

The main idea is to find the line of best fit. We`ll do it in a bit abstract way, and we`ll consider all problems in geometric interpretation. It`s essential because thanks to this you can immediately visually see and understand the issues being discussed.

Some people believe that machine learning is just entering number sets, but they are far from the truth as ever. The statement of the problem in a geometric form will immediately allow you to assess the comprehensibility and visibility of the issues under consideration and their solutions.

  • Then we`ll complicate a task by implementing additional input variables.

In Moore`s law, we had only one input variable. It`s time. It`s truly astonishing that it`s everything we need for predicting technology progress! Let`s compare it, for example, with image analysis in which we have to deal with thousands and even millions of input variables. Of course, we won`t leapfrog from analysis of one input variable onto analysis of millions of ones.

We`ll show you how to skip from a model with one variable to a model with two ones. When you understand how to do it, it`ll be easy for you to create models with any number of variables.
Finally, we`ll discuss some advanced concepts of machine learning and present you a general overview of directions which you can adhere to after studying our educational articles. Primarily, we want to push you to further in-depth study of machine learning issues. Nowadays it is the most popular technique, and for artificial intellect creation, in fact, it is the second after the brain.

We will provide basic tools you need to move to the next level. As a conclusion, I`d like to add that Pandas (software) makes Python a powerful tool for data analysis and is а good alternative of Excel when people work with large amounts of data. This topic is incredibly profound to discuss it in detail in this article. Later we are going to write a series of articles and present the power of this issue.

Machine learning. The  role of linear regression

Original source: Williams, E.J. (1959). Regression Analysis. John Wiley & Sons, New York. Page 43, Table 3.7

The first issue we want to discuss with you is the basic concepts without which you can`t understand the topic.

You have to know the basis of differential calculus to be able to find the minimum of the function. You should be aware how to find a derivative, partial derivatives and why you need them. You also have to master the basics of the Python language, what the cycle operator, the conditional operator, and array variables are. The knowledge of vectors, matrices, and rules for operations on them from a course of linear algebra will be necessary for you. Besides, you have to be familiar with the probability theory because you need to know such сoncepts as the Gaussian (normal) distribution, mean or deviation.

So, what does machine learning deal with? Machine learning attempts to predict further events analyzing the previous experience. For example, it attempts to forecast the weather for tomorrow or guess stock exchange rate or diagnose the patient’s illness based on his past medical history.

Machine learning also deals with other issues but these ones will be basic in our educational articles.

There are two large categories, two classes of algorithms which we are going to discuss.

  • The first class is called controlled learning. It means that there is a direct dependency between input data and the final result that is well-known. In this case, the data under study can be presented as couples of x and y and each x- value can be used for prediction of the y value.
  • The second class is uncontrolled learning. In this case, you don’t have known interdependence, and all we can do is to try to examine the structure of data we have. If it exists, of course. For example, imagine a large pile of some documents. What can we say about them? You may just find both a word “Physics” and a word “Einstein” in most of them.

Linear regression refers to controlled learning. We have a set of x and y values. If we plot them on a graph, we can consider each x–value as an input variable as a result of which we have output y value.

Controlled learning is also divided into two large categories: classification and regression.

  • Classification attempts to identify input data category, presence or absence of some their features. For example, it attempts to determine a written number or if there is a cat in the picture.
  • Regression calculates a specific number or a vector, e.g., temperature for tomorrow or the price of Google shares.

Let`s look at linear regression which is referred to controlled learning. In fact, here we have a direct dependence between x and y and the output value of y is a number. A lot of people call regression the line of best fit because if you plot the data points of x and y on a graph and draw a straight line accordingly, you get the line of best fit.  Linear regression is one of the ways to do this.

If you plot the data points of x and y on a graph and draw a straight line accordingly, you get the line of best fit

The objective function of the linear regression model is   y=f(x,b) + varepsilon, E(varepsilon) = 0

( where  b — model parameters,  ε — random model error) ƒ (x,b) so we have, f (x,b) =b_{{0}} + b_{{1}}x_{{1}} + ... + b_{{k}}x_{{k}}

where Βj  — regression parameters (coefficients),  Χj — regressors (model factors), κ — the number of model factors.

Definition of a one-dimensional model and its solution

Let`s discuss the formulation of the linear regression problem and how to obtain a solution for a one-dimensional quantity using differential calculus.

So let`s start with setting a task. Here`s a case study. We have a set of input data x1, x2, …, xN and output y1, y2, …, yN. We plot them on the coordinate plane. As a result, we have lots of points. Maybe they don`t form a straight line, and you have no desire to use linear regression for their analysis. But if you do decide, you will need to find what we sometimes call the line of best fit. That will be a straight line. You remember how to define a straight line from the school course of mathematics:


Denote ŷ by the predicted value of the output variable, in contrast to the usual y, which is given to us as the result of the experiment. In such way, we have to find values of a and b coefficients.

Now we`ll do what we face in the issues of machine learning – define the objective function. People sometimes call it the error function or loss function, it depends on the sphere of activity you work in.

So, let`s define an error of our function. We need to calculate the deviation sum of the actual value from the calculated one:

sum_{i=1}^{N} (y_i - y'_i)^2

But we can`t do it in such way. Let`s assume that in one case the actual value differs from the calculated one by 5 units, in another case – by -5 units. Then these two indicators merely compensate each other, and the error will be equal to zero, that it is, of course, not. We need the error to be always positive. How can we do that? Raise the difference into a square. Obviously, any difference will be positive:

sum_{i=1}^{n} (y_{{i}}-y'_{{i}})^{2}

Here you can ask: why not to use a module instead of a square? Brilliant question! But later you`ll see that for the solution of our problem we need a derivative which is problematic to be taken from the function module.

So we have the error function, and you want to minimize it. Then use the derivative and when you calculate it, equal to zero. As far as we have two parameters a and b, we have to take partial derivatives. You can be sure that there is a minimum since the function is quadratic and it has a minimum.

So what do we have? We have the error function:

sum_{i=1}^{n} (y_i - widehat y_i)

And we need to find the values for which the partial derivatives are zero:

frac{dE}{da} = 0, frac{dE}{db} = 0.

We`ll do this. The partial derivatives  of а is

frac{partial E}{partial a} = sum_{i=1}^{n} 2 (y_{i}- widehat y_i)

Since ŷi=axi+b, we can write an equation in the following form and equal it to zero at once:

sum^_i y_i x_i = sum^_{i}widehat y_i x_i

sum^_{i}y_{i}x_{{i}} = sum^_{i}(ax_{i}+b)x_{{i}}=asum^_{i}{x_{i}}^{2}+bsum^_{i}x_i.

Then we`ll find the second derivative:

frac{partial E}{partial b} = 0,

frac{partial E}{partial b} = sum_{i=1}^{N} 2 (y_{i} - widehat y_i),


sum_{i} y_{{i}} = sum_{i} 2 (y_{i} - widehat y_i),

So all the value of the expression is equal to 0

a sum_{i} x_{{i}} + bN,

Divide this expression by the number of N monitoring:

frac{Sigma_i y_i}{N} = a frac{Sigma_{i=1}^{N} x_{i}} {N} + b.

If we write down that

ȳ = ax+ b,

Then we`ll have the final solution

a = frac{Sigma x_{i}y_{i} - overline ySigma x_{i}}{Sigma x_{i}^{2} - overline xSigma x_{i}} },

b = frac{overline y_i Sigma x_{i}^{2} - overline x_iSigma y_{i}x_{i}}{Sigma x_{i}^{2} - overline x_iSigma x_i }.

So we have two equations with two unknowns. You can solve it with the help of the imputation method or matrix calculus. We prefer the imputation method. Solve this system by yourself as an exercise.

Pay attention that the denominators of both expressions are the same. We can use this fact while writing a code in the Python language.

We highly recommend making all the calculations on your own once again to be sure they are correct.

This topic is discussed in detail in our article.

The statement of a multidimensional problem and its solution

Let`s discuss multidimensional linear regression, also called multiple linear regression.  We`ll start with a problem statement, define an objective function and then derive a solution. So we have input variables

X = x1, x2, …, xN. But in this case each our xi isn`t a number. It`s а vector of dimension D. For example; we can examine the company’s share price depending on the Twitter reviews as well as the LinkedIn reviews. In this case each our xi input appears as a chart or matrix with N lines and D columns. That is the NxD dimension. Each line is a result of one monitoring. Each element of xij matrix is i-monitoring with j-dimension, where i=1..N and j=1..D.

With multiple linear regression

Ŷ = w1x1 + w2x2 + … + wDxD.

If we rewrite it in the vector form, where

w = begin{bmatrix}w_iw_2...w_dend{bmatrix}, x = begin{bmatrix}x_ix_2...x_dend{bmatrix},

we can write

ŷ = wTX.

The necessity of w-transposition is because multiplication of matrices is possible only if their dimensions coincide. If we multiply a matrix of dimension 1xD by a matrix Dx1, we obtain the required scalar value of dimension 1×1.

I`d like to remind that in the case of simple linear regression we had an absolute term b that characterized the point of intersection of the desired straight line with the y axis. In this case, we can remove it without any problems. We just input a new variable x0 and set its value equal to one:

Ŷ = w0 + w1x1 + … = w0x0 + w1x1 + … = wTX.

So, let’s continue finding the solution. The objective function remains the same as before. The only difference is that ŷ now looks a little different:

E = sum_{i=1}^{N} (y_{i} - widehat y_i)^2 = sum_{i=1}^{N} (y_{i} - w^{T}x_{i})^2.

Now everything is still the same. We must find the partial derivatives and equate them to zero. Thus, we must find D derivatives and obtain D equations with D unknowns:

frac{partial E}{partial w_j} = sum_{i=1}^{N} 2 (y_i - w^2 x_i)(-frac{d(w^T x_i))}{d w_j}).

Since  wTxi = w0xi0 + … + wjxij + ,,, + wDxiD,  so we obtain:

-frac{d(w^T x_i)}{d w_j} = x_i_j.


 frac{partial E}{partial w_j} = sum_{i=1}^{N} 2 (y_i - w^T x_i) x_i_j = 0,

sum_{i=1}^{N} y_i x_i_j = sum_{i=1}^{N} (w^T x_i) x_i_j.

There are scalar numbers in both parts of the equation. We can rewrite the equation in the matrix form

xijTy = wT(XTxij).

It`s essential to check dimension of matrices since, as you know, according to the rules of linear algebra, you can only multiply matrices with matching internal dimensions. Let us check. On the left side, the transposed column vector xij has dimension 1xN, the y matrix of output results has dimension Nx1. Consequently, their product is a scalar quantity of dimension 1×1. Further, on the right-hand side, wT has the dimension 1xD, the matrix XT has the dimension DxN, and xij has Nx1. Therefore, the product of XT and xij has dimension Dx1, and the total product has the dimension 1×1 again. So everything is correct now: both the left and right sides are scalars.  Rewrite our equation in a matrix form

XTy = wT(XTX).

As you remember you have to check the dimension so let`s do it now. On the left side, matrix XT has dimension DxN, y-matrix has dimension Nx1, and their product has dimension Dx1.  On the right side, wT has dimension 1xD, the product in brackets is DxN by NxD, so the total dimension of the right side is 1xD. That is a problem! Why did this happen? The point is that after transposing the matrix X, we get a column vector, whereas on the right side we have a vector-line. For everything to be right, we must transpose the right side again.

As you understand, while transposing we change everything in places so that the final result will be in the for:

 [wT(XTX)]T = (XTX)w.

After transposing we`ll get dimension Dx1. Then we`ll rewrite our equation from which it is easy to find the required w.

XTy = (XTX)w,

w = (XTX)-1XTy.

That is the solution for multiple linear regression. Interestingly, Python’s code has special functions for solving equations of the form Ax=b. That is why we denote A=(XTX), x=w  and b=XTy in programming. Later you`ll see in our other articles how exactly we will implement this solution in the code.

We want to note that this topic has been analyzed in detail in the article.

Examples of linear regression

Now let`s take a couple of examples of linear regression from the world around us. One of such examples which we come up with is the interdependence between the number of hours spent in the gym during the week and the body mass index. Another one is the direct dependence between a mark that a student gets at school and the number of lessons he/she attends.

In these examples, we can easily understand the dependence. But in general, we have to be attentive to determine the cause-effect relationship.  In our case, it`s obvious that the more you train in the gym, the healthier you will be and the better your mass index will be.  The only thing that can be clarified here is to find the line of best fit, and in such way, we can more accurately determine the interdependence of one quantity from another.

But this does not always happen. In the other example considered, we found the line of best fit and discovered that the two variables correlated with each other.

It doesn`t mean that one action causes another. For example, the number of your classes at school and your marks can show the relationship with each other, but perhaps there is another reason for this – for example, it’s just that your parents made you study at home more.

Introduction to the problem of Moore’s Law

One of the most remarkable features of linear regression is that it is applicable to a wide range of questions, even for those that seem completely away from it.

First of all, we need to find out whether there is an interdependence between our values. Now we`ll show how linear regression proves one of the most famous laws of informatics – Moore`s Law. It states: the number of transistors in the integrated circuit doubles approximately every two years:

T0=A0; T1=2*A0; T2= 2*2*A0; …; Tn=2nA0.

Wait, you will say. It isn`t linear regression! Isn`t this an increase in the exponential?

Great question! Indeed, doubling the value every two years is really an exponential growth. Computers are becoming more powerful, and if you notice, the power increases not just at a constant pace, but in an avalanche-like manner, now your smartphone is more powerful than many laptops ten years ago!

So how will we use linear regression? The point is that we won`t count the number of transistors, we`ll calculate their growth pace. In this way we`ll have linear regression:


Now we`ll show how linear regression proves one of the most famous laws of informatics – Moore`s Law

Plotting time on the x axis, and the logarithm of the number of transistors on the y axis, we can solve the problem.

While reading our further articles, you`ll have all necessary knowledge for the successful solution of this problem so don`t wait for information with a ready solution if you feel the strength to do this task yourself. Our aim is to help you visually, in a geometric form, to see and understand the idea of linear regression, as well as to learn how to creatively apply linear regression to solve problems of the surrounding world – such as Moore’s law.

If you want to work on your own.

After you learn how to implement a one-dimensional linear regression in your programming code and to calculate the standard deviation, to prove the correctness of the found line of the best fit, you can completely solve this task yourself. Data are from Wikipedia: the number of transistors is in the second column, the years are in the third one. The task of the given example is to show, firstly, that there`s a linear dependence between the logarithm of a number of transistors and years and, secondly, the number of transistors really doubles every two years.

A Brief Overview of Significant Questions of Linear Regression and Machine Learning

First of all, we want to thank you for having read till this stage. On this occasion, we would like to help you get some motivation so that you can become an efficient data processing specialist.

We haven`t discussed some issues in this article even though they are connected with using linear regression at maximum capacity.

First, it is the topic of the generalization error, training and verification data sets. The essence of this is that, when assessing the model, the available data are divided into k parts. Then, the model is trained in k-1 parts of the data, and the rest of the data is used for verification. Thus, we can find the mean and standard deviation for k different errors as a measure of the accuracy of our model, which shows how much we can be sure in the model.

The next is standardization. Imagine that one input variable equals from 0 to 1, another one is from 0 to 1000 in our linear regression model. The final error will be shifted towards the second variable. As a rule, we, subtract the mean value in this case, and then divide by the standard deviation for each incoming column of data, so that the converted data have the mean value of zero and the standard deviation is one. This process is also called normalization.

The next thing we want to mention is regularization. The data, which we worked with, are simple. In fact, the data are much more complex, and you can easily find yourself in a situation where most of your input data are in the range -1 to 1, and there are some data that equal a million. Such points are called “hills.” According to the idea of error minimization, linear regression will push our line of best fit towards these hills and away from the rest of the data in accordance with the mathematical line of best fit.

We want to discuss other algorithms of machine learning except linear regression. We are at the starting point of many machine learning algorithms, and it ‘s hard to choose where to begin a further study. Start with the basics and strengthening of the basic concepts, and only then engage yourself in more complicated regressive models.

Even if you are familiar with complex algorithms of machine learning, it is always worth starting with the basics to consolidate basic knowledge. Do you remember that we`ve discussed a situation when excessive complication can lead to poor final result?

Practice in using the acquired knowledge

The best way to practice in machine learning is to use it for the solution of real problems. In general, in our opinion, people are divided into two types in this issue.

  1. The first type is when you are interested in machine learning in general and want to use it for data processing. Thus, only theory is enough so that you would be interested. You are fascinated by the modeling algorithms, and you want to learn more.
  2. The second type is when you are interested in concrete, practical application. For example, you are a doctor and want to know how to correlate two medical analyses done.
  3. Nevertheless, we have found that in fact there is the third type of people. These are individuals who want to apply the theory in practice but do not know what to do. If you have not found anything interesting for yourself, try to watch the news, go to the library or do something else that stimulates your creativity

Free access to databases


  • First, collect data to work in the field you would like to investigate and measure the correlation.
  • Second, download the data and write a code and don`t forget to divide the data into training and testing.
  • Third, decide if you can use some variables to predict other ones.
  • Fourth, estimate the accuracy of the model using training and verification data.
  • Fifth, graphically depict the dependence of the dependent variable on the argument and perform a simple one-dimensional linear regression. Which model gives the best coefficient of determination?
  • Sixth, determine how many variables you really need to get the best determination coefficient on the test data. Do not forget that you can add only one variable as long as adding a variable improves the value of the determination coefficient of the training set.


Most of the above, you’ll read in the rubric Linear Regression in Python, so the most independent of you, as I believe, have been doing it already.

Now we will specify the addresses where you can get free data for analysis, and maybe find something new for yourself:

The link below contains data specifically for linear regression.

There are data on systolic blood pressure, crime rates and films.

The government websites also have a lot of data, including financial data, census data on the state of health, about agriculture and much more:

You can even find your data on the Internet if you have your own thoughts about this topic. In our opinion, Python and BeautifulSoup library queries will be very useful for this case.

Do not forget that in addition to regression and classification, there is also a controlled and uncontrolled learning.

Good luck in the field of data processing!

Ads are prohibited by the Google Adsense copyright protection program. To restore Google Ads, contact the copyright holders of the published content. This site uses a fraud technology. We ask you to leave this place to secure your personal data. Google Good Team
Like this post? Please share to your friends:
Leave a Reply

;-) :| :x :twisted: :smile: :shock: :sad: :roll: :razz: :oops: :o :mrgreen: :lol: :idea: :grin: :evil: :cry: :cool: :arrow: :???: :?: :!:

Add Comment
Viewing Highlight

Forgot password?
New to site? Create an Account

Already have an account? Login
Forgot Password