# How is multiple linear regression used in machine learning?

## Multiple linear regression and polynomial regression

In this article, we will examine the case where we have several input dependent variables. By the way, it is the most realistic and typical case. In fact, as a rule, several factors influence the final result at the same time. For example, it is better to predict blood pressure taking into account not only the age of a person but also his/her weight. Or, for instance, the price of housing can depend on both its net floor area and the area where it is situated. It means several input variables affect the output quantity, and it is often important to take all of them into account to achieve better forecast accuracy.

We`d like to remind that we started with a one-dimensional linear regression, where we had input data {(x1, y1), (x2, y2), …, (xN, yN)}.

## Defining the multidimensional problem and deriving the solution

Now, since several factors are being considered at once, so our xi becomes a vector. It is also called a feature vector. The number of elements of a vector is called dimension and, as a rule, it is represented as  D. Our model has the form

ŷ=wTx+b.

Since we need to multiply x by the transposed vector of parameters w, the vector w must also have the dimension D. Notice that we can always include the free term b in the parameter vector w. This is because we can rename b to w0 and add x0 to it, and set x0 = 1:

ŷ = b + w1x1 + … + wDxD,

ŷ = w0 + w1x1 + … + wDxD,

ŷ = w0x0 + w1x1 + … + wDxD = w’Tx’, x0 = 1.

It is equivalent to adding a column of ones to our data matrix X that has the original dimension NxD).

Why does our data matrix have the dimension NxD? Do not forget N is the number of experiments (observations), and D is the number of input factors. If we take one line from X, it means that we have taken the result of only one observation. So we get a column vector xi of dimension 1xD, which is а feature vector. But since vectors are usually considered as column vectors of dimension Dx1 in linear algebra, in the case of one experiment we have to transform the formula as follows:

ŷ=wTx.

If we want to calculate the value of the output variable ŷ for all observations of N simultaneously, we have to write It is because the column vector w has the dimension Dx1, and the data matrix X has the dimension NxD, and as you know, if you want to multiply the matrices correctly, it is necessary that their respective dimensions coincide. Then, as a result, we obtain the correct column vector of dimension Nx1. Of course, at first, it looks a bit strange, as our w is now on the right.

To clarify this idea, let’s take a simple example. Let N = 4 it means the number of observations we have made is four, and D = 3, that is the number of input factors being studied is three. Thus, our data matrix X has a dimension of 4×3. Let w be  vector with three elements 1, 2, 3: If we want to calculate the value of the predicted output variable y1, we need to calculate the scalar product of the matrices: Since the scalar product has a symmetry property, we can rewrite our product, leaving the parameter vector w and transposing the vector of signs x1 instead. Thus, we obtain the same result. Note that in order to get such result, we must initially have a column vector on the left, so that after transposition we can multiply the matrices.

But such calculations are ineffective. We would like to use the capabilities of NumPy as much as possible and do few operations at the same time. Do not forget, NumPy is specially optimized for numerical calculations, so we cannot calculate y1, y2, y3, and y4 separately. We want to calculate them all at once and that all four y are in the same array of dimension four. We can do this by multiplying the entire matrix X by w. It may seem confusing at first. But do not forget that all our xi are horizontally located in one line in the data matrix X. If we consider xi separately, it is taken as a column vector, so the data go vertically. Just always remember the “golden rule”: if we multiply matrices, their internal dimensions must coincide.

So, let`s work on the problem statement. How do we calculate w? Here we face with already familiar calculations. The error function has not changed, and the variables y and ŷ still remain scalar quantities. But since the task has become more complicated, now this expression has a slightly different form: But we can still take derivatives of all w elements. Let’s call them wjj=1..D. If you have not immediately understood how we get this solution, it may be easier for you to rewrite the equation in the scalar form. In any case, stop for a moment to check whether you have understood the solution. Our X is a matrix, so its elements can be numbered by two indices xij, where i indicates the number of the observation, and j is the number of the factor.

As you remember, the next action is to equate the derivative to zero. Since j has the dimension D and runs through all values from 1 to D, then we have D equations with D unknowns. Let me remind you that we need to find w. If you are familiar with the matrix calculus, you do not need to do all this work, but in this lecture, I suppose that it is not so.

The first thing we need to do is to multiply the terms and transfer the negative part to the other part of the equation to obtain only positive quantities.  It is a kind of result, but we still need to allocate w. First of all, we can take w out of the sum, since the summation is over i, and w does not depend on i. The next thing we need to do is to understand that this solution is only for one value j. We write out the equations for all values of j:   We would like to remind the definition of a scalar product: Therefore,we can write it in the matrix form

wT(XTX) = yTX

Stop here and try to get the same result on the paper.

Since we only need to obtain the w value but not the transposed one, we transpose both sides of the equation.

wT(XTX) = yTX

[wT(XTX)] T = [yTX] T

(XTX)w = XTy.

Note that, for example, if there is a linear dependence Ax = b, the solution is equal to the product of b by the A inverse:

Ax = b ⇒ x = A-1b.

Accordingly, all we need to do is to perform the same operation to allocate w.

(XTX)w = XTy

w = (XTX)-1XTy.

This is the solution for multiple linear regression.

As in the case of a one-dimensional linear regression, NumPy allows you to accelerate calculations in comparison to direct calculations. The operations for finding a linear regression solution are so common that for this purpose there is a special function solve in NumPy. Therefore, all we need is to trigger it in the code and set the values of the coefficients A and b.

Ax = b ⇒ x = np.linalg.solve(A, b)

w = (XTX)-1XTy = ⇒ w = np.linalg.solve(XTX, XTy).

I often update my courses, and in this case I also want to make some improvements compared to my initial lecture on the multiple regression. So if you have understood the material of this lecture, you can skip the next one, because there you can find the same information. However, if you want to revise the material and see how I derive the equations manually, do not hesitate to look through the next one.

## Solution of multiple linear regression using matrices

I suggest solving the problem of linear regression using only matrices. We will get a solution that is more “familiar” to us and learn how to work with matrices, as they are continuously used in data processing and in-depth training. Therefore, now it is essential to get used to working with them.

To begin with, let’s take our cost function, which is now defined in a scalar form, and transform it into a matrix function: where t is the target matrix of Nx1, y is the matrix of predicted data of dimension Nx1.

y = Xw,

where X is a matrix of dimension NxD, w is a vector of dimension Dx1, so when we multiply them we get, the result of the dimension Nx1. Now we need to calculate the derivative.

Note that you can use the matrix calculus course book for your reference here.

At first, we derive the expression with J.

J = (t-y)T(t-y) = tTt-tTy-yTt+yTy = tTt-tTXw-(Xw)Tt+(Xw)T(Xw).

Next, we find the derivative and equate it to zero. XTXw = XTt

And find our solution.

w = (XTX)-1XTt.

## Multiple Regression Solution in the code of Python

In this section of the article, we will write a code for solving the multidimensional linear regression.

We have already had some data. Thus, our matrix X has dimension two. The number of observations is 100. Therefore, we have D = 2, N = 100.

Create a new file and call it lr_2d.py. Then we import the necessary libraries. Note that we also need Mplot3d to build a 3D chart in addition to the already familiar NumPy and Matplotlib,

Matplotlib is a library in the Python programming language for visualizing 2D (2D) graphic data (3D graphics is also supported). The images people get can be used as illustrations in publications.

import numpy as np

from mpl_toolkits.mplot3d import Axes3D

import matplotlib.pyplot as plt

First of all, we load the data as before. Create empty lists X and Y and load the data from the file. As you remember, we also have a free term that characterizes the point of intersection with the ordinate axis, so we will also create our x0 and set its value as 1.

X=[]

Y=[]

for line in open(‘data_2d.csv’):

x1, x2, y = line.split(‘,’)

X.append([float(x1), float(x2), 1])

Y.append(float(y))

You’ve already seen how useful the NumPy library is, so we’ll convert our X and Y into NumPy arrays.

X = np.array(X)

Y = np.array(Y)

Now we depict our data graphically to see how they look like. This is a bit more complicated than the previous two-dimensional images because now we have three dimensions. Be careful! If you work under Mac, this may not work. I’ll show you how it looks like under Linux.

fig = plt.figure()

ax = fig.add_subplot(111, projection = ‘3d’)

ax.scatter(X[:,0], X[:, 1], Y)

plt.show()

So, let’s continue. We use the equations we have derived to calculate the coefficients of our model. If you remember, we do not transpose the X matrix manually, but use the special NumPy function to solve problems in linear algebra.

w = np.linalg.solve(np.dot(X.T, X), np.dot(X.T, Y)

Yhat = np.dot(X, w)

Now we calculate the coefficient of determination R2 to see how good our model is. The equation is the same in this case since R2 depends only on y and ŷ, and the changes concern only the calculation of the value of ŷ.

d1 = Y – Yhat

d2 = Y – Y.mean()

r2 = 1 – d1.dot(d1) / d2.dot(d2)

print ‘’r-squared:’’, r2

And run the program. R2, which we have got, is very good. It is 0.998. This is because we have selected the data in such way.

## Polynomial regression is an extension of linear regression with the code in Python

Let’s talk about polynomial regression, which in the future will help us to understanding the concept of generalization and retraining in the article.

So, here are some short explanations before we start writing a code. Even if we have a square or cubic polynomial or some other functions of a curve, we can still use linear regression, despite its name. The fact is that the linearity of the regression means that the coefficients of the model are linear, but it does not say anything about how x should be. If you take a quadratic function or a function of the form ax2 + bx + c, you will see that the result of y linearly depends on the coefficients a, b and c, and therefore linear regression is also applicable to them.

Now you have the idea how to apply linear regression to a polynomial, and why it is so important to structure the data correctly before building a model.

So, let’s create a new file called lr_poly.py.

As before, we import the NumPy and Matplotlib libraries

import numpy as np

import matplotlib.pyplot as plt

If you look at the data in the data in _poly.csv file, you will see only two data columns, the same as in the data_1d.csv file. So, we load our data.

X=[]

Y=[]
for line in open(‘data_poly.csv’):

x, y = line.split(‘,’)

The difference is that now we have x, which is simply a number, and X, which also includes x2 in addition to x.

x = float(x)

X.append([1, x, x*x])

Y.append(float(y))

As before, we transform arrays into NumPy:

X = np.array(X)

Y = np.array(Y)

Now, let’s depict it in a graphical form.

plt.scatter(X[:, 1], Y)

plt.show

Now calculate our coefficients. Do not forget this is the common multiple regression, the only difference is how exactly we enter our X.

w = np.linalg.solve(np.dot(X.T, X), np.dot(X.T, Y)

Yhat = np.dot(X, w)

And again, we will graph it.

plt.scatter(X[:, 1], Y)

plt.plot(X[:, 1], Yhat)

plt.show()

Now we have something incomprehensible. It sometimes happens when X is unsorted, then each pair of points is connected by a line. Let’s fix this by rewriting the last part of the code.

plt.scatter(X[:, 1], Y)

plt.plot(sorted(X[:, 1]), sorted(Yhat))

plt.show()

Run the program. Now we can see our quadratic curve passing through most points.

Do not forget to calculate the coefficient of determination R2, the formula of which has not changed.

d1 = Y – Yhat

d2 = Y – Y.mean()

r2 = 1 – d1.dot(d1) / d2.dot(d2)

print ‘’r-squared:’’, r2

The coefficient of determination is 0.999. It is an excellent value.

## Prediction of systolic blood pressure by age and weight

Now we will write a sample code for multiple linear regression. We will use this data. You can find them here

Just copy and paste this address into your browser and download the Excel file. We will use the pandas library to read Access files, but we need to load the external library. You need to run

sudo pip install xlrd

It will allow you to use the pd dot function to read an Excel file.

Having loaded the data, you will see that we have three variables X1, X2, and X3 for each patient. The first column of X1 is systolic blood pressure. It will be our Y. We will predict its value. X2 is our input factor that characterizes age in years. We will analyze blood pressure as a function of age. Besides, we want to analyze the dependence of pressure on weight in pounds. It is our X3. We will see that there is a linear dependence between all these data.

import matplotlib.pyplot as plt

import numpy as np

import pandas as pd

We also use the function of reading an Excel file from the pandas library. The file is called mlr02.xls.

You can also take this file in Riehle.

The first thing we do is to transform the data into a metric system. It is also nice to display them graphically at once. On the x-axis, we will plot the years a person has already lived, and on the y-axis, we will plot the arterial pressure.

X = df.as_matrix()

plt.scatter(X[:, 1], X[:, 0])

plt.show()

We do the same with the second factor, but now we plot the weight on the x-axis and the pressure on the y axis.

plt.scatter(X[:, 2], X[:, 0])

plt.show()

We will construct three linear regressions. The first is the dependence of the output variable Y only on the factor X2, the second one is the dependence only on the factor X3, and the third one is the dependence on both factors, and we also calculate the determination coefficient R2. We use the loop operator for this purpose.

df[‘ones’] = 1

Y = df[‘X1’]

X = df[[‘X2’, ‘X3’, ‘ones’]]

X2only = df[[‘X2’, ‘ones’]]

X3only = df[[‘X3’, ‘ones’]]

def get_r2(X, Y):

w = np.linalg.solve( X.T.dot(X), X.T.dot(Y) )

Yhat = X.dot(w)

d1 = Y – Yhat

d2 = Y – Y.mean()

r2 = 1 – d1.dot(d1) / d2.dot(d2)

return r2

print ‘’r2 for x2 only:’’, get_r2(X2only, Y)

print ‘’r2 for x3 only:’’, get_r2(X3only, Y)

print ‘’r2 for both:’’, get_r2(X, Y)

Let’s see what has happened. On the graph, the dependence of blood pressure on age is similar to linear. The graph of pressure versus weight also shows something that looks like a linear dependence. Therefore, the use of linear regression for both factors is justified.

The coefficient of determination is very good for the dependence on both factors separately, but it is even better to use both factors. 