## Generalization of error training and test sets

Contents

In this lecture, we will discuss our topic in detail after we have already learned the theory and the code of linear regression .

You may not understand one thing that we considered in our previous articles. If we can take x2 as the input data, why cannot we take x3, x4 or even x5 and in such way we can reduce the error?

Let`s remember from the course of higher mathematics: an infinite Taylor series can approximate any function. So, what extent can we add all these Xs to?

Indeed, if you do not take into account the complexity of calculations that make your code run much slower, to what extent can we add input data to reduce the error to zero?

However, let’s remember the real purpose of machine learning. The idea of it is not to сonstruct the line of best fit for the events that have already happened.

The real challenge is to predict the values of quantities in the future – for example, tomorrow’s weather or tomorrow’s share price or, even, to recognize a fake signature on a document.

We usually talk about the quality of the model, and we keep in mind how well the model generalizes the data that were not available before, and we speak of the generalization error. How can we improve our model when we simulate a situation that requires generalized data?

Well, for example, we can use data where 80% is the training data for constructing the model, and the remaining 20% is test data.

You can see typical curves of training and verification errors that you may be familiar with if you have studied data processing before. This is a typical situation when you begin to substitute real data for the model.

So, what happens when you complicate the model so much that the results of training data are getting better and better? At some point, the verification shows that the model begins to deviate from the actual values. This is the moment when you have to say to yourself: “Stop! My model is already quite complicated. ” Personally, I advise you to check your model on real data to estimate which curves you can get. Otherwise, it will be an optimistic rate approach.

**Demonstration of generalization and retraining in the code**

Now we examine the code that will enhance understanding the concepts of generalization and retraining.

So just go over the code without writing it, since it is more important for us to run it.

First, we create input data – a sinusoid from 0 to 6π. If you do not know what a sine wave looks like, we display it graphically.

import numpy as np

import matplotlib.pyplot as plt

N = 100

X = np.linspace(0, 6*np.pi, N)

Y = np.sin(X)

plt.plot(X, Y)

plt.show()

After this, we introduce the *make_poly* function to simplify our work with polynomial regression. We have already discussed polynomial regression, and you already know how to work with it. All you need to do is to enter the polynomial denotations into the input data, and we get polynomial regression. This function creates a column of ones and then it sums all polynomials from the first degree to the degree specified in the input argument.

*def* make_poly(*X, deg*):

n = len(X)

data = [np.ones(n)]

for d in *xrange*(deg):

data.append(X**(d+1))

return np.vstack(data).T

The next is the function fit, which is simply a linear regression solution for finding coefficients in the concepts of input and output data.

*def* fit(*X, Y*):

return np.linalg.solve(X.T.dot(X), X.T.dot(Y))

The following function is *fit_and_display*. As arguments, it receives a set of data X and Y, as well as sample and deg arguments. The sample argument gives us an observation number with X and Y to form the training set. We use these data to find the best coefficients for the polynomial of the degree indicated in the variable deg. Then we graphically represent this polynomial in addition to the already constructed initial schedule and the graph of the training data. So, we will have an idea of how well the polynomial fits the training data and how well it shows the sinusoid.

*def* fit_and_display (*X, Y, sample, deg*):

N = len(X)

train_idx = np.random.choice(N, sample)

Xtrain = X[train_idx]

Ytrain = Y[train_idx]

plt.scatter(Xtrain, Ytrain)

plt.show()

Xtrain_poly = make_poly(Xtrain, deg)

w = fit(Xtrain_poly, Ytrain)

X_poly = make_poly(X, deg)

Y_hat = X_poly.dot(w)

plt.plot(X, Y)

plt.plot(X, Y_hat)

plt.scatter(Xtrain, Ytrain)

plt.title(‘’deg = %d’’ % deg)

plt.show

Then we test the function with polynomials of the fifth, sixth, seventh, eighth and ninth degree.

for deg in (5, 6, 7, 8, 9):

fit_and_display(X, Y, 10, deg)

The following function is designed to calculate the standard deviation, which we construct for both training and test data sets.

*def* get_mse(*Y, Yhat*):

d = Y – Yhat

return d.dot(d) / len(d)

You can see that the test data set coincides with the training data set for a while, but in the end, they diverge abruptly, because it cannot adequately describe the new input data.

The next function *plot_train_vs_test_curves* exactly shows that. It takes a random data set x and y, compiles a training dataset with them, and makes up a verification set of the remaining data. Then it constructs polynomials with degrees from 1 to 20 and draws the deviation curves for the training and verification sets. We will also build a curve of a training set so that you can see how the deviation tends to reach zero and remains there.

*def* plot_train_vs_test_curves(*X, Y, sample=20, max_deg=20*):

N = len(X)

train_idx = np.random.choise(N, sample)

Xtrain = X[train_idx]

Ytrain = Y[train_idx]

test_idx = [idx for idx in *xrange*(N) if idx not in train_idx]

Xtest = X[test_idx]

Ytest = Y[test_idx]

mse_trains = []

mse_tests = []

for deg in *xrange*(max_deg+1):

Xtrain_poly = make_poly(Xtrain, deg)

w = fit(Xtrain_poly, Ytrain)

Yhat_train = Xtrain_poly.dot(w)

So, let’s run the code and see what we have:

Then we have a set of random data of 10 points:

Then we have the polynomial of the fifth degree:

You see that it does not pass through all the points.

Then we have another random set of 10 points:

Also, the polynomial of the sixth degree.

Note that it passes through all the points, but it looks simply awful in those areas where there are no data points. If we zoom, we can see that the result is terrible in some areas, although the graph passes exactly through all the points:

Here we see one more random set, and all our points coincide again:

So, we have areas where the generalization turns out to be very good, but the curves are very divergent in those areas where there are no points:

Here we have the next data set and the polynomial of the eighth degree. You can see that the construction is very good on the right side, where we have a lot of points, but it is unsatisfactory in the left part, where we do not have many points:

The next 10 points:

And the polynomial of the ninth degree:

We have a perfect coincidence even where there are few points, but only in the area between the extreme points of the data. For example, there are no points on the left side, so the graph simply goes to negative infinity.

Next, we see the standard deviation for the training and verification sets:

As we see, the deviation curve is much closer to zero for the training set.

The curve always drops to zero on the separate graph of the training data, and we can easily see this:

We come to the following interesting conclusions.

**How can we build a successful model?**

We recommend you to run the program several times to repeat the experiment. Try running the program by yourself with a different number of data points and various degrees of the polynomial. The more attempts you make, the better result you get. However, then you can notice that a large degree of the polynomial does not guarantee the construction of a successful model.

In fact, everything depends on the amount of training data. If the sample for training the model is really representative for the rest of the data, the coincidence will be very good in any case.

It can be applied to our example with a sinusoid: if the training data is distributed so that we get an evenly distributed set of points for all values of x, then our polynomial describes the curve well. This is because our training set shows good representativeness.

If our training set of points is grouped in several places, you can see that the polynomial abruptly diverges with the curve in areas where there are no points of input data. This is because the training set is unrepresentative.

What is the moral?

We need to collect more data. The more data we have, the better they represent the reality, and the more accurate our model will be. Moreover, you can make sure that the typical curve of best fit shows only one side of the coin.

**Probabilistic interpretation of the squared error**

We continue examination the topic of allowable errors in machine learning by using linear regression. Let`s examine the squared error that we used before.

I`d like to remind that we define it as a sum of squared differences between the actual value of y_{i} and the predicted one of ŷ_{i}:

In this lecture, we will show you that linear regression gives the most plausible solution for finding the line of best fit.

What do we mean when we talk about maximum likelihood?

Let`s assume that we have measured the height of all the students in our school. Let we have the Gaussian distribution. It means that if we build a diagram, we have a bell-shaped curve.

After measuring the height, we have collected height values of all students: {x1, x2, …, xN}. Let`s assume that we want to know the average height of the students. We intuitively understand how to calculate it: we have to sum all x_{i} values and divide the sum, which we get, by the number of the students N.

However, there`s a more methodologically sound way to find the answer. Maybe you do not even know about it.

Let`s assume that we want to measure the true mean value of the parameter for the Gaussian distribution. I`d like to remind that two parameters define the Gaussian distribution: the mean value and dispersion. As far as all our x_{i} are placed according to the Gaussian distribution, so the probability for each x_{i }can be calculated by probability density distribution that is described by the Gaussian function.

As far as we know that the height of the students is independent of each other and equally distributed, we can write down their total probability as the product of all these values.

We can write down this equation using the likelihood function. The likelihood function is the probability of data X appearance given by the μ parameter.

Don’t forget that we need exactly the μ parameter. If we talk more precisely, we need to find μ value so that the likelihood is maximal. In other words, we need to find such μ value that the data, which we have collected, are on the distribution given by this μ value.

So how can we find **μ**?

We`ll use differential calculus that is familiar to us. One of the ways to do it is to take the derivative of the likelihood function, but in general, it is challenging. It is better to take the likelihood function logarithm and only then we can take the derivative.

Why is it so? If we take the derivative directly from the likelihood function, then the expression, which we get, is complicated to solve. This works because the logarithm function is monotonically increasing. In other words, if A> B, then log A> log B. If you do not believe us, you can check it with the help of a calculator.

So, first of all, we find the logarithm of the likelihood function (we denote it by l).

We know that the logarithm of a product is equal to the sum of the logarithms. This allows us to simplify the expression.

Next, equate the derivative to zero.

When we solve this equation and find μ, we find out something familiar: the sum of our x_{i}, divided by the number of N observations.

The key point in this entire exercise is the form of the logarithm of the likelihood function.

Note that the first part of the expression, which is under the sum sign, goes away since it is a constant with respect to μ and becomes 0 when the derivative is taken. Similarly, when the derivative is equal to zero, σ2 and one second disappear. It is essential to leave the minus sign since if you remove it, you will find its minimum instead of the maximum of the function.

If we simplify the expression by removing all the parts that we do not need, we simply get the problem of finding the maximum of the negative sum of the squares of the x_{i} and μ differences.

Here we meet already familiar things. Indeed, this is very similar to the function of the standard deviation which we have discussed in the article about linear regression.

We can explain different signs in the formulas by the fact that earlier we tried to find the minimum of the function, and now we attempt to find its maximum.

The moral of the story is that when we minimize the error in the linear regression, then this is equivalent to finding the maximum of the logarithm of the likelihood function. We can rewrite it in the form

y ~ N(wTx, σ2).

We can also write it down in the form

y = wTx + ε, ε ~ N(0, σ2),

where ε is a Gaussian distributed noise with zero mean and arbitrary dispersion

## The Dummy Variable Trap

Above in the article, we examined direct encoding as one of the ways to work with categorical variables. This method leads to the fact that the result is convenient to interpret since each weighting coefficient directly shows how strongly the category value affects the result. In general, if a category has K different values, then X must have K columns to represent it.

Direct encoding has an alternative, which is called K-1-encoding. The idea of it is that one of the values of a categorical variable can be encoded in X by zeros, so we need only the K-1 column to do this.

However, since one of the values of the variable is encoded by zeros, we cannot directly see its effect on the final result.

In many cases, we propose K-1 encoding, and there is an important reason for this. It may seem unobvious until you try to find the model weights by yourself.

The problem is that we need to calculate the value

(X^{T}X) ^{-1} for the solution. That is impossible because this value is the identity matrix. The fact is that one column consists of only ones, and when we sum them in the case of direct encoding, the sum is also equal to one.

We will not deepen into linear algebra to explain this in detail, but the fact remains that it is impossible to take the inverse value of X^{T}X. Sometimes this situation is called a dummy variable trap.

There are several ways to get around this trap.

- The simplest way is, of course, to use K-1 encoding instead of the direct one, but it is not interesting.
- The second way is to remove the column from ones to avoid shift so that the linear combination of columns is not equal to the other column.
- You can also use the previously discussed L2-regularization. As we can see from the solution for finding weighting coefficients, in this case, we do not need to find the inverse value of X
^{T}X, because we add one more term to it, namely λI. Thus, we get rid of the need to work with an identity matrix. - Moreover, we are going to discuss the last method, which is called gradient descent with reference to L1-regularization. This is the most wonderful method, but the other things, which we are discussing, are also important. Gradient descent is used in all areas of profound learning. Since linear regression is mainly the only way to find weighting coefficients, it is useful to mention gradient descent right now. Moreover, this is the most common method, which can be used for all further models.

Note that the inverting the identity matrix in the matrix calculus is an operation equivalent to dividing by zero. Adding λI to the matrix is equivalent to adding a small number λ to the denominator so that it is no longer zero.

The problem, when one column is a linear combination of several others, is called multicollinearity. In the simplest case, when two variables perfectly correlate with each other, one column is simply a multiplier of the other. We prefer the general solution for such problem since the data is completely random in the general case.

We have used a dummy variable trap as an example, but in general, you cannot guarantee that the two columns of any input variables do not correlate with each other. Images have a particularly strong correlation, so it is preferable to use the most general methods of solution, such as gradient descent. Besides, it is constantly being used for profound training.