# Theory and code in L1 and L2-regularizations

In this lecture, we will consider the differences between L1 and L2-regularizations and show how these differences appear due to mathematical calculations. We will also discuss the theory and the code for L1 and L2-regularization. We will start with the differences, and then explain why they arise.

We have already seen that L1-regularization contributes to the sparse function when only a few factors are not equal to zero. L2-regularization contributes to the appearance of small weighting coefficients of the model but does not add to their exact equality to zero. In this lecture, we will discuss why this happens.

Note that both methods help to improve the generalization and test errors since they do not allow the model to be retrained due to the noise in the data.

L1-regularization implements this by selecting the most important factors that affect the result the most. For simplicity, you can assume that elements with a small amount of influence on the final result actually “help” you to predict only noise in the set of training data. L2-regularization prevents the model from retraining by prohibiting disproportionately large weighting coefficients.

First, let’s go through the existing and used techniques. They are worth exploring because I believe that they are inadequate and distract many resources. The contour plot of the negative logarithms shows the differences between each of the regularization types, and although this is not very useful for understanding the essence, it will be great if you study it.

It is really useful to imagine a model with a one-dimensional weighting coefficient. For L2-regularization, the additional term is a quadratic function, with L1-regularization, it is a modulus. The derivative of the function is really important here. The derivative, of course, is vital, since the gradient descent mainly moves towards the derivative.

With a quadratic term, the closer you are to zero, the smaller your derivative becomes, until it also approaches zero. Therefore, if your w has already been small with L2-regularization, the further gradient descent does not change it much. In the case of the modulus, the derivative is a constant with an absolute value equal to one. It is not formally defined at zero, but we also consider it to be zero. Therefore, for L1-regularization, the gradient descent will tend to zero at a constant speed, and when it reaches it, it remains there. As a result, L2-regularization contributes to small values of the weighting coefficients, and L1-regularization contributes to their equality to zero, thereby provoking sparsity.

Finally, note that you can include both L1 and L2-regularization in your model. This model even has a special name – ElasticNet. It sounds, of course, quite bizarre, but in fact, it’s just adding both fines of L1 and L2–regularization to your cost function.

JRIDGE = J + λ2|w|2,

JLASSO = J +λ1|w|,

JELASTICNET = J + λ1|w| + λ2|w|2.

## L1-regularization. Theory

We have discussed generalization and retraining in our lectures. We have seen that even when we add a column of random noise to our data, we can still improve our training rate. In general, we want the dimension of our X to be much smaller than the number of observations of N. Another possible problem is that the dimension is much larger than the number of observations for specific data sets.

One way to show this is to build a matrix X. We want our matrix X to be thin when N is large and D is small. In the opposite case, when the matrix X is wide, N is small, and D is large, and it is not very well, and we need to take some actions to avoid potential problems.

In this situation, we can select only a small number of the most important factors that set the trend from the whole array of factors, and remove all the others that are just noise. It is called sparsity, since most of the factors will be zero, and only a small number will not be zero. In this lecture, we discuss L1-regularization, which allows us to achieve sparsity.

The basis of L1-regularization is a relatively simple idea. As in the case of L2-regularization, we merely add a penalty to the original cost function. Just as with L2-regularization, we use L2- rationing for the correction of weighting coefficients, with L1-regularization we use special L1- rationing. L2-regularization is also called Ridge regression, and L1-regularization is called lasso regression.  Before we tackle the problem, let’s consider the probability distribution. You know that the exponent of the negative square is the Gaussian distribution, so with L2-regularization, we had Gaussian likelihood and a Gaussian prior for w. In this case, we no longer have a Gaussian prior by w.

Which distribution has a negative absolute value in the exponent? This is the Laplace distribution. Thus, for L1-regularization, we have a Laplace distributed prior of weighting coefficients and find a solution for the posterior for w with the Laplace prior.

In the case of logistic regression, you have already known how to solve this problem.

We find a gradient and move towards it:  With L1-regularization, you have already known how to find the gradient of the first part of the equation. The second part is λ multiplied by the sign (x) function. The sign (x) function returns one if x> 0, minus one if x <0, and zero if x = 0.

## L1-regularization. The Code

I suggest writing the code together to demonstrate the use of L1-regularization.

The plan is that we will generate some data, where the input variables are represented by a broad matrix, whereas Y will depend only on several factors, and the rest will be only noise. Then we use L1-regularization to find the sparse weighting coefficients that determine the useful dimensions of X.

If you do not want to write the code yourself, but just run it, the corresponding file is in the repository called l1_regularization.py.

import numpy as np

import matplotlib.pyplot as plt

We also define our sigmoid function. You have already known it

def sigmoid(z):

return 1 / (1 + np.exp(-z))

Set N = 50 and D = 50 to make it a wide matrix. The values of X are uniformly distributed in the range from -5 to +5.

N = 50

D = 50

X = (np.random.random((N,D)) – 0.5)*10

The true values of the weighting coefficients, which are defined by the variable true_w, are set to 1, 0.5 and -0.5, so only the first three dimensions have values, and the remaining 47 dimensions are set equal to zero. This does not affect the result in any way.

true_w = np.array([1, 0.5, -0.5] + *(D-3))

Now define our Y. It will be a sigmoid of X plus some random noise.

Y = np.round(sigmoid(X.dot(true_w) + np.random.randn(N)*0.5))

The next is the gradient descent. The training coefficient is set to 0.001, the penalty for L1-normalization is set to 2. I recommend you to experiment with different values of the penalty and see what happens. The number of iterations is set to 5000. Besides, we calculate the value of the cost function. We know how to do this.

costs[]

w = np.random.randn(D) / np.sqrt(D)

learning_rate = 0.001

l1 = 2.0

for t in xrange(5000):

Yhat = sigmoid(X.dot(w))

delta = Yhat – Y

w = w – learning_rate*(X.T.dot(delta) + l1*np.sign(w))

cost = -(Y*np.log(Yhat) + (1-Y)*np.log(1 – Yhat)). mean() + l1*np.abs(w).mean()

costs.append(cost)

plt.plot(costs)

plt.show()

Finally, we draw the graphs of the true weighting coefficients with the calculated ones

plt.plot(true_w, label=’true w’)

plt.plot(w, label=’w map’)

plt.legend()

plt.show()

Run the program and see what happens. As you can see, the cost function converges quite quickly.

The value of the weighting coefficients has been calculated well, but it does not entirely coincide with the real ones. We try again, but we set the penalty of L1-regularization equal to 10.

l1 = 10.0

We see a new cost function and weighting coefficients.  The coefficients are now much closer to zero since they are shifted into this area by regularization. Therefore, try to expose a smaller regularization penalty to achieve success. If it works out well, then keep its value as close as possible to zero.

## L2-regularization. Theory

Let`s discuss the most popular topic of machine learning – regularization. We will look at it from several points of view so that you fully understand the reasons for using regularization and its theoretical consequences. First of all, let us consider the problem of retraining. Let’s return to our Gaussian clouds, one of which is concentrated at a point with coordinates (2; 2), and the second is focused at a point with coordinates (-2; -2). This is the same problem for which we have found the solution of the Bayes classifier. It is easy to depict graphically – as a rule, these are two circles, separated by a straight line. As you know, the solution of the Bayes classifier, which we have found, is w = [0, 4, 4], where 0 is the bias term.

Which line corresponds to this solution? As we know from the course of mathematics, the equation of the straight line has the form

Note that in this equation, y represents the ordinate on the coordinate plane, and not the output result of logistic regression.

We can rewrite our equation in the form:

y = mx + b.

Consequently, the slope of the straight line is -1, and the free term, which characterizes the intersection with the y-axis, is zero.

0 + 4x + 4y = 0, hereof y = -x.

This can make you think why the solution for the Bayes classifier equals (4, 4). Why isn`t it (1, 1) or (10, 10)? Indeed, because all these solutions characterize the same straight line! This is the first clue why we might need regularization.

Consider the target function: Note that y exactly means the output result of logistic regression.

We take a point (x1; x2) with coordinates (1; 1). We know that it belongs to the same set of points that are concentrated at the point with coordinates (2; 2). So, it belongs to class 1. If we substitute the value of our point in the formula of logistic regression with the given values of the weighting coefficients w = [0, 4, 4], we get the value of the sigmoid

σ(0 + 4 + 4) = σ(8) = 0,99966.

Only the result that equals one can be better than this one. Thus, for y = σ (8), the target function is approximate -0,0003.

And what happens if the weighting coefficients are (0, 1, 1)? In this case, the target function will have a value of about -0.12. But this is not a good indicator. If we set the weighting coefficients equal to (0, 10, 10), then the value of the target function will be approximate -2.0 * 10-9.

The moral of the whole story is that the best weighting coefficients for the model are (0, infinity, infinity). Of course, the computer cannot process such numbers and will give an error.

People often understand the regularization in terms of regression from my previous course of linear regression. But now we are not examining regression, so this idea is inapplicable. In fact, even if our data occupy the entire range of possible values ​​of the input variables, no retraining occurs even if our model is very complicated. This is the reason why we should always have as much data as possible. Retraining happens when the model attempts to “guess” the result in the area where there have been no data before. If your training set covers the entire amount of data and includes all possible values, then you have the opportunity to train the model well. In other words, if your training dataset looks exactly like the test dataset, and the model shows a good result on the training dataset, then it will show the same good result on the test one.

Unlike the previous case, now even if our data are well balanced and include all possible variants of input variables, the logistic regression still tries to give a solution that the computer cannot calculate.

The solution to this problem is regularization. Regularization imposes a penalty on very large weighting factors. So, if we have the original function of the cross-entropy error then an additional term can be added to the formula, which will increase the value of the error function if the weighting coefficients are too large: This forces the weighting coefficients to tend to zero, and now we do not get weighting factors like (0, 10, 10) since in this case, the value of the error function will be very large.

The term λ is called the smoothing parameter. It balances the cross-entropy error function and the regularization penalty. If the value of λ is large, the weighting coefficients will tend to zero, if the value of λ is small or zero, then the weighting coefficients will merely tend to minimize the cross-entropy error function. As a rule, the value of the parameter λ is set to 0.1 or 1, or between these values, but mostly its value depends on the specific data. You need to try different values ​​and observe the behavior of the cost function and the final result. There is no universal method for determining the value of the parameter λ. Or rather, it may exist, but it is too complicated for our classes.

So, how does this affect the weighting coefficients?

Do not forget that with the gradient descent, all we need to do is to find the gradient of the objective function and move in its direction. Since adding a new member does not affect the calculation of the gradient descent, all the gradients can be found separately.

Let us consider the gradient with regularization penalties in more detail. On the one hand, it can be considered in a scalar form. In this case, the penalty is In this case, the derivative for any given wi will be equal to In the vector form, the new gradient will be equal to Let us consider another interpretation of regularization. As you remember, we also explain our model in terms of probability theory, and finding the minimum of the cross-entropy error function is also finding the maximum of the likelihood function. Let’s return to our target function as the one which maximum we want to find. Then We now take the exponent: The first exponentiated part is the binomial distribution for the likelihood function. The second exponentiated  quadratic term with a minus sign is merely a Gaussian distribution. All this is an introduction to the Bayes perspective of machine learning.

The Gaussian distribution of the weighting coefficient values is called the prior. This means that it represents our a priori expectations of the weighting coefficient values. In particular, they should be small and centered around zero, and the dispersion of this Gaussian distribution is 1 / λ. The parameter λ is also called the accuracy of the model; it is the inverse of the dispersion. We will meet it throughout Bayes machine learning again and again.

The rule of Bayes says: the posterior is proportional to likelihood multiplied by the prior. The consequence of this rule is that Thus, we are looking for the maximum of the likelihood function without regularization. Now we do not maximize the likelihood function – now we maximize the posterior. This method is called calculating the maximum of posterior or it is short for computing of  MAP.

## L2-regularization. The Code

As you have guessed, we have proceeded to practice, and you will find out how to apply regularization in the code.

We use the code from previous lessons because it is very similar. We just add the regularization component. We simply add the constant λ, set its value equal to 0,1, multiplied by the value of the weighting coefficients.

learning_rate = 0.1

for i in xrange(100)

if i % 10 == 0:

print cross_entropy(T, Y)

w += learning_rate *(np.dot((T-Y).T, Xb) – 0.1*w)

Y = sigmoid(Xb.dot(w))

Run the program. As you can see, now the value of the weighting coefficients is much smaller. This gives regularization. Since we assume that the data are typically distributed around zero, we get the weighting coefficient values much closer to zero, and they do not tend to grow to infinity. 