The application of L1 and L2-regularization in machine learning

L1 vs L2 Regularization

We discussed synthesis and retraining in our previous articles. We have seen that even when we add a column of random noise to our data, we can still improve our machine learning rate. In general, we want the dimension of our X to be much smaller than the number of N observations. Another possible problem is that the dimension is much larger than the number of observations for certain data sets.

What is L1 – regularization? Theory

One way to show it visually is to build a matrix X. We want our matrix X to be thin when N is large, and D is small. In the opposite case, when the matrix X is wide, N is small, and D is large, and this does not turn out very well, and we need to take some actions to avoid potential problems.

In this situation, we can select only a small number of the most important factors, which set the trend, from a whole array and remove all the others which are just noise. We call it sparseness, since most of the factors are zero, and only a small number is not zero. In this lecture, we are going to discuss L1-regularization, which allows us to achieve sparseness.

The basis of L1-regularization is a fairly simple idea. As in the case of L2-regularization, we simply add a penalty to the initial cost function. Just as in L2-regularization we use L2- normalization for the correction of weighting coefficients, in L1-regularization we use special L1- normalization. L2-regularization is also called Ridge regression, and L1-regularization is called lasso regression.

J_{RIDGE} = sum_{i=1}^{N} (y_n - widehat y_n)^2 + lambda parallel wparallel^2_2,

J_{LASSO} = sum_{i=1}^{N} (y_n - widehat y_n)^2 + lambda parallel wparallel_1.

Before we solve the problem, let’s consider the probability distribution. You know that the exponent of the negative square is the Gaussian distribution, so we had Gaussian likelihood and the Gaussian prior for w in L2-regularization. In this case, we no longer have the Gaussian prior over w.

Which distribution has a negative absolute value in the exponent?

It is the Laplace distribution.

p(w) = frac {lambda}{2} exp (- lambda |w|).

Thus, we calculate the weighting coefficients for L1-regularization that are distributed over the Laplace distribution.

Let’s decide how to solve the problem. As you know, we always take the derivative of the cost function, equate it to zero and solve with respect to w. Let`s do the same.

J = (Y - X_W)^T (Y - X_W) + lambda|w|,

J = Y^T Y - 2Y^T X_W + W^T X^T X_W + lambda|w|,

frac {partial J} {partial w} = - 2X^T Y + 2X^T X_W + lambda sign (w) = 0.

We examine the problem after finding the derivative. The sign (x) function returns a one if x> 0, minus one, if x <0, and zero if x = 0. Since the value is not a function, we cannot solve the equation with respect to w. To solve the problem, we need gradient descent. Fortunately, we know how to apply it.

We have nothing to do here.

L1 – regularization. The Code

Now we are going to write a code that demonstrates the use of L1-regularization!

The bottom line is that we generate data in such way that the input matrix X is “wide,” and Y depends only on several factors, and the rest is only noise. We use L1-regularization to find a sparse set of weighting coefficients that set useful values of X.

We set the number of observations equal to 50, the dimension is also  50, so this is a wide matrix, where X is evenly distributed at zero with a range from -5 to +5. We also set the actual weighting coefficients as [1, 0.5, -0.5], and the last three dimensions do not affect the result. Y = Xw plus some random noise. This is a Gaussian random noise, like the assumption of linear regression.

N = 50

D = 50

X = (np.random.random((N, D) – 0.5)*10

true_w = np.array([1, 0.5, -0.5] + [0]*(D-3))

Y = + np.random.randn(N)*0.5

Now we perform gradient descent. We set the value of the training coefficient equal to 0.001, the value of the regularization time is 10, and we go through a cycle 500 times

costs = []

w = np.random.randn(D) / np.sqrt(D)

learning_rate = 0.001

l1 = 10.0

for t in xrange(500):

Yhat =

delta = Yhat – Y

w = w – learning_rate*( + l1*np.sign(w))

mse = / N





print ‘’final w:’’, w


plt.plot(true_w, label=’true w’)

plt.plot(w, label=’w_map’)


Launch the program. We see that the cost function quickly reduces to zero, so apparently, there is no need to run the cycle 500 times:

We see that the cost function quickly reduces to zero, so apparently, there is no need to run the cycle 500 times

We also see that the actual value of w is very close to the calculated one:

We also see that the actual value of w is very close to the calculated one

What is L2 regularization in machine learning? Theory

Now we are going to discuss a technique known as L2-regularization, which helps to solve the problem of retraining a model. The bottom line is that excessively heavy weighting coefficients “repel” our line of best fit, which has been built by minimizing the square of errors, from the basic tendency.

On the graph, you see how the line of best fit, which we have constructed with the help of L2-regularization, describes the basic tendency better than the line which we have built by minimizing the square of errors.

On the graph, you see how the line of best fit, which we have constructed with the help of L2-regularization, describes the basic tendency better than the line which we have built by minimizing the square of errors

How does L2-regularization work?

The bottom line is that we change our initial function penalizing for large weights. To do this, we add a constant multiplied by a square of w:

J = sum_{i=1}^{N} (y_n - widehat y_n)^2 + lambda |w|^2,

|w|^2 = w^T w = w^2_1 + w^2_2 + ... + w^2_D.

So far, it is pretty easy.

Let us now look at this problem from the point of view of probability. First, we looked for the maximum of the likelihood function, but now we have changed our error function, and it does not work anymore. We know that the square of the errors is equivalent to the minimum of the negative logarithm of the likelihood function, which is equivalent to the maximum of the likelihood function. We expose our new cost function.

exp (J) = [prod_{{n=1}}^{N} exp [ - (y_n - w^T x_n)^2 ] ] exp [ - lambda w^T w].

P (Y|X, w) = prod_{{i=1}}^{N} frac {1} {sqrt {2pisigma^2}}exp [- frac {1} {2sigma^2}(y_n - w^T x_n)^2].

We see two Gaussians. The first is the same as before. The second is a new Gaussian where W is a random variable with an average value that is equal to zero and a dispersion of one unit per lambda. This second Gaussian is called the prior. It describes w regardless of the data since it is easy to be sure that it does not depend on X or Y.

J_{OLD} sim 1n P (Y|X,w),

J_{NEW} sim 1n P (Y|X,w) - 1n P(w),

P(w|Y,X) = frac {P (Y|X,w) P(w)} {P(Y|X)},

P(w|Y,X) sim P(Y|X,w) P(w).

You can find out the Bayesian rule here. P (w | Y, X) is called a posterior, and this method is the Maximum A Posteriori method, which has an abbreviation MAP. It means that now we are looking for the posterior maximum, rather than the likelihood function, as before.

So, we have analyzed some theoretical basics of L2-regularization. How can we find w? In the same way, as before, we take the derivative of our new cost function, equate it to zero and solve regarding w. First of all, we write out our cost function in a full matrix form.

J = (Y – Xw)T(Y – Xw) + λwTw,

J = YTY – 2YTXw + wTXTXw + λwTw.

Then we take the derivative and equate it to zero.

frac {partial J} {partial w} = - 2X^T Y + 2X^T X_W + 2lambda w = 0.

Solve the equation, that we get, regarding w.

 (λI + XTX)w = XTY,

w = (λI + XTX)-1XTY

Our result is the same as before, except for the presence of λ.

Let’s sum up. L2-regularization, which is also known as Ridge regression, is one way to avoid the excessive complexity of the model and its retraining. It works in such way. We add a square of the value of the weighting coefficients multiplied by a constant λ to our previous quadratic error function. We do this because the large values of weighting coefficients are indications of retraining. Next, we solve the equation with respect to w taking the derivative and equating it to zero. This is the Maximum A Posteriori method (MAP) since we find the posterior maximum of w for the data.

L2 – regularization. The Code

Now we demonstrate L2-regularization in the code.

Let’s start with importing the NumPy and Matplotlib libraries.

import numpy as np

import matplotlib as plt

 Set the number of experiments equal to 50. Generate the data in such way that we have 50 points which are evenly distributed between 0 and 10. Set Y = 0.5x plus some random noise.

N = 50.

X = np.linspace(0, 10, N)

Y = 0.5X + np.random.randn(N)

Now create a couple of “hills” manually. The value of the first endpoint goes over 30 than it is, and the value of the second end point also goes over 30 than it is.

Y[-1] += 30

Y[-2] += 30

 Next, we will display our data graphically. You know how to do this

plt.scatter(X, Y)


Let`s will find a solution for the weighting coefficients. Add displacement conditions.

X = np.vstack([np.ones(N), X]).T

So find the maximum of the likelihood function. Call this variable w_ml. You know how to calculate it. We display the initial data and the maximum of likelihood function graphically

w_ml = np.linalg.solve(,

Yhat_ml =

plt.scatter(X[:,1], Y)

plt.plot(X[:,1], Yhat_ml)

 Now we find a solution for L2-regularization. We set the penalty equal to 1000 and display the result graphically one more time.

l2 = 1000.0

w_map = np.linalg.solve(l2*np.eye(2) +,

Yhat_map =

plt.scatter(X[:,1], Y)

plt.plot(X[:,1], Yhat_ml, label=’maximum likehood’)

plt.plot(X[:,1], Yhat_map, label=’map’)


 Now run the program and see what we have.

On the first diagram, our input data are with our “hill”:

On the first diagram, our input data are with our "hill"

The second diagram shows the solution for the maximum of the likelihood function:

The second diagram shows the solution for the maximum of the likelihood function

Note that it is not very successful because it stretches to the “hill.” Now let’s look at the solution both for Maximum A Posteriori and the solution for maximum likelihood. Note that the MAP corresponds to the trend much more than the maximum likelihood function.

Differences between L1 and L2 – regularizations

Now we consider the differences between L1 and L2-regularizations and show how these differences appear due to mathematical calculations. We start off with the differences, and then I will explain why they appear.

We have already seen that L1-regularization contributes to the sparse function when only a few factors are not equal to zero. L2-regularization contributes to the appearance of small weighting coefficients of the model, but it does not contribute to their exact equality to zero.

We suggest discussing why this happens.

Note that both methods help to improve the synthesis and errors of the test since they do not allow the model to be retrained due to noise in the data.

  • L1-regularization realizes this by selecting the most important factors that affect the result more than anything. You may think that factors, which have a little amount of influence on the final result, actually “help” you to predict only noise in the set of training data.
  • L2-regularization prevents the model from retraining by prohibiting disproportionately large weights.

Outline of Differences between L1 vs L2 Regularization

First, let’s look through the methods which exist and are being used. They are worth studying because we believe that they are inadequate and divert many resources. The contour plot of the negative logarithms shows the differences between each of the regularization types, and although this is not very useful for understanding the essence, it is great if you study it.

It is actually useful to imagine a model with a one-dimensional weighting factor. The additional term is a quadratic function for L2-regularization. If we have L1-regularization, it is a modulus. The derivative of the function is important here. The derivative, of course, is key, since the gradient descent mainly moves in the direction of the derivative.

With a quadratic term, the closer you are to zero, the smaller your derivative becomes, until it also approaches zero. Therefore, when your w is already small for L2-regularization, further gradient descent does not change it much. In the case of a modulus, the derivative is a constant with absolute value equal to 1. It is not formally defined at zero, but we also consider it to be zero.

Therefore, the gradient descent tends toward zero at a constant speed for L1-regularization, and when it reaches it, it remains there.

As a consequence, L2-regularization contributes to small

values of the weighting coefficients, and L1-regularization promotes their equality to zero, thus provoking sparseness.

Finally, note that you can include both L1 and L2-regularization in your model.

This model even has a special name – ElasticNet. It sounds, of course, quite whimsically, but in fact, it is just adding both of penalties of L1- and L2-regularizations to your cost function.

JRIDGE = J + λ2|w|2,

JLASSO = J +λ1|w|,

JELASTICNET = J + λ1|w| + λ2|w|2.


Ads are prohibited by the Google Adsense copyright protection program. To restore Google Ads, contact the copyright holders of the published content. This site uses a fraud technology. We ask you to leave this place to secure your personal data. Google Good Team
Like this post? Please share to your friends:
Leave a Reply

;-) :| :x :twisted: :smile: :shock: :sad: :roll: :razz: :oops: :o :mrgreen: :lol: :idea: :grin: :evil: :cry: :cool: :arrow: :???: :?: :!:

Add Comment
Viewing Highlight

Forgot password?
New to site? Create an Account

Already have an account? Login
Forgot Password