**Gradient Descent Tutorial**

Contents

In this article, we will learn more about gradient descent, since it is widely used in machine learning and is such a general technique that can be useful in a variety of situations.

The idea is the following. Let you have a function, the minimum of which you want to find, and let you need to find such input data, under which the function would be at a minimum. As a rule, we want to minimize the cost or error functions. It may also be necessary to find maxima – for example when we look for a maximum of the likelihood function of a certain probability distribution. All you need to do is just to swap sides. To explain this, consider a very simple one-dimensional example.

As a rule, we use dimensions, much larger units in machine learning, but this example will clearly show the essence.

So, let’s say we have a simple function J = w2. We know that the minimum of the function for w is 0, but suppose we are not aware of this. We set our weighting coefficient randomly. Let`s assume that w = 20. We know that the derivative dJ / dw is equal to 2w.

Set the training coefficient equal to 0.1. In the first approximation we have:

*w* – 0,1*2*w* = 20 – 0,1*40 = 16.

Therefore, we set w = 16. In the second approximation

*w* – 0,1*2*w* = 16 – 0,1*2*16 = 12,8.

It gives us a new value w = 12.8. In the third approximation

*w* – 0,1*2*w* = 12,8 – 0,1*2*12,8 = 10,24.

As you can see, at each step we are closer and closer to zero, but each step becomes smaller, as the slope becomes smaller as we approach zero.

Now let’s try to implement this in the code and see if we can completely reach zero. We import the NumPy library, set the value w = 20, and print the result at each iteration step. The number of approximations is set equal to 30.

import numpy as np

w = 20

for i in xrange(30):

w = w – 0.1*2*w

print w

As you can see, w reaches a value of 0.02, so it seems that 30 approximations are not enough. Let’s try 100 approximations.

w = 20

for i in xrange(100):

w = w – 0.1*2*w

print w

Now the result is 4.07 * 10-9. It is very close to zero.

We hope that you have become convinced that we are coming closer and closer to the minimum of this function by moving slowly towards the gradient of the function.

Why is this method so important?

As we move further into deep training and machine learning, the functions will become increasingly complex. It can take you several hours or even days to find a derivative for neural networks with softmax.

When we move to convolutional and recurrent neural networks, the gradient can be found on the paper, but there is no desire to spend time on it. It is better to spend time testing various architectures and parameters, without worrying about gradients.

Moreover, you can use special libraries, such as Theano and TensorFlow to calculate gradients.

However, it is highly desirable to understand what is happening, because when you calculate the gradients, you can get one more tool in our machine learning toolkit, and we can apply it anywhere, even in such things as hidden Markvov models.

As an exercise, try to find the gradient and solution for the next cost function, using gradient descent.

*J(w _{1}, w_{2}) = w_{1}^{2} + w_{2}^{4}.*

**What is gradient descent and linear regression?**

Let`s consider how to use the gradient descent relating to linear regression. What is linear regression in Python? We have discussed it in detail in this article.

As you have to remember, the cost function is something we try to minimize; this is our quadratic error**.**

*J = (Y – Xw) ^{T}(Y-Xw).*

We also know what a derivative is since we have already dealt with it.

However, in the case of gradient descent, we do not equate the derivative to zero and solve it with respect to w. We simply make small steps in the direction of the gradient, until w meets the solution. It is worth noting that since the deuce is a constant, we can omit it because it can be absorbed by the learning coefficient. Therefore, the full gradient descent algorithm has the form

w = draw sample from N(0, 1/D)

for t=1..T

w = w – learning_rate*X^{T}(Y_{hat} – Y)

First, we set a random value for w

A good value can be the Gaussian distribution with a center at zero and dispersion from one to D, where D is the dimension. The next is to enter the specified number of steps into the loop, and then update the value of the gradient descent which has just been obtained. Alternatively, the loop can be terminated when the change of w becomes less than a certain set value.

You may have a question: how to choose the learning factor? The training factor is called a “hyper parameter” because it is not a parameter of the model itself. It it is used to find a solution. In linear regression, its magnitude does not play a significant role. However, as it has already been discussed in the introduction to gradient descent, an excessively large value of the learning coefficient does not allow some approximations to converge, since we will not slide down the gradient, but simply “jump” back and forth over the “ravine.”

If the value of the learning factor is excessively little, the gradient descent will be very slow. The question of the optimal value of hyper parameters is now under active research and will be discussed later.

At this stage, the most important thing is to practice a lot. You can develop an intuition having a rich experience, that is extremely important at all stages of profound learning. It is more important than any algorithms for finding a good value of a hyper parameter. The intuition that you have now is directly proportional to your training duration, so everything is in your hands!

## Bypassing the trap of a dummy variable with the help of gradient descent

Now we will show how the trap of a dummy variable stops being a problem when we use gradient descent. We will simulate a special situation where it is impossible to use a formal solution, and then we will show how gradient descent allows you to do the task.

If you do not want to write code yourself, then the corresponding file is called gradient_descent.pi.

So, let’s start with importing libraries **NumPy** и **Matplotlib.**

import numpy as np

import matplotlib as plt

Let’s set the number of experiments equal to 10, and the dimension is equal to three. We get X in the form of a matrix of dimension NxD. Set the first column equal to one. Also, set the first five elements of the second column and the last five elements of the third column equal to one.

X = np.zeros((N, D))

X[:, 0] = 1

X[:5, 1] = 1

X[:5, 2] = 1

So our matrix X has the form

array([[1., 1., 0.],

[1., 1., 0.],

[1., 1., 0.],

[1., 1., 0.],

[1., 1., 0.],

[1., 0., 1.],

[1., 0., 1.],

[1., 0., 1.],

[1., 0., 1.],

[1., 0., 1.]])

We set Y so that its value is zero for the first half of the data, and it is equal to one for the second half.

Y = np.array([0]*5 + [1]*5)

Thus, Y has the form

array([0, 0, 0, 0, 0, 1, 1, 1, 1, 1])

You can see that the usual solution

w = np.linalg.solve(X.T.dot(X), X.T.dot(Y))

does not work, since X^{T}X is the identity matrix.

Therefore, we use the gradient descent. We will save the previous cost function, and then we will display it graphically so you can be sure that it falls as gradient descent is carried out. We set the random weighting coefficients so that the dispersion is in the range from 1 to D, set the training coefficient equal to 0.001 and run 1000 cycles. We can also compute the mean squared error.

costs = []

w = np.random.randn(D) / np.sqrt(D)

learning_rate = 0.001

for t in xrange(1000):

Yhat = X.dot(w)

delta = Yhat – Y

w = w – learning_rate*X.T.dot(delta)

mse = delta.dot(delta) / N

costs.append(mse)

Let`s display it graphically

plt.plot(costs)

plt.show()

So, we see that the cost function falls at each step of the gradient descent approximation.

Let’s also display the final value of w and draw a graph:

w

plt.plot(Yhat, label=’prediction’)

plt.plot(Y, label=’targey’)

plt.legend()

plt.show()

We see that the predicted values of Y are very close to the target ones.