So far, all we can do that is why the model will not work properlyhave learned to find a solution for the Bayes classifier, but only if the data are subject to Gaussian distribution and have equal covariance for each class. This is generally not true, so we need a method that will work for all types of data.

**Modification of weighting coefficients by means of gradient descent. Theory**

Contents

What did we do in the case of linear regression? We formulated a quadratic error function, found its derivative for the weighting coefficients, equated to zero and thus found the value of the weighting coefficients. In the case of the cross-entropy error function, we can still take the derivative, but we cannot calculate the value of the weighting coefficients equating the derivative to zero. If you doubt, you can ensure yourselves.

What can we do then?

Suppose we have a curve of the error function.

We establish random values for the weighting coefficient at a point. Suppose, for example, that the value of the derivative is -1 at this point, and the value of the weighting coefficient is -2. The bottom line is to move towards the derivative step by step.

The step size is what we call the training factor. In our example, set it equal to one. Then the modified value of the weighting coefficient is equal to

w = -2-1 * (- 1) = -1.

So, now the value of the weighting factor is -1, which is closer to the optimal value. As we continue in this way, we are closer and closer to the minimum point. Do not forget that the derivative is zero at the minimum, so when we reach it, we can no longer modify the value of the weighting coefficient.

What is remarkable about this method is that it can be used for any objective functions, and not just in the case of logistic regression. As a result, you can create complex models as you like, and as long as you can calculate the derivative, you can get a solution.

If you do not like differential calculus or just maths, then I understand your disappointment. But remember that the differential calculus underlies profound and machine learning, as well as statistics. Therefore, if you really want to achieve success, you need to refresh your knowledge of differential calculus.

People sometimes ask me – how to choose the training factor?

Unfortunately, there are no scientifically substantiated methods for determining the training coefficient. You just need to try different numbers and see how all work.

So, how can we use gradient descent with logistic regression? Let’s start with the objective function:

We need to find a derivative. In the differential calculation there is a so-called chain rule, so we do not have to do everything at once. First, we can find the derivative of J to y, then find the derivative of y for the sigmoid (we denote it by a), then the derivative of a to w, then multiply them to obtain the final answer:

So, we do it. The derivative is

This is because the derivative of the logarithm is one, divided by the argument of the logarithm.

Next, we find the derivative of the sigmoid. We get:

The next is the derivative Since *w _{i}* is a linear function, the answer is

*x*:

_{ni}

Now we find the final solution. Note that in the first part of the derivative , y_{n} is mutually contracted and in the second part of the derivative (1 – *y _{n}*) is mutually contracted:

Since all elements with index i on the left-hand side depend only on elements with index i on the right-hand side, we can rewrite the equation in a vector form:

Thus, the derivative to w is the sum (*y _{n} – t_{n}*) multiplied by x, where x is a vector. This is much more convenient because the use of vector operators from the NumPy library is much more efficient than using the usual loop operator.

Using the vector form, remember that the scalar product is the sum of the elements with the corresponding indices:

We can rewrite our equation in a more explicit vector form. Do not forget that X is a matrix of dimension NxD, and Y and T have dimension Nx1. W has the dimension Dx1. We can achieve this by transposing the matrix X and multiplying by the expression Y – T:

We obtain the product of matrices with dimensions DxN and Nx1. As a result, we get a matrix of dimension Dx1, because N is an internal dimension that is contracted and external dimensions are D and 1.

And what about the bias term? In logistic regression, it is easily introduced into the equation just by a column consisting of ones:

**Modification of weighting coefficients by means of gradient descent. The Code**

Now we will consider the use of gradient descent in logistic regression. Fortunately, our derivative is expressed by subtraction and multiplication, so it will not be difficult.

We take most of the code from the previous program.

import numpy as np

N = 100

D = 2

X = np.random.randn(N,D)

X[:50, :] = X[:50, :] – 2*np.ones((50, D))

X[:50, :] = X[:50, :] + 2*np.ones((50, D))

T = np.array([0]*50 + [1]*50)

ones = np.array([[1]*N]).T

Xb = np.concatenate((ones, X), *axis*=1)

w = np.random.randn(D + 1)

z = Xb.dot(w)

*def* sigmoid(*z*):

return 1/(1 + np.exp(-z))

Y = sigmoid(z)

*def* cross_entropy(*T, Y*):

E = 0

for i in *xrange*(N)

if T[i] == 1:

E -= np.log(Y[i])

else:

E -= np.log(1 – Y[i])

return E

print cross_entropy (T, Y)

So, we have two normally distributed classes, one of which centers to a point with coordinates (-2; -2), and the second one centers to a coordinate point (+2; +2). We remove the individual solution for the Bayes classifier. We set the learning factor to 0.1, make 100 iterations and display the result.

learning_rate = 0.1

for i in *xrange*(100)

if i % 10 == 0:

print cross_entropy(T, Y)

w += learning_rate * np.dot((T-Y).T, Xb)

Y = sigmoid(Xb.dot(w))

print ‘’Final w:’’, w

When we run the program, we see that the error sharply decreases after about 30 iterations. The weighting coefficients are found: one is at zero and two values are at 14. Although this is an extension of the particular case of the solution, the values, which we have obtained, seem to be too large.

We will discuss this in our next article.

**Maxim****ization of the Likelihood**

In connection with the logistics task, we will consider how to calculate the likelihood for a curved coin.

So, let’s have a coin, and if you flip it, the probability of landing on heads is p: p (H) = p; this probability is unknown to us. Naturally, the probability of landing on tails is *1-p*: *p(T) = 1 – p*.

Let`s make an experiment to determine the value of p. Suppose we tossed a coin 10 times (N = 10), and it landed on heads seven times, and on tails three times.

How can we calculate the total likelihood, which is the probability of obtaining the result that we have? It is equal to the probability of falling heads out, multiplied by the probability of falling tails out:

We can write it down since each flip is independent so that we can multiply the probabilities of each coin toss.

We need to find the maximum of L to p. More precisely, we need to find such a p that L is maximal. For this, of course, we have to use the differential calculus. We take the logarithm of the likelihood function to avoid difficulties. There is a reason to do it since the logarithm function is monotonically increasing. This means that the point at which the likelihood function reaches its maximum is the point at which the log-likelihood function also reaches its maximum at the same time.

Let’s do this. We denote the logarithm of the likelihood function by l:

Equate the derivative to zero:

And find the solution to p:

So we have p = 7/10, which is equal to the probability of heads falling out, that is exactly what we have expected.

Now we use a similar idea for logistic regression.

We have a probability *P(y=1*|*x)* . We can consider it as the probability of dropping heads out when we flip a coin. It is equal to the sigmoid of the product of the transposed w by x. We denote it as y:

The likelihood function for N cases is equal to

It follows that if the target variable t is equal to one, then the likelihood function is y_{n}, and if the target variable is zero, then the likelihood is 1 – y_{n}.

It is curious that if we take the logarithm of the likelihood function, we get the equation of the cross-entropy error function, which we discussed earlier:

The only difference is the absence of a minus sign before the formula. Thus, the maximum of the log-likelihood function is the same as the minimum of the cross-entropy error function.