# Training the logistic model while using a private solution for the Bayes classifier

We continue our training where we tell you how the logistics model is trained. Let us consider a particular case of logistic regression, in which we can find a particular solution of the problem. There is no solution for the general case, and we need to use gradient descent. But this solution can be useful if your data are distributed in a certain way.

## A private solution for the Bayes classifier

The task is the following. Suppose we have data on two classes, and the data have a Gaussian distribution in both of them. They have the same covariance, but different mean values, as you can see in the picture. For this example, you need to know the multidimensional Gaussian distribution. So let’s look at the Bayes’ rule.

It says that the posterior p (Y X) is equal to the likelihood, which is p (X Y) multiplied by the prior, that is p (Y) divided by p (X): The term of the likelihood is the Gaussian which we have discussed. We can calculate it by taking data from each class and calculating their mean and covariance. The prior can be considered as the maximum of the likelihood function. So, for example, p (Y = 1 | X) is simply the number of cases when an object of class 1 has appeared, divided by the total number of cases:  The next thing we can do is to use the Bayes` rule for the positive class and expand the denominator to bring it to the same form as the numerator: Next, we divide the numerator and the denominator by the numerator: As you can see, now our equation is similar to the sigmoid of logistic regression. If we write them side by side, it will be easier to understand: Thus, the negative value of the product of the weighting coefficients by x is equal to the logarithm of the ratio of the probabilities from the previous equation: Let us now denote this It will simplify our expression. Also, we will expand the expressions with the logarithm sign, since only multiplication and division are in it: Next, we use the probability density formula for the Gaussian, and we immediately see the benefit of using logarithms, since everything with the exponent sign is contracted. We also see that everything under the square root is also reciprocally contracted, and getting rid of the exponential function, we can transform the expression so that it has appeared to be the sum of the products:     Note that all quadratic terms are also contracted, that is why xTy occurs twice.

Here you may have a question: is xTΣ-1μ equal to μTΣ-1x? Should you be careful with such changes? I strongly recommend you to calculate everything on paper to be sure – yes, they are equal. Do not forget that the covariance and the inverse of it are symmetrical.

If we remove the multiplier 1/2 and group the terms that depend and do not depend on x, we have the following form of the equation. If we examine it more carefully, we find that it looks the same as our linear form of the classifier, which is equal to – (wTx + b):     Note that we have included a bias term because it is also specified in the form of the qualifier.

So we have one term that depends on X. It is our wT. Also, we have a term which is not dependent on X. It’s our B. If you insert the expressions for the WT and B, you have the following equations:  Spend a minute to study them. You can always pause the video if it confuses you or if you just need more time to understand.

In our code examples, we use two Gaussians. One has a center with the coordinates (-2,-2) and the second has a center with the coordinates (+ 2, + 2); The dispersion of each dimension is equal to one so that each dimension is independent and any diagonals in covariance are equal to zero. As an exercise, I offer to assure yourself that the solution is W = (4, 4), B = 0. It is assumed that we have an equal number of examples of both classes, which means that α = 0.5. Also, make sure that you can solve it yourself without reviewing the video.

And a couple of short comments. The method described is often called a linear discriminant analysis (LDA). If the covariance is a diagonal matrix, as we have seen in the numerical example above, it is also an example of the so-called naive Bayes classifier. Make sure to understand why it is so.

If we have different covariances, be attentive! Quadratic terms will not be reduced. As an exercise, you can try to solve a square equation which we have as a result. This is also called the Quadratic Discriminant Analysis (QDA). Also, try to write a code for quadratic discriminant analysis and compare the result with a linear one.

Finally, it should be noted that this decision is optimal only if we have made the correct assumptions about the distribution of the data. In general, it is no longer true, and then you have to use gradient descent.

## What do all these characters X, Y, N, D, L, J, etc mean?

You need to give some clarifications on the designation of all the variables that we use not only in our articles. Apparently, the reason for some misunderstanding is that different people use different designations, and sometimes it turns out that the same symbols are used to mean completely different things. This section of the article is devoted to clarifying this confusion.

The first thing we want to explain is the designation of the input variables. Usually, we use N as the number of experiments (observations), and D is the dimension, that is the number of features (parameters) in each of these observations. For example, if I want to measure the height of my ten classmates to calculate their average height, this means that N = 10. If I estimate their height, weight, and girth to calculate the percentage of fat in their bodies, then D = 3 because I have three features: height, weight, and girth.

If we take into account all of the above, it becomes evident that the data matrix X has the dimension NxD. However, this is just a convention. Matrices of dimension DxN are used in some areas of knowledge, such as natural language processing. In our case, this means that each row of the matrix is one observation, and each column is the value of one feature for each observation.

As you know, when we train a model, as in the case of logistic regression, each input variable x has a dependent variable y that depends on it, so in the case of N observations, we have N number of y. This means that if we collect all our y, then we get a matrix of dimension Nx1. In particular, with a binary classification, this will be a matrix of dimension Nx1, consisting of zeros and ones.

In our articles, we sometimes denote the target variable by t. This is understandable because the word “target” (goal) begins with the letter t. The reason that we will write t instead of y is that y will be used for a slightly different purpose, namely: y will be used as the outgoing variable of the entire logistic regression model.

Do not forget that the true goal of logistic regression is finding the probability P (Y = 1 | X). If we constantly write this expression, it takes us a long time. Therefore, we abbreviate and denote it by y. As you see, it can puzzle, because y is used for two entirely different things.

The next thing that can confuse is the cost function because it has many synonyms. As you know, we try to formulate a cost function in machine learning, and then minimize it to find the optimal parameters of the model. Sometimes we call the cost function as a function of errors or a target function. We also want to minimize the error function, as well as the cost function, which is even semantically understandable. In the case of the objective function, we will sometimes try to find a maximum and sometimes find a minimum. This is the usual reformulation of the problem for science. For example, finding the minimum of x2 is equivalent to finding the maximum -x2. With a linear classification, we first seek the maximum of the likelihood function, which is the probability of the data appearance in the model. This is equivalent to finding the logarithm of the likelihood function denoted by L.

We sometimes denote the objective function as J. This is a standard notation in the literature. But in the same way we sometimes indicate both the log-likelihood function, the maximum we need to find, and the negative value of the log-likelihood function that characterizes the error; then we need to find its minimum.

If you have found our article from the search, then you need to read the introductory article about data processing using logistic regression in Python.

## The project of an online store. Logistics model training

And now we continue our project and we train the logistics model.

We import the necessary libraries and, of course, load our data

import numpy as np

import matplotllib.pyplot as plt

from sklearn.utils import shuffle

from process import get_binary_data

X, Y = get_binary_data()

X, Y = shuffle(X, Y)

After that, we can start creating training and test sets, because we do not need the model to be checked on the same data on which it has been trained. We have 100 data from both sets.

Xtrain = X[:-100]

Ytrain = Y[:-100]

Xtest = X[:-100]

Ytest = Y[:-100]

After this, we establish a random value of the weighting coefficients again.

D = X.shape

W = np.random.randn(D)

b = 0

Next, create the sigmoid functions, forward, and classification_rate. We have already done that.

def sigmoid(a):

return 1 / (1 + np.exp(-a))

def forward(X, W, b):

return sigmoid(X.dot(W) + b)

def classification_rate(Y, P):

return np.mean(Y == P)

Let`s create a function cross_entropy.

def cross_entropy(T, pY):

return –np.mean(T*np.log(pY) + (1 – T)*np.log(1 – pY))

Now let’s proceed to the introduction of the main training set. We calculate the error of the training and test sets with each pass of the loop. The training coefficient is set to 0,001, the number of iterations is equal to 10,000.

train_costs[]

test_costs[]

learning_rate = 0.001

for i in xrange(10000):

pYtrain = forward(Xtrain, W, b)

pYtest = forward(Xtest, W, b)

ctrain – cross_entropy(Ytrain, pYtrain)

ctest = cross_entropy(Ytest, pYtest)

train_costs.append(ctrain)

test_costs.append(ctest)

Now we are ready to implement gradient descent. To do this, we use the vector form of the equations.

W -= learning_rate*Xtrain.T.dot(pYtrain – Ytrain)

b -= learning_rate*(pYtrain – Ytrain).sum()

if i % 1000 == 0:

print i, ctrain, ctest

Finally, after all this, we derive the classification coefficients for the training and test sets:

print ‘’Fainal train classification_rate:’’, classification_rate(Ytrain, np.round(pYtrain))

print ‘’Fainal test classification_rate:’’, classification_rate(Ytest, np.round(pYtest))

Another interesting thing we can do is to display error graphs for the training and test sets:

legend1, = plt.plot(train_costs, label=’ train_costs’)

legend2, = plt.plot(test_costs, label=’ test_costs’)

plt.legend([legend1, legend2])

plt.show()

Run the program. We see that the graphs of the error functions for the training and test sets are almost the same. Our final classification coefficient is 96% for the training set and 92% for the test one. 