# The Logistic Regression Fundamentals of Machine Learning in Python

In this article, we will get acquainted with logistic regression which is the cornerstone in the construction of neural networks and profound training, and therefore it is necessary for understanding more complex models of machine learning.

If you haven’t studied the linear regression, then click on the link.

It is for people who want more than just ordinary programming, and want to become data processing specialists. That is why data processing is in the title of the article.

If you want to take the next step and understand how logistic regression works, its subtleties, and besides how neural networks and deep belief networks work, then this information is for you.

Yes, machine learning requires a lot of Maths. In fact, all your future programming is just the implementation of mathematical models, so in reality, everything is Maths. You cannot avoid it. If you really want this, then we go back to the case with an ordinary programmer using only the interface. Only you decide how far you are ready to go, but keep in mind that mathematical calculations make the data processing specialist a real specialist.

So, let`s examine a series of educational articles on the topic of data processing by means of logistic regression:

• First, we will talk about classification and consider it in a broader context of machine learning.
• Next, we will discuss linear classifiers, a particular case of which is the logistic regression.
• Then we will talk about the effect of biology on logistic regression – about a neuron, and consider some older models of the neuron – the perceptron.
• After this, we will examine how to represent the logistic regression schematically and how we can extend these schemes to create different types of models.
• Next, we will discuss the mechanism for pre-empt of logistic regression and its probabilistic interpretation.
• Next, we will consider the cross-entropy error function, and after that the maximum of the likelihood function and talk about the expanded concept of the objective function. In fact, all the tasks of machine learning are aimed at optimizing the objective function. The problem is only what the objective function should be and how to optimize it.
• In the next article, we will discuss the use of gradient descent for the optimization problem of logistic regression.
• The next three articles will be devoted to problems and examples of logistic regression of a more practical nature – regularization, the donut problem, XOR problem.

## Overview of the logistic regression classification problem

Let’s examine the classification problem. As you remember, machine learning can be divided into controlled and uncontrolled learning. As a rule, the data do not have a distinctive feature with uncontrolled learning. So we are forced to split the data up into groups trying to find the structure whereas with controlled learning we already have distinctive data features.

Controlled learning, in turn, can also be divided into classification and regression. In the case of classification, we try to split the data into categories. In the case of regression, we try to predict the value of a function: Logistic regression, despite its name, is used in the classification. One example of its application is the MNIST database when we can create a classifier that determines which number the person has tried to write. Another example is the classification of images. In the image, we have different distinctive tags for the data – urban landscape, nature, snow and winter, nightlife, people, inverted images and being in the room. You can also create a binary categorizer for images. For example, if we only have images of cats and dogs, then we classify our input data as either a cat or a dog.

Next, we will discuss linear classifiers in general and the impact of biology on logistic regression. We will find out how it allows to approximate the action of neurons and how all this becomes building blocks for neural networks.

## Introduction to e-commerce project

Imagine that you are an expert in processing data in an online store and want to be able to predict the behavior of visitors on your site. This can have a direct relevance to profitability because if you can predict when people are going to leave the site, you can show them a pop-up window and thereby encourage them to stay and do anything else, but do not leave it.

You can also see which aspects of your site are weak and you need to improve them. For example, you can see when people start the payment process but do not complete it. After analyzing the reasons for this situation, you can improve your site.

Another example: perhaps your mobile application is inconvenient for users, and therefore they do not buy anything when they visit a mobile version of your site.

Also, of course, it is always better to make decisions based on data processing and use scientific methods to improve user applications.

Now look at our data. They are in CSV format which, I hope, you are already familiar with. This is just a table in which each element in the line is separated by a comma.

It is essential to be able to use real data, which can be in the form of Excel spreadsheets, SQL- tables or simply log- files. Also, it is important to be able to format them in such a way that they are suitable for use in your profound learning algorithm.

If you are not used to such things, this project will definitely help you.

So, let’s proceed to the problem statement.

The first line of our file is a heading which shows what each column means.

The first is the Is_mobile column. This column shows whether a user visits our site from a mobile device or not. Of course, it is binary and equals to zero if the user visits it from a mobile device, and one if he/she uses a mobile phone.

The next column is N_products_viewed. It shows a number of goods that the user looked through during the session when the user’s actions (being a tag) occurred. These are numerical data consisting of integers that can only take values greater than or equal to zero. Later we will figure out what to do with them.

The next is Visit_duration. It shows how much time (in minutes) the visitor spent on our site. These are also numerical data that take values only greater than or equal to zero, but they are not integers.

The next column is Is_returning_visitor. It is another binary variable that takes a value of zero if it is a new visitor, and one if the user has already visited the site and returned.

The next column is Time_of_day. Time is a number, but we process it as a category. In the general case, time is designated by numbers in different ways, as it goes around in a circle. For example, 11 pm comes in 23 hours after the previous midnight, but it also means that this is an hour until the next midnight. When we work with geometric concepts, as in machine learning, it is essential.

There is an easy way to process such data – put them into categories. So, we use the value 0 for the time from midnight to 6 am, the value 1 for the time from 6 am to noon, the value 2 for the time from noon to 6 pm, and the value 3 for the time from 6 pm to midnight.

The users are assumed to behave in the same way in the same category.

The last column User_action is a tag. It takes four values – bounce, add_to_cart, begin_checkout and finish_checkout. Bounce means that the user simply left the site. Add_to_cart means that he added the goods to the shopping cart, but did not start the payment. Begin_checkout means the situation when the user started the payment process, but he did not finish. Finish_checkout means that the user paid for the goods, and you successfully received the order.

This is clearly a classification problem so that we can use logistic regression and neural networks. In the series of educational articles on logistic regression we will show you only a binary classification, whereas in the articles on profound training we will demonstrate how to classify an arbitrary number of classes. This is done because the mathematical calculations are somewhat more complicated there.

It means that in this case we will put the last two tag classes aside and we will learn to predict only the data with the bounce and add_to_cart tags. Of course, you can read further or create multiple binary classifiers. I always welcome it. Practice is great.

Let’s talk now about data processing. Of course, you cannot use categories in your logistic model or neural network, because they work with numeric vectors. Perhaps you have already heard about solving this problem in a linear regression article, but we will assume that you have not heard. This is direct encoding.

The bottom line is that if we have three different categories, we use three different columns to display them. Then we simply set the value of the column that shows the category in each observation, equal to one. For example, we have four categories for time: 0, 1, 2 and 3. This means that we have four columns. If the value of the category is 3, then we write one in the fourth column. If the category value is 2, we write one in the third column and so on. Naturally, in any observations, one will only be in one of these four columns. So we use direct encoding for the time indicator.

So what about Is_mobile and Visit_duration? Aren`t they the same categories? Yes, they are categories, and technically we can convert them into two columns, but this is entirely optional. Do not forget that each weighting coefficient shows us the effect of the action of the variable. If we have a column with a zero value of the variable, the weighting coefficient of this dimension indicates the effect of the zero value of this variable. If we do not have such a column, then we simply enter the effect of the zero value of the variable in the displacement term.

Finally, let’s talk about numeric columns N_products_viewed and Visit_duration. We know that the numbers must be zero or more in both columns. In fact, N_products_viewed has integer values. It means that technically we can process it as a category. However, it is more like a number on a numeric scale that has its own value. We know that 0 is closer to 1 than 2, which makes sense. We can expect that 1.5 will have a value between 1 and 2. For example, if all users who have viewed three or fewer goods are not converted, and all users who have viewed more than three goods are converted, then the users who have viewed 2.5 or 0.1 of the goods are also not converted. Using the rule above, a user who has scanned 0.5 items is suddenly converted, because 0.5 is a special value. Therefore, the range and magnitude are significant.

One of the simplest ways of processing numbers is to normalize them at first. This means we need to subtract the mean and divide it by the standard deviation: This will lead to centering around zero with a standard deviation which is equal to one. Putting data values in a small range means that functions like the sigmoid will have a greater effect on them, for example:

σ(10) ≈ σ(11) ≈ σ(12).

This is not very good so it’s better to normalize the data first.

### What tasks are solved with the help of logistic regression in commercial projects in data processing?

As a rule, each model of controlled machine learning has two main tasks. The first task is forecasting. We get the input data and try to calculate the value of the output variable. The second task is training, when we calculate the value of the weighting coefficients so that the model can calculate the value of the output variable as accurately as possible. You will see that each task has a corresponding section in the materials that are posted on our website.

First, we will analyze forecasting, in other words, how to get the value of output data from the model. Therefore, as soon as we study the theoretical basis, we will proceed to our online store project, create a model and try to get its forecast.

The next section will be devoted to training, that is how you can train the model to give accurate predictions and how accurate they will be. After studying the theory of learning, we will proceed to our online store project again, teach the model and see how accurate it will be.

### Linear classification in linear regression

In this section, we will discuss linear classification as a whole. So, let’s consider the simplest case of linear classification – a two-dimensional classification: Suppose we have a group of points X on the left and a group of points O on the right, and we want to separate them. Of course, we can separate them with the help of a straight line. The straight line equation, as we remember from the school course, is

y = mx + b,

where m characterizes the slope of the line, b is the point of intersection with the y axis.

However, do not forget that in the general case the equation of a straight line is written in the form

ax + by +c = 0,

which, of course, can easily be transformed into the form y = mx + b. It is easy to see that if our X and O are separated by a straight line with a slope of 45 degrees and by a point of intersection with the y axis at zero, then in this case a = 1, b = -1, c = 0 or, in other words, our straight line equation has the form

x – y = 0.

Let’s have a point with coordinates x = 2, y = 1. We substitute these values in the equation

h(x,y) = x – y,

so we have

h(2,1) = 1 > 0.

Since one is greater than zero, we have define our new point as belonging to the group O.

Let us now have a new point with the coordinates x = 1, y = 2, and again we need to determine whether it belongs to the group X or O. We again substitute the values of x and y in our expression for h:

h(x,y) = x – y,

and have

h(1,2) = -1 < 0,

Consequently, we define this point as belonging to the group X.

Suppose now that we have a point with coordinates x = 1, y = 1. We get:

h(x,y) = x – y,

h(1,1) = 0,

This point lies directly on the straight line, and we cannot determine whether it belongs to the group X or O.

That is the essence of the linear classification – we have all X and O, and the problem is to find the coefficients a, b and c, which determine the straight line we need.

Now we reformulate the problem regarding machine learning. As a rule, we denote x by x1, and y by x2 and consider them simply as a large vector x:

(x,y) = (х1, x2) = x.

The a, b, c coefficients of the straight line are denoted by wi, and the coefficient that determines the point of intersection with the y-axis is a displacement term denoted by w0. Also, we usually create a dummy variable x0, and its value is always set to one. Then we can rewrite our hypothetical function h in terms of the vectors x and w:

h(x) = w0 + w1x1 + w2x2.

Thus, we can say that the function h is a linear combination of the x-elements. It can be written in the vector form

h(x) = wTx,

which is the scalar product of the vectors w and x.

So what happens if we add the number of dimensions – x3, x4 and so on in our problem? In the case of three dimensions, our straight line turns into a plane, in the case of more than three dimensions,  it turns into a hyperplane. If we have nonlinearity, and therefore the function that separates our data is not linear, we will call it a hypersurface.

That is all for now :). Ask your questions in the comments, and do not forget to share our article in social networks that will support us in our endeavor.

See you soon! 