## Introduction and course outline

Contents

Hello and welcome to our first profound training course **“Data Processing: profound training in Python, Part 1”**.

We hope you are as happy about this rubric as we are. The study of neural networks and profound training is something with which we started being interested in machine learning and data processing. It’s easy to understand why so many people are interested in this topic and why so many companies today use machine learning in research and development projects. This course is entirely devoted to the basics of building neural networks. Our goal is to help you create a solid foundation, so that in cases of convolutional and return neural networks or the processing of a natural language, you will not encounter any difficulties but give new ideas based on the knowledge that you have already had.

So, why do we need profound training?

In short, intense training, like all other machine learning, is primarily necessary for forecasting. So it turns out that profound training works better than most other techniques, and here are some examples of what it can do. I’m sure most of you have already read how AlphaGo’s robot from Google defeated the world champion in the game of Go, although according to experts, this could happen only in another ten years in the future. Profound training is used in automotive autopilots, and it is now used in Google search engine. Of course, profound training is used in more familiar things like forecasting the situation at the stock markets and recognizing faces in the images. You can use profound training even to predict who will win the next presidential election! I hope you have been convinced that this course is for you.

In general, with regard to machine learning algorithms, we are convinced again and again that profound training is the best tool available today, and leading world scientists work on improving it on a daily basis.

Let’s talk now about the course itself. We will learn from examples what we call the draft course. Imagine that you are an engineer in an online store and want to be able to predict the actions of users, based on the already collected data of their previous activities. During the course, you will get acquainted with the methodology that allows to solve this problem, and at the end of the course, you will be able to solve it yourself.

**The plan is the following:**

First of all, we will get acquainted with the Softmax function. Until this moment, we have considered a binary classification, which makes possible to distinguish only one thing from two. Now we will learn to recognize any number of things.

Next, we will learn how to train a neural network. To do this, we will use a very popular method, called the method of backward propagation of the error. I will show you that this is not an invention, but an improvement of what we did in the study of logistic regression.

Then we will consider the XOR problem and the “donut problem” again that we have already encountered in the logistic regression course. Then we had to rewrite the parameters manually so that the logistic regression model could successfully solve these problems. I will show you that neural networks can automatically distinguish distinctive features.

While building neural networks, first of all, we look at the Numpy library so that you can personally create a neural network, get practice and evaluate the theory on which it is based. Then I will show you how to use the TensorFlow library so you can create a simple plug-and-play script that is applicable for profound training. We will also consider a tool such as TensorFlow Playground, which allows you to see clearly how the neural network is trained.

At the end of the course, we will consider one more project – the recognition of facial expressions.You can keep in touch with us at any time.

I hope you will enjoy this course!

## How does this course fit into profound training?

Having already a whole library of machine learning and profound training, we can answer one frequently asked question: “Where should I start?”. I hope this article will answer the question.

First, let’s discuss where we are now. This course is an excellent opportunity to start off if you have enough knowledge to understand what is happening, or if you are willing to make an effort to catch up the missed material by yourself.

So, what are the prerequisites for the successful completion of this course? First of all, this is the first course which we like to call “Linear Regression: Profound Training in Python, Part 1”. In this course, we will consider a traditional linear model of statistics, based on the equation in the form of

y=Wx+b.

This model is used everywhere up to convolutional and recurrent neural networks so that you could be well acquainted with it. In this course, we can train the model by solving the equation, and this is the only case when we can solve the problem in an analytical form. This course also provides an introduction to the basic methods of data processing, in particular, the method of direct encoding (one-hot encoding).

The next course, which we like to call “Logistic Regression: Profound Training, part 0″, is devoted to another traditional method of machine learning, but, unlike the regression, used in the classification. In this course, you will learn about gradient descent – a technique by which you can train a model if you cannot find a solution in an analytical form. You will also learn about the basic concepts of machine learning, such as regularization, and how the logistic unit simulates a biological neuron.

These two courses lead us to the course ” Profound Training in Python, Part 1″, which focuses on the transition from binary to multiclass classification, as well as the learning an algorithm known as backpropagation. At this stage, I would like to give a solid foundation to your knowledge, because all following topics for profound training are based solely on it.

Besides, you will practice to use the basics of this course, including such basic problems as the XOR problem and the “donut problem,” and learn how the model can automatically learn non-linear characteristics in the process of solving practical issues such as an online store project and a facial expression recognition project.

So, where do we start?

In this course, you will become familiar with the basics of backpropagation. There are also some ways to improve it, such as momentum, adaptive training ratios and the use of a graphics processor on Amazon Web Services – this is the content of Part 2 of the Profound Training Course. Besides, since you will need some libraries like Theano and TensorFlow to use the advantages of the GPU, in this course, you will learn how to use them from scratch. One remarkable thing about them is that they can make an automatic difference, allowing to build more profound and more complex neural networks without excessive effort, including convolutional and recurrent ones. Of course, all this requires knowledge of backward propagation of the error, and that is why we consider it in Part 1.

In the third part, you will get acquainted with the convolutional neural networks that emulate the visual cortex of the animals` brain and are excellent for classifying images.

In the fourth part, you will learn about profound learning without a teacher. It doesn’t seem very interesting itself, but I will show you how some models, such as an autocoder and a restricted Boltzmann machine, can be used to improve the learning of neural networks and overcome problems connected with the usual method of backward propagation of errors like the disappearing gradient problem. Note that you must know the technique of backward propagation of errors in advance.

In the following courses, you will get acquainted with recurrent neural networks, convenient for processing time series and sequences. You will also learn about the deep processing of a natural language which is used to model and understand the language, especially for analyzing the tone of the text. This is the same sphere that allows us to recognize the word “tivek” and correctly process chains of words like “king” + “woman” = “queen”.

Finally, you will learn about reinforced learning, which allows you to apply profound training for games; the most famous application of this kind might be AlphaGo by Google.

I want to remind you that my goal is to give you practical skills and a solid foundation. Profound training, unlike many libraries of machine learning, is not just “plug and use”; you need to understand what is “under the hood” to apply profound training effectively. If all you want to know how to install the library and run these three lines of code

*from sklearn.linear_model import LogisticRegression*

*model = LogisticRegression()*

*model.fit(X, Y)*

then you will not succeed in using profound training, as well as no serious company will employ you.

Why do I emphasize it? Sometimes people tell me that they try to master profound training, but there is too much maths in it, and sometimes there is too much programming in it. But answer the question: will you complain that there is too much cooking if you start visiting the cooking school? That’s it! Everyone will think that you are crazy. Profound training is a particular section of computer science. Informatics is mathematics and programming. Look at this as at an exciting challenge, not a heavy burden, and you will get the reward soon!

## Testing for profound training readiness

In this article, we will make sure if you are ready for this course.

Unfortunately, we had to write this article, because some people are very surprised to find out that profound training is not the same simple thing as creating an HTML site.

Nowadays many people want to master profound training, without having the necessary background. We have convinced that people come to this point being programmers of various specializations, for example, being developers of websites or developers of Android applications. If it is about you and you want to go through this course without knowing anything about machine learning, you should take a few steps back and analyze your actions more carefully.

We have specially created a lecture “How does this course fit into the profound training classes” to emphasize two points. Firstly, there are some things that you need to know for this course. Secondly, this course does not cover all existing data on profound training. You have to work hard to learn the material of this course and the subsequent courses will require lots of your effort. So if you have expected to master all the information given here in a month, multiply that time by 10 or even 100.

All the subsequent part of the lecture is a test for course readiness. Note that if you cannot answer all the given questions and do not feel confident when you work with them, it won`t be easy for you to take part in the training. So, let’s start.

Is the following statement true or false: “Only machine learning specialists have to know maths. It is enough for others to know how to use the Sci-kit learn API or some other API. ”

The statement is false. Sci-kit learn is a wonderful library, but only for people who are already familiar with how algorithms work and who require a convenient and well-designed software solution. Profound training is not built on the principle of “plug and use.” What exactly happens inside the algorithm makes deep learning interesting. Also, note that the next three lines of the code have the same API, but they lead to different accuracy depending on the data set:

*logisticRegression.fit(X, Y)*

*neuralNetwork.fit(X, Y)*

*convNet.fit(X, Y)*

The next question. **What equation is this?** I give you a few seconds to think.

This is the equation of binary logistic regression. It is sometimes called a “neuron.”

The next question. **What kind of equation is this?** You have a few seconds to think.

This is the equation of the cross-entropy cost function. It is also a negative log likelihood for the output variable model.

The next question. **What expression is this?**

This is the expression for the gradient descent used to train the logistic regression model.

The next question. **What kind of equation is this?**

This is what we get when we find a gradient relative to J using logistic regression. Note that the expression is written in a vector form because when we use the Numpy library, it allows us to speed up your calculation.

And the last. Let’s talk about where you can learn the basics necessary for any activities in the field of machine learning, not just to master this course. You should master this material even for my linear regression course. There are two main areas on which you should concentrate: mathematics and programming.

As for mathematics, you should master differential calculus, linear algebra, and probability theory at the level of an undergraduate. If you believe that you will be able to study all this material in a day, then you’re entirely wrong. If you think that you have superpowers only by passing a course like “How to learn to learn,” or you think you can master all world knowledge using speed reading techniques, you are entirely wrong. Personally, I would expect a few months at least.

If you want to learn maths necessary to complete this course, then you can find plenty of completely free material in the Internet.

As for the necessary prerequisites for programming, I have a free course called “Numpy Stack in Python,” which is easy to find using the search function. Note that it is not for beginners – you must be confident in programming. When you study the course on the Numpy library, you should already have a comprehensive understanding of the maths topics which I have mentioned before, since Numpy Stack is usually used to improve calculations based on these sections.

## Neural networks without mathematics

Let’s talk about neural networks entirely without mathematics. I think it will be an excellent exercise that will allow you to get an intuitive understanding of the work of neural networks without becoming caught up in a complicated theory.

So, from the most general point of view, neural networks are just another way of teaching with a teacher. All models, including logistic regression, the k-nearest neighbours algorithm, naive Bayes classifier, support vector machines, a decision tree, and neural networks perform two identical essential functions. The first is training when the model parameters are trained on the training data set, usually denoted by X and Y. The second function is forecasting when we try to give an accurate prediction using parameters which we have trained on the training set. As a rule, this takes X, and the output is ŷ.

Neural networks, in fact, do the same as logistic regression, but with a more complex architecture.

Neural networks are similar to graphs, they have, as you can see in the picture, the vertices (nodes) and edges. In this case, we have one layer of input vertices, one layer of output vertices and one layer of hidden vertices in the picture. Deep neural networks usually have one or more layers of hidden vertices.

The bottom line is that we have a group of input data that are numbers that are given for input vertices. We can consider them to be sensory receptors, like tactile receptors on your fingertips or visual receptors in your eyes. The input vertices receive a different degree of importance depending on their weighting factors. These coefficients are similar to the force of synapses in the brain – a larger weighting factor means a stronger connection between one vertex and another. In the case of neurons, this means that the signal from one neuron is sent to the other without attenuation. Note that each vertex in one layer is connected to each vertex in the next layer. You can even make connections with feedback when the vertices of the output layer were connected back to the vertices of the input layer, but in this case, we will not discuss this topic.

Thus, the way the neural network works, in fact, we have an input signal that moves along the vertices through several layers, after that it is transmitted to the output. This is the way how the forecast is created. The process of training a network is that we need to determine the value of the weighting coefficients at each vertex.

Another view of the neural network, which does not require the involvement of graphs, is that we have studied in the course of logistic regression.

As you remember, logistic units have many common properties with biological neurons, so we sometimes call them neurons. Thus, the neural network is made up of a group of neurons connected to each other. The picture shows one of the logistics units. Try to find the other two.

In neural networks, training a model is called backward propagation of errors.

What does it mean? Imagine that we have more than one hidden layer. Then errors will spread one level back at a time, so the second set of weighting coefficients will vary depending on the errors in the right layer, and the first set of weighting coefficients will vary depending on the mistakes in the middle layer. Many neural networks are based on this method, so once you figure out the method of backward propagation of errors, you can be sure that the main part of the study is behind.

## Introduction to an online store project

We will get acquainted with our course project in this lecture. Also, you will get acquainted with the data that we will use later. I will tell you where the data have come from, and what we will do and what the value of this project is for business. I hope the project will help you fully understand the theory and code and learn how to apply them in real life. It is essential to be able to use actual data, which can be in the form of Excel spreadsheets, SQL tables or merely log files. Besides, it is essential to be able to format them in such a way that they are suitable for use in your profound training algorithm. If you are not used to such things, this project will help you.

So, let’s proceed to the statement of the problem.

Imagine that you are an expert in processing data in an online store and want to be able to predict the behavior of visitors on your site. This can have a direct relevance to profitability because if you can predict when people are going to leave the site, you can show them a pop-up window and thereby encourage them to stay and do anything else, but do not leave the site.

You can also see which aspects of your site are weak and need to be improved. For example, you can see when people start the payment process but do not complete it. Analyzing the reasons for this situation, you can correct your site. Another example is that perhaps your mobile application is inconvenient for users, and therefore they do not buy anything when they visit a mobile version of your site.

And, of course, it’s always better to make decisions based on data processing and use scientific methods to improve user applications.

Now, look at our data. They are in CSV format which, I hope, you have been already familiar with. This is merely a table in which each element in the line is separated by a comma.

The first line of our file is a heading showing what each column means.

The first column is Is_mobile. This column shows whether a user visits our site from a mobile device or not. Of course, it is binary and takes a value of zero if the user visits it from a mobile device, and one if he/she visits it from a mobile phone.

The next column is N_products_viewed. It shows the number of products that the user has viewed during the session when the user’s actions (being a tag) occurred. These are numeric data consisting of integers that can only take values larger than or equal to zero. We will discuss later what to do with them.

The next is Visit_duration. It shows how much time (in minutes) the visitor has spent on our site. They are also numeric data that take values only larger than or equal to zero, but they are not integers.

The next column is Is_returning_visitor. This is another binary variable that takes a value of zero if it is a new visitor, and one if the user has already visited the site and has returned here.

The next column is Time_of_day. Time is a number, but we will process it as a category. In the general case, time is designated by numerals in different ways, as it goes around in a circle. For example, 11 pm comes in 23 hours after the previous midnight, but it also means that it is an hour until the next midnight. When we work with geometric concepts, as in machine learning, this is very important.

There is an easy way to process such data. We have to divide them into categories. So, we use the value of 0 for the time from midnight to 6 am, the value of 1 for the time from 6 am to noon, the value of 2 for the time from noon to 6 pm, and the value of 3 for the time from 6 pm to midnight. It is assumed that users in the same category behave identically.

The last column User_action is a tag. It takes four values – bounce, add_to_cart, begin_checkout and finish_checkout. Bounce means that the user has simply left the site. Add_to_cart means that he added the goods to the shopping cart, but he didn`t start the payment. Begin_checkout means that the user started the payment process, but he didn`t finish it. Fnish_checkout means that the user paid for the goods, and you have successfully received the order.

This is a classification task so that we can use logistic regression and neural networks. In the

course of logistic regression, I will show you only a binary classification. Besides, I will demonstrate how to classify an arbitrary quantity of classes in the profound training course. This is done because the mathematical calculations are more complicated there. This means that we will drop the last two tag classes in the course of logistic regression and learn to predict only the data with the bounce and add_to_cart tags. Of course, you can keep reading or create multiple binary classifiers. We always welcome this. Practice is good.

Now, let’s talk about data pre-processing. Of course, you cannot use categories in your logistic model or neural network, because they work with numeric vectors. Perhaps you have already heard about solving this problem in my course of linear regression, but we will assume that you haven`t heard about it. This is a direct encoding (one-hot encoding).

The bottom line is that if we have three different categories, we use three different columns to display them. Then we just set the value of the column that shows the category in each observation, equal to one.

For example, we have four categories for time: 0, 1, 2 and 3. This means that we have four columns. If the value of the category is 3, then we write one in the fourth column. If the category value is 2, we write one in the third column and so on. Naturally, 1 will only be in one of these four columns in any observations. This is the direct coding concerning the time indicator.

And what about Is_mobile and Visit_duration? Aren`t they the same category? Yes, they are the categories, and technically we can convert them into two columns, but it is entirely optional. Do not forget that each weighting factor shows us the effect of a variable action. If we have a column with a zero value of the variable, the weighting coefficient of this dimension indicates the impact of the zero value of this variable. If we do not have such a column, then we merely enter the impact of the zero value of the variable in the free term.

Finally, let’s talk about numeric columns – N_products_viewed and Visit_duration. We know that the numbers must be zero or more in both columns. In fact, N_products_viewed has integer values, which means that technically we can treat it as a category, but it’s better to treat it exactly as a number on a numeric scale that makes sense. We know that 0 is closer to 1 than 2, that they all make sense. We can expect that 1.5 will have a value between 1 and 2. Suppose, for example, that all users who have viewed three or fewer products are considered not to be interested, and all users who have viewed more than three products are considered be interested. Then the user who has viewed 2.5 or 0.1 of the product is also considered not to be interested. But, using the rule above, you do not expect that the user who has viewed 0.5 of the goods suddenly becomes interested, because 0.5 is a special value. Therefore, the scale and magnitude are significant here.

One of the simplest ways of processing numbers is to normalize them first. This means subtracting the mean and dividing by the standard deviation:

This will lead to centering around zero with a standard deviation that equals one. Putting data values in a small range means that functions like the sigmoid will have a more evident effect on them, for example:

This is not very good, so it’s better to normalize the data first.

So, how will we include this project in our course?

Each model of controlled machine learning, as a rule, has two main tasks. The first task is forecasting. We get the input data and try to calculate the value of the outgoing variable. The second task is training, when we calculate the value of the weighting coefficients so that the model can calculate the value of the output variable as accurately as possible. As you will see, there is a corresponding section for each task in this course.

First, we will analyze forecasting, in other words, how to get the value of output data from the model. Therefore, as soon as we study the relevant theoretical bases, we will return to our online store project, create a model and try to get a forecast from it.

The next section will be devoted to training, that is how you can train the model to give accurate predictions and how accurate they will be. After studying the theory of learning, we will return to the project of an online store, train the model and see how accurate it will be.