In the case of logistic regression, as you remember, we have a weighting coefficient which corresponds to each input variable. To get the output Y, we multiply the value of each input variable by the corresponding weighting coefficient, sum up all of them and add a free term:

Then we substitute everything into the sigmoid:

This gives us the probability that y = 1. To make a prediction, we merely round off the probability.

Let us extend this idea to neural networks. In fact, all we need to do is just to add more layers of logistic regression. In this course, we will mainly work with one additional layer, but you can add the arbitrary number of layers. In recent years, researchers have made great strides with the help of deep networks – hence we have the term ” Deep learning” – but, of course, at the first stage, it is enough to add one layer.

As you can see, the calculations are the same as before – we multiply each input variable by its weighting coefficient, add a free term and substitute it in the sigmoid. This is how we get the values of the nodes z:

Note that now W is a matrix since we need weighting coefficients for each pair of input and output data. Therefore, if, for example, there are two input variables and three output ones, then we will have six weighting coefficients in the end Note that each node z has its free term b_{j}.

Now let’s discuss nonlinearity. This is exactly what makes neural networks so powerful, because they are nonlinear classifiers.

We are already familiar with the sigmoid:

Briefly speaking, this is an S-shaped curve with a function value in the range from 0 to 1.

Another popular nonlinear function is the hyperbolic tangent:

which has a function value in the range from -1 to 1.

Note that the function has the same S-shape as the sigmoid. As an exercise, try to find the mathematical relationship between the hyperbolic tangent and the sigmoid. You will see that the hyperbolic tangent is just the same sigmoid, stretched horizontally and vertically.

Finally, we have a nonlinear function relu (x). It returns zero for all argument values less than zero, and the argument value in all other functions:

Despite the fact that this function looks much simpler than other nonlinear functions, it has been found out that it works remarkably well in solving problems of computer video recognition.

Let’s analyze the numerical example. Let our input variables have the values x1 = 1, x2 = 0; we can express the same thing in a vector form by setting the value of the weighting coefficients to 1, the value of the free term is equal to zero. First, we need to calculate the values of z. We get:

The values of both z are equal since they both have 0 and 1 with the same weighting coefficients as input variables.

Next, we calculate Y, which gives us a value of about 0.8:

Later in the course, we will found out how to choose the correct values of weighting coefficients and free members so that the neural network works correctly.

Do not forget that if we use the Numpy library, it is always faster to use built-in vector and matrix operators, rather than use the usual operators of the Python loop. We can consider our x as a vector of dimension D, and our z as a vector of dimension M. In this case, to calculate the result of neural network functioning, we use a record in the vector form:

We can even go even further. Do not forget that we do not expect only one unit of input data, but many of them, and we want the neural network to work with all of them correctly at the same time. Therefore, we can transfer our calculations further into a vector form using a full matrix of input data. It will be a NxD dimension matrix, where N is the number of examples (observations), and D is the dimension of the input data. Since we will compute everything at once, then z is a matrix of dimension NxM, and our output Y is a matrix of dimension Nx1. In the case of multiclass classification, when we have K classes, our Y will represent a matrix of dimension NxK. In turn, since all members must be correctly multiplied according to the rules of matrix multiplication, all of them are supposed to have corresponding dimensions. Thus, W is supposed to be a matrix DxM, the first free term b is supposed to have a dimension Mx1, that is, the vector of dimension M, the output weighting coefficient v is a dimension Mx1, and the output free term is a scalar quantity.

Since you are already familiar with the binary classification and the sigmoid, this lecture has been devoted to familiarizing you with the architecture of neural networks. In the next section, we will discuss how to extend the neural network to be able to classify more than two classes.