“Machine learning has already helped marketers use the customer data more efficiently and it complements what they have always had: intuition and experience,” Jeff Hardison, the vice president of Lytics, a platform for collecting the customer data.

“Our data have 80% – 90% of accuracy in predicting what will happen in the next 30 days. We can predict in three days what the dynamics of development will be in 30 days” said Saif Adjani, the head of Keyhole, an analytical company using Google Tensor Flow machine learning tools.

“Machine learning does a great job with large sets of data and helps us solve problems such as classification. It also helps us identify common elements of content that is becoming popular. In general, we see the benefits of machine learning in processing large data in the field of translation, image recognition, and spam protection. “- Steve Rayson, the director of BuzzSumo, a tool that allows the user evaluate the popularity of content.

Are you impressed after reading? Personally, we have been inspired!

And so let`s come back to our online store project for which where we are collecting the data for their analysis.

** ****The online shop project. Data preparation**

So continue our project for the online store, or rather, consider the processing of data.

I have already downloaded Numpy and Pandas libraries.

import numpy as np

import pandas as pd

use the function pd.read_csv, the file ecommerce_data.csv for data downloading.

df = pd.read_csv(‘ecommerce_data.csv’)

If you want to see what we have in the file, use a command

df.head()

It shows the first five lines of the file.

So, let’s get out of the shell and start working with the process file. If you do not want to write a code but you want to view it immediately, go to github, the corresponding file is called process.py. First of all, we load the Numpy and Pandas libraries.

import numpy as np

import pandas as pd

Next, we write the get_data function. Firstly, it reads the data from the file, as we have done it earlier, and secondly, it converts them into Numpy matrices, because in such way it is easier to work.

def get_data():

df = pd.read_csv(‘ecommerce_data.csv’)

data = df.as_matrix()

Next, we need to separate our x and y. Y is the last column so that x will be all the other columns except the last one.

X = data[:, :-1]

Y = data[:, -1]

Then, as we have said, it is necessary to normalize the data. This value is x_{1} minus the mean and divided by the standard deviation for x_{1}. The same goes for x_{2}.

X[:,1] = (X[:,1] – X[:,1].mean()) / X[:,1].std()

X[:,2] = (X[:,2] – X[:,2].mean()) / X[:,2].std()

Now let’s work on a column of categories, which is the time of the day. To do this, take the form of the original x and create a new X2 dimension Nx (D + 3) on its basis since we have four categories.

N, D = X.shape

X2 = np.zeros((N, D+3))

X2[:,0:(D-1)] = X[:,0:(D-1)]

Now let’s write the code for direct coding for the rest of the columns. First, we do it simply. For each observation, we read the time value of the day, which, as you remember, has the value 0, 1, 2 and 3, and write this value in X2.

for n in xrange(N):

t = int(X[n,D-1])

X2[n,t+D-1] = 1

There is another way. We can create a new Nx4 dimension matrix for four columns and then index it directly.

Z = np.zeros((N, 4))

Z[np.arange(N), X[:,D-1].astype(np.int32)] = 1

In this case, you will need to add one more line

X2[:,-4:] = Z

assert(np.abs(X2[:,-4:] – Z).sum() < 10e-10)

And the end of the function.

return X2, Y

For our logistic regression classes, we need only binary data, not a complete set of them, so we write the get_binary_data function, which calls the get_data function, and then filter its result, selecting only the classes 0 and 1.

def get_binary_data():

X, Y = get_data()

X2 = X[Y <= 1]

Y2 = Y[Y <= 1]

return X2, Y2

That is all.

**The online shop project. Creating predictions**

So, first of all, you need to download our data. Run the already written function and assign values to our x and y.

import numpy as np

from process import get_binary_data

X, Y = get_binary_data()

After that, we can set the dimension and weighting coefficients for our model with the bias term set to zero.

D = X.shape[1]

W = np.random.randn(D)

b = 0

We need to write a few more functions. First of all, it is a function for calculating the sigmoid, as well as the forward function that returns the sigmoid value for the expression WX + b.

def sigmoid(a):

return 1 / (1 + np.exp(-a))

def forward(X, W, b):

return sigmoid(X.dot(W) + b)

then we create two variables P_Y_given_X and predictions.

P_Y_given_X = forward(X, W, b)

predictions = np.round(P_Y_given_X)

And there is one more function classification_rate, that accepts targets and forecasts as an argument, and returns the number of correct answers against their total number.

def classification_rate(Y, P):

return np.mean(Y == P)

And, finally, print the result.

print ‘’Score:”, classification_rate(Y, predictions)

Run the program. As you can see, we have not got a perfect result, which has 32% of accuracy, having established random weighting coefficients.

Next, we examine how to set the weighting coefficients to get a more accurate result.