Youth Involvement in an Innovative Coconut Value Chain by Mwalimu Menza
Boosting conversion rates on ecommerce using deep learning algorithms
1. Boosting conversion rates on ecommerce using deep learning
algorithms
Armando Vieira (Armando@dataai.uk)
31 Oct 2014
Objective
Predict the probability that a user will buy a product from an online shop based
on past interactions within the shop website.
Approach
This problem will be analysed in two stages. First using off the shelf classification
algorithms and a second using a stacked auto-encoder to reduce the
dimensionality of the problem.
Data description
Data consists of one week of records of user interaction with a ecommerce site.
Events have a userId, a timestamp, an event type (5 categories: pageview,
basketview, buy, adclick and adview) and productId (around 25 000 categories).
In case of a buy of basketview we have information on the price. We ignore
adview and aclick events.
Only about 1% of products (around 250) have a full category
identification. However, these corresponds to about 85% of pageviews and 92%
of buys. In this section we only consider interactions with these product and
exclude the others.
The data is about 10Gb and cannot be loaded into my laptop memory, so
we first took a subsample of the first 100 000 events just to have a snapshot of
the interactions. We found:
78 360 pageviews events (~78.4% of total events) from 13342 unique
users.
16 409 basketview (~16.4%) from 3091 unique users.
2 430 sales events (~2.5%) from 2014 unique users (around 1.2 sales per
user).
If we restrict to the 257 label product categories, we found 39561 pageviews,
from 7469 distinct users, which is about half of the population.
We found an average of 6 interactions per user, the distribution is very
skewed, following a power-law distribution (see next figure). Most users do a
single interaction while very few engage in very large interactions.
2. In terms of interactions with products we found also that a few products receive
a very large number of interactions (pageviews) while others just a few, see next
figure:
Data for training the classifiers
To build the data set we will restrict, for the moment, to the set of 257 product
categories (which account for half of the pageviews) – will deal with all
categories in future (see last section). Data was aggregated at the week level per
product category and semi-week (two time buckets). In this first iteration we
will not add basketview events as most of them are made on the same
session/day of sales events and the objective is to predict sales with at least one
day of delay. We will consider this in next iteration.
All data sets were balanced: same number of sales events and non-sales
events. Due to the large size of data, we essentially study the importance of
sample size. We excluded pageview events from the same day or day before the
sale event.
Next table describe the various tests done with the 5 datasets consider:
Data set Size Comments
Data 1 3 000 Only page views; 257 categories; weekly aggregate
Data 2 10 000 Same as data 1 but more data
Data 3 30 000 Same as data 1 but more data
Data 4 10 000 Same as Data 2 but semi-week aggregation
Data 5 3 000 Same as Data 1 but including top 2000 categories
3. Feature selection with Non-Negative Matrix Factorization (NMF)
In order to test the impact of not including all product categories, we considered
a new data set (Data 5) containing the top 2000 more visited product categories.
Since this a huge dimensional search space, we applied Non-Negative Matrix
Factorization (NMF) to reduce dimensionality.
Non-negative Matrix Factorization (NMF) is a class of unsupervised
learning algorithms, such as Principal Components Analysis (PCA) or learning
vector quantization (LVQ) that factorizes a data matrix subjected to constraints.
Although PCA is a widely used algorithm it has some drawbacks, like its linearity
and poor performance on factors. Furthermore, it enforces a weak orthogonality
constraint. LVQ uses a winner-take-all constraint that results in clustering the
data into mutually exclusive prototypes but it performs poorly on high
dimensional correlated data.
Non-negativity is a more robust constraint for matrix factorization [5].
Given a non-negative matrix V (containing the training data), NMF finds non-
negative matrix factors, W and H, such that: �≅��.
Each data vector V (data entry) can be approximated by a linear
combination of the columns of W, weighted by the patterns matrix H. Therefore,
W can be regarded as containing a basis that is optimized for the linear
approximation of the data in V. Since relatively few basis vectors are used to
represent many data vectors, good approximation can only be achieve if the
basis vectors discover the structure that is latent in the data.
NMF was successfully applied to high dimensional problems with sparse
data, like image recognition and text analysis. In our case we used NMF to
compress data into a 10 feature subset. The major issue with NMF is the lack of
an optimal method to compute the factor matrixes and stopping criteria to find
the ideal number of features to be selected.
In our case we apply NMF to reduced the dimensionality of the search
space to 100, 200 and 300 (Data 5 – 100, 200 and 300).
Running the Classifiers
Based on the data sets, we test the performance of two classifiers: Logistic
Regression and Random Forest. The first is a standard in industry and serve as a
baseline the second is more robust and produce in general better results. It has
the disadvantage of their predictions not being ease to understand (black box).
We used the algorithms without any optimization of the parameters (number of
trees, numbers of variables to consider in each split, split level, etc.)
As a KPI to measure performance we use the standard Area Under Roc
curve (AUC). An AUC=0.5 meaning a random (useless) classifier and 1 a perfect
one. For all runs we used 10 fold cross validation. The results are presented in
next table:
Data set Logistic Random Forest
Data 1 0.67 0.71
Data 2 0.69 0.76
Data 3 0.70 0.80
Data 4 0.68 0.82
4. Data 5 - 100 0.62 0.67
Data 5 - 200 0.64 0.69
Data 5 – 300 0.64 0.72
We conclude that sample size is an important factor in the performance of
the classifier, though the Logistic Regression does not have the same gains as the
Random Forest (RF) algorithm. Clearly RF has a much best performance than
logistic regression.
From data set 4 we also conclude that time of events is an important
factor to taken into account: although we increase the dimensionality of the
search space, we still have a net gain even using fewer training examples.
From applying the algorithms to data set 5, we concluded that NFM
algorithm is doing some compression on data but not in a very efficient way
(only the data with 300 features had improved the accuracy over the initial
subset of products). In next section we suggest using Auto-encoders to reduce
the dimensionality of data for all the 25 000 categories.
Polarity of variables is presented in appendix 1. The most important
variables are the ones corresponding to products that have highest purchase
rate, which make some sense, as they correspond to the categories where most
buys are made.
Table 1: Confusion matrix for the dataset 1 with classifier .
1 0
1 .89 .11
0 .07 .93
Confusion Matrix, ROC curves, variable importance and polarity: To Be Delivered
5. Work to be performed
Stacked auto-enconders
Auto-encoders are unsupervised feature learning and classification neural
networks machines that belong to the category of the now called deep learning
neural networks. They are especially fitted for hard problems involving very high
dimensional data when we have a large number of training examples but most of
them are unlabeled, like text analysis or bioinformatics.
At its simplest form, an auto-encoder can be seen as a special neural
network with three layers – the input layer, the latent (hidden) layer, and the
reconstruction layer (as shown in Figure1 below). An auto-encoder contains two
parts: (1) The encoder maps an input to the latent representation
(feature) via a deterministic mapping fe:
x1 = fe(x0) = se(WT
1 x0 + b1)
Figure 1: schematic representation of an auto-encoder. The blue points corresponds to raw data and
the red to label data used for fine-tuning supervision.
where se is the activation function of the encoder, whose input is called the activation
of the latent layer, and {W1, b1} is the parameter set with a weight matrix
and a bias vector b1. The decoder maps the latent representation x1 back to a
reconstruction via another mapping function fd:
x2 = fd(x1) = sd(WT
2 x1 + b2)
The input of sd is called the activation of the reconstruction layer. Parameters are
learned through back-propagation by minimizing the loss function L(x0, x2):
L(x0, x2) = Lr(x0, x2) + 0.5 (||W1||2
2 + ||W2||2
2)
which consists of the reconstruction error Lr(x0, x2) and the L2 regularization ofW1
andW2. By minimizing the reconstruction error, we require the latent features should
be able to reconstruct the original input as much as possible. In this way, the latent
features preserve regularities of the original data. The squared Euclidean distance is
often used for Lr(x0, x2). Other loss functions such as negative log likelihood and
cross-entropy are also used. The L2 regularization term is a weight-decay which is
added to the objective function to penalize large weights and reduce over-fitting. The
term is the weight decay cost, which is usually a small number.
6. The stacked auto-encoders (SAE) is a neural network with multiple layers of auto-
encoders. It has been widely used as a deep learning method for dimensionality
reduction and feature learning
Figure 2: schematic representation of a stacked auto-encoder.
As illustrated in Figure 2, there are h auto-encoders which are trained in a bottom-up
and layer-wise manner. The input vectors (blue color in the figure) are fed to the
bottom auto-encoder. After finishing training the bottom auto-encoder, the output
latent representations are propagated to the higher layer. The sigmoid function or tanh
function is typically used for the activation functions of se and sd.
The same procedure is repeated until all the auto-encoders are trained. After
such a pre-training stage, the whole neural network is fine-tuned based on a pre-
defined objective. The latent layer of the top auto-encoder is the output of the stacked
auto-encoders, which can be further fed into other applications, such as SVM for
classification. The unsupervised pre-training can automatically exploit large amounts
of unlabeled data to obtain a good weight initialization for the neural network than
traditional random initialization.
Staked auto-encoders have been used in problems with very sparse data high
dimensional data of up to 100 000 input variables and billions of rows. Contrary to
shallow learning machines, like support vector machines (SVM) and traditional neural
networks, these architectures can take advantage of the large quantities of data and
continuously improve performance by adding new training examples.
The only downsize of them is the large computational effort needed to train
them (typically tens of hours or days in regular computers) – in some cases we are
working with 100 millions parameters that have to be learned... This can be alleviated
by using computation based on the CPUs and a cluster of machines (like the Amazon
cloud) which can reduce the training time to a couple of hours or minutes.
7. Results
We used two approaches: Stacked Auto-Encoders and Deep Belief Networks.
DBN with several architectures, with N inputs, M outputs (in this case M=1).
Stopping criteria. Learning rate.
Data set Architecture AUC
1 N-100-200-M 0.88
1 N-200-100-M 0.85
2 N-100-200-M 0.91
The only downsize of them is the large computational effort needed to train them
(typically tens of hours or days in regular computers) – in some cases we are working
with 100 millions parameters that have to be learned... This can be alleviated by using
computation based on the CPUs and a cluster of machines (like the Amazon cloud)
which can reduce the training time to a couple of hours or minutes.