Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Boosting conversion rates on ecommerce using deep learning algorithms

Boosting conversion rates on ecommerce using deep learning algorithms

Boosting conversion rates on ecommerce using deep learning algorithms

  1. 1. Boosting conversion rates on ecommerce using deep learning algorithms Armando Vieira (Armando@dataai.uk) 31 Oct 2014 Objective Predict the probability that a user will buy a product from an online shop based on past interactions within the shop website. Approach This problem will be analysed in two stages. First using off the shelf classification algorithms and a second using a stacked auto-encoder to reduce the dimensionality of the problem. Data description Data consists of one week of records of user interaction with a ecommerce site. Events have a userId, a timestamp, an event type (5 categories: pageview, basketview, buy, adclick and adview) and productId (around 25 000 categories). In case of a buy of basketview we have information on the price. We ignore adview and aclick events. Only about 1% of products (around 250) have a full category identification. However, these corresponds to about 85% of pageviews and 92% of buys. In this section we only consider interactions with these product and exclude the others. The data is about 10Gb and cannot be loaded into my laptop memory, so we first took a subsample of the first 100 000 events just to have a snapshot of the interactions. We found:  78 360 pageviews events (~78.4% of total events) from 13342 unique users.  16 409 basketview (~16.4%) from 3091 unique users.  2 430 sales events (~2.5%) from 2014 unique users (around 1.2 sales per user). If we restrict to the 257 label product categories, we found 39561 pageviews, from 7469 distinct users, which is about half of the population. We found an average of 6 interactions per user, the distribution is very skewed, following a power-law distribution (see next figure). Most users do a single interaction while very few engage in very large interactions.
  2. 2. In terms of interactions with products we found also that a few products receive a very large number of interactions (pageviews) while others just a few, see next figure: Data for training the classifiers To build the data set we will restrict, for the moment, to the set of 257 product categories (which account for half of the pageviews) – will deal with all categories in future (see last section). Data was aggregated at the week level per product category and semi-week (two time buckets). In this first iteration we will not add basketview events as most of them are made on the same session/day of sales events and the objective is to predict sales with at least one day of delay. We will consider this in next iteration. All data sets were balanced: same number of sales events and non-sales events. Due to the large size of data, we essentially study the importance of sample size. We excluded pageview events from the same day or day before the sale event. Next table describe the various tests done with the 5 datasets consider: Data set Size Comments Data 1 3 000 Only page views; 257 categories; weekly aggregate Data 2 10 000 Same as data 1 but more data Data 3 30 000 Same as data 1 but more data Data 4 10 000 Same as Data 2 but semi-week aggregation Data 5 3 000 Same as Data 1 but including top 2000 categories
  3. 3. Feature selection with Non-Negative Matrix Factorization (NMF) In order to test the impact of not including all product categories, we considered a new data set (Data 5) containing the top 2000 more visited product categories. Since this a huge dimensional search space, we applied Non-Negative Matrix Factorization (NMF) to reduce dimensionality. Non-negative Matrix Factorization (NMF) is a class of unsupervised learning algorithms, such as Principal Components Analysis (PCA) or learning vector quantization (LVQ) that factorizes a data matrix subjected to constraints. Although PCA is a widely used algorithm it has some drawbacks, like its linearity and poor performance on factors. Furthermore, it enforces a weak orthogonality constraint. LVQ uses a winner-take-all constraint that results in clustering the data into mutually exclusive prototypes but it performs poorly on high dimensional correlated data. Non-negativity is a more robust constraint for matrix factorization [5]. Given a non-negative matrix V (containing the training data), NMF finds non- negative matrix factors, W and H, such that: �≅��. Each data vector V (data entry) can be approximated by a linear combination of the columns of W, weighted by the patterns matrix H. Therefore, W can be regarded as containing a basis that is optimized for the linear approximation of the data in V. Since relatively few basis vectors are used to represent many data vectors, good approximation can only be achieve if the basis vectors discover the structure that is latent in the data. NMF was successfully applied to high dimensional problems with sparse data, like image recognition and text analysis. In our case we used NMF to compress data into a 10 feature subset. The major issue with NMF is the lack of an optimal method to compute the factor matrixes and stopping criteria to find the ideal number of features to be selected. In our case we apply NMF to reduced the dimensionality of the search space to 100, 200 and 300 (Data 5 – 100, 200 and 300). Running the Classifiers Based on the data sets, we test the performance of two classifiers: Logistic Regression and Random Forest. The first is a standard in industry and serve as a baseline the second is more robust and produce in general better results. It has the disadvantage of their predictions not being ease to understand (black box). We used the algorithms without any optimization of the parameters (number of trees, numbers of variables to consider in each split, split level, etc.) As a KPI to measure performance we use the standard Area Under Roc curve (AUC). An AUC=0.5 meaning a random (useless) classifier and 1 a perfect one. For all runs we used 10 fold cross validation. The results are presented in next table: Data set Logistic Random Forest Data 1 0.67 0.71 Data 2 0.69 0.76 Data 3 0.70 0.80 Data 4 0.68 0.82
  4. 4. Data 5 - 100 0.62 0.67 Data 5 - 200 0.64 0.69 Data 5 – 300 0.64 0.72 We conclude that sample size is an important factor in the performance of the classifier, though the Logistic Regression does not have the same gains as the Random Forest (RF) algorithm. Clearly RF has a much best performance than logistic regression. From data set 4 we also conclude that time of events is an important factor to taken into account: although we increase the dimensionality of the search space, we still have a net gain even using fewer training examples. From applying the algorithms to data set 5, we concluded that NFM algorithm is doing some compression on data but not in a very efficient way (only the data with 300 features had improved the accuracy over the initial subset of products). In next section we suggest using Auto-encoders to reduce the dimensionality of data for all the 25 000 categories. Polarity of variables is presented in appendix 1. The most important variables are the ones corresponding to products that have highest purchase rate, which make some sense, as they correspond to the categories where most buys are made. Table 1: Confusion matrix for the dataset 1 with classifier . 1 0 1 .89 .11 0 .07 .93 Confusion Matrix, ROC curves, variable importance and polarity: To Be Delivered
  5. 5. Work to be performed Stacked auto-enconders Auto-encoders are unsupervised feature learning and classification neural networks machines that belong to the category of the now called deep learning neural networks. They are especially fitted for hard problems involving very high dimensional data when we have a large number of training examples but most of them are unlabeled, like text analysis or bioinformatics. At its simplest form, an auto-encoder can be seen as a special neural network with three layers – the input layer, the latent (hidden) layer, and the reconstruction layer (as shown in Figure1 below). An auto-encoder contains two parts: (1) The encoder maps an input to the latent representation (feature) via a deterministic mapping fe: x1 = fe(x0) = se(WT 1 x0 + b1) Figure 1: schematic representation of an auto-encoder. The blue points corresponds to raw data and the red to label data used for fine-tuning supervision. where se is the activation function of the encoder, whose input is called the activation of the latent layer, and {W1, b1} is the parameter set with a weight matrix and a bias vector b1. The decoder maps the latent representation x1 back to a reconstruction via another mapping function fd: x2 = fd(x1) = sd(WT 2 x1 + b2) The input of sd is called the activation of the reconstruction layer. Parameters are learned through back-propagation by minimizing the loss function L(x0, x2): L(x0, x2) = Lr(x0, x2) + 0.5 (||W1||2 2 + ||W2||2 2) which consists of the reconstruction error Lr(x0, x2) and the L2 regularization ofW1 andW2. By minimizing the reconstruction error, we require the latent features should be able to reconstruct the original input as much as possible. In this way, the latent features preserve regularities of the original data. The squared Euclidean distance is often used for Lr(x0, x2). Other loss functions such as negative log likelihood and cross-entropy are also used. The L2 regularization term is a weight-decay which is added to the objective function to penalize large weights and reduce over-fitting. The term is the weight decay cost, which is usually a small number.
  6. 6. The stacked auto-encoders (SAE) is a neural network with multiple layers of auto- encoders. It has been widely used as a deep learning method for dimensionality reduction and feature learning Figure 2: schematic representation of a stacked auto-encoder. As illustrated in Figure 2, there are h auto-encoders which are trained in a bottom-up and layer-wise manner. The input vectors (blue color in the figure) are fed to the bottom auto-encoder. After finishing training the bottom auto-encoder, the output latent representations are propagated to the higher layer. The sigmoid function or tanh function is typically used for the activation functions of se and sd. The same procedure is repeated until all the auto-encoders are trained. After such a pre-training stage, the whole neural network is fine-tuned based on a pre- defined objective. The latent layer of the top auto-encoder is the output of the stacked auto-encoders, which can be further fed into other applications, such as SVM for classification. The unsupervised pre-training can automatically exploit large amounts of unlabeled data to obtain a good weight initialization for the neural network than traditional random initialization. Staked auto-encoders have been used in problems with very sparse data high dimensional data of up to 100 000 input variables and billions of rows. Contrary to shallow learning machines, like support vector machines (SVM) and traditional neural networks, these architectures can take advantage of the large quantities of data and continuously improve performance by adding new training examples. The only downsize of them is the large computational effort needed to train them (typically tens of hours or days in regular computers) – in some cases we are working with 100 millions parameters that have to be learned... This can be alleviated by using computation based on the CPUs and a cluster of machines (like the Amazon cloud) which can reduce the training time to a couple of hours or minutes.
  7. 7. Results We used two approaches: Stacked Auto-Encoders and Deep Belief Networks. DBN with several architectures, with N inputs, M outputs (in this case M=1). Stopping criteria. Learning rate. Data set Architecture AUC 1 N-100-200-M 0.88 1 N-200-100-M 0.85 2 N-100-200-M 0.91 The only downsize of them is the large computational effort needed to train them (typically tens of hours or days in regular computers) – in some cases we are working with 100 millions parameters that have to be learned... This can be alleviated by using computation based on the CPUs and a cluster of machines (like the Amazon cloud) which can reduce the training time to a couple of hours or minutes.

    Be the first to comment

    Login to see the comments

  • LukaszKosmaBonenberg

    Oct. 8, 2015
  • quevedin

    Oct. 13, 2015
  • LeonardWolters

    Oct. 21, 2015
  • darkseed

    Jan. 18, 2016

Boosting conversion rates on ecommerce using deep learning algorithms

Views

Total views

2,070

On Slideshare

0

From embeds

0

Number of embeds

14

Actions

Downloads

55

Shares

0

Comments

0

Likes

4

×