Slides for three presentations Coolblue's Behind the Scenes Data Science event on 2018-03-22
Speakers:
- Andres Martinez (Data Science @ Coolblue)
- Matthias Schuurmans (forecasts)
- Daan Marechal (recommendations)
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
Behind The Scenes Data Science Coolblue 2018-03-22
1. Andres Martinez | Manager Data Science | a.martinez@coolblue.nl | 22-03-2018
2. Agenda
● What means Data Scientist at Coolblue.
● Delivering data science solutions in an agile, data-driven company.
● Organization.
Data Science at Coolblue
3. Descriptive:
What is happening now based on incoming data.
Diagnostic:
What happened and why.
Predictive:
An analysis of likely scenarios of what might happen. The deliverables
are usually a predictive forecast.
Prescriptive:
This type of analysis reveals what actions should be taken.
Analytics outputs
4. ● … should contains the underlying dynamic of the process we want to predict
● … is accurate enough for creating scenarios and anticipate actions.
Where we focus our efforts
A good predictive model...
5. ● … should contains the underlying dynamic of the process we want to predict.
Drivers impact
A good predictive model...
Diagnosis
Prediction
6. ● … should contains the underlying dynamic of the process we want to predict
Drivers impact
A good predictive model...
Diagnosis
Prediction
7. ● … should contains the underlying dynamic of the process we want to predict
Drivers impact
A good predictive model...
Diagnosis
Prediction
8. ● … is accurate enough for creating scenarios and anticipate actions.
Future scenarios and actions
A good predictive model...
today
Prediction
Prescription
n-people required
9. Our definition of Data Scientist
It is about implementation
Statistical Analysis, model
estimation,...
industrialization
managing models'
lifecycle at scale
10. Power is nothing without control
We take care...
● The period when a model is valid
always is bounded.
● Continuous monitoring and
adjustment.
22. The strength is in the team
Boosting performance!
● Appropriate tasks and responsibilities.
● A single individual is not enough: team
really matters!
● Knowledge sharing.
● There is not a single recipe.
23. The three components:
Build technical solutions
when there is value on it!
Flexible & agile organization
Data Science across
Coolblue through close
cooperation
Validation Implementation
Production
Exploration and validation: work in
domains/knowledge centers in close
cooperation with Business Analysis.
Collaboration
24. Problem understanding
Hypothesis creation
Data gathering
Feature engineering
Model selection and/or estimation
Model evaluation
Generalization
Implementation in production
Full stack DS vs. Pioneers
Core team
Data
Scientist
Satellite
Data
Scientist
25. Domain-a
Head of Tech
Manager
Data Science
Data Science Satellite 1
Team Lead Tech
Scrum team
Team Lead Tech
Scrum team
Data Science Satellite 2
Data Science Satellite m
Organization
Domain-b
Domain-a
Domain-b
Domain-c
Tech principles & scrum methodology Research & PoC
26. Andres Martinez | Manager Data Science | a.martinez@coolblue.nl | 22-03-2018
36. Shipments forecast
● Just enough people in the warehouses
● 3 warehouses: Parcel, XL and Whitegoods
● 3 horizons: 7 days, 14 days and 364 days
● Nice data
Context
47. Shipments forecast evaluation
● Cross Validation and KPIs
○ Percentage below 10% error
○ Root Mean Squared Error
○ Mean Absolute Percentage Error
● Extra attention for special cases
○ Christmas
● Interpretability / transparency
○ Effects of features
● Stability
Good forecast?
63. Portfolio Demand forecast
Features Models
Trend Regularized Regression
Seasonality Neural Networks
Holidays Support Vector Regressors
Events Weighted Average
Lag targets MARS
Polynomials Decision Trees
Dummified Feature subsets
64. Good forecast?
Good forecast?
Deal with automatically:
○ Cross Validation and KPIs
○ Stability
Ability to investigate manually:
○ Extra attention for special cases
○ Interpretability / transparency
66. Forecasting
● Forecasting very important for planning
● Pick a best model
○ Smart feature engineering
○ Relevant models and parameters
○ Grid and decide based on error metrics, stability, transparency
● Calculate using best model every day
● Use cloud when appropriate
● Automate and monitor everything!
Summary
69. Recommender Systems are software tools and techniques providing suggestions
for items to be of use to a user. The suggestions provided are aimed at supporting
their users in various decision-making processes.
Increase satisfaction and
boost sales
Pretty well known…
Recommender systems
78. What should we use!?
Nearest Neighbors
Decision Trees
Rule-based Classifiers
Bayesian Classifiers
Artificial Neural Networks
Support Vector Machines
Ensembles of Classifiers
Most popular and fundamental techniques used
Collaborative filtering, content-based filtering, data mining methods and context-aware
methods.
K-means
Other alternatives to K-means
Association Rule Mining
Other ad-hoc methods
Classification
Cluster analysis
Others
79. These are the typical features...
So, what do we have here?
● Gender
● Region
● Specified interests
● Purchase history
● etc.
82. Talking to our customers!?
A product sequence is like a phrase
83. This helps in deciding the model
● Several thousands products to be recommended
● It seems not to be depended of the gender, regions, etc.
● Each customer visualizes a very personal set of products
● Try to respond with a new personal set of products
Brief summary after some analysis:
84. Recurrent Neural Network
Nearest Neighbors
Decision Trees
Rule-based Classifiers
Bayesian Classifiers
Artificial Neural Networks
Support Vector Machines
Ensembles of Classifiers
Among of possibilities
K-means
Other alternatives to K-means
Association Rule Mining
Other ad-hoc methods
Classification
Cluster analysis
Others
85. Let’s see what the literature says
● Not many papers about RNN and
recommender systems
● All papers are very recent: 2016, 2017
● Results are very promising but still there are
no figures of real tests (only offline
experiments).
86. We set up the benchmark
Still we believe it’s worth the try!
We are in the research phase… we could try a quick PoC.
For more information:
Cole MacLean, Barbara Garza, and Suren Oganesian. A recurrent neural network based subreddit recommendation
system. 2017.
87. Evaluation
Mean Average Precision @ k
● Average Precision @ k looks at a ranked set of k recommended items
● Checks whether relevant item is in the recommended set
1 0 0
0 0
AP@5 = 0.20 0 0
0 1
AP@5 = 1
● Mean Average Precision @ k is the mean of all AP@k’s
89. Computationally expensive
● 4 CPUs Locally: ~120 hrs (estimated)
● 16 vCPUs in the cloud: ~36 hours
● GPU (NVIDIA Tesla k80) in the cloud: ~8 hrs
● GPU (NVIDIA Tesla P100) in the cloud: ~4.5 hrs
About the timing.
92. TensorFlow & Google Cloud Platform
Some notes about the training
● Python
● For the PoC we have used Jupyter notebooks
● TensorFlow for the Neural Network
● All models have been trained in GCP Compute
Engine
93. ● Start from the beginning: how are we going to measure success?
● Understand your data
● What is the current status? Literature? Benchmark
● Offline test and fine tuning: PoC
● A/B testing
● Future steps for industrializing it: Core Team
Wrap-up
Summary
Create your own Fact/slogan here: https://coolblueblauwdruk.nl/en/huisstijl/feit-slogan-generator
Create your own Fact/slogan here: https://coolblueblauwdruk.nl/en/huisstijl/feit-slogan-generator
Run Azkaban UAT STS
BQ:
SELECT
creation_timestamp,
warehouse_id,
DATE(forecast.timestamp) AS forecast.date,
forecast.value
FROM
[coolblue-bi-platform-uat:forecasts.short_term_shipments_with_reallocations]
WHERE
creation_timestamp > timestamp('2018-03-22T16:00:00')
ORDER BY
forecast.timestamp,
warehouse_id
Quick show of Data Science landscape dashboard, click through to short term shipments
Quick show of performance and stability tabs in Shiny
Quick show of the GFM design file, run, show output
Run DF calculate for 1k products, prepare well! Show table before, cluster upping, CPU usage, cluster downing, table after
BQ:
SELECT
creation_datetime,
forecast_start_date,
product_id,
value,
model_queue_id
FROM
[coolblue-bi-platform-dev:demand_forecast.forecast]
WHERE
creation_datetime > DATETIME('2018-03-22T14:00:00')
ORDER BY
product_id
Check Shiny individual product 638470 while cluster is upping
Run DF optimize for 5 products, prepare well! Show table before, cluster upping, CPU usage, cluster downing, table after
BQ:
SELECT
mq.product_id,
mq.model_queue_id,
mq.insert_datetime,
m.model_description
FROM
[coolblue-bi-platform-dev:demand_forecast.model_queue] mq
INNER JOIN
[coolblue-bi-platform-dev:demand_forecast.model] m
ON
m.model_id = mq.model_id
WHERE
mq.insert_datetime > DATETIME('2018-03-22T14:00:00')
Create your own Fact/slogan here: https://coolblueblauwdruk.nl/en/huisstijl/feit-slogan-generator
Satellite: meaning that I work in the domains and explore the available data and models that we could use, depending on the goal. I make proof of concepts and once we can show what the added value of a model is, the Core team is going to productionize the model and automate all the processes needed. Today I am going to show you a recent project that I did, which is about creating a recommender system.
Suggestions → not necessarily personal
Research → personalized leads to higher customer satisfaction/loyalty → boost in sales
Increasing datasets → hot topics
Recommender systems has been a hot topic for over a few decades and because of the ever increasing datasets it is a very interesting problem for data scientists to solve. At Coolblue we have lots of data and therefore it is exciting to use this data to improve customer experience. Once we create a good recommender system, we can make a big impact by targeting each and every customer individually!
At Coolblue we have 45.000 different products which makes it difficult for customers to find exactly what they are looking for. You should know that there is a great variety of products to choose from, but that customers only have limited time available to browse through all options. Therefore it is of great importance for us to show customers relevant products as early as possible in their customer journey. We have to think about solutions to help our customers find interesting products. For this reason we investigated the possibility to improve our current logic behind personal recommendations.
At Coolblue we already have personalized recommendations. These recommendations are made by looking at recent behavior and purchases of customers. The recommendations that we are going to generate need to be better than the current recommendations. But.. what is better? How can we measure the performance of personal recommendations?
We can measure this by A/B testing. A/B testing is a useful tool to measure the performance of two different variations of webpages or e-mails. We have set up an A/B test within our weekly newsletter, sent by email marketing. For us right now it is FASTER to obtain results when testing it in the email domain. For a proof of concept, implementing the obtained personal recommendations on the website would be too difficult and would take up too much time. In the email A/B test, we send half of the customers the current recommendations and the other half our new personal recommendations. The main metric that we are looking at is the product click through rate. This is essentially the share of customers that click on a personal recommendation. We seek to increase this metric, meaning that the engagement / interaction with the e-mail will be higher. When this metric increases it also means that more customers will land on our website and therefore are more likely to buy a product.
So now I have discussed how we are going to measure the performance of personal recommendations. Let me jump forward in time and show you the A/B test itself...
This is the e-mail that we sent, as you can see, the only difference in the e-mail is the products that are recommended. The other parts of the e-mail are exactly the same in both variations.
As you can see, the results are extremely good! The performance of our recommendations is significantly better than the recommendations provided by the current logic. The CTR of the products increases with 60%, which is huge.. Moreover, the people that click on the products also have a 5% higher probability of buying something when they land on the Coolblue website. This means that we provide high quality recommendations that not only raise awareness of products, but also creates desire. The people actually want to buy the products that we recommended. By using this A/B test we have proved the value of productionizing the model that we build.
Ok, so how did we do this?
OVERWHELMED First of all, we did some deep research. There are many well-known methods to recommend products to customers. The main techniques are collaborative filtering, content based filtering and context-aware methods. I will not go into detail in these techniques but I do want to mention that we can use classification techniques to predict the most relevant product for each customer. Next to that we can use clustering techniques to group products in order to find similar products.
DO THESE FEATURES DRIVE RECOMMENDATIONS? Usually, recommender systems look at the customer features. For instance when looking at gender, the model is basically looking for products that are being sold more often to females than to males and will then boost these products to the female customers and vice versa. Of course, the same happens for region etc. It is basically segmenting the customers into different groups and each customer in this group will get recommended the same products. But, are these features really good drivers to predict the right products to the right customer? Let’s say we know that according to their specifics, a customer is not likely to buy an apple macbook. Maybe because we know that she always buys the less expensive products in a category. What will happen if she is looking for laptops and it turns out that she is constantly looking at macbooks. When we look at her features, we will not recommend her a macbook because she is not going to buy expensive products, right? But why would she look at the macbooks then? Basically she is telling us what she is interested in by her browsing behavior.. a macbook!
So, we have come up with a product sequence. The customer is interacting with us through the products that she sees on our website. This sequence is ordered so it means that we should be able to extract patterns and very insightful information from these sequences.
And then we have this. The model would generate a new sequence of products, which we can recommend.
This means that in a very abstract way, we are trying to create a chatbot. The customers interact with this chatbot implicitly and the model will produce new sentences, containing products. This is where natural language processing comes in. We can think of these sequences as sentences. But except for words the sentences consist of productids. If we find a model from NLP we should be able to input our sentences and the model would be able to extract relationships between words, which in our case are products. Also, models from NLP are also able to generate new sequences following the same patterns and logic based on the input data. Next to this, when we put in a unfinished sequence, it is able to finish the sequence!
So, to summarize. We have thousands of products that we can recommend in order to increase customer satisfaction and to create awareness of new products. After analyzing the data we have we noticed that the customer features did not drive which product is going to be bought in the future. So we need another way to personalize the content, which we do using the set of products seen within each session. These sequences are extremely personal and gives us valuable information about the customer. These sequences tell us which products are likely to be seen together and by using millions of these sequences we are able to extract patterns and recreate flows of customers.
Notice that we are dealing with a multiclass classification problem with as many classes as we have products, which makes artificial neural networks a natural candidate.
Next to that, by following the intuition I mentioned before, we are aiming for a model that can handle sequential data. This means that we should be able to use recurrent neural networks, which originate in NLP. Recurrent Neural Networks are extremely effective in modelling sequential data, which is what we need. They are capable of generating sequences following the same patterns.
Pattern recognition
Sequence modelling
Multi class classification
As many classes as products
Ok, so let’s research what already has been done in recommender systems using Recurrent Neural Networks. And what turns out is that it is actually not used a lot. We found a few papers discussing the use of recurrent neural networks in recommender systems, but only one unpublished article is following the same intuition as ours. The theoretical performance of this proposed model is promising, but this is only measured using offline experimenting. This means the recommended products are not tested on real customers, but it’s just evaluated on a test set in the data.
The theoretical performance in this paper is based on a metric called Mean Average Precision@k.
This is a metric that measures the average precision of a set of predicted items. The set of predicted items is cut off at k, so it is only looking at the top k products in the prediction. Okay but what is average precision? It basically compares the set of recommended products with the actual relevant item. Notice that the model is trained to predict the next item in the sequence, meaning that there is only one relevant item per sequence. If this relevant product is in the recommended set of products the precision goes up by a certain amount. This amount is based on the position of this relevant product in the recommended set of products. This means that it matters in which order the recommended set of products is presented. Basically, when the relevant item is in the first position, the average precision is 1, when it is in the 5th position the average precision is 1/5. Then, the Mean Average Precision is the average of the average precisions of all sequences in the test set. We are going to compare our results with the MAP@5 in this paper since we are not going to recommend more than 5 products.
So, let me show you how a recurrent neural network looks like. By looking at this architecture it reveals why recurrent neural networks are so effective in modelling sequences. This is because it involves timesteps. In our case, we can input a product at each timestep and then predict the last product in the sequence. As mentioned before, this means that there is only 1 correct product. After we inputted a product in the first timestep, the recurrent layer computes what to let through to the next timestep. The next timestep then receives a new product, and the valuable information from the previous timestep. This means that it can remember long term patterns. The output is a probability vector of the products. This means that during training we can compare the output vector with the actual next product. By using this feedback, the model learns to alter parameters in the recurrent layers in order to improve the predictions.
Okay, now that we have chosen a model and searched for a benchmark. It’s time to train our own recurrent neural network. We train our model using 1 million sequences, which makes it computationally expensive… We started with experimenting this locally on 4 CPUs but soon found out that in order to train a model long enough to become a good classifier, it would take approximately 120 hours.. This is without fine tuning parameters, so if we change some parameters, we would have to wait 120 hours before we see the effects. As you can understand, this is not the way to go. So we started computing in the cloud, and this significantly reduced the training time. On the GPU in the cloud it was reduced to 4.5 hours..
And this means we can finetune our models and see the effects way faster! And as a result, this led to better models
After experimenting and finetuning the model parameters we have obtained an MAP@5 of 0.091, which is very close to the paper discussed before. We do have to keep in mind however that our estimation is based on 20% more items than the paper, which makes it significantly harder to obtain the same MAP@5!
We have trained all models for the proof of concept using Jupyter notebooks. For the neural networks we used the TensorFlow package for Python. As mentioned before we trained the model using the Google Cloud Compute Engine.
Create your own Fact/slogan here: https://coolblueblauwdruk.nl/en/huisstijl/feit-slogan-generator