As a leading e-commerce company in fashion in the Netherlands, Wehkamp dedicates itself to provide a better shopping experience for the customers. Using Spark, the data science team is able to develop various machine-learning projects for this purpose based on the large scale data of products and customers. A major topic for the data science team is ranking products. If a visitor enters a search phrase, what are the best products that fit the search phrase and in what order should the products been shown? Ranking products is also important if a visitor enters a product overview page, where hundreds or even thousands of products of a certain article type are displayed.
In this project, Spark is used in the whole pipeline: retrieving and processing the search phrases and their results, making click models, creating feature sets, training and evaluating ranking models, pushing the models to production using ElasticSearch and creating Tableau dashboarding. In this talk, we are going to demonstrate how we use Spark to build up the whole pipeline of ranking products and the challenges we faced along the way.
6. 1952 - first advertisement 1955 - first catalog 1995 - first steps online 2010 - completely online 2018 - mobile first
2019 -
a great shop
experience
our history
where we come from
7. over 2.000 brands
C&A // Vingino // Hunkemöller // Mango // Tommy Hilfiger // Scotch & Soda // ONLY
HK Living // House Doctor // Woood // Bloomingville // Zuiver // whkmp’s own
our categories
Fashion // Home & garden // Electronics // Entertainment // Household // Sports & Leisure // Beauty &
Health
>400.000
products
>500.000
daily visitors
661 million
sales 18/19
11 million
packages
> 950
colleagues
60%
of customers
shopping mobile
72%
of our
customers is
female
8. Our journey
8#UnifiedDataAnalytics #SparkAISummit
• We work(ed) with a traditional corporate data warehouse
• Need: ML, flexibility, speed, enabling, etc.
• 2 years ago: pilot Spark on Databricks
– Challenges: Training of people, data in cloud
• Today:
– Transformation to Databricks / Cloud (S3)
– Lots of new (ML) products/prototypes and colleagues on DB platform
9. Machine learning @ wehkamp
9#UnifiedDataAnalytics #SparkAISummit
Recommend
ers
Forecasti
ng
Image
classificati
on
Search
Personalisat
ion
Product
ranking
Fraud
detection
And a
lot
more
10. Machine learning @ wehkamp
10#UnifiedDataAnalytics #SparkAISummit
Recommend
ers
Forecasti
ng
Image
classificati
on
Search
Personalisat
ion
Product
ranking
Fraud
detection
And a
lot
more
12. Ranking problem for ecommerce
12#UnifiedDataAnalytics #SparkAISummit
User search for ‘jeans’
Relevant?
We return 4401 products
13. Ranking problem for ecommerce
13#UnifiedDataAnalytics #SparkAISummit
User navigates to
‘ladies jeans’ overview
page
Relevant?
We return 2176 products
14. Ranking problem for ecommerce
14#UnifiedDataAnalytics #SparkAISummit
● Consider a visit to a ‘product overview page’ (example
‘ladies jeans’) as a user query
● Main problem: maximize the order of relevance of returned
products given a user query
15. ● How good is this list?
● Suppose we know how relevant each item is, can we
define an overall score for the relevancy of this list?
● Yes we can, the answer is NDCG
(Normalized Discounted Cumulative Gain)
https://en.wikipedia.org/wiki/Discounted_cumulative_gain
Ranking problem for ecommerce
16. ● Suppose we know relevancy scores,
let’s rank them
● Let’s add a correction for position via
Log2(i+1)
● Divide and sum to get a score:
discounted cumulative gain (7,84)
● Do the same, but for this list in
perfect order to get an Ideal DCG.
That score will be: 9,00
● Divide our DCG / IDCG =
normalized discounted cumulative
gain (0.87)
Ranking problem for ecommerce
2 3 4
1
i
1 2 1,00 2,00
2 3 1,58 1,89
3 4 2,00 2,00
4 1 2,32 0,43
5 3 2,58 1,16
6 1 2,81 0,36
3 1
Sum: 7,84
17. Ranking problem for ecommerce
Relevancy scores Explain the scores with features
32 1
Title match
4
Article match
Maximize the NDCG, by giving weight to features
Reviews
Seasonality
Price
…
21. Efforts
21#UnifiedDataAnalytics #SparkAISummit
• Initial effort of building pipeline:
2 data scientists and 1 data engineer (for search and Product Overview Page) for a couple of
months
• New click/ranking model:
1 data scientist can train, test and push a new ranking model to production within 1 hour
22. Data collection
22#UnifiedDataAnalytics #SparkAISummit
● Source: raw Google Analytics feed (daily)
● Per product list (i.e. search, overview page):
○ ProductID
○ Position / Page
○ Impression / Click
● Challenges:
○ tagging is different for web and app
○ devices have different display formats
23. Click model
Reality: We don’t know the relevancy scores; use a click model.
Goal: determine relevance of products in each SOP/POP
Approach: predict the relevance of products based on impressions and clicks of products
given its position
• Clicks over Expected clicks (COEC)
• Corrected for small search queries
In our case:
better results, easier to train & explain
• DBN click model
(https://github.com/varepsilon/clickmodels)
• Paper: Dynamic Bayesian Network ( DBN
) model: Chapelle, O. and Zhang, Y. 2009.
A dynamic bayesian network click model
for web search ranking. WWW (2009)
27. Feature generation
Try to explain and predict which attributes (i.e. features) of products (wrt user query)
contribute to its relevance score
27#UnifiedDataAnalytics #SparkAISummit
- Title match
- Description match
- Tf-idf
- …
● Limit the number of features to
< 100 (latency issues)
● For POP features we did not
use OHE, but a Bayesian
encoder to limit number of
features
- Popularity
- Discount / Promo
- Seasonality
- Reviews
- Days online
- Brand
- ..
Feature examples
32. Ranking model
• Many machine learning techniques to use
• Elastic Search LTR plugin supports XGBoost
• XGBoost → eXtreme Gradient Boosting
– Variant of the gradient boosting technique (tree-based model)
– Non-linearity
– Good results (e.g. Kaggle competitions)
– Easy to use, tune, and evaluate
– Fast (parallel computation on single machine but also cluster
support, e.g. Spark)
• XGBoost has lots of parameters to tune; we adopt help from Hyperopt
https://hyperopt.github.io/hyperopt/
• XGBoost has rank:ndcg as option
32#UnifiedDataAnalytics #SparkAISummit
41. Evaluation
• Use A/B testing to check if ranking models outperform the standard
implementation. Configuration of tests done with Planout
https://github.com/facebook/planout
• An automated Tableau report will show the results of the A/B test
• We are reporting quite a few metrics, but most importantly looking at:
- Click Trough Rate
- Revenue per session
- Paul Score https://www.mediawiki.org/wiki/Wikimedia_Discovery/Search/Glossary
41#UnifiedDataAnalytics #SparkAISummit
43. Our journey ahead
• For search; build multiple models for multiple
categories, based on searchphrase classification
• Add more product specific attributes
• Test with personalisation
44. Wrap up
44#UnifiedDataAnalytics #SparkAISummit
Automating a learning-to-rank pipeline requires a lot of different parts
working together.
- Google Analytics
- Databricks / Spark
- Elasticsearch
- S3
- XGBoost
- Hyperopt / SHAP
- MLflow
- Planout
- Tableau