Many ecommerce companies have extensive logs of user behavior such as clicks and conversions. However, if supervised learning is naively applied, then systems can suffer from poor performance due to bias and feedback loops. Using techniques from counterfactual learning we can leverage log data in a principled manner in order to model user behaviour and build personalized recommender systems. At Grubhub, a user journey begins with recommendations and the vast majority of conversions are powered by recommendations. Our recommender policies can drive user behavior to increase orders and/or profit. Accordingly, the ability to rapidly iterate and experiment is very important. Because of our powerful GPU workflows, we can iterate 200% more rapidly than with counterpart CPU workflows. Developers iterate ideas with notebooks powered by GPUs. Hyperparameter spaces are explored up to 8x faster with multi-GPUs Ray clusters. Solutions are shipped from notebooks to production in half the time with nbdev. With our accelerated DS workflows and Deep Learning on GPUs, we were able to deliver a +12.6% conversion boost in just a few months. In this talk we hope to present modern techniques for industrial recommender systems powered by GPU workflows. First a small background on counterfactual learning techniques, then followed by practical information and data from our industrial application.
By Alex Egg, accepted to Nvidia GTC 2021 Conference
6. Candidate Selection (Recall)
Motivation: We can’t rank the whole catalog in SLA
Fast/High recall set << Catalogue
● metadata-based filters: eg select items in user’s fav
cuisines or genre
● Item co-occurrences: eg clusters that belong to your past
items
● k-nearest neighbors: eg find similar items in Rn
space (see
ANN later)
7. Ranking (Precision)
Rank Candidates w/ high precision using Supervised Learning
● Classification
○ Binomial: P( click | u, i )
○ Multinomial: P( I | u ) → Autoencoders
● Ranking
○ Pointwise, pairwise, listwise
* Choice of approach is a product of your supervision labels
Binary feedback or relevance labels
8. Supervised Learning Task w/ sparse categorical variables:
f(X) = y, D=(X,y), X=(U,R), {U,R,} ∈ R1
, y ∈ {0,1}
Linear Model: P(y|X) = σ(Xw) = σ( u1
w1
+ r2
w2
)
U(is_french) = {1, -1}
R(is_french) = {1, -1}
X=[1, 1] ← french lover + french rest
X=[1, -1] ← french lover + non-french rest
X=[-1, 1] ← french hater + french rest
X=[-1,-1] ← french hater + non-french rest
Personalization (Modeling Interactions)
Feature Crosses: (2nd-order)
σ(ɸ(X)w) = σ( u1
w1
+ r2
w2
+ u1
r1
w3
)
X=[1, 1, 1] ← french lover + french rest
X=[1, -1, -1] ← french lover + non-french rest
X=[-1, 1, -1] ← french hater + french rest
X=[-1,-1, 1] ← french hater + non-french rest
Go Deep!: 2-layer MLP
How to model nth-Order Interactions?
● Explicit & implicit feature crosses (very sparse feature
space, expensive)
● Combinations of explicit and implicit (wide & deep)!
9. Deep & Cross Network
Are multiplicative crosses enough? FMs → MLPs
Recent studies [1, 2] found that DNNs are inefficient to
even approximately model 2nd or 3rd-order feature
crosses.
● What is advantage of DCN?
○ Efficient Explicit Feature Crosses
1: Latent cross: Making use of context in recurrent recommender systems.
2: Deep & Cross Network for Ad Click Predictions
13. Biased Data P( y | X )
Feedback
● Organic/Full-Feedback
○ Common in Academia, Rare in industry (ADS-16, MSLR-WEB30k)
● Bandit/partially-observed-Feedback
○ Click logs from industrial applications
Analogy: What is your favourite color: red or black? => red
P(y=red | X=🧐) = 1 ← is this actually true?
“Missing, not at random”
Apply this analogy to any recsys you use: netflix, spotify, amzn, grubhub
14. Evaluation (Thought Experiment)
● Classic train/test split to predict the test set accurately...
● Dataset of production system logs D=(x,y,r) ...
● What is the value of predicting the test set accurately?
● Is the test-set a reflection of organic user behavior? (No) Or a reflection of the logging
policy!? (Yes)
● There is a difference between a prediction and a recommendation (A recommendation is an
intervention)
● Bandit feedback is the product of a logging policy
● Logging policy is the previous generation recommender, ie the dataset (logs)
Goal
Supervised Learning Predict test-set
Actual Predict user behavior
Test-set != user behavior
16. Causal Embeddings for Recommendation. Bonner, Vasile. Recsys ‘18
Selection Bias (Randomization)
Bias from Feedback loops
Add Exploration → Stochasticity (Randomization)
● Random Exploration w/ ϵ-Greedy Bandit
● Causal Embeddings: Jointly factorize unbiased
and greedy embeddings
17. Position Bias (Randomization)
Bias from devices
Inverse Propensity Scoring
Compute inverse propensities 1/bi
across ranks for
random bucket
Offset loss:
25. Offline Evaluation
Metric Corr w/ Conv
MRR .55 📈
AUC .51
NDCG .48
loss .02
Surrogate Metric: Can we get directional estimates of online metrics, offline?
Can we design a metrics that tracks conversion rate?
26. On-Policy Evaluation
● Most-popular policy vs Personalization Policy
● Personalization policy +20% & improves over time
● Diversity: ~5 cuisines/slate, 60% unique
27. On-Policy Evaluation: Fairness
Inequality (Lorenz Curve + Gini Index)
Quantifies inequality (ie impressions across merchants
or wealth across populations)
The EE variant is more equitable than the MP variant.
Gini
Baseline .60
MP .59
EE .49 🏅
28. Experiment: Causal Embeddings
Hypothesis: If we use the uniform data in a
principled manner we can increase performance
by overcoming selection bias.
Experiment:
● Random
● Biased
● Random ∪ Biased
● CausE
Results: Principled use of uniform was beneficial
Causal Embeddings for Recommendation. Bonner, Vasile. Recsys ‘18
AUC
Random
(small data)
.56
Biased .72
Random ∪
Biased
.73
Causal .74
29. Experiment: Interaction Modeling
Hypothesis: MLPs are universal function
approximators?
Experiment: Evaluate MLP against
feature crosses
Results: MLP does not capture full
interactions
NDCG
(unintentful)
MRR
(unintentful)
AUC
(unintentful)
Random .511 .216 .500
UMP .615 .582 .653
MLP .627 .586 .689
DCN .657
(+4.7%)
.617
(+5.2%)
.695
(+0.8%)
31. Experiment: Global vs Local Models
(Markets)
If your recommender operates in markets
of varying sizes with distinct
cultural/taste patterns, it’s important that
your recs are high-quality in all markets.
● Operational Pain
● Market Sparsity
NDCG
(unintentful)
MRR
(unintentful)
AUC
(unintentful)
Local .617 0.557 0.635
Global .749
(+21%)
0.709
(+27.2%)
0.736
(+12%)
32. Experiment: GPU Data Pipelines
● IO → CPU → RAM → GPU RAM → GPU
● IO Bound
○ Sequential Data Access: Libsvm → TFRecords, GPU: 4% → 90%
○ tf.data pipelines are CPU only
○ Vmap: batch → map
○ Prefetch (to GPU memory)
Step/s Train Time (h) Batch Size
CPU (64 cores) 0.58 47 448
K80(4992 cores) 3.2 10 127
V100* (5120 cores) 8.5 (2-3x) 3.2 (3x) 448
* Not using Tensor-Cores/FP16