GTC 2021: Counterfactual Learning to Rank in E-commerce

Counterfactual
Learning to Rank:
Alex Egg | @eggie5
Personalized Recommendations
In Ecommerce

Outline
● 2 Stage IR System
● Candidate Selection
● Ranking
● Personalization (Modeling Interactions)
● Features (for recommenders)
● Log Feedback
● Biased Data (Counterfactuals & Reinforcement)
● Training
● Tuning
● Deployment
● Evaluation
● Ops

Restaurant
Recommendations
Menu/Dish
Recommendations
Rest/Dish/Cuisine
Search
Cuisine
Recommendations

Two-Stage Information Retrieval System
2 Stages:
● Candidates
● Rankings

Candidate Selection (Recall)
Motivation: We can’t rank the whole catalog in SLA
Fast/High recall set << Catalogue
● metadata-based ﬁlters: eg select items in user’s fav
cuisines or genre
● Item co-occurrences: eg clusters that belong to your past
items
● k-nearest neighbors: eg ﬁnd similar items in Rn
space (see
ANN later)

Ranking (Precision)
Rank Candidates w/ high precision using Supervised Learning
● Classiﬁcation
○ Binomial: P( click | u, i )
○ Multinomial: P( I | u ) → Autoencoders
● Ranking
○ Pointwise, pairwise, listwise
* Choice of approach is a product of your supervision labels
Binary feedback or relevance labels

Supervised Learning Task w/ sparse categorical variables:
f(X) = y, D=(X,y), X=(U,R), {U,R,} ∈ R1
, y ∈ {0,1}
Linear Model: P(y|X) = σ(Xw) = σ( u1
w1
+ r2
w2
)
U(is_french) = {1, -1}
R(is_french) = {1, -1}
X=[1, 1] ← french lover + french rest
X=[1, -1] ← french lover + non-french rest
X=[-1, 1] ← french hater + french rest
X=[-1,-1] ← french hater + non-french rest
Personalization (Modeling Interactions)
Feature Crosses: (2nd-order)
σ(ɸ(X)w) = σ( u1
w1
+ r2
w2
+ u1
r1
w3
)
X=[1, 1, 1] ← french lover + french rest
X=[1, -1, -1] ← french lover + non-french rest
X=[-1, 1, -1] ← french hater + french rest
X=[-1,-1, 1] ← french hater + non-french rest
Go Deep!: 2-layer MLP
How to model nth-Order Interactions?
● Explicit & implicit feature crosses (very sparse feature
space, expensive)
● Combinations of explicit and implicit (wide & deep)!

Deep & Cross Network
Are multiplicative crosses enough? FMs → MLPs
Recent studies [1, 2] found that DNNs are inefﬁcient to
even approximately model 2nd or 3rd-order feature
crosses.
● What is advantage of DCN?
○ Efﬁcient Explicit Feature Crosses
1: Latent cross: Making use of context in recurrent recommender systems.
2: Deep & Cross Network for Ad Click Predictions

Features
Sparse categorical variables
→ embeddings
Examples:
● User
● Item
● Context

Log Feedback
Full-feedback → Partial-feedback (logs)
● Log Feedback
● Biased Data
● Evaluation Paradox

Log Feedback
D = (x, y, r)
● x: context (user)
● y: action (item/ranking)
● r: reward (feedback click/order)
Feedback
● Explicit Feedback: like/stars
● Implicit Feedback: watch/click/order
Tradeoff: quantity/quality
�
�

Biased Data P( y | X )
Feedback
● Organic/Full-Feedback
○ Common in Academia, Rare in industry (ADS-16, MSLR-WEB30k)
● Bandit/partially-observed-Feedback
○ Click logs from industrial applications
Analogy: What is your favourite color: red or black? => red
P(y=red | X=🧐) = 1 ← is this actually true?
“Missing, not at random”
Apply this analogy to any recsys you use: netﬂix, spotify, amzn, grubhub

Evaluation (Thought Experiment)
● Classic train/test split to predict the test set accurately...
● Dataset of production system logs D=(x,y,r) ...
● What is the value of predicting the test set accurately?
● Is the test-set a reﬂection of organic user behavior? (No) Or a reﬂection of the logging
policy!? (Yes)
● There is a difference between a prediction and a recommendation (A recommendation is an
intervention)
● Bandit feedback is the product of a logging policy
● Logging policy is the previous generation recommender, ie the dataset (logs)
Goal
Supervised Learning Predict test-set
Actual Predict user behavior
Test-set != user behavior

Counterfactual
Learning ● Selection Bias
● Position Bias
Randomization and stochasticity

Causal Embeddings for Recommendation. Bonner, Vasile. Recsys ‘18
Selection Bias (Randomization)
Bias from Feedback loops
Add Exploration → Stochasticity (Randomization)
● Random Exploration w/ ϵ-Greedy Bandit
● Causal Embeddings: Jointly factorize unbiased
and greedy embeddings

Position Bias (Randomization)
Bias from devices
Inverse Propensity Scoring
Compute inverse propensities 1/bi
across ranks for
random bucket
Offset loss:

Counterfactual
Evaluation
● Partial Information
● Full Information
● Partial Information w/ bandit
feedback
Medical Analogy

Patient Bypass Stent Drugs
1 0
2 1
3 1
4 0
5 1
6 1
7 1
8 0
9 0
10 1
11 1
Partial Information Setting
Counterfactual Thinking
Treating Heart Attacks
● Treatments: Y: [bypass, stent, drugs]
● Outcomes δi
: 5 year survival (0/1)
󰢛 Which treatment is best??
● Drugs 3/4🏅
● Stent ⅔
● Bypass 2/4
Really? 🤔

Patient Bypass Stent Drugs
1 0 1 0
2 1 1 0
3 0 0 1
4 0 0 0
5 0 1 1
6 1 0 0
7 1 0 1
8 0 1 0
9 0 1 0
10 1 1 0
11 1 1 1
Full Information Setting
Treatment Effects
Example:
● Bypass = 5/11 = .45
● Stent = 7/11 = .63🏅
● Drugs = 4/11 = .36

Bypass Stent Drugs
0 1 0
1 1 0
0 0 1
0 0 0
0 1 1
1 0 0
1 0 1
0 1 0
0 1 0
1 1 0
1 1 1
Patient P_B P_S P_D
1 .3 .6 .1
2 .4 .5 .1
3 .1 .1 .8
4 .6 .3 .1
5 .2 .1 .7
6 .4 .2 .4
7 .1 .1 .8
8 .1 .8 .1
9 .3 .3 .4
10 .3. .2 .1
11 .4 .4 .2
Partial Information Setting w/ Bandit Feedback
Assignment
R’
IPS
(y)=∑𝐈(yi
=y)/pi
δ(xi
,yi
)
●Bypass = 1/11 (0/.3 + 1/.4 + 0/.3 + 1/.4) = .45
●Stent = 1/11 (1/.5 + 0/.3 + 1/.2) = .63🏅
●Drugs = 1/11 (1/.8 + 1/.7 + 1/.8 + 0/.1) = .36

Off-policy Evaluation, eg 🎉 Ofﬂine AB Testing🎉
1. Policy: Deterministic y = f(x) → Stochastic p ~ π(x)
2. Log propensites: D=(x,y,r,p)
3. Build IPS Estimator
Counterfactual Evaluation

Experiments
& Results
Personalized Policy
● Evaluation
● Interaction Modeling
● Multi-relevance Feedback
● Congregated Search
● Market Generalization
● GPU Workﬂows

Evaluation
Ofﬂine Metrics
● NDCG
● MRR
● AUC
● Gini-lorenz
Hyperparameter Tuning
● Vertical Scaling across GPUs (p3) w/
● Exhaustive Search over 66 combinations. w/o
concurrency would take ~200h.
Online Metrics
● Conversion Rate (Conv)
● Orders/Visitor (OPV)
● Revenue
● Diversity
● Fairness

Ofﬂine Evaluation
Metric Corr w/ Conv
MRR .55 📈
AUC .51
NDCG .48
loss .02
Surrogate Metric: Can we get directional estimates of online metrics, ofﬂine?
Can we design a metrics that tracks conversion rate?

On-Policy Evaluation
● Most-popular policy vs Personalization Policy
● Personalization policy +20% & improves over time
● Diversity: ~5 cuisines/slate, 60% unique

On-Policy Evaluation: Fairness
Inequality (Lorenz Curve + Gini Index)
Quantiﬁes inequality (ie impressions across merchants
or wealth across populations)
The EE variant is more equitable than the MP variant.
Gini
Baseline .60
MP .59
EE .49 🏅

Experiment: Causal Embeddings
Hypothesis: If we use the uniform data in a
principled manner we can increase performance
by overcoming selection bias.
Experiment:
● Random
● Biased
● Random ∪ Biased
● CausE
Results: Principled use of uniform was beneﬁcial
Causal Embeddings for Recommendation. Bonner, Vasile. Recsys ‘18
AUC
Random
(small data)
.56
Biased .72
Random ∪
Biased
.73
Causal .74

Experiment: Interaction Modeling
Hypothesis: MLPs are universal function
approximators?
Experiment: Evaluate MLP against
feature crosses
Results: MLP does not capture full
interactions
NDCG
(unintentful)
MRR
(unintentful)
AUC
(unintentful)
Random .511 .216 .500
UMP .615 .582 .653
MLP .627 .586 .689
DCN .657
(+4.7%)
.617
(+5.2%)
.695
(+0.8%)

Experiment: Multi-relevance Feedback
Sources of Feedback:
● Impressions
● Clicks
● Orders
Metric Disagreement
Online Eval or Off-policy Eval
NDCG
(unintentful)
MRR
(unintentful)
AUC
(unintentful)
Orders &
Clicks
.668 .633 .675
Clicks .665 (0%) .600 (-5%) .757 (+12%)

Experiment: Global vs Local Models
(Markets)
If your recommender operates in markets
of varying sizes with distinct
cultural/taste patterns, it’s important that
your recs are high-quality in all markets.
● Operational Pain
● Market Sparsity
NDCG
(unintentful)
MRR
(unintentful)
AUC
(unintentful)
Local .617 0.557 0.635
Global .749
(+21%)
0.709
(+27.2%)
0.736
(+12%)

Experiment: GPU Data Pipelines
● IO → CPU → RAM → GPU RAM → GPU
● IO Bound
○ Sequential Data Access: Libsvm → TFRecords, GPU: 4% → 90%
○ tf.data pipelines are CPU only
○ Vmap: batch → map
○ Prefetch (to GPU memory)
Step/s Train Time (h) Batch Size
CPU (64 cores) 0.58 47 448
K80(4992 cores) 3.2 10 127
V100* (5120 cores) 8.5 (2-3x) 3.2 (3x) 448
* Not using Tensor-Cores/FP16

GTC 2021: Counterfactual Learning to Rank in E-commerce

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to GTC 2021: Counterfactual Learning to Rank in E-commerce

Similar to GTC 2021: Counterfactual Learning to Rank in E-commerce (20)

Recently uploaded

Recently uploaded (20)

GTC 2021: Counterfactual Learning to Rank in E-commerce