SlideShare a Scribd company logo
1 of 34
Download to read offline
Counterfactual
Learning to Rank:
Alex Egg | @eggie5
Personalized Recommendations
In Ecommerce
Outline
● 2 Stage IR System
● Candidate Selection
● Ranking
● Personalization (Modeling Interactions)
● Features (for recommenders)
● Log Feedback
● Biased Data (Counterfactuals & Reinforcement)
● Training
● Tuning
● Deployment
● Evaluation
● Ops
Restaurant
Recommendations
Menu/Dish
Recommendations
Rest/Dish/Cuisine
Search
Cuisine
Recommendations
Two-Stage Information Retrieval System
2 Stages:
● Candidates
● Rankings
Candidate Selection (Recall)
Motivation: We can’t rank the whole catalog in SLA
Fast/High recall set << Catalogue
● metadata-based filters: eg select items in user’s fav
cuisines or genre
● Item co-occurrences: eg clusters that belong to your past
items
● k-nearest neighbors: eg find similar items in Rn
space (see
ANN later)
Ranking (Precision)
Rank Candidates w/ high precision using Supervised Learning
● Classification
○ Binomial: P( click | u, i )
○ Multinomial: P( I | u ) → Autoencoders
● Ranking
○ Pointwise, pairwise, listwise
* Choice of approach is a product of your supervision labels
Binary feedback or relevance labels
Supervised Learning Task w/ sparse categorical variables:
f(X) = y, D=(X,y), X=(U,R), {U,R,} ∈ R1
, y ∈ {0,1}
Linear Model: P(y|X) = σ(Xw) = σ( u1
w1
+ r2
w2
)
U(is_french) = {1, -1}
R(is_french) = {1, -1}
X=[1, 1] ← french lover + french rest
X=[1, -1] ← french lover + non-french rest
X=[-1, 1] ← french hater + french rest
X=[-1,-1] ← french hater + non-french rest
Personalization (Modeling Interactions)
Feature Crosses: (2nd-order)
σ(ɸ(X)w) = σ( u1
w1
+ r2
w2
+ u1
r1
w3
)
X=[1, 1, 1] ← french lover + french rest
X=[1, -1, -1] ← french lover + non-french rest
X=[-1, 1, -1] ← french hater + french rest
X=[-1,-1, 1] ← french hater + non-french rest
Go Deep!: 2-layer MLP
How to model nth-Order Interactions?
● Explicit & implicit feature crosses (very sparse feature
space, expensive)
● Combinations of explicit and implicit (wide & deep)!
Deep & Cross Network
Are multiplicative crosses enough? FMs → MLPs
Recent studies [1, 2] found that DNNs are inefficient to
even approximately model 2nd or 3rd-order feature
crosses.
● What is advantage of DCN?
○ Efficient Explicit Feature Crosses
1: Latent cross: Making use of context in recurrent recommender systems.
2: Deep & Cross Network for Ad Click Predictions
Features
Sparse categorical variables
→ embeddings
Examples:
● User
● Item
● Context
Log Feedback
Full-feedback → Partial-feedback (logs)
● Log Feedback
● Biased Data
● Evaluation Paradox
Log Feedback
D = (x, y, r)
● x: context (user)
● y: action (item/ranking)
● r: reward (feedback click/order)
Feedback
● Explicit Feedback: like/stars
● Implicit Feedback: watch/click/order
Tradeoff: quantity/quality
�
�
Biased Data P( y | X )
Feedback
● Organic/Full-Feedback
○ Common in Academia, Rare in industry (ADS-16, MSLR-WEB30k)
● Bandit/partially-observed-Feedback
○ Click logs from industrial applications
Analogy: What is your favourite color: red or black? => red
P(y=red | X=🧐) = 1 ← is this actually true?
“Missing, not at random”
Apply this analogy to any recsys you use: netflix, spotify, amzn, grubhub
Evaluation (Thought Experiment)
● Classic train/test split to predict the test set accurately...
● Dataset of production system logs D=(x,y,r) ...
● What is the value of predicting the test set accurately?
● Is the test-set a reflection of organic user behavior? (No) Or a reflection of the logging
policy!? (Yes)
● There is a difference between a prediction and a recommendation (A recommendation is an
intervention)
● Bandit feedback is the product of a logging policy
● Logging policy is the previous generation recommender, ie the dataset (logs)
Goal
Supervised Learning Predict test-set
Actual Predict user behavior
Test-set != user behavior
Counterfactual
Learning ● Selection Bias
● Position Bias
Randomization and stochasticity
Causal Embeddings for Recommendation. Bonner, Vasile. Recsys ‘18
Selection Bias (Randomization)
Bias from Feedback loops
Add Exploration → Stochasticity (Randomization)
● Random Exploration w/ ϵ-Greedy Bandit
● Causal Embeddings: Jointly factorize unbiased
and greedy embeddings
Position Bias (Randomization)
Bias from devices
Inverse Propensity Scoring
Compute inverse propensities 1/bi
across ranks for
random bucket
Offset loss:
Counterfactual
Evaluation
● Partial Information
● Full Information
● Partial Information w/ bandit
feedback
Medical Analogy
Patient Bypass Stent Drugs
1 0
2 1
3 1
4 0
5 1
6 1
7 1
8 0
9 0
10 1
11 1
Partial Information Setting
Counterfactual Thinking
Treating Heart Attacks
● Treatments: Y: [bypass, stent, drugs]
● Outcomes δi
: 5 year survival (0/1)
󰢛 Which treatment is best??
● Drugs 3/4🏅
● Stent ⅔
● Bypass 2/4
Really? 🤔
Patient Bypass Stent Drugs
1 0 1 0
2 1 1 0
3 0 0 1
4 0 0 0
5 0 1 1
6 1 0 0
7 1 0 1
8 0 1 0
9 0 1 0
10 1 1 0
11 1 1 1
Full Information Setting
Treatment Effects
Example:
● Bypass = 5/11 = .45
● Stent = 7/11 = .63🏅
● Drugs = 4/11 = .36
Bypass Stent Drugs
0 1 0
1 1 0
0 0 1
0 0 0
0 1 1
1 0 0
1 0 1
0 1 0
0 1 0
1 1 0
1 1 1
Patient P_B P_S P_D
1 .3 .6 .1
2 .4 .5 .1
3 .1 .1 .8
4 .6 .3 .1
5 .2 .1 .7
6 .4 .2 .4
7 .1 .1 .8
8 .1 .8 .1
9 .3 .3 .4
10 .3. .2 .1
11 .4 .4 .2
Partial Information Setting w/ Bandit Feedback
Assignment
R’
IPS
(y)=∑𝐈(yi
=y)/pi
δ(xi
,yi
)
●Bypass = 1/11 (0/.3 + 1/.4 + 0/.3 + 1/.4) = .45
●Stent = 1/11 (1/.5 + 0/.3 + 1/.2) = .63🏅
●Drugs = 1/11 (1/.8 + 1/.7 + 1/.8 + 0/.1) = .36
Off-policy Evaluation, eg 🎉 Offline AB Testing🎉
1. Policy: Deterministic y = f(x) → Stochastic p ~ π(x)
2. Log propensites: D=(x,y,r,p)
3. Build IPS Estimator
Counterfactual Evaluation
Experiments
& Results
Personalized Policy
● Evaluation
● Interaction Modeling
● Multi-relevance Feedback
● Congregated Search
● Market Generalization
● GPU Workflows
Evaluation
Offline Metrics
● NDCG
● MRR
● AUC
● Gini-lorenz
Hyperparameter Tuning
● Vertical Scaling across GPUs (p3) w/
● Exhaustive Search over 66 combinations. w/o
concurrency would take ~200h.
Online Metrics
● Conversion Rate (Conv)
● Orders/Visitor (OPV)
● Revenue
● Diversity
● Fairness
Offline Evaluation
Metric Corr w/ Conv
MRR .55 📈
AUC .51
NDCG .48
loss .02
Surrogate Metric: Can we get directional estimates of online metrics, offline?
Can we design a metrics that tracks conversion rate?
On-Policy Evaluation
● Most-popular policy vs Personalization Policy
● Personalization policy +20% & improves over time
● Diversity: ~5 cuisines/slate, 60% unique
On-Policy Evaluation: Fairness
Inequality (Lorenz Curve + Gini Index)
Quantifies inequality (ie impressions across merchants
or wealth across populations)
The EE variant is more equitable than the MP variant.
Gini
Baseline .60
MP .59
EE .49 🏅
Experiment: Causal Embeddings
Hypothesis: If we use the uniform data in a
principled manner we can increase performance
by overcoming selection bias.
Experiment:
● Random
● Biased
● Random ∪ Biased
● CausE
Results: Principled use of uniform was beneficial
Causal Embeddings for Recommendation. Bonner, Vasile. Recsys ‘18
AUC
Random
(small data)
.56
Biased .72
Random ∪
Biased
.73
Causal .74
Experiment: Interaction Modeling
Hypothesis: MLPs are universal function
approximators?
Experiment: Evaluate MLP against
feature crosses
Results: MLP does not capture full
interactions
NDCG
(unintentful)
MRR
(unintentful)
AUC
(unintentful)
Random .511 .216 .500
UMP .615 .582 .653
MLP .627 .586 .689
DCN .657
(+4.7%)
.617
(+5.2%)
.695
(+0.8%)
Experiment: Multi-relevance Feedback
Sources of Feedback:
● Impressions
● Clicks
● Orders
Metric Disagreement
Online Eval or Off-policy Eval
NDCG
(unintentful)
MRR
(unintentful)
AUC
(unintentful)
Orders &
Clicks
.668 .633 .675
Clicks .665 (0%) .600 (-5%) .757 (+12%)
Experiment: Global vs Local Models
(Markets)
If your recommender operates in markets
of varying sizes with distinct
cultural/taste patterns, it’s important that
your recs are high-quality in all markets.
● Operational Pain
● Market Sparsity
NDCG
(unintentful)
MRR
(unintentful)
AUC
(unintentful)
Local .617 0.557 0.635
Global .749
(+21%)
0.709
(+27.2%)
0.736
(+12%)
Experiment: GPU Data Pipelines
● IO → CPU → RAM → GPU RAM → GPU
● IO Bound
○ Sequential Data Access: Libsvm → TFRecords, GPU: 4% → 90%
○ tf.data pipelines are CPU only
○ Vmap: batch → map
○ Prefetch (to GPU memory)
Step/s Train Time (h) Batch Size
CPU (64 cores) 0.58 47 448
K80(4992 cores) 3.2 10 127
V100* (5120 cores) 8.5 (2-3x) 3.2 (3x) 448
* Not using Tensor-Cores/FP16
Thank You
Alex Egg | @eggie5

More Related Content

What's hot

Homepage Personalization at Spotify
Homepage Personalization at SpotifyHomepage Personalization at Spotify
Homepage Personalization at SpotifyOguz Semerci
 
Deep Learning for Recommender Systems
Deep Learning for Recommender SystemsDeep Learning for Recommender Systems
Deep Learning for Recommender SystemsJustin Basilico
 
Bpr bayesian personalized ranking from implicit feedback
Bpr bayesian personalized ranking from implicit feedbackBpr bayesian personalized ranking from implicit feedback
Bpr bayesian personalized ranking from implicit feedbackPark JunPyo
 
Time, Context and Causality in Recommender Systems
Time, Context and Causality in Recommender SystemsTime, Context and Causality in Recommender Systems
Time, Context and Causality in Recommender SystemsYves Raimond
 
Boston ML - Architecting Recommender Systems
Boston ML - Architecting Recommender SystemsBoston ML - Architecting Recommender Systems
Boston ML - Architecting Recommender SystemsJames Kirk
 
Recsys 2016 tutorial: Lessons learned from building real-life recommender sys...
Recsys 2016 tutorial: Lessons learned from building real-life recommender sys...Recsys 2016 tutorial: Lessons learned from building real-life recommender sys...
Recsys 2016 tutorial: Lessons learned from building real-life recommender sys...Xavier Amatriain
 
Tutorial on Deep Learning in Recommender System, Lars summer school 2019
Tutorial on Deep Learning in Recommender System, Lars summer school 2019Tutorial on Deep Learning in Recommender System, Lars summer school 2019
Tutorial on Deep Learning in Recommender System, Lars summer school 2019Anoop Deoras
 
Deeper Things: How Netflix Leverages Deep Learning in Recommendations and Se...
 Deeper Things: How Netflix Leverages Deep Learning in Recommendations and Se... Deeper Things: How Netflix Leverages Deep Learning in Recommendations and Se...
Deeper Things: How Netflix Leverages Deep Learning in Recommendations and Se...Sudeep Das, Ph.D.
 
Recent Trends in Personalization at Netflix
Recent Trends in Personalization at NetflixRecent Trends in Personalization at Netflix
Recent Trends in Personalization at NetflixJustin Basilico
 
How to build a recommender system?
How to build a recommender system?How to build a recommender system?
How to build a recommender system?blueace
 
Recent Trends in Personalization: A Netflix Perspective
Recent Trends in Personalization: A Netflix PerspectiveRecent Trends in Personalization: A Netflix Perspective
Recent Trends in Personalization: A Netflix PerspectiveJustin Basilico
 
Personalizing "The Netflix Experience" with Deep Learning
Personalizing "The Netflix Experience" with Deep LearningPersonalizing "The Netflix Experience" with Deep Learning
Personalizing "The Netflix Experience" with Deep LearningAnoop Deoras
 
[QCon.ai 2019] People You May Know: Fast Recommendations Over Massive Data
[QCon.ai 2019] People You May Know: Fast Recommendations Over Massive Data[QCon.ai 2019] People You May Know: Fast Recommendations Over Massive Data
[QCon.ai 2019] People You May Know: Fast Recommendations Over Massive DataSumit Rangwala
 
Personalized Page Generation for Browsing Recommendations
Personalized Page Generation for Browsing RecommendationsPersonalized Page Generation for Browsing Recommendations
Personalized Page Generation for Browsing RecommendationsJustin Basilico
 
Context Aware Recommendations at Netflix
Context Aware Recommendations at NetflixContext Aware Recommendations at Netflix
Context Aware Recommendations at NetflixLinas Baltrunas
 
Recommendations for Building Machine Learning Software
Recommendations for Building Machine Learning SoftwareRecommendations for Building Machine Learning Software
Recommendations for Building Machine Learning SoftwareJustin Basilico
 
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...MLconf
 
Deep Learning for Recommender Systems
Deep Learning for Recommender SystemsDeep Learning for Recommender Systems
Deep Learning for Recommender SystemsYves Raimond
 
Metrics, Engagement & Personalization
Metrics, Engagement & Personalization Metrics, Engagement & Personalization
Metrics, Engagement & Personalization Mounia Lalmas-Roelleke
 

What's hot (20)

Homepage Personalization at Spotify
Homepage Personalization at SpotifyHomepage Personalization at Spotify
Homepage Personalization at Spotify
 
Recent Trends in Personalization at Netflix
Recent Trends in Personalization at NetflixRecent Trends in Personalization at Netflix
Recent Trends in Personalization at Netflix
 
Deep Learning for Recommender Systems
Deep Learning for Recommender SystemsDeep Learning for Recommender Systems
Deep Learning for Recommender Systems
 
Bpr bayesian personalized ranking from implicit feedback
Bpr bayesian personalized ranking from implicit feedbackBpr bayesian personalized ranking from implicit feedback
Bpr bayesian personalized ranking from implicit feedback
 
Time, Context and Causality in Recommender Systems
Time, Context and Causality in Recommender SystemsTime, Context and Causality in Recommender Systems
Time, Context and Causality in Recommender Systems
 
Boston ML - Architecting Recommender Systems
Boston ML - Architecting Recommender SystemsBoston ML - Architecting Recommender Systems
Boston ML - Architecting Recommender Systems
 
Recsys 2016 tutorial: Lessons learned from building real-life recommender sys...
Recsys 2016 tutorial: Lessons learned from building real-life recommender sys...Recsys 2016 tutorial: Lessons learned from building real-life recommender sys...
Recsys 2016 tutorial: Lessons learned from building real-life recommender sys...
 
Tutorial on Deep Learning in Recommender System, Lars summer school 2019
Tutorial on Deep Learning in Recommender System, Lars summer school 2019Tutorial on Deep Learning in Recommender System, Lars summer school 2019
Tutorial on Deep Learning in Recommender System, Lars summer school 2019
 
Deeper Things: How Netflix Leverages Deep Learning in Recommendations and Se...
 Deeper Things: How Netflix Leverages Deep Learning in Recommendations and Se... Deeper Things: How Netflix Leverages Deep Learning in Recommendations and Se...
Deeper Things: How Netflix Leverages Deep Learning in Recommendations and Se...
 
Recent Trends in Personalization at Netflix
Recent Trends in Personalization at NetflixRecent Trends in Personalization at Netflix
Recent Trends in Personalization at Netflix
 
How to build a recommender system?
How to build a recommender system?How to build a recommender system?
How to build a recommender system?
 
Recent Trends in Personalization: A Netflix Perspective
Recent Trends in Personalization: A Netflix PerspectiveRecent Trends in Personalization: A Netflix Perspective
Recent Trends in Personalization: A Netflix Perspective
 
Personalizing "The Netflix Experience" with Deep Learning
Personalizing "The Netflix Experience" with Deep LearningPersonalizing "The Netflix Experience" with Deep Learning
Personalizing "The Netflix Experience" with Deep Learning
 
[QCon.ai 2019] People You May Know: Fast Recommendations Over Massive Data
[QCon.ai 2019] People You May Know: Fast Recommendations Over Massive Data[QCon.ai 2019] People You May Know: Fast Recommendations Over Massive Data
[QCon.ai 2019] People You May Know: Fast Recommendations Over Massive Data
 
Personalized Page Generation for Browsing Recommendations
Personalized Page Generation for Browsing RecommendationsPersonalized Page Generation for Browsing Recommendations
Personalized Page Generation for Browsing Recommendations
 
Context Aware Recommendations at Netflix
Context Aware Recommendations at NetflixContext Aware Recommendations at Netflix
Context Aware Recommendations at Netflix
 
Recommendations for Building Machine Learning Software
Recommendations for Building Machine Learning SoftwareRecommendations for Building Machine Learning Software
Recommendations for Building Machine Learning Software
 
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...
 
Deep Learning for Recommender Systems
Deep Learning for Recommender SystemsDeep Learning for Recommender Systems
Deep Learning for Recommender Systems
 
Metrics, Engagement & Personalization
Metrics, Engagement & Personalization Metrics, Engagement & Personalization
Metrics, Engagement & Personalization
 

Similar to GTC 2021: Counterfactual Learning to Rank in E-commerce

Causality without headaches
Causality without headachesCausality without headaches
Causality without headachesBenoît Rostykus
 
Setting up an A/B-testing framework
Setting up an A/B-testing frameworkSetting up an A/B-testing framework
Setting up an A/B-testing frameworkAgnes van Belle
 
Recommender Systems Fairness Evaluation via Generalized Cross Entropy
Recommender Systems Fairness Evaluation via Generalized Cross EntropyRecommender Systems Fairness Evaluation via Generalized Cross Entropy
Recommender Systems Fairness Evaluation via Generalized Cross EntropyVito Walter Anelli
 
Human-Machine Collaboration in Organizations: Impact of Algorithm Bias on De...
 Human-Machine Collaboration in Organizations: Impact of Algorithm Bias on De... Human-Machine Collaboration in Organizations: Impact of Algorithm Bias on De...
Human-Machine Collaboration in Organizations: Impact of Algorithm Bias on De...Anh Luong
 
Machine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdfMachine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdfAijun Zhang
 
Machine learning - session 3
Machine learning - session 3Machine learning - session 3
Machine learning - session 3Luis Borbon
 
ODSC Causal Inference Workshop (November 2016) (1)
ODSC Causal Inference Workshop (November 2016) (1)ODSC Causal Inference Workshop (November 2016) (1)
ODSC Causal Inference Workshop (November 2016) (1)Emily Glassberg Sands
 
Linear Probability Models and Big Data: Prediction, Inference and Selection Bias
Linear Probability Models and Big Data: Prediction, Inference and Selection BiasLinear Probability Models and Big Data: Prediction, Inference and Selection Bias
Linear Probability Models and Big Data: Prediction, Inference and Selection BiasSuneel Babu Chatla
 
Causal reasoning and Learning Systems
Causal reasoning and Learning SystemsCausal reasoning and Learning Systems
Causal reasoning and Learning SystemsTrieu Nguyen
 
Recommender Systems: Advances in Collaborative Filtering
Recommender Systems: Advances in Collaborative FilteringRecommender Systems: Advances in Collaborative Filtering
Recommender Systems: Advances in Collaborative FilteringChangsung Moon
 
Empirical Evaluation of Active Learning in Recommender Systems
Empirical Evaluation of Active Learning in Recommender SystemsEmpirical Evaluation of Active Learning in Recommender Systems
Empirical Evaluation of Active Learning in Recommender SystemsUniversity of Bergen
 
Machine learning Mind Map
Machine learning Mind MapMachine learning Mind Map
Machine learning Mind MapAshish Patel
 
Summer 2015 Internship
Summer 2015 InternshipSummer 2015 Internship
Summer 2015 InternshipTaylor Martell
 
Review: [KDD'21]Model-Agnostic Counterfactual Reasoning for Eliminating Popul...
Review: [KDD'21]Model-Agnostic Counterfactual Reasoning for Eliminating Popul...Review: [KDD'21]Model-Agnostic Counterfactual Reasoning for Eliminating Popul...
Review: [KDD'21]Model-Agnostic Counterfactual Reasoning for Eliminating Popul...CS Kwak
 
Combining Lazy Learning, Racing and Subsampling for Effective Feature Selection
Combining Lazy Learning, Racing and Subsampling for Effective Feature SelectionCombining Lazy Learning, Racing and Subsampling for Effective Feature Selection
Combining Lazy Learning, Racing and Subsampling for Effective Feature SelectionGianluca Bontempi
 
A new CPXR Based Logistic Regression Method and Clinical Prognostic Modeling ...
A new CPXR Based Logistic Regression Method and Clinical Prognostic Modeling ...A new CPXR Based Logistic Regression Method and Clinical Prognostic Modeling ...
A new CPXR Based Logistic Regression Method and Clinical Prognostic Modeling ...Vahid Taslimitehrani
 

Similar to GTC 2021: Counterfactual Learning to Rank in E-commerce (20)

Causality without headaches
Causality without headachesCausality without headaches
Causality without headaches
 
Setting up an A/B-testing framework
Setting up an A/B-testing frameworkSetting up an A/B-testing framework
Setting up an A/B-testing framework
 
Recommender Systems Fairness Evaluation via Generalized Cross Entropy
Recommender Systems Fairness Evaluation via Generalized Cross EntropyRecommender Systems Fairness Evaluation via Generalized Cross Entropy
Recommender Systems Fairness Evaluation via Generalized Cross Entropy
 
ML MODULE 4.pdf
ML MODULE 4.pdfML MODULE 4.pdf
ML MODULE 4.pdf
 
Human-Machine Collaboration in Organizations: Impact of Algorithm Bias on De...
 Human-Machine Collaboration in Organizations: Impact of Algorithm Bias on De... Human-Machine Collaboration in Organizations: Impact of Algorithm Bias on De...
Human-Machine Collaboration in Organizations: Impact of Algorithm Bias on De...
 
Machine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdfMachine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdf
 
Machine learning - session 3
Machine learning - session 3Machine learning - session 3
Machine learning - session 3
 
Weka.arff
Weka.arffWeka.arff
Weka.arff
 
Learning to learn Model Behavior: How to use "human-in-the-loop" to explain d...
Learning to learn Model Behavior: How to use "human-in-the-loop" to explain d...Learning to learn Model Behavior: How to use "human-in-the-loop" to explain d...
Learning to learn Model Behavior: How to use "human-in-the-loop" to explain d...
 
ODSC Causal Inference Workshop (November 2016) (1)
ODSC Causal Inference Workshop (November 2016) (1)ODSC Causal Inference Workshop (November 2016) (1)
ODSC Causal Inference Workshop (November 2016) (1)
 
Weka presentation cmt111
Weka presentation cmt111Weka presentation cmt111
Weka presentation cmt111
 
Linear Probability Models and Big Data: Prediction, Inference and Selection Bias
Linear Probability Models and Big Data: Prediction, Inference and Selection BiasLinear Probability Models and Big Data: Prediction, Inference and Selection Bias
Linear Probability Models and Big Data: Prediction, Inference and Selection Bias
 
Causal reasoning and Learning Systems
Causal reasoning and Learning SystemsCausal reasoning and Learning Systems
Causal reasoning and Learning Systems
 
Recommender Systems: Advances in Collaborative Filtering
Recommender Systems: Advances in Collaborative FilteringRecommender Systems: Advances in Collaborative Filtering
Recommender Systems: Advances in Collaborative Filtering
 
Empirical Evaluation of Active Learning in Recommender Systems
Empirical Evaluation of Active Learning in Recommender SystemsEmpirical Evaluation of Active Learning in Recommender Systems
Empirical Evaluation of Active Learning in Recommender Systems
 
Machine learning Mind Map
Machine learning Mind MapMachine learning Mind Map
Machine learning Mind Map
 
Summer 2015 Internship
Summer 2015 InternshipSummer 2015 Internship
Summer 2015 Internship
 
Review: [KDD'21]Model-Agnostic Counterfactual Reasoning for Eliminating Popul...
Review: [KDD'21]Model-Agnostic Counterfactual Reasoning for Eliminating Popul...Review: [KDD'21]Model-Agnostic Counterfactual Reasoning for Eliminating Popul...
Review: [KDD'21]Model-Agnostic Counterfactual Reasoning for Eliminating Popul...
 
Combining Lazy Learning, Racing and Subsampling for Effective Feature Selection
Combining Lazy Learning, Racing and Subsampling for Effective Feature SelectionCombining Lazy Learning, Racing and Subsampling for Effective Feature Selection
Combining Lazy Learning, Racing and Subsampling for Effective Feature Selection
 
A new CPXR Based Logistic Regression Method and Clinical Prognostic Modeling ...
A new CPXR Based Logistic Regression Method and Clinical Prognostic Modeling ...A new CPXR Based Logistic Regression Method and Clinical Prognostic Modeling ...
A new CPXR Based Logistic Regression Method and Clinical Prognostic Modeling ...
 

Recently uploaded

"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentationphoebematthew05
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 

Recently uploaded (20)

"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentation
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort ServiceHot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 

GTC 2021: Counterfactual Learning to Rank in E-commerce

  • 1. Counterfactual Learning to Rank: Alex Egg | @eggie5 Personalized Recommendations In Ecommerce
  • 2. Outline ● 2 Stage IR System ● Candidate Selection ● Ranking ● Personalization (Modeling Interactions) ● Features (for recommenders) ● Log Feedback ● Biased Data (Counterfactuals & Reinforcement) ● Training ● Tuning ● Deployment ● Evaluation ● Ops
  • 3.
  • 5. Two-Stage Information Retrieval System 2 Stages: ● Candidates ● Rankings
  • 6. Candidate Selection (Recall) Motivation: We can’t rank the whole catalog in SLA Fast/High recall set << Catalogue ● metadata-based filters: eg select items in user’s fav cuisines or genre ● Item co-occurrences: eg clusters that belong to your past items ● k-nearest neighbors: eg find similar items in Rn space (see ANN later)
  • 7. Ranking (Precision) Rank Candidates w/ high precision using Supervised Learning ● Classification ○ Binomial: P( click | u, i ) ○ Multinomial: P( I | u ) → Autoencoders ● Ranking ○ Pointwise, pairwise, listwise * Choice of approach is a product of your supervision labels Binary feedback or relevance labels
  • 8. Supervised Learning Task w/ sparse categorical variables: f(X) = y, D=(X,y), X=(U,R), {U,R,} ∈ R1 , y ∈ {0,1} Linear Model: P(y|X) = σ(Xw) = σ( u1 w1 + r2 w2 ) U(is_french) = {1, -1} R(is_french) = {1, -1} X=[1, 1] ← french lover + french rest X=[1, -1] ← french lover + non-french rest X=[-1, 1] ← french hater + french rest X=[-1,-1] ← french hater + non-french rest Personalization (Modeling Interactions) Feature Crosses: (2nd-order) σ(ɸ(X)w) = σ( u1 w1 + r2 w2 + u1 r1 w3 ) X=[1, 1, 1] ← french lover + french rest X=[1, -1, -1] ← french lover + non-french rest X=[-1, 1, -1] ← french hater + french rest X=[-1,-1, 1] ← french hater + non-french rest Go Deep!: 2-layer MLP How to model nth-Order Interactions? ● Explicit & implicit feature crosses (very sparse feature space, expensive) ● Combinations of explicit and implicit (wide & deep)!
  • 9. Deep & Cross Network Are multiplicative crosses enough? FMs → MLPs Recent studies [1, 2] found that DNNs are inefficient to even approximately model 2nd or 3rd-order feature crosses. ● What is advantage of DCN? ○ Efficient Explicit Feature Crosses 1: Latent cross: Making use of context in recurrent recommender systems. 2: Deep & Cross Network for Ad Click Predictions
  • 10. Features Sparse categorical variables → embeddings Examples: ● User ● Item ● Context
  • 11. Log Feedback Full-feedback → Partial-feedback (logs) ● Log Feedback ● Biased Data ● Evaluation Paradox
  • 12. Log Feedback D = (x, y, r) ● x: context (user) ● y: action (item/ranking) ● r: reward (feedback click/order) Feedback ● Explicit Feedback: like/stars ● Implicit Feedback: watch/click/order Tradeoff: quantity/quality � �
  • 13. Biased Data P( y | X ) Feedback ● Organic/Full-Feedback ○ Common in Academia, Rare in industry (ADS-16, MSLR-WEB30k) ● Bandit/partially-observed-Feedback ○ Click logs from industrial applications Analogy: What is your favourite color: red or black? => red P(y=red | X=🧐) = 1 ← is this actually true? “Missing, not at random” Apply this analogy to any recsys you use: netflix, spotify, amzn, grubhub
  • 14. Evaluation (Thought Experiment) ● Classic train/test split to predict the test set accurately... ● Dataset of production system logs D=(x,y,r) ... ● What is the value of predicting the test set accurately? ● Is the test-set a reflection of organic user behavior? (No) Or a reflection of the logging policy!? (Yes) ● There is a difference between a prediction and a recommendation (A recommendation is an intervention) ● Bandit feedback is the product of a logging policy ● Logging policy is the previous generation recommender, ie the dataset (logs) Goal Supervised Learning Predict test-set Actual Predict user behavior Test-set != user behavior
  • 15. Counterfactual Learning ● Selection Bias ● Position Bias Randomization and stochasticity
  • 16. Causal Embeddings for Recommendation. Bonner, Vasile. Recsys ‘18 Selection Bias (Randomization) Bias from Feedback loops Add Exploration → Stochasticity (Randomization) ● Random Exploration w/ ϵ-Greedy Bandit ● Causal Embeddings: Jointly factorize unbiased and greedy embeddings
  • 17. Position Bias (Randomization) Bias from devices Inverse Propensity Scoring Compute inverse propensities 1/bi across ranks for random bucket Offset loss:
  • 18. Counterfactual Evaluation ● Partial Information ● Full Information ● Partial Information w/ bandit feedback Medical Analogy
  • 19. Patient Bypass Stent Drugs 1 0 2 1 3 1 4 0 5 1 6 1 7 1 8 0 9 0 10 1 11 1 Partial Information Setting Counterfactual Thinking Treating Heart Attacks ● Treatments: Y: [bypass, stent, drugs] ● Outcomes δi : 5 year survival (0/1) 󰢛 Which treatment is best?? ● Drugs 3/4🏅 ● Stent ⅔ ● Bypass 2/4 Really? 🤔
  • 20. Patient Bypass Stent Drugs 1 0 1 0 2 1 1 0 3 0 0 1 4 0 0 0 5 0 1 1 6 1 0 0 7 1 0 1 8 0 1 0 9 0 1 0 10 1 1 0 11 1 1 1 Full Information Setting Treatment Effects Example: ● Bypass = 5/11 = .45 ● Stent = 7/11 = .63🏅 ● Drugs = 4/11 = .36
  • 21. Bypass Stent Drugs 0 1 0 1 1 0 0 0 1 0 0 0 0 1 1 1 0 0 1 0 1 0 1 0 0 1 0 1 1 0 1 1 1 Patient P_B P_S P_D 1 .3 .6 .1 2 .4 .5 .1 3 .1 .1 .8 4 .6 .3 .1 5 .2 .1 .7 6 .4 .2 .4 7 .1 .1 .8 8 .1 .8 .1 9 .3 .3 .4 10 .3. .2 .1 11 .4 .4 .2 Partial Information Setting w/ Bandit Feedback Assignment R’ IPS (y)=∑𝐈(yi =y)/pi δ(xi ,yi ) ●Bypass = 1/11 (0/.3 + 1/.4 + 0/.3 + 1/.4) = .45 ●Stent = 1/11 (1/.5 + 0/.3 + 1/.2) = .63🏅 ●Drugs = 1/11 (1/.8 + 1/.7 + 1/.8 + 0/.1) = .36
  • 22. Off-policy Evaluation, eg 🎉 Offline AB Testing🎉 1. Policy: Deterministic y = f(x) → Stochastic p ~ π(x) 2. Log propensites: D=(x,y,r,p) 3. Build IPS Estimator Counterfactual Evaluation
  • 23. Experiments & Results Personalized Policy ● Evaluation ● Interaction Modeling ● Multi-relevance Feedback ● Congregated Search ● Market Generalization ● GPU Workflows
  • 24. Evaluation Offline Metrics ● NDCG ● MRR ● AUC ● Gini-lorenz Hyperparameter Tuning ● Vertical Scaling across GPUs (p3) w/ ● Exhaustive Search over 66 combinations. w/o concurrency would take ~200h. Online Metrics ● Conversion Rate (Conv) ● Orders/Visitor (OPV) ● Revenue ● Diversity ● Fairness
  • 25. Offline Evaluation Metric Corr w/ Conv MRR .55 📈 AUC .51 NDCG .48 loss .02 Surrogate Metric: Can we get directional estimates of online metrics, offline? Can we design a metrics that tracks conversion rate?
  • 26. On-Policy Evaluation ● Most-popular policy vs Personalization Policy ● Personalization policy +20% & improves over time ● Diversity: ~5 cuisines/slate, 60% unique
  • 27. On-Policy Evaluation: Fairness Inequality (Lorenz Curve + Gini Index) Quantifies inequality (ie impressions across merchants or wealth across populations) The EE variant is more equitable than the MP variant. Gini Baseline .60 MP .59 EE .49 🏅
  • 28. Experiment: Causal Embeddings Hypothesis: If we use the uniform data in a principled manner we can increase performance by overcoming selection bias. Experiment: ● Random ● Biased ● Random ∪ Biased ● CausE Results: Principled use of uniform was beneficial Causal Embeddings for Recommendation. Bonner, Vasile. Recsys ‘18 AUC Random (small data) .56 Biased .72 Random ∪ Biased .73 Causal .74
  • 29. Experiment: Interaction Modeling Hypothesis: MLPs are universal function approximators? Experiment: Evaluate MLP against feature crosses Results: MLP does not capture full interactions NDCG (unintentful) MRR (unintentful) AUC (unintentful) Random .511 .216 .500 UMP .615 .582 .653 MLP .627 .586 .689 DCN .657 (+4.7%) .617 (+5.2%) .695 (+0.8%)
  • 30. Experiment: Multi-relevance Feedback Sources of Feedback: ● Impressions ● Clicks ● Orders Metric Disagreement Online Eval or Off-policy Eval NDCG (unintentful) MRR (unintentful) AUC (unintentful) Orders & Clicks .668 .633 .675 Clicks .665 (0%) .600 (-5%) .757 (+12%)
  • 31. Experiment: Global vs Local Models (Markets) If your recommender operates in markets of varying sizes with distinct cultural/taste patterns, it’s important that your recs are high-quality in all markets. ● Operational Pain ● Market Sparsity NDCG (unintentful) MRR (unintentful) AUC (unintentful) Local .617 0.557 0.635 Global .749 (+21%) 0.709 (+27.2%) 0.736 (+12%)
  • 32. Experiment: GPU Data Pipelines ● IO → CPU → RAM → GPU RAM → GPU ● IO Bound ○ Sequential Data Access: Libsvm → TFRecords, GPU: 4% → 90% ○ tf.data pipelines are CPU only ○ Vmap: batch → map ○ Prefetch (to GPU memory) Step/s Train Time (h) Batch Size CPU (64 cores) 0.58 47 448 K80(4992 cores) 3.2 10 127 V100* (5120 cores) 8.5 (2-3x) 3.2 (3x) 448 * Not using Tensor-Cores/FP16
  • 33.
  • 34. Thank You Alex Egg | @eggie5