SlideShare a Scribd company logo
1 of 23
Download to read offline
Challenges in Evaluating
Exploration Effectiveness in
Recommender Systems
REVEAL workshop, RecSys 2020
Inbar Naor Algorithms Team Lead, Taboola
1
Recommendations in Taboola
● Content + Ads recommendation
● Multi-stakeholders: publishers, advertisers, users
● My team: new advertisers success
2
Exploration in Recommender Systems is Challenging
● Huge state-action space
● The real world is not stationary:
○ New items are added all the time
○ State space is always changing
○ Users and market dynamic adapt to changes
● Not sufficient to bound regret - we might want to bound the probability of
catastrophic failures
3
4[Ou 2018]
Defining Regret is Hard
● Don’t know the reward of actions we did not take
● We can’t assume stationarity in reward or states dynamics
● Different costs in multi-stakeholder systems:
○ Example: ads display:
■ Opportunity cost - how much the system missed by showing this ad and not something
else
■ Billing cost - how much the advertiser lost by showing his ad in this place
● Hard to account for long term effects (e.g - churn)
5
Exploration - Different Points of View
● High level system impact (main metric)
● Success of the target population (e.g - new items, new advertisers)
● Fairness for new items
6
➢ High level system impact
➢ Success of the target population
➢ Fairness for new items
7
High level system impact
What is the overall effect of exploration on our system?
Use main metric(s) - e.g user engagement
● Compare traffic with / without exploration (fix model training)
○ Does not account for exploratory data available in training
A
Exploitation
90%
C
Exploration
5%
● Compare model trained on A + B to model trained on A + C
[Chen 2019]
8
B
Exploitation
5%
➢ High level system impact
➢ Success of the target population
➢ Fairness for new items
9
Effect on Target Population
Exploration as a way to improve performance for new items / advertisers
Goal: define short term metrics that are AB testable and indicative of long term
success
Examples: Clicks, Conversions, Engagement, CTR, Cost Per Action
10
Too much exploration can have bad business effect:
● Costs too much money
● Too much exploration can make the system confusing or unsatisfactory to
users
In many situations we want to limit our exploration traffic
Why would you have Separate Traffic for Exploration?
11
Metrics Hierarchy
Business KPI
Online metrics for
experimentation
Optimization
Metrics
For training ML algorithms
Offline Metrics
Long term
Metrics
12
What do we Want From a Metric
● Indicative of long term gain in Business KPI
● Sensitive to changes, converge fast
● Not too noisy
Also:
● Fast and cheap to compute
● Hard to game, incentivizes the right action
[Gupta et al 2019]
13
● Data set of past experiments
● Correlation between the short term metric and business KPI
● Predictive power of the short term metric
Causality:
● Surrogacy assumption: long-term outcome is independent of treatment
conditional on the surrogate
● Surrogate index - combining multiple surrogates [Athey 2019]
Indicative of Long Term Gain in Business KPI
14
● Past experiments
● Degradation test
● Run an AA test
Sensitivity & Noise
15
➢ High level system impact
➢ Success of the target population
➢ Fairness for new items
16
Exploration - a mean to improve fairness between new items and existing items
Individual Fairness: similar individual should get similar results
Group Fairness: the group with the protected attribute should get similar results
to the group without the protected attribute
Fairness Metrics
17
Fairness Metrics for Classification
Group Fairness:
● Demographic Parity:
Members of both groups are predicted to belong to the positive class at the same
rate.
● Equalized Odds:
The true positive rate (TPR) and (separately) the false positive rate (FPR) are the
same across groups.
● Equality of opportunity:
18
Fairness Metrics - continue
● Calibration: if we look at all the examples that got a score s, they
have the same probability of belonging to the positive class,
regardless of group membership.
● Balance for Positive Class: the average score for members of
the positive class are equal, regardless of group membership.
19
Don’t consider the effects of short-term actions on long-term rewards:
● Applying fairness constraint may cause more harm to the group they try to
protect at next iteration [Liu 2018]
Take long time to converge:
● Fairness constrained algorithm may take exponential time (in number of
states) to converge to optimal policy [Jabbari 2017]
Fairness in RL
20
● Small changes in relevancy can lead to large changes in exposure
● Also the other way around: it doesn’t matter if the model predicted 0.4 or 0.6
if it doesn’t impact the ranking.
● Work on adapting fairness constrained to ranking:
○ Parity: restricting the fraction of items from each group in different positions in
ranking [Zehlike 2017, Yang 2017, Celis 2017]
○ Different fairness constraints [Singh & Joachims 2018]
○ Pairwise comparisons within same query [Beutel 2019]
Fairness in Ranking
21
Summary
● Unique challenges of exploration in recommender systems
● We should go beyond the usual reward maximization perspective
● Three different points of view for exploration:
○ Maximizing overall utility
○ Improving the chances of new items to succeed in the system
○ Fairness for new items
Questions? inbar.n (@) taboola.com
22
References
Chen, Minmin, et al. "Top-k off-policy correction for a REINFORCE recommender system." Proceedings of the Twelfth ACM International
Conference on Web Search and Data Mining. 2019.
Gupta, Somit, et al. "Top challenges from the first practical online controlled experiments summit." ACM SIGKDD Explorations Newsletter 21.1
(2019): 20-35.
Athey, Susan, et al. "Estimating treatment effects using multiple surrogates: The role of the surrogate score and the surrogate index." arXiv
preprint arXiv:1603.09326 (2016).
Liu, Lydia T., et al. "Delayed impact of fair machine learning." arXiv preprint arXiv:1803.04383 (2018).
Jabbari, Shahin, et al. "Fairness in reinforcement learning." International Conference on Machine Learning. PMLR, 2017.
Zehlike, Meike, et al. "Fa* ir: A fair top-k ranking algorithm." Proceedings of the 2017 ACM on Conference on Information and Knowledge
Management. 2017.
Yang, Ke, and Julia Stoyanovich. "Measuring fairness in ranked outputs." Proceedings of the 29th International Conference on Scientific and
Statistical Database Management. 2017.
Celis, L. Elisa, Damian Straszak, and Nisheeth K. Vishnoi. "Ranking with fairness constraints." arXiv preprint arXiv:1704.06840 (2017).
Singh, Ashudeep, and Thorsten Joachims. "Fairness of exposure in rankings." Proceedings of the 24th ACM SIGKDD International Conference on
Knowledge Discovery & Data Mining. 2018.
Beutel, Alex, et al. "Fairness in recommendation ranking through pairwise comparisons." Proceedings of the 25th ACM SIGKDD International
Conference on Knowledge Discovery & Data Mining. 2019.
23

More Related Content

Similar to Challenges in Evaluating Exploration Effectiveness in Recommender Systems

Experiments on Generalizability of User-Oriented Fairness in Recommender Systems
Experiments on Generalizability of User-Oriented Fairness in Recommender SystemsExperiments on Generalizability of User-Oriented Fairness in Recommender Systems
Experiments on Generalizability of User-Oriented Fairness in Recommender Systems
Hossein A. (Saeed) Rahmani
 

Similar to Challenges in Evaluating Exploration Effectiveness in Recommender Systems (20)

Experiments on Generalizability of User-Oriented Fairness in Recommender Systems
Experiments on Generalizability of User-Oriented Fairness in Recommender SystemsExperiments on Generalizability of User-Oriented Fairness in Recommender Systems
Experiments on Generalizability of User-Oriented Fairness in Recommender Systems
 
Measuring effectiveness of machine learning systems
Measuring effectiveness of machine learning systemsMeasuring effectiveness of machine learning systems
Measuring effectiveness of machine learning systems
 
Incibeta ignite google-measurement stategy (1)
Incibeta ignite  google-measurement stategy (1)Incibeta ignite  google-measurement stategy (1)
Incibeta ignite google-measurement stategy (1)
 
Cartel screening in the digital era – UK Competition & Markets Authority – Ja...
Cartel screening in the digital era – UK Competition & Markets Authority – Ja...Cartel screening in the digital era – UK Competition & Markets Authority – Ja...
Cartel screening in the digital era – UK Competition & Markets Authority – Ja...
 
C1803031825
C1803031825C1803031825
C1803031825
 
A Comparative Analysis of Sampling Techniques for Click-Through Rate Predicti...
A Comparative Analysis of Sampling Techniques for Click-Through Rate Predicti...A Comparative Analysis of Sampling Techniques for Click-Through Rate Predicti...
A Comparative Analysis of Sampling Techniques for Click-Through Rate Predicti...
 
Define, describe, deploy how to build an analytical framework
Define, describe, deploy  how to build an analytical framework Define, describe, deploy  how to build an analytical framework
Define, describe, deploy how to build an analytical framework
 
Barriers to entry, exit and a level playing field
Barriers to entry, exit and a level playing fieldBarriers to entry, exit and a level playing field
Barriers to entry, exit and a level playing field
 
Software Agents Role and Predictive Approaches for Online Auctions
Software Agents Role and Predictive Approaches for Online AuctionsSoftware Agents Role and Predictive Approaches for Online Auctions
Software Agents Role and Predictive Approaches for Online Auctions
 
metrics that matter
metrics that mattermetrics that matter
metrics that matter
 
MAT 510 Exceptional Education - snaptutorial.com
MAT 510   Exceptional Education - snaptutorial.comMAT 510   Exceptional Education - snaptutorial.com
MAT 510 Exceptional Education - snaptutorial.com
 
The Impact of Computing Systems | Causal inference in practice
The Impact of Computing Systems | Causal inference in practiceThe Impact of Computing Systems | Causal inference in practice
The Impact of Computing Systems | Causal inference in practice
 
MAT 510 Effective Communication - snaptutorial.com
MAT 510 Effective Communication - snaptutorial.comMAT 510 Effective Communication - snaptutorial.com
MAT 510 Effective Communication - snaptutorial.com
 
Mat 510 Enhance teaching / snaptutorial.com
Mat 510 Enhance teaching / snaptutorial.comMat 510 Enhance teaching / snaptutorial.com
Mat 510 Enhance teaching / snaptutorial.com
 
Mat 510 Believe Possibilities / snaptutorial.com
Mat 510  Believe Possibilities / snaptutorial.comMat 510  Believe Possibilities / snaptutorial.com
Mat 510 Believe Possibilities / snaptutorial.com
 
MAT 510 RANK Education Planning--mat510rank.com
MAT 510 RANK Education Planning--mat510rank.comMAT 510 RANK Education Planning--mat510rank.com
MAT 510 RANK Education Planning--mat510rank.com
 
MAT 510 RANK Lessons in Excellence-- mat510rank.com
MAT 510 RANK Lessons in Excellence-- mat510rank.comMAT 510 RANK Lessons in Excellence-- mat510rank.com
MAT 510 RANK Lessons in Excellence-- mat510rank.com
 
The State of Conversion Rate Optimisation (CRO) 2019
The State of Conversion Rate Optimisation (CRO) 2019The State of Conversion Rate Optimisation (CRO) 2019
The State of Conversion Rate Optimisation (CRO) 2019
 
MAT 510 Effective Communication - tutorialrank.com
MAT 510  Effective Communication - tutorialrank.comMAT 510  Effective Communication - tutorialrank.com
MAT 510 Effective Communication - tutorialrank.com
 
Better Living Through Analytics - Louis Cialdella Product School
Better Living Through Analytics - Louis Cialdella Product SchoolBetter Living Through Analytics - Louis Cialdella Product School
Better Living Through Analytics - Louis Cialdella Product School
 

Recently uploaded

Maher Othman Interior Design Portfolio..
Maher Othman Interior Design Portfolio..Maher Othman Interior Design Portfolio..
Maher Othman Interior Design Portfolio..
MaherOthman7
 
Activity Planning: Objectives, Project Schedule, Network Planning Model. Time...
Activity Planning: Objectives, Project Schedule, Network Planning Model. Time...Activity Planning: Objectives, Project Schedule, Network Planning Model. Time...
Activity Planning: Objectives, Project Schedule, Network Planning Model. Time...
Lovely Professional University
 

Recently uploaded (20)

E-Commerce Shopping using MERN Stack where different modules are present
E-Commerce Shopping using MERN Stack where different modules are presentE-Commerce Shopping using MERN Stack where different modules are present
E-Commerce Shopping using MERN Stack where different modules are present
 
Introduction to Artificial Intelligence and History of AI
Introduction to Artificial Intelligence and History of AIIntroduction to Artificial Intelligence and History of AI
Introduction to Artificial Intelligence and History of AI
 
ChatGPT Prompt Engineering for project managers.pdf
ChatGPT Prompt Engineering for project managers.pdfChatGPT Prompt Engineering for project managers.pdf
ChatGPT Prompt Engineering for project managers.pdf
 
Maher Othman Interior Design Portfolio..
Maher Othman Interior Design Portfolio..Maher Othman Interior Design Portfolio..
Maher Othman Interior Design Portfolio..
 
Intelligent Agents, A discovery on How A Rational Agent Acts
Intelligent Agents, A discovery on How A Rational Agent ActsIntelligent Agents, A discovery on How A Rational Agent Acts
Intelligent Agents, A discovery on How A Rational Agent Acts
 
Filters for Electromagnetic Compatibility Applications
Filters for Electromagnetic Compatibility ApplicationsFilters for Electromagnetic Compatibility Applications
Filters for Electromagnetic Compatibility Applications
 
"United Nations Park" Site Visit Report.
"United Nations Park" Site  Visit Report."United Nations Park" Site  Visit Report.
"United Nations Park" Site Visit Report.
 
analog-vs-digital-communication (concept of analog and digital).pptx
analog-vs-digital-communication (concept of analog and digital).pptxanalog-vs-digital-communication (concept of analog and digital).pptx
analog-vs-digital-communication (concept of analog and digital).pptx
 
Artificial Intelligence Bayesian Reasoning
Artificial Intelligence Bayesian ReasoningArtificial Intelligence Bayesian Reasoning
Artificial Intelligence Bayesian Reasoning
 
15-Minute City: A Completely New Horizon
15-Minute City: A Completely New Horizon15-Minute City: A Completely New Horizon
15-Minute City: A Completely New Horizon
 
EMPLOYEE MANAGEMENT SYSTEM FINAL presentation
EMPLOYEE MANAGEMENT SYSTEM FINAL presentationEMPLOYEE MANAGEMENT SYSTEM FINAL presentation
EMPLOYEE MANAGEMENT SYSTEM FINAL presentation
 
Activity Planning: Objectives, Project Schedule, Network Planning Model. Time...
Activity Planning: Objectives, Project Schedule, Network Planning Model. Time...Activity Planning: Objectives, Project Schedule, Network Planning Model. Time...
Activity Planning: Objectives, Project Schedule, Network Planning Model. Time...
 
Fabrication Of Automatic Star Delta Starter Using Relay And GSM Module By Utk...
Fabrication Of Automatic Star Delta Starter Using Relay And GSM Module By Utk...Fabrication Of Automatic Star Delta Starter Using Relay And GSM Module By Utk...
Fabrication Of Automatic Star Delta Starter Using Relay And GSM Module By Utk...
 
Geometric constructions Engineering Drawing.pdf
Geometric constructions Engineering Drawing.pdfGeometric constructions Engineering Drawing.pdf
Geometric constructions Engineering Drawing.pdf
 
Insurance management system project report.pdf
Insurance management system project report.pdfInsurance management system project report.pdf
Insurance management system project report.pdf
 
Linux Systems Programming: Semaphores, Shared Memory, and Message Queues
Linux Systems Programming: Semaphores, Shared Memory, and Message QueuesLinux Systems Programming: Semaphores, Shared Memory, and Message Queues
Linux Systems Programming: Semaphores, Shared Memory, and Message Queues
 
BURGER ORDERING SYSYTEM PROJECT REPORT..pdf
BURGER ORDERING SYSYTEM PROJECT REPORT..pdfBURGER ORDERING SYSYTEM PROJECT REPORT..pdf
BURGER ORDERING SYSYTEM PROJECT REPORT..pdf
 
Research Methodolgy & Intellectual Property Rights Series 2
Research Methodolgy & Intellectual Property Rights Series 2Research Methodolgy & Intellectual Property Rights Series 2
Research Methodolgy & Intellectual Property Rights Series 2
 
Supermarket billing system project report..pdf
Supermarket billing system project report..pdfSupermarket billing system project report..pdf
Supermarket billing system project report..pdf
 
The battle for RAG, explore the pros and cons of using KnowledgeGraphs and Ve...
The battle for RAG, explore the pros and cons of using KnowledgeGraphs and Ve...The battle for RAG, explore the pros and cons of using KnowledgeGraphs and Ve...
The battle for RAG, explore the pros and cons of using KnowledgeGraphs and Ve...
 

Challenges in Evaluating Exploration Effectiveness in Recommender Systems

  • 1. Challenges in Evaluating Exploration Effectiveness in Recommender Systems REVEAL workshop, RecSys 2020 Inbar Naor Algorithms Team Lead, Taboola 1
  • 2. Recommendations in Taboola ● Content + Ads recommendation ● Multi-stakeholders: publishers, advertisers, users ● My team: new advertisers success 2
  • 3. Exploration in Recommender Systems is Challenging ● Huge state-action space ● The real world is not stationary: ○ New items are added all the time ○ State space is always changing ○ Users and market dynamic adapt to changes ● Not sufficient to bound regret - we might want to bound the probability of catastrophic failures 3
  • 5. Defining Regret is Hard ● Don’t know the reward of actions we did not take ● We can’t assume stationarity in reward or states dynamics ● Different costs in multi-stakeholder systems: ○ Example: ads display: ■ Opportunity cost - how much the system missed by showing this ad and not something else ■ Billing cost - how much the advertiser lost by showing his ad in this place ● Hard to account for long term effects (e.g - churn) 5
  • 6. Exploration - Different Points of View ● High level system impact (main metric) ● Success of the target population (e.g - new items, new advertisers) ● Fairness for new items 6
  • 7. ➢ High level system impact ➢ Success of the target population ➢ Fairness for new items 7
  • 8. High level system impact What is the overall effect of exploration on our system? Use main metric(s) - e.g user engagement ● Compare traffic with / without exploration (fix model training) ○ Does not account for exploratory data available in training A Exploitation 90% C Exploration 5% ● Compare model trained on A + B to model trained on A + C [Chen 2019] 8 B Exploitation 5%
  • 9. ➢ High level system impact ➢ Success of the target population ➢ Fairness for new items 9
  • 10. Effect on Target Population Exploration as a way to improve performance for new items / advertisers Goal: define short term metrics that are AB testable and indicative of long term success Examples: Clicks, Conversions, Engagement, CTR, Cost Per Action 10
  • 11. Too much exploration can have bad business effect: ● Costs too much money ● Too much exploration can make the system confusing or unsatisfactory to users In many situations we want to limit our exploration traffic Why would you have Separate Traffic for Exploration? 11
  • 12. Metrics Hierarchy Business KPI Online metrics for experimentation Optimization Metrics For training ML algorithms Offline Metrics Long term Metrics 12
  • 13. What do we Want From a Metric ● Indicative of long term gain in Business KPI ● Sensitive to changes, converge fast ● Not too noisy Also: ● Fast and cheap to compute ● Hard to game, incentivizes the right action [Gupta et al 2019] 13
  • 14. ● Data set of past experiments ● Correlation between the short term metric and business KPI ● Predictive power of the short term metric Causality: ● Surrogacy assumption: long-term outcome is independent of treatment conditional on the surrogate ● Surrogate index - combining multiple surrogates [Athey 2019] Indicative of Long Term Gain in Business KPI 14
  • 15. ● Past experiments ● Degradation test ● Run an AA test Sensitivity & Noise 15
  • 16. ➢ High level system impact ➢ Success of the target population ➢ Fairness for new items 16
  • 17. Exploration - a mean to improve fairness between new items and existing items Individual Fairness: similar individual should get similar results Group Fairness: the group with the protected attribute should get similar results to the group without the protected attribute Fairness Metrics 17
  • 18. Fairness Metrics for Classification Group Fairness: ● Demographic Parity: Members of both groups are predicted to belong to the positive class at the same rate. ● Equalized Odds: The true positive rate (TPR) and (separately) the false positive rate (FPR) are the same across groups. ● Equality of opportunity: 18
  • 19. Fairness Metrics - continue ● Calibration: if we look at all the examples that got a score s, they have the same probability of belonging to the positive class, regardless of group membership. ● Balance for Positive Class: the average score for members of the positive class are equal, regardless of group membership. 19
  • 20. Don’t consider the effects of short-term actions on long-term rewards: ● Applying fairness constraint may cause more harm to the group they try to protect at next iteration [Liu 2018] Take long time to converge: ● Fairness constrained algorithm may take exponential time (in number of states) to converge to optimal policy [Jabbari 2017] Fairness in RL 20
  • 21. ● Small changes in relevancy can lead to large changes in exposure ● Also the other way around: it doesn’t matter if the model predicted 0.4 or 0.6 if it doesn’t impact the ranking. ● Work on adapting fairness constrained to ranking: ○ Parity: restricting the fraction of items from each group in different positions in ranking [Zehlike 2017, Yang 2017, Celis 2017] ○ Different fairness constraints [Singh & Joachims 2018] ○ Pairwise comparisons within same query [Beutel 2019] Fairness in Ranking 21
  • 22. Summary ● Unique challenges of exploration in recommender systems ● We should go beyond the usual reward maximization perspective ● Three different points of view for exploration: ○ Maximizing overall utility ○ Improving the chances of new items to succeed in the system ○ Fairness for new items Questions? inbar.n (@) taboola.com 22
  • 23. References Chen, Minmin, et al. "Top-k off-policy correction for a REINFORCE recommender system." Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining. 2019. Gupta, Somit, et al. "Top challenges from the first practical online controlled experiments summit." ACM SIGKDD Explorations Newsletter 21.1 (2019): 20-35. Athey, Susan, et al. "Estimating treatment effects using multiple surrogates: The role of the surrogate score and the surrogate index." arXiv preprint arXiv:1603.09326 (2016). Liu, Lydia T., et al. "Delayed impact of fair machine learning." arXiv preprint arXiv:1803.04383 (2018). Jabbari, Shahin, et al. "Fairness in reinforcement learning." International Conference on Machine Learning. PMLR, 2017. Zehlike, Meike, et al. "Fa* ir: A fair top-k ranking algorithm." Proceedings of the 2017 ACM on Conference on Information and Knowledge Management. 2017. Yang, Ke, and Julia Stoyanovich. "Measuring fairness in ranked outputs." Proceedings of the 29th International Conference on Scientific and Statistical Database Management. 2017. Celis, L. Elisa, Damian Straszak, and Nisheeth K. Vishnoi. "Ranking with fairness constraints." arXiv preprint arXiv:1704.06840 (2017). Singh, Ashudeep, and Thorsten Joachims. "Fairness of exposure in rankings." Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2018. Beutel, Alex, et al. "Fairness in recommendation ranking through pairwise comparisons." Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2019. 23