Challenges in Evaluating
Exploration Effectiveness in
Recommender Systems
REVEAL workshop, RecSys 2020
Inbar Naor Algorithms Team Lead, Taboola
1
Recommendations in Taboola
● Content + Ads recommendation
● Multi-stakeholders: publishers, advertisers, users
● My team: new advertisers success
2
Exploration in Recommender Systems is Challenging
● Huge state-action space
● The real world is not stationary:
○ New items are added all the time
○ State space is always changing
○ Users and market dynamic adapt to changes
● Not sufficient to bound regret - we might want to bound the probability of
catastrophic failures
3
4[Ou 2018]
Defining Regret is Hard
● Don’t know the reward of actions we did not take
● We can’t assume stationarity in reward or states dynamics
● Different costs in multi-stakeholder systems:
○ Example: ads display:
■ Opportunity cost - how much the system missed by showing this ad and not something
else
■ Billing cost - how much the advertiser lost by showing his ad in this place
● Hard to account for long term effects (e.g - churn)
5
Exploration - Different Points of View
● High level system impact (main metric)
● Success of the target population (e.g - new items, new advertisers)
● Fairness for new items
6
➢ High level system impact
➢ Success of the target population
➢ Fairness for new items
7
High level system impact
What is the overall effect of exploration on our system?
Use main metric(s) - e.g user engagement
● Compare traffic with / without exploration (fix model training)
○ Does not account for exploratory data available in training
A
Exploitation
90%
C
Exploration
5%
● Compare model trained on A + B to model trained on A + C
[Chen 2019]
8
B
Exploitation
5%
➢ High level system impact
➢ Success of the target population
➢ Fairness for new items
9
Effect on Target Population
Exploration as a way to improve performance for new items / advertisers
Goal: define short term metrics that are AB testable and indicative of long term
success
Examples: Clicks, Conversions, Engagement, CTR, Cost Per Action
10
Too much exploration can have bad business effect:
● Costs too much money
● Too much exploration can make the system confusing or unsatisfactory to
users
In many situations we want to limit our exploration traffic
Why would you have Separate Traffic for Exploration?
11
Metrics Hierarchy
Business KPI
Online metrics for
experimentation
Optimization
Metrics
For training ML algorithms
Offline Metrics
Long term
Metrics
12
What do we Want From a Metric
● Indicative of long term gain in Business KPI
● Sensitive to changes, converge fast
● Not too noisy
Also:
● Fast and cheap to compute
● Hard to game, incentivizes the right action
[Gupta et al 2019]
13
● Data set of past experiments
● Correlation between the short term metric and business KPI
● Predictive power of the short term metric
Causality:
● Surrogacy assumption: long-term outcome is independent of treatment
conditional on the surrogate
● Surrogate index - combining multiple surrogates [Athey 2019]
Indicative of Long Term Gain in Business KPI
14
● Past experiments
● Degradation test
● Run an AA test
Sensitivity & Noise
15
➢ High level system impact
➢ Success of the target population
➢ Fairness for new items
16
Exploration - a mean to improve fairness between new items and existing items
Individual Fairness: similar individual should get similar results
Group Fairness: the group with the protected attribute should get similar results
to the group without the protected attribute
Fairness Metrics
17
Fairness Metrics for Classification
Group Fairness:
● Demographic Parity:
Members of both groups are predicted to belong to the positive class at the same
rate.
● Equalized Odds:
The true positive rate (TPR) and (separately) the false positive rate (FPR) are the
same across groups.
● Equality of opportunity:
18
Fairness Metrics - continue
● Calibration: if we look at all the examples that got a score s, they
have the same probability of belonging to the positive class,
regardless of group membership.
● Balance for Positive Class: the average score for members of
the positive class are equal, regardless of group membership.
19
Don’t consider the effects of short-term actions on long-term rewards:
● Applying fairness constraint may cause more harm to the group they try to
protect at next iteration [Liu 2018]
Take long time to converge:
● Fairness constrained algorithm may take exponential time (in number of
states) to converge to optimal policy [Jabbari 2017]
Fairness in RL
20
● Small changes in relevancy can lead to large changes in exposure
● Also the other way around: it doesn’t matter if the model predicted 0.4 or 0.6
if it doesn’t impact the ranking.
● Work on adapting fairness constrained to ranking:
○ Parity: restricting the fraction of items from each group in different positions in
ranking [Zehlike 2017, Yang 2017, Celis 2017]
○ Different fairness constraints [Singh & Joachims 2018]
○ Pairwise comparisons within same query [Beutel 2019]
Fairness in Ranking
21
Summary
● Unique challenges of exploration in recommender systems
● We should go beyond the usual reward maximization perspective
● Three different points of view for exploration:
○ Maximizing overall utility
○ Improving the chances of new items to succeed in the system
○ Fairness for new items
Questions? inbar.n (@) taboola.com
22
References
Chen, Minmin, et al. "Top-k off-policy correction for a REINFORCE recommender system." Proceedings of the Twelfth ACM International
Conference on Web Search and Data Mining. 2019.
Gupta, Somit, et al. "Top challenges from the first practical online controlled experiments summit." ACM SIGKDD Explorations Newsletter 21.1
(2019): 20-35.
Athey, Susan, et al. "Estimating treatment effects using multiple surrogates: The role of the surrogate score and the surrogate index." arXiv
preprint arXiv:1603.09326 (2016).
Liu, Lydia T., et al. "Delayed impact of fair machine learning." arXiv preprint arXiv:1803.04383 (2018).
Jabbari, Shahin, et al. "Fairness in reinforcement learning." International Conference on Machine Learning. PMLR, 2017.
Zehlike, Meike, et al. "Fa* ir: A fair top-k ranking algorithm." Proceedings of the 2017 ACM on Conference on Information and Knowledge
Management. 2017.
Yang, Ke, and Julia Stoyanovich. "Measuring fairness in ranked outputs." Proceedings of the 29th International Conference on Scientific and
Statistical Database Management. 2017.
Celis, L. Elisa, Damian Straszak, and Nisheeth K. Vishnoi. "Ranking with fairness constraints." arXiv preprint arXiv:1704.06840 (2017).
Singh, Ashudeep, and Thorsten Joachims. "Fairness of exposure in rankings." Proceedings of the 24th ACM SIGKDD International Conference on
Knowledge Discovery & Data Mining. 2018.
Beutel, Alex, et al. "Fairness in recommendation ranking through pairwise comparisons." Proceedings of the 25th ACM SIGKDD International
Conference on Knowledge Discovery & Data Mining. 2019.
23

Challenges in Evaluating Exploration Effectiveness in Recommender Systems

  • 1.
    Challenges in Evaluating ExplorationEffectiveness in Recommender Systems REVEAL workshop, RecSys 2020 Inbar Naor Algorithms Team Lead, Taboola 1
  • 2.
    Recommendations in Taboola ●Content + Ads recommendation ● Multi-stakeholders: publishers, advertisers, users ● My team: new advertisers success 2
  • 3.
    Exploration in RecommenderSystems is Challenging ● Huge state-action space ● The real world is not stationary: ○ New items are added all the time ○ State space is always changing ○ Users and market dynamic adapt to changes ● Not sufficient to bound regret - we might want to bound the probability of catastrophic failures 3
  • 4.
  • 5.
    Defining Regret isHard ● Don’t know the reward of actions we did not take ● We can’t assume stationarity in reward or states dynamics ● Different costs in multi-stakeholder systems: ○ Example: ads display: ■ Opportunity cost - how much the system missed by showing this ad and not something else ■ Billing cost - how much the advertiser lost by showing his ad in this place ● Hard to account for long term effects (e.g - churn) 5
  • 6.
    Exploration - DifferentPoints of View ● High level system impact (main metric) ● Success of the target population (e.g - new items, new advertisers) ● Fairness for new items 6
  • 7.
    ➢ High levelsystem impact ➢ Success of the target population ➢ Fairness for new items 7
  • 8.
    High level systemimpact What is the overall effect of exploration on our system? Use main metric(s) - e.g user engagement ● Compare traffic with / without exploration (fix model training) ○ Does not account for exploratory data available in training A Exploitation 90% C Exploration 5% ● Compare model trained on A + B to model trained on A + C [Chen 2019] 8 B Exploitation 5%
  • 9.
    ➢ High levelsystem impact ➢ Success of the target population ➢ Fairness for new items 9
  • 10.
    Effect on TargetPopulation Exploration as a way to improve performance for new items / advertisers Goal: define short term metrics that are AB testable and indicative of long term success Examples: Clicks, Conversions, Engagement, CTR, Cost Per Action 10
  • 11.
    Too much explorationcan have bad business effect: ● Costs too much money ● Too much exploration can make the system confusing or unsatisfactory to users In many situations we want to limit our exploration traffic Why would you have Separate Traffic for Exploration? 11
  • 12.
    Metrics Hierarchy Business KPI Onlinemetrics for experimentation Optimization Metrics For training ML algorithms Offline Metrics Long term Metrics 12
  • 13.
    What do weWant From a Metric ● Indicative of long term gain in Business KPI ● Sensitive to changes, converge fast ● Not too noisy Also: ● Fast and cheap to compute ● Hard to game, incentivizes the right action [Gupta et al 2019] 13
  • 14.
    ● Data setof past experiments ● Correlation between the short term metric and business KPI ● Predictive power of the short term metric Causality: ● Surrogacy assumption: long-term outcome is independent of treatment conditional on the surrogate ● Surrogate index - combining multiple surrogates [Athey 2019] Indicative of Long Term Gain in Business KPI 14
  • 15.
    ● Past experiments ●Degradation test ● Run an AA test Sensitivity & Noise 15
  • 16.
    ➢ High levelsystem impact ➢ Success of the target population ➢ Fairness for new items 16
  • 17.
    Exploration - amean to improve fairness between new items and existing items Individual Fairness: similar individual should get similar results Group Fairness: the group with the protected attribute should get similar results to the group without the protected attribute Fairness Metrics 17
  • 18.
    Fairness Metrics forClassification Group Fairness: ● Demographic Parity: Members of both groups are predicted to belong to the positive class at the same rate. ● Equalized Odds: The true positive rate (TPR) and (separately) the false positive rate (FPR) are the same across groups. ● Equality of opportunity: 18
  • 19.
    Fairness Metrics -continue ● Calibration: if we look at all the examples that got a score s, they have the same probability of belonging to the positive class, regardless of group membership. ● Balance for Positive Class: the average score for members of the positive class are equal, regardless of group membership. 19
  • 20.
    Don’t consider theeffects of short-term actions on long-term rewards: ● Applying fairness constraint may cause more harm to the group they try to protect at next iteration [Liu 2018] Take long time to converge: ● Fairness constrained algorithm may take exponential time (in number of states) to converge to optimal policy [Jabbari 2017] Fairness in RL 20
  • 21.
    ● Small changesin relevancy can lead to large changes in exposure ● Also the other way around: it doesn’t matter if the model predicted 0.4 or 0.6 if it doesn’t impact the ranking. ● Work on adapting fairness constrained to ranking: ○ Parity: restricting the fraction of items from each group in different positions in ranking [Zehlike 2017, Yang 2017, Celis 2017] ○ Different fairness constraints [Singh & Joachims 2018] ○ Pairwise comparisons within same query [Beutel 2019] Fairness in Ranking 21
  • 22.
    Summary ● Unique challengesof exploration in recommender systems ● We should go beyond the usual reward maximization perspective ● Three different points of view for exploration: ○ Maximizing overall utility ○ Improving the chances of new items to succeed in the system ○ Fairness for new items Questions? inbar.n (@) taboola.com 22
  • 23.
    References Chen, Minmin, etal. "Top-k off-policy correction for a REINFORCE recommender system." Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining. 2019. Gupta, Somit, et al. "Top challenges from the first practical online controlled experiments summit." ACM SIGKDD Explorations Newsletter 21.1 (2019): 20-35. Athey, Susan, et al. "Estimating treatment effects using multiple surrogates: The role of the surrogate score and the surrogate index." arXiv preprint arXiv:1603.09326 (2016). Liu, Lydia T., et al. "Delayed impact of fair machine learning." arXiv preprint arXiv:1803.04383 (2018). Jabbari, Shahin, et al. "Fairness in reinforcement learning." International Conference on Machine Learning. PMLR, 2017. Zehlike, Meike, et al. "Fa* ir: A fair top-k ranking algorithm." Proceedings of the 2017 ACM on Conference on Information and Knowledge Management. 2017. Yang, Ke, and Julia Stoyanovich. "Measuring fairness in ranked outputs." Proceedings of the 29th International Conference on Scientific and Statistical Database Management. 2017. Celis, L. Elisa, Damian Straszak, and Nisheeth K. Vishnoi. "Ranking with fairness constraints." arXiv preprint arXiv:1704.06840 (2017). Singh, Ashudeep, and Thorsten Joachims. "Fairness of exposure in rankings." Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2018. Beutel, Alex, et al. "Fairness in recommendation ranking through pairwise comparisons." Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2019. 23