Exploration is an important pillar of many interactive systems. Particularly, in recommender systems, exploration provides the ability to not only serve well-established items, but to explore new items that might have unfulfilled potential.
Measuring the effectiveness of our exploration mechanisms is a challenging task; while many theoretical results rely on regret bounds or measuring regret via simulations, defining and measuring the regret in real world scenarios is not trivial. Furthermore, measuring the gain from a given exploration method is even less trivial as there are many objectives and constraints we may want to consider.
In this talk we will focus on different challenges in defining exploration metrics, present different approaches to do so, discuss the key properties we may want from such metrics and examine methods to validate them.
The battle for RAG, explore the pros and cons of using KnowledgeGraphs and Ve...
Challenges in Evaluating Exploration Effectiveness in Recommender Systems
1. Challenges in Evaluating
Exploration Effectiveness in
Recommender Systems
REVEAL workshop, RecSys 2020
Inbar Naor Algorithms Team Lead, Taboola
1
2. Recommendations in Taboola
● Content + Ads recommendation
● Multi-stakeholders: publishers, advertisers, users
● My team: new advertisers success
2
3. Exploration in Recommender Systems is Challenging
● Huge state-action space
● The real world is not stationary:
○ New items are added all the time
○ State space is always changing
○ Users and market dynamic adapt to changes
● Not sufficient to bound regret - we might want to bound the probability of
catastrophic failures
3
5. Defining Regret is Hard
● Don’t know the reward of actions we did not take
● We can’t assume stationarity in reward or states dynamics
● Different costs in multi-stakeholder systems:
○ Example: ads display:
■ Opportunity cost - how much the system missed by showing this ad and not something
else
■ Billing cost - how much the advertiser lost by showing his ad in this place
● Hard to account for long term effects (e.g - churn)
5
6. Exploration - Different Points of View
● High level system impact (main metric)
● Success of the target population (e.g - new items, new advertisers)
● Fairness for new items
6
7. ➢ High level system impact
➢ Success of the target population
➢ Fairness for new items
7
8. High level system impact
What is the overall effect of exploration on our system?
Use main metric(s) - e.g user engagement
● Compare traffic with / without exploration (fix model training)
○ Does not account for exploratory data available in training
A
Exploitation
90%
C
Exploration
5%
● Compare model trained on A + B to model trained on A + C
[Chen 2019]
8
B
Exploitation
5%
9. ➢ High level system impact
➢ Success of the target population
➢ Fairness for new items
9
10. Effect on Target Population
Exploration as a way to improve performance for new items / advertisers
Goal: define short term metrics that are AB testable and indicative of long term
success
Examples: Clicks, Conversions, Engagement, CTR, Cost Per Action
10
11. Too much exploration can have bad business effect:
● Costs too much money
● Too much exploration can make the system confusing or unsatisfactory to
users
In many situations we want to limit our exploration traffic
Why would you have Separate Traffic for Exploration?
11
12. Metrics Hierarchy
Business KPI
Online metrics for
experimentation
Optimization
Metrics
For training ML algorithms
Offline Metrics
Long term
Metrics
12
13. What do we Want From a Metric
● Indicative of long term gain in Business KPI
● Sensitive to changes, converge fast
● Not too noisy
Also:
● Fast and cheap to compute
● Hard to game, incentivizes the right action
[Gupta et al 2019]
13
14. ● Data set of past experiments
● Correlation between the short term metric and business KPI
● Predictive power of the short term metric
Causality:
● Surrogacy assumption: long-term outcome is independent of treatment
conditional on the surrogate
● Surrogate index - combining multiple surrogates [Athey 2019]
Indicative of Long Term Gain in Business KPI
14
16. ➢ High level system impact
➢ Success of the target population
➢ Fairness for new items
16
17. Exploration - a mean to improve fairness between new items and existing items
Individual Fairness: similar individual should get similar results
Group Fairness: the group with the protected attribute should get similar results
to the group without the protected attribute
Fairness Metrics
17
18. Fairness Metrics for Classification
Group Fairness:
● Demographic Parity:
Members of both groups are predicted to belong to the positive class at the same
rate.
● Equalized Odds:
The true positive rate (TPR) and (separately) the false positive rate (FPR) are the
same across groups.
● Equality of opportunity:
18
19. Fairness Metrics - continue
● Calibration: if we look at all the examples that got a score s, they
have the same probability of belonging to the positive class,
regardless of group membership.
● Balance for Positive Class: the average score for members of
the positive class are equal, regardless of group membership.
19
20. Don’t consider the effects of short-term actions on long-term rewards:
● Applying fairness constraint may cause more harm to the group they try to
protect at next iteration [Liu 2018]
Take long time to converge:
● Fairness constrained algorithm may take exponential time (in number of
states) to converge to optimal policy [Jabbari 2017]
Fairness in RL
20
21. ● Small changes in relevancy can lead to large changes in exposure
● Also the other way around: it doesn’t matter if the model predicted 0.4 or 0.6
if it doesn’t impact the ranking.
● Work on adapting fairness constrained to ranking:
○ Parity: restricting the fraction of items from each group in different positions in
ranking [Zehlike 2017, Yang 2017, Celis 2017]
○ Different fairness constraints [Singh & Joachims 2018]
○ Pairwise comparisons within same query [Beutel 2019]
Fairness in Ranking
21
22. Summary
● Unique challenges of exploration in recommender systems
● We should go beyond the usual reward maximization perspective
● Three different points of view for exploration:
○ Maximizing overall utility
○ Improving the chances of new items to succeed in the system
○ Fairness for new items
Questions? inbar.n (@) taboola.com
22
23. References
Chen, Minmin, et al. "Top-k off-policy correction for a REINFORCE recommender system." Proceedings of the Twelfth ACM International
Conference on Web Search and Data Mining. 2019.
Gupta, Somit, et al. "Top challenges from the first practical online controlled experiments summit." ACM SIGKDD Explorations Newsletter 21.1
(2019): 20-35.
Athey, Susan, et al. "Estimating treatment effects using multiple surrogates: The role of the surrogate score and the surrogate index." arXiv
preprint arXiv:1603.09326 (2016).
Liu, Lydia T., et al. "Delayed impact of fair machine learning." arXiv preprint arXiv:1803.04383 (2018).
Jabbari, Shahin, et al. "Fairness in reinforcement learning." International Conference on Machine Learning. PMLR, 2017.
Zehlike, Meike, et al. "Fa* ir: A fair top-k ranking algorithm." Proceedings of the 2017 ACM on Conference on Information and Knowledge
Management. 2017.
Yang, Ke, and Julia Stoyanovich. "Measuring fairness in ranked outputs." Proceedings of the 29th International Conference on Scientific and
Statistical Database Management. 2017.
Celis, L. Elisa, Damian Straszak, and Nisheeth K. Vishnoi. "Ranking with fairness constraints." arXiv preprint arXiv:1704.06840 (2017).
Singh, Ashudeep, and Thorsten Joachims. "Fairness of exposure in rankings." Proceedings of the 24th ACM SIGKDD International Conference on
Knowledge Discovery & Data Mining. 2018.
Beutel, Alex, et al. "Fairness in recommendation ranking through pairwise comparisons." Proceedings of the 25th ACM SIGKDD International
Conference on Knowledge Discovery & Data Mining. 2019.
23