Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

Successfully reported this slideshow.

Like this presentation? Why not share!

- The AI Rush by Jean-Baptiste Dumont 2645028 views
- AI and Machine Learning Demystified... by Carol Smith 4166298 views
- 10 facts about jobs in the future by Pew Research Cent... 1178915 views
- Harry Surden - Artificial Intellige... by Harry Surden 935896 views
- Inside Google's Numbers in 2017 by Rand Fishkin 1514753 views
- Pinot: Realtime Distributed OLAP da... by Kishore Gopalakri... 749828 views

710 views

Published on

The paper is available here: https://drive.google.com/open?id=1oaM5Fu2bJ0GzMC09yyqjA7eZD9axzSKb

Published in:
Technology

No Downloads

Total views

710

On SlideShare

0

From Embeds

0

Number of Embeds

103

Shares

0

Downloads

0

Comments

0

Likes

1

No embeds

No notes for slide

- 1. Recap: Designing a more Eﬃcient Estimator for Oﬀ-policy Evaluation in Bandits with Large Action Spaces Ajinkya More, Linas Baltrunas, Nikos Vlassis, Justin Basilico REVEAL Workshop at RecSys 2019 Copenhagen, Denmark, Sept 20, 2019 Ajinkya, Linas, Nikos, Justin Recap Metric REVEAL, RecSys 2019 1 / 19
- 2. Introduction Contextual bandits are a common modeling tool in modern personalization systems (e.g. artwork personalization at Netﬂix, playlist ordering on Spotify, ad ranking in web search) Evaluating bandit algorithms via online A/B testing is expensive Oﬀ-policy evaluation is a preferred alternative Goal: A sensitive, eﬃcient estimator that aligns with online metrics Ajinkya, Linas, Nikos, Justin Recap Metric REVEAL, RecSys 2019 2 / 19
- 3. Challenge Several metrics proposed in literature have limitations in personalization settings: Replay, IPS, SNIPS - High variance Few matches with many arms Requires very large amounts of data Direct Method, Doubly Robust - Require a reward model This model is used to evaluate other models Introduces modeling bias E.g. what features to use in the reward model? Ajinkya, Linas, Nikos, Justin Recap Metric REVEAL, RecSys 2019 3 / 19
- 4. Overview This paper: Introduce a new oﬀ-policy evaluation metric: Recap Trades oﬀ bias for signiﬁcantly lowering variance in oﬀ-policy evaluation Model free - No need to deﬁne reward model Study properties via simulated data Investigate alignment with online metric Ajinkya, Linas, Nikos, Justin Recap Metric REVEAL, RecSys 2019 4 / 19
- 5. Notation Let A = {1, 2, ..., K} be the set of actions. In trial t ∈ {1, 2, ..., n}, the environment chooses a context xt and in response the agent chooses an arm at ∈ A. The environment then reveals a reward ra,t ∈ [0, 1]. Denote the probability of the logging policy taking action a for context x as π(x, a) and the logged action at trial t as aπ. We want to estimate the expected reward E[ra] of a deterministic target policy τ given data from a potentially diﬀerent logging policy. We assume the target policy is mediated by a scorer or a ranker, that assigns a real valued score s(x, a) to each arm and selects the arm argmaxa∈As(x, a). Ajinkya, Linas, Nikos, Justin Recap Metric REVEAL, RecSys 2019 5 / 19
- 6. IPS, SNIPS Inverse Propensity Scoring (IPS): ˆV τ IPS = Σn t=0raπ I(aτ =aπ) π(xt ,aπ) n (1) Self-Normalized Inverse Propensity Scoring (SNIPS): ˆV τ SNIPS = Σn t=0raπ I(aτ =aπ) π(xt ,aπ) Σn t=0 I(aτ =aπ) π(xt ,aπ) (2) Ajinkya, Linas, Nikos, Justin Recap Metric REVEAL, RecSys 2019 6 / 19
- 7. Recap ˆV τ Recap = Σn t=0raπ RR π(xt ,aπ) Σn t=0 RR π(xt ,aπ) (3) where RR = 1 Σa∈AI(s(x, a) ≥ s(x, aπ)) is simply the reciprocal rank of the logged action according to the scores assigned by the target policy. It assigns a“partial credit” when the logged action and target policy action are diﬀerent (which may be seen as “stochastifying” a deterministic policy). Ajinkya, Linas, Nikos, Justin Recap Metric REVEAL, RecSys 2019 7 / 19
- 8. Recap(m) More generally, for m ∈ N (or even m ∈ R+), we deﬁne, ˆV τ Recap(m) = Σn t=0raπ RRm π(xt ,aπ) Σn t=0 RRm π(xt ,aπ) (4) ˆV τ Recap(1) = ˆV τ Recap As m → ∞, ˆV τ Recap(m) → ˆV τ SNIPS Thus, m allows us to smoothly trade oﬀ bias for variance. Though we ﬁnd m = 1 to work well in practice. Ajinkya, Linas, Nikos, Justin Recap Metric REVEAL, RecSys 2019 8 / 19
- 9. Simulations Set up 1 The rewards for arm a are drawn from a Bernoulli distribution with parameter θa = 1/a 2 The arm scores s(x, a) are drawn from a normal distribution N(µ, σ) where σ = 0.1 and µ is drawn from a uniform distribution with support [0, 0.2]. 3 In each trial, the logging policy samples the action from a multinomial distribution over the arms. 4 The target policy is argmaxa∈As(x, a). Ajinkya, Linas, Nikos, Justin Recap Metric REVEAL, RecSys 2019 9 / 19
- 10. Behavior in small action spaces Ajinkya, Linas, Nikos, Justin Recap Metric REVEAL, RecSys 2019 10 / 19
- 11. Behavior in large action spaces Ajinkya, Linas, Nikos, Justin Recap Metric REVEAL, RecSys 2019 11 / 19
- 12. Variance: Small number of arms Ajinkya, Linas, Nikos, Justin Recap Metric REVEAL, RecSys 2019 12 / 19
- 13. Variance: Medium number of arms Ajinkya, Linas, Nikos, Justin Recap Metric REVEAL, RecSys 2019 13 / 19
- 14. Variance: Large number of arms Ajinkya, Linas, Nikos, Justin Recap Metric REVEAL, RecSys 2019 14 / 19
- 15. Risk Ajinkya, Linas, Nikos, Justin Recap Metric REVEAL, RecSys 2019 15 / 19
- 16. Online–Oﬄine correlation Ajinkya, Linas, Nikos, Justin Recap Metric REVEAL, RecSys 2019 16 / 19
- 17. Summary Recap: An oﬀ-policy estimator for bandits based on MRR Trades oﬀ a small increase in bias (compared with IPS) for signiﬁcantly reduced variance compared to both IPS and SNIPS Recap uses all the available data making it more eﬃcient with small logged data or large numbers of arms Doesn’t require a reward model as compared to Direct Method and Doubly Robust Signs of alignment with online metrics Ajinkya, Linas, Nikos, Justin Recap Metric REVEAL, RecSys 2019 17 / 19
- 18. Ongoing Work Analyze some theoretical properties of the estimator Bias/Variance/Risk of the estimator Closed form expressions Asymptotic properties Is the estimator consistent? Deviation bounds Training models to optimize Recap Connection to Learning to Rank approaches Ajinkya, Linas, Nikos, Justin Recap Metric REVEAL, RecSys 2019 18 / 19
- 19. Thank you! Questions? Ajinkya, Linas, Nikos, Justin Recap Metric REVEAL, RecSys 2019 19 / 19

No public clipboards found for this slide

Be the first to comment