Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Recap: Designing a more Efficient Estimator for Off-policy Evaluation in Bandits with Large Action Spaces

710 views

Published on

Talk from the REVEAL workshop at RecSys 2019 on 2019-09-20 in Copenhagen, Denmark. The slides were primarily made by Ajinkya More and the paper was also joint work with Linas Baltrunas and Nikos Vlassis.

The paper is available here: https://drive.google.com/open?id=1oaM5Fu2bJ0GzMC09yyqjA7eZD9axzSKb

Published in: Technology
  • Be the first to comment

Recap: Designing a more Efficient Estimator for Off-policy Evaluation in Bandits with Large Action Spaces

  1. 1. Recap: Designing a more Efficient Estimator for Off-policy Evaluation in Bandits with Large Action Spaces Ajinkya More, Linas Baltrunas, Nikos Vlassis, Justin Basilico REVEAL Workshop at RecSys 2019 Copenhagen, Denmark, Sept 20, 2019 Ajinkya, Linas, Nikos, Justin Recap Metric REVEAL, RecSys 2019 1 / 19
  2. 2. Introduction Contextual bandits are a common modeling tool in modern personalization systems (e.g. artwork personalization at Netflix, playlist ordering on Spotify, ad ranking in web search) Evaluating bandit algorithms via online A/B testing is expensive Off-policy evaluation is a preferred alternative Goal: A sensitive, efficient estimator that aligns with online metrics Ajinkya, Linas, Nikos, Justin Recap Metric REVEAL, RecSys 2019 2 / 19
  3. 3. Challenge Several metrics proposed in literature have limitations in personalization settings: Replay, IPS, SNIPS - High variance Few matches with many arms Requires very large amounts of data Direct Method, Doubly Robust - Require a reward model This model is used to evaluate other models Introduces modeling bias E.g. what features to use in the reward model? Ajinkya, Linas, Nikos, Justin Recap Metric REVEAL, RecSys 2019 3 / 19
  4. 4. Overview This paper: Introduce a new off-policy evaluation metric: Recap Trades off bias for significantly lowering variance in off-policy evaluation Model free - No need to define reward model Study properties via simulated data Investigate alignment with online metric Ajinkya, Linas, Nikos, Justin Recap Metric REVEAL, RecSys 2019 4 / 19
  5. 5. Notation Let A = {1, 2, ..., K} be the set of actions. In trial t ∈ {1, 2, ..., n}, the environment chooses a context xt and in response the agent chooses an arm at ∈ A. The environment then reveals a reward ra,t ∈ [0, 1]. Denote the probability of the logging policy taking action a for context x as π(x, a) and the logged action at trial t as aπ. We want to estimate the expected reward E[ra] of a deterministic target policy τ given data from a potentially different logging policy. We assume the target policy is mediated by a scorer or a ranker, that assigns a real valued score s(x, a) to each arm and selects the arm argmaxa∈As(x, a). Ajinkya, Linas, Nikos, Justin Recap Metric REVEAL, RecSys 2019 5 / 19
  6. 6. IPS, SNIPS Inverse Propensity Scoring (IPS): ˆV τ IPS = Σn t=0raπ I(aτ =aπ) π(xt ,aπ) n (1) Self-Normalized Inverse Propensity Scoring (SNIPS): ˆV τ SNIPS = Σn t=0raπ I(aτ =aπ) π(xt ,aπ) Σn t=0 I(aτ =aπ) π(xt ,aπ) (2) Ajinkya, Linas, Nikos, Justin Recap Metric REVEAL, RecSys 2019 6 / 19
  7. 7. Recap ˆV τ Recap = Σn t=0raπ RR π(xt ,aπ) Σn t=0 RR π(xt ,aπ) (3) where RR = 1 Σa∈AI(s(x, a) ≥ s(x, aπ)) is simply the reciprocal rank of the logged action according to the scores assigned by the target policy. It assigns a“partial credit” when the logged action and target policy action are different (which may be seen as “stochastifying” a deterministic policy). Ajinkya, Linas, Nikos, Justin Recap Metric REVEAL, RecSys 2019 7 / 19
  8. 8. Recap(m) More generally, for m ∈ N (or even m ∈ R+), we define, ˆV τ Recap(m) = Σn t=0raπ RRm π(xt ,aπ) Σn t=0 RRm π(xt ,aπ) (4) ˆV τ Recap(1) = ˆV τ Recap As m → ∞, ˆV τ Recap(m) → ˆV τ SNIPS Thus, m allows us to smoothly trade off bias for variance. Though we find m = 1 to work well in practice. Ajinkya, Linas, Nikos, Justin Recap Metric REVEAL, RecSys 2019 8 / 19
  9. 9. Simulations Set up 1 The rewards for arm a are drawn from a Bernoulli distribution with parameter θa = 1/a 2 The arm scores s(x, a) are drawn from a normal distribution N(µ, σ) where σ = 0.1 and µ is drawn from a uniform distribution with support [0, 0.2]. 3 In each trial, the logging policy samples the action from a multinomial distribution over the arms. 4 The target policy is argmaxa∈As(x, a). Ajinkya, Linas, Nikos, Justin Recap Metric REVEAL, RecSys 2019 9 / 19
  10. 10. Behavior in small action spaces Ajinkya, Linas, Nikos, Justin Recap Metric REVEAL, RecSys 2019 10 / 19
  11. 11. Behavior in large action spaces Ajinkya, Linas, Nikos, Justin Recap Metric REVEAL, RecSys 2019 11 / 19
  12. 12. Variance: Small number of arms Ajinkya, Linas, Nikos, Justin Recap Metric REVEAL, RecSys 2019 12 / 19
  13. 13. Variance: Medium number of arms Ajinkya, Linas, Nikos, Justin Recap Metric REVEAL, RecSys 2019 13 / 19
  14. 14. Variance: Large number of arms Ajinkya, Linas, Nikos, Justin Recap Metric REVEAL, RecSys 2019 14 / 19
  15. 15. Risk Ajinkya, Linas, Nikos, Justin Recap Metric REVEAL, RecSys 2019 15 / 19
  16. 16. Online–Offline correlation Ajinkya, Linas, Nikos, Justin Recap Metric REVEAL, RecSys 2019 16 / 19
  17. 17. Summary Recap: An off-policy estimator for bandits based on MRR Trades off a small increase in bias (compared with IPS) for significantly reduced variance compared to both IPS and SNIPS Recap uses all the available data making it more efficient with small logged data or large numbers of arms Doesn’t require a reward model as compared to Direct Method and Doubly Robust Signs of alignment with online metrics Ajinkya, Linas, Nikos, Justin Recap Metric REVEAL, RecSys 2019 17 / 19
  18. 18. Ongoing Work Analyze some theoretical properties of the estimator Bias/Variance/Risk of the estimator Closed form expressions Asymptotic properties Is the estimator consistent? Deviation bounds Training models to optimize Recap Connection to Learning to Rank approaches Ajinkya, Linas, Nikos, Justin Recap Metric REVEAL, RecSys 2019 18 / 19
  19. 19. Thank you! Questions? Ajinkya, Linas, Nikos, Justin Recap Metric REVEAL, RecSys 2019 19 / 19

×