Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

Successfully reported this slideshow.

1,365 views

Published on

Published in:
Technology

No Downloads

Total views

1,365

On SlideShare

0

From Embeds

0

Number of Embeds

358

Shares

0

Downloads

21

Comments

0

Likes

1

No embeds

No notes for slide

- 1. Learning with Exploration Alina Beygelzimer Yahoo Labs, New York (based on work by many)
- 2. Interactive Learning Repeatedly: 1 A user comes to Yahoo 2 Yahoo chooses content to present (urls, ads, news stories) 3 The user reacts to the presented information (clicks on something) Making good content decisions requires learning from user feedback.
- 3. Abstracting the Setting For t = 1, . . . , T: 1 The world produces some context x ∈ X 2 The learner chooses an action a ∈ A 3 The world reacts with reward r(a, x) Goal: Learn a good policy for choosing actions given context
- 4. Dominant Solution 1 Deploy some initial system 2 Collect data using this system 3 Use machine learning to build a reward predictor ˆr(a, x) from collected data 4 Evaluate new system = arg maxa ˆr(a, x) oﬄine evaluation on past data bucket test 5 If metrics improve, switch to this new system and repeat
- 5. Example: Bagels vs. Pizza for New York and Chicago users
- 6. Example: Bagels vs. Pizza for New York and Chicago users Initial system: NY gets bagels, Chicago gets pizza. New York Chicago
- 7. Example: Bagels vs. Pizza for New York and Chicago users Initial system: NY gets bagels, Chicago gets pizza. Observed CTR New York ? 0.6 Chicago 0.4 ?
- 8. Example: Bagels vs. Pizza for New York and Chicago users Initial system: NY gets bagels, Chicago gets pizza. Observed CTR/Estimated CTR New York ?/0.5 0.6/0.6 Chicago 0.4/0.4 ?/0.5
- 9. Example: Bagels vs. Pizza for New York and Chicago users Initial system: NY gets bagels, Chicago gets pizza. Observed CTR/Estimated CTR New York ?/0.5 0.6/0.6 Chicago 0.4/0.4 ?/0.5 Bagels win. Switch to serving bagels for all and update model based on new data.
- 10. Example: Bagels vs. Pizza for New York and Chicago users Initial system: NY gets bagels, Chicago gets pizza. Observed CTR/Estimated CTR New York ?/0.5 0.6/0.6 Chicago 0.4/0.4 0.7/0.5 Bagels win. Switch to serving bagels for all and update model based on new data.
- 11. Example: Bagels vs. Pizza for New York and Chicago users Initial system: NY gets bagels, Chicago gets pizza. Observed CTR/Estimated CTR New York ?/0.4595 0.6/0.6 Chicago 0.4/0.4 0.7/0.7 Bagels win. Switch to serving bagels for all and update model based on new data.
- 12. Example: Bagels vs. Pizza for New York and Chicago users Initial system: NY gets bagels, Chicago gets pizza. Observed CTR/Estimated CTR/True CTR New York ?/0.4595/1 0.6/0.6/0.6 Chicago 0.4/0.4/0.4 0.7/0.7/0.7 Yikes! Missed out big in NY!
- 13. Basic Observations 1 Standard machine learning is not enough. Model ﬁts collected data perfectly.
- 14. Basic Observations 1 Standard machine learning is not enough. Model ﬁts collected data perfectly. 2 More data doesn’t help: Observed = True where data was collected.
- 15. Basic Observations 1 Standard machine learning is not enough. Model ﬁts collected data perfectly. 2 More data doesn’t help: Observed = True where data was collected. 3 Better data helps! Exploration is required.
- 16. Basic Observations 1 Standard machine learning is not enough. Model ﬁts collected data perfectly. 2 More data doesn’t help: Observed = True where data was collected. 3 Better data helps! Exploration is required. 4 Prediction errors are not a proxy for controlled exploration.
- 17. Attempt to ﬁx New policy: bagels in the morning, pizza at night for both cities
- 18. Attempt to ﬁx New policy: bagels in the morning, pizza at night for both cities This will overestimate the CTR for both!
- 19. Attempt to ﬁx New policy: bagels in the morning, pizza at night for both cities This will overestimate the CTR for both! Solution: Deployed system should be randomized with probabilities recorded.
- 20. Oﬄine Evaluation Evaluating a new system on data collected by deployed system may mislead badly: New York ?/1/1 0.6/0.6/0.5 Chicago 0.4/0.4/0.4 0.7/0.7/0.7 The new system appears worse than deployed system on collected data, although its true loss may be much lower.
- 21. The Evaluation Problem Given a new policy, how do we evaluate it?
- 22. The Evaluation Problem Given a new policy, how do we evaluate it? One possibility: Deploy it in the world. Very Expensive! Need a bucket for every candidate policy.
- 23. A/B testing for evaluating two policies Policy 1 : Use pizza for New York, bagels for Chicago rule Policy 2 : Use bagels for everyone rule
- 24. A/B testing for evaluating two policies Policy 1 : Use pizza for New York, bagels for Chicago rule Policy 2 : Use bagels for everyone rule Segment users randomly into Policy 1 and Policy 2 groups:
- 25. A/B testing for evaluating two policies Policy 1 : Use pizza for New York, bagels for Chicago rule Policy 2 : Use bagels for everyone rule Segment users randomly into Policy 1 and Policy 2 groups: Policy 2
- 26. A/B testing for evaluating two policies Policy 1 : Use pizza for New York, bagels for Chicago rule Policy 2 : Use bagels for everyone rule Segment users randomly into Policy 1 and Policy 2 groups: Policy 2
- 27. A/B testing for evaluating two policies Policy 1 : Use pizza for New York, bagels for Chicago rule Policy 2 : Use bagels for everyone rule Segment users randomly into Policy 1 and Policy 2 groups: Policy 2 no click
- 28. A/B testing for evaluating two policies Policy 1 : Use pizza for New York, bagels for Chicago rule Policy 2 : Use bagels for everyone rule Segment users randomly into Policy 1 and Policy 2 groups: Policy 2 no click
- 29. A/B testing for evaluating two policies Policy 1 : Use pizza for New York, bagels for Chicago rule Policy 2 : Use bagels for everyone rule Segment users randomly into Policy 1 and Policy 2 groups: Policy 2 Policy 1 no click
- 30. A/B testing for evaluating two policies Policy 1 : Use pizza for New York, bagels for Chicago rule Policy 2 : Use bagels for everyone rule Segment users randomly into Policy 1 and Policy 2 groups: Policy 2 Policy 1 NY no click
- 31. A/B testing for evaluating two policies Policy 1 : Use pizza for New York, bagels for Chicago rule Policy 2 : Use bagels for everyone rule Segment users randomly into Policy 1 and Policy 2 groups: Policy 2 Policy 1 no click no click
- 32. A/B testing for evaluating two policies Policy 1 : Use pizza for New York, bagels for Chicago rule Policy 2 : Use bagels for everyone rule Segment users randomly into Policy 1 and Policy 2 groups: Policy 2 Policy 1 no click no click
- 33. A/B testing for evaluating two policies Policy 1 : Use pizza for New York, bagels for Chicago rule Policy 2 : Use bagels for everyone rule Segment users randomly into Policy 1 and Policy 2 groups: Policy 2 Policy 1 Policy 2 no click no click
- 34. A/B testing for evaluating two policies Policy 1 : Use pizza for New York, bagels for Chicago rule Policy 2 : Use bagels for everyone rule Segment users randomly into Policy 1 and Policy 2 groups: Policy 2 Policy 1 Policy 2 no click no click
- 35. A/B testing for evaluating two policies Policy 1 : Use pizza for New York, bagels for Chicago rule Policy 2 : Use bagels for everyone rule Segment users randomly into Policy 1 and Policy 2 groups: Policy 2 Policy 1 Policy 2 no click no click click
- 36. A/B testing for evaluating two policies Policy 1 : Use pizza for New York, bagels for Chicago rule Policy 2 : Use bagels for everyone rule Segment users randomly into Policy 1 and Policy 2 groups: Policy 2 Policy 1 Policy 2 no click no click click
- 37. A/B testing for evaluating two policies Policy 1 : Use pizza for New York, bagels for Chicago rule Policy 2 : Use bagels for everyone rule Segment users randomly into Policy 1 and Policy 2 groups: Policy 2 Policy 1 Policy 2 Policy 1 no click no click click
- 38. A/B testing for evaluating two policies Policy 1 : Use pizza for New York, bagels for Chicago rule Policy 2 : Use bagels for everyone rule Segment users randomly into Policy 1 and Policy 2 groups: Policy 2 Policy 1 Policy 2 Policy 1 Chicago no click no click click
- 39. A/B testing for evaluating two policies Policy 1 : Use pizza for New York, bagels for Chicago rule Policy 2 : Use bagels for everyone rule Segment users randomly into Policy 1 and Policy 2 groups: Policy 2 Policy 1 Policy 2 Policy 1 no click no click click no click
- 40. A/B testing for evaluating two policies Policy 1 : Use pizza for New York, bagels for Chicago rule Policy 2 : Use bagels for everyone rule Segment users randomly into Policy 1 and Policy 2 groups: Policy 2 Policy 1 Policy 2 Policy 1 no click no click click no click . . . Two weeks later, evaluate which is better.
- 41. Instead randomize every transaction (at least for transactions you plan to use for learning and/or evaluation) Simplest strategy: -greedy. Go with empirically best policy, but always choose a random action with probability > 0.
- 42. Instead randomize every transaction (at least for transactions you plan to use for learning and/or evaluation) Simplest strategy: -greedy. Go with empirically best policy, but always choose a random action with probability > 0.
- 43. Instead randomize every transaction (at least for transactions you plan to use for learning and/or evaluation) Simplest strategy: -greedy. Go with empirically best policy, but always choose a random action with probability > 0.
- 44. Instead randomize every transaction (at least for transactions you plan to use for learning and/or evaluation) Simplest strategy: -greedy. Go with empirically best policy, but always choose a random action with probability > 0. no click (x, b, 0, pb)
- 45. Instead randomize every transaction (at least for transactions you plan to use for learning and/or evaluation) Simplest strategy: -greedy. Go with empirically best policy, but always choose a random action with probability > 0. no click (x, b, 0, pb)
- 46. Instead randomize every transaction (at least for transactions you plan to use for learning and/or evaluation) Simplest strategy: -greedy. Go with empirically best policy, but always choose a random action with probability > 0. no click (x, b, 0, pb)
- 47. Instead randomize every transaction (at least for transactions you plan to use for learning and/or evaluation) Simplest strategy: -greedy. Go with empirically best policy, but always choose a random action with probability > 0. no click no click (x, b, 0, pb) (x, p, 0, pp)
- 48. Instead randomize every transaction (at least for transactions you plan to use for learning and/or evaluation) Simplest strategy: -greedy. Go with empirically best policy, but always choose a random action with probability > 0. no click no click (x, b, 0, pb) (x, p, 0, pp)
- 49. Instead randomize every transaction (at least for transactions you plan to use for learning and/or evaluation) Simplest strategy: -greedy. Go with empirically best policy, but always choose a random action with probability > 0. no click no click (x, b, 0, pb) (x, p, 0, pp)
- 50. Instead randomize every transaction (at least for transactions you plan to use for learning and/or evaluation) Simplest strategy: -greedy. Go with empirically best policy, but always choose a random action with probability > 0. no click no click click (x, b, 0, pb) (x, p, 0, pp) (x, p, 1, pp)
- 51. Instead randomize every transaction (at least for transactions you plan to use for learning and/or evaluation) Simplest strategy: -greedy. Go with empirically best policy, but always choose a random action with probability > 0. no click no click click (x, b, 0, pb) (x, p, 0, pp) (x, p, 1, pp)
- 52. Instead randomize every transaction (at least for transactions you plan to use for learning and/or evaluation) Simplest strategy: -greedy. Go with empirically best policy, but always choose a random action with probability > 0. no click no click click (x, b, 0, pb) (x, p, 0, pp) (x, p, 1, pp)
- 53. Instead randomize every transaction (at least for transactions you plan to use for learning and/or evaluation) Simplest strategy: -greedy. Go with empirically best policy, but always choose a random action with probability > 0. no click no click click no click (x, b, 0, pb) (x, p, 0, pp) (x, p, 1, pp) (x, b, 0, pb)
- 54. Instead randomize every transaction (at least for transactions you plan to use for learning and/or evaluation) Simplest strategy: -greedy. Go with empirically best policy, but always choose a random action with probability > 0. no click no click click no click · · · (x, b, 0, pb) (x, p, 0, pp) (x, p, 1, pp) (x, b, 0, pb) Oﬄine evaluation Later evaluate any policy using the same events. Each evaluation is cheap and immediate.
- 55. The Importance Weighting Trick Let π : X → A be a policy. How do we evaluate it?
- 56. The Importance Weighting Trick Let π : X → A be a policy. How do we evaluate it? Collect exploration samples of the form (x, a, ra, pa), where x = context a = action ra = reward for action pa = probability of action a then evaluate Value(π) = Average ra 1(π(x) = a) pa
- 57. The Importance Weighting Trick Theorem Value(π) is an unbiased estimate of the expected reward of π: E(x,r)∼D rπ(x) = E[ Value(π) ] with deviations bounded by O( 1√ T minx pπ(x) ). Example: Action 1 2 Reward 0.5 1 Probability 1 4 3 4 Estimate
- 58. The Importance Weighting Trick Theorem Value(π) is an unbiased estimate of the expected reward of π: E(x,r)∼D rπ(x) = E[ Value(π) ] with deviations bounded by O( 1√ T minx pπ(x) ). Example: Action 1 2 Reward 0.5 1 Probability 1 4 3 4 Estimate 2 0
- 59. The Importance Weighting Trick Theorem Value(π) is an unbiased estimate of the expected reward of π: E(x,r)∼D rπ(x) = E[ Value(π) ] with deviations bounded by O( 1√ T minx pπ(x) ). Example: Action 1 2 Reward 0.5 1 Probability 1 4 3 4 Estimate 2 | 0 0 | 4 3
- 60. Can we do better? Suppose we have a (possibly bad) reward estimator ˆr(a, x). How can we use it?
- 61. Can we do better? Suppose we have a (possibly bad) reward estimator ˆr(a, x). How can we use it? Value (π) = Average (ra − ˆr(a, x))1(π(x) = a) pa + ˆr(π(x), x)
- 62. Can we do better? Suppose we have a (possibly bad) reward estimator ˆr(a, x). How can we use it? Value (π) = Average (ra − ˆr(a, x))1(π(x) = a) pa + ˆr(π(x), x) Why does this work?
- 63. Can we do better? Suppose we have a (possibly bad) reward estimator ˆr(a, x). How can we use it? Value (π) = Average (ra − ˆr(a, x))1(π(x) = a) pa + ˆr(π(x), x) Why does this work? Ea∼p ˆr(a, x)1(π(x) = a) pa = ˆr(π(x), x)
- 64. Can we do better? Suppose we have a (possibly bad) reward estimator ˆr(a, x). How can we use it? Value (π) = Average (ra − ˆr(a, x))1(π(x) = a) pa + ˆr(π(x), x) Why does this work? Ea∼p ˆr(a, x)1(π(x) = a) pa = ˆr(π(x), x) Keeps the estimate unbiased. It helps, because ra − ˆr(a, x) small reduces variance.
- 65. How do you directly optimize based on past exploration data? 1 Learn ˆr(a, x). 2 Compute for each x and a ∈ A: (ra − ˆr(a, x))1(a = a) pa + ˆr(a , x) 3 Learn π using a cost-sensitive multiclass classiﬁer.
- 66. Take home summary Using exploration data 1 There are techniques for using past exploration data to evaluate any policy. 2 You can reliably measure performance oﬄine, and hence experiment much faster, shifting from guess-and-check (A/B testing) to direct optimization. Doing exploration 1 There has been much recent progress on practical regret-optimal algorithms. 2 -greedy has suboptimal regret but is a reasonable choice in practice.
- 67. Comparison of Approaches Supervised -greedy Optimal CB algorithms Feedback full bandit bandit Regret O ln |Π| δ T O 3 |A| ln |Π| δ T O |A| ln |Π| δ T Running time O(T) O(T) O(T1.5) A. Agarwal, D. Hsu, S. Kale, J. Langford, L. Li, R. Schapire, Taming the Monster: A Fast and Simple Algorithm for Contextual Bandits, 2014 M. Dudik, D. Hsu, S. Kale, N. Karampatziakis, J. Langford, L. Reyzin, T. Zhang: Eﬃcient optimal learning for contextual bandits, 2011 A. Beygelzimer, J. Langford, L. Li, L. Reyzin, R. Schapire: Contextual Bandit Algorithms with Supervised Learning Guarantees, 2011

No public clipboards found for this slide

Be the first to comment