Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Learning with Exploration
Alina Beygelzimer
Yahoo Labs, New York
(based on work by many)
Interactive Learning
Repeatedly:
1 A user comes to Yahoo
2 Yahoo chooses content to present (urls, ads, news stories)
3 Th...
Abstracting the Setting
For t = 1, . . . , T:
1 The world produces some context x ∈ X
2 The learner chooses an action a ∈ ...
Dominant Solution
1 Deploy some initial system
2 Collect data using this system
3 Use machine learning to build a reward p...
Example: Bagels vs. Pizza for New York and Chicago users
Example: Bagels vs. Pizza for New York and Chicago users
Initial system: NY gets bagels, Chicago gets pizza.
New York
Chic...
Example: Bagels vs. Pizza for New York and Chicago users
Initial system: NY gets bagels, Chicago gets pizza.
Observed CTR
...
Example: Bagels vs. Pizza for New York and Chicago users
Initial system: NY gets bagels, Chicago gets pizza.
Observed CTR/...
Example: Bagels vs. Pizza for New York and Chicago users
Initial system: NY gets bagels, Chicago gets pizza.
Observed CTR/...
Example: Bagels vs. Pizza for New York and Chicago users
Initial system: NY gets bagels, Chicago gets pizza.
Observed CTR/...
Example: Bagels vs. Pizza for New York and Chicago users
Initial system: NY gets bagels, Chicago gets pizza.
Observed CTR/...
Example: Bagels vs. Pizza for New York and Chicago users
Initial system: NY gets bagels, Chicago gets pizza.
Observed CTR/...
Basic Observations
1 Standard machine learning is not enough. Model fits collected
data perfectly.
Basic Observations
1 Standard machine learning is not enough. Model fits collected
data perfectly.
2 More data doesn’t help...
Basic Observations
1 Standard machine learning is not enough. Model fits collected
data perfectly.
2 More data doesn’t help...
Basic Observations
1 Standard machine learning is not enough. Model fits collected
data perfectly.
2 More data doesn’t help...
Attempt to fix
New policy: bagels in the morning, pizza at night for both
cities
Attempt to fix
New policy: bagels in the morning, pizza at night for both
cities
This will overestimate the CTR for both!
Attempt to fix
New policy: bagels in the morning, pizza at night for both
cities
This will overestimate the CTR for both!
S...
Offline Evaluation
Evaluating a new system on data collected by deployed system
may mislead badly:
New York ?/1/1 0.6/0.6/0....
The Evaluation Problem
Given a new policy, how do we evaluate it?
The Evaluation Problem
Given a new policy, how do we evaluate it?
One possibility: Deploy it in the world.
Very Expensive!...
A/B testing for evaluating two policies
Policy 1 : Use pizza for New York, bagels for Chicago rule
Policy 2 : Use bagels f...
A/B testing for evaluating two policies
Policy 1 : Use pizza for New York, bagels for Chicago rule
Policy 2 : Use bagels f...
A/B testing for evaluating two policies
Policy 1 : Use pizza for New York, bagels for Chicago rule
Policy 2 : Use bagels f...
A/B testing for evaluating two policies
Policy 1 : Use pizza for New York, bagels for Chicago rule
Policy 2 : Use bagels f...
A/B testing for evaluating two policies
Policy 1 : Use pizza for New York, bagels for Chicago rule
Policy 2 : Use bagels f...
A/B testing for evaluating two policies
Policy 1 : Use pizza for New York, bagels for Chicago rule
Policy 2 : Use bagels f...
A/B testing for evaluating two policies
Policy 1 : Use pizza for New York, bagels for Chicago rule
Policy 2 : Use bagels f...
A/B testing for evaluating two policies
Policy 1 : Use pizza for New York, bagels for Chicago rule
Policy 2 : Use bagels f...
A/B testing for evaluating two policies
Policy 1 : Use pizza for New York, bagels for Chicago rule
Policy 2 : Use bagels f...
A/B testing for evaluating two policies
Policy 1 : Use pizza for New York, bagels for Chicago rule
Policy 2 : Use bagels f...
A/B testing for evaluating two policies
Policy 1 : Use pizza for New York, bagels for Chicago rule
Policy 2 : Use bagels f...
A/B testing for evaluating two policies
Policy 1 : Use pizza for New York, bagels for Chicago rule
Policy 2 : Use bagels f...
A/B testing for evaluating two policies
Policy 1 : Use pizza for New York, bagels for Chicago rule
Policy 2 : Use bagels f...
A/B testing for evaluating two policies
Policy 1 : Use pizza for New York, bagels for Chicago rule
Policy 2 : Use bagels f...
A/B testing for evaluating two policies
Policy 1 : Use pizza for New York, bagels for Chicago rule
Policy 2 : Use bagels f...
A/B testing for evaluating two policies
Policy 1 : Use pizza for New York, bagels for Chicago rule
Policy 2 : Use bagels f...
A/B testing for evaluating two policies
Policy 1 : Use pizza for New York, bagels for Chicago rule
Policy 2 : Use bagels f...
A/B testing for evaluating two policies
Policy 1 : Use pizza for New York, bagels for Chicago rule
Policy 2 : Use bagels f...
Instead randomize every transaction
(at least for transactions you plan to use for learning and/or evaluation)
Simplest st...
Instead randomize every transaction
(at least for transactions you plan to use for learning and/or evaluation)
Simplest st...
Instead randomize every transaction
(at least for transactions you plan to use for learning and/or evaluation)
Simplest st...
Instead randomize every transaction
(at least for transactions you plan to use for learning and/or evaluation)
Simplest st...
Instead randomize every transaction
(at least for transactions you plan to use for learning and/or evaluation)
Simplest st...
Instead randomize every transaction
(at least for transactions you plan to use for learning and/or evaluation)
Simplest st...
Instead randomize every transaction
(at least for transactions you plan to use for learning and/or evaluation)
Simplest st...
Instead randomize every transaction
(at least for transactions you plan to use for learning and/or evaluation)
Simplest st...
Instead randomize every transaction
(at least for transactions you plan to use for learning and/or evaluation)
Simplest st...
Instead randomize every transaction
(at least for transactions you plan to use for learning and/or evaluation)
Simplest st...
Instead randomize every transaction
(at least for transactions you plan to use for learning and/or evaluation)
Simplest st...
Instead randomize every transaction
(at least for transactions you plan to use for learning and/or evaluation)
Simplest st...
Instead randomize every transaction
(at least for transactions you plan to use for learning and/or evaluation)
Simplest st...
Instead randomize every transaction
(at least for transactions you plan to use for learning and/or evaluation)
Simplest st...
The Importance Weighting Trick
Let π : X → A be a policy. How do we evaluate it?
The Importance Weighting Trick
Let π : X → A be a policy. How do we evaluate it?
Collect exploration samples of the form
(...
The Importance Weighting Trick
Theorem
Value(π) is an unbiased estimate of the expected reward of π:
E(x,r)∼D rπ(x) = E[ V...
The Importance Weighting Trick
Theorem
Value(π) is an unbiased estimate of the expected reward of π:
E(x,r)∼D rπ(x) = E[ V...
The Importance Weighting Trick
Theorem
Value(π) is an unbiased estimate of the expected reward of π:
E(x,r)∼D rπ(x) = E[ V...
Can we do better?
Suppose we have a (possibly bad) reward estimator ˆr(a, x). How
can we use it?
Can we do better?
Suppose we have a (possibly bad) reward estimator ˆr(a, x). How
can we use it?
Value (π) = Average
(ra −...
Can we do better?
Suppose we have a (possibly bad) reward estimator ˆr(a, x). How
can we use it?
Value (π) = Average
(ra −...
Can we do better?
Suppose we have a (possibly bad) reward estimator ˆr(a, x). How
can we use it?
Value (π) = Average
(ra −...
Can we do better?
Suppose we have a (possibly bad) reward estimator ˆr(a, x). How
can we use it?
Value (π) = Average
(ra −...
How do you directly optimize based on past exploration
data?
1 Learn ˆr(a, x).
2 Compute for each x and a ∈ A:
(ra − ˆr(a,...
Take home summary
Using exploration data
1 There are techniques for using past exploration data to
evaluate any policy.
2 ...
Comparison of Approaches
Supervised -greedy Optimal CB algorithms
Feedback full bandit bandit
Regret O
ln
|Π|
δ
T O
3 |A| ...
Upcoming SlideShare
Loading in …5
×

Alina Beygelzimer, Senior Research Scientist, Yahoo Labs at MLconf NYC

1,365 views

Published on

Learning through exploration: I will talk about interactive learning applied to several core problems at Yahoo. Solving these problems well requires learning from user feedback. The difficulty is that only the feedback for what is actually shown to the user is observed. The need for exploration makes these problems fundamentally different from standard supervised learning problems—if a choice is not explored, we can’t optimize for it. Through examples, I will discuss the importance of gathering the right data. I will then discuss how to reuse data collected by production systems for offline evaluation and direct optimization. Being able to reliably measure performance offline allows for much faster experimentation, shifting from guess-and-check with A/B testing to direct optimization.

Published in: Technology
  • Be the first to comment

Alina Beygelzimer, Senior Research Scientist, Yahoo Labs at MLconf NYC

  1. 1. Learning with Exploration Alina Beygelzimer Yahoo Labs, New York (based on work by many)
  2. 2. Interactive Learning Repeatedly: 1 A user comes to Yahoo 2 Yahoo chooses content to present (urls, ads, news stories) 3 The user reacts to the presented information (clicks on something) Making good content decisions requires learning from user feedback.
  3. 3. Abstracting the Setting For t = 1, . . . , T: 1 The world produces some context x ∈ X 2 The learner chooses an action a ∈ A 3 The world reacts with reward r(a, x) Goal: Learn a good policy for choosing actions given context
  4. 4. Dominant Solution 1 Deploy some initial system 2 Collect data using this system 3 Use machine learning to build a reward predictor ˆr(a, x) from collected data 4 Evaluate new system = arg maxa ˆr(a, x) offline evaluation on past data bucket test 5 If metrics improve, switch to this new system and repeat
  5. 5. Example: Bagels vs. Pizza for New York and Chicago users
  6. 6. Example: Bagels vs. Pizza for New York and Chicago users Initial system: NY gets bagels, Chicago gets pizza. New York Chicago
  7. 7. Example: Bagels vs. Pizza for New York and Chicago users Initial system: NY gets bagels, Chicago gets pizza. Observed CTR New York ? 0.6 Chicago 0.4 ?
  8. 8. Example: Bagels vs. Pizza for New York and Chicago users Initial system: NY gets bagels, Chicago gets pizza. Observed CTR/Estimated CTR New York ?/0.5 0.6/0.6 Chicago 0.4/0.4 ?/0.5
  9. 9. Example: Bagels vs. Pizza for New York and Chicago users Initial system: NY gets bagels, Chicago gets pizza. Observed CTR/Estimated CTR New York ?/0.5 0.6/0.6 Chicago 0.4/0.4 ?/0.5 Bagels win. Switch to serving bagels for all and update model based on new data.
  10. 10. Example: Bagels vs. Pizza for New York and Chicago users Initial system: NY gets bagels, Chicago gets pizza. Observed CTR/Estimated CTR New York ?/0.5 0.6/0.6 Chicago 0.4/0.4 0.7/0.5 Bagels win. Switch to serving bagels for all and update model based on new data.
  11. 11. Example: Bagels vs. Pizza for New York and Chicago users Initial system: NY gets bagels, Chicago gets pizza. Observed CTR/Estimated CTR New York ?/0.4595 0.6/0.6 Chicago 0.4/0.4 0.7/0.7 Bagels win. Switch to serving bagels for all and update model based on new data.
  12. 12. Example: Bagels vs. Pizza for New York and Chicago users Initial system: NY gets bagels, Chicago gets pizza. Observed CTR/Estimated CTR/True CTR New York ?/0.4595/1 0.6/0.6/0.6 Chicago 0.4/0.4/0.4 0.7/0.7/0.7 Yikes! Missed out big in NY!
  13. 13. Basic Observations 1 Standard machine learning is not enough. Model fits collected data perfectly.
  14. 14. Basic Observations 1 Standard machine learning is not enough. Model fits collected data perfectly. 2 More data doesn’t help: Observed = True where data was collected.
  15. 15. Basic Observations 1 Standard machine learning is not enough. Model fits collected data perfectly. 2 More data doesn’t help: Observed = True where data was collected. 3 Better data helps! Exploration is required.
  16. 16. Basic Observations 1 Standard machine learning is not enough. Model fits collected data perfectly. 2 More data doesn’t help: Observed = True where data was collected. 3 Better data helps! Exploration is required. 4 Prediction errors are not a proxy for controlled exploration.
  17. 17. Attempt to fix New policy: bagels in the morning, pizza at night for both cities
  18. 18. Attempt to fix New policy: bagels in the morning, pizza at night for both cities This will overestimate the CTR for both!
  19. 19. Attempt to fix New policy: bagels in the morning, pizza at night for both cities This will overestimate the CTR for both! Solution: Deployed system should be randomized with probabilities recorded.
  20. 20. Offline Evaluation Evaluating a new system on data collected by deployed system may mislead badly: New York ?/1/1 0.6/0.6/0.5 Chicago 0.4/0.4/0.4 0.7/0.7/0.7 The new system appears worse than deployed system on collected data, although its true loss may be much lower.
  21. 21. The Evaluation Problem Given a new policy, how do we evaluate it?
  22. 22. The Evaluation Problem Given a new policy, how do we evaluate it? One possibility: Deploy it in the world. Very Expensive! Need a bucket for every candidate policy.
  23. 23. A/B testing for evaluating two policies Policy 1 : Use pizza for New York, bagels for Chicago rule Policy 2 : Use bagels for everyone rule
  24. 24. A/B testing for evaluating two policies Policy 1 : Use pizza for New York, bagels for Chicago rule Policy 2 : Use bagels for everyone rule Segment users randomly into Policy 1 and Policy 2 groups:
  25. 25. A/B testing for evaluating two policies Policy 1 : Use pizza for New York, bagels for Chicago rule Policy 2 : Use bagels for everyone rule Segment users randomly into Policy 1 and Policy 2 groups: Policy 2
  26. 26. A/B testing for evaluating two policies Policy 1 : Use pizza for New York, bagels for Chicago rule Policy 2 : Use bagels for everyone rule Segment users randomly into Policy 1 and Policy 2 groups: Policy 2
  27. 27. A/B testing for evaluating two policies Policy 1 : Use pizza for New York, bagels for Chicago rule Policy 2 : Use bagels for everyone rule Segment users randomly into Policy 1 and Policy 2 groups: Policy 2 no click
  28. 28. A/B testing for evaluating two policies Policy 1 : Use pizza for New York, bagels for Chicago rule Policy 2 : Use bagels for everyone rule Segment users randomly into Policy 1 and Policy 2 groups: Policy 2 no click
  29. 29. A/B testing for evaluating two policies Policy 1 : Use pizza for New York, bagels for Chicago rule Policy 2 : Use bagels for everyone rule Segment users randomly into Policy 1 and Policy 2 groups: Policy 2 Policy 1 no click
  30. 30. A/B testing for evaluating two policies Policy 1 : Use pizza for New York, bagels for Chicago rule Policy 2 : Use bagels for everyone rule Segment users randomly into Policy 1 and Policy 2 groups: Policy 2 Policy 1 NY no click
  31. 31. A/B testing for evaluating two policies Policy 1 : Use pizza for New York, bagels for Chicago rule Policy 2 : Use bagels for everyone rule Segment users randomly into Policy 1 and Policy 2 groups: Policy 2 Policy 1 no click no click
  32. 32. A/B testing for evaluating two policies Policy 1 : Use pizza for New York, bagels for Chicago rule Policy 2 : Use bagels for everyone rule Segment users randomly into Policy 1 and Policy 2 groups: Policy 2 Policy 1 no click no click
  33. 33. A/B testing for evaluating two policies Policy 1 : Use pizza for New York, bagels for Chicago rule Policy 2 : Use bagels for everyone rule Segment users randomly into Policy 1 and Policy 2 groups: Policy 2 Policy 1 Policy 2 no click no click
  34. 34. A/B testing for evaluating two policies Policy 1 : Use pizza for New York, bagels for Chicago rule Policy 2 : Use bagels for everyone rule Segment users randomly into Policy 1 and Policy 2 groups: Policy 2 Policy 1 Policy 2 no click no click
  35. 35. A/B testing for evaluating two policies Policy 1 : Use pizza for New York, bagels for Chicago rule Policy 2 : Use bagels for everyone rule Segment users randomly into Policy 1 and Policy 2 groups: Policy 2 Policy 1 Policy 2 no click no click click
  36. 36. A/B testing for evaluating two policies Policy 1 : Use pizza for New York, bagels for Chicago rule Policy 2 : Use bagels for everyone rule Segment users randomly into Policy 1 and Policy 2 groups: Policy 2 Policy 1 Policy 2 no click no click click
  37. 37. A/B testing for evaluating two policies Policy 1 : Use pizza for New York, bagels for Chicago rule Policy 2 : Use bagels for everyone rule Segment users randomly into Policy 1 and Policy 2 groups: Policy 2 Policy 1 Policy 2 Policy 1 no click no click click
  38. 38. A/B testing for evaluating two policies Policy 1 : Use pizza for New York, bagels for Chicago rule Policy 2 : Use bagels for everyone rule Segment users randomly into Policy 1 and Policy 2 groups: Policy 2 Policy 1 Policy 2 Policy 1 Chicago no click no click click
  39. 39. A/B testing for evaluating two policies Policy 1 : Use pizza for New York, bagels for Chicago rule Policy 2 : Use bagels for everyone rule Segment users randomly into Policy 1 and Policy 2 groups: Policy 2 Policy 1 Policy 2 Policy 1 no click no click click no click
  40. 40. A/B testing for evaluating two policies Policy 1 : Use pizza for New York, bagels for Chicago rule Policy 2 : Use bagels for everyone rule Segment users randomly into Policy 1 and Policy 2 groups: Policy 2 Policy 1 Policy 2 Policy 1 no click no click click no click . . . Two weeks later, evaluate which is better.
  41. 41. Instead randomize every transaction (at least for transactions you plan to use for learning and/or evaluation) Simplest strategy: -greedy. Go with empirically best policy, but always choose a random action with probability > 0.
  42. 42. Instead randomize every transaction (at least for transactions you plan to use for learning and/or evaluation) Simplest strategy: -greedy. Go with empirically best policy, but always choose a random action with probability > 0.
  43. 43. Instead randomize every transaction (at least for transactions you plan to use for learning and/or evaluation) Simplest strategy: -greedy. Go with empirically best policy, but always choose a random action with probability > 0.
  44. 44. Instead randomize every transaction (at least for transactions you plan to use for learning and/or evaluation) Simplest strategy: -greedy. Go with empirically best policy, but always choose a random action with probability > 0. no click (x, b, 0, pb)
  45. 45. Instead randomize every transaction (at least for transactions you plan to use for learning and/or evaluation) Simplest strategy: -greedy. Go with empirically best policy, but always choose a random action with probability > 0. no click (x, b, 0, pb)
  46. 46. Instead randomize every transaction (at least for transactions you plan to use for learning and/or evaluation) Simplest strategy: -greedy. Go with empirically best policy, but always choose a random action with probability > 0. no click (x, b, 0, pb)
  47. 47. Instead randomize every transaction (at least for transactions you plan to use for learning and/or evaluation) Simplest strategy: -greedy. Go with empirically best policy, but always choose a random action with probability > 0. no click no click (x, b, 0, pb) (x, p, 0, pp)
  48. 48. Instead randomize every transaction (at least for transactions you plan to use for learning and/or evaluation) Simplest strategy: -greedy. Go with empirically best policy, but always choose a random action with probability > 0. no click no click (x, b, 0, pb) (x, p, 0, pp)
  49. 49. Instead randomize every transaction (at least for transactions you plan to use for learning and/or evaluation) Simplest strategy: -greedy. Go with empirically best policy, but always choose a random action with probability > 0. no click no click (x, b, 0, pb) (x, p, 0, pp)
  50. 50. Instead randomize every transaction (at least for transactions you plan to use for learning and/or evaluation) Simplest strategy: -greedy. Go with empirically best policy, but always choose a random action with probability > 0. no click no click click (x, b, 0, pb) (x, p, 0, pp) (x, p, 1, pp)
  51. 51. Instead randomize every transaction (at least for transactions you plan to use for learning and/or evaluation) Simplest strategy: -greedy. Go with empirically best policy, but always choose a random action with probability > 0. no click no click click (x, b, 0, pb) (x, p, 0, pp) (x, p, 1, pp)
  52. 52. Instead randomize every transaction (at least for transactions you plan to use for learning and/or evaluation) Simplest strategy: -greedy. Go with empirically best policy, but always choose a random action with probability > 0. no click no click click (x, b, 0, pb) (x, p, 0, pp) (x, p, 1, pp)
  53. 53. Instead randomize every transaction (at least for transactions you plan to use for learning and/or evaluation) Simplest strategy: -greedy. Go with empirically best policy, but always choose a random action with probability > 0. no click no click click no click (x, b, 0, pb) (x, p, 0, pp) (x, p, 1, pp) (x, b, 0, pb)
  54. 54. Instead randomize every transaction (at least for transactions you plan to use for learning and/or evaluation) Simplest strategy: -greedy. Go with empirically best policy, but always choose a random action with probability > 0. no click no click click no click · · · (x, b, 0, pb) (x, p, 0, pp) (x, p, 1, pp) (x, b, 0, pb) Offline evaluation Later evaluate any policy using the same events. Each evaluation is cheap and immediate.
  55. 55. The Importance Weighting Trick Let π : X → A be a policy. How do we evaluate it?
  56. 56. The Importance Weighting Trick Let π : X → A be a policy. How do we evaluate it? Collect exploration samples of the form (x, a, ra, pa), where x = context a = action ra = reward for action pa = probability of action a then evaluate Value(π) = Average ra 1(π(x) = a) pa
  57. 57. The Importance Weighting Trick Theorem Value(π) is an unbiased estimate of the expected reward of π: E(x,r)∼D rπ(x) = E[ Value(π) ] with deviations bounded by O( 1√ T minx pπ(x) ). Example: Action 1 2 Reward 0.5 1 Probability 1 4 3 4 Estimate
  58. 58. The Importance Weighting Trick Theorem Value(π) is an unbiased estimate of the expected reward of π: E(x,r)∼D rπ(x) = E[ Value(π) ] with deviations bounded by O( 1√ T minx pπ(x) ). Example: Action 1 2 Reward 0.5 1 Probability 1 4 3 4 Estimate 2 0
  59. 59. The Importance Weighting Trick Theorem Value(π) is an unbiased estimate of the expected reward of π: E(x,r)∼D rπ(x) = E[ Value(π) ] with deviations bounded by O( 1√ T minx pπ(x) ). Example: Action 1 2 Reward 0.5 1 Probability 1 4 3 4 Estimate 2 | 0 0 | 4 3
  60. 60. Can we do better? Suppose we have a (possibly bad) reward estimator ˆr(a, x). How can we use it?
  61. 61. Can we do better? Suppose we have a (possibly bad) reward estimator ˆr(a, x). How can we use it? Value (π) = Average (ra − ˆr(a, x))1(π(x) = a) pa + ˆr(π(x), x)
  62. 62. Can we do better? Suppose we have a (possibly bad) reward estimator ˆr(a, x). How can we use it? Value (π) = Average (ra − ˆr(a, x))1(π(x) = a) pa + ˆr(π(x), x) Why does this work?
  63. 63. Can we do better? Suppose we have a (possibly bad) reward estimator ˆr(a, x). How can we use it? Value (π) = Average (ra − ˆr(a, x))1(π(x) = a) pa + ˆr(π(x), x) Why does this work? Ea∼p ˆr(a, x)1(π(x) = a) pa = ˆr(π(x), x)
  64. 64. Can we do better? Suppose we have a (possibly bad) reward estimator ˆr(a, x). How can we use it? Value (π) = Average (ra − ˆr(a, x))1(π(x) = a) pa + ˆr(π(x), x) Why does this work? Ea∼p ˆr(a, x)1(π(x) = a) pa = ˆr(π(x), x) Keeps the estimate unbiased. It helps, because ra − ˆr(a, x) small reduces variance.
  65. 65. How do you directly optimize based on past exploration data? 1 Learn ˆr(a, x). 2 Compute for each x and a ∈ A: (ra − ˆr(a, x))1(a = a) pa + ˆr(a , x) 3 Learn π using a cost-sensitive multiclass classifier.
  66. 66. Take home summary Using exploration data 1 There are techniques for using past exploration data to evaluate any policy. 2 You can reliably measure performance offline, and hence experiment much faster, shifting from guess-and-check (A/B testing) to direct optimization. Doing exploration 1 There has been much recent progress on practical regret-optimal algorithms. 2 -greedy has suboptimal regret but is a reasonable choice in practice.
  67. 67. Comparison of Approaches Supervised -greedy Optimal CB algorithms Feedback full bandit bandit Regret O ln |Π| δ T O 3 |A| ln |Π| δ T O |A| ln |Π| δ T Running time O(T) O(T) O(T1.5) A. Agarwal, D. Hsu, S. Kale, J. Langford, L. Li, R. Schapire, Taming the Monster: A Fast and Simple Algorithm for Contextual Bandits, 2014 M. Dudik, D. Hsu, S. Kale, N. Karampatziakis, J. Langford, L. Reyzin, T. Zhang: Efficient optimal learning for contextual bandits, 2011 A. Beygelzimer, J. Langford, L. Li, L. Reyzin, R. Schapire: Contextual Bandit Algorithms with Supervised Learning Guarantees, 2011

×