Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Reinforcement Learning 2. Multi-armed Bandits

21 views

Published on

A summary of Chapter 2: Multi-armed Bandits of the book 'Reinforcement Learning: An Introduction' by Sutton and Barto. You can find the full book in Professor Sutton's website: http://incompleteideas.net/book/the-book-2nd.html

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Reinforcement Learning 2. Multi-armed Bandits

  1. 1. Chapter 2: Multi-armed Bandits Seungjae Ryan Lee
  2. 2. ● Slot machine ● Each spin (action) is independent One-armed Bandit
  3. 3. ● Multiple slot machines to choose from ● Simplified setting to avoid complexities of RL problems ○ No observation ○ Action does not have delayed effect Multi-armed Bandit problem
  4. 4. 10-armed Testbed ● 10 actions, 10 reward distributions ● Reward chosen from stationary probability distributions
  5. 5. ● Knowing expected reward trivializes the problem ● Estimate with Expected Reward
  6. 6. ● Estimate by averaging received rewards ● Default value (ex. 0) if action was never selected ● converges to as denominator goes to infinity Sample-average
  7. 7. ● Always select greedily : ● No exploration ● Often stuck in suboptimal actions Greedy method Eat the usual cereal?Agent 1
  8. 8. ● Select random action with probability ε ● All converges to as denominator goes to infinity ε-greedy method Try a new cereal? Eat the usual cereal?Agent 0.1 0.9
  9. 9. Greedy vs. ε-greedy
  10. 10. ● Don’t store reward for each step ● Compute incrementally Incremental Implementation
  11. 11. ● changes over time ● Want to give new experience more weight Nonstationary problem
  12. 12. ● Constant step-size parameter ● Give more weight to recent rewards Exponentially weighted average
  13. 13. Weighted average ● Never completely converges ● Desirable in nonstationary problems Sample-average ● Guaranteed convergence ● Converge slowly: need tuning ● Seldomly used in applications
  14. 14. ● Set initial action values optimistically (ex. +5) ● Temporarily encourage exploration ● Doesn’t work in nonstationary problems Optimistic Initial Values +5 +4.5 R = 0 R = 0.1 +4.06 R = -0.1 +3.64 R = 0 +3.28
  15. 15. Optimistic Greedy vs. Realistic ε-greedy
  16. 16. ● Take into account each action’s potential to be optimal ● Selected less → more potential ● Difficult to extend beyond multi-armed bandits Upper Confidence Bound (UCB)
  17. 17. UCB vs. ε-greedy
  18. 18. ● Learn a numerical preference for each action ● Convert to probability with softmax: Gradient Bandit Algorithms
  19. 19. ● Update preference with SGD ● Baseline : average of all rewards ● Increase probability if reward is above baseline ● Decrease probability if reward is below baseline Gradient Bandit: Stochastic Gradient Descent
  20. 20. Gradient Bandit: Results
  21. 21. Parameter Study ● Check performance in best setting ● Check hyperparameter sensitivity
  22. 22. ● Observe some context that can help decision ● Intermediate between multi-armed bandit and full RL problem ○ Need to learn a policy to associate observations and actions ○ Each action only affects immediate reward Associative Search (Contextual Bandit)
  23. 23. Thank you! Original content from ● Reinforcement Learning: An Introduction by Sutton and Barto You can find more content in ● github.com/seungjaeryanlee ● www.endtoend.ai

×