Successfully reported this slideshow.
Upcoming SlideShare
×

# Reinforcement Learning 2. Multi-armed Bandits

35 views

Published on

A summary of Chapter 2: Multi-armed Bandits of the book 'Reinforcement Learning: An Introduction' by Sutton and Barto. You can find the full book in Professor Sutton's website: http://incompleteideas.net/book/the-book-2nd.html

Published in: Technology
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

• Be the first to like this

### Reinforcement Learning 2. Multi-armed Bandits

1. 1. Chapter 2: Multi-armed Bandits Seungjae Ryan Lee
2. 2. ● Slot machine ● Each spin (action) is independent One-armed Bandit
3. 3. ● Multiple slot machines to choose from ● Simplified setting to avoid complexities of RL problems ○ No observation ○ Action does not have delayed effect Multi-armed Bandit problem
4. 4. 10-armed Testbed ● 10 actions, 10 reward distributions ● Reward chosen from stationary probability distributions
5. 5. ● Knowing expected reward trivializes the problem ● Estimate with Expected Reward
6. 6. ● Estimate by averaging received rewards ● Default value (ex. 0) if action was never selected ● converges to as denominator goes to infinity Sample-average
7. 7. ● Always select greedily : ● No exploration ● Often stuck in suboptimal actions Greedy method Eat the usual cereal?Agent 1
8. 8. ● Select random action with probability ε ● All converges to as denominator goes to infinity ε-greedy method Try a new cereal? Eat the usual cereal?Agent 0.1 0.9
9. 9. Greedy vs. ε-greedy
10. 10. ● Don’t store reward for each step ● Compute incrementally Incremental Implementation
11. 11. ● changes over time ● Want to give new experience more weight Nonstationary problem
12. 12. ● Constant step-size parameter ● Give more weight to recent rewards Exponentially weighted average
13. 13. Weighted average ● Never completely converges ● Desirable in nonstationary problems Sample-average ● Guaranteed convergence ● Converge slowly: need tuning ● Seldomly used in applications
14. 14. ● Set initial action values optimistically (ex. +5) ● Temporarily encourage exploration ● Doesn’t work in nonstationary problems Optimistic Initial Values +5 +4.5 R = 0 R = 0.1 +4.06 R = -0.1 +3.64 R = 0 +3.28
15. 15. Optimistic Greedy vs. Realistic ε-greedy
16. 16. ● Take into account each action’s potential to be optimal ● Selected less → more potential ● Difficult to extend beyond multi-armed bandits Upper Confidence Bound (UCB)
17. 17. UCB vs. ε-greedy
18. 18. ● Learn a numerical preference for each action ● Convert to probability with softmax: Gradient Bandit Algorithms
19. 19. ● Update preference with SGD ● Baseline : average of all rewards ● Increase probability if reward is above baseline ● Decrease probability if reward is below baseline Gradient Bandit: Stochastic Gradient Descent