SlideShare a Scribd company logo
Multi-armed Bandits
Dongmin Lee
RLI Study
Reference
- Reinforcement Learning : An Introduction,
Richard S. Sutton & Andrew G. Barto
(Link)
- ๋ฉ€ํ‹ฐ ์•”๋“œ ๋ฐด๋”ง(Multi-Armed Bandits), ์†กํ˜ธ์—ฐ
(https://brunch.co.kr/@chris-song/62)
Index
1. Why do you need to know Multi-armed Bandits?
2. A k-armed Bandit Problem
3. Simple-average Action-value Methods
4. A simple Bandit Algorithm
5. Gradient Bandit Algorithm
6. Summary
1. Why do you need to know
Multi-armed Bandits(MAB)?
1. Why do you need to know MAB?
- Reinforcement Learning(RL) uses training information that
โ€˜Evaluateโ€™(โ€˜Instructโ€™ X) the actions
- Evaluative feedback indicates โ€˜How good the action taken wasโ€™
- Because of simplicity, โ€˜Nonassociativeโ€™ one situation
- Most prior work involving evaluative feedback
- โ€˜Nonassociativeโ€™, โ€˜Evaluative feedbackโ€™ -> MAB
- In order to Introduce basic learning methods in later chapters
1. Why do you need to know MAB?
In my opinion,
- I think we canโ€™t seem to know RL without knowing MAB.
- MAB deal with โ€˜Exploitation & Explorationโ€™ of the core ideas in RL.
- In the full reinforcement learning problem, MAB is always used.
- In every profession, MAB is very useful.
2. A k-armed Bandit Problem
2. A k-armed Bandit Problem
Do you know what MAB is?
Do you know what MAB is?
Source : Multi-Armed Bandit
Image source : https://brunch.co.kr/@chris-song/62
- Slot Machine -> Bandit
- Slot Machineโ€™s lever -> Armed
- N slot Machine
๏ƒ  Multi-armed Bandits
Do you know what MAB is?
Source : Multi-Armed Bandit
Image source : https://brunch.co.kr/@chris-song/62
Among the various slot machines,
which slot machine
should I put my money on
and lower the lever?
Do you know what MAB is?
Source : Multi-Armed Bandit
Image source : https://brunch.co.kr/@chris-song/62
How can you make the best return
on your investment?
Do you know what MAB is?
Source : Multi-Armed Bandit
Image source : https://brunch.co.kr/@chris-song/62
MAB is a algorithm created to
optimize investment in slot machines
2. A k-armed Bandit Problem
A K-armed Bandit Problem
A K-armed Bandit Problem
Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto
Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
t โ€“ Discrete time step or play number
๐ด ๐‘ก - Action at time t
๐‘…๐‘ก - Reward at time t
๐‘žโˆ— ๐‘Ž โ€“ True value (expected reward) of action a
A K-armed Bandit Problem
Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto
Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
In our k-armed bandit problem, each of the k actions has
an expected or mean reward given that that action is selected;
let us call this the value of that action.
3. Simple-average Action-value Methods
3. Simple-average Action-value Methods
Simple-average Method
Simple-average Method
Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto
Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
๐‘„๐‘ก(๐‘Ž) converges to ๐‘žโˆ—(๐‘Ž)
3. Simple-average Action-value Methods
Action-value Methods
Action-value Methods
Action-value Methods
- Greedy Action Selection Method
- ๐œ€-greedy Action Selection Method
- Upper-Confidence-Bound(UCB) Action Selection Method
Action-value Methods
Action-value Methods
- Greedy Action Selection Method
- ๐œ€-greedy Action Selection Method
- Upper-Confidence-Bound(UCB) Action Selection Method
Greedy Action Selection Method
๐‘Ž๐‘Ÿ๐‘”๐‘š๐‘Ž๐‘ฅ ๐‘Ž ๐‘“(๐‘Ž) - a value of ๐‘Ž at which ๐‘“(๐‘Ž) takes its maximal value
Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto
Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
Greedy Action Selection Method
Greedy action selection always exploits
current knowledge to maximize immediate reward
Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto
Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
Greedy Action Selection Method
Greedy Action Selection Methodโ€™s disadvantage
Is it a good idea to select greedy action,
exploit that action selection
and maximize the current immediate reward?
Action-value Methods
Action-value Methods
- Greedy Action Selection Method
- ๐œ€-greedy Action Selection Method
- Upper-Confidence-Bound(UCB) Action Selection Method
๐œ€-greedy Action Selection Method
Exploitation is the right thing to do to maximize
the expected reward on the one step,
but Exploration may produce the greater total reward in the long run.
๐œ€-greedy Action Selection Method
Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto
Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
๐œ€ โ€“ probability of taking a random action in an ๐œ€-greedy policy
๐œ€-greedy Action Selection Method
Exploitation
Exploration
Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto
Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
Action-value Methods
Action-value Methods
- Greedy Action Selection Method
- ๐œ€-greedy Action Selection Method
- Upper-Confidence-Bound(UCB) Action Selection Method
Upper-Confidence-Bound(UCB) Action Selection Method
ln ๐‘ก - natural logarithm of ๐‘ก
๐‘๐‘ก(๐‘Ž) โ€“ the number of times that action ๐‘Ž has been selected prior to time ๐‘ก
๐‘ โ€“ the number ๐‘ > 0 controls the degree of exploration
Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto
Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
Upper-Confidence-Bound(UCB) Action Selection Method
Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto
Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
The idea of this UCB action selection is that
The square-root term is a measure of
the uncertainty(or potential) in the estimate of ๐‘Žโ€™s value
The probability that the slot machine
may be the optimal slot machine
Upper-Confidence-Bound(UCB) Action Selection Method
UCB Action Selection Methodโ€™s disadvantage
UCB is more difficult than ๐œ€-greedy to extend beyond bandits
to the more general reinforcement learning settings
One difficulty is in dealing with nonstationary problems
Another difficulty is dealing with large state spaces
Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto
Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
4. A simple Bandit Algorithm
4. A simple Bandit Algorithm
- Incremental Implementation
- Tracking a Nonstationary Problem
4. A simple Bandit Algorithm
- Incremental Implementation
- Tracking a Nonstationary Problem
Incremental Implementation
๐‘„ ๐‘› denote the estimate of ๐‘… ๐‘›โˆ’1โ€˜s action value
after ๐‘… ๐‘›โˆ’1 has been selected n โˆ’ 1 times
Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto
Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
Incremental Implementation
Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto
Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
Incremental Implementation
Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto
Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
๐‘„ ๐‘›+1 = ๐‘„ ๐‘› +
1
๐‘›
๐‘… ๐‘› โˆ’ ๐‘„ ๐‘›
holds even for ๐‘› = 1,
obtaining ๐‘„2 = ๐‘…1 for arbitrary ๐‘„1
Incremental Implementation
Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto
Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
Incremental Implementation
Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto
Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
Incremental Implementation
Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto
Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
Unstable(โ†”Constant)
Available on stationary problem
Incremental Implementation
Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto
Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
Incremental Implementation
Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto
Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
The expression ๐‘‡๐‘Ž๐‘Ÿ๐‘”๐‘’๐‘ก โˆ’ ๐‘‚๐‘™๐‘‘๐ธ๐‘ ๐‘ก๐‘–๐‘š๐‘Ž๐‘ก๐‘’ is an error in the estimate.
The target is presumed to indicate a desirable direction
in which to move, though it may be noisy.
4. A simple Bandit Algorithm
- Incremental Implementation
- Tracking a Nonstationary Problem
Tracking a Nonstationary Problem
Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto
Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
Tracking a Nonstationary Problem
Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto
Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
Why do you think it should be changed from
1
๐‘›
to ๐›ผ?
Tracking a Nonstationary Problem
Why do you think it should be changed from
1
๐‘›
to ๐›ผ?
We often encounter RL problems that are effectively nonstationary.
In such cases it makes sense to give more weight
to recent rewards than to long-past rewards.
One of the most popular ways of doing this is to use
a constant step-size parameter.
The step-size parameter ๐›ผ โˆˆ (0,1] is constant.
Tracking a Nonstationary Problem
Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto
Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
Tracking a Nonstationary Problem
?
= ๐‘„ ๐‘› + ๐›ผ๐‘… ๐‘› โˆ’ ๐›ผ๐‘„ ๐‘›
= ๐›ผ๐‘… ๐‘› + (1 โˆ’ ๐›ผ)๐‘„ ๐‘›
๐‘„ ๐‘› = ๐›ผ๐‘… ๐‘›โˆ’1 + 1 โˆ’ ๐›ผ ๐‘„ ๐‘›โˆ’1 ?
โˆด ๐‘„ ๐‘› = ๐›ผ๐‘… ๐‘›โˆ’1 + (1 โˆ’ ๐›ผ)๐‘„ ๐‘›โˆ’1
Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto
Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
Tracking a Nonstationary Problem
?
Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto
Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
Tracking a Nonstationary Problem
โˆด ๐‘„ ๐‘›โˆ’1 = ๐›ผ ๐‘… ๐‘›โˆ’2 + 1 โˆ’ ๐›ผ ๐‘… ๐‘›โˆ’3 + โ‹ฏ + 1 โˆ’ ๐›ผ ๐‘›โˆ’3 ๐‘…1 + 1 โˆ’ ๐›ผ ๐‘›โˆ’2 ๐‘„1
1 โˆ’ ๐›ผ 2
๐‘„ ๐‘›โˆ’1 = 1 โˆ’ ๐›ผ 2
๐›ผ๐‘… ๐‘›โˆ’2 + 1 โˆ’ ๐›ผ 3
๐›ผ๐‘… ๐‘›โˆ’3 + โ‹ฏ
+ 1 โˆ’ ๐›ผ ๐‘›โˆ’1
๐›ผ๐‘…1 + 1 โˆ’ ๐›ผ ๐‘›
๐‘„1 ?
= 1 โˆ’ ๐›ผ 2{๐›ผ๐‘… ๐‘›โˆ’2 + 1 โˆ’ ๐›ผ ๐›ผ๐‘… ๐‘›โˆ’3 + โ‹ฏ
+ 1 โˆ’ ๐›ผ ๐‘›โˆ’3 ๐›ผ๐‘…1 + 1 โˆ’ ๐›ผ ๐‘›โˆ’2 ๐‘„1}
Tracking a Nonstationary Problem
Sequences of step-size parameters often converge
very slowly or need considerable tuning
in order to obtain a satisfactory convergence rate.
Thus, step-size parameters should be tuned effectively.
Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto
Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
5. Gradient Bandit Algorithm
5. Gradient Bandit Algorithm
In addition to a simple bandit algorithm,
there is another way to use the gradient method as a bandit algorithm
5. Gradient Bandit Algorithm
We consider learning a numerical ๐‘๐‘Ÿ๐‘’๐‘“๐‘’๐‘Ÿ๐‘’๐‘›๐‘๐‘’
for each action ๐‘Ž, which we denote ๐ป๐‘ก(๐‘Ž).
The larger the preference, the more often that action is taken,
but the preference has no interpretation in terms of reward.
In other wards, just because the preference(๐ป๐‘ก(๐‘Ž)) is large,
the reward is not unconditionally large.
However, if the reward is large, It can affect the preference(๐ป๐‘ก(๐‘Ž))
5. Gradient Bandit Algorithm
The action probabilities are determined according to
a ๐‘ ๐‘œ๐‘“๐‘ก โˆ’ ๐‘š๐‘Ž๐‘ฅ ๐‘‘๐‘–๐‘ ๐‘ก๐‘Ÿ๐‘–๐‘๐‘ข๐‘ก๐‘–๐‘œ๐‘› (i.e., Gibbs or Boltzmann distribution)
Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto
Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
5. Gradient Bandit Algorithm
๐œ‹ ๐‘ก(๐‘Ž) โ€“ Probability of selecting action ๐‘Ž at time ๐‘ก
Initially all action preferences are the same
so that all actions have an equal probability of being selected.
Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto
Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
5. Gradient Bandit Algorithm
Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto
Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
There is a natural learning algorithm for this setting
based on the idea of stochastic gradient ascent.
On each step, after selecting action ๐ด ๐‘ก and receiving the reward ๐‘…๐‘ก,
the action preferences are updated.
Selected action ๐ด ๐‘ก
Non-selected actions
5. Gradient Bandit Algorithm
Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto
Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
1) What does ๐‘…๐‘ก mean?
2) What does (๐‘…๐‘ก โˆ’ ๐‘…๐‘ก)(1 โˆ’ ๐œ‹ ๐‘ก ๐ด ๐‘ก ) mean?
5. Gradient Bandit Algorithm
Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto
Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
1) What does ๐‘…๐‘ก mean?
2) What does (๐‘…๐‘ก โˆ’ ๐‘…๐‘ก)(1 โˆ’ ๐œ‹ ๐‘ก ๐ด ๐‘ก ) mean?
?
What does ๐‘…๐‘ก mean?
Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto
Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
๐‘…๐‘ก โˆˆ R is the average of all the rewards.
The ๐‘…๐‘ก term serves as a baseline.
If the reward is higher than the baseline,
then the probability of taking ๐ด ๐‘ก in the future is increased,
and if the reward is below baseline, then probability is decreased.
The non-selected actions move in the opposite direction.
5. Gradient Bandit Algorithm
Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto
Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
1) What does ๐‘…๐‘ก mean?
2) What does (๐‘…๐‘ก โˆ’ ๐‘…๐‘ก)(1 โˆ’ ๐œ‹ ๐‘ก ๐ด ๐‘ก ) mean?
?
What does (๐‘…๐‘ก โˆ’ ๐‘…๐‘ก)(1 โˆ’ ๐œ‹ ๐‘ก ๐ด ๐‘ก ) mean?
Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto
Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
- Stochastic approximation to gradient ascent
in Bandit Gradient Algorithm
- Expected reward
- Expected reward by Low of total expectationE[๐‘…๐‘ก] = E[ E ๐‘…๐‘ก ๐ด ๐‘ก ]
What does (๐‘…๐‘ก โˆ’ ๐‘…๐‘ก)(1 โˆ’ ๐œ‹ ๐‘ก ๐ด ๐‘ก ) mean?
Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto
Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
What does (๐‘…๐‘ก โˆ’ ๐‘…๐‘ก)(1 โˆ’ ๐œ‹ ๐‘ก ๐ด ๐‘ก ) mean?
Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto
Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
?
What does (๐‘…๐‘ก โˆ’ ๐‘…๐‘ก)(1 โˆ’ ๐œ‹ ๐‘ก ๐ด ๐‘ก ) mean?
Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto
Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
?
The gradient sums to zero over all the actions, ฯƒ ๐‘ฅ
๐œ•๐œ‹ ๐‘ก(๐‘ฅ)
๐œ•๐ป๐‘ก(๐‘Ž)
= 0
โ€“ as ๐ป๐‘ก(๐‘Ž) is changed, some actionsโ€™ probabilities go up
and some go down, but the sum of the changes must be
zero because the sum of the probabilities is always one.
What does (๐‘…๐‘ก โˆ’ ๐‘…๐‘ก)(1 โˆ’ ๐œ‹ ๐‘ก ๐ด ๐‘ก ) mean?
Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto
Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
What does (๐‘…๐‘ก โˆ’ ๐‘…๐‘ก)(1 โˆ’ ๐œ‹ ๐‘ก ๐ด ๐‘ก ) mean?
Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto
Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
?
E[๐‘…๐‘ก] = E[ E ๐‘…๐‘ก ๐ด ๐‘ก ]
What does (๐‘…๐‘ก โˆ’ ๐‘…๐‘ก)(1 โˆ’ ๐œ‹ ๐‘ก ๐ด ๐‘ก ) mean?
Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto
Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
Please refer page 40 in link of reference slide
What does (๐‘…๐‘ก โˆ’ ๐‘…๐‘ก)(1 โˆ’ ๐œ‹ ๐‘ก ๐ด ๐‘ก ) mean?
Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto
Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
โˆด
We can substitute a sample of the expectation above
for the performance gradient in ๐ป๐‘ก+1 ๐‘Ž โ‰… ๐ป๐‘ก ๐‘Ž + ๐›ผ
๐œ•E[๐‘… ๐‘ก]
๐œ•๐ป๐‘ก(๐‘Ž)
6. Summary
6. Summary
In this chapter, โ€˜Exploitation & Explorationโ€˜ is the core idea.
6. Summary
Action-value Methods
- Greedy Action Selection Method
- ๐œ€-greedy Action Selection Method
- Upper-Confidence-Bound(UCB) Action Selection Method
6. Summary
A simple bandit algorithm : Incremental Implementation
Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto
Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
6. Summary
A simple bandit algorithm : Incremental Implementation
Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto
Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
6. Summary
A simple bandit algorithm : Tracking a Nonstationary Problem
Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto
Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
6. Summary
A simple bandit algorithm : Tracking a Nonstationary Problem
Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto
Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
6. Summary
Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto
Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
Gradient Bandit Algorithm
6. Summary
Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto
Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
A parameter study of the various bandit algorithms
Reinforcement Learning is LOVEโ™ฅ
Thank you

More Related Content

What's hot

Reinforcement learning
Reinforcement learning Reinforcement learning
Reinforcement learning
Chandra Meena
ย 
Intro to Deep Reinforcement Learning
Intro to Deep Reinforcement LearningIntro to Deep Reinforcement Learning
Intro to Deep Reinforcement Learning
Khaled Saleh
ย 
Reinforcement learning, Q-Learning
Reinforcement learning, Q-LearningReinforcement learning, Q-Learning
Reinforcement learning, Q-Learning
Kuppusamy P
ย 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
Shahan Ali Memon
ย 
Reinforcement learning 7313
Reinforcement learning 7313Reinforcement learning 7313
Reinforcement learning 7313
Slideshare
ย 
Reinforcement Learning 5. Monte Carlo Methods
Reinforcement Learning 5. Monte Carlo MethodsReinforcement Learning 5. Monte Carlo Methods
Reinforcement Learning 5. Monte Carlo Methods
Seung Jae Lee
ย 
Reinforcement Learning : A Beginners Tutorial
Reinforcement Learning : A Beginners TutorialReinforcement Learning : A Beginners Tutorial
Reinforcement Learning : A Beginners Tutorial
Omar Enayet
ย 
Lecture 9 Markov decision process
Lecture 9 Markov decision processLecture 9 Markov decision process
Lecture 9 Markov decision process
VARUN KUMAR
ย 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
Ding Li
ย 
Introduction to Multi-armed Bandits
Introduction to Multi-armed BanditsIntroduction to Multi-armed Bandits
Introduction to Multi-armed Bandits
Yan Xu
ย 
Deep Reinforcement Learning
Deep Reinforcement LearningDeep Reinforcement Learning
Deep Reinforcement Learning
MeetupDataScienceRoma
ย 
Adversarial search
Adversarial searchAdversarial search
Adversarial search
Dheerendra k
ย 
Reinforcement Learning 3. Finite Markov Decision Processes
Reinforcement Learning 3. Finite Markov Decision ProcessesReinforcement Learning 3. Finite Markov Decision Processes
Reinforcement Learning 3. Finite Markov Decision Processes
Seung Jae Lee
ย 
Markov decision process
Markov decision processMarkov decision process
Markov decision process
Hamed Abdi
ย 
DQN (Deep Q-Network)
DQN (Deep Q-Network)DQN (Deep Q-Network)
DQN (Deep Q-Network)
Dong Guo
ย 
Introduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement LearningIntroduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement Learning
NAVER Engineering
ย 
K Nearest Neighbors
K Nearest NeighborsK Nearest Neighbors
Deep Q-Learning
Deep Q-LearningDeep Q-Learning
Deep Q-Learning
Nikolay Pavlov
ย 
Reinforcement Learning 8: Planning and Learning with Tabular Methods
Reinforcement Learning 8: Planning and Learning with Tabular MethodsReinforcement Learning 8: Planning and Learning with Tabular Methods
Reinforcement Learning 8: Planning and Learning with Tabular Methods
Seung Jae Lee
ย 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
DongHyun Kwak
ย 

What's hot (20)

Reinforcement learning
Reinforcement learning Reinforcement learning
Reinforcement learning
ย 
Intro to Deep Reinforcement Learning
Intro to Deep Reinforcement LearningIntro to Deep Reinforcement Learning
Intro to Deep Reinforcement Learning
ย 
Reinforcement learning, Q-Learning
Reinforcement learning, Q-LearningReinforcement learning, Q-Learning
Reinforcement learning, Q-Learning
ย 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
ย 
Reinforcement learning 7313
Reinforcement learning 7313Reinforcement learning 7313
Reinforcement learning 7313
ย 
Reinforcement Learning 5. Monte Carlo Methods
Reinforcement Learning 5. Monte Carlo MethodsReinforcement Learning 5. Monte Carlo Methods
Reinforcement Learning 5. Monte Carlo Methods
ย 
Reinforcement Learning : A Beginners Tutorial
Reinforcement Learning : A Beginners TutorialReinforcement Learning : A Beginners Tutorial
Reinforcement Learning : A Beginners Tutorial
ย 
Lecture 9 Markov decision process
Lecture 9 Markov decision processLecture 9 Markov decision process
Lecture 9 Markov decision process
ย 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
ย 
Introduction to Multi-armed Bandits
Introduction to Multi-armed BanditsIntroduction to Multi-armed Bandits
Introduction to Multi-armed Bandits
ย 
Deep Reinforcement Learning
Deep Reinforcement LearningDeep Reinforcement Learning
Deep Reinforcement Learning
ย 
Adversarial search
Adversarial searchAdversarial search
Adversarial search
ย 
Reinforcement Learning 3. Finite Markov Decision Processes
Reinforcement Learning 3. Finite Markov Decision ProcessesReinforcement Learning 3. Finite Markov Decision Processes
Reinforcement Learning 3. Finite Markov Decision Processes
ย 
Markov decision process
Markov decision processMarkov decision process
Markov decision process
ย 
DQN (Deep Q-Network)
DQN (Deep Q-Network)DQN (Deep Q-Network)
DQN (Deep Q-Network)
ย 
Introduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement LearningIntroduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement Learning
ย 
K Nearest Neighbors
K Nearest NeighborsK Nearest Neighbors
K Nearest Neighbors
ย 
Deep Q-Learning
Deep Q-LearningDeep Q-Learning
Deep Q-Learning
ย 
Reinforcement Learning 8: Planning and Learning with Tabular Methods
Reinforcement Learning 8: Planning and Learning with Tabular MethodsReinforcement Learning 8: Planning and Learning with Tabular Methods
Reinforcement Learning 8: Planning and Learning with Tabular Methods
ย 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
ย 

Similar to Multi-armed Bandits

Planning and Learning with Tabular Methods
Planning and Learning with Tabular MethodsPlanning and Learning with Tabular Methods
Planning and Learning with Tabular Methods
Dongmin Lee
ย 
Efficient Selectivity and Backup Operators in Monte-Carlo Tree Search
Efficient Selectivity and Backup Operators in Monte-Carlo Tree SearchEfficient Selectivity and Backup Operators in Monte-Carlo Tree Search
Efficient Selectivity and Backup Operators in Monte-Carlo Tree Search
Melvin Zhang
ย 
Aprendizaje reforzado con swift
Aprendizaje reforzado con swiftAprendizaje reforzado con swift
Aprendizaje reforzado con swift
NSCoder Mexico
ย 
Multi-Armed Bandit: an algorithmic perspective
Multi-Armed Bandit: an algorithmic perspectiveMulti-Armed Bandit: an algorithmic perspective
Multi-Armed Bandit: an algorithmic perspective
Gabriele Sottocornola
ย 
Sequential and Diverse Recommendation with Long Tail
Sequential and Diverse Recommendation with Long TailSequential and Diverse Recommendation with Long Tail
Sequential and Diverse Recommendation with Long Tail
Yejin Kim
ย 
Fast Distributed Online Classification
Fast Distributed Online Classification Fast Distributed Online Classification
Fast Distributed Online Classification
DataWorks Summit/Hadoop Summit
ย 
Legal Analytics Course - Class 8 - Introduction to Random Forests and Ensembl...
Legal Analytics Course - Class 8 - Introduction to Random Forests and Ensembl...Legal Analytics Course - Class 8 - Introduction to Random Forests and Ensembl...
Legal Analytics Course - Class 8 - Introduction to Random Forests and Ensembl...
Daniel Katz
ย 
ๅพž Atari/AlphaGo/ChatGPT ่ซ‡ๆทฑๅบฆๅผทๅŒ–ๅญธ็ฟ’ๅŠ้€š็”จไบบๅทฅๆ™บๆ…ง
ๅพž Atari/AlphaGo/ChatGPT ่ซ‡ๆทฑๅบฆๅผทๅŒ–ๅญธ็ฟ’ๅŠ้€š็”จไบบๅทฅๆ™บๆ…งๅพž Atari/AlphaGo/ChatGPT ่ซ‡ๆทฑๅบฆๅผทๅŒ–ๅญธ็ฟ’ๅŠ้€š็”จไบบๅทฅๆ™บๆ…ง
ๅพž Atari/AlphaGo/ChatGPT ่ซ‡ๆทฑๅบฆๅผทๅŒ–ๅญธ็ฟ’ๅŠ้€š็”จไบบๅทฅๆ™บๆ…ง
Frank Fang Kuo Yu
ย 
Introduction to Few shot learning
Introduction to Few shot learningIntroduction to Few shot learning
Introduction to Few shot learning
Ridge-i, Inc.
ย 
Multi-Agent Reinforcement Learning
Multi-Agent Reinforcement LearningMulti-Agent Reinforcement Learning
Multi-Agent Reinforcement Learning
Seolhokim
ย 
Fast Distributed Online Classification
Fast Distributed Online ClassificationFast Distributed Online Classification
Fast Distributed Online Classification
Prasad Chalasani
ย 
How DeepMind Mastered The Game Of Go
How DeepMind Mastered The Game Of GoHow DeepMind Mastered The Game Of Go
How DeepMind Mastered The Game Of Go
Tim Riser
ย 
Reinforcement Learning โ€“ a Rewards Based Approach to Machine Learning - Marko...
Reinforcement Learning โ€“ a Rewards Based Approach to Machine Learning - Marko...Reinforcement Learning โ€“ a Rewards Based Approach to Machine Learning - Marko...
Reinforcement Learning โ€“ a Rewards Based Approach to Machine Learning - Marko...
Marko Lohert
ย 
How to Use Agile to Move the Earth
How to Use Agile to Move the EarthHow to Use Agile to Move the Earth
How to Use Agile to Move the Earth
Ryan Martens
ย 
Google ai based personal assistant and i woring
Google ai based personal assistant and i woringGoogle ai based personal assistant and i woring
Google ai based personal assistant and i woring
IdcIdk1
ย 
Summer internship 2014 report by Rishabh Misra, Thapar University
Summer internship 2014 report by Rishabh Misra, Thapar UniversitySummer internship 2014 report by Rishabh Misra, Thapar University
Summer internship 2014 report by Rishabh Misra, Thapar University
Rishabh Misra
ย 
Lecture 12: Research Directions (Full Stack Deep Learning - Spring 2021)
Lecture 12: Research Directions (Full Stack Deep Learning - Spring 2021)Lecture 12: Research Directions (Full Stack Deep Learning - Spring 2021)
Lecture 12: Research Directions (Full Stack Deep Learning - Spring 2021)
Sergey Karayev
ย 
reinforcement-learning-141009013546-conversion-gate02.pdf
reinforcement-learning-141009013546-conversion-gate02.pdfreinforcement-learning-141009013546-conversion-gate02.pdf
reinforcement-learning-141009013546-conversion-gate02.pdf
VaishnavGhadge1
ย 
Reinforcement Learning in Practice: Contextual Bandits
Reinforcement Learning in Practice: Contextual BanditsReinforcement Learning in Practice: Contextual Bandits
Reinforcement Learning in Practice: Contextual Bandits
Max Pagels
ย 
Say "Hi!" to Your New Boss
Say "Hi!" to Your New BossSay "Hi!" to Your New Boss
Say "Hi!" to Your New Boss
Andreas Dewes
ย 

Similar to Multi-armed Bandits (20)

Planning and Learning with Tabular Methods
Planning and Learning with Tabular MethodsPlanning and Learning with Tabular Methods
Planning and Learning with Tabular Methods
ย 
Efficient Selectivity and Backup Operators in Monte-Carlo Tree Search
Efficient Selectivity and Backup Operators in Monte-Carlo Tree SearchEfficient Selectivity and Backup Operators in Monte-Carlo Tree Search
Efficient Selectivity and Backup Operators in Monte-Carlo Tree Search
ย 
Aprendizaje reforzado con swift
Aprendizaje reforzado con swiftAprendizaje reforzado con swift
Aprendizaje reforzado con swift
ย 
Multi-Armed Bandit: an algorithmic perspective
Multi-Armed Bandit: an algorithmic perspectiveMulti-Armed Bandit: an algorithmic perspective
Multi-Armed Bandit: an algorithmic perspective
ย 
Sequential and Diverse Recommendation with Long Tail
Sequential and Diverse Recommendation with Long TailSequential and Diverse Recommendation with Long Tail
Sequential and Diverse Recommendation with Long Tail
ย 
Fast Distributed Online Classification
Fast Distributed Online Classification Fast Distributed Online Classification
Fast Distributed Online Classification
ย 
Legal Analytics Course - Class 8 - Introduction to Random Forests and Ensembl...
Legal Analytics Course - Class 8 - Introduction to Random Forests and Ensembl...Legal Analytics Course - Class 8 - Introduction to Random Forests and Ensembl...
Legal Analytics Course - Class 8 - Introduction to Random Forests and Ensembl...
ย 
ๅพž Atari/AlphaGo/ChatGPT ่ซ‡ๆทฑๅบฆๅผทๅŒ–ๅญธ็ฟ’ๅŠ้€š็”จไบบๅทฅๆ™บๆ…ง
ๅพž Atari/AlphaGo/ChatGPT ่ซ‡ๆทฑๅบฆๅผทๅŒ–ๅญธ็ฟ’ๅŠ้€š็”จไบบๅทฅๆ™บๆ…งๅพž Atari/AlphaGo/ChatGPT ่ซ‡ๆทฑๅบฆๅผทๅŒ–ๅญธ็ฟ’ๅŠ้€š็”จไบบๅทฅๆ™บๆ…ง
ๅพž Atari/AlphaGo/ChatGPT ่ซ‡ๆทฑๅบฆๅผทๅŒ–ๅญธ็ฟ’ๅŠ้€š็”จไบบๅทฅๆ™บๆ…ง
ย 
Introduction to Few shot learning
Introduction to Few shot learningIntroduction to Few shot learning
Introduction to Few shot learning
ย 
Multi-Agent Reinforcement Learning
Multi-Agent Reinforcement LearningMulti-Agent Reinforcement Learning
Multi-Agent Reinforcement Learning
ย 
Fast Distributed Online Classification
Fast Distributed Online ClassificationFast Distributed Online Classification
Fast Distributed Online Classification
ย 
How DeepMind Mastered The Game Of Go
How DeepMind Mastered The Game Of GoHow DeepMind Mastered The Game Of Go
How DeepMind Mastered The Game Of Go
ย 
Reinforcement Learning โ€“ a Rewards Based Approach to Machine Learning - Marko...
Reinforcement Learning โ€“ a Rewards Based Approach to Machine Learning - Marko...Reinforcement Learning โ€“ a Rewards Based Approach to Machine Learning - Marko...
Reinforcement Learning โ€“ a Rewards Based Approach to Machine Learning - Marko...
ย 
How to Use Agile to Move the Earth
How to Use Agile to Move the EarthHow to Use Agile to Move the Earth
How to Use Agile to Move the Earth
ย 
Google ai based personal assistant and i woring
Google ai based personal assistant and i woringGoogle ai based personal assistant and i woring
Google ai based personal assistant and i woring
ย 
Summer internship 2014 report by Rishabh Misra, Thapar University
Summer internship 2014 report by Rishabh Misra, Thapar UniversitySummer internship 2014 report by Rishabh Misra, Thapar University
Summer internship 2014 report by Rishabh Misra, Thapar University
ย 
Lecture 12: Research Directions (Full Stack Deep Learning - Spring 2021)
Lecture 12: Research Directions (Full Stack Deep Learning - Spring 2021)Lecture 12: Research Directions (Full Stack Deep Learning - Spring 2021)
Lecture 12: Research Directions (Full Stack Deep Learning - Spring 2021)
ย 
reinforcement-learning-141009013546-conversion-gate02.pdf
reinforcement-learning-141009013546-conversion-gate02.pdfreinforcement-learning-141009013546-conversion-gate02.pdf
reinforcement-learning-141009013546-conversion-gate02.pdf
ย 
Reinforcement Learning in Practice: Contextual Bandits
Reinforcement Learning in Practice: Contextual BanditsReinforcement Learning in Practice: Contextual Bandits
Reinforcement Learning in Practice: Contextual Bandits
ย 
Say "Hi!" to Your New Boss
Say "Hi!" to Your New BossSay "Hi!" to Your New Boss
Say "Hi!" to Your New Boss
ย 

More from Dongmin Lee

Causal Confusion in Imitation Learning
Causal Confusion in Imitation LearningCausal Confusion in Imitation Learning
Causal Confusion in Imitation Learning
Dongmin Lee
ย 
Character Controllers using Motion VAEs
Character Controllers using Motion VAEsCharacter Controllers using Motion VAEs
Character Controllers using Motion VAEs
Dongmin Lee
ย 
Causal Confusion in Imitation Learning
Causal Confusion in Imitation LearningCausal Confusion in Imitation Learning
Causal Confusion in Imitation Learning
Dongmin Lee
ย 
Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Va...
Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Va...Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Va...
Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Va...
Dongmin Lee
ย 
PRM-RL: Long-range Robotics Navigation Tasks by Combining Reinforcement Learn...
PRM-RL: Long-range Robotics Navigation Tasks by Combining Reinforcement Learn...PRM-RL: Long-range Robotics Navigation Tasks by Combining Reinforcement Learn...
PRM-RL: Long-range Robotics Navigation Tasks by Combining Reinforcement Learn...
Dongmin Lee
ย 
Exploration Strategies in Reinforcement Learning
Exploration Strategies in Reinforcement LearningExploration Strategies in Reinforcement Learning
Exploration Strategies in Reinforcement Learning
Dongmin Lee
ย 
Maximum Entropy Reinforcement Learning (Stochastic Control)
Maximum Entropy Reinforcement Learning (Stochastic Control)Maximum Entropy Reinforcement Learning (Stochastic Control)
Maximum Entropy Reinforcement Learning (Stochastic Control)
Dongmin Lee
ย 
Let's do Inverse RL
Let's do Inverse RLLet's do Inverse RL
Let's do Inverse RL
Dongmin Lee
ย 
๋ชจ๋‘๋ฅผ ์œ„ํ•œ PG์—ฌํ–‰ ๊ฐ€์ด๋“œ
๋ชจ๋‘๋ฅผ ์œ„ํ•œ PG์—ฌํ–‰ ๊ฐ€์ด๋“œ๋ชจ๋‘๋ฅผ ์œ„ํ•œ PG์—ฌํ–‰ ๊ฐ€์ด๋“œ
๋ชจ๋‘๋ฅผ ์œ„ํ•œ PG์—ฌํ–‰ ๊ฐ€์ด๋“œ
Dongmin Lee
ย 
Safe Reinforcement Learning
Safe Reinforcement LearningSafe Reinforcement Learning
Safe Reinforcement Learning
Dongmin Lee
ย 
์•ˆ.์ „.์ œ.์ผ. ๊ฐ•ํ™”ํ•™์Šต!
์•ˆ.์ „.์ œ.์ผ. ๊ฐ•ํ™”ํ•™์Šต!์•ˆ.์ „.์ œ.์ผ. ๊ฐ•ํ™”ํ•™์Šต!
์•ˆ.์ „.์ œ.์ผ. ๊ฐ•ํ™”ํ•™์Šต!
Dongmin Lee
ย 
๊ฐ•ํ™”ํ•™์Šต ์•Œ๊ณ ๋ฆฌ์ฆ˜์˜ ํ๋ฆ„๋„ Part 2
๊ฐ•ํ™”ํ•™์Šต ์•Œ๊ณ ๋ฆฌ์ฆ˜์˜ ํ๋ฆ„๋„ Part 2๊ฐ•ํ™”ํ•™์Šต ์•Œ๊ณ ๋ฆฌ์ฆ˜์˜ ํ๋ฆ„๋„ Part 2
๊ฐ•ํ™”ํ•™์Šต ์•Œ๊ณ ๋ฆฌ์ฆ˜์˜ ํ๋ฆ„๋„ Part 2
Dongmin Lee
ย 
๊ฐ•ํ™”ํ•™์Šต์˜ ํ๋ฆ„๋„ Part 1
๊ฐ•ํ™”ํ•™์Šต์˜ ํ๋ฆ„๋„ Part 1๊ฐ•ํ™”ํ•™์Šต์˜ ํ๋ฆ„๋„ Part 1
๊ฐ•ํ™”ํ•™์Šต์˜ ํ๋ฆ„๋„ Part 1
Dongmin Lee
ย 
๊ฐ•ํ™”ํ•™์Šต์˜ ๊ฐœ์š”
๊ฐ•ํ™”ํ•™์Šต์˜ ๊ฐœ์š”๊ฐ•ํ™”ํ•™์Šต์˜ ๊ฐœ์š”
๊ฐ•ํ™”ํ•™์Šต์˜ ๊ฐœ์š”
Dongmin Lee
ย 

More from Dongmin Lee (14)

Causal Confusion in Imitation Learning
Causal Confusion in Imitation LearningCausal Confusion in Imitation Learning
Causal Confusion in Imitation Learning
ย 
Character Controllers using Motion VAEs
Character Controllers using Motion VAEsCharacter Controllers using Motion VAEs
Character Controllers using Motion VAEs
ย 
Causal Confusion in Imitation Learning
Causal Confusion in Imitation LearningCausal Confusion in Imitation Learning
Causal Confusion in Imitation Learning
ย 
Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Va...
Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Va...Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Va...
Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Va...
ย 
PRM-RL: Long-range Robotics Navigation Tasks by Combining Reinforcement Learn...
PRM-RL: Long-range Robotics Navigation Tasks by Combining Reinforcement Learn...PRM-RL: Long-range Robotics Navigation Tasks by Combining Reinforcement Learn...
PRM-RL: Long-range Robotics Navigation Tasks by Combining Reinforcement Learn...
ย 
Exploration Strategies in Reinforcement Learning
Exploration Strategies in Reinforcement LearningExploration Strategies in Reinforcement Learning
Exploration Strategies in Reinforcement Learning
ย 
Maximum Entropy Reinforcement Learning (Stochastic Control)
Maximum Entropy Reinforcement Learning (Stochastic Control)Maximum Entropy Reinforcement Learning (Stochastic Control)
Maximum Entropy Reinforcement Learning (Stochastic Control)
ย 
Let's do Inverse RL
Let's do Inverse RLLet's do Inverse RL
Let's do Inverse RL
ย 
๋ชจ๋‘๋ฅผ ์œ„ํ•œ PG์—ฌํ–‰ ๊ฐ€์ด๋“œ
๋ชจ๋‘๋ฅผ ์œ„ํ•œ PG์—ฌํ–‰ ๊ฐ€์ด๋“œ๋ชจ๋‘๋ฅผ ์œ„ํ•œ PG์—ฌํ–‰ ๊ฐ€์ด๋“œ
๋ชจ๋‘๋ฅผ ์œ„ํ•œ PG์—ฌํ–‰ ๊ฐ€์ด๋“œ
ย 
Safe Reinforcement Learning
Safe Reinforcement LearningSafe Reinforcement Learning
Safe Reinforcement Learning
ย 
์•ˆ.์ „.์ œ.์ผ. ๊ฐ•ํ™”ํ•™์Šต!
์•ˆ.์ „.์ œ.์ผ. ๊ฐ•ํ™”ํ•™์Šต!์•ˆ.์ „.์ œ.์ผ. ๊ฐ•ํ™”ํ•™์Šต!
์•ˆ.์ „.์ œ.์ผ. ๊ฐ•ํ™”ํ•™์Šต!
ย 
๊ฐ•ํ™”ํ•™์Šต ์•Œ๊ณ ๋ฆฌ์ฆ˜์˜ ํ๋ฆ„๋„ Part 2
๊ฐ•ํ™”ํ•™์Šต ์•Œ๊ณ ๋ฆฌ์ฆ˜์˜ ํ๋ฆ„๋„ Part 2๊ฐ•ํ™”ํ•™์Šต ์•Œ๊ณ ๋ฆฌ์ฆ˜์˜ ํ๋ฆ„๋„ Part 2
๊ฐ•ํ™”ํ•™์Šต ์•Œ๊ณ ๋ฆฌ์ฆ˜์˜ ํ๋ฆ„๋„ Part 2
ย 
๊ฐ•ํ™”ํ•™์Šต์˜ ํ๋ฆ„๋„ Part 1
๊ฐ•ํ™”ํ•™์Šต์˜ ํ๋ฆ„๋„ Part 1๊ฐ•ํ™”ํ•™์Šต์˜ ํ๋ฆ„๋„ Part 1
๊ฐ•ํ™”ํ•™์Šต์˜ ํ๋ฆ„๋„ Part 1
ย 
๊ฐ•ํ™”ํ•™์Šต์˜ ๊ฐœ์š”
๊ฐ•ํ™”ํ•™์Šต์˜ ๊ฐœ์š”๊ฐ•ํ™”ํ•™์Šต์˜ ๊ฐœ์š”
๊ฐ•ํ™”ํ•™์Šต์˜ ๊ฐœ์š”
ย 

Recently uploaded

cathode ray oscilloscope and its applications
cathode ray oscilloscope and its applicationscathode ray oscilloscope and its applications
cathode ray oscilloscope and its applications
sandertein
ย 
Evidence of Jet Activity from the Secondary Black Hole in the OJ 287 Binary S...
Evidence of Jet Activity from the Secondary Black Hole in the OJ 287 Binary S...Evidence of Jet Activity from the Secondary Black Hole in the OJ 287 Binary S...
Evidence of Jet Activity from the Secondary Black Hole in the OJ 287 Binary S...
Sรฉrgio Sacani
ย 
GBSN - Biochemistry (Unit 6) Chemistry of Proteins
GBSN - Biochemistry (Unit 6) Chemistry of ProteinsGBSN - Biochemistry (Unit 6) Chemistry of Proteins
GBSN - Biochemistry (Unit 6) Chemistry of Proteins
Areesha Ahmad
ย 
BIOTRANSFORMATION MECHANISM FOR OF STEROID
BIOTRANSFORMATION MECHANISM FOR OF STEROIDBIOTRANSFORMATION MECHANISM FOR OF STEROID
BIOTRANSFORMATION MECHANISM FOR OF STEROID
ShibsekharRoy1
ย 
Sexuality - Issues, Attitude and Behaviour - Applied Social Psychology - Psyc...
Sexuality - Issues, Attitude and Behaviour - Applied Social Psychology - Psyc...Sexuality - Issues, Attitude and Behaviour - Applied Social Psychology - Psyc...
Sexuality - Issues, Attitude and Behaviour - Applied Social Psychology - Psyc...
PsychoTech Services
ย 
Mechanisms and Applications of Antiviral Neutralizing Antibodies - Creative B...
Mechanisms and Applications of Antiviral Neutralizing Antibodies - Creative B...Mechanisms and Applications of Antiviral Neutralizing Antibodies - Creative B...
Mechanisms and Applications of Antiviral Neutralizing Antibodies - Creative B...
Creative-Biolabs
ย 
23PH301 - Optics - Unit 1 - Optical Lenses
23PH301 - Optics  -  Unit 1 - Optical Lenses23PH301 - Optics  -  Unit 1 - Optical Lenses
23PH301 - Optics - Unit 1 - Optical Lenses
RDhivya6
ย 
Methods of grain storage Structures in India.pdf
Methods of grain storage Structures in India.pdfMethods of grain storage Structures in India.pdf
Methods of grain storage Structures in India.pdf
PirithiRaju
ย 
MICROBIAL INTERACTION PPT/ MICROBIAL INTERACTION AND THEIR TYPES // PLANT MIC...
MICROBIAL INTERACTION PPT/ MICROBIAL INTERACTION AND THEIR TYPES // PLANT MIC...MICROBIAL INTERACTION PPT/ MICROBIAL INTERACTION AND THEIR TYPES // PLANT MIC...
MICROBIAL INTERACTION PPT/ MICROBIAL INTERACTION AND THEIR TYPES // PLANT MIC...
ABHISHEK SONI NIMT INSTITUTE OF MEDICAL AND PARAMEDCIAL SCIENCES , GOVT PG COLLEGE NOIDA
ย 
Summary Of transcription and Translation.pdf
Summary Of transcription and Translation.pdfSummary Of transcription and Translation.pdf
Summary Of transcription and Translation.pdf
vadgavevedant86
ย 
Juaristi, Jon. - El canon espanol. El legado de la cultura espaรฑola a la civi...
Juaristi, Jon. - El canon espanol. El legado de la cultura espaรฑola a la civi...Juaristi, Jon. - El canon espanol. El legado de la cultura espaรฑola a la civi...
Juaristi, Jon. - El canon espanol. El legado de la cultura espaรฑola a la civi...
frank0071
ย 
Introduction_Ch_01_Biotech Biotechnology course .pptx
Introduction_Ch_01_Biotech Biotechnology course .pptxIntroduction_Ch_01_Biotech Biotechnology course .pptx
Introduction_Ch_01_Biotech Biotechnology course .pptx
QusayMaghayerh
ย 
Microbiology of Central Nervous System INFECTIONS.pdf
Microbiology of Central Nervous System INFECTIONS.pdfMicrobiology of Central Nervous System INFECTIONS.pdf
Microbiology of Central Nervous System INFECTIONS.pdf
sammy700571
ย 
gastroretentive drug delivery system-PPT.pptx
gastroretentive drug delivery system-PPT.pptxgastroretentive drug delivery system-PPT.pptx
gastroretentive drug delivery system-PPT.pptx
Shekar Boddu
ย 
(June 12, 2024) Webinar: Development of PET theranostics targeting the molecu...
(June 12, 2024) Webinar: Development of PET theranostics targeting the molecu...(June 12, 2024) Webinar: Development of PET theranostics targeting the molecu...
(June 12, 2024) Webinar: Development of PET theranostics targeting the molecu...
Scintica Instrumentation
ย 
BIRDS DIVERSITY OF SOOTEA BISWANATH ASSAM.ppt.pptx
BIRDS  DIVERSITY OF SOOTEA BISWANATH ASSAM.ppt.pptxBIRDS  DIVERSITY OF SOOTEA BISWANATH ASSAM.ppt.pptx
BIRDS DIVERSITY OF SOOTEA BISWANATH ASSAM.ppt.pptx
goluk9330
ย 
Randomised Optimisation Algorithms in DAPHNE
Randomised Optimisation Algorithms in DAPHNERandomised Optimisation Algorithms in DAPHNE
Randomised Optimisation Algorithms in DAPHNE
University of Maribor
ย 
ๅœจ็บฟๅŠž็†(salforๆฏ•ไธš่ฏไนฆ)็ดขๅฐ”็ฆๅพทๅคงๅญฆๆฏ•ไธš่ฏๆฏ•ไธšๅฎŒๆˆไฟกไธ€ๆจกไธ€ๆ ท
ๅœจ็บฟๅŠž็†(salforๆฏ•ไธš่ฏไนฆ)็ดขๅฐ”็ฆๅพทๅคงๅญฆๆฏ•ไธš่ฏๆฏ•ไธšๅฎŒๆˆไฟกไธ€ๆจกไธ€ๆ ทๅœจ็บฟๅŠž็†(salforๆฏ•ไธš่ฏไนฆ)็ดขๅฐ”็ฆๅพทๅคงๅญฆๆฏ•ไธš่ฏๆฏ•ไธšๅฎŒๆˆไฟกไธ€ๆจกไธ€ๆ ท
ๅœจ็บฟๅŠž็†(salforๆฏ•ไธš่ฏไนฆ)็ดขๅฐ”็ฆๅพทๅคงๅญฆๆฏ•ไธš่ฏๆฏ•ไธšๅฎŒๆˆไฟกไธ€ๆจกไธ€ๆ ท
vluwdy49
ย 
AJAY KUMAR NIET GreNo Guava Project File.pdf
AJAY KUMAR NIET GreNo Guava Project File.pdfAJAY KUMAR NIET GreNo Guava Project File.pdf
AJAY KUMAR NIET GreNo Guava Project File.pdf
AJAY KUMAR
ย 
Discovery of An Apparent Red, High-Velocity Type Ia Supernova at ๐ณ = 2.9 wi...
Discovery of An Apparent Red, High-Velocity Type Ia Supernova at  ๐ณ = 2.9  wi...Discovery of An Apparent Red, High-Velocity Type Ia Supernova at  ๐ณ = 2.9  wi...
Discovery of An Apparent Red, High-Velocity Type Ia Supernova at ๐ณ = 2.9 wi...
Sรฉrgio Sacani
ย 

Recently uploaded (20)

cathode ray oscilloscope and its applications
cathode ray oscilloscope and its applicationscathode ray oscilloscope and its applications
cathode ray oscilloscope and its applications
ย 
Evidence of Jet Activity from the Secondary Black Hole in the OJ 287 Binary S...
Evidence of Jet Activity from the Secondary Black Hole in the OJ 287 Binary S...Evidence of Jet Activity from the Secondary Black Hole in the OJ 287 Binary S...
Evidence of Jet Activity from the Secondary Black Hole in the OJ 287 Binary S...
ย 
GBSN - Biochemistry (Unit 6) Chemistry of Proteins
GBSN - Biochemistry (Unit 6) Chemistry of ProteinsGBSN - Biochemistry (Unit 6) Chemistry of Proteins
GBSN - Biochemistry (Unit 6) Chemistry of Proteins
ย 
BIOTRANSFORMATION MECHANISM FOR OF STEROID
BIOTRANSFORMATION MECHANISM FOR OF STEROIDBIOTRANSFORMATION MECHANISM FOR OF STEROID
BIOTRANSFORMATION MECHANISM FOR OF STEROID
ย 
Sexuality - Issues, Attitude and Behaviour - Applied Social Psychology - Psyc...
Sexuality - Issues, Attitude and Behaviour - Applied Social Psychology - Psyc...Sexuality - Issues, Attitude and Behaviour - Applied Social Psychology - Psyc...
Sexuality - Issues, Attitude and Behaviour - Applied Social Psychology - Psyc...
ย 
Mechanisms and Applications of Antiviral Neutralizing Antibodies - Creative B...
Mechanisms and Applications of Antiviral Neutralizing Antibodies - Creative B...Mechanisms and Applications of Antiviral Neutralizing Antibodies - Creative B...
Mechanisms and Applications of Antiviral Neutralizing Antibodies - Creative B...
ย 
23PH301 - Optics - Unit 1 - Optical Lenses
23PH301 - Optics  -  Unit 1 - Optical Lenses23PH301 - Optics  -  Unit 1 - Optical Lenses
23PH301 - Optics - Unit 1 - Optical Lenses
ย 
Methods of grain storage Structures in India.pdf
Methods of grain storage Structures in India.pdfMethods of grain storage Structures in India.pdf
Methods of grain storage Structures in India.pdf
ย 
MICROBIAL INTERACTION PPT/ MICROBIAL INTERACTION AND THEIR TYPES // PLANT MIC...
MICROBIAL INTERACTION PPT/ MICROBIAL INTERACTION AND THEIR TYPES // PLANT MIC...MICROBIAL INTERACTION PPT/ MICROBIAL INTERACTION AND THEIR TYPES // PLANT MIC...
MICROBIAL INTERACTION PPT/ MICROBIAL INTERACTION AND THEIR TYPES // PLANT MIC...
ย 
Summary Of transcription and Translation.pdf
Summary Of transcription and Translation.pdfSummary Of transcription and Translation.pdf
Summary Of transcription and Translation.pdf
ย 
Juaristi, Jon. - El canon espanol. El legado de la cultura espaรฑola a la civi...
Juaristi, Jon. - El canon espanol. El legado de la cultura espaรฑola a la civi...Juaristi, Jon. - El canon espanol. El legado de la cultura espaรฑola a la civi...
Juaristi, Jon. - El canon espanol. El legado de la cultura espaรฑola a la civi...
ย 
Introduction_Ch_01_Biotech Biotechnology course .pptx
Introduction_Ch_01_Biotech Biotechnology course .pptxIntroduction_Ch_01_Biotech Biotechnology course .pptx
Introduction_Ch_01_Biotech Biotechnology course .pptx
ย 
Microbiology of Central Nervous System INFECTIONS.pdf
Microbiology of Central Nervous System INFECTIONS.pdfMicrobiology of Central Nervous System INFECTIONS.pdf
Microbiology of Central Nervous System INFECTIONS.pdf
ย 
gastroretentive drug delivery system-PPT.pptx
gastroretentive drug delivery system-PPT.pptxgastroretentive drug delivery system-PPT.pptx
gastroretentive drug delivery system-PPT.pptx
ย 
(June 12, 2024) Webinar: Development of PET theranostics targeting the molecu...
(June 12, 2024) Webinar: Development of PET theranostics targeting the molecu...(June 12, 2024) Webinar: Development of PET theranostics targeting the molecu...
(June 12, 2024) Webinar: Development of PET theranostics targeting the molecu...
ย 
BIRDS DIVERSITY OF SOOTEA BISWANATH ASSAM.ppt.pptx
BIRDS  DIVERSITY OF SOOTEA BISWANATH ASSAM.ppt.pptxBIRDS  DIVERSITY OF SOOTEA BISWANATH ASSAM.ppt.pptx
BIRDS DIVERSITY OF SOOTEA BISWANATH ASSAM.ppt.pptx
ย 
Randomised Optimisation Algorithms in DAPHNE
Randomised Optimisation Algorithms in DAPHNERandomised Optimisation Algorithms in DAPHNE
Randomised Optimisation Algorithms in DAPHNE
ย 
ๅœจ็บฟๅŠž็†(salforๆฏ•ไธš่ฏไนฆ)็ดขๅฐ”็ฆๅพทๅคงๅญฆๆฏ•ไธš่ฏๆฏ•ไธšๅฎŒๆˆไฟกไธ€ๆจกไธ€ๆ ท
ๅœจ็บฟๅŠž็†(salforๆฏ•ไธš่ฏไนฆ)็ดขๅฐ”็ฆๅพทๅคงๅญฆๆฏ•ไธš่ฏๆฏ•ไธšๅฎŒๆˆไฟกไธ€ๆจกไธ€ๆ ทๅœจ็บฟๅŠž็†(salforๆฏ•ไธš่ฏไนฆ)็ดขๅฐ”็ฆๅพทๅคงๅญฆๆฏ•ไธš่ฏๆฏ•ไธšๅฎŒๆˆไฟกไธ€ๆจกไธ€ๆ ท
ๅœจ็บฟๅŠž็†(salforๆฏ•ไธš่ฏไนฆ)็ดขๅฐ”็ฆๅพทๅคงๅญฆๆฏ•ไธš่ฏๆฏ•ไธšๅฎŒๆˆไฟกไธ€ๆจกไธ€ๆ ท
ย 
AJAY KUMAR NIET GreNo Guava Project File.pdf
AJAY KUMAR NIET GreNo Guava Project File.pdfAJAY KUMAR NIET GreNo Guava Project File.pdf
AJAY KUMAR NIET GreNo Guava Project File.pdf
ย 
Discovery of An Apparent Red, High-Velocity Type Ia Supernova at ๐ณ = 2.9 wi...
Discovery of An Apparent Red, High-Velocity Type Ia Supernova at  ๐ณ = 2.9  wi...Discovery of An Apparent Red, High-Velocity Type Ia Supernova at  ๐ณ = 2.9  wi...
Discovery of An Apparent Red, High-Velocity Type Ia Supernova at ๐ณ = 2.9 wi...
ย 

Multi-armed Bandits

  • 2. Reference - Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto (Link) - ๋ฉ€ํ‹ฐ ์•”๋“œ ๋ฐด๋”ง(Multi-Armed Bandits), ์†กํ˜ธ์—ฐ (https://brunch.co.kr/@chris-song/62)
  • 3. Index 1. Why do you need to know Multi-armed Bandits? 2. A k-armed Bandit Problem 3. Simple-average Action-value Methods 4. A simple Bandit Algorithm 5. Gradient Bandit Algorithm 6. Summary
  • 4. 1. Why do you need to know Multi-armed Bandits(MAB)?
  • 5. 1. Why do you need to know MAB? - Reinforcement Learning(RL) uses training information that โ€˜Evaluateโ€™(โ€˜Instructโ€™ X) the actions - Evaluative feedback indicates โ€˜How good the action taken wasโ€™ - Because of simplicity, โ€˜Nonassociativeโ€™ one situation - Most prior work involving evaluative feedback - โ€˜Nonassociativeโ€™, โ€˜Evaluative feedbackโ€™ -> MAB - In order to Introduce basic learning methods in later chapters
  • 6. 1. Why do you need to know MAB? In my opinion, - I think we canโ€™t seem to know RL without knowing MAB. - MAB deal with โ€˜Exploitation & Explorationโ€™ of the core ideas in RL. - In the full reinforcement learning problem, MAB is always used. - In every profession, MAB is very useful.
  • 7. 2. A k-armed Bandit Problem
  • 8. 2. A k-armed Bandit Problem Do you know what MAB is?
  • 9. Do you know what MAB is? Source : Multi-Armed Bandit Image source : https://brunch.co.kr/@chris-song/62 - Slot Machine -> Bandit - Slot Machineโ€™s lever -> Armed - N slot Machine ๏ƒ  Multi-armed Bandits
  • 10. Do you know what MAB is? Source : Multi-Armed Bandit Image source : https://brunch.co.kr/@chris-song/62 Among the various slot machines, which slot machine should I put my money on and lower the lever?
  • 11. Do you know what MAB is? Source : Multi-Armed Bandit Image source : https://brunch.co.kr/@chris-song/62 How can you make the best return on your investment?
  • 12. Do you know what MAB is? Source : Multi-Armed Bandit Image source : https://brunch.co.kr/@chris-song/62 MAB is a algorithm created to optimize investment in slot machines
  • 13. 2. A k-armed Bandit Problem A K-armed Bandit Problem
  • 14. A K-armed Bandit Problem Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view t โ€“ Discrete time step or play number ๐ด ๐‘ก - Action at time t ๐‘…๐‘ก - Reward at time t ๐‘žโˆ— ๐‘Ž โ€“ True value (expected reward) of action a
  • 15. A K-armed Bandit Problem Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view In our k-armed bandit problem, each of the k actions has an expected or mean reward given that that action is selected; let us call this the value of that action.
  • 17. 3. Simple-average Action-value Methods Simple-average Method
  • 18. Simple-average Method Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view ๐‘„๐‘ก(๐‘Ž) converges to ๐‘žโˆ—(๐‘Ž)
  • 19. 3. Simple-average Action-value Methods Action-value Methods
  • 20. Action-value Methods Action-value Methods - Greedy Action Selection Method - ๐œ€-greedy Action Selection Method - Upper-Confidence-Bound(UCB) Action Selection Method
  • 21. Action-value Methods Action-value Methods - Greedy Action Selection Method - ๐œ€-greedy Action Selection Method - Upper-Confidence-Bound(UCB) Action Selection Method
  • 22. Greedy Action Selection Method ๐‘Ž๐‘Ÿ๐‘”๐‘š๐‘Ž๐‘ฅ ๐‘Ž ๐‘“(๐‘Ž) - a value of ๐‘Ž at which ๐‘“(๐‘Ž) takes its maximal value Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
  • 23. Greedy Action Selection Method Greedy action selection always exploits current knowledge to maximize immediate reward Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
  • 24. Greedy Action Selection Method Greedy Action Selection Methodโ€™s disadvantage Is it a good idea to select greedy action, exploit that action selection and maximize the current immediate reward?
  • 25. Action-value Methods Action-value Methods - Greedy Action Selection Method - ๐œ€-greedy Action Selection Method - Upper-Confidence-Bound(UCB) Action Selection Method
  • 26. ๐œ€-greedy Action Selection Method Exploitation is the right thing to do to maximize the expected reward on the one step, but Exploration may produce the greater total reward in the long run.
  • 27. ๐œ€-greedy Action Selection Method Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view ๐œ€ โ€“ probability of taking a random action in an ๐œ€-greedy policy
  • 28. ๐œ€-greedy Action Selection Method Exploitation Exploration Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
  • 29. Action-value Methods Action-value Methods - Greedy Action Selection Method - ๐œ€-greedy Action Selection Method - Upper-Confidence-Bound(UCB) Action Selection Method
  • 30. Upper-Confidence-Bound(UCB) Action Selection Method ln ๐‘ก - natural logarithm of ๐‘ก ๐‘๐‘ก(๐‘Ž) โ€“ the number of times that action ๐‘Ž has been selected prior to time ๐‘ก ๐‘ โ€“ the number ๐‘ > 0 controls the degree of exploration Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
  • 31. Upper-Confidence-Bound(UCB) Action Selection Method Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view The idea of this UCB action selection is that The square-root term is a measure of the uncertainty(or potential) in the estimate of ๐‘Žโ€™s value The probability that the slot machine may be the optimal slot machine
  • 32. Upper-Confidence-Bound(UCB) Action Selection Method UCB Action Selection Methodโ€™s disadvantage UCB is more difficult than ๐œ€-greedy to extend beyond bandits to the more general reinforcement learning settings One difficulty is in dealing with nonstationary problems Another difficulty is dealing with large state spaces Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
  • 33. 4. A simple Bandit Algorithm
  • 34. 4. A simple Bandit Algorithm - Incremental Implementation - Tracking a Nonstationary Problem
  • 35. 4. A simple Bandit Algorithm - Incremental Implementation - Tracking a Nonstationary Problem
  • 36. Incremental Implementation ๐‘„ ๐‘› denote the estimate of ๐‘… ๐‘›โˆ’1โ€˜s action value after ๐‘… ๐‘›โˆ’1 has been selected n โˆ’ 1 times Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
  • 37. Incremental Implementation Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
  • 38. Incremental Implementation Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view ๐‘„ ๐‘›+1 = ๐‘„ ๐‘› + 1 ๐‘› ๐‘… ๐‘› โˆ’ ๐‘„ ๐‘› holds even for ๐‘› = 1, obtaining ๐‘„2 = ๐‘…1 for arbitrary ๐‘„1
  • 39. Incremental Implementation Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
  • 40. Incremental Implementation Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
  • 41. Incremental Implementation Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view Unstable(โ†”Constant) Available on stationary problem
  • 42. Incremental Implementation Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
  • 43. Incremental Implementation Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view The expression ๐‘‡๐‘Ž๐‘Ÿ๐‘”๐‘’๐‘ก โˆ’ ๐‘‚๐‘™๐‘‘๐ธ๐‘ ๐‘ก๐‘–๐‘š๐‘Ž๐‘ก๐‘’ is an error in the estimate. The target is presumed to indicate a desirable direction in which to move, though it may be noisy.
  • 44. 4. A simple Bandit Algorithm - Incremental Implementation - Tracking a Nonstationary Problem
  • 45. Tracking a Nonstationary Problem Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
  • 46. Tracking a Nonstationary Problem Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view Why do you think it should be changed from 1 ๐‘› to ๐›ผ?
  • 47. Tracking a Nonstationary Problem Why do you think it should be changed from 1 ๐‘› to ๐›ผ? We often encounter RL problems that are effectively nonstationary. In such cases it makes sense to give more weight to recent rewards than to long-past rewards. One of the most popular ways of doing this is to use a constant step-size parameter. The step-size parameter ๐›ผ โˆˆ (0,1] is constant.
  • 48. Tracking a Nonstationary Problem Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
  • 49. Tracking a Nonstationary Problem ? = ๐‘„ ๐‘› + ๐›ผ๐‘… ๐‘› โˆ’ ๐›ผ๐‘„ ๐‘› = ๐›ผ๐‘… ๐‘› + (1 โˆ’ ๐›ผ)๐‘„ ๐‘› ๐‘„ ๐‘› = ๐›ผ๐‘… ๐‘›โˆ’1 + 1 โˆ’ ๐›ผ ๐‘„ ๐‘›โˆ’1 ? โˆด ๐‘„ ๐‘› = ๐›ผ๐‘… ๐‘›โˆ’1 + (1 โˆ’ ๐›ผ)๐‘„ ๐‘›โˆ’1 Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
  • 50. Tracking a Nonstationary Problem ? Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
  • 51. Tracking a Nonstationary Problem โˆด ๐‘„ ๐‘›โˆ’1 = ๐›ผ ๐‘… ๐‘›โˆ’2 + 1 โˆ’ ๐›ผ ๐‘… ๐‘›โˆ’3 + โ‹ฏ + 1 โˆ’ ๐›ผ ๐‘›โˆ’3 ๐‘…1 + 1 โˆ’ ๐›ผ ๐‘›โˆ’2 ๐‘„1 1 โˆ’ ๐›ผ 2 ๐‘„ ๐‘›โˆ’1 = 1 โˆ’ ๐›ผ 2 ๐›ผ๐‘… ๐‘›โˆ’2 + 1 โˆ’ ๐›ผ 3 ๐›ผ๐‘… ๐‘›โˆ’3 + โ‹ฏ + 1 โˆ’ ๐›ผ ๐‘›โˆ’1 ๐›ผ๐‘…1 + 1 โˆ’ ๐›ผ ๐‘› ๐‘„1 ? = 1 โˆ’ ๐›ผ 2{๐›ผ๐‘… ๐‘›โˆ’2 + 1 โˆ’ ๐›ผ ๐›ผ๐‘… ๐‘›โˆ’3 + โ‹ฏ + 1 โˆ’ ๐›ผ ๐‘›โˆ’3 ๐›ผ๐‘…1 + 1 โˆ’ ๐›ผ ๐‘›โˆ’2 ๐‘„1}
  • 52. Tracking a Nonstationary Problem Sequences of step-size parameters often converge very slowly or need considerable tuning in order to obtain a satisfactory convergence rate. Thus, step-size parameters should be tuned effectively. Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
  • 53. 5. Gradient Bandit Algorithm
  • 54. 5. Gradient Bandit Algorithm In addition to a simple bandit algorithm, there is another way to use the gradient method as a bandit algorithm
  • 55. 5. Gradient Bandit Algorithm We consider learning a numerical ๐‘๐‘Ÿ๐‘’๐‘“๐‘’๐‘Ÿ๐‘’๐‘›๐‘๐‘’ for each action ๐‘Ž, which we denote ๐ป๐‘ก(๐‘Ž). The larger the preference, the more often that action is taken, but the preference has no interpretation in terms of reward. In other wards, just because the preference(๐ป๐‘ก(๐‘Ž)) is large, the reward is not unconditionally large. However, if the reward is large, It can affect the preference(๐ป๐‘ก(๐‘Ž))
  • 56. 5. Gradient Bandit Algorithm The action probabilities are determined according to a ๐‘ ๐‘œ๐‘“๐‘ก โˆ’ ๐‘š๐‘Ž๐‘ฅ ๐‘‘๐‘–๐‘ ๐‘ก๐‘Ÿ๐‘–๐‘๐‘ข๐‘ก๐‘–๐‘œ๐‘› (i.e., Gibbs or Boltzmann distribution) Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
  • 57. 5. Gradient Bandit Algorithm ๐œ‹ ๐‘ก(๐‘Ž) โ€“ Probability of selecting action ๐‘Ž at time ๐‘ก Initially all action preferences are the same so that all actions have an equal probability of being selected. Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
  • 58. 5. Gradient Bandit Algorithm Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view There is a natural learning algorithm for this setting based on the idea of stochastic gradient ascent. On each step, after selecting action ๐ด ๐‘ก and receiving the reward ๐‘…๐‘ก, the action preferences are updated. Selected action ๐ด ๐‘ก Non-selected actions
  • 59. 5. Gradient Bandit Algorithm Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view 1) What does ๐‘…๐‘ก mean? 2) What does (๐‘…๐‘ก โˆ’ ๐‘…๐‘ก)(1 โˆ’ ๐œ‹ ๐‘ก ๐ด ๐‘ก ) mean?
  • 60. 5. Gradient Bandit Algorithm Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view 1) What does ๐‘…๐‘ก mean? 2) What does (๐‘…๐‘ก โˆ’ ๐‘…๐‘ก)(1 โˆ’ ๐œ‹ ๐‘ก ๐ด ๐‘ก ) mean? ?
  • 61. What does ๐‘…๐‘ก mean? Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view ๐‘…๐‘ก โˆˆ R is the average of all the rewards. The ๐‘…๐‘ก term serves as a baseline. If the reward is higher than the baseline, then the probability of taking ๐ด ๐‘ก in the future is increased, and if the reward is below baseline, then probability is decreased. The non-selected actions move in the opposite direction.
  • 62. 5. Gradient Bandit Algorithm Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view 1) What does ๐‘…๐‘ก mean? 2) What does (๐‘…๐‘ก โˆ’ ๐‘…๐‘ก)(1 โˆ’ ๐œ‹ ๐‘ก ๐ด ๐‘ก ) mean? ?
  • 63. What does (๐‘…๐‘ก โˆ’ ๐‘…๐‘ก)(1 โˆ’ ๐œ‹ ๐‘ก ๐ด ๐‘ก ) mean? Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view - Stochastic approximation to gradient ascent in Bandit Gradient Algorithm - Expected reward - Expected reward by Low of total expectationE[๐‘…๐‘ก] = E[ E ๐‘…๐‘ก ๐ด ๐‘ก ]
  • 64. What does (๐‘…๐‘ก โˆ’ ๐‘…๐‘ก)(1 โˆ’ ๐œ‹ ๐‘ก ๐ด ๐‘ก ) mean? Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
  • 65. What does (๐‘…๐‘ก โˆ’ ๐‘…๐‘ก)(1 โˆ’ ๐œ‹ ๐‘ก ๐ด ๐‘ก ) mean? Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view ?
  • 66. What does (๐‘…๐‘ก โˆ’ ๐‘…๐‘ก)(1 โˆ’ ๐œ‹ ๐‘ก ๐ด ๐‘ก ) mean? Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view ? The gradient sums to zero over all the actions, ฯƒ ๐‘ฅ ๐œ•๐œ‹ ๐‘ก(๐‘ฅ) ๐œ•๐ป๐‘ก(๐‘Ž) = 0 โ€“ as ๐ป๐‘ก(๐‘Ž) is changed, some actionsโ€™ probabilities go up and some go down, but the sum of the changes must be zero because the sum of the probabilities is always one.
  • 67. What does (๐‘…๐‘ก โˆ’ ๐‘…๐‘ก)(1 โˆ’ ๐œ‹ ๐‘ก ๐ด ๐‘ก ) mean? Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
  • 68. What does (๐‘…๐‘ก โˆ’ ๐‘…๐‘ก)(1 โˆ’ ๐œ‹ ๐‘ก ๐ด ๐‘ก ) mean? Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view ? E[๐‘…๐‘ก] = E[ E ๐‘…๐‘ก ๐ด ๐‘ก ]
  • 69. What does (๐‘…๐‘ก โˆ’ ๐‘…๐‘ก)(1 โˆ’ ๐œ‹ ๐‘ก ๐ด ๐‘ก ) mean? Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view Please refer page 40 in link of reference slide
  • 70. What does (๐‘…๐‘ก โˆ’ ๐‘…๐‘ก)(1 โˆ’ ๐œ‹ ๐‘ก ๐ด ๐‘ก ) mean? Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view โˆด We can substitute a sample of the expectation above for the performance gradient in ๐ป๐‘ก+1 ๐‘Ž โ‰… ๐ป๐‘ก ๐‘Ž + ๐›ผ ๐œ•E[๐‘… ๐‘ก] ๐œ•๐ป๐‘ก(๐‘Ž)
  • 72. 6. Summary In this chapter, โ€˜Exploitation & Explorationโ€˜ is the core idea.
  • 73. 6. Summary Action-value Methods - Greedy Action Selection Method - ๐œ€-greedy Action Selection Method - Upper-Confidence-Bound(UCB) Action Selection Method
  • 74. 6. Summary A simple bandit algorithm : Incremental Implementation Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
  • 75. 6. Summary A simple bandit algorithm : Incremental Implementation Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
  • 76. 6. Summary A simple bandit algorithm : Tracking a Nonstationary Problem Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
  • 77. 6. Summary A simple bandit algorithm : Tracking a Nonstationary Problem Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
  • 78. 6. Summary Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view Gradient Bandit Algorithm
  • 79. 6. Summary Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view A parameter study of the various bandit algorithms