Introduction to Reinforcement Learning, part II: Basic tabular methods
This is the second presentation in a three-part series covering the basics of Reinforcement Learning (RL).
In this presentation, we introduce some more building blocks, such as policy iteration, bandits and exploration, epsilon-greedy policies, temporal difference methods.
We introduce basic model-free methods that use tabular value representation; Monte Carlo on- and off-policy, Sarsa, Expected Sarsa, and Q-learning.
The algorithms are illustrated using simple black jack as an environment.
2. Agenda
• Last time: Part I
• Intro: Reinforcement learning as a ML approach
• Basic building blocks: agent and environment, MDP, policies, value functions, Bellman
equations, optimal policies and value functions
• Basic dynamic programming algorithms illustrated on a simple maze: Value iteration, Policy
iteration
3. Agenda
• This time: Part II
• Some more building blocks: GPI, bandits, exploration, TD updates,…
• Basic model-free methods using tabular value representation
• …illustrated on Blackjack: Monte Carlo on- vs off-policy; Sarsa, Expected Sarsa, Q-learning
• Next time: Part III
• Value function approximation-based methods
• Semi-gradient descent with Sarsa and different linear representations; polynomial, tile
coding, Fourier cosine basis
• Batch updates LSPI-LSTDQ
4. Recap – we’ll be briefly revisiting the
following concepts
• RL problem setting; Agent and environment
• Markov Decision Process, MDP
• Policy
• Policy Iteration
• Discounted return, Utility
• Value function, state-value function, action value function
• Bellman equations, update rules (backups)
5. RL problem setting: Agent and environment
Agent performs action
Agent observes
environment state
and reward
Environment
Agent
*) This would be a fully observable environment
6. RL problem setting: Agent and environment
Agent performs action
Agent observes
environment state
and reward
Environment
Agent
Agent models the environment as
a Markov Decision Process
Agent maintains a policy
that defines what action to
take when in a state
Agent approximates the value function
of each state and action
Agent creates an internal
representation
of state
7. Markov Decision Process (MDP)
Markov Decision Process is a tuple , with
• States
• Actions
• Transition probabilities
• Rewards
Reward function
• Discount factor ,
• When we know all of these, we have a fully defined MPD
8. From fully defined MDPs to model-free
methods
Methods for fully defined MDPs Model-based methods Model-free methods
We know all of states, actions, transition
probabilities, rewards and discount factor
We know states, actions, and discount factor
(from problem definition)
We know states, actions, and discount factor
(from problem definition)
Model gives prediction of next state and
reward when taking an action in a state
So, we use a model to estimate transition
probabilities and rewards
We don’t know transition probabilities or
rewards
We don’t have a model, and don’t need one
We can use a Dynamic Programming
algorithms, such as policy iteration, to find
optimal value function and policy
We can use the model-augmented MDP with
DP methods
Or learn our model from experience as in
model free methods
Or create simulated experience based on our
model for model-free
We employ an agent to explore the
environment, and we use agent’s direct
experience to update our estimates of value
function and policy
Algorithms typically use full sweeps of the
state-space to update values
We can use episodes or individual action-
steps to update values
We need to have some approach to selecting
which states and actions the agent explores
9. We discussed Policy iteration
– a DP algorithm for fully-defined MDPs
• Perform policy evaluation - evaluate
state values under current policy*
• Improve the policy by determining the
best action in each state using the
state-value function determined
during policy evaluation step
• Stop when policy no longer changes
• Each improvement step improves the
value function and when improvement
stops, the optimal policy and value
function have been found
*) using either iterative policy evaluation or solving the linear system
10. Towards Generalized Policy Iteration
source: David Silver: UCL Course on RL
The process of making a new policy that improves on
an original policy, by making it greedy with respect to
the value function of the original policy, is called
policy improvement
Policy improvement theorem:
11. Blackjack as an MDP
• States: current set of cards for dealer and
player
• Actions: HIT – more cards, STAND – no
more cards
• Transition probabilities: A stochastic
environment due to randomly drawing
cards – exact probabilities difficult to
determine, though
• Rewards: At the end of episode: +1 for
winning, -1 for losing, 0 for a draw; and 0
otherwise
• Discounting: not used,
12. Blackjack as an MDP - example
dealer: 5
player: 13
no ace
HIT
5,13,no ace
HIT
Dealer:
Player:
13. Blackjack as an MDP
dealer: 5
player: 13
no ace
5,13,no ace
HIT
Dealer:
Player:
14. Blackjack as an MDP - example
dealer: 5
player: 13
no ace
dealer: 5
player: 20
no ace
dealer: bust
player: 20
no ace
dealer bust
HIT STAND
reward: 0
reward: +1
5,20,no ace
5,13,no ace
Dealer:
Player:
If in previous state
player is 13 then
here we have 8
states where
player’s sum is
14,15,…,21
If in previous state
player has 20 then
here only state is
21, all other cards
lead to lose
15. Policy
• Policy defines how the agent behaves
in an MDP environment
• Policy is a mapping from each state
to an action
• A deterministic policy always
returns the same action for a state
• A stochastic policy gives a
probability for an action in a state
One possible deterministic policy for the maze
16. Flipsism (Höpsismi) as a policy
• A random policy
• For MDPs with two actions in each
state
• Equal probability for choosing either
action = 0,5
17. Multi-armed bandits
– a slightly more formal approach to stochastic policies
• We can choose from four actions;
a, b, c or d
• Whenever we choose an action,
we receive a reward with an
unknown probability distribution
• We have now chosen an action
six times, a and b twice, c and d
once
• We have received the rewards
shown
• We want to maximize the reward
we receive over time
• What action would you select
next, why?
This would be a 4-armed bandit
18. Multi-armed bandits
• Now we have selected each
action six times and the
reward situation is as
shown
• How would you continue
from here? Why?
19. Exploration vs exploitation
• Exploitation: we exploit the
information we already have to
maximize reward
• Maintain estimates of the
values of actions
• Select the action whose
estimated value is greatest
• This is called the greedy
action
• Exploration: we choose some
other action than the greedy
one to gain information and to
improve our estimates We have now chosen an action 40 000 times: 10 000 times each a, b, c and d
We can estimate that we have lost about 85 000 in value compared to the optimal
strategy of choosing b every time
20. Epsilon-greedy policy
• -greedy policy is a strategy to balance exploration and exploitation
• We choose the greedy action with probability , and a random action with
probability
• For this to work in theory, all states are to be visited infinitely often and needs
to decrease towards zero, so that the policy converges to greedy policy*
• In practice, it might be enough to decrease epsilon towards the greedy policy
• A simple, often proposed strategy is to decrease epsilon as , but this might
be a bit fast in practice
*) GLIE: Greedy in the Limit with Infinite Exploration
21. -greedy method for bandits
• It was said that ”we maintain estimates
of the values of actions”
• For this, we use incrementally
computed sample averages:
• And use -greedy policy for selecting an
action
Source: Sutton-Barto 2nd ed
For calculating incremental mean, we maintain two parameters:
N, the current visit count for each action (selecting a bandit) and
Q, the current estimated value for the action
22. General update rule for RL
• Note the format of the update rule in the method on the previous slide
• We can consider the form
as a general update rule, where represents our current target value,
is the error of our current estimate and is a decreasing step-size or
learning-rate parameter
• Expect to see more of these soon…
23. Discounted return, utility
• An agent exploring the MDP environment would observe a sequence
• Discounted return, or utility, from time step 𝑡 onwards is the sum of discounted
rewards received:
24. The state-value function
• If the agent was following a policy, then in each state , the agent would select
the action defined by that policy
• The state-value function of a state under policy , denoted , is the expected
discounted return when following the policy from state onwards:
• The recursive relationship between the value of a state and its successor states is
called the Bellman expectation equation for state-value function
25. The action-value function
• Action-value function for policy defines the expected utility when
starting in state , performing action and following the policy thereafter
26. 5,20,no ace
5,13,no ace
State-action value
when
state is (5,13,no ace)
and action is HIT
Q(S,A) ~ - 0,255
State-action value
when
state is (5,20,no ace)
and action is STAND
Q(S,A) ~ 0,669
State-action value function
27. Greedy policy from action-value function
• To derive the policy from state-value
function , we need to know the
transition probabilities and rewards:
• But we can extract the policy directly
from action-value function
• So, working with enables us to be
model-free
28. First RL algorithm: Monte Carlo
• Sample a full episode from MDP using a -greedy policy
• For each state-action pair estimate value using average sample returns
• Maintain visit-counts to each state action pair
• Update value estimates based on incremental average of observed return
29. One more concept: on-policy vs off-policy
• On-policy learning: apply a policy to choose actions and learn the value-function
for that policy
• Monte Carlo algorithm presented in the previous slide is an on-policy method
• In practice, we start with a stochastic policy to sample all possible state-action
pairs
• and gradually adjust the policy towards a deterministic optimal policy (GLIE?)
• Off-policy learning: apply a policy, but learn for some other policy
• Typically in off-policy learning, we apply a behavior policy that allows for
exploration and learns about an optimal target policy
30. Towards Off-policy Monte Carlo
• To use returns generated by behavior policy to evaluate target policy , we
apply importance sampling, a technique to estimate expected values for one
distribution using samples from another
• The probability of observing a sequence of states and actions under policy is
• We form importance sampling ratios, the ratio of probabilities of the sequences
under target and behavior policies
• And apply those to weight our observed returns
31. Off-policy Monte Carlo
generate episode
iterate backwards
accumulate discounted returns
MC update, now with importance sampling
policy improvement, greedy wrt value func
incremental weight update
Source: Sutton-Barto 2nd ed
32. Temporal-difference methods
• Recall our general update rule from a couple of slides back
• Monte Carlo methods use the returns from a full episode as a learning target
• In Temporal-difference methods, we use a sample return instead
• We can apply temporal-difference methods with incomplete sequences, or when
we don’t have terminating episodes
If one had to identify one idea as central and novel to reinforcement learning, it would
undoubtedly be temporal-difference (TD) learning
- Sutton and Barto
34. First TD algorithm: Sarsa
• Generate samples from MDP using a -greedy policy
• For each sample, update state-action value using discounted sample return
TD-target
TD-error
learning-rate parameter
35. Three TD algorithms in just one slide
• Sarsa: Samples
• Q-learning: Samples
• Expected Sarsa: Samples
36. Q-learning again
• Considered as “one of the early
breakthroughs in RL”
• published by Watkins in 1989
• It is an off-policy algorithm that directly
approximates the optimal action-value
function
• State-action pairs are selected for
evaluation by e-greedy behavior policy
• But next state action, and thus, next
state-action value in the update, is
replaced by the greedy action for that
state
Source: Sutton-Barto 2nd ed
37. Simulation experiments: Reference result
Greedy policy Action-value function Difference in value between actions
Monte Carlo Off-policy; 100 000 000 episodes; random behavior policy, ; no discounting
38. Monte Carlo
On-policy
• 100 000 learning episodes
• Decreasing epsilon
according to state-action
visit count:
• Initial epsilon
•
39. Learning results: Action value function
So, this illustrates Monte Carlo on-policy
after 100 000 learning episodes
40. Battle of TD-agents
• Participating agents:
• Monte Carlo on-policy as episodic reference, on-policy, decreasing epsilon
• Sarsa, on-policy, decreasing epsilon
• Expected Sarsa, as on-policy, decreasing epsilon
• Expected Sarsa, as off-policy, random behavior policy,
• Q-learning, random behavior policy,
• 100 000 learning episodes for each
• Schedule for alpha: Exponential target at
• Target rounds 90 000, initial 0,2 –> target 0,01
• Schedule for epsilon: State-action visit count –scaled,
41. MSE and wrong action calls*
*) When compared to reference case
43. So…
• We have covered basic model-free RL
algorithms
• Algorithms that learn from episodes or
from TD-updates
• That apply GPI; they work with value, in
particular state-action value function, and
derive the corresponding policy from that
• That store the values of state-actions, i.e.
use tabular value representation