Introduction to Reinforcement Learning, part III: Basic approximate methods
This is the final presentation in a three-part series covering the basics of Reinforcement Learning (RL).
In this presentation, we introduce value function approximation and cover three different approaches to generating features for linear models.
We then take a sidestep to cover stochastic gradient descent in some detail before we return to introduce semi-gradient descent for RL. We also briefly cover a batch method as an alternative for episodic methods.
We discuss the implementation of the RL algorithms. For further discussion and illustrating the simulation results, we refer to Github repositories with source code of the implementation as well as Jupyter notebooks visualizing the simulation results.
2. Agenda
• First time: Part I
• Intro: Reinforcement learning as a ML approach
• Basic building blocks: agent and environment, MDP, policies, value functions, Bellman
equations, optimal policies and value functions
• Basic dynamic programming algorithms illustrated on a simple maze: Value iteration, Policy
iteration
• Last time: Part II
• Some more building blocks: GPI, bandits, exploration, TD updates,…
• Basic model-free methods using tabular value representation
• …illustrated on Blackjack: Monte Carlo on- vs off-policy; Sarsa, Expected Sarsa, Q-learning
3. Agenda
• This time: Part III
• Value function approximation -based methods
• Examples of linear representations; polynomial, tile coding, Fourier cosine basis
• Stochastic gradient descent and semi-gradient descent
• Batch updates LSPI-LSTDQ
• Python implementation
• Results from simulation experiments
4. Recap: Markov Decision Process (MDP)
• Markov Decision Process is a tuple , with
states, actions, transition probabilities, rewards and a
discount factor
• If we know all of these, we have a fully defined MPD and
can apply Dynamic Programming (DP) (as covered in Part
I)
• If we don’t, we can use a model to augment the MDP, in
practice transition probabilities and rewards - this would
be a model-based approach
• Or, we apply agent’s direct experience with the
environment to update our estimates of value function,
this would be a model-free approach
5. Recap: Discounted return, utility
• An agent exploring the MDP environment would observe a sequence
• Discounted return, or utility, from time step 𝑡 onwards is the sum of discounted
rewards received:
6. Recap: The state-value function
• If the agent was following a policy, then in each state , the agent would select
the action defined by that policy
• The state-value function of a state under policy , denoted , is the expected
discounted return when following the policy from state onwards:
• Here applying the Bellman expectation equation
7. Recap: The action-value function
• Action-value function for policy defines the expected utility when
starting in state , performing action and following the policy thereafter
8. Recap: Greedy policy from
• To derive the policy from state-value
function , we need to know the
transition probabilities and rewards:
• But we can extract the policy directly
from action-value function
• So, working with enables us to be
model-free
9. Recap: Monte Carlo – full episodes
• Sample a full episode from MDP using a -greedy policy
• For each state-action pair estimate value using average sample returns
• Maintain visit-counts to each state action pair
• Update value estimates based on incremental average of observed return
10. Recap: Sarsa – Temporal-difference TD(0)
• Generate samples from MDP using a -greedy policy
• For each sample, update state-action value using discounted sample return
TD-target
TD-error
learning-rate parameter
12. Tabular methods
• Last time we discussed tabular methods, where action-values are stored in a
lookup table structure
• These algorithms read current values from and update values to table storage,
e.g. for Sarsa update:
• Good: simple, fast, can be used to store other values as well (e.g. visit counts)
• Limitations: large state-action spaces, continuous values, does not generalize
update current value read current and next state values
13. Value function approximation
• Action-value function is represented with parametric approximation
• where is a -dimensional vector of weights
• We use to denote features representing a state-action pair
• If the approximate function is a linear function of the weight vector , we have
14. RL problem setting and value function
approximation
Perform action
Receive raw signal from
environment
Environment
Agent
Approximate the value function of each
state and action
Create features to
represent observed state
Pre-process the received
observation
15. Sidenote:
Domain knowledge?
”Demonstrate superhuman performance
without human domain knowledge”
• prior knowledge == bad
“Make money at the casino playing RL
guided Blackjack”
• maximally exploit prior knowledge
Silver et al 2017: Mastering the game of Go without human knowledge
16. For our case of simple Blackjack
• We represent the state-action pair as
• total sum of dealer’s cards (2,11)
• total sum of player’s cards (4, 21)
• does the player have a soft, usable
ace (True, False)
• action to take (HIT, STAND)
• This gives 560 state-action pairs for a
tabular lookup table*
• Or, we can derive feature vectors
for approximate methods using these
four variables
Dealer:
Player:
dealer: 5
player: 20
no ace
*) Compare this to Backgammon 1020 states, Go 10170 states
17. How to approximate this?
• Each subplot on the right shows
the action-values by one of
the four variables - dealer’s sum,
player’s sum, soft ace, action
• Now, we need to define a mapping
from each pair, i.e. from
values
to features
• And then approximate action-value
function with
a suitable linear model of
parameter
Estimated action-values for each of the 560 state-action pairs for simple blackjack
Note: shown values represent our reference result, a test run of 100 000 000 episodes of tabular off-policy Monte Carlo
18. How to approximate this? Start from here…?
• Subplots on the right show the
action-values for each of the 560
state-action pairs
• Values are grouped into four
groups based on
• x-axis shows values for player’s
cards
• Equal values for dealer’s card are
shown connected
dealer
2
11
Note: shown values represent our reference result, a test run of 100 000 000 million episodes of tabular off-policy Monte Carlo
19. Three different value representations
• In the following we will briefly cover three different approaches to generating
features for linear models
• Polynomial approximation
• Fourier-cosine basis
• Tile Coding
• Notable omissions:
• Radial Basis Functions, i.e. gaussians with fixed mean and variance, appropriately positioned
(as we have not tried then, but yes, they would make sense)
• Neural Networks (as they are non-linear)
20. First: Polynomial approximation
• We want to approximate the action-value function using a linear model
• using the variables we have available (from preprocessing the input signal)
• where features are terms of a nth degree polynomial, of the form
• and exponent coefficients are from the set and indicators
either
• For order , this would give us 100 terms. How to select a meaningful subset?
21. Experiment: Model selection for polynomial
features
• Model search against our reference result
obtained using a tabular method
• Forward-backward search applying Akaike
information criterion (AIC)
• start with a constant
• test adding remaining terms one at a
time
• add the one that improves AIC the most
• test if removing terms improves AIC
• repeat and stop when AIC no longer
improves
• Using the resulting polynomial model, we
apply approximate RL and compare the
result to the LS fit
Polynomial
model
candidate
Reference
result
LS fit
Selected
polynomial
features
RL method
Model search
RL model
Performance comparison
AIC
22. Selected polynomial: c 25 d4p3sa
• AIC-search gives us a polynomial
with 27 terms as shown below
• Subplots on the right show the
least-squares fit of this
polynomial model against the
reference result
• Reference values are shown in
gray dotted lines, colored lines
for estimated action-values
dealer
2
11
23. Second: Fourier cosine basis
• Fourier series approximates a periodic function as a
weighted sum of sine and cosine basis functions of
different frequencies
• Within half-interval, sine terms can be dropped, and
function can be represented as closely as desired with
just the cosine terms
• So, we select our period as 2 and restrict the features to
half-interval [0,1]
• For one dimensional case, we get feature terms
• For example, for , gives the features
shown on the right
24. Fourier cosine basis in 2D
• When we represent each state-action pair with more
than one dimension, the basis formula becomes
• where and
• In our blackjack case, we use a different model for each
of the four combinations, so input is
, and
• We would get the cosine basis functions shown on the
right for
• For RL testing, we use giving 49 features for each
- defined plane
25. Third: Tile coding
• For tile coding (a form of coarse coding) the
feature space is divided into partitions, or
tiles
• We get binary features: For each input, the
feature has either value of 1 (within tile) or 0
(outside tile)
• For this to work, the tiles have to overlap, so
that multiple tiles are active for each input
• For testing, we get 160 tiles, creating 3x3
areas with four different tiles active
• With this, we get a different set of tiles to
cover each of the combinations
26. Stochastic Gradient Descent
• So, we want to represent action-value function with a parametric model with
weights
• We can use different approaches to generate features for our model, such as
Fourier cosine basis…
• To determine values for the weights, we apply gradient descent… or a variant
thereof
27. Backgrounder: Stochastic Gradient Descent
• We have an error function we want to minimize, in this case MSE between actual
action-values and estimated values given by our function approximation
• Gradient of a function shows the direction of steepest ascent
• For the error function, the (negative half) gradient is given as
28. Stochastic Gradient Descent (SGD)
• Stochastic gradient descent approximates the gradient using a single sample
• To reduce the error, we update the weights towards the negative gradient
• where is a learning rate parameter
• Gradient descent converges to a local minimum of the error function, if the
learning rate is decreased over iterations
29. SGD with linear model
• For linear model of action-value function
• Gradient is given by
• And the update rule reduces to
30. Simple SGD example
• We want to find the best fitting line
to the observations
• Our weight vector is and the
partial derivatives for the gradient
• The SGD updates are
• Light blue line is the (unknown) relation, from which the
observations are generated by adding N(0,1) noise
• Blue dots are the known observations
• Dark blue line is given by the least-squares estimate for a
and b
32. Sample gradients
Above the error surface
calculated using all of
the observations
Images on the right
show the gradients
calculated based on
samples for first eight
iterations
33. SGD example
• Figure on the right shows how the line
changes as slope and intercept are
updated during gradient descent
• Light green line shows the final line,
orange is the line obtained applying
least-squares
34. RL: Semi-gradient descent
• In RL, we do not have a target value for that we could use in a supervised
learning -type update
• So, for model free RL, we substitute an approximate target that we get by
interacting with the environment
• and we live with the fact that * also depends on our current weights .
Not being a true gradient descent method, this is called semi-gradient descent
*) such targets are called bootstrapping targets or bootstrapping estimates in the literature
35. RL: Semi-gradient descent
• Full-episode Monte Carlo we use the observed return
• to get the update semi-gradient update
• For temporal-difference Sarsa, we use the sample return
36. Semi-gradient Sarsa TD(0)
e-greedy policy (using estimated action-values)
Source: Sutton-Barto 2nd ed
learning from each episode
iterate forward
Semi-GD weight updates at each step
37. Convergence and the deadly triad
• Three elements, making up the deadly triad
• Function approximation: Generalizing from a state space (e.g., linear function approximation
or ANNs).
• Bootstrapping: Update targets that include existing estimates rather than relying exclusively
on actual rewards and complete returns
• Off-policy training: Training on other than the target policy
• Two of these can be handled, three leads to instability and divergence
Source: Sutton-Barto 2nd ed, Table source: David Silver: UCL Course on RL
38. Batch method: LSPI-LSTDQ
• This far, we have considered episodic methods, using either a full episode or TD
samples return targets
• Batch methods apply a different approach: Agent generates experience, a batch
of samples from the environment, which is then used for learning
• Samples can be reused or played in different order
• LSPI-LSTDQ is an example of a batch policy improvement algorithm, using a batch
of samples to solve the weights using least-squares
• Solving the linear system involves a matrix (pseudo-) inversion at each iteration of
the algorithm
40. RL algorithm implementation
• Learning goals were
• improve Python skills beyond simple scripting and notebook use
• gain experience in modeling complex real-world (?) entities (such as
agents, policies and environments) in Python
• gain understanding on how the RL algorithms actually work
• develop ideas for real RL use cases to tackle later
• Common toy-problems that enable fast iterations and
straightforward visualization of both the data and results
• From scratch –ideology
• Downside: slow and painful, excellent libraries with ready
implementations readily available
Tech stack
WSL
Ubuntu
Conda
VS Code
Python 3.10
Mypy
Numpy
Pandas
Jupyter
Matplotlib
Seaborn
43. Simulation experiments: Reference result
Greedy policy Action-value function Difference in value between actions
Monte Carlo Off-policy; 100 000 000 episodes; random behavior policy, ; no discounting
44. So…
• We have covered basic model-free RL
algorithms
• Algorithms that learn from episodes or
from TD-updates, and a single example of
a batch algorithm
• That apply GPI; they work with state-
action value function, and derive the
corresponding policy from that
• They store the values of state-actions, i.e.
use tabular value representation
• Or use linear function approximation to
represent value of state-actions, and
apply a derivative of gradient descent to
learn from experience
45. What was not covered
• Introductions to RL typically cover both prediction, i.e. working with state-value function, and control, working with action-value function.
We pretty much skipped the prediction part (as we do not find it interesting or useful if control is feasible)
• Between presented single-step TD(0) and full episode Monte Carlo approaches, a set of methods exists that use several return steps in
updates, TD-lambda or n-step methods. These were not covered
• We have covered only action-value GPI methods that work with value-function and derive a corresponding policy from the action-values. A
family of methods that work directly with parametrized policy, policy gradient methods such as REINFORCE, were not covered
• We covered dynamic programming in first part for fully-defined cases. For situations where the MDP is not known, we concentrated on
model-free methods only, did not cover model-based approaches
• We limited the discussion to linear function approximation methods, even hinting that applying non-linear methods would lead to issues
like divergence. This may be true. Yet, most of the groundbreaking results in RL have been obtained applying Deep Neural Networks. For
instance, DQN (Deep NNs and Q-learning) for Atari games is standard textbook material
• We did not discuss tree-search algorithms, such as MCTS
• There is a huge number of interesting recent algorithms and variants that were not covered. The discussion pretty much follows the
standard textbook level of knowledge from some years back, as covered in Sutton-Barto and David Silver’s lectures from 2015
• And notably, we omitted almost all of the convergence proofs that are diligently covered in most lecture materials and books. We are sorry
about that