Introduction to Reinforcement Learning, part I: Dynamic programming
This is the first presentation in a three-part series covering the basics of Reinforcement Learning (RL).
In this presentation, we introduce reinforcement learning as a machine learning approach. We cover the terminology and building blocks needed, such as agents and environments, policies and value functions, Markov Decision Processes.
We introduce two basic dynamic programming algorithms; Value iteration and Policy iteration, and illustrate the algorithms using a simple (canonical) maze as an example.
1. Introduction to Reinforcement Learning
Part I: Dynamic programming
Mikko Mäkipää 18.11.2021
https://mmakipaa.github.io/dp/
2. Agenda
• Today: Part I
• Intro: Reinforcement learning as a ML approach
• Basic building blocks: agent and environment, MDP, policies, value functions, Bellman
equations, optimal policies and value functions
• Basic dynamic programming algorithms illustrated on a simple maze: Value iteration, Policy
iteration
• Later: Parts II and III
• Some more building blocks (likely): exploration, TD updates, semi-gradient error descent,
“model” as in “model-free”, deadly triads,…
• Value function approximation, tabular state representation vs linear approximations using
semi-gradient Sarsa; polynomial, Tile coding, Fourier cosine basis
• Basic value based reinforcement algorithms illustrated on BlackJack: Monte Carlo on- vs off-
policy episodic; Sarsa, Expected Sarsa, Q-learning using TD; batch updates LSPI-LSTDQ
4. Evolution of Learning Approach
• Lie on couch
• Lie on couch, read Nature article
• Lie on couch, watch video lectures
• Lie on couch, watch lectures, read textbook
• Watch lectures, read textbook, code
• Watch lectures, read textbook, code, document
5. Learning objectives (and self-assessment)
• Understand what is reinforcement learning
• Understand what is behind the spectacular results
• Improve Python skills, especially for modeling object structures
• Get ideas for implementing RL with real tools in real problems
7. RL problem setting: Agent and environment
Environment
Agent
*) This would be a fully observable environmentc
Agent performs action
Agent observes
environment state
and reward
8. RL problem setting: Agent and environment
Agent performs action
Agent observes
environment state
and reward
Environment
Agent
Agent models the environment as
a Markov Decision Process
Agent maintains a policy
that defines what action to
take when in a state
Agent approximates the value function
of each state and action
Agent creates an internal
representation
of state
9. Dynamic programming (Bellman, 1950s)
“Dynamic programming refers to simplifying a complicated problem by breaking it
down into simpler sub-problems in a recursive manner”
- Wikipedia
“A collection of algorithms that can be used to compute optimal policies given a
perfect model of the environment as a Markov decision process (MDP)”
- Sutton and Barto, Reinforcement Learning: An Introduction
10. Markov decision process (MDP)
• Reinforcement learning problems are framed as
Markov Decision Processes (MDP)
• Formally, a Markov Decision Process is a tuple , where
is the set of states
is the set of actions
defines the transition probabilities between states
is the reward function, and
is a discount factor,
• And transition probabilities follow the Markov Property
11. Picture of an MDP
• States
• Actions
• Transition probabilities
• Rewards
Reward function
• Discount factor ,
12. Maze (or grid-world) as an MDP
• States indexed with(x,y) -tuples
• terminal states (here green and red)
• Reward received when entering a state
• defined for terminal states
• zero for all other states
• additionally, a small living cost for each
step
• Actions N, E, S, W in each state
• walls bounce agent back
• moves are noisy
A commonly used example,
the ”canonical maze”
14. Maze transition probs with noisy moves
• Two state maze
• Noisy moves with
• Attempted to move NORTH
east, south, west omitted for clarity
15. Policy
• Policy defines how the agent behaves
in a MDP environment
• Policy is a mapping from each state
to an action
• A deterministic policy always
returns the same action for a state
• A stochastic policy gives a
probability for an action in a state
One possible deterministic policy for the maze
16. Value function
• An agent exploring the MDP environment would observe a sequence
• Discounted return, or utility, from time step 𝑡 onwards is the sum of discounted
rewards received:
17. The state-value function
• If the agent was following a policy, then in each state , the agent would select
the action defined by that policy
• The state-value function of a state under policy , denoted , is the expected
discounted return when following the policy from state onwards:
18. Bellman expectation equation for
• Previously, we defined
• Manipulating
• Expanding the expectation, we get
*
*)
19. Bellman expectation equation for
• Bellman expectation equation for state-value function
defines a recursive relationship between the value of a
state and its successor states
• Value of a state decomposes to
• expected immediate reward received when selecting
the action as defined by the policy
• expected discounted value of the successor state
20. State-value function, action-value function
• State-value function defines the expected discounted return when
following the policy from state onwards
• Similarly, we can define the action-value function for policy :
• Action-value function defines the expected utility when starting in state ,
performing action and following policy thereafter.
• After receiving the immediate reward , the discounted future reward as
defined by the successor state value is received.
• This illustrates the recursive relation between and
21. Optimal value functions
• Optimal value function is one that gives the best expected utility for each state or
state-action pair, and , over all policies
• gives the expected return for being in state and following optimal policy
• while gives the expected return for taking action in state and
following optimal policy thereafter
22. Optimal policy
• We consider a policy to be equal or better than some other policy , if the
state-value in all states is at least as good for
• For any MDP, there always exists an optimal deterministic policy that is at least
as good as all other policies:
• There can be multiple optimal policies, but all optimal policies achieve the
optimal value functions and
• So, finally, optimal policy is the policy (or one of the best policies) that gives the
best value for each state
23. Bellman optimality equations
• A couple of slides back we defined the Bellman expectation equation for
• We can derive a similar relation between the optimal value functions, called the
Bellman optimality equation (for ):
25. Bellman equations as update rules (backups)
• We can derive update rules from Bellman equations, called Bellman backups
• For Bellman expectation equation:
• For Bellman optimality equation
26. Policy evaluation
• The task of computing state-value function for states under policy 𝜋 is called
policy evaluation
• Applying the Bellman expectation equation
• gives us a system of simultaneous linear equations that can be solved using a
suitable standard method
• Alternatively, we can use Bellman expectation backup as an update rule, sweep
through the states repeatedly and update each state in turn
• This is called iterative policy evaluation and is guaranteed to converge to
27. Matrix implementation of policy evaluation
• To solve the system using a matrix equation, we define the following matrixes
is the vector of state values, where
is the vector* of rewards,
is the matrix of transition probabilities, where
represents
• We can represent the Bellman expectation equation in matrix form as
• Solving for yields
• This involves inversion of a matrix, a complexity operation
*) when rewards depend only on the target state .
28. Value iteration algorithm
• Perform a synchronous sweep of
the state-space
• Update value of each state based
on values of adjacent states using
the Bellman optimality backup
• Stop when state-values no longer
change (much)
• Extract policy corresponding to
state-values
• Algorithm will convergence to
optimal value function*
*) assuming finite episodes or .
30. Round 0
State values initialized to zero
0.8*(1+0) + 0.1*(0+0) + 0.1*(0+0) - 0.04 = 0.76
0.8*(0+0) + 0.1*(0+0) + 0.1 *(0+0) - 0.04 = -0.04
31. Round 0
ROUND 0, 0.76
State values after round 0 Policy corresponding to state values
(which we would not normally know or bother to extract until the end)
38. Policy iteration algorithm
• Perform policy evaluation - evaluate
state values under current policy*
• Improve the policy by determining the
best action in each state using the
state-value function determined
during policy evaluation step
• Stop when policy no longer changes
• Each improvement step improves the
value function and when improvement
stops, the optimal policy and value
function have been found
*) using either iterative policy evaluation or solving the linear system
42. So…
• When we have a fully defined MDP, i.e. we know all of
• We can apply an iterative dynamic programming algorithm to the problem
• The algorithm is guaranteed to converge to optimal value function and policy
• Where we do full sweep* of the state space or invert a matrix
*) not strictly necessary as asynchronous alternatives exits where we do not perform full sweeps, i.e. synchronous backups
43. Utility of classical DP
“Classical DP algorithms are of limited utility in reinforcement learning both
because of their assumption of a perfect model and because of their great
computational expense, but they are still important theoretically.
DP provides an essential foundation for the understanding of the methods
presented in the rest of this book.
In fact, all of these methods can be viewed as attempts to achieve much the same
effect as DP, only with less computation and without assuming a perfect model of
the environment.”
- Sutton and Barto, Reinforcement Learning: An Introduction
44. Part II
• What if we do not know the MDP
• Or don’t care to know
• Or cannot iterate across all states
• Or do not want to blindly iterate
-> Reinforcement Learning!
45. References
Video lectures: https://www.davidsilver.uk/teaching/
Book: http://incompleteideas.net/book/RLbook2020.pdf
• http://artint.info/2e/index.html
Random selection of university classes & lecture materials:
• http://mlg.eng.cam.ac.uk/teaching/4f13/1011/lect1214.pdf
• https://www.cs.cmu.edu/~mgormley/courses/10701-f16/schedule.html
• https://www.cs.cmu.edu/~mgormley/courses/10601-s17/slides/lecture26-ri.pdf
• https://www.andrew.cmu.edu/course/10-703/slides/lecture3_exactmethods-9-5-2018.pdf
• https://www.cs.cmu.edu/~15381-f17/
• http://www.cs.cmu.edu/afs/cs.cmu.edu/academic/class/15780-s16/www/
• https://web.mit.edu/6.246/www/lectures/L3-2021sp.pdf
• https://web.stanford.edu/class/cs234/slides/lecture2.pdf
• https://people.eecs.berkeley.edu/~pabbeel/cs287-fa12/slides/mdps-exact-methods.pdf
• https://people.eecs.berkeley.edu/~pabbeel/cs287-fa11/slides/mdps-intro-value-iteration.pdf
• https://inst.eecs.berkeley.edu/~cs188/sp20/assets/lecture/lec11_6up.pdf