Intro to Reinforcement learning - part I

Introduction to Reinforcement Learning
Part I: Dynamic programming
Mikko Mäkipää 18.11.2021
https://mmakipaa.github.io/dp/

Agenda
• Today: Part I
• Intro: Reinforcement learning as a ML approach
• Basic building blocks: agent and environment, MDP, policies, value functions, Bellman
equations, optimal policies and value functions
• Basic dynamic programming algorithms illustrated on a simple maze: Value iteration, Policy
iteration
• Later: Parts II and III
• Some more building blocks (likely): exploration, TD updates, semi-gradient error descent,
“model” as in “model-free”, deadly triads,…
• Value function approximation, tabular state representation vs linear approximations using
semi-gradient Sarsa; polynomial, Tile coding, Fourier cosine basis
• Basic value based reinforcement algorithms illustrated on BlackJack: Monte Carlo on- vs off-
policy episodic; Sarsa, Expected Sarsa, Q-learning using TD; batch updates LSPI-LSTDQ

Evolution of Learning Approach
• Lie on couch
• Lie on couch, read Nature article
• Lie on couch, watch video lectures
• Lie on couch, watch lectures, read textbook
• Watch lectures, read textbook, code
• Watch lectures, read textbook, code, document

Learning objectives (and self-assessment)
• Understand what is reinforcement learning 
• Understand what is behind the spectacular results 
• Improve Python skills, especially for modeling object structures 
• Get ideas for implementing RL with real tools in real problems 

1
2
3
4
4 5 6 7 8 9
sepal
width
(cm)
sepal length (cm)
0
1
2
3
4
5
6
7
8
4 5 6 7 8
Machine learning approaches
Supervised learning Reinforcement learning ?
Unsupervised learning

RL problem setting: Agent and environment
Environment
Agent
*) This would be a fully observable environmentc
Agent performs action
Agent observes
environment state
and reward

RL problem setting: Agent and environment
Agent performs action
Agent observes
environment state
and reward
Environment
Agent
Agent models the environment as
a Markov Decision Process
Agent maintains a policy
that defines what action to
take when in a state
Agent approximates the value function
of each state and action
Agent creates an internal
representation
of state

Dynamic programming (Bellman, 1950s)
“Dynamic programming refers to simplifying a complicated problem by breaking it
down into simpler sub-problems in a recursive manner”
- Wikipedia
“A collection of algorithms that can be used to compute optimal policies given a
perfect model of the environment as a Markov decision process (MDP)”
- Sutton and Barto, Reinforcement Learning: An Introduction

Markov decision process (MDP)
• Reinforcement learning problems are framed as
Markov Decision Processes (MDP)
• Formally, a Markov Decision Process is a tuple , where
is the set of states
is the set of actions
defines the transition probabilities between states
is the reward function, and
is a discount factor,
• And transition probabilities follow the Markov Property

Picture of an MDP
• States
• Actions
• Transition probabilities
• Rewards
Reward function
• Discount factor ,

Maze (or grid-world) as an MDP
• States indexed with(x,y) -tuples
• terminal states (here green and red)
• Reward received when entering a state
• defined for terminal states
• zero for all other states
• additionally, a small living cost for each
step
• Actions N, E, S, W in each state
• walls bounce agent back
• moves are noisy
A commonly used example,
the ”canonical maze”

Maze transition probs with noisy moves
• Two state maze
• Noisy moves with
• Attempted to move NORTH
east, south, west omitted for clarity

Policy
• Policy defines how the agent behaves
in a MDP environment
• Policy is a mapping from each state
to an action
• A deterministic policy always
returns the same action for a state
• A stochastic policy gives a
probability for an action in a state
One possible deterministic policy for the maze

Value function
• An agent exploring the MDP environment would observe a sequence
• Discounted return, or utility, from time step 𝑡 onwards is the sum of discounted
rewards received:

The state-value function
• If the agent was following a policy, then in each state , the agent would select
the action defined by that policy
• The state-value function of a state under policy , denoted , is the expected
discounted return when following the policy from state onwards:

Bellman expectation equation for
• Previously, we defined
• Manipulating
• Expanding the expectation, we get
*
*)

Bellman expectation equation for
• Bellman expectation equation for state-value function
defines a recursive relationship between the value of a
state and its successor states
• Value of a state decomposes to
• expected immediate reward received when selecting
the action as defined by the policy
• expected discounted value of the successor state

State-value function, action-value function
• State-value function defines the expected discounted return when
following the policy from state onwards
• Similarly, we can define the action-value function for policy :
• Action-value function defines the expected utility when starting in state ,
performing action and following policy thereafter.
• After receiving the immediate reward , the discounted future reward as
defined by the successor state value is received.
• This illustrates the recursive relation between and

Optimal value functions
• Optimal value function is one that gives the best expected utility for each state or
state-action pair, and , over all policies
• gives the expected return for being in state and following optimal policy
• while gives the expected return for taking action in state and
following optimal policy thereafter

Optimal policy
• We consider a policy to be equal or better than some other policy , if the
state-value in all states is at least as good for
• For any MDP, there always exists an optimal deterministic policy that is at least
as good as all other policies:
• There can be multiple optimal policies, but all optimal policies achieve the
optimal value functions and
• So, finally, optimal policy is the policy (or one of the best policies) that gives the
best value for each state

Bellman optimality equations
• A couple of slides back we defined the Bellman expectation equation for
• We can derive a similar relation between the optimal value functions, called the
Bellman optimality equation (for ):

Bellman optimality equations
• Starting from
• Noting that
• We get
• And combing the above, also

Bellman equations as update rules (backups)
• We can derive update rules from Bellman equations, called Bellman backups
• For Bellman expectation equation:
• For Bellman optimality equation

Policy evaluation
• The task of computing state-value function for states under policy 𝜋 is called
policy evaluation
• Applying the Bellman expectation equation
• gives us a system of simultaneous linear equations that can be solved using a
suitable standard method
• Alternatively, we can use Bellman expectation backup as an update rule, sweep
through the states repeatedly and update each state in turn
• This is called iterative policy evaluation and is guaranteed to converge to

Matrix implementation of policy evaluation
• To solve the system using a matrix equation, we define the following matrixes
is the vector of state values, where
is the vector* of rewards,
is the matrix of transition probabilities, where
represents
• We can represent the Bellman expectation equation in matrix form as
• Solving for yields
• This involves inversion of a matrix, a complexity operation
*) when rewards depend only on the target state .

Value iteration algorithm
• Perform a synchronous sweep of
the state-space
• Update value of each state based
on values of adjacent states using
the Bellman optimality backup
• Stop when state-values no longer
change (much)
• Extract policy corresponding to
state-values
• Algorithm will convergence to
optimal value function*
*) assuming finite episodes or .

Value iteration example – canonical maze
State rewards (including the living cost of -0.04)

Round 0
State values initialized to zero
0.8*(1+0) + 0.1*(0+0) + 0.1*(0+0) - 0.04 = 0.76
0.8*(0+0) + 0.1*(0+0) + 0.1 *(0+0) - 0.04 = -0.04

Round 0
ROUND 0, 0.76
State values after round 0 Policy corresponding to state values
(which we would not normally know or bother to extract until the end)

Round 1
ROUND 0, 0.76
ROUND 1, 0.6

Round 2
ROUND 0, 0.76
ROUND 1, 0.6
ROUND 2, 0.47199

Round 3
ROUND 0, 0.76
ROUND 1, 0.6
ROUND 2, 0.47199
ROUND 3, 0.3696

Round 7
ROUND 0, 0.76
ROUND 1, 0.6
ROUND 2, 0.47199
ROUND 3, 0.3696
ROUND 4, 0.32249
ROUND 5, 0.22240
ROUND 6, 0.14542
ROUND 7, 0.07915

Round 11
ROUND 0, 0.76
ROUND 1, 0.6
ROUND 2, 0.47199
ROUND 3, 0.3696
ROUND 4, 0.32249
ROUND 5, 0.22240
ROUND 6, 0.14542
ROUND 7, 0.07915
ROUND 8, 0.05098
ROUND 9, 0.04350
ROUND 10, 0.02816
ROUND 11, 0.01685

Round 21
ROUND 0, 0.76
ROUND 1, 0.6
ROUND 2, 0.47199
ROUND 3, 0.3696
ROUND 4, 0.32249
ROUND 5, 0.22240
ROUND 6, 0.14542
ROUND 7, 0.07915
ROUND 8, 0.05098
ROUND 9, 0.04350
ROUND 10, 0.02816
ROUND 11, 0.01685
ROUND 12, 0.00947
ROUND 13, 0.00829
ROUND 14, 0.00723
ROUND 15, 0.00466
ROUND 16, 0.00258
ROUND 17, 0.00135
ROUND 18, 0.00069
ROUND 19, 0.00034
ROUND 20, 0.00017
ROUND 21, 8.3e-05

Policy iteration algorithm
• Perform policy evaluation - evaluate
state values under current policy*
• Improve the policy by determining the
best action in each state using the
state-value function determined
during policy evaluation step
• Stop when policy no longer changes
• Each improvement step improves the
value function and when improvement
stops, the optimal policy and value
function have been found
*) using either iterative policy evaluation or solving the linear system

Policy iteration, canonical maze - Round 4
ROUND 0, 6
ROUND 1, 1
ROUND 2, 1
ROUND 3, 1
ROUND 4, 0

Example – a complex maze
ROUND 0, 20
ROUND 1, 13
ROUND 2, 8
ROUND 3, 3
ROUND 4, 0

So…
• When we have a fully defined MDP, i.e. we know all of
• We can apply an iterative dynamic programming algorithm to the problem
• The algorithm is guaranteed to converge to optimal value function and policy
• Where we do full sweep* of the state space or invert a matrix
*) not strictly necessary as asynchronous alternatives exits where we do not perform full sweeps, i.e. synchronous backups

Utility of classical DP
“Classical DP algorithms are of limited utility in reinforcement learning both
because of their assumption of a perfect model and because of their great
computational expense, but they are still important theoretically.
DP provides an essential foundation for the understanding of the methods
presented in the rest of this book.
In fact, all of these methods can be viewed as attempts to achieve much the same
effect as DP, only with less computation and without assuming a perfect model of
the environment.”
- Sutton and Barto, Reinforcement Learning: An Introduction

Part II
• What if we do not know the MDP
• Or don’t care to know
• Or cannot iterate across all states
• Or do not want to blindly iterate
-> Reinforcement Learning!

References
Video lectures: https://www.davidsilver.uk/teaching/
Book: http://incompleteideas.net/book/RLbook2020.pdf
• http://artint.info/2e/index.html
Random selection of university classes & lecture materials:
• http://mlg.eng.cam.ac.uk/teaching/4f13/1011/lect1214.pdf
• https://www.cs.cmu.edu/~mgormley/courses/10701-f16/schedule.html
• https://www.cs.cmu.edu/~mgormley/courses/10601-s17/slides/lecture26-ri.pdf
• https://www.andrew.cmu.edu/course/10-703/slides/lecture3_exactmethods-9-5-2018.pdf
• https://www.cs.cmu.edu/~15381-f17/
• http://www.cs.cmu.edu/afs/cs.cmu.edu/academic/class/15780-s16/www/
• https://web.mit.edu/6.246/www/lectures/L3-2021sp.pdf
• https://web.stanford.edu/class/cs234/slides/lecture2.pdf
• https://people.eecs.berkeley.edu/~pabbeel/cs287-fa12/slides/mdps-exact-methods.pdf
• https://people.eecs.berkeley.edu/~pabbeel/cs287-fa11/slides/mdps-intro-value-iteration.pdf
• https://inst.eecs.berkeley.edu/~cs188/sp20/assets/lecture/lec11_6up.pdf

Intro to Reinforcement learning - part I

Recommended

Recommended

More Related Content

Similar to Intro to Reinforcement learning - part I

Similar to Intro to Reinforcement learning - part I (20)

Recently uploaded

Recently uploaded (20)

Intro to Reinforcement learning - part I