SlideShare a Scribd company logo
1 of 45
Download to read offline
Introduction to Reinforcement Learning
Part I: Dynamic programming
Mikko Mäkipää 18.11.2021
https://mmakipaa.github.io/dp/
Agenda
• Today: Part I
• Intro: Reinforcement learning as a ML approach
• Basic building blocks: agent and environment, MDP, policies, value functions, Bellman
equations, optimal policies and value functions
• Basic dynamic programming algorithms illustrated on a simple maze: Value iteration, Policy
iteration
• Later: Parts II and III
• Some more building blocks (likely): exploration, TD updates, semi-gradient error descent,
“model” as in “model-free”, deadly triads,…
• Value function approximation, tabular state representation vs linear approximations using
semi-gradient Sarsa; polynomial, Tile coding, Fourier cosine basis
• Basic value based reinforcement algorithms illustrated on BlackJack: Monte Carlo on- vs off-
policy episodic; Sarsa, Expected Sarsa, Q-learning using TD; batch updates LSPI-LSTDQ
Reinforcement learning?
Evolution of Learning Approach
• Lie on couch
• Lie on couch, read Nature article
• Lie on couch, watch video lectures
• Lie on couch, watch lectures, read textbook
• Watch lectures, read textbook, code
• Watch lectures, read textbook, code, document
Learning objectives (and self-assessment)
• Understand what is reinforcement learning 
• Understand what is behind the spectacular results 
• Improve Python skills, especially for modeling object structures 
• Get ideas for implementing RL with real tools in real problems 
1
2
3
4
4 5 6 7 8 9
sepal
width
(cm)
sepal length (cm)
0
1
2
3
4
5
6
7
8
4 5 6 7 8
Machine learning approaches
Supervised learning Reinforcement learning ?
Unsupervised learning
RL problem setting: Agent and environment
Environment
Agent
*) This would be a fully observable environmentc
Agent performs action
Agent observes
environment state
and reward
RL problem setting: Agent and environment
Agent performs action
Agent observes
environment state
and reward
Environment
Agent
Agent models the environment as
a Markov Decision Process
Agent maintains a policy
that defines what action to
take when in a state
Agent approximates the value function
of each state and action
Agent creates an internal
representation
of state
Dynamic programming (Bellman, 1950s)
“Dynamic programming refers to simplifying a complicated problem by breaking it
down into simpler sub-problems in a recursive manner”
- Wikipedia
“A collection of algorithms that can be used to compute optimal policies given a
perfect model of the environment as a Markov decision process (MDP)”
- Sutton and Barto, Reinforcement Learning: An Introduction
Markov decision process (MDP)
• Reinforcement learning problems are framed as
Markov Decision Processes (MDP)
• Formally, a Markov Decision Process is a tuple , where
is the set of states
is the set of actions
defines the transition probabilities between states
is the reward function, and
is a discount factor,
• And transition probabilities follow the Markov Property
Picture of an MDP
• States
• Actions
• Transition probabilities
• Rewards
Reward function
• Discount factor ,
Maze (or grid-world) as an MDP
• States indexed with(x,y) -tuples
• terminal states (here green and red)
• Reward received when entering a state
• defined for terminal states
• zero for all other states
• additionally, a small living cost for each
step
• Actions N, E, S, W in each state
• walls bounce agent back
• moves are noisy
A commonly used example,
the ”canonical maze”
Maze - Noisy moves
Maze transition probs with noisy moves
• Two state maze
• Noisy moves with
• Attempted to move NORTH
east, south, west omitted for clarity
Policy
• Policy defines how the agent behaves
in a MDP environment
• Policy is a mapping from each state
to an action
• A deterministic policy always
returns the same action for a state
• A stochastic policy gives a
probability for an action in a state
One possible deterministic policy for the maze
Value function
• An agent exploring the MDP environment would observe a sequence
• Discounted return, or utility, from time step 𝑡 onwards is the sum of discounted
rewards received:
The state-value function
• If the agent was following a policy, then in each state , the agent would select
the action defined by that policy
• The state-value function of a state under policy , denoted , is the expected
discounted return when following the policy from state onwards:
Bellman expectation equation for
• Previously, we defined
• Manipulating
• Expanding the expectation, we get
*
*)
Bellman expectation equation for
• Bellman expectation equation for state-value function
defines a recursive relationship between the value of a
state and its successor states
• Value of a state decomposes to
• expected immediate reward received when selecting
the action as defined by the policy
• expected discounted value of the successor state
State-value function, action-value function
• State-value function defines the expected discounted return when
following the policy from state onwards
• Similarly, we can define the action-value function for policy :
• Action-value function defines the expected utility when starting in state ,
performing action and following policy thereafter.
• After receiving the immediate reward , the discounted future reward as
defined by the successor state value is received.
• This illustrates the recursive relation between and
Optimal value functions
• Optimal value function is one that gives the best expected utility for each state or
state-action pair, and , over all policies
• gives the expected return for being in state and following optimal policy
• while gives the expected return for taking action in state and
following optimal policy thereafter
Optimal policy
• We consider a policy to be equal or better than some other policy , if the
state-value in all states is at least as good for
• For any MDP, there always exists an optimal deterministic policy that is at least
as good as all other policies:
• There can be multiple optimal policies, but all optimal policies achieve the
optimal value functions and
• So, finally, optimal policy is the policy (or one of the best policies) that gives the
best value for each state
Bellman optimality equations
• A couple of slides back we defined the Bellman expectation equation for
• We can derive a similar relation between the optimal value functions, called the
Bellman optimality equation (for ):
Bellman optimality equations
• Starting from
• Noting that
• We get
• And combing the above, also
Bellman equations as update rules (backups)
• We can derive update rules from Bellman equations, called Bellman backups
• For Bellman expectation equation:
• For Bellman optimality equation
Policy evaluation
• The task of computing state-value function for states under policy 𝜋 is called
policy evaluation
• Applying the Bellman expectation equation
• gives us a system of simultaneous linear equations that can be solved using a
suitable standard method
• Alternatively, we can use Bellman expectation backup as an update rule, sweep
through the states repeatedly and update each state in turn
• This is called iterative policy evaluation and is guaranteed to converge to
Matrix implementation of policy evaluation
• To solve the system using a matrix equation, we define the following matrixes
is the vector of state values, where
is the vector* of rewards,
is the matrix of transition probabilities, where
represents
• We can represent the Bellman expectation equation in matrix form as
• Solving for yields
• This involves inversion of a matrix, a complexity operation
*) when rewards depend only on the target state .
Value iteration algorithm
• Perform a synchronous sweep of
the state-space
• Update value of each state based
on values of adjacent states using
the Bellman optimality backup
• Stop when state-values no longer
change (much)
• Extract policy corresponding to
state-values
• Algorithm will convergence to
optimal value function*
*) assuming finite episodes or .
Value iteration example – canonical maze
State rewards (including the living cost of -0.04)
Round 0
State values initialized to zero
0.8*(1+0) + 0.1*(0+0) + 0.1*(0+0) - 0.04 = 0.76
0.8*(0+0) + 0.1*(0+0) + 0.1 *(0+0) - 0.04 = -0.04
Round 0
ROUND 0, 0.76
State values after round 0 Policy corresponding to state values
(which we would not normally know or bother to extract until the end)
Round 1
ROUND 0, 0.76
ROUND 1, 0.6
Round 2
ROUND 0, 0.76
ROUND 1, 0.6
ROUND 2, 0.47199
Round 3
ROUND 0, 0.76
ROUND 1, 0.6
ROUND 2, 0.47199
ROUND 3, 0.3696
Round 7
ROUND 0, 0.76
ROUND 1, 0.6
ROUND 2, 0.47199
ROUND 3, 0.3696
ROUND 4, 0.32249
ROUND 5, 0.22240
ROUND 6, 0.14542
ROUND 7, 0.07915
Round 11
ROUND 0, 0.76
ROUND 1, 0.6
ROUND 2, 0.47199
ROUND 3, 0.3696
ROUND 4, 0.32249
ROUND 5, 0.22240
ROUND 6, 0.14542
ROUND 7, 0.07915
ROUND 8, 0.05098
ROUND 9, 0.04350
ROUND 10, 0.02816
ROUND 11, 0.01685
Round 21
ROUND 0, 0.76
ROUND 1, 0.6
ROUND 2, 0.47199
ROUND 3, 0.3696
ROUND 4, 0.32249
ROUND 5, 0.22240
ROUND 6, 0.14542
ROUND 7, 0.07915
ROUND 8, 0.05098
ROUND 9, 0.04350
ROUND 10, 0.02816
ROUND 11, 0.01685
ROUND 12, 0.00947
ROUND 13, 0.00829
ROUND 14, 0.00723
ROUND 15, 0.00466
ROUND 16, 0.00258
ROUND 17, 0.00135
ROUND 18, 0.00069
ROUND 19, 0.00034
ROUND 20, 0.00017
ROUND 21, 8.3e-05
Policy iteration algorithm
• Perform policy evaluation - evaluate
state values under current policy*
• Improve the policy by determining the
best action in each state using the
state-value function determined
during policy evaluation step
• Stop when policy no longer changes
• Each improvement step improves the
value function and when improvement
stops, the optimal policy and value
function have been found
*) using either iterative policy evaluation or solving the linear system
Policy iteration, canonical maze - Round 4
ROUND 0, 6
ROUND 1, 1
ROUND 2, 1
ROUND 3, 1
ROUND 4, 0
Example – a complex maze
Example – a complex maze
ROUND 0, 20
ROUND 1, 13
ROUND 2, 8
ROUND 3, 3
ROUND 4, 0
So…
• When we have a fully defined MDP, i.e. we know all of
• We can apply an iterative dynamic programming algorithm to the problem
• The algorithm is guaranteed to converge to optimal value function and policy
• Where we do full sweep* of the state space or invert a matrix
*) not strictly necessary as asynchronous alternatives exits where we do not perform full sweeps, i.e. synchronous backups
Utility of classical DP
“Classical DP algorithms are of limited utility in reinforcement learning both
because of their assumption of a perfect model and because of their great
computational expense, but they are still important theoretically.
DP provides an essential foundation for the understanding of the methods
presented in the rest of this book.
In fact, all of these methods can be viewed as attempts to achieve much the same
effect as DP, only with less computation and without assuming a perfect model of
the environment.”
- Sutton and Barto, Reinforcement Learning: An Introduction
Part II
• What if we do not know the MDP
• Or don’t care to know
• Or cannot iterate across all states
• Or do not want to blindly iterate
-> Reinforcement Learning!
References
Video lectures: https://www.davidsilver.uk/teaching/
Book: http://incompleteideas.net/book/RLbook2020.pdf
• http://artint.info/2e/index.html
Random selection of university classes & lecture materials:
• http://mlg.eng.cam.ac.uk/teaching/4f13/1011/lect1214.pdf
• https://www.cs.cmu.edu/~mgormley/courses/10701-f16/schedule.html
• https://www.cs.cmu.edu/~mgormley/courses/10601-s17/slides/lecture26-ri.pdf
• https://www.andrew.cmu.edu/course/10-703/slides/lecture3_exactmethods-9-5-2018.pdf
• https://www.cs.cmu.edu/~15381-f17/
• http://www.cs.cmu.edu/afs/cs.cmu.edu/academic/class/15780-s16/www/
• https://web.mit.edu/6.246/www/lectures/L3-2021sp.pdf
• https://web.stanford.edu/class/cs234/slides/lecture2.pdf
• https://people.eecs.berkeley.edu/~pabbeel/cs287-fa12/slides/mdps-exact-methods.pdf
• https://people.eecs.berkeley.edu/~pabbeel/cs287-fa11/slides/mdps-intro-value-iteration.pdf
• https://inst.eecs.berkeley.edu/~cs188/sp20/assets/lecture/lec11_6up.pdf

More Related Content

Similar to Intro to Reinforcement learning - part I

Making Complex Decisions(Artificial Intelligence)
Making Complex Decisions(Artificial Intelligence)Making Complex Decisions(Artificial Intelligence)
Making Complex Decisions(Artificial Intelligence)
United International University
 
Problem Solving Agents decide what to do by finding a sequence of actions tha...
Problem Solving Agents decide what to do by finding a sequence of actions tha...Problem Solving Agents decide what to do by finding a sequence of actions tha...
Problem Solving Agents decide what to do by finding a sequence of actions tha...
KrishnaVeni451953
 
24.09.2021 Reinforcement Learning Algorithms.pptx
24.09.2021 Reinforcement Learning Algorithms.pptx24.09.2021 Reinforcement Learning Algorithms.pptx
24.09.2021 Reinforcement Learning Algorithms.pptx
ManiMaran230751
 
lecture_21.pptx - PowerPoint Presentation
lecture_21.pptx - PowerPoint Presentationlecture_21.pptx - PowerPoint Presentation
lecture_21.pptx - PowerPoint Presentation
butest
 

Similar to Intro to Reinforcement learning - part I (20)

Proximal Policy Optimization
Proximal Policy OptimizationProximal Policy Optimization
Proximal Policy Optimization
 
Deep reinforcement learning from scratch
Deep reinforcement learning from scratchDeep reinforcement learning from scratch
Deep reinforcement learning from scratch
 
Artificial Intelligence - Hill climbing.
Artificial Intelligence - Hill climbing.Artificial Intelligence - Hill climbing.
Artificial Intelligence - Hill climbing.
 
reinforcement-learning its based on the slide of university
reinforcement-learning its based on the slide of universityreinforcement-learning its based on the slide of university
reinforcement-learning its based on the slide of university
 
reinforcement-learning.ppt
reinforcement-learning.pptreinforcement-learning.ppt
reinforcement-learning.ppt
 
Introduction to Deep Reinforcement Learning
Introduction to Deep Reinforcement LearningIntroduction to Deep Reinforcement Learning
Introduction to Deep Reinforcement Learning
 
An efficient use of temporal difference technique in Computer Game Learning
An efficient use of temporal difference technique in Computer Game LearningAn efficient use of temporal difference technique in Computer Game Learning
An efficient use of temporal difference technique in Computer Game Learning
 
Head First Reinforcement Learning
Head First Reinforcement LearningHead First Reinforcement Learning
Head First Reinforcement Learning
 
Making Complex Decisions(Artificial Intelligence)
Making Complex Decisions(Artificial Intelligence)Making Complex Decisions(Artificial Intelligence)
Making Complex Decisions(Artificial Intelligence)
 
Intro to Deep Reinforcement Learning
Intro to Deep Reinforcement LearningIntro to Deep Reinforcement Learning
Intro to Deep Reinforcement Learning
 
Problem Solving Agents decide what to do by finding a sequence of actions tha...
Problem Solving Agents decide what to do by finding a sequence of actions tha...Problem Solving Agents decide what to do by finding a sequence of actions tha...
Problem Solving Agents decide what to do by finding a sequence of actions tha...
 
Deep Reinforcement learning
Deep Reinforcement learningDeep Reinforcement learning
Deep Reinforcement learning
 
Introduction: Asynchronous Methods for Deep Reinforcement Learning
Introduction: Asynchronous Methods for  Deep Reinforcement LearningIntroduction: Asynchronous Methods for  Deep Reinforcement Learning
Introduction: Asynchronous Methods for Deep Reinforcement Learning
 
Fundamentals of RL.pptx
Fundamentals of RL.pptxFundamentals of RL.pptx
Fundamentals of RL.pptx
 
Hill climbing algorithm
Hill climbing algorithmHill climbing algorithm
Hill climbing algorithm
 
CH2_AI_Lecture1.ppt
CH2_AI_Lecture1.pptCH2_AI_Lecture1.ppt
CH2_AI_Lecture1.ppt
 
RL_in_10_min.pptx
RL_in_10_min.pptxRL_in_10_min.pptx
RL_in_10_min.pptx
 
24.09.2021 Reinforcement Learning Algorithms.pptx
24.09.2021 Reinforcement Learning Algorithms.pptx24.09.2021 Reinforcement Learning Algorithms.pptx
24.09.2021 Reinforcement Learning Algorithms.pptx
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
 
lecture_21.pptx - PowerPoint Presentation
lecture_21.pptx - PowerPoint Presentationlecture_21.pptx - PowerPoint Presentation
lecture_21.pptx - PowerPoint Presentation
 

Recently uploaded

Top profile Call Girls In Nandurbar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Nandurbar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Nandurbar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Nandurbar [ 7014168258 ] Call Me For Genuine Models...
gajnagarg
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
nirzagarg
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
nirzagarg
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
gajnagarg
 
Kalyani ? Call Girl in Kolkata | Service-oriented sexy call girls 8005736733 ...
Kalyani ? Call Girl in Kolkata | Service-oriented sexy call girls 8005736733 ...Kalyani ? Call Girl in Kolkata | Service-oriented sexy call girls 8005736733 ...
Kalyani ? Call Girl in Kolkata | Service-oriented sexy call girls 8005736733 ...
HyderabadDolls
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
nirzagarg
 
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
HyderabadDolls
 
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
gajnagarg
 
Lake Town / Independent Kolkata Call Girls Phone No 8005736733 Elite Escort S...
Lake Town / Independent Kolkata Call Girls Phone No 8005736733 Elite Escort S...Lake Town / Independent Kolkata Call Girls Phone No 8005736733 Elite Escort S...
Lake Town / Independent Kolkata Call Girls Phone No 8005736733 Elite Escort S...
HyderabadDolls
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
nirzagarg
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
ahmedjiabur940
 

Recently uploaded (20)

Top profile Call Girls In Nandurbar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Nandurbar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Nandurbar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Nandurbar [ 7014168258 ] Call Me For Genuine Models...
 
💞 Safe And Secure Call Girls Agra Call Girls Service Just Call 🍑👄6378878445 🍑...
💞 Safe And Secure Call Girls Agra Call Girls Service Just Call 🍑👄6378878445 🍑...💞 Safe And Secure Call Girls Agra Call Girls Service Just Call 🍑👄6378878445 🍑...
💞 Safe And Secure Call Girls Agra Call Girls Service Just Call 🍑👄6378878445 🍑...
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
 
20240412-SmartCityIndex-2024-Full-Report.pdf
20240412-SmartCityIndex-2024-Full-Report.pdf20240412-SmartCityIndex-2024-Full-Report.pdf
20240412-SmartCityIndex-2024-Full-Report.pdf
 
Kalyani ? Call Girl in Kolkata | Service-oriented sexy call girls 8005736733 ...
Kalyani ? Call Girl in Kolkata | Service-oriented sexy call girls 8005736733 ...Kalyani ? Call Girl in Kolkata | Service-oriented sexy call girls 8005736733 ...
Kalyani ? Call Girl in Kolkata | Service-oriented sexy call girls 8005736733 ...
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
 
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
 
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
 
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptxRESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
Giridih Escorts Service Girl ^ 9332606886, WhatsApp Anytime Giridih
Giridih Escorts Service Girl ^ 9332606886, WhatsApp Anytime GiridihGiridih Escorts Service Girl ^ 9332606886, WhatsApp Anytime Giridih
Giridih Escorts Service Girl ^ 9332606886, WhatsApp Anytime Giridih
 
Lake Town / Independent Kolkata Call Girls Phone No 8005736733 Elite Escort S...
Lake Town / Independent Kolkata Call Girls Phone No 8005736733 Elite Escort S...Lake Town / Independent Kolkata Call Girls Phone No 8005736733 Elite Escort S...
Lake Town / Independent Kolkata Call Girls Phone No 8005736733 Elite Escort S...
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
 
Introduction to Statistics Presentation.pptx
Introduction to Statistics Presentation.pptxIntroduction to Statistics Presentation.pptx
Introduction to Statistics Presentation.pptx
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
 
Case Study 4 Where the cry of rebellion happen?
Case Study 4 Where the cry of rebellion happen?Case Study 4 Where the cry of rebellion happen?
Case Study 4 Where the cry of rebellion happen?
 

Intro to Reinforcement learning - part I

  • 1. Introduction to Reinforcement Learning Part I: Dynamic programming Mikko Mäkipää 18.11.2021 https://mmakipaa.github.io/dp/
  • 2. Agenda • Today: Part I • Intro: Reinforcement learning as a ML approach • Basic building blocks: agent and environment, MDP, policies, value functions, Bellman equations, optimal policies and value functions • Basic dynamic programming algorithms illustrated on a simple maze: Value iteration, Policy iteration • Later: Parts II and III • Some more building blocks (likely): exploration, TD updates, semi-gradient error descent, “model” as in “model-free”, deadly triads,… • Value function approximation, tabular state representation vs linear approximations using semi-gradient Sarsa; polynomial, Tile coding, Fourier cosine basis • Basic value based reinforcement algorithms illustrated on BlackJack: Monte Carlo on- vs off- policy episodic; Sarsa, Expected Sarsa, Q-learning using TD; batch updates LSPI-LSTDQ
  • 4. Evolution of Learning Approach • Lie on couch • Lie on couch, read Nature article • Lie on couch, watch video lectures • Lie on couch, watch lectures, read textbook • Watch lectures, read textbook, code • Watch lectures, read textbook, code, document
  • 5. Learning objectives (and self-assessment) • Understand what is reinforcement learning  • Understand what is behind the spectacular results  • Improve Python skills, especially for modeling object structures  • Get ideas for implementing RL with real tools in real problems 
  • 6. 1 2 3 4 4 5 6 7 8 9 sepal width (cm) sepal length (cm) 0 1 2 3 4 5 6 7 8 4 5 6 7 8 Machine learning approaches Supervised learning Reinforcement learning ? Unsupervised learning
  • 7. RL problem setting: Agent and environment Environment Agent *) This would be a fully observable environmentc Agent performs action Agent observes environment state and reward
  • 8. RL problem setting: Agent and environment Agent performs action Agent observes environment state and reward Environment Agent Agent models the environment as a Markov Decision Process Agent maintains a policy that defines what action to take when in a state Agent approximates the value function of each state and action Agent creates an internal representation of state
  • 9. Dynamic programming (Bellman, 1950s) “Dynamic programming refers to simplifying a complicated problem by breaking it down into simpler sub-problems in a recursive manner” - Wikipedia “A collection of algorithms that can be used to compute optimal policies given a perfect model of the environment as a Markov decision process (MDP)” - Sutton and Barto, Reinforcement Learning: An Introduction
  • 10. Markov decision process (MDP) • Reinforcement learning problems are framed as Markov Decision Processes (MDP) • Formally, a Markov Decision Process is a tuple , where is the set of states is the set of actions defines the transition probabilities between states is the reward function, and is a discount factor, • And transition probabilities follow the Markov Property
  • 11. Picture of an MDP • States • Actions • Transition probabilities • Rewards Reward function • Discount factor ,
  • 12. Maze (or grid-world) as an MDP • States indexed with(x,y) -tuples • terminal states (here green and red) • Reward received when entering a state • defined for terminal states • zero for all other states • additionally, a small living cost for each step • Actions N, E, S, W in each state • walls bounce agent back • moves are noisy A commonly used example, the ”canonical maze”
  • 13. Maze - Noisy moves
  • 14. Maze transition probs with noisy moves • Two state maze • Noisy moves with • Attempted to move NORTH east, south, west omitted for clarity
  • 15. Policy • Policy defines how the agent behaves in a MDP environment • Policy is a mapping from each state to an action • A deterministic policy always returns the same action for a state • A stochastic policy gives a probability for an action in a state One possible deterministic policy for the maze
  • 16. Value function • An agent exploring the MDP environment would observe a sequence • Discounted return, or utility, from time step 𝑡 onwards is the sum of discounted rewards received:
  • 17. The state-value function • If the agent was following a policy, then in each state , the agent would select the action defined by that policy • The state-value function of a state under policy , denoted , is the expected discounted return when following the policy from state onwards:
  • 18. Bellman expectation equation for • Previously, we defined • Manipulating • Expanding the expectation, we get * *)
  • 19. Bellman expectation equation for • Bellman expectation equation for state-value function defines a recursive relationship between the value of a state and its successor states • Value of a state decomposes to • expected immediate reward received when selecting the action as defined by the policy • expected discounted value of the successor state
  • 20. State-value function, action-value function • State-value function defines the expected discounted return when following the policy from state onwards • Similarly, we can define the action-value function for policy : • Action-value function defines the expected utility when starting in state , performing action and following policy thereafter. • After receiving the immediate reward , the discounted future reward as defined by the successor state value is received. • This illustrates the recursive relation between and
  • 21. Optimal value functions • Optimal value function is one that gives the best expected utility for each state or state-action pair, and , over all policies • gives the expected return for being in state and following optimal policy • while gives the expected return for taking action in state and following optimal policy thereafter
  • 22. Optimal policy • We consider a policy to be equal or better than some other policy , if the state-value in all states is at least as good for • For any MDP, there always exists an optimal deterministic policy that is at least as good as all other policies: • There can be multiple optimal policies, but all optimal policies achieve the optimal value functions and • So, finally, optimal policy is the policy (or one of the best policies) that gives the best value for each state
  • 23. Bellman optimality equations • A couple of slides back we defined the Bellman expectation equation for • We can derive a similar relation between the optimal value functions, called the Bellman optimality equation (for ):
  • 24. Bellman optimality equations • Starting from • Noting that • We get • And combing the above, also
  • 25. Bellman equations as update rules (backups) • We can derive update rules from Bellman equations, called Bellman backups • For Bellman expectation equation: • For Bellman optimality equation
  • 26. Policy evaluation • The task of computing state-value function for states under policy 𝜋 is called policy evaluation • Applying the Bellman expectation equation • gives us a system of simultaneous linear equations that can be solved using a suitable standard method • Alternatively, we can use Bellman expectation backup as an update rule, sweep through the states repeatedly and update each state in turn • This is called iterative policy evaluation and is guaranteed to converge to
  • 27. Matrix implementation of policy evaluation • To solve the system using a matrix equation, we define the following matrixes is the vector of state values, where is the vector* of rewards, is the matrix of transition probabilities, where represents • We can represent the Bellman expectation equation in matrix form as • Solving for yields • This involves inversion of a matrix, a complexity operation *) when rewards depend only on the target state .
  • 28. Value iteration algorithm • Perform a synchronous sweep of the state-space • Update value of each state based on values of adjacent states using the Bellman optimality backup • Stop when state-values no longer change (much) • Extract policy corresponding to state-values • Algorithm will convergence to optimal value function* *) assuming finite episodes or .
  • 29. Value iteration example – canonical maze State rewards (including the living cost of -0.04)
  • 30. Round 0 State values initialized to zero 0.8*(1+0) + 0.1*(0+0) + 0.1*(0+0) - 0.04 = 0.76 0.8*(0+0) + 0.1*(0+0) + 0.1 *(0+0) - 0.04 = -0.04
  • 31. Round 0 ROUND 0, 0.76 State values after round 0 Policy corresponding to state values (which we would not normally know or bother to extract until the end)
  • 32. Round 1 ROUND 0, 0.76 ROUND 1, 0.6
  • 33. Round 2 ROUND 0, 0.76 ROUND 1, 0.6 ROUND 2, 0.47199
  • 34. Round 3 ROUND 0, 0.76 ROUND 1, 0.6 ROUND 2, 0.47199 ROUND 3, 0.3696
  • 35. Round 7 ROUND 0, 0.76 ROUND 1, 0.6 ROUND 2, 0.47199 ROUND 3, 0.3696 ROUND 4, 0.32249 ROUND 5, 0.22240 ROUND 6, 0.14542 ROUND 7, 0.07915
  • 36. Round 11 ROUND 0, 0.76 ROUND 1, 0.6 ROUND 2, 0.47199 ROUND 3, 0.3696 ROUND 4, 0.32249 ROUND 5, 0.22240 ROUND 6, 0.14542 ROUND 7, 0.07915 ROUND 8, 0.05098 ROUND 9, 0.04350 ROUND 10, 0.02816 ROUND 11, 0.01685
  • 37. Round 21 ROUND 0, 0.76 ROUND 1, 0.6 ROUND 2, 0.47199 ROUND 3, 0.3696 ROUND 4, 0.32249 ROUND 5, 0.22240 ROUND 6, 0.14542 ROUND 7, 0.07915 ROUND 8, 0.05098 ROUND 9, 0.04350 ROUND 10, 0.02816 ROUND 11, 0.01685 ROUND 12, 0.00947 ROUND 13, 0.00829 ROUND 14, 0.00723 ROUND 15, 0.00466 ROUND 16, 0.00258 ROUND 17, 0.00135 ROUND 18, 0.00069 ROUND 19, 0.00034 ROUND 20, 0.00017 ROUND 21, 8.3e-05
  • 38. Policy iteration algorithm • Perform policy evaluation - evaluate state values under current policy* • Improve the policy by determining the best action in each state using the state-value function determined during policy evaluation step • Stop when policy no longer changes • Each improvement step improves the value function and when improvement stops, the optimal policy and value function have been found *) using either iterative policy evaluation or solving the linear system
  • 39. Policy iteration, canonical maze - Round 4 ROUND 0, 6 ROUND 1, 1 ROUND 2, 1 ROUND 3, 1 ROUND 4, 0
  • 40. Example – a complex maze
  • 41. Example – a complex maze ROUND 0, 20 ROUND 1, 13 ROUND 2, 8 ROUND 3, 3 ROUND 4, 0
  • 42. So… • When we have a fully defined MDP, i.e. we know all of • We can apply an iterative dynamic programming algorithm to the problem • The algorithm is guaranteed to converge to optimal value function and policy • Where we do full sweep* of the state space or invert a matrix *) not strictly necessary as asynchronous alternatives exits where we do not perform full sweeps, i.e. synchronous backups
  • 43. Utility of classical DP “Classical DP algorithms are of limited utility in reinforcement learning both because of their assumption of a perfect model and because of their great computational expense, but they are still important theoretically. DP provides an essential foundation for the understanding of the methods presented in the rest of this book. In fact, all of these methods can be viewed as attempts to achieve much the same effect as DP, only with less computation and without assuming a perfect model of the environment.” - Sutton and Barto, Reinforcement Learning: An Introduction
  • 44. Part II • What if we do not know the MDP • Or don’t care to know • Or cannot iterate across all states • Or do not want to blindly iterate -> Reinforcement Learning!
  • 45. References Video lectures: https://www.davidsilver.uk/teaching/ Book: http://incompleteideas.net/book/RLbook2020.pdf • http://artint.info/2e/index.html Random selection of university classes & lecture materials: • http://mlg.eng.cam.ac.uk/teaching/4f13/1011/lect1214.pdf • https://www.cs.cmu.edu/~mgormley/courses/10701-f16/schedule.html • https://www.cs.cmu.edu/~mgormley/courses/10601-s17/slides/lecture26-ri.pdf • https://www.andrew.cmu.edu/course/10-703/slides/lecture3_exactmethods-9-5-2018.pdf • https://www.cs.cmu.edu/~15381-f17/ • http://www.cs.cmu.edu/afs/cs.cmu.edu/academic/class/15780-s16/www/ • https://web.mit.edu/6.246/www/lectures/L3-2021sp.pdf • https://web.stanford.edu/class/cs234/slides/lecture2.pdf • https://people.eecs.berkeley.edu/~pabbeel/cs287-fa12/slides/mdps-exact-methods.pdf • https://people.eecs.berkeley.edu/~pabbeel/cs287-fa11/slides/mdps-intro-value-iteration.pdf • https://inst.eecs.berkeley.edu/~cs188/sp20/assets/lecture/lec11_6up.pdf