Lec1 intro-mdps-exact-methods

Deep Reinforcement Learning
Lecture 1: Mo5va5on + Overview + MDPs + Exact
Solu5on Methods

Pieter Abbeel
OpenAI / UC Berkeley / Gradescope

Many slides made with John Schulman, Yan (Rocky) Duan and Xi (Peter) Chen

Organizers / Instructors / TAs

LogisLcs
n  Laptop setup for labs
n  Check your email!
n  CommunicaLon with staﬀ
n  In-person
n  Piazza
n  Help us out! J
n  Laptops
n  Stay charged
n  Tomorrow
n  Make sure to have your badges!

n  Understand mathemaLcal and algorithmic foundaLons of
Deep RL
n  Have implemented many of the core algorithms
Bootcamp Goals

8:30: Breakfast / Check-in
9-10: Core Lecture 1 Intro to MDPs and Exact SoluLon Methods (Pieter Abbeel)
10-10:10 Brief Sponsor Intros
10:10-11:10 Core Lecture 2 Sample-based ApproximaLons and Fi]ed Learning (Rocky Duan)
11:10-11:30: Coﬀee Break
11:30-12:30 Core Lecture 3 DQN + variants (Vlad Mnih)
12:30-1:30 Lunch [catered]
1:30-2:15 Core Lecture 4a Policy Gradients and Actor CriLc (Pieter Abbeel)
2:15-3:00 Core Lecture 4b Pong from Pixels (Andrej Karpathy)
3:00-3:30 Core Lecture 5 Natural Policy Gradients, TRPO, and PPO (John Schulman)
3:30-3:50 Coﬀee Break
3:50-4:30 Core Lecture 6 Nuts and Bolts of Deep RL ExperimentaLon (John Schulman)
4:30-7 Labs 1-2-3
7-8 FronLers Lecture I: Recent Advances, FronLers and Future of Deep RL (Vlad Mnih)

Schedule -- Saturday

8:30: Breakfast
9-10: Core Lecture 7 SVG, DDPG, and StochasLc ComputaLon Graphs (John Schulman)
10-11: Core Lecture 8 DerivaLve-free Methods (Peter Chen)
11-11:30: Coﬀee Break
11:30-12:30 Core Lecture 9 Model-based RL (Chelsea Finn)
12:30-1:30 lunch [catered]
1:30-2:30 Core Lecture 10 ULliLes / Inverse RL (Pieter Abbeel / Chelsea Finn)
2:30-3:10 Two-minute PresentaLons by each TA
3:10-3:30 Coﬀee Break
3:30-6 Labs 4-5
6-7 FronLers Lecture II: Recent Advances, FronLers and Future of Deep RL (Sergey Levine)
Schedule -- Sunday

[Drawing from Su]on and Barto, Reinforcement Learning: An IntroducLon, 1998]
Markov Decision Process
AssumpLon: agent gets to observe the state

Kohl and Stone, 2004
Some Reinforcement Learning Success Stories
Tedrake et al, 2005 Kober and Peters, 2009 Ng et al, 2004
Silver et al, 2014 (DPG)
Lillicrap et al, 2015 (DDPG) Schulman et al,
2016 (TRPO + GAE)
Levine*, Finn*, et
al, 2016
(GPS)
Mnih et al 2013 (DQN)
Mnih et al, 2015 (A3C)
Silver*, Huang*, et
al, 2016
(AlphaGo)

An MDP is deﬁned by:
n  Set of states S
n  Set of acLons A
n  TransiLon funcLon P(s’ | s, a)
n  Reward funcLon R(s, a, s’)
n  Start state s0
n  Discount factor γ
n  Horizon H
Markov Decision Process (MDP)

Example MDP: Gridworld
Goal:
π:
An MDP is deﬁned by:
n  Set of states S
n  Set of acLons A
n  TransiLon funcLon P(s’ | s, a)
n  Reward funcLon R(s, a, s’)
n  Start state s0
n  Discount factor γ
n  Horizon H

n  OpLmal Control
=
given an MDP (S, A, P, R, γ, H)
ﬁnd the opLmal policy π*
Outline
n  Exact Methods:
n  Value Itera*on
n  Policy IteraLon
For now: small discrete state-acLon spaces as they are simpler to get the main
concepts across. We will consider large / conLnuous spaces later!

OpLmal Value FuncLon V*
= sum of discounted rewards when starLng from state s and acLng opLmally
V ⇤
(s) = max
⇡
E
" HX
t=0
t
R(st, at, st+1) | ⇡, s0 = s
#

Let’s assume:
acLons determinisLcally successful, gamma = 1, H = 100
V*(4,3) = 1
V*(3,3) = 1
V*(2,3) = 1
V*(1,1) = 1
V*(4,2) = -1
V ⇤
(s) = max
⇡
E
" HX
t=0
t
R(st, at, st+1) | ⇡, s0 = s
#

Let’s assume:
acLons determinisLcally successful, gamma = 0.9, H = 100
V*(4,3) = 1
V*(3,3) = 0.9
V*(2,3) = 0.9*0.9 = 0.81
V*(1,1) = 0.9*0.9*0.9*0.9*0.9 = 0.59
V*(4,2) = -1
V ⇤
(s) = max
⇡
E
" HX
t=0
t
R(st, at, st+1) | ⇡, s0 = s
#

V*(4,3) = 1
V*(3,3) = 0.8 * 0.9 + 0.1 * 0.9 * V*(3,3) + 0.1 * 0.9 * V*(3,2)
V*(2,3) =
V*(1,1) =
V*(4,2) =
Let’s assume:
acLons successful w/probability 0.8, gamma = 0.9, H = 100
V ⇤
(s) = max
⇡
E
" HX
t=0
t
R(st, at, st+1) | ⇡, s0 = s
#

n  = opLmal value for state s when H=0
n 
n 
n 
n  = opLmal value for state s when H = k
n 
Value IteraLon
V ⇤
0 (s)
V ⇤
0 (s) = 0 8s
V ⇤
1 (s)
V ⇤
1 (s) = max
a
X
s0
P(s0
|s, a)(R(s, a, s0
) + V ⇤
0 (s0
))
V ⇤
2 (s) = max
a
X
s0
P(s0
|s, a)(R(s, a, s0
) + V ⇤
1 (s0
))
V ⇤
2 (s)
V ⇤
k (s) = max
a
X
s0
P(s0
|s, a)(R(s, a, s0
) + V ⇤
k 1(s0
))
V ⇤
k (s)

Algorithm:
Start with for all s.
For k = 1, … , H:
For all states s in S:

This is called a value update or Bellman update/back-up
Value IteraLon
V ⇤
k (s) max
a
X
s0
P(s0
|s, a) R(s, a, s0
) + V ⇤
k 1(s0
)
⇡⇤
k(s) arg max
a
X
s0
P(s0
|s, a) R(s, a, s0
) + V ⇤
k 1(s0
)

Noise = 0.2
Discount = 0.9
k = 0
Value IteraLon

Value IteraLon
k = 0
Noise = 0.2
Discount = 0.9

Value IteraLon
k = 1
Noise = 0.2
Discount = 0.9

Value IteraLon
k = 2
Noise = 0.2
Discount = 0.9

Value IteraLon
k = 3
Noise = 0.2
Discount = 0.9

Value IteraLon
k = 4
Noise = 0.2
Discount = 0.9

Value IteraLon
k = 5
Noise = 0.2
Discount = 0.9

Value IteraLon
k = 6
Noise = 0.2
Discount = 0.9

Value IteraLon
k = 7
Noise = 0.2
Discount = 0.9

Value IteraLon
k = 8
Noise = 0.2
Discount = 0.9

Value IteraLon
k = 9
Noise = 0.2
Discount = 0.9

Value IteraLon
k = 10
Noise = 0.2
Discount = 0.9

Value IteraLon
k = 11
Noise = 0.2
Discount = 0.9

Value IteraLon
k = 12
Noise = 0.2
Discount = 0.9

Value IteraLon
k = 100
Noise = 0.2
Discount = 0.9

§  Now we know how to act for infinite horizon with discounted rewards!
§  Run value iteraLon Lll convergence.
§  This produces V*, which in turn tells us how to act, namely following:
§  Note: the infinite horizon opLmal policy is staLonary, i.e., the opLmal acLon at
a state s is the same acLon at all Lmes. (Efficient to store!)
Theorem. Value itera*on converges. At convergence, we have found the
op*mal value func*on V* for the discounted infinite horizon problem, which
sa*sfies the Bellman equa*ons

Value IteraLon Convergence
⇡⇤
(s) = arg max
a
X
s0
P(s0
|s, a) [R(s, a, s0
) + V ⇤
(s0
)]
V ⇤
(s) = max
a
X
s0
P(s0
|s, a) [R(s, a, s0
) + V ⇤
(s0
)]

n  = expected sum of rewards accumulated starLng from state s, acLng opLmally for steps
n  = expected sum of rewards accumulated starLng from state s, acLng opLmally for H steps
n  AddiLonal reward collected over Lme steps H+1, H+2, …
goes to zero as H goes to inﬁnity
Hence

For simplicity of notaLon in the above it was assumed that rewards are always greater than or equal to zero. If rewards can be negaLve, a
similar argument holds, using max |R| and bounding from both sides.
Convergence: IntuiLon
V ⇤
H(s)
1V ⇤
(s)
H+1
R(sH+1) + H+2
R(sH+2) + . . .  H+1
Rmax + H+2
Rmax + . . . =
H+1
1
Rmax
V ⇤
H
H!1
! V ⇤

n  Deﬁne the max-norm:
n  Theorem: For any two approximaLons U and V
n  I.e., any disLnct approximaLons must get closer to each other, so, in parLcular, any approximaLon
must get closer to the true U and value iteraLon converges to a unique, stable, opLmal soluLon
n  Theorem:
n  I.e. once the change in our approximaLon is small, it must also be close to correct
Convergence and ContracLons

(a) Prefer the close exit (+1), risking the cliff (-10)
(b) Prefer the close exit (+1), but avoiding the cliff (-10)
(c) Prefer the distant exit (+10), risking the cliff (-10)
(d) Prefer the distant exit (+10), avoiding the cliff (-10)
Exercise 1: Effect of Discount and Noise
(1) γ = 0.1, noise = 0.5
(2) γ = 0.99, noise = 0
(3) γ = 0.99, noise = 0.5
(4) γ = 0.1, noise = 0

(a) Prefer close exit (+1), risking the cliﬀ (-10) --- (4) γ = 0.1, noise = 0
Exercise 1 SoluLon

(b) Prefer close exit (+1), avoiding the cliﬀ (-10) --- (1) γ = 0.1, noise = 0.5
Exercise 1 SoluLon

(c) Prefer distant exit (+1), risking the cliﬀ (-10) --- (2) γ = 0.99, noise = 0
Exercise 1 SoluLon

(d) Prefer distant exit (+1), avoid the cliﬀ (-10) --- (3) γ = 0.99, noise = 0.5
Exercise 1 SoluLon

Q-Values
Q*(s, a) = expected utility starting in s, taking action a, and (thereafter)
acting optimally
Bellman Equation:
Q-Value Iteration:
Q⇤
k+1(s, a)
X
s0
P(s0
|s, a)(R(s, a, s0
) + max
a0
Q⇤
k(s0
, a0
))

Q-Value IteraLon
Noise = 0.2
Discount = 0.9
k = 100
Q⇤
k+1(s, a)
X
s0
P(s0
|s, a)(R(s, a, s0
) + max
a0
Q⇤
k(s0
, a0
))

n  Recall value iteraLon:
n  Policy evaluaLon for a given :
At convergence:
Policy EvaluaLon
⇡(s)
V ⇤
k (s) max
a
X
s0
P(s0
|s, a) R(s, a, s0
) + V ⇤
k 1(s0
)
V ⇡
k (s)
X
s0
P(s0
|s, ⇡(s))(R(s, ⇡(s), s0
) + V ⇡
k 1(s))
8s V ⇡
(s)
X
s0
P(s0
|s, ⇡(s))(R(s, ⇡(s), s0
) + V ⇡
(s))

Exercise 2
Consider a stochastic policy ⇡(a|s), where ⇡(a|s) is the probability of taking
action a when in state s. Which of the following is the correct update to perform
policy evaluation for this stochastic policy?
1. V ⇡
k+1(s) maxa
P
s0 P(s0
|s, a) (R(s, a, s0
) + V ⇡
k (s0
))
2. V ⇡
k+1(s)
P
s0
P
a ⇡(a|s)P(s0
|s, a) (R(s, a, s0
) + V ⇡
k (s0
))
3. V ⇡
k+1(s)
P
a ⇡(a|s) maxs0 P(s0
|s, a) (R(s, a, s0
) + V ⇡
k (s0
))

n  Repeat unLl policy converges
n  At convergence: opLmal policy; and converges faster than value iteraLon under some condiLons
Policy IteraLon
One itera5on of policy itera5on:
n  Policy evaluaLon for current policy :
n  Iterate unLl convergence
n  Policy improvement: ﬁnd the best acLon according to one-step
look-ahead
V ⇡k
i+1(s)
X
s0
P(s0
|s, ⇡k(s)) [R(s, ⇡(s), s0
) + V ⇡k
i (s0
)]
⇡k
⇡k+1(s) arg max
a
X
s0
P(s0
|s, a) [R(s, a, s0
) + V ⇡k
(s0
)]

n  Idea 1: modify Bellman updates
n  Idea 2: it is just a linear system, solve with Numpy (or whatever)
variables: Vπ(s)
constants: P, R
Policy EvaluaLon Revisited
V ⇡
i+1(s)
X
s0
P(s0
|s, ⇡(s)) [R(s, ⇡(s), s0
) + V ⇡
i (s0
)]
8s V ⇡
(s) =
X
s0
P(s0
|s, ⇡(s)) [R(s, ⇡(s), s0
) + V ⇡
(s0
)]

Proof sketch:
(1)  Guarantee to converge: In every step the policy improves. This means that a given policy can be encountered
at most once. This means that awer we have iterated as many Lmes as there are different policies, i.e.,
(number acLons)(number states), we must be done and hence have converged.
(2)  Op*mal at convergence: by definiLon of convergence, at convergence πk+1(s) = πk(s) for all states s. This
means
Hence saLsfies the Bellman equaLon, which means is equal to the opLmal value funcLon V*.
Theorem. Policy iteraLon is guaranteed to converge and at convergence, the current policy
and its value funcLon are the opLmal policy and the opLmal value funcLon!
Policy IteraLon Guarantees
Policy IteraLon iterates over:
V ⇡k
i+1(s)
X
s0
P (s0
|s, ⇡k(s)) [R(s, ⇡(s), s0
) + V ⇡k
i (s0
)]
n  Policy evaluaLon for current policy :
n  Iterate unLl convergence
n  Policy improvement: find the best acLon according to
one-step look-ahead
⇡k
⇡k+1(s) arg max
a
X
s0
P (s0
|s, a) [R(s, a, s0
) + V ⇡k
(s0
)]

n  OpLmal Control
=
given an MDP (S, A, P, R, γ, H)
ﬁnd the opLmal policy π*
Outline
n  Exact Methods:
n  Value Itera*on
n  Policy Itera*on
Limita*ons:
•  IteraLon over / storage for all states and acLons: requires small,
discrete state-acLon space
•  Update equaLons require access to dynamics model

Lec1 intro-mdps-exact-methods

Recommended

Recommended

More Related Content

Similar to Lec1 intro-mdps-exact-methods

Similar to Lec1 intro-mdps-exact-methods (13)

More from Ronald Teo

More from Ronald Teo (16)

Recently uploaded

Recently uploaded (20)

Lec1 intro-mdps-exact-methods