13. n OpLmal Control
=
given an MDP (S, A, P, R, γ, H)
find the opLmal policy π*
Outline
n Exact Methods:
n Value Itera*on
n Policy IteraLon
For now: small discrete state-acLon spaces as they are simpler to get the main
concepts across. We will consider large / conLnuous spaces later!
18. n = opLmal value for state s when H=0
n
n = opLmal value for state s when H=1
n
n = opLmal value for state s when H=2
n
n = opLmal value for state s when H = k
n
Value IteraLon
V ⇤
0 (s)
V ⇤
0 (s) = 0 8s
V ⇤
1 (s)
V ⇤
1 (s) = max
a
X
s0
P(s0
|s, a)(R(s, a, s0
) + V ⇤
0 (s0
))
V ⇤
2 (s) = max
a
X
s0
P(s0
|s, a)(R(s, a, s0
) + V ⇤
1 (s0
))
V ⇤
2 (s)
V ⇤
k (s) = max
a
X
s0
P(s0
|s, a)(R(s, a, s0
) + V ⇤
k 1(s0
))
V ⇤
k (s)
35. § Now we know how to act for infinite horizon with discounted rewards!
§ Run value iteraLon Lll convergence.
§ This produces V*, which in turn tells us how to act, namely following:
§ Note: the infinite horizon opLmal policy is staLonary, i.e., the opLmal acLon at
a state s is the same acLon at all Lmes. (Efficient to store!)
Theorem. Value itera*on converges. At convergence, we have found the
op*mal value func*on V* for the discounted infinite horizon problem, which
sa*sfies the Bellman equa*ons
Value IteraLon Convergence
⇡⇤
(s) = arg max
a
X
s0
P(s0
|s, a) [R(s, a, s0
) + V ⇤
(s0
)]
V ⇤
(s) = max
a
X
s0
P(s0
|s, a) [R(s, a, s0
) + V ⇤
(s0
)]
37. n Define the max-norm:
n Theorem: For any two approximaLons U and V
n I.e., any disLnct approximaLons must get closer to each other, so, in parLcular, any approximaLon
must get closer to the true U and value iteraLon converges to a unique, stable, opLmal soluLon
n Theorem:
n I.e. once the change in our approximaLon is small, it must also be close to correct
Convergence and ContracLons
43. Q-Values
Q*(s, a) = expected utility starting in s, taking action a, and (thereafter)
acting optimally
Bellman Equation:
Q-Value Iteration:
Q⇤
k+1(s, a)
X
s0
P(s0
|s, a)(R(s, a, s0
) + max
a0
Q⇤
k(s0
, a0
))
45. n OpLmal Control
=
given an MDP (S, A, P, R, γ, H)
find the opLmal policy π*
Outline
n Exact Methods:
n Value Itera*on
n Policy IteraLon
For now: small discrete state-acLon spaces as they are simpler to get the main
concepts across. We will consider large / conLnuous spaces later!
47. Exercise 2
Consider a stochastic policy ⇡(a|s), where ⇡(a|s) is the probability of taking
action a when in state s. Which of the following is the correct update to perform
policy evaluation for this stochastic policy?
1. V ⇡
k+1(s) maxa
P
s0 P(s0
|s, a) (R(s, a, s0
) + V ⇡
k (s0
))
2. V ⇡
k+1(s)
P
s0
P
a ⇡(a|s)P(s0
|s, a) (R(s, a, s0
) + V ⇡
k (s0
))
3. V ⇡
k+1(s)
P
a ⇡(a|s) maxs0 P(s0
|s, a) (R(s, a, s0
) + V ⇡
k (s0
))
51. n OpLmal Control
=
given an MDP (S, A, P, R, γ, H)
find the opLmal policy π*
Outline
n Exact Methods:
n Value Itera*on
n Policy Itera*on
Limita*ons:
• IteraLon over / storage for all states and acLons: requires small,
discrete state-acLon space
• Update equaLons require access to dynamics model