Example:
Maze Navigation
• State: agent location
• Action: where to move
• Reward:
• Prize for reaching target
• Cost for hitting wall
3
st
at
rt
• Trajectory: sequence of
states, actions and rewards
• State dynamics:
• Policy: deterministic ; stochastic
• Reward:
Markov Decision Process (MDP)
ppst`1|st, atq
⇡pat|stq
4
at“⇡pstq
rt“rpst, atq
s0, a0, r0, s1, a1, r1, s2, . . .
• Environment
• Reset() → get initial state
• Step( ) → get reward
draw next state
• Agent policy
• Action( ) → draw next action
"Rolling Out"
5
s0
ppst`1|st, atq
at
st ⇡pat|stq
rpst, atq
• Environment
• Reset() → get initial state
• Step( ) → get reward
draw next state
• Agent policy
• Action( ) → draw next action
"Rolling Out"
5
s0
ppst`1|st, atq
at
st ⇡pat|stq
rpst, atq
• Environment
• Reset() → get initial state
• Step( ) → get reward
draw next state
• Agent policy
• Action( ) → draw next action
"Rolling Out"
5
s0
ppst`1|st, atq
at
st ⇡pat|stq
rpst, atq
• Environment
• Reset() → get initial state
• Step( ) → get reward
draw next state
• Agent policy
• Action( ) → draw next action
"Rolling Out"
5
s0
ppst`1|st, atq
at
st ⇡pat|stq
rpst, atq
• Environment
• Reset() → get initial state
• Step( ) → get reward
draw next state
• Agent policy
• Action( ) → draw next action
"Rolling Out"
5
s0
ppst`1|st, atq
at
st ⇡pat|stq
rpst, atq
• Environment
• Reset() → get initial state
• Step( ) → get reward
draw next state
• Agent policy
• Action( ) → draw next action
"Rolling Out"
5
s0
ppst`1|st, atq
at
st ⇡pat|stq
rpst, atq
• Environment
• Reset() → get initial state
• Step( ) → get reward
draw next state
• Agent policy
• Action( ) → draw next action
"Rolling Out"
5
s0
ppst`1|st, atq
at
st ⇡pat|stq
rpst, atq
• Environment
• Reset() → get initial state
• Step( ) → get reward
draw next state
• Agent policy
• Action( ) → draw next action
"Rolling Out"
5
s0
ppst`1|st, atq
at
st ⇡pat|stq
rpst, atq
• Environment
• Reset() → get initial state
• Step( ) → get reward
draw next state
• Agent policy
• Action( ) → draw next action
"Rolling Out"
5
s0
ppst`1|st, atq
at
st ⇡pat|stq
rpst, atq
• Environment
• Reset() → get initial state
• Step( ) → get reward
draw next state
• Agent policy
• Action( ) → draw next action
"Rolling Out"
5
s0
ppst`1|st, atq
at
st ⇡pat|stq
rpst, atq
• Environment
• Reset() → get initial state
• Step( ) → get reward
draw next state
• Agent policy
• Action( ) → draw next action
"Rolling Out"
5
s0
ppst`1|st, atq
at
st ⇡pat|stq
rpst, atq
• Environment
• Reset() → get initial state
• Step( ) → get reward
draw next state
• Agent policy
• Action( ) → draw next action
"Rolling Out"
5
s0
ppst`1|st, atq
at
st ⇡pat|stq
rpst, atq
• Environment
• Reset() → get initial state
• Step( ) → get reward
draw next state
• Agent policy
• Action( ) → draw next action
"Rolling Out"
5
s0
ppst`1|st, atq
at
st ⇡pat|stq
rpst, atq
• Environment
• Reset() → get initial state
• Step( ) → get reward
draw next state
• Agent policy
• Action( ) → draw next action
"Rolling Out"
5
s0
ppst`1|st, atq
at
st ⇡pat|stq
rpst, atq
• Environment
• Reset() → get initial state
• Step( ) → get reward
draw next state
• Agent policy
• Action( ) → draw next action
"Rolling Out"
5
s0
ppst`1|st, atq
at
st ⇡pat|stq
rpst, atq
Example:
Game of Go
• State: board
• Action: place stone
• Reward: captures
• Environment: can
be simulated 6
Policy Evaluation
• Return:
• Discount
prefer early
rewards, late costs
• Policy value is its
expected return 9
R “ r0 ` r1 ` 2
r2 ` ¨ ¨ ¨
V⇡
ErRs
0 § § 1
6.3
Policy Evaluation
• Return:
• Discount
prefer early
rewards, late costs
• Policy value is its
expected return 9
R “ r0 ` r1 ` 2
r2 ` ¨ ¨ ¨
V⇡
ErRs
0 § § 1 7
5
7
7
5
7
R:
Value Function
• Policy value function
is its expected
return given
current state, action
• Optimal policy
satisfies
10
Q⇡ps, aq
ErR|s, as
⇡psq “ arg max
a
Q⇡ps, aq
Value Iteration
• Bellman equation:
• Iterate to convergence
• Final policy is
11
Qpst, atq “ rpst, atq`
Ermax
at`1
Qpst`1, at`1qs
⇡psq “ arg max
a
Qps, aq
Value Iteration
• Bellman equation:
• Iterate to convergence
• Final policy is
11
Qpst, atq “ rpst, atq`
Ermax
at`1
Qpst`1, at`1qs
⇡psq “ arg max
a
Qps, aq
Representing Value
• How to represent Q(s,a)
for large state space?
• Approximate Q with
deep representations
• Deep Q Net generalizes
to unseen states
12