Lecture 10: Policy gradient
Radoslav Neychev
Machine Learning course
advanced track
MIPT
08.11.2019, Moscow, Russia
These slides are almost the exact copy of Practical RL course week 6 slides.
Special thanks to YSDA team for making them publicly available.
Original slides link: week06_policy_based
References
2
Small experiment
The next slide contains a question
Please respond as fast as you can!
3
Small experiment
left or right?
4
Small experiment
Right! Ready for next one?
5
Small experiment
What's Q(s,right) under gamma=0.99?
6
Small experiment
What's Q(s,right) under gamma=0.99?
7
Approximation error
DQN is trained to minimize
Simple 2-state world
L≈E[Q(st ,at)−(rt +γ⋅maxa' Q(st+1, a'))]
2
True (A) (B)
Q(s0,a0) 1 1 2
Q(s0,a1) 2 2 1
Q(s1,a0) 3 3 3
Q(s1,a1) 100 50 100
Q: Which prediction is better (A/B)?
8
Approximation error
DQN is trained to minimize
Simple 2-state world
L≈E[Q(st ,at)−(rt +γ⋅maxa' Q(st+1, a'))]
2
True (A) (B)
Q(s0,a0) 1 1 2
Q(s0,a1) 2 2 1
Q(s1,a0) 3 3 3
Q(s1,a1) 100 50 100
better
policy
less
MSE
9
Approximation error
DQN is trained to minimize
Simple 2-state world
L≈E[Q(st ,at)−(rt +γ⋅maxa' Q(st+1, a'))]
2
True (A) (B)
Q(s0,a0) 1 1 2
Q(s0,a1) 2 2 1
Q(s1,a0) 3 3 3
Q(s1,a1) 100 50 100
better
policy
less
MSE
Q-learning will prefer worse policy (B)!
10
Conclusion
● Often computing q-values is harder than
picking optimal actions!
● We could avoid learning value functions by
directly learning agent's policy
Q: what algorithm works that way?
(of those we studied)
πθ(a∣s)
11
Conclusion
● Often computing q-values is harder than
picking optimal actions!
● We could avoid learning value functions by
directly learning agent's policy
Q: what algorithm works that way?
πθ(a∣s)
e.g. crossentropy method
12
NOT how humans survived
argmax[
Q(s,pet the tiger)
Q(s,run from tiger)
Q(s,provoke tiger)
Q(s,ignore tiger)
]
13
how humans survived
π(run∣s)=1
14
Policies
In general, two kinds
● Deterministic policy
● Stochastic policy
a=πθ (s)
a∼πθ (a∣s)
Q: Any case where stochastic is better?
15
Policies
In general, two kinds
● Deterministic policy
● Stochastic policy
a=πθ (s)
a∼πθ (a∣s)
Q: Any case where stochastic is better?
e.g. rock-paper
-scissors
16
Policies
In general, two kinds
● Deterministic policy
● Stochastic policy sampling takes care
of exploration
same action each time
Genetic algos (week 0)
Deterministic policy gradient a=πθ (s)
a∼πθ (a∣s)
Crossentropy method
Policy gradient
Q: how to represent policy in continuous action space?
17
Policies
In general, two kinds
● Deterministic policy
● Stochastic policy sampling takes care
of exploration
same action each time
Genetic algos (week 0)
Deterministic policy gradient a=πθ (s)
a∼πθ (a∣s)
Crossentropy method
Policy gradient
categorical, normal, mixture of normal, whatever
18
Two approaches
● Value based:
Learn value function or
Infer policy
● Policy based:
Explicitly learn policy or
Implicitly maximize reward over policy
a=argmax
a
Qθ(s,a)
Qθ(s ,a) Vθ(s)
πθ(a∣s) πθ(s)→a
19
Recap: crossentropy method
● Initialize NN weights
● Loop:
– Sample N sessions
– elite = take M best sessions and concatenate
θ0←random
θi+1=θi+α ∇ ∑
i
log πθi
(ai∣si)⋅
[ si ,ai∈Elite]
20
Recap: crossentropy method
● Initialize NN weights
● Loop:
– Sample N sessions
– elite = take M best sessions and concatenate
TD version: elite (s,a) that have highest G(s,a)
(select elites independently from each state)
θ0←random
θi+1=θi+α ∇ ∑
i
log πθi
(ai∣si)⋅
[ si ,ai∈Elite]
21
Policy gradient main idea
Why so complicated?
We'd rather maximize reward directly!
22
Objective
Expected reward:
Expected discounted reward:
J = E
s∼p(s)
a∼πθ (s∣a)
...
R(s, a,s' ,a',...)
J = E
s∼p(s)
a∼πθ (s∣a)
G(s,a)
23
Objective
Expected reward:
Expected discounted reward:
J = E
s∼p(s)
a∼πθ (s∣a)
...
R(s, a,s' ,a',...)
J = E
s∼p(s)
a∼πθ (s∣a)
G(s,a)
R(z) setting
G(s,a) = r + γ*G(s',a')
24
Objective
Consider an 1-step process for simplicity
J = E
s∼p(s)
a∼πθ (s∣a)
R(s, a)=∫
s
p(s)∫
a
πθ(a∣s)R(s,a)da ds
25
Objective
state visitation frequency
(may depend on policy)
Q: how do we compute that?
J = E
s∼p(s)
a∼πθ (s∣a)
R(s, a)=∫
s
p(s)∫
a
πθ(a∣s)R(s,a)da ds
Reward for 1-step
session
Consider an 1-step process for simplicity
26
Objective
J = E
s∼p(s)
a∼πθ (s∣a)
R(s, a)=∫
s
p(s)∫
a
πθ(a∣s)R(s,a)da ds
sample N sessions
under current policy
J≈
1
N
∑
i=0
N
R(s ,a)
27
Objective
Can we optimize policy now?
sample N sessions
J = E
s∼p(s)
a∼πθ (s∣a)
R(s, a)=∫
s
p(s)∫
a
πθ(a∣s)R(s,a)da ds
J≈
1
N
∑
i=0
N
R(s ,a)
28
Objective
parameters “sit” here
We don't know how to compute dJ/dtheta
J = E
s∼p(s)
a∼πθ (s∣a)
R(s, a)=∫
s
p(s)∫
a
πθ(a∣s)R(s,a)da ds
J≈
1
N
∑
i=0
N
∑
s,a∈zi
R(s,a)
29
Optimization
Finite differences
– Change policy a little, evaluate
Stochastic optimization
– Good old crossentropy method
– Maximize probability of “elite” actions
∇ J≈
Jθ+ϵ−Jθ
ϵ
30
Optimization
Finite differences
– Change policy a little, evaluate
Stochastic optimization
– Good old crossentropy method
– Maximize probability of “elite” actions
∇ J≈
Jθ+ϵ−Jθ
ϵ
Q: any problems with those two?
31
Optimization
Finite differences
– Change policy a little, evaluate
Stochastic optimization
– Good old crossentropy method
– Maximize probability of “elite” actions
∇ J≈
Jθ+ϵ−Jθ
ϵ
VERY noizy, especially
if both J are sampled
“quantile convergence”
problems with stochastic
MDPs
32
Objective
Wish list:
– Analytical gradient
– Easy/stable approximations
J = E
s∼p(s)
a∼πθ (s∣a)
R(s, a)=∫
s
p(s)∫
a
πθ(a∣s)R(s,a)da ds
33
Logderivative trick
Simple math
(try chain rule)
∇ log π(z)=? ??
34
Logderivative trick
Simple math
∇ log π(z)=
1
π(z)
⋅∇ π( z)
π⋅∇ log π(z)=∇ π( z)
35
Policy gradient
Analytical inference
π⋅∇ log π(z)=∇ π( z)
∇ J=∫
s
p(s)∫
a
∇ πθ (a∣s)R(s, a)da ds
36
Policy gradient
Analytical inference
∇ J=∫
s
p(s)∫
a
∇ πθ (a∣s)R(s, a)da ds
π⋅∇ log π(z)=∇ π( z)
∇ J=∫
s
p(s)∫
a
πθ (a∣s)∇ log πθ(a∣s)R(s,a)dads
Q: anything curious about that formula?
37
Policy gradient
Analytical inference
π⋅∇ log π(z)=∇ π( z)
that's expectation :)
∇ J=∫
s
p(s)∫
a
∇ πθ (a∣s)R(s, a)da ds
∇ J=∫
s
p(s)∫
a
πθ (a∣s)∇ log πθ(a∣s)R(s,a)dads
38
REINFORCE (bandit)
∇ J≈
1
N
∑
i=0
N
∇ log πθ(a∣s)⋅R(s,a)
● Initialize NN weights
● Loop:
– Sample N sessions z under current
– Evaluate policy gradient
– Ascend
θ0←random
θi+1←θi+α⋅∇ J
39
Discounted reward case
● Replace R with Q :)
π⋅∇ log π(z)=∇ π( z)
that's expectation :)
∇ J=∫
s
p(s)∫
a
∇ πθ (a∣s)Q(s,a)dads
∇ J=∫
s
p(s)∫
a
πθ (a∣s)∇ log πθ(a∣s)Q(s,a)dads
True action value
a.k.a. E[ G(s,a) ]
40
REINFORCE (discounted)
● Policy gradient
● Approximate with sampling
∇ J= E
s∼p(s)
a∼πθ (s∣a)
∇ logπθ(a∣s)⋅Q(s, a)
∇ J≈
1
N
∑
i=0
N
∑
s ,a∈zi
∇ logπθ(a∣s)⋅Q(s,a)
41
REINFORCE algorithm
We can estimate Q using G
Gt=rt+γ⋅rt+1+γ
2
⋅rt+2+...
Qπ (st ,at)=Es' G(st ,at)
prev s
prev a a''
s''
s'
a
a'
r’
s
r'’
r
r'’’
42
We can use this to compue all G’s
in linear time
Recap: discounted rewards
Gt=rt+γ⋅rt+1+γ
2
⋅rt+2+...
rt +γ⋅(rt+1+γ⋅rt+2+...)
rt +γ⋅Gt+1
=
= =
=
43
REINFORCE algorithm
∇ J≈
1
N
∑
i=0
N
∑
s ,a∈zi
∇ logπθ(a∣s)⋅Q(s,a)
● Initialize NN weights
● Loop:
– Sample N sessions z under current
– Evaluate policy gradient
– Ascend
θ0←random
πθ(a∣s)
θi+1←θi+α⋅∇ J
44
REINFORCE algorithm
∇ J≈
1
N
∑
i=0
N
∑
s ,a∈zi
∇ logπθ(a∣s)⋅Q(s,a)
● Initialize NN weights
● Loop:
– Sample N sessions z under current
– Evaluate policy gradient
– Ascend
θ0←random
πθ(a∣s)
θi+1←θi+α⋅∇ J
Q: is it off- or on-policy?
45
REINFORCE algorithm
∇ J≈
1
N
∑
i=0
N
∑
s ,a∈zi
∇ logπθ(a∣s)⋅Q(s,a)
● Initialize NN weights
● Loop:
– Sample N sessions z under current
– Evaluate policy gradient
– Ascend
θ0←random
πθ(a∣s)
θi+1←θi+α⋅∇ J
actions under current policy
= on-policy
value-based Vs policy-based
Value-based
● Q-learning, SARSA, MCTS
value-iteration
● Solves harder problem
● Artificial exploration
● Learns from partial experience
(temporal difference)
● Evaluates strategy for free :)
Policy-based
● REINFORCE, CEM
● Solves easier problem
● Innate exploration
● Innate stochasticity
● Support continuous action space
● Learns from full session only?
value-based Vs policy-based
Value-based
● Q-learning, SARSA, MCTS
value-iteration
● Solves harder problem
● Artificial exploration
● Learns from partial experience
(temporal difference)
● Evaluates strategy for free :)
Policy-based
● REINFORCE, CEM
● Solves easier problem
● Innate exploration
● Innate stochasticity
● Support continuous action space
● Learns from full session only
We'll learn much more soon!
48
REINFORCE baselines
∇ J≈
1
N
∑
i=0
N
∑
s ,a∈zi
∇ logπθ(a∣s)⋅Q(s,a)
● Initialize NN weights
● Loop:
– Sample N sessions z under current
– Evaluate policy gradient
θ0←random
πθ(a∣s)
What is better for learning:
random action in good state
or
great action in bad state?
49
REINFORCE baselines
We can subtract arbitrary baseline b(s)
∇ J= E
s∼p(s)
a∼πθ (a∣s)
∇ logπθ(a∣s)(Q(s,a)−b(s))=...
...= E
s∼p(s)
a∼πθ (a∣s)
∇ logπθ(a∣s)Q(s ,a)− E
s∼p(s)
a∼πθ (a∣s)
∇ logπθ(a∣s)b(s)=...
50
REINFORCE baselines
We can subtract arbitrary baseline b(s)
∇ J= E
s∼p(s)
a∼πθ (a∣s)
∇ logπθ(a∣s)(Q(s,a)−b(s))=...
...= E
s∼p(s)
a∼πθ (a∣s)
∇ logπθ(a∣s)Q(s ,a)− E
s∼p(s)
a∼πθ (a∣s)
∇ logπθ(a∣s)b(s)=...
Note that b(s) does not depend on a
Q: Can you simplify the second term?
51
REINFORCE baselines
We can subtract arbitrary baseline b(s)
∇ J= E
s∼p(s)
a∼πθ (a∣s)
∇ logπθ(a∣s)(Q(s,a)−b(s))=...
...= E
s∼p(s)
a∼πθ (a∣s)
∇ logπθ(a∣s)Q(s ,a)− E
s∼p(s)
a∼πθ (a∣s)
∇ logπθ(a∣s)b(s)=...
E
s∼p(s)
a∼πθ (a∣s)
∇ log πθ(a∣s)b(s)=b(s)⋅ E
s∼p(s)
a∼πθ (a∣s)
∇ log πθ(a∣s)=0
52
REINFORCE baselines
We can subtract arbitrary baseline b(s)
∇ J= E
s∼p(s)
a∼πθ (a∣s)
∇ logπθ(a∣s)(Q(s,a)−b(s))=...
...= E
s∼p(s)
a∼πθ (a∣s)
∇ logπθ(a∣s)Q(s ,a)− E
s∼p(s)
a∼πθ (a∣s)
∇ logπθ(a∣s)b(s)=...
...= E
s∼p(s)
a∼πθ (a∣s)
∇ logπθ(a∣s)Q(s ,a)
Gradient direction doesn’t change!
53
REINFORCE baselines
● Gradient direction stays the same
● Variance may change
Gradient variance:
as a random variable over (s, a)
∇ J
Var[Q(s,a)−b(s)]
Var[Q(s,a)]−2⋅Cov[Q(s ,a),b(s)]+Var[b(s)]
54
REINFORCE baselines
● Gradient direction stays the same
● Variance may change
Gradient variance:
as a random variable over (s, a)
∇ J
Var[Q(s,a)−b(s)]
If b(s) correlates with Q(s,a), variance decreases
Var[Q(s,a)]−2⋅Cov[Q(s ,a),b(s)]+Var[b(s)]
55
REINFORCE baselines
● Gradient direction stays the same
● Variance may change
Gradient variance:
as a random variable over (s, a)
∇ J
Var[Q(s,a)−b(s)]
Q: can you suggest any such b(s)?
Var[Q(s,a)]−2⋅Cov[Q(s ,a),b(s)]+Var[b(s)]
56
REINFORCE baselines
● Gradient direction stays the same
● Variance may change
Gradient variance:
as a random variable over (s, a)
∇ J
Var[Q(s,a)−b(s)]
Naive baseline: b = moving average Q
over all (s, a), Var[b(s)] = 0, Cov[Q, b] > 0
Var[Q(s,a)]−2⋅Cov[Q(s ,a),b(s)]+Var[b(s)]
57
REINFORCE baselines
∇ J≈
1
N
∑
i=0
N
∑
s ,a∈zi
∇ logπθ(a∣s)⋅(Q(s,a)−V (s))
Better baseline: b(s) = V(s)
Q: but how do we predict V(s)?
58
Actor-critic
● Learn both V(s) and
● Hope for the best of both worlds :)
πθ(a∣s)
59
Advantage actor-critic
Idea: learn both and
Use to learn faster!
Q: how can we estimate A(s,a)
from (s,a,r,s') and V-function?
Vθ(s) πθ(a∣s)
πθ(a∣s) Vθ(s)
60
Advantage actor-critic
Idea: learn both and
Use to learn faster!
Vθ(s) πθ(a∣s)
πθ(a∣s) Vθ(s)
A(s ,a)=Q(s ,a)−V (s)
Q(s,a)=r+γ⋅V (s ')
A(s ,a)=r+ γ⋅V (s')−V (s)
61
Advantage actor-critic
Idea: learn both and
Use to learn faster!
Vθ(s) πθ(a∣s)
πθ(a∣s) Vθ(s)
A(s ,a)=Q(s ,a)−V (s)
Q(s,a)=r+γ⋅V (s ')
A(s ,a)=r+ γ⋅V (s')−V (s)
Also: n-step
version
62
Advantage actor-critic
Idea: learn both and
Use to learn faster!
Vθ(s) πθ(a∣s)
πθ(a∣s) Vθ(s)
A(s ,a)=r+ γ⋅V (s')−V (s)
∇ Jactor≈
1
N
∑
i=0
N
∑
s,a∈zi
∇ logπθ(a∣s)⋅A(s ,a)
consider
const
63
Advantage actor-critic
∇ Jactor≈
1
N
∑
i=0
N
∑
s,a∈zi
∇ logπθ(a∣s)⋅A(s ,a)
Vθ(s)
model
W = params
state s
πθ(a∣s)
Lcritic≈
1
N
∑
i=0
N
∑
s,a∈zi
(V θ(s)−[r+γ⋅V (s')])
2
Improve policy:
Improve value:
64
Continuous action spaces
What if there's continuously many actions?
● Robot control: control motor voltage
● Trading: assign money to equity
How does the algorithm change?
65
Continuous action spaces
What if there's continuously many actions?
● Robot control: control motor voltage
● Trading: assign money to equity
How does the algorithm change?
it doesn't :)
Just plug in a different formula for
pi(a|s), e.g. normal distribution
66
Asynchronous advantage actor-critic
● Parallel game sessions
● Async multi-CPU training
● No experience replay
● LSTM policy
● N-step advantage
● No experience replay
Read more: https://arxiv.org/abs/1602.01783
67
IMPALA
Read more: https://arxiv.org/abs/1802.01561
● Massively parallel
● Separate actor / learner processes
● Small experience replay
w/ importance sampling
68
Duct tape zone
● V(s) errors less important than in Q-learning
– actor still learns even if critic is random, just slower
● Regularize with entropy
– to prevent premature convergence
● Learn on parallel sessions
– Or super-small experience replay
● Use logsoftmax for numerical stability
69
Asynchronous advantage actor-critic
● Remember log-derivative trick
● Combining best from both worlds is generally a good idea
● See this paper for the proof of the policy gradient for
discounted rewards
● Time to write some code!
Outro and Q & A
70

week10_Reinforce.pdf

  • 1.
    Lecture 10: Policygradient Radoslav Neychev Machine Learning course advanced track MIPT 08.11.2019, Moscow, Russia
  • 2.
    These slides arealmost the exact copy of Practical RL course week 6 slides. Special thanks to YSDA team for making them publicly available. Original slides link: week06_policy_based References
  • 3.
    2 Small experiment The nextslide contains a question Please respond as fast as you can!
  • 4.
  • 5.
  • 6.
  • 7.
  • 8.
    7 Approximation error DQN istrained to minimize Simple 2-state world L≈E[Q(st ,at)−(rt +γ⋅maxa' Q(st+1, a'))] 2 True (A) (B) Q(s0,a0) 1 1 2 Q(s0,a1) 2 2 1 Q(s1,a0) 3 3 3 Q(s1,a1) 100 50 100 Q: Which prediction is better (A/B)?
  • 9.
    8 Approximation error DQN istrained to minimize Simple 2-state world L≈E[Q(st ,at)−(rt +γ⋅maxa' Q(st+1, a'))] 2 True (A) (B) Q(s0,a0) 1 1 2 Q(s0,a1) 2 2 1 Q(s1,a0) 3 3 3 Q(s1,a1) 100 50 100 better policy less MSE
  • 10.
    9 Approximation error DQN istrained to minimize Simple 2-state world L≈E[Q(st ,at)−(rt +γ⋅maxa' Q(st+1, a'))] 2 True (A) (B) Q(s0,a0) 1 1 2 Q(s0,a1) 2 2 1 Q(s1,a0) 3 3 3 Q(s1,a1) 100 50 100 better policy less MSE Q-learning will prefer worse policy (B)!
  • 11.
    10 Conclusion ● Often computingq-values is harder than picking optimal actions! ● We could avoid learning value functions by directly learning agent's policy Q: what algorithm works that way? (of those we studied) πθ(a∣s)
  • 12.
    11 Conclusion ● Often computingq-values is harder than picking optimal actions! ● We could avoid learning value functions by directly learning agent's policy Q: what algorithm works that way? πθ(a∣s) e.g. crossentropy method
  • 13.
    12 NOT how humanssurvived argmax[ Q(s,pet the tiger) Q(s,run from tiger) Q(s,provoke tiger) Q(s,ignore tiger) ]
  • 14.
  • 15.
    14 Policies In general, twokinds ● Deterministic policy ● Stochastic policy a=πθ (s) a∼πθ (a∣s) Q: Any case where stochastic is better?
  • 16.
    15 Policies In general, twokinds ● Deterministic policy ● Stochastic policy a=πθ (s) a∼πθ (a∣s) Q: Any case where stochastic is better? e.g. rock-paper -scissors
  • 17.
    16 Policies In general, twokinds ● Deterministic policy ● Stochastic policy sampling takes care of exploration same action each time Genetic algos (week 0) Deterministic policy gradient a=πθ (s) a∼πθ (a∣s) Crossentropy method Policy gradient Q: how to represent policy in continuous action space?
  • 18.
    17 Policies In general, twokinds ● Deterministic policy ● Stochastic policy sampling takes care of exploration same action each time Genetic algos (week 0) Deterministic policy gradient a=πθ (s) a∼πθ (a∣s) Crossentropy method Policy gradient categorical, normal, mixture of normal, whatever
  • 19.
    18 Two approaches ● Valuebased: Learn value function or Infer policy ● Policy based: Explicitly learn policy or Implicitly maximize reward over policy a=argmax a Qθ(s,a) Qθ(s ,a) Vθ(s) πθ(a∣s) πθ(s)→a
  • 20.
    19 Recap: crossentropy method ●Initialize NN weights ● Loop: – Sample N sessions – elite = take M best sessions and concatenate θ0←random θi+1=θi+α ∇ ∑ i log πθi (ai∣si)⋅ [ si ,ai∈Elite]
  • 21.
    20 Recap: crossentropy method ●Initialize NN weights ● Loop: – Sample N sessions – elite = take M best sessions and concatenate TD version: elite (s,a) that have highest G(s,a) (select elites independently from each state) θ0←random θi+1=θi+α ∇ ∑ i log πθi (ai∣si)⋅ [ si ,ai∈Elite]
  • 22.
    21 Policy gradient mainidea Why so complicated? We'd rather maximize reward directly!
  • 23.
    22 Objective Expected reward: Expected discountedreward: J = E s∼p(s) a∼πθ (s∣a) ... R(s, a,s' ,a',...) J = E s∼p(s) a∼πθ (s∣a) G(s,a)
  • 24.
    23 Objective Expected reward: Expected discountedreward: J = E s∼p(s) a∼πθ (s∣a) ... R(s, a,s' ,a',...) J = E s∼p(s) a∼πθ (s∣a) G(s,a) R(z) setting G(s,a) = r + γ*G(s',a')
  • 25.
    24 Objective Consider an 1-stepprocess for simplicity J = E s∼p(s) a∼πθ (s∣a) R(s, a)=∫ s p(s)∫ a πθ(a∣s)R(s,a)da ds
  • 26.
    25 Objective state visitation frequency (maydepend on policy) Q: how do we compute that? J = E s∼p(s) a∼πθ (s∣a) R(s, a)=∫ s p(s)∫ a πθ(a∣s)R(s,a)da ds Reward for 1-step session Consider an 1-step process for simplicity
  • 27.
    26 Objective J = E s∼p(s) a∼πθ(s∣a) R(s, a)=∫ s p(s)∫ a πθ(a∣s)R(s,a)da ds sample N sessions under current policy J≈ 1 N ∑ i=0 N R(s ,a)
  • 28.
    27 Objective Can we optimizepolicy now? sample N sessions J = E s∼p(s) a∼πθ (s∣a) R(s, a)=∫ s p(s)∫ a πθ(a∣s)R(s,a)da ds J≈ 1 N ∑ i=0 N R(s ,a)
  • 29.
    28 Objective parameters “sit” here Wedon't know how to compute dJ/dtheta J = E s∼p(s) a∼πθ (s∣a) R(s, a)=∫ s p(s)∫ a πθ(a∣s)R(s,a)da ds J≈ 1 N ∑ i=0 N ∑ s,a∈zi R(s,a)
  • 30.
    29 Optimization Finite differences – Changepolicy a little, evaluate Stochastic optimization – Good old crossentropy method – Maximize probability of “elite” actions ∇ J≈ Jθ+ϵ−Jθ ϵ
  • 31.
    30 Optimization Finite differences – Changepolicy a little, evaluate Stochastic optimization – Good old crossentropy method – Maximize probability of “elite” actions ∇ J≈ Jθ+ϵ−Jθ ϵ Q: any problems with those two?
  • 32.
    31 Optimization Finite differences – Changepolicy a little, evaluate Stochastic optimization – Good old crossentropy method – Maximize probability of “elite” actions ∇ J≈ Jθ+ϵ−Jθ ϵ VERY noizy, especially if both J are sampled “quantile convergence” problems with stochastic MDPs
  • 33.
    32 Objective Wish list: – Analyticalgradient – Easy/stable approximations J = E s∼p(s) a∼πθ (s∣a) R(s, a)=∫ s p(s)∫ a πθ(a∣s)R(s,a)da ds
  • 34.
    33 Logderivative trick Simple math (trychain rule) ∇ log π(z)=? ??
  • 35.
    34 Logderivative trick Simple math ∇log π(z)= 1 π(z) ⋅∇ π( z) π⋅∇ log π(z)=∇ π( z)
  • 36.
    35 Policy gradient Analytical inference π⋅∇log π(z)=∇ π( z) ∇ J=∫ s p(s)∫ a ∇ πθ (a∣s)R(s, a)da ds
  • 37.
    36 Policy gradient Analytical inference ∇J=∫ s p(s)∫ a ∇ πθ (a∣s)R(s, a)da ds π⋅∇ log π(z)=∇ π( z) ∇ J=∫ s p(s)∫ a πθ (a∣s)∇ log πθ(a∣s)R(s,a)dads Q: anything curious about that formula?
  • 38.
    37 Policy gradient Analytical inference π⋅∇log π(z)=∇ π( z) that's expectation :) ∇ J=∫ s p(s)∫ a ∇ πθ (a∣s)R(s, a)da ds ∇ J=∫ s p(s)∫ a πθ (a∣s)∇ log πθ(a∣s)R(s,a)dads
  • 39.
    38 REINFORCE (bandit) ∇ J≈ 1 N ∑ i=0 N ∇log πθ(a∣s)⋅R(s,a) ● Initialize NN weights ● Loop: – Sample N sessions z under current – Evaluate policy gradient – Ascend θ0←random θi+1←θi+α⋅∇ J
  • 40.
    39 Discounted reward case ●Replace R with Q :) π⋅∇ log π(z)=∇ π( z) that's expectation :) ∇ J=∫ s p(s)∫ a ∇ πθ (a∣s)Q(s,a)dads ∇ J=∫ s p(s)∫ a πθ (a∣s)∇ log πθ(a∣s)Q(s,a)dads True action value a.k.a. E[ G(s,a) ]
  • 41.
    40 REINFORCE (discounted) ● Policygradient ● Approximate with sampling ∇ J= E s∼p(s) a∼πθ (s∣a) ∇ logπθ(a∣s)⋅Q(s, a) ∇ J≈ 1 N ∑ i=0 N ∑ s ,a∈zi ∇ logπθ(a∣s)⋅Q(s,a)
  • 42.
    41 REINFORCE algorithm We canestimate Q using G Gt=rt+γ⋅rt+1+γ 2 ⋅rt+2+... Qπ (st ,at)=Es' G(st ,at) prev s prev a a'' s'' s' a a' r’ s r'’ r r'’’
  • 43.
    42 We can usethis to compue all G’s in linear time Recap: discounted rewards Gt=rt+γ⋅rt+1+γ 2 ⋅rt+2+... rt +γ⋅(rt+1+γ⋅rt+2+...) rt +γ⋅Gt+1 = = = =
  • 44.
    43 REINFORCE algorithm ∇ J≈ 1 N ∑ i=0 N ∑ s,a∈zi ∇ logπθ(a∣s)⋅Q(s,a) ● Initialize NN weights ● Loop: – Sample N sessions z under current – Evaluate policy gradient – Ascend θ0←random πθ(a∣s) θi+1←θi+α⋅∇ J
  • 45.
    44 REINFORCE algorithm ∇ J≈ 1 N ∑ i=0 N ∑ s,a∈zi ∇ logπθ(a∣s)⋅Q(s,a) ● Initialize NN weights ● Loop: – Sample N sessions z under current – Evaluate policy gradient – Ascend θ0←random πθ(a∣s) θi+1←θi+α⋅∇ J Q: is it off- or on-policy?
  • 46.
    45 REINFORCE algorithm ∇ J≈ 1 N ∑ i=0 N ∑ s,a∈zi ∇ logπθ(a∣s)⋅Q(s,a) ● Initialize NN weights ● Loop: – Sample N sessions z under current – Evaluate policy gradient – Ascend θ0←random πθ(a∣s) θi+1←θi+α⋅∇ J actions under current policy = on-policy
  • 47.
    value-based Vs policy-based Value-based ●Q-learning, SARSA, MCTS value-iteration ● Solves harder problem ● Artificial exploration ● Learns from partial experience (temporal difference) ● Evaluates strategy for free :) Policy-based ● REINFORCE, CEM ● Solves easier problem ● Innate exploration ● Innate stochasticity ● Support continuous action space ● Learns from full session only?
  • 48.
    value-based Vs policy-based Value-based ●Q-learning, SARSA, MCTS value-iteration ● Solves harder problem ● Artificial exploration ● Learns from partial experience (temporal difference) ● Evaluates strategy for free :) Policy-based ● REINFORCE, CEM ● Solves easier problem ● Innate exploration ● Innate stochasticity ● Support continuous action space ● Learns from full session only We'll learn much more soon!
  • 49.
    48 REINFORCE baselines ∇ J≈ 1 N ∑ i=0 N ∑ s,a∈zi ∇ logπθ(a∣s)⋅Q(s,a) ● Initialize NN weights ● Loop: – Sample N sessions z under current – Evaluate policy gradient θ0←random πθ(a∣s) What is better for learning: random action in good state or great action in bad state?
  • 50.
    49 REINFORCE baselines We cansubtract arbitrary baseline b(s) ∇ J= E s∼p(s) a∼πθ (a∣s) ∇ logπθ(a∣s)(Q(s,a)−b(s))=... ...= E s∼p(s) a∼πθ (a∣s) ∇ logπθ(a∣s)Q(s ,a)− E s∼p(s) a∼πθ (a∣s) ∇ logπθ(a∣s)b(s)=...
  • 51.
    50 REINFORCE baselines We cansubtract arbitrary baseline b(s) ∇ J= E s∼p(s) a∼πθ (a∣s) ∇ logπθ(a∣s)(Q(s,a)−b(s))=... ...= E s∼p(s) a∼πθ (a∣s) ∇ logπθ(a∣s)Q(s ,a)− E s∼p(s) a∼πθ (a∣s) ∇ logπθ(a∣s)b(s)=... Note that b(s) does not depend on a Q: Can you simplify the second term?
  • 52.
    51 REINFORCE baselines We cansubtract arbitrary baseline b(s) ∇ J= E s∼p(s) a∼πθ (a∣s) ∇ logπθ(a∣s)(Q(s,a)−b(s))=... ...= E s∼p(s) a∼πθ (a∣s) ∇ logπθ(a∣s)Q(s ,a)− E s∼p(s) a∼πθ (a∣s) ∇ logπθ(a∣s)b(s)=... E s∼p(s) a∼πθ (a∣s) ∇ log πθ(a∣s)b(s)=b(s)⋅ E s∼p(s) a∼πθ (a∣s) ∇ log πθ(a∣s)=0
  • 53.
    52 REINFORCE baselines We cansubtract arbitrary baseline b(s) ∇ J= E s∼p(s) a∼πθ (a∣s) ∇ logπθ(a∣s)(Q(s,a)−b(s))=... ...= E s∼p(s) a∼πθ (a∣s) ∇ logπθ(a∣s)Q(s ,a)− E s∼p(s) a∼πθ (a∣s) ∇ logπθ(a∣s)b(s)=... ...= E s∼p(s) a∼πθ (a∣s) ∇ logπθ(a∣s)Q(s ,a) Gradient direction doesn’t change!
  • 54.
    53 REINFORCE baselines ● Gradientdirection stays the same ● Variance may change Gradient variance: as a random variable over (s, a) ∇ J Var[Q(s,a)−b(s)] Var[Q(s,a)]−2⋅Cov[Q(s ,a),b(s)]+Var[b(s)]
  • 55.
    54 REINFORCE baselines ● Gradientdirection stays the same ● Variance may change Gradient variance: as a random variable over (s, a) ∇ J Var[Q(s,a)−b(s)] If b(s) correlates with Q(s,a), variance decreases Var[Q(s,a)]−2⋅Cov[Q(s ,a),b(s)]+Var[b(s)]
  • 56.
    55 REINFORCE baselines ● Gradientdirection stays the same ● Variance may change Gradient variance: as a random variable over (s, a) ∇ J Var[Q(s,a)−b(s)] Q: can you suggest any such b(s)? Var[Q(s,a)]−2⋅Cov[Q(s ,a),b(s)]+Var[b(s)]
  • 57.
    56 REINFORCE baselines ● Gradientdirection stays the same ● Variance may change Gradient variance: as a random variable over (s, a) ∇ J Var[Q(s,a)−b(s)] Naive baseline: b = moving average Q over all (s, a), Var[b(s)] = 0, Cov[Q, b] > 0 Var[Q(s,a)]−2⋅Cov[Q(s ,a),b(s)]+Var[b(s)]
  • 58.
    57 REINFORCE baselines ∇ J≈ 1 N ∑ i=0 N ∑ s,a∈zi ∇ logπθ(a∣s)⋅(Q(s,a)−V (s)) Better baseline: b(s) = V(s) Q: but how do we predict V(s)?
  • 59.
    58 Actor-critic ● Learn bothV(s) and ● Hope for the best of both worlds :) πθ(a∣s)
  • 60.
    59 Advantage actor-critic Idea: learnboth and Use to learn faster! Q: how can we estimate A(s,a) from (s,a,r,s') and V-function? Vθ(s) πθ(a∣s) πθ(a∣s) Vθ(s)
  • 61.
    60 Advantage actor-critic Idea: learnboth and Use to learn faster! Vθ(s) πθ(a∣s) πθ(a∣s) Vθ(s) A(s ,a)=Q(s ,a)−V (s) Q(s,a)=r+γ⋅V (s ') A(s ,a)=r+ γ⋅V (s')−V (s)
  • 62.
    61 Advantage actor-critic Idea: learnboth and Use to learn faster! Vθ(s) πθ(a∣s) πθ(a∣s) Vθ(s) A(s ,a)=Q(s ,a)−V (s) Q(s,a)=r+γ⋅V (s ') A(s ,a)=r+ γ⋅V (s')−V (s) Also: n-step version
  • 63.
    62 Advantage actor-critic Idea: learnboth and Use to learn faster! Vθ(s) πθ(a∣s) πθ(a∣s) Vθ(s) A(s ,a)=r+ γ⋅V (s')−V (s) ∇ Jactor≈ 1 N ∑ i=0 N ∑ s,a∈zi ∇ logπθ(a∣s)⋅A(s ,a) consider const
  • 64.
    63 Advantage actor-critic ∇ Jactor≈ 1 N ∑ i=0 N ∑ s,a∈zi ∇logπθ(a∣s)⋅A(s ,a) Vθ(s) model W = params state s πθ(a∣s) Lcritic≈ 1 N ∑ i=0 N ∑ s,a∈zi (V θ(s)−[r+γ⋅V (s')]) 2 Improve policy: Improve value:
  • 65.
    64 Continuous action spaces Whatif there's continuously many actions? ● Robot control: control motor voltage ● Trading: assign money to equity How does the algorithm change?
  • 66.
    65 Continuous action spaces Whatif there's continuously many actions? ● Robot control: control motor voltage ● Trading: assign money to equity How does the algorithm change? it doesn't :) Just plug in a different formula for pi(a|s), e.g. normal distribution
  • 67.
    66 Asynchronous advantage actor-critic ●Parallel game sessions ● Async multi-CPU training ● No experience replay ● LSTM policy ● N-step advantage ● No experience replay Read more: https://arxiv.org/abs/1602.01783
  • 68.
    67 IMPALA Read more: https://arxiv.org/abs/1802.01561 ●Massively parallel ● Separate actor / learner processes ● Small experience replay w/ importance sampling
  • 69.
    68 Duct tape zone ●V(s) errors less important than in Q-learning – actor still learns even if critic is random, just slower ● Regularize with entropy – to prevent premature convergence ● Learn on parallel sessions – Or super-small experience replay ● Use logsoftmax for numerical stability
  • 70.
  • 71.
    ● Remember log-derivativetrick ● Combining best from both worlds is generally a good idea ● See this paper for the proof of the policy gradient for discounted rewards ● Time to write some code! Outro and Q & A 70