2. These slides are almost the exact copy of Practical RL course week 6 slides.
Special thanks to YSDA team for making them publicly available.
Original slides link: week06_policy_based
References
10. 9
Approximation error
DQN is trained to minimize
Simple 2-state world
L≈E[Q(st ,at)−(rt +γ⋅maxa' Q(st+1, a'))]
2
True (A) (B)
Q(s0,a0) 1 1 2
Q(s0,a1) 2 2 1
Q(s1,a0) 3 3 3
Q(s1,a1) 100 50 100
better
policy
less
MSE
Q-learning will prefer worse policy (B)!
11. 10
Conclusion
● Often computing q-values is harder than
picking optimal actions!
● We could avoid learning value functions by
directly learning agent's policy
Q: what algorithm works that way?
(of those we studied)
πθ(a∣s)
12. 11
Conclusion
● Often computing q-values is harder than
picking optimal actions!
● We could avoid learning value functions by
directly learning agent's policy
Q: what algorithm works that way?
πθ(a∣s)
e.g. crossentropy method
13. 12
NOT how humans survived
argmax[
Q(s,pet the tiger)
Q(s,run from tiger)
Q(s,provoke tiger)
Q(s,ignore tiger)
]
15. 14
Policies
In general, two kinds
● Deterministic policy
● Stochastic policy
a=πθ (s)
a∼πθ (a∣s)
Q: Any case where stochastic is better?
16. 15
Policies
In general, two kinds
● Deterministic policy
● Stochastic policy
a=πθ (s)
a∼πθ (a∣s)
Q: Any case where stochastic is better?
e.g. rock-paper
-scissors
17. 16
Policies
In general, two kinds
● Deterministic policy
● Stochastic policy sampling takes care
of exploration
same action each time
Genetic algos (week 0)
Deterministic policy gradient a=πθ (s)
a∼πθ (a∣s)
Crossentropy method
Policy gradient
Q: how to represent policy in continuous action space?
18. 17
Policies
In general, two kinds
● Deterministic policy
● Stochastic policy sampling takes care
of exploration
same action each time
Genetic algos (week 0)
Deterministic policy gradient a=πθ (s)
a∼πθ (a∣s)
Crossentropy method
Policy gradient
categorical, normal, mixture of normal, whatever
19. 18
Two approaches
● Value based:
Learn value function or
Infer policy
● Policy based:
Explicitly learn policy or
Implicitly maximize reward over policy
a=argmax
a
Qθ(s,a)
Qθ(s ,a) Vθ(s)
πθ(a∣s) πθ(s)→a
20. 19
Recap: crossentropy method
● Initialize NN weights
● Loop:
– Sample N sessions
– elite = take M best sessions and concatenate
θ0←random
θi+1=θi+α ∇ ∑
i
log πθi
(ai∣si)⋅
[ si ,ai∈Elite]
21. 20
Recap: crossentropy method
● Initialize NN weights
● Loop:
– Sample N sessions
– elite = take M best sessions and concatenate
TD version: elite (s,a) that have highest G(s,a)
(select elites independently from each state)
θ0←random
θi+1=θi+α ∇ ∑
i
log πθi
(ai∣si)⋅
[ si ,ai∈Elite]
26. 25
Objective
state visitation frequency
(may depend on policy)
Q: how do we compute that?
J = E
s∼p(s)
a∼πθ (s∣a)
R(s, a)=∫
s
p(s)∫
a
πθ(a∣s)R(s,a)da ds
Reward for 1-step
session
Consider an 1-step process for simplicity
27. 26
Objective
J = E
s∼p(s)
a∼πθ (s∣a)
R(s, a)=∫
s
p(s)∫
a
πθ(a∣s)R(s,a)da ds
sample N sessions
under current policy
J≈
1
N
∑
i=0
N
R(s ,a)
28. 27
Objective
Can we optimize policy now?
sample N sessions
J = E
s∼p(s)
a∼πθ (s∣a)
R(s, a)=∫
s
p(s)∫
a
πθ(a∣s)R(s,a)da ds
J≈
1
N
∑
i=0
N
R(s ,a)
29. 28
Objective
parameters “sit” here
We don't know how to compute dJ/dtheta
J = E
s∼p(s)
a∼πθ (s∣a)
R(s, a)=∫
s
p(s)∫
a
πθ(a∣s)R(s,a)da ds
J≈
1
N
∑
i=0
N
∑
s,a∈zi
R(s,a)
30. 29
Optimization
Finite differences
– Change policy a little, evaluate
Stochastic optimization
– Good old crossentropy method
– Maximize probability of “elite” actions
∇ J≈
Jθ+ϵ−Jθ
ϵ
31. 30
Optimization
Finite differences
– Change policy a little, evaluate
Stochastic optimization
– Good old crossentropy method
– Maximize probability of “elite” actions
∇ J≈
Jθ+ϵ−Jθ
ϵ
Q: any problems with those two?
32. 31
Optimization
Finite differences
– Change policy a little, evaluate
Stochastic optimization
– Good old crossentropy method
– Maximize probability of “elite” actions
∇ J≈
Jθ+ϵ−Jθ
ϵ
VERY noizy, especially
if both J are sampled
“quantile convergence”
problems with stochastic
MDPs
39. 38
REINFORCE (bandit)
∇ J≈
1
N
∑
i=0
N
∇ log πθ(a∣s)⋅R(s,a)
● Initialize NN weights
● Loop:
– Sample N sessions z under current
– Evaluate policy gradient
– Ascend
θ0←random
θi+1←θi+α⋅∇ J
40. 39
Discounted reward case
● Replace R with Q :)
π⋅∇ log π(z)=∇ π( z)
that's expectation :)
∇ J=∫
s
p(s)∫
a
∇ πθ (a∣s)Q(s,a)dads
∇ J=∫
s
p(s)∫
a
πθ (a∣s)∇ log πθ(a∣s)Q(s,a)dads
True action value
a.k.a. E[ G(s,a) ]
41. 40
REINFORCE (discounted)
● Policy gradient
● Approximate with sampling
∇ J= E
s∼p(s)
a∼πθ (s∣a)
∇ logπθ(a∣s)⋅Q(s, a)
∇ J≈
1
N
∑
i=0
N
∑
s ,a∈zi
∇ logπθ(a∣s)⋅Q(s,a)
42. 41
REINFORCE algorithm
We can estimate Q using G
Gt=rt+γ⋅rt+1+γ
2
⋅rt+2+...
Qπ (st ,at)=Es' G(st ,at)
prev s
prev a a''
s''
s'
a
a'
r’
s
r'’
r
r'’’
43. 42
We can use this to compue all G’s
in linear time
Recap: discounted rewards
Gt=rt+γ⋅rt+1+γ
2
⋅rt+2+...
rt +γ⋅(rt+1+γ⋅rt+2+...)
rt +γ⋅Gt+1
=
= =
=
44. 43
REINFORCE algorithm
∇ J≈
1
N
∑
i=0
N
∑
s ,a∈zi
∇ logπθ(a∣s)⋅Q(s,a)
● Initialize NN weights
● Loop:
– Sample N sessions z under current
– Evaluate policy gradient
– Ascend
θ0←random
πθ(a∣s)
θi+1←θi+α⋅∇ J
45. 44
REINFORCE algorithm
∇ J≈
1
N
∑
i=0
N
∑
s ,a∈zi
∇ logπθ(a∣s)⋅Q(s,a)
● Initialize NN weights
● Loop:
– Sample N sessions z under current
– Evaluate policy gradient
– Ascend
θ0←random
πθ(a∣s)
θi+1←θi+α⋅∇ J
Q: is it off- or on-policy?
46. 45
REINFORCE algorithm
∇ J≈
1
N
∑
i=0
N
∑
s ,a∈zi
∇ logπθ(a∣s)⋅Q(s,a)
● Initialize NN weights
● Loop:
– Sample N sessions z under current
– Evaluate policy gradient
– Ascend
θ0←random
πθ(a∣s)
θi+1←θi+α⋅∇ J
actions under current policy
= on-policy
47. value-based Vs policy-based
Value-based
● Q-learning, SARSA, MCTS
value-iteration
● Solves harder problem
● Artificial exploration
● Learns from partial experience
(temporal difference)
● Evaluates strategy for free :)
Policy-based
● REINFORCE, CEM
● Solves easier problem
● Innate exploration
● Innate stochasticity
● Support continuous action space
● Learns from full session only?
48. value-based Vs policy-based
Value-based
● Q-learning, SARSA, MCTS
value-iteration
● Solves harder problem
● Artificial exploration
● Learns from partial experience
(temporal difference)
● Evaluates strategy for free :)
Policy-based
● REINFORCE, CEM
● Solves easier problem
● Innate exploration
● Innate stochasticity
● Support continuous action space
● Learns from full session only
We'll learn much more soon!
49. 48
REINFORCE baselines
∇ J≈
1
N
∑
i=0
N
∑
s ,a∈zi
∇ logπθ(a∣s)⋅Q(s,a)
● Initialize NN weights
● Loop:
– Sample N sessions z under current
– Evaluate policy gradient
θ0←random
πθ(a∣s)
What is better for learning:
random action in good state
or
great action in bad state?
50. 49
REINFORCE baselines
We can subtract arbitrary baseline b(s)
∇ J= E
s∼p(s)
a∼πθ (a∣s)
∇ logπθ(a∣s)(Q(s,a)−b(s))=...
...= E
s∼p(s)
a∼πθ (a∣s)
∇ logπθ(a∣s)Q(s ,a)− E
s∼p(s)
a∼πθ (a∣s)
∇ logπθ(a∣s)b(s)=...
51. 50
REINFORCE baselines
We can subtract arbitrary baseline b(s)
∇ J= E
s∼p(s)
a∼πθ (a∣s)
∇ logπθ(a∣s)(Q(s,a)−b(s))=...
...= E
s∼p(s)
a∼πθ (a∣s)
∇ logπθ(a∣s)Q(s ,a)− E
s∼p(s)
a∼πθ (a∣s)
∇ logπθ(a∣s)b(s)=...
Note that b(s) does not depend on a
Q: Can you simplify the second term?
52. 51
REINFORCE baselines
We can subtract arbitrary baseline b(s)
∇ J= E
s∼p(s)
a∼πθ (a∣s)
∇ logπθ(a∣s)(Q(s,a)−b(s))=...
...= E
s∼p(s)
a∼πθ (a∣s)
∇ logπθ(a∣s)Q(s ,a)− E
s∼p(s)
a∼πθ (a∣s)
∇ logπθ(a∣s)b(s)=...
E
s∼p(s)
a∼πθ (a∣s)
∇ log πθ(a∣s)b(s)=b(s)⋅ E
s∼p(s)
a∼πθ (a∣s)
∇ log πθ(a∣s)=0
53. 52
REINFORCE baselines
We can subtract arbitrary baseline b(s)
∇ J= E
s∼p(s)
a∼πθ (a∣s)
∇ logπθ(a∣s)(Q(s,a)−b(s))=...
...= E
s∼p(s)
a∼πθ (a∣s)
∇ logπθ(a∣s)Q(s ,a)− E
s∼p(s)
a∼πθ (a∣s)
∇ logπθ(a∣s)b(s)=...
...= E
s∼p(s)
a∼πθ (a∣s)
∇ logπθ(a∣s)Q(s ,a)
Gradient direction doesn’t change!
54. 53
REINFORCE baselines
● Gradient direction stays the same
● Variance may change
Gradient variance:
as a random variable over (s, a)
∇ J
Var[Q(s,a)−b(s)]
Var[Q(s,a)]−2⋅Cov[Q(s ,a),b(s)]+Var[b(s)]
55. 54
REINFORCE baselines
● Gradient direction stays the same
● Variance may change
Gradient variance:
as a random variable over (s, a)
∇ J
Var[Q(s,a)−b(s)]
If b(s) correlates with Q(s,a), variance decreases
Var[Q(s,a)]−2⋅Cov[Q(s ,a),b(s)]+Var[b(s)]
56. 55
REINFORCE baselines
● Gradient direction stays the same
● Variance may change
Gradient variance:
as a random variable over (s, a)
∇ J
Var[Q(s,a)−b(s)]
Q: can you suggest any such b(s)?
Var[Q(s,a)]−2⋅Cov[Q(s ,a),b(s)]+Var[b(s)]
57. 56
REINFORCE baselines
● Gradient direction stays the same
● Variance may change
Gradient variance:
as a random variable over (s, a)
∇ J
Var[Q(s,a)−b(s)]
Naive baseline: b = moving average Q
over all (s, a), Var[b(s)] = 0, Cov[Q, b] > 0
Var[Q(s,a)]−2⋅Cov[Q(s ,a),b(s)]+Var[b(s)]
60. 59
Advantage actor-critic
Idea: learn both and
Use to learn faster!
Q: how can we estimate A(s,a)
from (s,a,r,s') and V-function?
Vθ(s) πθ(a∣s)
πθ(a∣s) Vθ(s)
61. 60
Advantage actor-critic
Idea: learn both and
Use to learn faster!
Vθ(s) πθ(a∣s)
πθ(a∣s) Vθ(s)
A(s ,a)=Q(s ,a)−V (s)
Q(s,a)=r+γ⋅V (s ')
A(s ,a)=r+ γ⋅V (s')−V (s)
62. 61
Advantage actor-critic
Idea: learn both and
Use to learn faster!
Vθ(s) πθ(a∣s)
πθ(a∣s) Vθ(s)
A(s ,a)=Q(s ,a)−V (s)
Q(s,a)=r+γ⋅V (s ')
A(s ,a)=r+ γ⋅V (s')−V (s)
Also: n-step
version
63. 62
Advantage actor-critic
Idea: learn both and
Use to learn faster!
Vθ(s) πθ(a∣s)
πθ(a∣s) Vθ(s)
A(s ,a)=r+ γ⋅V (s')−V (s)
∇ Jactor≈
1
N
∑
i=0
N
∑
s,a∈zi
∇ logπθ(a∣s)⋅A(s ,a)
consider
const
65. 64
Continuous action spaces
What if there's continuously many actions?
● Robot control: control motor voltage
● Trading: assign money to equity
How does the algorithm change?
66. 65
Continuous action spaces
What if there's continuously many actions?
● Robot control: control motor voltage
● Trading: assign money to equity
How does the algorithm change?
it doesn't :)
Just plug in a different formula for
pi(a|s), e.g. normal distribution
67. 66
Asynchronous advantage actor-critic
● Parallel game sessions
● Async multi-CPU training
● No experience replay
● LSTM policy
● N-step advantage
● No experience replay
Read more: https://arxiv.org/abs/1602.01783
69. 68
Duct tape zone
● V(s) errors less important than in Q-learning
– actor still learns even if critic is random, just slower
● Regularize with entropy
– to prevent premature convergence
● Learn on parallel sessions
– Or super-small experience replay
● Use logsoftmax for numerical stability
71. ● Remember log-derivative trick
● Combining best from both worlds is generally a good idea
● See this paper for the proof of the policy gradient for
discounted rewards
● Time to write some code!
Outro and Q & A
70