week10_Reinforce.pdf

Lecture 10: Policy gradient
Radoslav Neychev
Machine Learning course
advanced track
MIPT
08.11.2019, Moscow, Russia

These slides are almost the exact copy of Practical RL course week 6 slides.
Special thanks to YSDA team for making them publicly available.
Original slides link: week06_policy_based
References

2
Small experiment
The next slide contains a question
Please respond as fast as you can!

3
Small experiment
left or right?

4
Small experiment
Right! Ready for next one?

5
Small experiment
What's Q(s,right) under gamma=0.99?

6
Small experiment
What's Q(s,right) under gamma=0.99?

7
Approximation error
DQN is trained to minimize
Simple 2-state world
L≈E[Q(st ,at)−(rt +γ⋅maxa' Q(st+1, a'))]
2
True (A) (B)
Q(s0,a0) 1 1 2
Q(s0,a1) 2 2 1
Q(s1,a0) 3 3 3
Q(s1,a1) 100 50 100
Q: Which prediction is better (A/B)?

8
Approximation error
2
True (A) (B)
Q(s0,a0) 1 1 2
Q(s0,a1) 2 2 1
Q(s1,a0) 3 3 3
Q(s1,a1) 100 50 100
better
policy
less
MSE

9
Approximation error
2
True (A) (B)
Q(s0,a0) 1 1 2
Q(s0,a1) 2 2 1
Q(s1,a0) 3 3 3
Q(s1,a1) 100 50 100
better
policy
less
MSE
Q-learning will prefer worse policy (B)!

10
Conclusion
● Often computing q-values is harder than
picking optimal actions!
● We could avoid learning value functions by
directly learning agent's policy
Q: what algorithm works that way?
(of those we studied)
πθ(a∣s)

11
Conclusion
● Often computing q-values is harder than
picking optimal actions!
● We could avoid learning value functions by
directly learning agent's policy
Q: what algorithm works that way?
πθ(a∣s)
e.g. crossentropy method

12
NOT how humans survived
argmax[
Q(s,pet the tiger)
Q(s,run from tiger)
Q(s,provoke tiger)
Q(s,ignore tiger)
]

13
how humans survived
π(run∣s)=1

14
Policies
In general, two kinds
● Deterministic policy
● Stochastic policy
a=πθ (s)
a∼πθ (a∣s)
Q: Any case where stochastic is better?

15
Policies
● Stochastic policy
a=πθ (s)
a∼πθ (a∣s)
Q: Any case where stochastic is better?
e.g. rock-paper
-scissors

16
Policies
● Stochastic policy sampling takes care
of exploration
same action each time
Genetic algos (week 0)
Deterministic policy gradient a=πθ (s)
a∼πθ (a∣s)
Crossentropy method
Policy gradient
Q: how to represent policy in continuous action space?

17
Policies
● Stochastic policy sampling takes care
of exploration
same action each time
Genetic algos (week 0)
Deterministic policy gradient a=πθ (s)
a∼πθ (a∣s)
Crossentropy method
Policy gradient
categorical, normal, mixture of normal, whatever

18
Two approaches
● Value based:
Learn value function or
Infer policy
● Policy based:
Explicitly learn policy or
Implicitly maximize reward over policy
a=argmax
a
Qθ(s,a)
Qθ(s ,a) Vθ(s)
πθ(a∣s) πθ(s)→a

19
Recap: crossentropy method
● Initialize NN weights
● Loop:
– Sample N sessions
– elite = take M best sessions and concatenate
θ0←random
θi+1=θi+α ∇ ∑
i
log πθi
(ai∣si)⋅
[ si ,ai∈Elite]

20
Recap: crossentropy method
● Loop:
– Sample N sessions
– elite = take M best sessions and concatenate
TD version: elite (s,a) that have highest G(s,a)
(select elites independently from each state)
θ0←random
θi+1=θi+α ∇ ∑
i
log πθi
(ai∣si)⋅
[ si ,ai∈Elite]

21
Policy gradient main idea
Why so complicated?
We'd rather maximize reward directly!

22
Objective
Expected reward:
Expected discounted reward:
J = E
s∼p(s)
a∼πθ (s∣a)
...
R(s, a,s' ,a',...)
J = E
s∼p(s)
a∼πθ (s∣a)
G(s,a)

23
Objective
Expected reward:
Expected discounted reward:
J = E
s∼p(s)
a∼πθ (s∣a)
...
R(s, a,s' ,a',...)
J = E
s∼p(s)
a∼πθ (s∣a)
G(s,a)
R(z) setting
G(s,a) = r + γ*G(s',a')

24
Objective
Consider an 1-step process for simplicity
J = E
s∼p(s)
a∼πθ (s∣a)
R(s, a)=∫
s
p(s)∫
a
πθ(a∣s)R(s,a)da ds

25
Objective
state visitation frequency
(may depend on policy)
Q: how do we compute that?
J = E
s∼p(s)
a∼πθ (s∣a)
R(s, a)=∫
s
p(s)∫
a
Reward for 1-step
session
Consider an 1-step process for simplicity

26
Objective
J = E
s∼p(s)
a∼πθ (s∣a)
R(s, a)=∫
s
p(s)∫
a
sample N sessions
under current policy
J≈
1
N
∑
i=0
N
R(s ,a)

27
Objective
Can we optimize policy now?
sample N sessions
J = E
s∼p(s)
a∼πθ (s∣a)
R(s, a)=∫
s
p(s)∫
a
J≈
1
N
∑
i=0
N
R(s ,a)

28
Objective
parameters “sit” here
We don't know how to compute dJ/dtheta
J = E
s∼p(s)
a∼πθ (s∣a)
R(s, a)=∫
s
p(s)∫
a
J≈
1
N
∑
i=0
N
∑
s,a∈zi
R(s,a)

29
Optimization
Finite differences
– Change policy a little, evaluate
Stochastic optimization
– Good old crossentropy method
– Maximize probability of “elite” actions
∇ J≈
Jθ+ϵ−Jθ
ϵ

30
Optimization
Finite differences
∇ J≈
Jθ+ϵ−Jθ
ϵ
Q: any problems with those two?

31
Optimization
Finite differences
∇ J≈
Jθ+ϵ−Jθ
ϵ
VERY noizy, especially
if both J are sampled
“quantile convergence”
problems with stochastic
MDPs

32
Objective
Wish list:
– Analytical gradient
– Easy/stable approximations
J = E
s∼p(s)
a∼πθ (s∣a)
R(s, a)=∫
s
p(s)∫
a

33
Logderivative trick
Simple math
(try chain rule)
∇ log π(z)=? ??

34
Logderivative trick
Simple math
∇ log π(z)=
1
π(z)
⋅∇ π( z)
π⋅∇ log π(z)=∇ π( z)

35
Policy gradient
Analytical inference
∇ J=∫
s
p(s)∫
a
∇ πθ (a∣s)R(s, a)da ds

36
Policy gradient
∇ J=∫
s
p(s)∫
a
∇ J=∫
s
p(s)∫
a
πθ (a∣s)∇ log πθ(a∣s)R(s,a)dads
Q: anything curious about that formula?

37
Policy gradient
that's expectation :)
∇ J=∫
s
p(s)∫
a
∇ J=∫
s
p(s)∫
a
πθ (a∣s)∇ log πθ(a∣s)R(s,a)dads

38
REINFORCE (bandit)
∇ J≈
1
N
∑
i=0
N
∇ log πθ(a∣s)⋅R(s,a)
● Loop:
– Sample N sessions z under current
– Evaluate policy gradient
– Ascend
θ0←random
θi+1←θi+α⋅∇ J

39
Discounted reward case
● Replace R with Q :)
that's expectation :)
∇ J=∫
s
p(s)∫
a
∇ πθ (a∣s)Q(s,a)dads
∇ J=∫
s
p(s)∫
a
πθ (a∣s)∇ log πθ(a∣s)Q(s,a)dads
True action value
a.k.a. E[ G(s,a) ]

40
REINFORCE (discounted)
● Policy gradient
● Approximate with sampling
∇ J= E
s∼p(s)
a∼πθ (s∣a)
∇ logπθ(a∣s)⋅Q(s, a)
∇ J≈
1
N
∑
i=0
N
∑
s ,a∈zi
∇ logπθ(a∣s)⋅Q(s,a)

41
REINFORCE algorithm
We can estimate Q using G
Gt=rt+γ⋅rt+1+γ
2
⋅rt+2+...
Qπ (st ,at)=Es' G(st ,at)
prev s
prev a a''
s''
s'
a
a'
r’
s
r'’
r
r'’’

42
We can use this to compue all G’s
in linear time
Recap: discounted rewards
Gt=rt+γ⋅rt+1+γ
2
⋅rt+2+...
rt +γ⋅(rt+1+γ⋅rt+2+...)
rt +γ⋅Gt+1
=
= =
=

43
REINFORCE algorithm
∇ J≈
1
N
∑
i=0
N
∑
s ,a∈zi
● Loop:
– Ascend
θ0←random
πθ(a∣s)

44
REINFORCE algorithm
∇ J≈
1
N
∑
i=0
N
∑
s ,a∈zi
● Loop:
– Ascend
θ0←random
πθ(a∣s)
Q: is it off- or on-policy?

45
REINFORCE algorithm
∇ J≈
1
N
∑
i=0
N
∑
s ,a∈zi
● Loop:
– Ascend
θ0←random
πθ(a∣s)
actions under current policy
= on-policy

value-based Vs policy-based
Value-based
● Q-learning, SARSA, MCTS
value-iteration
● Solves harder problem
● Artificial exploration
● Learns from partial experience
(temporal difference)
● Evaluates strategy for free :)
Policy-based
● REINFORCE, CEM
● Solves easier problem
● Innate exploration
● Innate stochasticity
● Support continuous action space
● Learns from full session only?

value-based Vs policy-based
Value-based
● Q-learning, SARSA, MCTS
value-iteration
● Solves harder problem
● Artificial exploration
● Learns from partial experience
(temporal difference)
● Evaluates strategy for free :)
Policy-based
● REINFORCE, CEM
● Solves easier problem
● Innate exploration
● Innate stochasticity
● Support continuous action space
● Learns from full session only
We'll learn much more soon!

48
REINFORCE baselines
∇ J≈
1
N
∑
i=0
N
∑
s ,a∈zi
● Loop:
θ0←random
πθ(a∣s)
What is better for learning:
random action in good state
or
great action in bad state?

49
REINFORCE baselines
We can subtract arbitrary baseline b(s)
∇ J= E
s∼p(s)
a∼πθ (a∣s)
∇ logπθ(a∣s)(Q(s,a)−b(s))=...
...= E
s∼p(s)
a∼πθ (a∣s)
∇ logπθ(a∣s)Q(s ,a)− E
s∼p(s)
a∼πθ (a∣s)
∇ logπθ(a∣s)b(s)=...

50
REINFORCE baselines
∇ J= E
s∼p(s)
a∼πθ (a∣s)
...= E
s∼p(s)
a∼πθ (a∣s)
s∼p(s)
a∼πθ (a∣s)
Note that b(s) does not depend on a
Q: Can you simplify the second term?

51
REINFORCE baselines
∇ J= E
s∼p(s)
a∼πθ (a∣s)
...= E
s∼p(s)
a∼πθ (a∣s)
s∼p(s)
a∼πθ (a∣s)
E
s∼p(s)
a∼πθ (a∣s)
∇ log πθ(a∣s)b(s)=b(s)⋅ E
s∼p(s)
a∼πθ (a∣s)
∇ log πθ(a∣s)=0

52
REINFORCE baselines
∇ J= E
s∼p(s)
a∼πθ (a∣s)
...= E
s∼p(s)
a∼πθ (a∣s)
s∼p(s)
a∼πθ (a∣s)
...= E
s∼p(s)
a∼πθ (a∣s)
∇ logπθ(a∣s)Q(s ,a)
Gradient direction doesn’t change!

53
REINFORCE baselines
● Gradient direction stays the same
● Variance may change
Gradient variance:
as a random variable over (s, a)
∇ J
Var[Q(s,a)−b(s)]
Var[Q(s,a)]−2⋅Cov[Q(s ,a),b(s)]+Var[b(s)]

54
REINFORCE baselines
Gradient variance:
∇ J
Var[Q(s,a)−b(s)]
If b(s) correlates with Q(s,a), variance decreases

55
REINFORCE baselines
Gradient variance:
∇ J
Var[Q(s,a)−b(s)]
Q: can you suggest any such b(s)?

56
REINFORCE baselines
Gradient variance:
∇ J
Var[Q(s,a)−b(s)]
Naive baseline: b = moving average Q
over all (s, a), Var[b(s)] = 0, Cov[Q, b] > 0

57
REINFORCE baselines
∇ J≈
1
N
∑
i=0
N
∑
s ,a∈zi
∇ logπθ(a∣s)⋅(Q(s,a)−V (s))
Better baseline: b(s) = V(s)
Q: but how do we predict V(s)?

58
Actor-critic
● Learn both V(s) and
● Hope for the best of both worlds :)
πθ(a∣s)

59
Advantage actor-critic
Idea: learn both and
Use to learn faster!
Q: how can we estimate A(s,a)
from (s,a,r,s') and V-function?
Vθ(s) πθ(a∣s)
πθ(a∣s) Vθ(s)

60
Vθ(s) πθ(a∣s)
πθ(a∣s) Vθ(s)
A(s ,a)=Q(s ,a)−V (s)
Q(s,a)=r+γ⋅V (s ')
A(s ,a)=r+ γ⋅V (s')−V (s)

61
Vθ(s) πθ(a∣s)
πθ(a∣s) Vθ(s)
A(s ,a)=Q(s ,a)−V (s)
Q(s,a)=r+γ⋅V (s ')
A(s ,a)=r+ γ⋅V (s')−V (s)
Also: n-step
version

62
Vθ(s) πθ(a∣s)
πθ(a∣s) Vθ(s)
A(s ,a)=r+ γ⋅V (s')−V (s)
∇ Jactor≈
1
N
∑
i=0
N
∑
s,a∈zi
∇ logπθ(a∣s)⋅A(s ,a)
consider
const

63
∇ Jactor≈
1
N
∑
i=0
N
∑
s,a∈zi
∇ logπθ(a∣s)⋅A(s ,a)
Vθ(s)
model
W = params
state s
πθ(a∣s)
Lcritic≈
1
N
∑
i=0
N
∑
s,a∈zi
(V θ(s)−[r+γ⋅V (s')])
2
Improve policy:
Improve value:

64
Continuous action spaces
What if there's continuously many actions?
● Robot control: control motor voltage
● Trading: assign money to equity
How does the algorithm change?

65
Continuous action spaces
What if there's continuously many actions?
● Robot control: control motor voltage
● Trading: assign money to equity
How does the algorithm change?
it doesn't :)
Just plug in a different formula for
pi(a|s), e.g. normal distribution

66
Asynchronous advantage actor-critic
● Parallel game sessions
● Async multi-CPU training
● No experience replay
● LSTM policy
● N-step advantage
● No experience replay
Read more: https://arxiv.org/abs/1602.01783

67
IMPALA
Read more: https://arxiv.org/abs/1802.01561
● Massively parallel
● Separate actor / learner processes
● Small experience replay
w/ importance sampling

68
Duct tape zone
● V(s) errors less important than in Q-learning
– actor still learns even if critic is random, just slower
● Regularize with entropy
– to prevent premature convergence
● Learn on parallel sessions
– Or super-small experience replay
● Use logsoftmax for numerical stability

69
Asynchronous advantage actor-critic

● Remember log-derivative trick
● Combining best from both worlds is generally a good idea
● See this paper for the proof of the policy gradient for
discounted rewards
● Time to write some code!
Outro and Q & A
70

week10_Reinforce.pdf

Recommended

Recommended

More Related Content

Similar to week10_Reinforce.pdf

Similar to week10_Reinforce.pdf (20)

More from YuChianWu

More from YuChianWu (12)

Recently uploaded

Recently uploaded (20)

week10_Reinforce.pdf