SlideShare a Scribd company logo
1 of 71
Download to read offline
Lecture 10: Policy gradient
Radoslav Neychev
Machine Learning course
advanced track
MIPT
08.11.2019, Moscow, Russia
These slides are almost the exact copy of Practical RL course week 6 slides.
Special thanks to YSDA team for making them publicly available.
Original slides link: week06_policy_based
References
2
Small experiment
The next slide contains a question
Please respond as fast as you can!
3
Small experiment
left or right?
4
Small experiment
Right! Ready for next one?
5
Small experiment
What's Q(s,right) under gamma=0.99?
6
Small experiment
What's Q(s,right) under gamma=0.99?
7
Approximation error
DQN is trained to minimize
Simple 2-state world
L≈E[Q(st ,at)−(rt +γ⋅maxa' Q(st+1, a'))]
2
True (A) (B)
Q(s0,a0) 1 1 2
Q(s0,a1) 2 2 1
Q(s1,a0) 3 3 3
Q(s1,a1) 100 50 100
Q: Which prediction is better (A/B)?
8
Approximation error
DQN is trained to minimize
Simple 2-state world
L≈E[Q(st ,at)−(rt +γ⋅maxa' Q(st+1, a'))]
2
True (A) (B)
Q(s0,a0) 1 1 2
Q(s0,a1) 2 2 1
Q(s1,a0) 3 3 3
Q(s1,a1) 100 50 100
better
policy
less
MSE
9
Approximation error
DQN is trained to minimize
Simple 2-state world
L≈E[Q(st ,at)−(rt +γ⋅maxa' Q(st+1, a'))]
2
True (A) (B)
Q(s0,a0) 1 1 2
Q(s0,a1) 2 2 1
Q(s1,a0) 3 3 3
Q(s1,a1) 100 50 100
better
policy
less
MSE
Q-learning will prefer worse policy (B)!
10
Conclusion
● Often computing q-values is harder than
picking optimal actions!
● We could avoid learning value functions by
directly learning agent's policy
Q: what algorithm works that way?
(of those we studied)
πθ(a∣s)
11
Conclusion
● Often computing q-values is harder than
picking optimal actions!
● We could avoid learning value functions by
directly learning agent's policy
Q: what algorithm works that way?
πθ(a∣s)
e.g. crossentropy method
12
NOT how humans survived
argmax[
Q(s,pet the tiger)
Q(s,run from tiger)
Q(s,provoke tiger)
Q(s,ignore tiger)
]
13
how humans survived
π(run∣s)=1
14
Policies
In general, two kinds
● Deterministic policy
● Stochastic policy
a=πθ (s)
a∼πθ (a∣s)
Q: Any case where stochastic is better?
15
Policies
In general, two kinds
● Deterministic policy
● Stochastic policy
a=πθ (s)
a∼πθ (a∣s)
Q: Any case where stochastic is better?
e.g. rock-paper
-scissors
16
Policies
In general, two kinds
● Deterministic policy
● Stochastic policy sampling takes care
of exploration
same action each time
Genetic algos (week 0)
Deterministic policy gradient a=πθ (s)
a∼πθ (a∣s)
Crossentropy method
Policy gradient
Q: how to represent policy in continuous action space?
17
Policies
In general, two kinds
● Deterministic policy
● Stochastic policy sampling takes care
of exploration
same action each time
Genetic algos (week 0)
Deterministic policy gradient a=πθ (s)
a∼πθ (a∣s)
Crossentropy method
Policy gradient
categorical, normal, mixture of normal, whatever
18
Two approaches
● Value based:
Learn value function or
Infer policy
● Policy based:
Explicitly learn policy or
Implicitly maximize reward over policy
a=argmax
a
Qθ(s,a)
Qθ(s ,a) Vθ(s)
πθ(a∣s) πθ(s)→a
19
Recap: crossentropy method
● Initialize NN weights
● Loop:
– Sample N sessions
– elite = take M best sessions and concatenate
θ0←random
θi+1=θi+α ∇ ∑
i
log πθi
(ai∣si)⋅
[ si ,ai∈Elite]
20
Recap: crossentropy method
● Initialize NN weights
● Loop:
– Sample N sessions
– elite = take M best sessions and concatenate
TD version: elite (s,a) that have highest G(s,a)
(select elites independently from each state)
θ0←random
θi+1=θi+α ∇ ∑
i
log πθi
(ai∣si)⋅
[ si ,ai∈Elite]
21
Policy gradient main idea
Why so complicated?
We'd rather maximize reward directly!
22
Objective
Expected reward:
Expected discounted reward:
J = E
s∼p(s)
a∼πθ (s∣a)
...
R(s, a,s' ,a',...)
J = E
s∼p(s)
a∼πθ (s∣a)
G(s,a)
23
Objective
Expected reward:
Expected discounted reward:
J = E
s∼p(s)
a∼πθ (s∣a)
...
R(s, a,s' ,a',...)
J = E
s∼p(s)
a∼πθ (s∣a)
G(s,a)
R(z) setting
G(s,a) = r + γ*G(s',a')
24
Objective
Consider an 1-step process for simplicity
J = E
s∼p(s)
a∼πθ (s∣a)
R(s, a)=∫
s
p(s)∫
a
πθ(a∣s)R(s,a)da ds
25
Objective
state visitation frequency
(may depend on policy)
Q: how do we compute that?
J = E
s∼p(s)
a∼πθ (s∣a)
R(s, a)=∫
s
p(s)∫
a
πθ(a∣s)R(s,a)da ds
Reward for 1-step
session
Consider an 1-step process for simplicity
26
Objective
J = E
s∼p(s)
a∼πθ (s∣a)
R(s, a)=∫
s
p(s)∫
a
πθ(a∣s)R(s,a)da ds
sample N sessions
under current policy
J≈
1
N
∑
i=0
N
R(s ,a)
27
Objective
Can we optimize policy now?
sample N sessions
J = E
s∼p(s)
a∼πθ (s∣a)
R(s, a)=∫
s
p(s)∫
a
πθ(a∣s)R(s,a)da ds
J≈
1
N
∑
i=0
N
R(s ,a)
28
Objective
parameters “sit” here
We don't know how to compute dJ/dtheta
J = E
s∼p(s)
a∼πθ (s∣a)
R(s, a)=∫
s
p(s)∫
a
πθ(a∣s)R(s,a)da ds
J≈
1
N
∑
i=0
N
∑
s,a∈zi
R(s,a)
29
Optimization
Finite differences
– Change policy a little, evaluate
Stochastic optimization
– Good old crossentropy method
– Maximize probability of “elite” actions
∇ J≈
Jθ+ϵ−Jθ
ϵ
30
Optimization
Finite differences
– Change policy a little, evaluate
Stochastic optimization
– Good old crossentropy method
– Maximize probability of “elite” actions
∇ J≈
Jθ+ϵ−Jθ
ϵ
Q: any problems with those two?
31
Optimization
Finite differences
– Change policy a little, evaluate
Stochastic optimization
– Good old crossentropy method
– Maximize probability of “elite” actions
∇ J≈
Jθ+ϵ−Jθ
ϵ
VERY noizy, especially
if both J are sampled
“quantile convergence”
problems with stochastic
MDPs
32
Objective
Wish list:
– Analytical gradient
– Easy/stable approximations
J = E
s∼p(s)
a∼πθ (s∣a)
R(s, a)=∫
s
p(s)∫
a
πθ(a∣s)R(s,a)da ds
33
Logderivative trick
Simple math
(try chain rule)
∇ log π(z)=? ??
34
Logderivative trick
Simple math
∇ log π(z)=
1
π(z)
⋅∇ π( z)
π⋅∇ log π(z)=∇ π( z)
35
Policy gradient
Analytical inference
π⋅∇ log π(z)=∇ π( z)
∇ J=∫
s
p(s)∫
a
∇ πθ (a∣s)R(s, a)da ds
36
Policy gradient
Analytical inference
∇ J=∫
s
p(s)∫
a
∇ πθ (a∣s)R(s, a)da ds
π⋅∇ log π(z)=∇ π( z)
∇ J=∫
s
p(s)∫
a
πθ (a∣s)∇ log πθ(a∣s)R(s,a)dads
Q: anything curious about that formula?
37
Policy gradient
Analytical inference
π⋅∇ log π(z)=∇ π( z)
that's expectation :)
∇ J=∫
s
p(s)∫
a
∇ πθ (a∣s)R(s, a)da ds
∇ J=∫
s
p(s)∫
a
πθ (a∣s)∇ log πθ(a∣s)R(s,a)dads
38
REINFORCE (bandit)
∇ J≈
1
N
∑
i=0
N
∇ log πθ(a∣s)⋅R(s,a)
● Initialize NN weights
● Loop:
– Sample N sessions z under current
– Evaluate policy gradient
– Ascend
θ0←random
θi+1←θi+α⋅∇ J
39
Discounted reward case
● Replace R with Q :)
π⋅∇ log π(z)=∇ π( z)
that's expectation :)
∇ J=∫
s
p(s)∫
a
∇ πθ (a∣s)Q(s,a)dads
∇ J=∫
s
p(s)∫
a
πθ (a∣s)∇ log πθ(a∣s)Q(s,a)dads
True action value
a.k.a. E[ G(s,a) ]
40
REINFORCE (discounted)
● Policy gradient
● Approximate with sampling
∇ J= E
s∼p(s)
a∼πθ (s∣a)
∇ logπθ(a∣s)⋅Q(s, a)
∇ J≈
1
N
∑
i=0
N
∑
s ,a∈zi
∇ logπθ(a∣s)⋅Q(s,a)
41
REINFORCE algorithm
We can estimate Q using G
Gt=rt+γ⋅rt+1+γ
2
⋅rt+2+...
Qπ (st ,at)=Es' G(st ,at)
prev s
prev a a''
s''
s'
a
a'
r’
s
r'’
r
r'’’
42
We can use this to compue all G’s
in linear time
Recap: discounted rewards
Gt=rt+γ⋅rt+1+γ
2
⋅rt+2+...
rt +γ⋅(rt+1+γ⋅rt+2+...)
rt +γ⋅Gt+1
=
= =
=
43
REINFORCE algorithm
∇ J≈
1
N
∑
i=0
N
∑
s ,a∈zi
∇ logπθ(a∣s)⋅Q(s,a)
● Initialize NN weights
● Loop:
– Sample N sessions z under current
– Evaluate policy gradient
– Ascend
θ0←random
πθ(a∣s)
θi+1←θi+α⋅∇ J
44
REINFORCE algorithm
∇ J≈
1
N
∑
i=0
N
∑
s ,a∈zi
∇ logπθ(a∣s)⋅Q(s,a)
● Initialize NN weights
● Loop:
– Sample N sessions z under current
– Evaluate policy gradient
– Ascend
θ0←random
πθ(a∣s)
θi+1←θi+α⋅∇ J
Q: is it off- or on-policy?
45
REINFORCE algorithm
∇ J≈
1
N
∑
i=0
N
∑
s ,a∈zi
∇ logπθ(a∣s)⋅Q(s,a)
● Initialize NN weights
● Loop:
– Sample N sessions z under current
– Evaluate policy gradient
– Ascend
θ0←random
πθ(a∣s)
θi+1←θi+α⋅∇ J
actions under current policy
= on-policy
value-based Vs policy-based
Value-based
● Q-learning, SARSA, MCTS
value-iteration
● Solves harder problem
● Artificial exploration
● Learns from partial experience
(temporal difference)
● Evaluates strategy for free :)
Policy-based
● REINFORCE, CEM
● Solves easier problem
● Innate exploration
● Innate stochasticity
● Support continuous action space
● Learns from full session only?
value-based Vs policy-based
Value-based
● Q-learning, SARSA, MCTS
value-iteration
● Solves harder problem
● Artificial exploration
● Learns from partial experience
(temporal difference)
● Evaluates strategy for free :)
Policy-based
● REINFORCE, CEM
● Solves easier problem
● Innate exploration
● Innate stochasticity
● Support continuous action space
● Learns from full session only
We'll learn much more soon!
48
REINFORCE baselines
∇ J≈
1
N
∑
i=0
N
∑
s ,a∈zi
∇ logπθ(a∣s)⋅Q(s,a)
● Initialize NN weights
● Loop:
– Sample N sessions z under current
– Evaluate policy gradient
θ0←random
πθ(a∣s)
What is better for learning:
random action in good state
or
great action in bad state?
49
REINFORCE baselines
We can subtract arbitrary baseline b(s)
∇ J= E
s∼p(s)
a∼πθ (a∣s)
∇ logπθ(a∣s)(Q(s,a)−b(s))=...
...= E
s∼p(s)
a∼πθ (a∣s)
∇ logπθ(a∣s)Q(s ,a)− E
s∼p(s)
a∼πθ (a∣s)
∇ logπθ(a∣s)b(s)=...
50
REINFORCE baselines
We can subtract arbitrary baseline b(s)
∇ J= E
s∼p(s)
a∼πθ (a∣s)
∇ logπθ(a∣s)(Q(s,a)−b(s))=...
...= E
s∼p(s)
a∼πθ (a∣s)
∇ logπθ(a∣s)Q(s ,a)− E
s∼p(s)
a∼πθ (a∣s)
∇ logπθ(a∣s)b(s)=...
Note that b(s) does not depend on a
Q: Can you simplify the second term?
51
REINFORCE baselines
We can subtract arbitrary baseline b(s)
∇ J= E
s∼p(s)
a∼πθ (a∣s)
∇ logπθ(a∣s)(Q(s,a)−b(s))=...
...= E
s∼p(s)
a∼πθ (a∣s)
∇ logπθ(a∣s)Q(s ,a)− E
s∼p(s)
a∼πθ (a∣s)
∇ logπθ(a∣s)b(s)=...
E
s∼p(s)
a∼πθ (a∣s)
∇ log πθ(a∣s)b(s)=b(s)⋅ E
s∼p(s)
a∼πθ (a∣s)
∇ log πθ(a∣s)=0
52
REINFORCE baselines
We can subtract arbitrary baseline b(s)
∇ J= E
s∼p(s)
a∼πθ (a∣s)
∇ logπθ(a∣s)(Q(s,a)−b(s))=...
...= E
s∼p(s)
a∼πθ (a∣s)
∇ logπθ(a∣s)Q(s ,a)− E
s∼p(s)
a∼πθ (a∣s)
∇ logπθ(a∣s)b(s)=...
...= E
s∼p(s)
a∼πθ (a∣s)
∇ logπθ(a∣s)Q(s ,a)
Gradient direction doesn’t change!
53
REINFORCE baselines
● Gradient direction stays the same
● Variance may change
Gradient variance:
as a random variable over (s, a)
∇ J
Var[Q(s,a)−b(s)]
Var[Q(s,a)]−2⋅Cov[Q(s ,a),b(s)]+Var[b(s)]
54
REINFORCE baselines
● Gradient direction stays the same
● Variance may change
Gradient variance:
as a random variable over (s, a)
∇ J
Var[Q(s,a)−b(s)]
If b(s) correlates with Q(s,a), variance decreases
Var[Q(s,a)]−2⋅Cov[Q(s ,a),b(s)]+Var[b(s)]
55
REINFORCE baselines
● Gradient direction stays the same
● Variance may change
Gradient variance:
as a random variable over (s, a)
∇ J
Var[Q(s,a)−b(s)]
Q: can you suggest any such b(s)?
Var[Q(s,a)]−2⋅Cov[Q(s ,a),b(s)]+Var[b(s)]
56
REINFORCE baselines
● Gradient direction stays the same
● Variance may change
Gradient variance:
as a random variable over (s, a)
∇ J
Var[Q(s,a)−b(s)]
Naive baseline: b = moving average Q
over all (s, a), Var[b(s)] = 0, Cov[Q, b] > 0
Var[Q(s,a)]−2⋅Cov[Q(s ,a),b(s)]+Var[b(s)]
57
REINFORCE baselines
∇ J≈
1
N
∑
i=0
N
∑
s ,a∈zi
∇ logπθ(a∣s)⋅(Q(s,a)−V (s))
Better baseline: b(s) = V(s)
Q: but how do we predict V(s)?
58
Actor-critic
● Learn both V(s) and
● Hope for the best of both worlds :)
πθ(a∣s)
59
Advantage actor-critic
Idea: learn both and
Use to learn faster!
Q: how can we estimate A(s,a)
from (s,a,r,s') and V-function?
Vθ(s) πθ(a∣s)
πθ(a∣s) Vθ(s)
60
Advantage actor-critic
Idea: learn both and
Use to learn faster!
Vθ(s) πθ(a∣s)
πθ(a∣s) Vθ(s)
A(s ,a)=Q(s ,a)−V (s)
Q(s,a)=r+γ⋅V (s ')
A(s ,a)=r+ γ⋅V (s')−V (s)
61
Advantage actor-critic
Idea: learn both and
Use to learn faster!
Vθ(s) πθ(a∣s)
πθ(a∣s) Vθ(s)
A(s ,a)=Q(s ,a)−V (s)
Q(s,a)=r+γ⋅V (s ')
A(s ,a)=r+ γ⋅V (s')−V (s)
Also: n-step
version
62
Advantage actor-critic
Idea: learn both and
Use to learn faster!
Vθ(s) πθ(a∣s)
πθ(a∣s) Vθ(s)
A(s ,a)=r+ γ⋅V (s')−V (s)
∇ Jactor≈
1
N
∑
i=0
N
∑
s,a∈zi
∇ logπθ(a∣s)⋅A(s ,a)
consider
const
63
Advantage actor-critic
∇ Jactor≈
1
N
∑
i=0
N
∑
s,a∈zi
∇ logπθ(a∣s)⋅A(s ,a)
Vθ(s)
model
W = params
state s
πθ(a∣s)
Lcritic≈
1
N
∑
i=0
N
∑
s,a∈zi
(V θ(s)−[r+γ⋅V (s')])
2
Improve policy:
Improve value:
64
Continuous action spaces
What if there's continuously many actions?
● Robot control: control motor voltage
● Trading: assign money to equity
How does the algorithm change?
65
Continuous action spaces
What if there's continuously many actions?
● Robot control: control motor voltage
● Trading: assign money to equity
How does the algorithm change?
it doesn't :)
Just plug in a different formula for
pi(a|s), e.g. normal distribution
66
Asynchronous advantage actor-critic
● Parallel game sessions
● Async multi-CPU training
● No experience replay
● LSTM policy
● N-step advantage
● No experience replay
Read more: https://arxiv.org/abs/1602.01783
67
IMPALA
Read more: https://arxiv.org/abs/1802.01561
● Massively parallel
● Separate actor / learner processes
● Small experience replay
w/ importance sampling
68
Duct tape zone
● V(s) errors less important than in Q-learning
– actor still learns even if critic is random, just slower
● Regularize with entropy
– to prevent premature convergence
● Learn on parallel sessions
– Or super-small experience replay
● Use logsoftmax for numerical stability
69
Asynchronous advantage actor-critic
● Remember log-derivative trick
● Combining best from both worlds is generally a good idea
● See this paper for the proof of the policy gradient for
discounted rewards
● Time to write some code!
Outro and Q & A
70

More Related Content

Similar to week10_Reinforce.pdf

Uncertainty Awareness in Integrating Machine Learning and Game Theory
Uncertainty Awareness in Integrating Machine Learning and Game TheoryUncertainty Awareness in Integrating Machine Learning and Game Theory
Uncertainty Awareness in Integrating Machine Learning and Game TheoryRikiya Takahashi
 
Introducing Zap Q-Learning
Introducing Zap Q-Learning   Introducing Zap Q-Learning
Introducing Zap Q-Learning Sean Meyn
 
CS294-112 Lec 05
CS294-112 Lec 05CS294-112 Lec 05
CS294-112 Lec 05Gyubin Son
 
Lec7 deeprlbootcamp-svg+scg
Lec7 deeprlbootcamp-svg+scgLec7 deeprlbootcamp-svg+scg
Lec7 deeprlbootcamp-svg+scgRonald Teo
 
Research internship on optimal stochastic theory with financial application u...
Research internship on optimal stochastic theory with financial application u...Research internship on optimal stochastic theory with financial application u...
Research internship on optimal stochastic theory with financial application u...Asma Ben Slimene
 
Presentation on stochastic control problem with financial applications (Merto...
Presentation on stochastic control problem with financial applications (Merto...Presentation on stochastic control problem with financial applications (Merto...
Presentation on stochastic control problem with financial applications (Merto...Asma Ben Slimene
 
Reinfrocement Learning
Reinfrocement LearningReinfrocement Learning
Reinfrocement LearningNatan Katz
 
Conceptual Introduction to Gaussian Processes
Conceptual Introduction to Gaussian ProcessesConceptual Introduction to Gaussian Processes
Conceptual Introduction to Gaussian ProcessesJuanPabloCarbajal3
 
Lec2 sampling-based-approximations-and-function-fitting
Lec2 sampling-based-approximations-and-function-fittingLec2 sampling-based-approximations-and-function-fitting
Lec2 sampling-based-approximations-and-function-fittingRonald Teo
 
Counterfactual Learning for Recommendation
Counterfactual Learning for RecommendationCounterfactual Learning for Recommendation
Counterfactual Learning for RecommendationOlivier Jeunen
 
MAPE regression, seminar @ QUT (Brisbane)
MAPE regression, seminar @ QUT (Brisbane)MAPE regression, seminar @ QUT (Brisbane)
MAPE regression, seminar @ QUT (Brisbane)Arnaud de Myttenaere
 
block-mdp-masters-defense.pdf
block-mdp-masters-defense.pdfblock-mdp-masters-defense.pdf
block-mdp-masters-defense.pdfJunghyun Lee
 
Nonconvex Compressed Sensing with the Sum-of-Squares Method
Nonconvex Compressed Sensing with the Sum-of-Squares MethodNonconvex Compressed Sensing with the Sum-of-Squares Method
Nonconvex Compressed Sensing with the Sum-of-Squares MethodTasuku Soma
 
Reinforcement Learning Overview | Marco Del Pra
Reinforcement Learning Overview | Marco Del PraReinforcement Learning Overview | Marco Del Pra
Reinforcement Learning Overview | Marco Del PraData Science Milan
 
DissertationSlides169
DissertationSlides169DissertationSlides169
DissertationSlides169Ryan White
 
Optimization Methods in Finance
Optimization Methods in FinanceOptimization Methods in Finance
Optimization Methods in Financethilankm
 
Lec5 advanced-policy-gradient-methods
Lec5 advanced-policy-gradient-methodsLec5 advanced-policy-gradient-methods
Lec5 advanced-policy-gradient-methodsRonald Teo
 

Similar to week10_Reinforce.pdf (20)

Uncertainty Awareness in Integrating Machine Learning and Game Theory
Uncertainty Awareness in Integrating Machine Learning and Game TheoryUncertainty Awareness in Integrating Machine Learning and Game Theory
Uncertainty Awareness in Integrating Machine Learning and Game Theory
 
Introducing Zap Q-Learning
Introducing Zap Q-Learning   Introducing Zap Q-Learning
Introducing Zap Q-Learning
 
RL unit 5 part 1.pdf
RL unit 5 part 1.pdfRL unit 5 part 1.pdf
RL unit 5 part 1.pdf
 
CS294-112 Lec 05
CS294-112 Lec 05CS294-112 Lec 05
CS294-112 Lec 05
 
Lec7 deeprlbootcamp-svg+scg
Lec7 deeprlbootcamp-svg+scgLec7 deeprlbootcamp-svg+scg
Lec7 deeprlbootcamp-svg+scg
 
Research internship on optimal stochastic theory with financial application u...
Research internship on optimal stochastic theory with financial application u...Research internship on optimal stochastic theory with financial application u...
Research internship on optimal stochastic theory with financial application u...
 
Presentation on stochastic control problem with financial applications (Merto...
Presentation on stochastic control problem with financial applications (Merto...Presentation on stochastic control problem with financial applications (Merto...
Presentation on stochastic control problem with financial applications (Merto...
 
Reinfrocement Learning
Reinfrocement LearningReinfrocement Learning
Reinfrocement Learning
 
Conceptual Introduction to Gaussian Processes
Conceptual Introduction to Gaussian ProcessesConceptual Introduction to Gaussian Processes
Conceptual Introduction to Gaussian Processes
 
Lec2 sampling-based-approximations-and-function-fitting
Lec2 sampling-based-approximations-and-function-fittingLec2 sampling-based-approximations-and-function-fitting
Lec2 sampling-based-approximations-and-function-fitting
 
Counterfactual Learning for Recommendation
Counterfactual Learning for RecommendationCounterfactual Learning for Recommendation
Counterfactual Learning for Recommendation
 
MAPE regression, seminar @ QUT (Brisbane)
MAPE regression, seminar @ QUT (Brisbane)MAPE regression, seminar @ QUT (Brisbane)
MAPE regression, seminar @ QUT (Brisbane)
 
block-mdp-masters-defense.pdf
block-mdp-masters-defense.pdfblock-mdp-masters-defense.pdf
block-mdp-masters-defense.pdf
 
Nonconvex Compressed Sensing with the Sum-of-Squares Method
Nonconvex Compressed Sensing with the Sum-of-Squares MethodNonconvex Compressed Sensing with the Sum-of-Squares Method
Nonconvex Compressed Sensing with the Sum-of-Squares Method
 
QMC: Transition Workshop - Applying Quasi-Monte Carlo Methods to a Stochastic...
QMC: Transition Workshop - Applying Quasi-Monte Carlo Methods to a Stochastic...QMC: Transition Workshop - Applying Quasi-Monte Carlo Methods to a Stochastic...
QMC: Transition Workshop - Applying Quasi-Monte Carlo Methods to a Stochastic...
 
Reinforcement Learning Overview | Marco Del Pra
Reinforcement Learning Overview | Marco Del PraReinforcement Learning Overview | Marco Del Pra
Reinforcement Learning Overview | Marco Del Pra
 
DissertationSlides169
DissertationSlides169DissertationSlides169
DissertationSlides169
 
Optimization Methods in Finance
Optimization Methods in FinanceOptimization Methods in Finance
Optimization Methods in Finance
 
Introduction to Deep Reinforcement Learning
Introduction to Deep Reinforcement LearningIntroduction to Deep Reinforcement Learning
Introduction to Deep Reinforcement Learning
 
Lec5 advanced-policy-gradient-methods
Lec5 advanced-policy-gradient-methodsLec5 advanced-policy-gradient-methods
Lec5 advanced-policy-gradient-methods
 

More from YuChianWu

lecture04_mle_svm_multiclass_pca.pdf
lecture04_mle_svm_multiclass_pca.pdflecture04_mle_svm_multiclass_pca.pdf
lecture04_mle_svm_multiclass_pca.pdfYuChianWu
 
lecture02_LinearRegression.pdf
lecture02_LinearRegression.pdflecture02_LinearRegression.pdf
lecture02_LinearRegression.pdfYuChianWu
 
week13_yolo_mobilenet.pdf
week13_yolo_mobilenet.pdfweek13_yolo_mobilenet.pdf
week13_yolo_mobilenet.pdfYuChianWu
 
week14_segmentation.pdf
week14_segmentation.pdfweek14_segmentation.pdf
week14_segmentation.pdfYuChianWu
 
week12_cv_intro_and_r-cnn.pdf
week12_cv_intro_and_r-cnn.pdfweek12_cv_intro_and_r-cnn.pdf
week12_cv_intro_and_r-cnn.pdfYuChianWu
 
Fragattacks-Breaking-Wi-Fi-Through-Fragmentation-And-Aggregation.pdf
Fragattacks-Breaking-Wi-Fi-Through-Fragmentation-And-Aggregation.pdfFragattacks-Breaking-Wi-Fi-Through-Fragmentation-And-Aggregation.pdf
Fragattacks-Breaking-Wi-Fi-Through-Fragmentation-And-Aggregation.pdfYuChianWu
 
Im-A-Hacker-Get-Me-Out-Of-Here-Breaking-Network-Segregation-Using-Esoteric-Co...
Im-A-Hacker-Get-Me-Out-Of-Here-Breaking-Network-Segregation-Using-Esoteric-Co...Im-A-Hacker-Get-Me-Out-Of-Here-Breaking-Network-Segregation-Using-Esoteric-Co...
Im-A-Hacker-Get-Me-Out-Of-Here-Breaking-Network-Segregation-Using-Esoteric-Co...YuChianWu
 
Lets-Attack-Lets_Encrypt.pdf
Lets-Attack-Lets_Encrypt.pdfLets-Attack-Lets_Encrypt.pdf
Lets-Attack-Lets_Encrypt.pdfYuChianWu
 
A-Hole-in-the-Tube.pdf
A-Hole-in-the-Tube.pdfA-Hole-in-the-Tube.pdf
A-Hole-in-the-Tube.pdfYuChianWu
 
IPvSeeYou.pdf
IPvSeeYou.pdfIPvSeeYou.pdf
IPvSeeYou.pdfYuChianWu
 
kolmogorov_foundations.pdf
kolmogorov_foundations.pdfkolmogorov_foundations.pdf
kolmogorov_foundations.pdfYuChianWu
 
Introductory Real Analysis (A. N. Kolmogorov, S. V. Fomin etc.) (z-lib.org).pdf
Introductory Real Analysis (A. N. Kolmogorov, S. V. Fomin etc.) (z-lib.org).pdfIntroductory Real Analysis (A. N. Kolmogorov, S. V. Fomin etc.) (z-lib.org).pdf
Introductory Real Analysis (A. N. Kolmogorov, S. V. Fomin etc.) (z-lib.org).pdfYuChianWu
 

More from YuChianWu (12)

lecture04_mle_svm_multiclass_pca.pdf
lecture04_mle_svm_multiclass_pca.pdflecture04_mle_svm_multiclass_pca.pdf
lecture04_mle_svm_multiclass_pca.pdf
 
lecture02_LinearRegression.pdf
lecture02_LinearRegression.pdflecture02_LinearRegression.pdf
lecture02_LinearRegression.pdf
 
week13_yolo_mobilenet.pdf
week13_yolo_mobilenet.pdfweek13_yolo_mobilenet.pdf
week13_yolo_mobilenet.pdf
 
week14_segmentation.pdf
week14_segmentation.pdfweek14_segmentation.pdf
week14_segmentation.pdf
 
week12_cv_intro_and_r-cnn.pdf
week12_cv_intro_and_r-cnn.pdfweek12_cv_intro_and_r-cnn.pdf
week12_cv_intro_and_r-cnn.pdf
 
Fragattacks-Breaking-Wi-Fi-Through-Fragmentation-And-Aggregation.pdf
Fragattacks-Breaking-Wi-Fi-Through-Fragmentation-And-Aggregation.pdfFragattacks-Breaking-Wi-Fi-Through-Fragmentation-And-Aggregation.pdf
Fragattacks-Breaking-Wi-Fi-Through-Fragmentation-And-Aggregation.pdf
 
Im-A-Hacker-Get-Me-Out-Of-Here-Breaking-Network-Segregation-Using-Esoteric-Co...
Im-A-Hacker-Get-Me-Out-Of-Here-Breaking-Network-Segregation-Using-Esoteric-Co...Im-A-Hacker-Get-Me-Out-Of-Here-Breaking-Network-Segregation-Using-Esoteric-Co...
Im-A-Hacker-Get-Me-Out-Of-Here-Breaking-Network-Segregation-Using-Esoteric-Co...
 
Lets-Attack-Lets_Encrypt.pdf
Lets-Attack-Lets_Encrypt.pdfLets-Attack-Lets_Encrypt.pdf
Lets-Attack-Lets_Encrypt.pdf
 
A-Hole-in-the-Tube.pdf
A-Hole-in-the-Tube.pdfA-Hole-in-the-Tube.pdf
A-Hole-in-the-Tube.pdf
 
IPvSeeYou.pdf
IPvSeeYou.pdfIPvSeeYou.pdf
IPvSeeYou.pdf
 
kolmogorov_foundations.pdf
kolmogorov_foundations.pdfkolmogorov_foundations.pdf
kolmogorov_foundations.pdf
 
Introductory Real Analysis (A. N. Kolmogorov, S. V. Fomin etc.) (z-lib.org).pdf
Introductory Real Analysis (A. N. Kolmogorov, S. V. Fomin etc.) (z-lib.org).pdfIntroductory Real Analysis (A. N. Kolmogorov, S. V. Fomin etc.) (z-lib.org).pdf
Introductory Real Analysis (A. N. Kolmogorov, S. V. Fomin etc.) (z-lib.org).pdf
 

Recently uploaded

Electronically Controlled suspensions system .pdf
Electronically Controlled suspensions system .pdfElectronically Controlled suspensions system .pdf
Electronically Controlled suspensions system .pdfme23b1001
 
Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...VICTOR MAESTRE RAMIREZ
 
Call Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call GirlsCall Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call Girlsssuser7cb4ff
 
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube ExchangerStudy on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube ExchangerAnamika Sarkar
 
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVHARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVRajaP95
 
complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...asadnawaz62
 
Churning of Butter, Factors affecting .
Churning of Butter, Factors affecting  .Churning of Butter, Factors affecting  .
Churning of Butter, Factors affecting .Satyam Kumar
 
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)dollysharma2066
 
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxDecoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxJoão Esperancinha
 
Internship report on mechanical engineering
Internship report on mechanical engineeringInternship report on mechanical engineering
Internship report on mechanical engineeringmalavadedarshan25
 
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfCCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfAsst.prof M.Gokilavani
 
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...VICTOR MAESTRE RAMIREZ
 
GDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentationGDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentationGDSCAESB
 
Concrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptxConcrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptxKartikeyaDwivedi3
 
Call Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile serviceCall Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile servicerehmti665
 
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionSachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionDr.Costas Sachpazis
 
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSAPPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSKurinjimalarL3
 

Recently uploaded (20)

Electronically Controlled suspensions system .pdf
Electronically Controlled suspensions system .pdfElectronically Controlled suspensions system .pdf
Electronically Controlled suspensions system .pdf
 
Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...
 
Call Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call GirlsCall Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call Girls
 
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube ExchangerStudy on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
 
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVHARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
 
complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...
 
Churning of Butter, Factors affecting .
Churning of Butter, Factors affecting  .Churning of Butter, Factors affecting  .
Churning of Butter, Factors affecting .
 
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)
 
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxDecoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
 
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
 
Internship report on mechanical engineering
Internship report on mechanical engineeringInternship report on mechanical engineering
Internship report on mechanical engineering
 
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfCCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
 
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...
 
GDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentationGDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentation
 
Concrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptxConcrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptx
 
Call Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile serviceCall Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile service
 
Design and analysis of solar grass cutter.pdf
Design and analysis of solar grass cutter.pdfDesign and analysis of solar grass cutter.pdf
Design and analysis of solar grass cutter.pdf
 
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionSachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
 
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Serviceyoung call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
 
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSAPPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
 

week10_Reinforce.pdf

  • 1. Lecture 10: Policy gradient Radoslav Neychev Machine Learning course advanced track MIPT 08.11.2019, Moscow, Russia
  • 2. These slides are almost the exact copy of Practical RL course week 6 slides. Special thanks to YSDA team for making them publicly available. Original slides link: week06_policy_based References
  • 3. 2 Small experiment The next slide contains a question Please respond as fast as you can!
  • 8. 7 Approximation error DQN is trained to minimize Simple 2-state world L≈E[Q(st ,at)−(rt +γ⋅maxa' Q(st+1, a'))] 2 True (A) (B) Q(s0,a0) 1 1 2 Q(s0,a1) 2 2 1 Q(s1,a0) 3 3 3 Q(s1,a1) 100 50 100 Q: Which prediction is better (A/B)?
  • 9. 8 Approximation error DQN is trained to minimize Simple 2-state world L≈E[Q(st ,at)−(rt +γ⋅maxa' Q(st+1, a'))] 2 True (A) (B) Q(s0,a0) 1 1 2 Q(s0,a1) 2 2 1 Q(s1,a0) 3 3 3 Q(s1,a1) 100 50 100 better policy less MSE
  • 10. 9 Approximation error DQN is trained to minimize Simple 2-state world L≈E[Q(st ,at)−(rt +γ⋅maxa' Q(st+1, a'))] 2 True (A) (B) Q(s0,a0) 1 1 2 Q(s0,a1) 2 2 1 Q(s1,a0) 3 3 3 Q(s1,a1) 100 50 100 better policy less MSE Q-learning will prefer worse policy (B)!
  • 11. 10 Conclusion ● Often computing q-values is harder than picking optimal actions! ● We could avoid learning value functions by directly learning agent's policy Q: what algorithm works that way? (of those we studied) πθ(a∣s)
  • 12. 11 Conclusion ● Often computing q-values is harder than picking optimal actions! ● We could avoid learning value functions by directly learning agent's policy Q: what algorithm works that way? πθ(a∣s) e.g. crossentropy method
  • 13. 12 NOT how humans survived argmax[ Q(s,pet the tiger) Q(s,run from tiger) Q(s,provoke tiger) Q(s,ignore tiger) ]
  • 15. 14 Policies In general, two kinds ● Deterministic policy ● Stochastic policy a=πθ (s) a∼πθ (a∣s) Q: Any case where stochastic is better?
  • 16. 15 Policies In general, two kinds ● Deterministic policy ● Stochastic policy a=πθ (s) a∼πθ (a∣s) Q: Any case where stochastic is better? e.g. rock-paper -scissors
  • 17. 16 Policies In general, two kinds ● Deterministic policy ● Stochastic policy sampling takes care of exploration same action each time Genetic algos (week 0) Deterministic policy gradient a=πθ (s) a∼πθ (a∣s) Crossentropy method Policy gradient Q: how to represent policy in continuous action space?
  • 18. 17 Policies In general, two kinds ● Deterministic policy ● Stochastic policy sampling takes care of exploration same action each time Genetic algos (week 0) Deterministic policy gradient a=πθ (s) a∼πθ (a∣s) Crossentropy method Policy gradient categorical, normal, mixture of normal, whatever
  • 19. 18 Two approaches ● Value based: Learn value function or Infer policy ● Policy based: Explicitly learn policy or Implicitly maximize reward over policy a=argmax a Qθ(s,a) Qθ(s ,a) Vθ(s) πθ(a∣s) πθ(s)→a
  • 20. 19 Recap: crossentropy method ● Initialize NN weights ● Loop: – Sample N sessions – elite = take M best sessions and concatenate θ0←random θi+1=θi+α ∇ ∑ i log πθi (ai∣si)⋅ [ si ,ai∈Elite]
  • 21. 20 Recap: crossentropy method ● Initialize NN weights ● Loop: – Sample N sessions – elite = take M best sessions and concatenate TD version: elite (s,a) that have highest G(s,a) (select elites independently from each state) θ0←random θi+1=θi+α ∇ ∑ i log πθi (ai∣si)⋅ [ si ,ai∈Elite]
  • 22. 21 Policy gradient main idea Why so complicated? We'd rather maximize reward directly!
  • 23. 22 Objective Expected reward: Expected discounted reward: J = E s∼p(s) a∼πθ (s∣a) ... R(s, a,s' ,a',...) J = E s∼p(s) a∼πθ (s∣a) G(s,a)
  • 24. 23 Objective Expected reward: Expected discounted reward: J = E s∼p(s) a∼πθ (s∣a) ... R(s, a,s' ,a',...) J = E s∼p(s) a∼πθ (s∣a) G(s,a) R(z) setting G(s,a) = r + γ*G(s',a')
  • 25. 24 Objective Consider an 1-step process for simplicity J = E s∼p(s) a∼πθ (s∣a) R(s, a)=∫ s p(s)∫ a πθ(a∣s)R(s,a)da ds
  • 26. 25 Objective state visitation frequency (may depend on policy) Q: how do we compute that? J = E s∼p(s) a∼πθ (s∣a) R(s, a)=∫ s p(s)∫ a πθ(a∣s)R(s,a)da ds Reward for 1-step session Consider an 1-step process for simplicity
  • 27. 26 Objective J = E s∼p(s) a∼πθ (s∣a) R(s, a)=∫ s p(s)∫ a πθ(a∣s)R(s,a)da ds sample N sessions under current policy J≈ 1 N ∑ i=0 N R(s ,a)
  • 28. 27 Objective Can we optimize policy now? sample N sessions J = E s∼p(s) a∼πθ (s∣a) R(s, a)=∫ s p(s)∫ a πθ(a∣s)R(s,a)da ds J≈ 1 N ∑ i=0 N R(s ,a)
  • 29. 28 Objective parameters “sit” here We don't know how to compute dJ/dtheta J = E s∼p(s) a∼πθ (s∣a) R(s, a)=∫ s p(s)∫ a πθ(a∣s)R(s,a)da ds J≈ 1 N ∑ i=0 N ∑ s,a∈zi R(s,a)
  • 30. 29 Optimization Finite differences – Change policy a little, evaluate Stochastic optimization – Good old crossentropy method – Maximize probability of “elite” actions ∇ J≈ Jθ+ϵ−Jθ ϵ
  • 31. 30 Optimization Finite differences – Change policy a little, evaluate Stochastic optimization – Good old crossentropy method – Maximize probability of “elite” actions ∇ J≈ Jθ+ϵ−Jθ ϵ Q: any problems with those two?
  • 32. 31 Optimization Finite differences – Change policy a little, evaluate Stochastic optimization – Good old crossentropy method – Maximize probability of “elite” actions ∇ J≈ Jθ+ϵ−Jθ ϵ VERY noizy, especially if both J are sampled “quantile convergence” problems with stochastic MDPs
  • 33. 32 Objective Wish list: – Analytical gradient – Easy/stable approximations J = E s∼p(s) a∼πθ (s∣a) R(s, a)=∫ s p(s)∫ a πθ(a∣s)R(s,a)da ds
  • 34. 33 Logderivative trick Simple math (try chain rule) ∇ log π(z)=? ??
  • 35. 34 Logderivative trick Simple math ∇ log π(z)= 1 π(z) ⋅∇ π( z) π⋅∇ log π(z)=∇ π( z)
  • 36. 35 Policy gradient Analytical inference π⋅∇ log π(z)=∇ π( z) ∇ J=∫ s p(s)∫ a ∇ πθ (a∣s)R(s, a)da ds
  • 37. 36 Policy gradient Analytical inference ∇ J=∫ s p(s)∫ a ∇ πθ (a∣s)R(s, a)da ds π⋅∇ log π(z)=∇ π( z) ∇ J=∫ s p(s)∫ a πθ (a∣s)∇ log πθ(a∣s)R(s,a)dads Q: anything curious about that formula?
  • 38. 37 Policy gradient Analytical inference π⋅∇ log π(z)=∇ π( z) that's expectation :) ∇ J=∫ s p(s)∫ a ∇ πθ (a∣s)R(s, a)da ds ∇ J=∫ s p(s)∫ a πθ (a∣s)∇ log πθ(a∣s)R(s,a)dads
  • 39. 38 REINFORCE (bandit) ∇ J≈ 1 N ∑ i=0 N ∇ log πθ(a∣s)⋅R(s,a) ● Initialize NN weights ● Loop: – Sample N sessions z under current – Evaluate policy gradient – Ascend θ0←random θi+1←θi+α⋅∇ J
  • 40. 39 Discounted reward case ● Replace R with Q :) π⋅∇ log π(z)=∇ π( z) that's expectation :) ∇ J=∫ s p(s)∫ a ∇ πθ (a∣s)Q(s,a)dads ∇ J=∫ s p(s)∫ a πθ (a∣s)∇ log πθ(a∣s)Q(s,a)dads True action value a.k.a. E[ G(s,a) ]
  • 41. 40 REINFORCE (discounted) ● Policy gradient ● Approximate with sampling ∇ J= E s∼p(s) a∼πθ (s∣a) ∇ logπθ(a∣s)⋅Q(s, a) ∇ J≈ 1 N ∑ i=0 N ∑ s ,a∈zi ∇ logπθ(a∣s)⋅Q(s,a)
  • 42. 41 REINFORCE algorithm We can estimate Q using G Gt=rt+γ⋅rt+1+γ 2 ⋅rt+2+... Qπ (st ,at)=Es' G(st ,at) prev s prev a a'' s'' s' a a' r’ s r'’ r r'’’
  • 43. 42 We can use this to compue all G’s in linear time Recap: discounted rewards Gt=rt+γ⋅rt+1+γ 2 ⋅rt+2+... rt +γ⋅(rt+1+γ⋅rt+2+...) rt +γ⋅Gt+1 = = = =
  • 44. 43 REINFORCE algorithm ∇ J≈ 1 N ∑ i=0 N ∑ s ,a∈zi ∇ logπθ(a∣s)⋅Q(s,a) ● Initialize NN weights ● Loop: – Sample N sessions z under current – Evaluate policy gradient – Ascend θ0←random πθ(a∣s) θi+1←θi+α⋅∇ J
  • 45. 44 REINFORCE algorithm ∇ J≈ 1 N ∑ i=0 N ∑ s ,a∈zi ∇ logπθ(a∣s)⋅Q(s,a) ● Initialize NN weights ● Loop: – Sample N sessions z under current – Evaluate policy gradient – Ascend θ0←random πθ(a∣s) θi+1←θi+α⋅∇ J Q: is it off- or on-policy?
  • 46. 45 REINFORCE algorithm ∇ J≈ 1 N ∑ i=0 N ∑ s ,a∈zi ∇ logπθ(a∣s)⋅Q(s,a) ● Initialize NN weights ● Loop: – Sample N sessions z under current – Evaluate policy gradient – Ascend θ0←random πθ(a∣s) θi+1←θi+α⋅∇ J actions under current policy = on-policy
  • 47. value-based Vs policy-based Value-based ● Q-learning, SARSA, MCTS value-iteration ● Solves harder problem ● Artificial exploration ● Learns from partial experience (temporal difference) ● Evaluates strategy for free :) Policy-based ● REINFORCE, CEM ● Solves easier problem ● Innate exploration ● Innate stochasticity ● Support continuous action space ● Learns from full session only?
  • 48. value-based Vs policy-based Value-based ● Q-learning, SARSA, MCTS value-iteration ● Solves harder problem ● Artificial exploration ● Learns from partial experience (temporal difference) ● Evaluates strategy for free :) Policy-based ● REINFORCE, CEM ● Solves easier problem ● Innate exploration ● Innate stochasticity ● Support continuous action space ● Learns from full session only We'll learn much more soon!
  • 49. 48 REINFORCE baselines ∇ J≈ 1 N ∑ i=0 N ∑ s ,a∈zi ∇ logπθ(a∣s)⋅Q(s,a) ● Initialize NN weights ● Loop: – Sample N sessions z under current – Evaluate policy gradient θ0←random πθ(a∣s) What is better for learning: random action in good state or great action in bad state?
  • 50. 49 REINFORCE baselines We can subtract arbitrary baseline b(s) ∇ J= E s∼p(s) a∼πθ (a∣s) ∇ logπθ(a∣s)(Q(s,a)−b(s))=... ...= E s∼p(s) a∼πθ (a∣s) ∇ logπθ(a∣s)Q(s ,a)− E s∼p(s) a∼πθ (a∣s) ∇ logπθ(a∣s)b(s)=...
  • 51. 50 REINFORCE baselines We can subtract arbitrary baseline b(s) ∇ J= E s∼p(s) a∼πθ (a∣s) ∇ logπθ(a∣s)(Q(s,a)−b(s))=... ...= E s∼p(s) a∼πθ (a∣s) ∇ logπθ(a∣s)Q(s ,a)− E s∼p(s) a∼πθ (a∣s) ∇ logπθ(a∣s)b(s)=... Note that b(s) does not depend on a Q: Can you simplify the second term?
  • 52. 51 REINFORCE baselines We can subtract arbitrary baseline b(s) ∇ J= E s∼p(s) a∼πθ (a∣s) ∇ logπθ(a∣s)(Q(s,a)−b(s))=... ...= E s∼p(s) a∼πθ (a∣s) ∇ logπθ(a∣s)Q(s ,a)− E s∼p(s) a∼πθ (a∣s) ∇ logπθ(a∣s)b(s)=... E s∼p(s) a∼πθ (a∣s) ∇ log πθ(a∣s)b(s)=b(s)⋅ E s∼p(s) a∼πθ (a∣s) ∇ log πθ(a∣s)=0
  • 53. 52 REINFORCE baselines We can subtract arbitrary baseline b(s) ∇ J= E s∼p(s) a∼πθ (a∣s) ∇ logπθ(a∣s)(Q(s,a)−b(s))=... ...= E s∼p(s) a∼πθ (a∣s) ∇ logπθ(a∣s)Q(s ,a)− E s∼p(s) a∼πθ (a∣s) ∇ logπθ(a∣s)b(s)=... ...= E s∼p(s) a∼πθ (a∣s) ∇ logπθ(a∣s)Q(s ,a) Gradient direction doesn’t change!
  • 54. 53 REINFORCE baselines ● Gradient direction stays the same ● Variance may change Gradient variance: as a random variable over (s, a) ∇ J Var[Q(s,a)−b(s)] Var[Q(s,a)]−2⋅Cov[Q(s ,a),b(s)]+Var[b(s)]
  • 55. 54 REINFORCE baselines ● Gradient direction stays the same ● Variance may change Gradient variance: as a random variable over (s, a) ∇ J Var[Q(s,a)−b(s)] If b(s) correlates with Q(s,a), variance decreases Var[Q(s,a)]−2⋅Cov[Q(s ,a),b(s)]+Var[b(s)]
  • 56. 55 REINFORCE baselines ● Gradient direction stays the same ● Variance may change Gradient variance: as a random variable over (s, a) ∇ J Var[Q(s,a)−b(s)] Q: can you suggest any such b(s)? Var[Q(s,a)]−2⋅Cov[Q(s ,a),b(s)]+Var[b(s)]
  • 57. 56 REINFORCE baselines ● Gradient direction stays the same ● Variance may change Gradient variance: as a random variable over (s, a) ∇ J Var[Q(s,a)−b(s)] Naive baseline: b = moving average Q over all (s, a), Var[b(s)] = 0, Cov[Q, b] > 0 Var[Q(s,a)]−2⋅Cov[Q(s ,a),b(s)]+Var[b(s)]
  • 58. 57 REINFORCE baselines ∇ J≈ 1 N ∑ i=0 N ∑ s ,a∈zi ∇ logπθ(a∣s)⋅(Q(s,a)−V (s)) Better baseline: b(s) = V(s) Q: but how do we predict V(s)?
  • 59. 58 Actor-critic ● Learn both V(s) and ● Hope for the best of both worlds :) πθ(a∣s)
  • 60. 59 Advantage actor-critic Idea: learn both and Use to learn faster! Q: how can we estimate A(s,a) from (s,a,r,s') and V-function? Vθ(s) πθ(a∣s) πθ(a∣s) Vθ(s)
  • 61. 60 Advantage actor-critic Idea: learn both and Use to learn faster! Vθ(s) πθ(a∣s) πθ(a∣s) Vθ(s) A(s ,a)=Q(s ,a)−V (s) Q(s,a)=r+γ⋅V (s ') A(s ,a)=r+ γ⋅V (s')−V (s)
  • 62. 61 Advantage actor-critic Idea: learn both and Use to learn faster! Vθ(s) πθ(a∣s) πθ(a∣s) Vθ(s) A(s ,a)=Q(s ,a)−V (s) Q(s,a)=r+γ⋅V (s ') A(s ,a)=r+ γ⋅V (s')−V (s) Also: n-step version
  • 63. 62 Advantage actor-critic Idea: learn both and Use to learn faster! Vθ(s) πθ(a∣s) πθ(a∣s) Vθ(s) A(s ,a)=r+ γ⋅V (s')−V (s) ∇ Jactor≈ 1 N ∑ i=0 N ∑ s,a∈zi ∇ logπθ(a∣s)⋅A(s ,a) consider const
  • 64. 63 Advantage actor-critic ∇ Jactor≈ 1 N ∑ i=0 N ∑ s,a∈zi ∇ logπθ(a∣s)⋅A(s ,a) Vθ(s) model W = params state s πθ(a∣s) Lcritic≈ 1 N ∑ i=0 N ∑ s,a∈zi (V θ(s)−[r+γ⋅V (s')]) 2 Improve policy: Improve value:
  • 65. 64 Continuous action spaces What if there's continuously many actions? ● Robot control: control motor voltage ● Trading: assign money to equity How does the algorithm change?
  • 66. 65 Continuous action spaces What if there's continuously many actions? ● Robot control: control motor voltage ● Trading: assign money to equity How does the algorithm change? it doesn't :) Just plug in a different formula for pi(a|s), e.g. normal distribution
  • 67. 66 Asynchronous advantage actor-critic ● Parallel game sessions ● Async multi-CPU training ● No experience replay ● LSTM policy ● N-step advantage ● No experience replay Read more: https://arxiv.org/abs/1602.01783
  • 68. 67 IMPALA Read more: https://arxiv.org/abs/1802.01561 ● Massively parallel ● Separate actor / learner processes ● Small experience replay w/ importance sampling
  • 69. 68 Duct tape zone ● V(s) errors less important than in Q-learning – actor still learns even if critic is random, just slower ● Regularize with entropy – to prevent premature convergence ● Learn on parallel sessions – Or super-small experience replay ● Use logsoftmax for numerical stability
  • 71. ● Remember log-derivative trick ● Combining best from both worlds is generally a good idea ● See this paper for the proof of the policy gradient for discounted rewards ● Time to write some code! Outro and Q & A 70