SlideShare a Scribd company logo
Lecture 10: Policy gradient
Radoslav Neychev
Machine Learning course
advanced track
MIPT
08.11.2019, Moscow, Russia
These slides are almost the exact copy of Practical RL course week 6 slides.
Special thanks to YSDA team for making them publicly available.
Original slides link: week06_policy_based
References
2
Small experiment
The next slide contains a question
Please respond as fast as you can!
3
Small experiment
left or right?
4
Small experiment
Right! Ready for next one?
5
Small experiment
What's Q(s,right) under gamma=0.99?
6
Small experiment
What's Q(s,right) under gamma=0.99?
7
Approximation error
DQN is trained to minimize
Simple 2-state world
L≈E[Q(st ,at)−(rt +γ⋅maxa' Q(st+1, a'))]
2
True (A) (B)
Q(s0,a0) 1 1 2
Q(s0,a1) 2 2 1
Q(s1,a0) 3 3 3
Q(s1,a1) 100 50 100
Q: Which prediction is better (A/B)?
8
Approximation error
DQN is trained to minimize
Simple 2-state world
L≈E[Q(st ,at)−(rt +γ⋅maxa' Q(st+1, a'))]
2
True (A) (B)
Q(s0,a0) 1 1 2
Q(s0,a1) 2 2 1
Q(s1,a0) 3 3 3
Q(s1,a1) 100 50 100
better
policy
less
MSE
9
Approximation error
DQN is trained to minimize
Simple 2-state world
L≈E[Q(st ,at)−(rt +γ⋅maxa' Q(st+1, a'))]
2
True (A) (B)
Q(s0,a0) 1 1 2
Q(s0,a1) 2 2 1
Q(s1,a0) 3 3 3
Q(s1,a1) 100 50 100
better
policy
less
MSE
Q-learning will prefer worse policy (B)!
10
Conclusion
● Often computing q-values is harder than
picking optimal actions!
● We could avoid learning value functions by
directly learning agent's policy
Q: what algorithm works that way?
(of those we studied)
πθ(a∣s)
11
Conclusion
● Often computing q-values is harder than
picking optimal actions!
● We could avoid learning value functions by
directly learning agent's policy
Q: what algorithm works that way?
πθ(a∣s)
e.g. crossentropy method
12
NOT how humans survived
argmax[
Q(s,pet the tiger)
Q(s,run from tiger)
Q(s,provoke tiger)
Q(s,ignore tiger)
]
13
how humans survived
π(run∣s)=1
14
Policies
In general, two kinds
● Deterministic policy
● Stochastic policy
a=πθ (s)
a∼πθ (a∣s)
Q: Any case where stochastic is better?
15
Policies
In general, two kinds
● Deterministic policy
● Stochastic policy
a=πθ (s)
a∼πθ (a∣s)
Q: Any case where stochastic is better?
e.g. rock-paper
-scissors
16
Policies
In general, two kinds
● Deterministic policy
● Stochastic policy sampling takes care
of exploration
same action each time
Genetic algos (week 0)
Deterministic policy gradient a=πθ (s)
a∼πθ (a∣s)
Crossentropy method
Policy gradient
Q: how to represent policy in continuous action space?
17
Policies
In general, two kinds
● Deterministic policy
● Stochastic policy sampling takes care
of exploration
same action each time
Genetic algos (week 0)
Deterministic policy gradient a=πθ (s)
a∼πθ (a∣s)
Crossentropy method
Policy gradient
categorical, normal, mixture of normal, whatever
18
Two approaches
● Value based:
Learn value function or
Infer policy
● Policy based:
Explicitly learn policy or
Implicitly maximize reward over policy
a=argmax
a
Qθ(s,a)
Qθ(s ,a) Vθ(s)
πθ(a∣s) πθ(s)→a
19
Recap: crossentropy method
● Initialize NN weights
● Loop:
– Sample N sessions
– elite = take M best sessions and concatenate
θ0←random
θi+1=θi+α ∇ ∑
i
log πθi
(ai∣si)⋅
[ si ,ai∈Elite]
20
Recap: crossentropy method
● Initialize NN weights
● Loop:
– Sample N sessions
– elite = take M best sessions and concatenate
TD version: elite (s,a) that have highest G(s,a)
(select elites independently from each state)
θ0←random
θi+1=θi+α ∇ ∑
i
log πθi
(ai∣si)⋅
[ si ,ai∈Elite]
21
Policy gradient main idea
Why so complicated?
We'd rather maximize reward directly!
22
Objective
Expected reward:
Expected discounted reward:
J = E
s∼p(s)
a∼πθ (s∣a)
...
R(s, a,s' ,a',...)
J = E
s∼p(s)
a∼πθ (s∣a)
G(s,a)
23
Objective
Expected reward:
Expected discounted reward:
J = E
s∼p(s)
a∼πθ (s∣a)
...
R(s, a,s' ,a',...)
J = E
s∼p(s)
a∼πθ (s∣a)
G(s,a)
R(z) setting
G(s,a) = r + γ*G(s',a')
24
Objective
Consider an 1-step process for simplicity
J = E
s∼p(s)
a∼πθ (s∣a)
R(s, a)=∫
s
p(s)∫
a
πθ(a∣s)R(s,a)da ds
25
Objective
state visitation frequency
(may depend on policy)
Q: how do we compute that?
J = E
s∼p(s)
a∼πθ (s∣a)
R(s, a)=∫
s
p(s)∫
a
πθ(a∣s)R(s,a)da ds
Reward for 1-step
session
Consider an 1-step process for simplicity
26
Objective
J = E
s∼p(s)
a∼πθ (s∣a)
R(s, a)=∫
s
p(s)∫
a
πθ(a∣s)R(s,a)da ds
sample N sessions
under current policy
J≈
1
N
∑
i=0
N
R(s ,a)
27
Objective
Can we optimize policy now?
sample N sessions
J = E
s∼p(s)
a∼πθ (s∣a)
R(s, a)=∫
s
p(s)∫
a
πθ(a∣s)R(s,a)da ds
J≈
1
N
∑
i=0
N
R(s ,a)
28
Objective
parameters “sit” here
We don't know how to compute dJ/dtheta
J = E
s∼p(s)
a∼πθ (s∣a)
R(s, a)=∫
s
p(s)∫
a
πθ(a∣s)R(s,a)da ds
J≈
1
N
∑
i=0
N
∑
s,a∈zi
R(s,a)
29
Optimization
Finite differences
– Change policy a little, evaluate
Stochastic optimization
– Good old crossentropy method
– Maximize probability of “elite” actions
∇ J≈
Jθ+ϵ−Jθ
ϵ
30
Optimization
Finite differences
– Change policy a little, evaluate
Stochastic optimization
– Good old crossentropy method
– Maximize probability of “elite” actions
∇ J≈
Jθ+ϵ−Jθ
ϵ
Q: any problems with those two?
31
Optimization
Finite differences
– Change policy a little, evaluate
Stochastic optimization
– Good old crossentropy method
– Maximize probability of “elite” actions
∇ J≈
Jθ+ϵ−Jθ
ϵ
VERY noizy, especially
if both J are sampled
“quantile convergence”
problems with stochastic
MDPs
32
Objective
Wish list:
– Analytical gradient
– Easy/stable approximations
J = E
s∼p(s)
a∼πθ (s∣a)
R(s, a)=∫
s
p(s)∫
a
πθ(a∣s)R(s,a)da ds
33
Logderivative trick
Simple math
(try chain rule)
∇ log π(z)=? ??
34
Logderivative trick
Simple math
∇ log π(z)=
1
π(z)
⋅∇ π( z)
π⋅∇ log π(z)=∇ π( z)
35
Policy gradient
Analytical inference
π⋅∇ log π(z)=∇ π( z)
∇ J=∫
s
p(s)∫
a
∇ πθ (a∣s)R(s, a)da ds
36
Policy gradient
Analytical inference
∇ J=∫
s
p(s)∫
a
∇ πθ (a∣s)R(s, a)da ds
π⋅∇ log π(z)=∇ π( z)
∇ J=∫
s
p(s)∫
a
πθ (a∣s)∇ log πθ(a∣s)R(s,a)dads
Q: anything curious about that formula?
37
Policy gradient
Analytical inference
π⋅∇ log π(z)=∇ π( z)
that's expectation :)
∇ J=∫
s
p(s)∫
a
∇ πθ (a∣s)R(s, a)da ds
∇ J=∫
s
p(s)∫
a
πθ (a∣s)∇ log πθ(a∣s)R(s,a)dads
38
REINFORCE (bandit)
∇ J≈
1
N
∑
i=0
N
∇ log πθ(a∣s)⋅R(s,a)
● Initialize NN weights
● Loop:
– Sample N sessions z under current
– Evaluate policy gradient
– Ascend
θ0←random
θi+1←θi+α⋅∇ J
39
Discounted reward case
● Replace R with Q :)
π⋅∇ log π(z)=∇ π( z)
that's expectation :)
∇ J=∫
s
p(s)∫
a
∇ πθ (a∣s)Q(s,a)dads
∇ J=∫
s
p(s)∫
a
πθ (a∣s)∇ log πθ(a∣s)Q(s,a)dads
True action value
a.k.a. E[ G(s,a) ]
40
REINFORCE (discounted)
● Policy gradient
● Approximate with sampling
∇ J= E
s∼p(s)
a∼πθ (s∣a)
∇ logπθ(a∣s)⋅Q(s, a)
∇ J≈
1
N
∑
i=0
N
∑
s ,a∈zi
∇ logπθ(a∣s)⋅Q(s,a)
41
REINFORCE algorithm
We can estimate Q using G
Gt=rt+γ⋅rt+1+γ
2
⋅rt+2+...
Qπ (st ,at)=Es' G(st ,at)
prev s
prev a a''
s''
s'
a
a'
r’
s
r'’
r
r'’’
42
We can use this to compue all G’s
in linear time
Recap: discounted rewards
Gt=rt+γ⋅rt+1+γ
2
⋅rt+2+...
rt +γ⋅(rt+1+γ⋅rt+2+...)
rt +γ⋅Gt+1
=
= =
=
43
REINFORCE algorithm
∇ J≈
1
N
∑
i=0
N
∑
s ,a∈zi
∇ logπθ(a∣s)⋅Q(s,a)
● Initialize NN weights
● Loop:
– Sample N sessions z under current
– Evaluate policy gradient
– Ascend
θ0←random
πθ(a∣s)
θi+1←θi+α⋅∇ J
44
REINFORCE algorithm
∇ J≈
1
N
∑
i=0
N
∑
s ,a∈zi
∇ logπθ(a∣s)⋅Q(s,a)
● Initialize NN weights
● Loop:
– Sample N sessions z under current
– Evaluate policy gradient
– Ascend
θ0←random
πθ(a∣s)
θi+1←θi+α⋅∇ J
Q: is it off- or on-policy?
45
REINFORCE algorithm
∇ J≈
1
N
∑
i=0
N
∑
s ,a∈zi
∇ logπθ(a∣s)⋅Q(s,a)
● Initialize NN weights
● Loop:
– Sample N sessions z under current
– Evaluate policy gradient
– Ascend
θ0←random
πθ(a∣s)
θi+1←θi+α⋅∇ J
actions under current policy
= on-policy
value-based Vs policy-based
Value-based
● Q-learning, SARSA, MCTS
value-iteration
● Solves harder problem
● Artificial exploration
● Learns from partial experience
(temporal difference)
● Evaluates strategy for free :)
Policy-based
● REINFORCE, CEM
● Solves easier problem
● Innate exploration
● Innate stochasticity
● Support continuous action space
● Learns from full session only?
value-based Vs policy-based
Value-based
● Q-learning, SARSA, MCTS
value-iteration
● Solves harder problem
● Artificial exploration
● Learns from partial experience
(temporal difference)
● Evaluates strategy for free :)
Policy-based
● REINFORCE, CEM
● Solves easier problem
● Innate exploration
● Innate stochasticity
● Support continuous action space
● Learns from full session only
We'll learn much more soon!
48
REINFORCE baselines
∇ J≈
1
N
∑
i=0
N
∑
s ,a∈zi
∇ logπθ(a∣s)⋅Q(s,a)
● Initialize NN weights
● Loop:
– Sample N sessions z under current
– Evaluate policy gradient
θ0←random
πθ(a∣s)
What is better for learning:
random action in good state
or
great action in bad state?
49
REINFORCE baselines
We can subtract arbitrary baseline b(s)
∇ J= E
s∼p(s)
a∼πθ (a∣s)
∇ logπθ(a∣s)(Q(s,a)−b(s))=...
...= E
s∼p(s)
a∼πθ (a∣s)
∇ logπθ(a∣s)Q(s ,a)− E
s∼p(s)
a∼πθ (a∣s)
∇ logπθ(a∣s)b(s)=...
50
REINFORCE baselines
We can subtract arbitrary baseline b(s)
∇ J= E
s∼p(s)
a∼πθ (a∣s)
∇ logπθ(a∣s)(Q(s,a)−b(s))=...
...= E
s∼p(s)
a∼πθ (a∣s)
∇ logπθ(a∣s)Q(s ,a)− E
s∼p(s)
a∼πθ (a∣s)
∇ logπθ(a∣s)b(s)=...
Note that b(s) does not depend on a
Q: Can you simplify the second term?
51
REINFORCE baselines
We can subtract arbitrary baseline b(s)
∇ J= E
s∼p(s)
a∼πθ (a∣s)
∇ logπθ(a∣s)(Q(s,a)−b(s))=...
...= E
s∼p(s)
a∼πθ (a∣s)
∇ logπθ(a∣s)Q(s ,a)− E
s∼p(s)
a∼πθ (a∣s)
∇ logπθ(a∣s)b(s)=...
E
s∼p(s)
a∼πθ (a∣s)
∇ log πθ(a∣s)b(s)=b(s)⋅ E
s∼p(s)
a∼πθ (a∣s)
∇ log πθ(a∣s)=0
52
REINFORCE baselines
We can subtract arbitrary baseline b(s)
∇ J= E
s∼p(s)
a∼πθ (a∣s)
∇ logπθ(a∣s)(Q(s,a)−b(s))=...
...= E
s∼p(s)
a∼πθ (a∣s)
∇ logπθ(a∣s)Q(s ,a)− E
s∼p(s)
a∼πθ (a∣s)
∇ logπθ(a∣s)b(s)=...
...= E
s∼p(s)
a∼πθ (a∣s)
∇ logπθ(a∣s)Q(s ,a)
Gradient direction doesn’t change!
53
REINFORCE baselines
● Gradient direction stays the same
● Variance may change
Gradient variance:
as a random variable over (s, a)
∇ J
Var[Q(s,a)−b(s)]
Var[Q(s,a)]−2⋅Cov[Q(s ,a),b(s)]+Var[b(s)]
54
REINFORCE baselines
● Gradient direction stays the same
● Variance may change
Gradient variance:
as a random variable over (s, a)
∇ J
Var[Q(s,a)−b(s)]
If b(s) correlates with Q(s,a), variance decreases
Var[Q(s,a)]−2⋅Cov[Q(s ,a),b(s)]+Var[b(s)]
55
REINFORCE baselines
● Gradient direction stays the same
● Variance may change
Gradient variance:
as a random variable over (s, a)
∇ J
Var[Q(s,a)−b(s)]
Q: can you suggest any such b(s)?
Var[Q(s,a)]−2⋅Cov[Q(s ,a),b(s)]+Var[b(s)]
56
REINFORCE baselines
● Gradient direction stays the same
● Variance may change
Gradient variance:
as a random variable over (s, a)
∇ J
Var[Q(s,a)−b(s)]
Naive baseline: b = moving average Q
over all (s, a), Var[b(s)] = 0, Cov[Q, b] > 0
Var[Q(s,a)]−2⋅Cov[Q(s ,a),b(s)]+Var[b(s)]
57
REINFORCE baselines
∇ J≈
1
N
∑
i=0
N
∑
s ,a∈zi
∇ logπθ(a∣s)⋅(Q(s,a)−V (s))
Better baseline: b(s) = V(s)
Q: but how do we predict V(s)?
58
Actor-critic
● Learn both V(s) and
● Hope for the best of both worlds :)
πθ(a∣s)
59
Advantage actor-critic
Idea: learn both and
Use to learn faster!
Q: how can we estimate A(s,a)
from (s,a,r,s') and V-function?
Vθ(s) πθ(a∣s)
πθ(a∣s) Vθ(s)
60
Advantage actor-critic
Idea: learn both and
Use to learn faster!
Vθ(s) πθ(a∣s)
πθ(a∣s) Vθ(s)
A(s ,a)=Q(s ,a)−V (s)
Q(s,a)=r+γ⋅V (s ')
A(s ,a)=r+ γ⋅V (s')−V (s)
61
Advantage actor-critic
Idea: learn both and
Use to learn faster!
Vθ(s) πθ(a∣s)
πθ(a∣s) Vθ(s)
A(s ,a)=Q(s ,a)−V (s)
Q(s,a)=r+γ⋅V (s ')
A(s ,a)=r+ γ⋅V (s')−V (s)
Also: n-step
version
62
Advantage actor-critic
Idea: learn both and
Use to learn faster!
Vθ(s) πθ(a∣s)
πθ(a∣s) Vθ(s)
A(s ,a)=r+ γ⋅V (s')−V (s)
∇ Jactor≈
1
N
∑
i=0
N
∑
s,a∈zi
∇ logπθ(a∣s)⋅A(s ,a)
consider
const
63
Advantage actor-critic
∇ Jactor≈
1
N
∑
i=0
N
∑
s,a∈zi
∇ logπθ(a∣s)⋅A(s ,a)
Vθ(s)
model
W = params
state s
πθ(a∣s)
Lcritic≈
1
N
∑
i=0
N
∑
s,a∈zi
(V θ(s)−[r+γ⋅V (s')])
2
Improve policy:
Improve value:
64
Continuous action spaces
What if there's continuously many actions?
● Robot control: control motor voltage
● Trading: assign money to equity
How does the algorithm change?
65
Continuous action spaces
What if there's continuously many actions?
● Robot control: control motor voltage
● Trading: assign money to equity
How does the algorithm change?
it doesn't :)
Just plug in a different formula for
pi(a|s), e.g. normal distribution
66
Asynchronous advantage actor-critic
● Parallel game sessions
● Async multi-CPU training
● No experience replay
● LSTM policy
● N-step advantage
● No experience replay
Read more: https://arxiv.org/abs/1602.01783
67
IMPALA
Read more: https://arxiv.org/abs/1802.01561
● Massively parallel
● Separate actor / learner processes
● Small experience replay
w/ importance sampling
68
Duct tape zone
● V(s) errors less important than in Q-learning
– actor still learns even if critic is random, just slower
● Regularize with entropy
– to prevent premature convergence
● Learn on parallel sessions
– Or super-small experience replay
● Use logsoftmax for numerical stability
69
Asynchronous advantage actor-critic
● Remember log-derivative trick
● Combining best from both worlds is generally a good idea
● See this paper for the proof of the policy gradient for
discounted rewards
● Time to write some code!
Outro and Q & A
70

More Related Content

Similar to week10_Reinforce.pdf

Uncertainty Awareness in Integrating Machine Learning and Game Theory
Uncertainty Awareness in Integrating Machine Learning and Game TheoryUncertainty Awareness in Integrating Machine Learning and Game Theory
Uncertainty Awareness in Integrating Machine Learning and Game Theory
Rikiya Takahashi
 
Introducing Zap Q-Learning
Introducing Zap Q-Learning   Introducing Zap Q-Learning
Introducing Zap Q-Learning
Sean Meyn
 
RL unit 5 part 1.pdf
RL unit 5 part 1.pdfRL unit 5 part 1.pdf
RL unit 5 part 1.pdf
ChandanaVemulapalli2
 
CS294-112 Lec 05
CS294-112 Lec 05CS294-112 Lec 05
CS294-112 Lec 05
Gyubin Son
 
Lec7 deeprlbootcamp-svg+scg
Lec7 deeprlbootcamp-svg+scgLec7 deeprlbootcamp-svg+scg
Lec7 deeprlbootcamp-svg+scg
Ronald Teo
 
Research internship on optimal stochastic theory with financial application u...
Research internship on optimal stochastic theory with financial application u...Research internship on optimal stochastic theory with financial application u...
Research internship on optimal stochastic theory with financial application u...
Asma Ben Slimene
 
Presentation on stochastic control problem with financial applications (Merto...
Presentation on stochastic control problem with financial applications (Merto...Presentation on stochastic control problem with financial applications (Merto...
Presentation on stochastic control problem with financial applications (Merto...
Asma Ben Slimene
 
Reinfrocement Learning
Reinfrocement LearningReinfrocement Learning
Reinfrocement Learning
Natan Katz
 
Conceptual Introduction to Gaussian Processes
Conceptual Introduction to Gaussian ProcessesConceptual Introduction to Gaussian Processes
Conceptual Introduction to Gaussian Processes
JuanPabloCarbajal3
 
Lec2 sampling-based-approximations-and-function-fitting
Lec2 sampling-based-approximations-and-function-fittingLec2 sampling-based-approximations-and-function-fitting
Lec2 sampling-based-approximations-and-function-fitting
Ronald Teo
 
Counterfactual Learning for Recommendation
Counterfactual Learning for RecommendationCounterfactual Learning for Recommendation
Counterfactual Learning for Recommendation
Olivier Jeunen
 
MAPE regression, seminar @ QUT (Brisbane)
MAPE regression, seminar @ QUT (Brisbane)MAPE regression, seminar @ QUT (Brisbane)
MAPE regression, seminar @ QUT (Brisbane)
Arnaud de Myttenaere
 
block-mdp-masters-defense.pdf
block-mdp-masters-defense.pdfblock-mdp-masters-defense.pdf
block-mdp-masters-defense.pdf
Junghyun Lee
 
Nonconvex Compressed Sensing with the Sum-of-Squares Method
Nonconvex Compressed Sensing with the Sum-of-Squares MethodNonconvex Compressed Sensing with the Sum-of-Squares Method
Nonconvex Compressed Sensing with the Sum-of-Squares Method
Tasuku Soma
 
QMC: Transition Workshop - Applying Quasi-Monte Carlo Methods to a Stochastic...
QMC: Transition Workshop - Applying Quasi-Monte Carlo Methods to a Stochastic...QMC: Transition Workshop - Applying Quasi-Monte Carlo Methods to a Stochastic...
QMC: Transition Workshop - Applying Quasi-Monte Carlo Methods to a Stochastic...
The Statistical and Applied Mathematical Sciences Institute
 
Reinforcement Learning Overview | Marco Del Pra
Reinforcement Learning Overview | Marco Del PraReinforcement Learning Overview | Marco Del Pra
Reinforcement Learning Overview | Marco Del Pra
Data Science Milan
 
DissertationSlides169
DissertationSlides169DissertationSlides169
DissertationSlides169Ryan White
 
Optimization Methods in Finance
Optimization Methods in FinanceOptimization Methods in Finance
Optimization Methods in Finance
thilankm
 
Introduction to Deep Reinforcement Learning
Introduction to Deep Reinforcement LearningIntroduction to Deep Reinforcement Learning
Introduction to Deep Reinforcement Learning
IDEAS - Int'l Data Engineering and Science Association
 
Lec5 advanced-policy-gradient-methods
Lec5 advanced-policy-gradient-methodsLec5 advanced-policy-gradient-methods
Lec5 advanced-policy-gradient-methods
Ronald Teo
 

Similar to week10_Reinforce.pdf (20)

Uncertainty Awareness in Integrating Machine Learning and Game Theory
Uncertainty Awareness in Integrating Machine Learning and Game TheoryUncertainty Awareness in Integrating Machine Learning and Game Theory
Uncertainty Awareness in Integrating Machine Learning and Game Theory
 
Introducing Zap Q-Learning
Introducing Zap Q-Learning   Introducing Zap Q-Learning
Introducing Zap Q-Learning
 
RL unit 5 part 1.pdf
RL unit 5 part 1.pdfRL unit 5 part 1.pdf
RL unit 5 part 1.pdf
 
CS294-112 Lec 05
CS294-112 Lec 05CS294-112 Lec 05
CS294-112 Lec 05
 
Lec7 deeprlbootcamp-svg+scg
Lec7 deeprlbootcamp-svg+scgLec7 deeprlbootcamp-svg+scg
Lec7 deeprlbootcamp-svg+scg
 
Research internship on optimal stochastic theory with financial application u...
Research internship on optimal stochastic theory with financial application u...Research internship on optimal stochastic theory with financial application u...
Research internship on optimal stochastic theory with financial application u...
 
Presentation on stochastic control problem with financial applications (Merto...
Presentation on stochastic control problem with financial applications (Merto...Presentation on stochastic control problem with financial applications (Merto...
Presentation on stochastic control problem with financial applications (Merto...
 
Reinfrocement Learning
Reinfrocement LearningReinfrocement Learning
Reinfrocement Learning
 
Conceptual Introduction to Gaussian Processes
Conceptual Introduction to Gaussian ProcessesConceptual Introduction to Gaussian Processes
Conceptual Introduction to Gaussian Processes
 
Lec2 sampling-based-approximations-and-function-fitting
Lec2 sampling-based-approximations-and-function-fittingLec2 sampling-based-approximations-and-function-fitting
Lec2 sampling-based-approximations-and-function-fitting
 
Counterfactual Learning for Recommendation
Counterfactual Learning for RecommendationCounterfactual Learning for Recommendation
Counterfactual Learning for Recommendation
 
MAPE regression, seminar @ QUT (Brisbane)
MAPE regression, seminar @ QUT (Brisbane)MAPE regression, seminar @ QUT (Brisbane)
MAPE regression, seminar @ QUT (Brisbane)
 
block-mdp-masters-defense.pdf
block-mdp-masters-defense.pdfblock-mdp-masters-defense.pdf
block-mdp-masters-defense.pdf
 
Nonconvex Compressed Sensing with the Sum-of-Squares Method
Nonconvex Compressed Sensing with the Sum-of-Squares MethodNonconvex Compressed Sensing with the Sum-of-Squares Method
Nonconvex Compressed Sensing with the Sum-of-Squares Method
 
QMC: Transition Workshop - Applying Quasi-Monte Carlo Methods to a Stochastic...
QMC: Transition Workshop - Applying Quasi-Monte Carlo Methods to a Stochastic...QMC: Transition Workshop - Applying Quasi-Monte Carlo Methods to a Stochastic...
QMC: Transition Workshop - Applying Quasi-Monte Carlo Methods to a Stochastic...
 
Reinforcement Learning Overview | Marco Del Pra
Reinforcement Learning Overview | Marco Del PraReinforcement Learning Overview | Marco Del Pra
Reinforcement Learning Overview | Marco Del Pra
 
DissertationSlides169
DissertationSlides169DissertationSlides169
DissertationSlides169
 
Optimization Methods in Finance
Optimization Methods in FinanceOptimization Methods in Finance
Optimization Methods in Finance
 
Introduction to Deep Reinforcement Learning
Introduction to Deep Reinforcement LearningIntroduction to Deep Reinforcement Learning
Introduction to Deep Reinforcement Learning
 
Lec5 advanced-policy-gradient-methods
Lec5 advanced-policy-gradient-methodsLec5 advanced-policy-gradient-methods
Lec5 advanced-policy-gradient-methods
 

More from YuChianWu

lecture04_mle_svm_multiclass_pca.pdf
lecture04_mle_svm_multiclass_pca.pdflecture04_mle_svm_multiclass_pca.pdf
lecture04_mle_svm_multiclass_pca.pdf
YuChianWu
 
lecture02_LinearRegression.pdf
lecture02_LinearRegression.pdflecture02_LinearRegression.pdf
lecture02_LinearRegression.pdf
YuChianWu
 
week13_yolo_mobilenet.pdf
week13_yolo_mobilenet.pdfweek13_yolo_mobilenet.pdf
week13_yolo_mobilenet.pdf
YuChianWu
 
week14_segmentation.pdf
week14_segmentation.pdfweek14_segmentation.pdf
week14_segmentation.pdf
YuChianWu
 
week12_cv_intro_and_r-cnn.pdf
week12_cv_intro_and_r-cnn.pdfweek12_cv_intro_and_r-cnn.pdf
week12_cv_intro_and_r-cnn.pdf
YuChianWu
 
Fragattacks-Breaking-Wi-Fi-Through-Fragmentation-And-Aggregation.pdf
Fragattacks-Breaking-Wi-Fi-Through-Fragmentation-And-Aggregation.pdfFragattacks-Breaking-Wi-Fi-Through-Fragmentation-And-Aggregation.pdf
Fragattacks-Breaking-Wi-Fi-Through-Fragmentation-And-Aggregation.pdf
YuChianWu
 
Im-A-Hacker-Get-Me-Out-Of-Here-Breaking-Network-Segregation-Using-Esoteric-Co...
Im-A-Hacker-Get-Me-Out-Of-Here-Breaking-Network-Segregation-Using-Esoteric-Co...Im-A-Hacker-Get-Me-Out-Of-Here-Breaking-Network-Segregation-Using-Esoteric-Co...
Im-A-Hacker-Get-Me-Out-Of-Here-Breaking-Network-Segregation-Using-Esoteric-Co...
YuChianWu
 
Lets-Attack-Lets_Encrypt.pdf
Lets-Attack-Lets_Encrypt.pdfLets-Attack-Lets_Encrypt.pdf
Lets-Attack-Lets_Encrypt.pdf
YuChianWu
 
A-Hole-in-the-Tube.pdf
A-Hole-in-the-Tube.pdfA-Hole-in-the-Tube.pdf
A-Hole-in-the-Tube.pdf
YuChianWu
 
IPvSeeYou.pdf
IPvSeeYou.pdfIPvSeeYou.pdf
IPvSeeYou.pdf
YuChianWu
 
kolmogorov_foundations.pdf
kolmogorov_foundations.pdfkolmogorov_foundations.pdf
kolmogorov_foundations.pdf
YuChianWu
 
Introductory Real Analysis (A. N. Kolmogorov, S. V. Fomin etc.) (z-lib.org).pdf
Introductory Real Analysis (A. N. Kolmogorov, S. V. Fomin etc.) (z-lib.org).pdfIntroductory Real Analysis (A. N. Kolmogorov, S. V. Fomin etc.) (z-lib.org).pdf
Introductory Real Analysis (A. N. Kolmogorov, S. V. Fomin etc.) (z-lib.org).pdf
YuChianWu
 

More from YuChianWu (12)

lecture04_mle_svm_multiclass_pca.pdf
lecture04_mle_svm_multiclass_pca.pdflecture04_mle_svm_multiclass_pca.pdf
lecture04_mle_svm_multiclass_pca.pdf
 
lecture02_LinearRegression.pdf
lecture02_LinearRegression.pdflecture02_LinearRegression.pdf
lecture02_LinearRegression.pdf
 
week13_yolo_mobilenet.pdf
week13_yolo_mobilenet.pdfweek13_yolo_mobilenet.pdf
week13_yolo_mobilenet.pdf
 
week14_segmentation.pdf
week14_segmentation.pdfweek14_segmentation.pdf
week14_segmentation.pdf
 
week12_cv_intro_and_r-cnn.pdf
week12_cv_intro_and_r-cnn.pdfweek12_cv_intro_and_r-cnn.pdf
week12_cv_intro_and_r-cnn.pdf
 
Fragattacks-Breaking-Wi-Fi-Through-Fragmentation-And-Aggregation.pdf
Fragattacks-Breaking-Wi-Fi-Through-Fragmentation-And-Aggregation.pdfFragattacks-Breaking-Wi-Fi-Through-Fragmentation-And-Aggregation.pdf
Fragattacks-Breaking-Wi-Fi-Through-Fragmentation-And-Aggregation.pdf
 
Im-A-Hacker-Get-Me-Out-Of-Here-Breaking-Network-Segregation-Using-Esoteric-Co...
Im-A-Hacker-Get-Me-Out-Of-Here-Breaking-Network-Segregation-Using-Esoteric-Co...Im-A-Hacker-Get-Me-Out-Of-Here-Breaking-Network-Segregation-Using-Esoteric-Co...
Im-A-Hacker-Get-Me-Out-Of-Here-Breaking-Network-Segregation-Using-Esoteric-Co...
 
Lets-Attack-Lets_Encrypt.pdf
Lets-Attack-Lets_Encrypt.pdfLets-Attack-Lets_Encrypt.pdf
Lets-Attack-Lets_Encrypt.pdf
 
A-Hole-in-the-Tube.pdf
A-Hole-in-the-Tube.pdfA-Hole-in-the-Tube.pdf
A-Hole-in-the-Tube.pdf
 
IPvSeeYou.pdf
IPvSeeYou.pdfIPvSeeYou.pdf
IPvSeeYou.pdf
 
kolmogorov_foundations.pdf
kolmogorov_foundations.pdfkolmogorov_foundations.pdf
kolmogorov_foundations.pdf
 
Introductory Real Analysis (A. N. Kolmogorov, S. V. Fomin etc.) (z-lib.org).pdf
Introductory Real Analysis (A. N. Kolmogorov, S. V. Fomin etc.) (z-lib.org).pdfIntroductory Real Analysis (A. N. Kolmogorov, S. V. Fomin etc.) (z-lib.org).pdf
Introductory Real Analysis (A. N. Kolmogorov, S. V. Fomin etc.) (z-lib.org).pdf
 

Recently uploaded

The Benefits and Techniques of Trenchless Pipe Repair.pdf
The Benefits and Techniques of Trenchless Pipe Repair.pdfThe Benefits and Techniques of Trenchless Pipe Repair.pdf
The Benefits and Techniques of Trenchless Pipe Repair.pdf
Pipe Restoration Solutions
 
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&BDesign and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Sreedhar Chowdam
 
Planning Of Procurement o different goods and services
Planning Of Procurement o different goods and servicesPlanning Of Procurement o different goods and services
Planning Of Procurement o different goods and services
JoytuBarua2
 
Student information management system project report ii.pdf
Student information management system project report ii.pdfStudent information management system project report ii.pdf
Student information management system project report ii.pdf
Kamal Acharya
 
Event Management System Vb Net Project Report.pdf
Event Management System Vb Net  Project Report.pdfEvent Management System Vb Net  Project Report.pdf
Event Management System Vb Net Project Report.pdf
Kamal Acharya
 
Architectural Portfolio Sean Lockwood
Architectural Portfolio Sean LockwoodArchitectural Portfolio Sean Lockwood
Architectural Portfolio Sean Lockwood
seandesed
 
DESIGN A COTTON SEED SEPARATION MACHINE.docx
DESIGN A COTTON SEED SEPARATION MACHINE.docxDESIGN A COTTON SEED SEPARATION MACHINE.docx
DESIGN A COTTON SEED SEPARATION MACHINE.docx
FluxPrime1
 
CME397 Surface Engineering- Professional Elective
CME397 Surface Engineering- Professional ElectiveCME397 Surface Engineering- Professional Elective
CME397 Surface Engineering- Professional Elective
karthi keyan
 
ethical hacking-mobile hacking methods.ppt
ethical hacking-mobile hacking methods.pptethical hacking-mobile hacking methods.ppt
ethical hacking-mobile hacking methods.ppt
Jayaprasanna4
 
addressing modes in computer architecture
addressing modes  in computer architectureaddressing modes  in computer architecture
addressing modes in computer architecture
ShahidSultan24
 
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
MdTanvirMahtab2
 
ethical hacking in wireless-hacking1.ppt
ethical hacking in wireless-hacking1.pptethical hacking in wireless-hacking1.ppt
ethical hacking in wireless-hacking1.ppt
Jayaprasanna4
 
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptxCFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
R&R Consult
 
weather web application report.pdf
weather web application report.pdfweather web application report.pdf
weather web application report.pdf
Pratik Pawar
 
Forklift Classes Overview by Intella Parts
Forklift Classes Overview by Intella PartsForklift Classes Overview by Intella Parts
Forklift Classes Overview by Intella Parts
Intella Parts
 
Railway Signalling Principles Edition 3.pdf
Railway Signalling Principles Edition 3.pdfRailway Signalling Principles Edition 3.pdf
Railway Signalling Principles Edition 3.pdf
TeeVichai
 
MCQ Soil mechanics questions (Soil shear strength).pdf
MCQ Soil mechanics questions (Soil shear strength).pdfMCQ Soil mechanics questions (Soil shear strength).pdf
MCQ Soil mechanics questions (Soil shear strength).pdf
Osamah Alsalih
 
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
AJAYKUMARPUND1
 
Democratizing Fuzzing at Scale by Abhishek Arya
Democratizing Fuzzing at Scale by Abhishek AryaDemocratizing Fuzzing at Scale by Abhishek Arya
Democratizing Fuzzing at Scale by Abhishek Arya
abh.arya
 
Water Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdfWater Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation & Control
 

Recently uploaded (20)

The Benefits and Techniques of Trenchless Pipe Repair.pdf
The Benefits and Techniques of Trenchless Pipe Repair.pdfThe Benefits and Techniques of Trenchless Pipe Repair.pdf
The Benefits and Techniques of Trenchless Pipe Repair.pdf
 
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&BDesign and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
 
Planning Of Procurement o different goods and services
Planning Of Procurement o different goods and servicesPlanning Of Procurement o different goods and services
Planning Of Procurement o different goods and services
 
Student information management system project report ii.pdf
Student information management system project report ii.pdfStudent information management system project report ii.pdf
Student information management system project report ii.pdf
 
Event Management System Vb Net Project Report.pdf
Event Management System Vb Net  Project Report.pdfEvent Management System Vb Net  Project Report.pdf
Event Management System Vb Net Project Report.pdf
 
Architectural Portfolio Sean Lockwood
Architectural Portfolio Sean LockwoodArchitectural Portfolio Sean Lockwood
Architectural Portfolio Sean Lockwood
 
DESIGN A COTTON SEED SEPARATION MACHINE.docx
DESIGN A COTTON SEED SEPARATION MACHINE.docxDESIGN A COTTON SEED SEPARATION MACHINE.docx
DESIGN A COTTON SEED SEPARATION MACHINE.docx
 
CME397 Surface Engineering- Professional Elective
CME397 Surface Engineering- Professional ElectiveCME397 Surface Engineering- Professional Elective
CME397 Surface Engineering- Professional Elective
 
ethical hacking-mobile hacking methods.ppt
ethical hacking-mobile hacking methods.pptethical hacking-mobile hacking methods.ppt
ethical hacking-mobile hacking methods.ppt
 
addressing modes in computer architecture
addressing modes  in computer architectureaddressing modes  in computer architecture
addressing modes in computer architecture
 
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
 
ethical hacking in wireless-hacking1.ppt
ethical hacking in wireless-hacking1.pptethical hacking in wireless-hacking1.ppt
ethical hacking in wireless-hacking1.ppt
 
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptxCFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
 
weather web application report.pdf
weather web application report.pdfweather web application report.pdf
weather web application report.pdf
 
Forklift Classes Overview by Intella Parts
Forklift Classes Overview by Intella PartsForklift Classes Overview by Intella Parts
Forklift Classes Overview by Intella Parts
 
Railway Signalling Principles Edition 3.pdf
Railway Signalling Principles Edition 3.pdfRailway Signalling Principles Edition 3.pdf
Railway Signalling Principles Edition 3.pdf
 
MCQ Soil mechanics questions (Soil shear strength).pdf
MCQ Soil mechanics questions (Soil shear strength).pdfMCQ Soil mechanics questions (Soil shear strength).pdf
MCQ Soil mechanics questions (Soil shear strength).pdf
 
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
 
Democratizing Fuzzing at Scale by Abhishek Arya
Democratizing Fuzzing at Scale by Abhishek AryaDemocratizing Fuzzing at Scale by Abhishek Arya
Democratizing Fuzzing at Scale by Abhishek Arya
 
Water Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdfWater Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdf
 

week10_Reinforce.pdf

  • 1. Lecture 10: Policy gradient Radoslav Neychev Machine Learning course advanced track MIPT 08.11.2019, Moscow, Russia
  • 2. These slides are almost the exact copy of Practical RL course week 6 slides. Special thanks to YSDA team for making them publicly available. Original slides link: week06_policy_based References
  • 3. 2 Small experiment The next slide contains a question Please respond as fast as you can!
  • 8. 7 Approximation error DQN is trained to minimize Simple 2-state world L≈E[Q(st ,at)−(rt +γ⋅maxa' Q(st+1, a'))] 2 True (A) (B) Q(s0,a0) 1 1 2 Q(s0,a1) 2 2 1 Q(s1,a0) 3 3 3 Q(s1,a1) 100 50 100 Q: Which prediction is better (A/B)?
  • 9. 8 Approximation error DQN is trained to minimize Simple 2-state world L≈E[Q(st ,at)−(rt +γ⋅maxa' Q(st+1, a'))] 2 True (A) (B) Q(s0,a0) 1 1 2 Q(s0,a1) 2 2 1 Q(s1,a0) 3 3 3 Q(s1,a1) 100 50 100 better policy less MSE
  • 10. 9 Approximation error DQN is trained to minimize Simple 2-state world L≈E[Q(st ,at)−(rt +γ⋅maxa' Q(st+1, a'))] 2 True (A) (B) Q(s0,a0) 1 1 2 Q(s0,a1) 2 2 1 Q(s1,a0) 3 3 3 Q(s1,a1) 100 50 100 better policy less MSE Q-learning will prefer worse policy (B)!
  • 11. 10 Conclusion ● Often computing q-values is harder than picking optimal actions! ● We could avoid learning value functions by directly learning agent's policy Q: what algorithm works that way? (of those we studied) πθ(a∣s)
  • 12. 11 Conclusion ● Often computing q-values is harder than picking optimal actions! ● We could avoid learning value functions by directly learning agent's policy Q: what algorithm works that way? πθ(a∣s) e.g. crossentropy method
  • 13. 12 NOT how humans survived argmax[ Q(s,pet the tiger) Q(s,run from tiger) Q(s,provoke tiger) Q(s,ignore tiger) ]
  • 15. 14 Policies In general, two kinds ● Deterministic policy ● Stochastic policy a=πθ (s) a∼πθ (a∣s) Q: Any case where stochastic is better?
  • 16. 15 Policies In general, two kinds ● Deterministic policy ● Stochastic policy a=πθ (s) a∼πθ (a∣s) Q: Any case where stochastic is better? e.g. rock-paper -scissors
  • 17. 16 Policies In general, two kinds ● Deterministic policy ● Stochastic policy sampling takes care of exploration same action each time Genetic algos (week 0) Deterministic policy gradient a=πθ (s) a∼πθ (a∣s) Crossentropy method Policy gradient Q: how to represent policy in continuous action space?
  • 18. 17 Policies In general, two kinds ● Deterministic policy ● Stochastic policy sampling takes care of exploration same action each time Genetic algos (week 0) Deterministic policy gradient a=πθ (s) a∼πθ (a∣s) Crossentropy method Policy gradient categorical, normal, mixture of normal, whatever
  • 19. 18 Two approaches ● Value based: Learn value function or Infer policy ● Policy based: Explicitly learn policy or Implicitly maximize reward over policy a=argmax a Qθ(s,a) Qθ(s ,a) Vθ(s) πθ(a∣s) πθ(s)→a
  • 20. 19 Recap: crossentropy method ● Initialize NN weights ● Loop: – Sample N sessions – elite = take M best sessions and concatenate θ0←random θi+1=θi+α ∇ ∑ i log πθi (ai∣si)⋅ [ si ,ai∈Elite]
  • 21. 20 Recap: crossentropy method ● Initialize NN weights ● Loop: – Sample N sessions – elite = take M best sessions and concatenate TD version: elite (s,a) that have highest G(s,a) (select elites independently from each state) θ0←random θi+1=θi+α ∇ ∑ i log πθi (ai∣si)⋅ [ si ,ai∈Elite]
  • 22. 21 Policy gradient main idea Why so complicated? We'd rather maximize reward directly!
  • 23. 22 Objective Expected reward: Expected discounted reward: J = E s∼p(s) a∼πθ (s∣a) ... R(s, a,s' ,a',...) J = E s∼p(s) a∼πθ (s∣a) G(s,a)
  • 24. 23 Objective Expected reward: Expected discounted reward: J = E s∼p(s) a∼πθ (s∣a) ... R(s, a,s' ,a',...) J = E s∼p(s) a∼πθ (s∣a) G(s,a) R(z) setting G(s,a) = r + γ*G(s',a')
  • 25. 24 Objective Consider an 1-step process for simplicity J = E s∼p(s) a∼πθ (s∣a) R(s, a)=∫ s p(s)∫ a πθ(a∣s)R(s,a)da ds
  • 26. 25 Objective state visitation frequency (may depend on policy) Q: how do we compute that? J = E s∼p(s) a∼πθ (s∣a) R(s, a)=∫ s p(s)∫ a πθ(a∣s)R(s,a)da ds Reward for 1-step session Consider an 1-step process for simplicity
  • 27. 26 Objective J = E s∼p(s) a∼πθ (s∣a) R(s, a)=∫ s p(s)∫ a πθ(a∣s)R(s,a)da ds sample N sessions under current policy J≈ 1 N ∑ i=0 N R(s ,a)
  • 28. 27 Objective Can we optimize policy now? sample N sessions J = E s∼p(s) a∼πθ (s∣a) R(s, a)=∫ s p(s)∫ a πθ(a∣s)R(s,a)da ds J≈ 1 N ∑ i=0 N R(s ,a)
  • 29. 28 Objective parameters “sit” here We don't know how to compute dJ/dtheta J = E s∼p(s) a∼πθ (s∣a) R(s, a)=∫ s p(s)∫ a πθ(a∣s)R(s,a)da ds J≈ 1 N ∑ i=0 N ∑ s,a∈zi R(s,a)
  • 30. 29 Optimization Finite differences – Change policy a little, evaluate Stochastic optimization – Good old crossentropy method – Maximize probability of “elite” actions ∇ J≈ Jθ+ϵ−Jθ ϵ
  • 31. 30 Optimization Finite differences – Change policy a little, evaluate Stochastic optimization – Good old crossentropy method – Maximize probability of “elite” actions ∇ J≈ Jθ+ϵ−Jθ ϵ Q: any problems with those two?
  • 32. 31 Optimization Finite differences – Change policy a little, evaluate Stochastic optimization – Good old crossentropy method – Maximize probability of “elite” actions ∇ J≈ Jθ+ϵ−Jθ ϵ VERY noizy, especially if both J are sampled “quantile convergence” problems with stochastic MDPs
  • 33. 32 Objective Wish list: – Analytical gradient – Easy/stable approximations J = E s∼p(s) a∼πθ (s∣a) R(s, a)=∫ s p(s)∫ a πθ(a∣s)R(s,a)da ds
  • 34. 33 Logderivative trick Simple math (try chain rule) ∇ log π(z)=? ??
  • 35. 34 Logderivative trick Simple math ∇ log π(z)= 1 π(z) ⋅∇ π( z) π⋅∇ log π(z)=∇ π( z)
  • 36. 35 Policy gradient Analytical inference π⋅∇ log π(z)=∇ π( z) ∇ J=∫ s p(s)∫ a ∇ πθ (a∣s)R(s, a)da ds
  • 37. 36 Policy gradient Analytical inference ∇ J=∫ s p(s)∫ a ∇ πθ (a∣s)R(s, a)da ds π⋅∇ log π(z)=∇ π( z) ∇ J=∫ s p(s)∫ a πθ (a∣s)∇ log πθ(a∣s)R(s,a)dads Q: anything curious about that formula?
  • 38. 37 Policy gradient Analytical inference π⋅∇ log π(z)=∇ π( z) that's expectation :) ∇ J=∫ s p(s)∫ a ∇ πθ (a∣s)R(s, a)da ds ∇ J=∫ s p(s)∫ a πθ (a∣s)∇ log πθ(a∣s)R(s,a)dads
  • 39. 38 REINFORCE (bandit) ∇ J≈ 1 N ∑ i=0 N ∇ log πθ(a∣s)⋅R(s,a) ● Initialize NN weights ● Loop: – Sample N sessions z under current – Evaluate policy gradient – Ascend θ0←random θi+1←θi+α⋅∇ J
  • 40. 39 Discounted reward case ● Replace R with Q :) π⋅∇ log π(z)=∇ π( z) that's expectation :) ∇ J=∫ s p(s)∫ a ∇ πθ (a∣s)Q(s,a)dads ∇ J=∫ s p(s)∫ a πθ (a∣s)∇ log πθ(a∣s)Q(s,a)dads True action value a.k.a. E[ G(s,a) ]
  • 41. 40 REINFORCE (discounted) ● Policy gradient ● Approximate with sampling ∇ J= E s∼p(s) a∼πθ (s∣a) ∇ logπθ(a∣s)⋅Q(s, a) ∇ J≈ 1 N ∑ i=0 N ∑ s ,a∈zi ∇ logπθ(a∣s)⋅Q(s,a)
  • 42. 41 REINFORCE algorithm We can estimate Q using G Gt=rt+γ⋅rt+1+γ 2 ⋅rt+2+... Qπ (st ,at)=Es' G(st ,at) prev s prev a a'' s'' s' a a' r’ s r'’ r r'’’
  • 43. 42 We can use this to compue all G’s in linear time Recap: discounted rewards Gt=rt+γ⋅rt+1+γ 2 ⋅rt+2+... rt +γ⋅(rt+1+γ⋅rt+2+...) rt +γ⋅Gt+1 = = = =
  • 44. 43 REINFORCE algorithm ∇ J≈ 1 N ∑ i=0 N ∑ s ,a∈zi ∇ logπθ(a∣s)⋅Q(s,a) ● Initialize NN weights ● Loop: – Sample N sessions z under current – Evaluate policy gradient – Ascend θ0←random πθ(a∣s) θi+1←θi+α⋅∇ J
  • 45. 44 REINFORCE algorithm ∇ J≈ 1 N ∑ i=0 N ∑ s ,a∈zi ∇ logπθ(a∣s)⋅Q(s,a) ● Initialize NN weights ● Loop: – Sample N sessions z under current – Evaluate policy gradient – Ascend θ0←random πθ(a∣s) θi+1←θi+α⋅∇ J Q: is it off- or on-policy?
  • 46. 45 REINFORCE algorithm ∇ J≈ 1 N ∑ i=0 N ∑ s ,a∈zi ∇ logπθ(a∣s)⋅Q(s,a) ● Initialize NN weights ● Loop: – Sample N sessions z under current – Evaluate policy gradient – Ascend θ0←random πθ(a∣s) θi+1←θi+α⋅∇ J actions under current policy = on-policy
  • 47. value-based Vs policy-based Value-based ● Q-learning, SARSA, MCTS value-iteration ● Solves harder problem ● Artificial exploration ● Learns from partial experience (temporal difference) ● Evaluates strategy for free :) Policy-based ● REINFORCE, CEM ● Solves easier problem ● Innate exploration ● Innate stochasticity ● Support continuous action space ● Learns from full session only?
  • 48. value-based Vs policy-based Value-based ● Q-learning, SARSA, MCTS value-iteration ● Solves harder problem ● Artificial exploration ● Learns from partial experience (temporal difference) ● Evaluates strategy for free :) Policy-based ● REINFORCE, CEM ● Solves easier problem ● Innate exploration ● Innate stochasticity ● Support continuous action space ● Learns from full session only We'll learn much more soon!
  • 49. 48 REINFORCE baselines ∇ J≈ 1 N ∑ i=0 N ∑ s ,a∈zi ∇ logπθ(a∣s)⋅Q(s,a) ● Initialize NN weights ● Loop: – Sample N sessions z under current – Evaluate policy gradient θ0←random πθ(a∣s) What is better for learning: random action in good state or great action in bad state?
  • 50. 49 REINFORCE baselines We can subtract arbitrary baseline b(s) ∇ J= E s∼p(s) a∼πθ (a∣s) ∇ logπθ(a∣s)(Q(s,a)−b(s))=... ...= E s∼p(s) a∼πθ (a∣s) ∇ logπθ(a∣s)Q(s ,a)− E s∼p(s) a∼πθ (a∣s) ∇ logπθ(a∣s)b(s)=...
  • 51. 50 REINFORCE baselines We can subtract arbitrary baseline b(s) ∇ J= E s∼p(s) a∼πθ (a∣s) ∇ logπθ(a∣s)(Q(s,a)−b(s))=... ...= E s∼p(s) a∼πθ (a∣s) ∇ logπθ(a∣s)Q(s ,a)− E s∼p(s) a∼πθ (a∣s) ∇ logπθ(a∣s)b(s)=... Note that b(s) does not depend on a Q: Can you simplify the second term?
  • 52. 51 REINFORCE baselines We can subtract arbitrary baseline b(s) ∇ J= E s∼p(s) a∼πθ (a∣s) ∇ logπθ(a∣s)(Q(s,a)−b(s))=... ...= E s∼p(s) a∼πθ (a∣s) ∇ logπθ(a∣s)Q(s ,a)− E s∼p(s) a∼πθ (a∣s) ∇ logπθ(a∣s)b(s)=... E s∼p(s) a∼πθ (a∣s) ∇ log πθ(a∣s)b(s)=b(s)⋅ E s∼p(s) a∼πθ (a∣s) ∇ log πθ(a∣s)=0
  • 53. 52 REINFORCE baselines We can subtract arbitrary baseline b(s) ∇ J= E s∼p(s) a∼πθ (a∣s) ∇ logπθ(a∣s)(Q(s,a)−b(s))=... ...= E s∼p(s) a∼πθ (a∣s) ∇ logπθ(a∣s)Q(s ,a)− E s∼p(s) a∼πθ (a∣s) ∇ logπθ(a∣s)b(s)=... ...= E s∼p(s) a∼πθ (a∣s) ∇ logπθ(a∣s)Q(s ,a) Gradient direction doesn’t change!
  • 54. 53 REINFORCE baselines ● Gradient direction stays the same ● Variance may change Gradient variance: as a random variable over (s, a) ∇ J Var[Q(s,a)−b(s)] Var[Q(s,a)]−2⋅Cov[Q(s ,a),b(s)]+Var[b(s)]
  • 55. 54 REINFORCE baselines ● Gradient direction stays the same ● Variance may change Gradient variance: as a random variable over (s, a) ∇ J Var[Q(s,a)−b(s)] If b(s) correlates with Q(s,a), variance decreases Var[Q(s,a)]−2⋅Cov[Q(s ,a),b(s)]+Var[b(s)]
  • 56. 55 REINFORCE baselines ● Gradient direction stays the same ● Variance may change Gradient variance: as a random variable over (s, a) ∇ J Var[Q(s,a)−b(s)] Q: can you suggest any such b(s)? Var[Q(s,a)]−2⋅Cov[Q(s ,a),b(s)]+Var[b(s)]
  • 57. 56 REINFORCE baselines ● Gradient direction stays the same ● Variance may change Gradient variance: as a random variable over (s, a) ∇ J Var[Q(s,a)−b(s)] Naive baseline: b = moving average Q over all (s, a), Var[b(s)] = 0, Cov[Q, b] > 0 Var[Q(s,a)]−2⋅Cov[Q(s ,a),b(s)]+Var[b(s)]
  • 58. 57 REINFORCE baselines ∇ J≈ 1 N ∑ i=0 N ∑ s ,a∈zi ∇ logπθ(a∣s)⋅(Q(s,a)−V (s)) Better baseline: b(s) = V(s) Q: but how do we predict V(s)?
  • 59. 58 Actor-critic ● Learn both V(s) and ● Hope for the best of both worlds :) πθ(a∣s)
  • 60. 59 Advantage actor-critic Idea: learn both and Use to learn faster! Q: how can we estimate A(s,a) from (s,a,r,s') and V-function? Vθ(s) πθ(a∣s) πθ(a∣s) Vθ(s)
  • 61. 60 Advantage actor-critic Idea: learn both and Use to learn faster! Vθ(s) πθ(a∣s) πθ(a∣s) Vθ(s) A(s ,a)=Q(s ,a)−V (s) Q(s,a)=r+γ⋅V (s ') A(s ,a)=r+ γ⋅V (s')−V (s)
  • 62. 61 Advantage actor-critic Idea: learn both and Use to learn faster! Vθ(s) πθ(a∣s) πθ(a∣s) Vθ(s) A(s ,a)=Q(s ,a)−V (s) Q(s,a)=r+γ⋅V (s ') A(s ,a)=r+ γ⋅V (s')−V (s) Also: n-step version
  • 63. 62 Advantage actor-critic Idea: learn both and Use to learn faster! Vθ(s) πθ(a∣s) πθ(a∣s) Vθ(s) A(s ,a)=r+ γ⋅V (s')−V (s) ∇ Jactor≈ 1 N ∑ i=0 N ∑ s,a∈zi ∇ logπθ(a∣s)⋅A(s ,a) consider const
  • 64. 63 Advantage actor-critic ∇ Jactor≈ 1 N ∑ i=0 N ∑ s,a∈zi ∇ logπθ(a∣s)⋅A(s ,a) Vθ(s) model W = params state s πθ(a∣s) Lcritic≈ 1 N ∑ i=0 N ∑ s,a∈zi (V θ(s)−[r+γ⋅V (s')]) 2 Improve policy: Improve value:
  • 65. 64 Continuous action spaces What if there's continuously many actions? ● Robot control: control motor voltage ● Trading: assign money to equity How does the algorithm change?
  • 66. 65 Continuous action spaces What if there's continuously many actions? ● Robot control: control motor voltage ● Trading: assign money to equity How does the algorithm change? it doesn't :) Just plug in a different formula for pi(a|s), e.g. normal distribution
  • 67. 66 Asynchronous advantage actor-critic ● Parallel game sessions ● Async multi-CPU training ● No experience replay ● LSTM policy ● N-step advantage ● No experience replay Read more: https://arxiv.org/abs/1602.01783
  • 68. 67 IMPALA Read more: https://arxiv.org/abs/1802.01561 ● Massively parallel ● Separate actor / learner processes ● Small experience replay w/ importance sampling
  • 69. 68 Duct tape zone ● V(s) errors less important than in Q-learning – actor still learns even if critic is random, just slower ● Regularize with entropy – to prevent premature convergence ● Learn on parallel sessions – Or super-small experience replay ● Use logsoftmax for numerical stability
  • 71. ● Remember log-derivative trick ● Combining best from both worlds is generally a good idea ● See this paper for the proof of the policy gradient for discounted rewards ● Time to write some code! Outro and Q & A 70