SlideShare a Scribd company logo
Advanced Policy Gradient Methods:
Natural Gradient, TRPO, and More
John Schulman
August 26, 2017
Two Limitations of “Vanilla” Policy Gradient Methods
Hard to choose stepsizes
Input data is nonstationary due to changing policy: observation and reward
distributions change
Bad step is more damaging than in supervised learning, since it affects
visitation distribution
Step too far → bad policy
Next batch: collected under bad policy
Can’t recover—collapse in performance
Sample efficiency
Only one gradient step per environment sample
Dependent on scaling of coordinates
Reducing reinforcement learning to optimization
Much of modern ML: reduce learning to numerical optimization problem
Supervised learning: minimize training error
RL: how to use all data so far and compute the best policy?
Q-learning: can (in principle) include all transitions seen so far, however,
we’re optimizing the wrong objective
Policy gradient methods: yes stochastic gradients, but no optimization
problem∗
This lecture: write down an optimization problem that allows you to do a
small update to policy π based on data sampled from π (on-policy data)
What Loss to Optimize?
Policy gradients
ˆg = ˆEt θ log πθ(at | st) ˆAt
Can differentiate the following loss
LPG
(θ) = ˆEt log πθ(at | st) ˆAt .
but don’t want to optimize it too far
Equivalently differentiate
LIS
θold
(θ) = ˆEt
πθ(at | st)
πθold
(at | st)
ˆAt .
at θ = θold, state-actions are sampled using θold. (IS = importance sampling)
Just the chain rule: θ log f (θ) θold
=
θf (θ)
θold
f (θold) = θ
f (θ)
f (θold) θold
Surrogate Loss: Importance Sampling Interpretation
Importance sampling interpretation
Est ∼πθold
,at ∼πθ
[Aπ
(st, at)]
= Est ∼πθold
,at ∼πθold
πθ(at | st)
πθold
(at | st)
Aπθold (st, at) (importance sampling)
= Est ∼πθold
,at ∼πθold
πθ(at | st)
πθold
(at | st)
ˆAt (replace Aπ
with estimator)
= LIS
θold
(θ)
Kakade et al.1
and S. et al.2
analyze how LIS
approximates the actual
performance difference between θ and θold.
In practice, LIS
is not much different than the logprob version
LPG
(θ) = ˆEt log πθ(at | st) ˆAt , for reasonably small policy changes.
1
S. Kakade and J. Langford. “Approximately optimal approximate reinforcement learning”. ICML. 2002.
2
J. Schulman, S. Levine, P. Moritz, M. I. Jordan, and P. Abbeel. “Trust Region Policy Optimization”. ICML (2015).
Trust Region Policy Optimization
Define the following trust region update:
maximize
θ
ˆEt
πθ(at | st)
πθold
(at | st)
ˆAt
subject to ˆEt[KL[πθold
(· | st), πθ(· | st)]] ≤ δ.
Also worth considering using a penalty instead of a constraint
maximize
θ
ˆEt
πθ(at | st)
πθold
(at | st)
ˆAt − βˆEt[KL[πθold
(· | st), πθ(· | st)]]
Method of Lagrange multipliers: optimality point of δ-constrained problem
is also an optimality point of β-penalized problem for some β.
In practice, δ is easier to tune, and fixed δ is better than fixed β
Monotonic Improvement Result
Consider KL penalized objective
maximize
θ
ˆEt
πθ(at | st)
πθold
(at | st)
ˆAt − βˆEt[KL[πθold
(· | st), πθ(· | st)]]
Theory result: if we use max KL instead of mean KL in penalty, then we get
a lower (=pessimistic) bound on policy performance
Trust Region Policy Optimization: Pseudocode
Pseudocode:
for iteration=1, 2, . . . do
Run policy for T timesteps or N trajectories
Estimate advantage function at all timesteps
maximize
θ
N
n=1
πθ(an | sn)
πθold
(an | sn)
ˆAn
subject to KLπθold
(πθ) ≤ δ
end for
Can solve constrained optimization problem efficiently by using conjugate
gradient
Closely related to natural policy gradients (Kakade, 2002), natural actor
critic (Peters and Schaal, 2005), REPS (Peters et al., 2010)
Solving KL Penalized Problem
maximizeθ Lπθold
(πθ) − β · KLπθold
(πθ)
Make linear approximation to Lπθold
and quadratic approximation to KL term:
maximize
θ
g · (θ − θold) − β
2
(θ − θold)T
F(θ − θold)
where g =
∂
∂θ
Lπθold
(πθ) θ=θold
, F =
∂2
∂2θ
KLπθold
(πθ) θ=θold
Quadratic part of L is negligible compared to KL term
F is positive semidefinite, but not if we include Hessian of L
Solution: θ − θold = 1
β
F−1
g, where F is Fisher Information matrix, g is
policy gradient. This is called the natural policy gradient3
.
3
S. Kakade. “A Natural Policy Gradient.” NIPS. 2001.
Solving KL Constrained Problem
Method of Lagrange multipliers: solve penalized problem to get θ∗(β). Then
substitute θ∗(β) into original problem and solve for β.
β only affects scaling of solution, not direction. Compute scaling as follows:
Compute s = F−1g (soln with β = 1)
Rescale θ − θold = αs so that constraint is satisfied. Quadratic approx
KLπθold
(πθ) ≈ 1
2(θ − θold)T F(θ − θold), so we want
1
2
(αs)T
F(αs) = δ
α = 2δ/(sT Fs)
Even better, we can do a line search to solve the original nonlinear problem.
maximize Lπθold
(πθ) − 1[KLπθold
(πθ) ≤ δ]
Try α, α/2, α/4, . . . until line search objective improves
“Proximal” Policy Optimization: KL Penalty Version
Use penalty instead of constraint
maximize
θ
N
n=1
πθ(an | sn)
πθold
(an | sn)
ˆAn − C · KLπθold
(πθ)
Pseudocode:
for iteration=1, 2, . . . do
Run policy for T timesteps or N trajectories
Estimate advantage function at all timesteps
Do SGD on above objective for some number of epochs
If KL too high, increase β. If KL too low, decrease β.
end for
≈ same performance as TRPO, but only first-order optimization
Review
Suggested optimizing surrogate loss LPG
or LIS
Suggested using KL to constrain size of update
Corresponds to natural gradient step F−1
g under linear quadratic
approximation
Can solve for this step approximately using conjugate gradient method
Connection Between Trust Region Problem and Other Things
maximize
θ
N
n=1
πθ(an | sn)
πθold
(an | sn)
ˆAn
subject to KLπθold
(πθ) ≤ δ
Linear-quadratic approximation + penalty ⇒ natural gradient
No constraint ⇒ policy iteration
Euclidean penalty instead of KL ⇒ vanilla policy gradient
Limitations of TRPO
Hard to use with architectures with multiple outputs, e.g. policy and value
function (need to weight different terms in distance metric)
Empirically performs poorly on tasks requiring deep CNNs and RNNs, e.g.
Atari benchmark
CG makes implementation more complicated
Calculating Natural Gradient Step with KFAC
Summary: do blockwise approximation to FIM, and approximate each block
using a certain factorization
Alternate expression for FIM as outer product
(instead of second deriv. of KL):
ˆEt θ log πθ(at | st)T
θ log πθ(at | st)
R. Grosse and J. Martens. “A Kronecker-factored approximate Fisher matrix for convolution layers”. 2016
J. Martens and R. Grosse. “Optimizing neural networks with Kronecker-factored approximate curvature”. 2015
Calculating Natural Gradient Step with KFAC
Consider network with weight matrix W appearing once in network, with
y = Wx. Then
Wij
L = xi yj
L = xi yj
F =ˆEt Wij
log πθ(at | st)T
Wij
log πθ(at | st)
Fij,i j = ˆEt xi yj xi yj
KFAC approximation:
Fij,i j = ˆEt xi yj xi yj ≈ ˆEt[xi xi ]ˆEt yj yj
Approximating Fisher block as tensor product A ⊗ B, where A is nin × nin
and B is nout × nout. (A ⊗ B)−1
= A−1
⊗ B−1
and we can compute
matrix-vector product without forming full matrix.
Maintain running estimate of covariance matrices ˆEt[xi xj ], ˆEt yi yj and
periodically compute matrix inverses (small overhead)
ACKTR: Combine A2C with KFAC Natural Gradient
Combined with A2C, gives excellent on Atari benchmark and continuous
control from images4
.
Note: we’re already computing θ log πθ(at | st) for policy gradient: no
extra gradient computation needed
Matrix inverses can be computed asynchronously
Limitation: works straightforwardly for feedforward nets (including
convolutions), less straightforward for RNNs or architectures with shared
weights
4
Y. Wu, E. Mansimov, S. Liao, R. Grosse, and J. Ba. “Scalable trust-region method for deep reinforcement learning using Kronecker-factored
approximation”. (2017).
Proximal Policy Optimization: KL Penalized Version
Back to penalty instead of constraint
maximize
θ
N
n=1
πθ(an | sn)
πθold
(an | sn)
ˆAn − C · KLπθold
(πθ)
Pseudocode:
for iteration=1, 2, . . . do
Run policy for T timesteps or N trajectories
Estimate advantage function at all timesteps
Do SGD on above objective for some number of epochs
If KL too high, increase β. If KL too low, decrease β.
end for
≈ same performance as TRPO, but only first-order optimization
Proximal Policy Optimization: Clipping Objective
Recall the surrogate objective
LIS
(θ) = ˆEt
πθ(at | st)
πθold
(at | st)
ˆAt = ˆEt rt(θ) ˆAt . (1)
Form a lower bound via clipped importance ratios
LCLIP
(θ) = ˆEt min(rt(θ) ˆAt, clip(rt(θ), 1 − , 1 + ) ˆAt) (2)
0 1
Linear interpolation factor
0.02
0.00
0.02
0.04
0.06
0.08
0.10
0.12 Et[KLt]
LCPI =Et[rtAt]
Et[clip(rt,1 ,1+ )At]
LCLIP =Et[min(rtAt,clip(rt,1 ,1+ )At)]
Forms pessimistic bound on objective, can be optimized using SGD
Proximal Policy Optimization
Pseudocode:
for iteration=1, 2, . . . do
Run policy for T timesteps or N trajectories
Estimate advantage function at all timesteps
Do SGD on LCLIP
(θ) objective for some number of epochs
end for
A bit better than TRPO on continuous control, much better on Atari
Compatible with multi-output networks and RNNs
Further Reading
S. Kakade. “A Natural Policy Gradient.” NIPS. 2001
S. Kakade and J. Langford. “Approximately optimal approximate reinforcement learning”. ICML. 2002
J. Peters and S. Schaal. “Natural actor-critic”. Neurocomputing (2008)
J. Schulman, S. Levine, P. Moritz, M. I. Jordan, and P. Abbeel. “Trust Region Policy Optimization”. ICML (2015)
Y. Duan, X. Chen, R. Houthooft, J. Schulman, and P. Abbeel. “Benchmarking Deep Reinforcement Learning for Continuous Control”.
ICML (2016)
J. Martens and I. Sutskever. “Training deep and recurrent networks with Hessian-free optimization”. Springer, 2012
Z. Wang, V. Bapst, N. Heess, V. Mnih, R. Munos, et al. “Sample Efficient Actor-Critic with Experience Replay”. (2016)
Y. Wu, E. Mansimov, S. Liao, R. Grosse, and J. Ba. “Scalable trust-region method for deep reinforcement learning using Kronecker-factored
approximation”. (2017)
J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. “Proximal Policy Optimization Algorithms”. (2017)
blog.openai.com: recent posts on baselines releases
That’s all. Questions?

More Related Content

What's hot

icml2004 tutorial on spectral clustering part I
icml2004 tutorial on spectral clustering part Iicml2004 tutorial on spectral clustering part I
icml2004 tutorial on spectral clustering part I
zukun
 
Application of Fisher Linear Discriminant Analysis to Speech/Music Classifica...
Application of Fisher Linear Discriminant Analysis to Speech/Music Classifica...Application of Fisher Linear Discriminant Analysis to Speech/Music Classifica...
Application of Fisher Linear Discriminant Analysis to Speech/Music Classifica...
Lushanthan Sivaneasharajah
 
Slides
SlidesSlides
RuleML 2015 Constraint Handling Rules - What Else?
RuleML 2015 Constraint Handling Rules - What Else?RuleML 2015 Constraint Handling Rules - What Else?
RuleML 2015 Constraint Handling Rules - What Else?
RuleML
 
Wireless Localization: Ranging (second part)
Wireless Localization: Ranging (second part)Wireless Localization: Ranging (second part)
Wireless Localization: Ranging (second part)
Stefano Severi
 
Query Modeling Using Non-relevance Information - TREC 2008 Relevance Feedbac...
Query Modeling Using Non-relevance Information  - TREC 2008 Relevance Feedbac...Query Modeling Using Non-relevance Information  - TREC 2008 Relevance Feedbac...
Query Modeling Using Non-relevance Information - TREC 2008 Relevance Feedbac...
Edgar Meij
 
SPECTRAL SYNTHESIS PROBLEM FOR FOURIER ALGEBRAS
SPECTRAL SYNTHESIS PROBLEM FOR FOURIER ALGEBRASSPECTRAL SYNTHESIS PROBLEM FOR FOURIER ALGEBRAS
SPECTRAL SYNTHESIS PROBLEM FOR FOURIER ALGEBRAS
Kunda Chowdaiah
 
Reinforcement Learning in Configurable Environments
Reinforcement Learning in Configurable EnvironmentsReinforcement Learning in Configurable Environments
Reinforcement Learning in Configurable Environments
Emanuele Ghelfi
 
Spectral clustering Tutorial
Spectral clustering TutorialSpectral clustering Tutorial
Spectral clustering Tutorial
Zitao Liu
 
Ultrasound Modular Architecture
Ultrasound Modular ArchitectureUltrasound Modular Architecture
Ultrasound Modular Architecture
Jose Miguel Moreno
 
Ilya Shkredov – Subsets of Z/pZ with small Wiener norm and arithmetic progres...
Ilya Shkredov – Subsets of Z/pZ with small Wiener norm and arithmetic progres...Ilya Shkredov – Subsets of Z/pZ with small Wiener norm and arithmetic progres...
Ilya Shkredov – Subsets of Z/pZ with small Wiener norm and arithmetic progres...
Yandex
 
Modular representation theory of finite groups
Modular representation theory of finite groupsModular representation theory of finite groups
Modular representation theory of finite groups
Springer
 
Planning Under Uncertainty With Markov Decision Processes
Planning Under Uncertainty With Markov Decision ProcessesPlanning Under Uncertainty With Markov Decision Processes
Planning Under Uncertainty With Markov Decision Processes
ahmad bassiouny
 
M4 heuristics
M4 heuristicsM4 heuristics
M4 heuristics
Yasir Khan
 
Rare Kaon Decays: Matching Long and Short Distance Physics in K-> Pi e+ e-
Rare Kaon Decays: Matching Long and Short Distance Physics in K-> Pi e+ e-Rare Kaon Decays: Matching Long and Short Distance Physics in K-> Pi e+ e-
Rare Kaon Decays: Matching Long and Short Distance Physics in K-> Pi e+ e-
Atanu Nath
 
A Graph Syntax for Processes and Services @ Workshop WS-FM 2009
A Graph Syntax for Processes and Services @ Workshop WS-FM 2009A Graph Syntax for Processes and Services @ Workshop WS-FM 2009
A Graph Syntax for Processes and Services @ Workshop WS-FM 2009
Alberto Lluch Lafuente
 
Jarrar: Informed Search
Jarrar: Informed Search  Jarrar: Informed Search
Jarrar: Informed Search
Mustafa Jarrar
 
Spectral Clustering Report
Spectral Clustering ReportSpectral Clustering Report
Spectral Clustering Report
Miaolan Xie
 
Selected topics in Bayesian Optimization
Selected topics in Bayesian OptimizationSelected topics in Bayesian Optimization
Selected topics in Bayesian Optimization
ginsby
 
Paper Review: An exact mapping between the Variational Renormalization Group ...
Paper Review: An exact mapping between the Variational Renormalization Group ...Paper Review: An exact mapping between the Variational Renormalization Group ...
Paper Review: An exact mapping between the Variational Renormalization Group ...
Kai-Wen Zhao
 

What's hot (20)

icml2004 tutorial on spectral clustering part I
icml2004 tutorial on spectral clustering part Iicml2004 tutorial on spectral clustering part I
icml2004 tutorial on spectral clustering part I
 
Application of Fisher Linear Discriminant Analysis to Speech/Music Classifica...
Application of Fisher Linear Discriminant Analysis to Speech/Music Classifica...Application of Fisher Linear Discriminant Analysis to Speech/Music Classifica...
Application of Fisher Linear Discriminant Analysis to Speech/Music Classifica...
 
Slides
SlidesSlides
Slides
 
RuleML 2015 Constraint Handling Rules - What Else?
RuleML 2015 Constraint Handling Rules - What Else?RuleML 2015 Constraint Handling Rules - What Else?
RuleML 2015 Constraint Handling Rules - What Else?
 
Wireless Localization: Ranging (second part)
Wireless Localization: Ranging (second part)Wireless Localization: Ranging (second part)
Wireless Localization: Ranging (second part)
 
Query Modeling Using Non-relevance Information - TREC 2008 Relevance Feedbac...
Query Modeling Using Non-relevance Information  - TREC 2008 Relevance Feedbac...Query Modeling Using Non-relevance Information  - TREC 2008 Relevance Feedbac...
Query Modeling Using Non-relevance Information - TREC 2008 Relevance Feedbac...
 
SPECTRAL SYNTHESIS PROBLEM FOR FOURIER ALGEBRAS
SPECTRAL SYNTHESIS PROBLEM FOR FOURIER ALGEBRASSPECTRAL SYNTHESIS PROBLEM FOR FOURIER ALGEBRAS
SPECTRAL SYNTHESIS PROBLEM FOR FOURIER ALGEBRAS
 
Reinforcement Learning in Configurable Environments
Reinforcement Learning in Configurable EnvironmentsReinforcement Learning in Configurable Environments
Reinforcement Learning in Configurable Environments
 
Spectral clustering Tutorial
Spectral clustering TutorialSpectral clustering Tutorial
Spectral clustering Tutorial
 
Ultrasound Modular Architecture
Ultrasound Modular ArchitectureUltrasound Modular Architecture
Ultrasound Modular Architecture
 
Ilya Shkredov – Subsets of Z/pZ with small Wiener norm and arithmetic progres...
Ilya Shkredov – Subsets of Z/pZ with small Wiener norm and arithmetic progres...Ilya Shkredov – Subsets of Z/pZ with small Wiener norm and arithmetic progres...
Ilya Shkredov – Subsets of Z/pZ with small Wiener norm and arithmetic progres...
 
Modular representation theory of finite groups
Modular representation theory of finite groupsModular representation theory of finite groups
Modular representation theory of finite groups
 
Planning Under Uncertainty With Markov Decision Processes
Planning Under Uncertainty With Markov Decision ProcessesPlanning Under Uncertainty With Markov Decision Processes
Planning Under Uncertainty With Markov Decision Processes
 
M4 heuristics
M4 heuristicsM4 heuristics
M4 heuristics
 
Rare Kaon Decays: Matching Long and Short Distance Physics in K-> Pi e+ e-
Rare Kaon Decays: Matching Long and Short Distance Physics in K-> Pi e+ e-Rare Kaon Decays: Matching Long and Short Distance Physics in K-> Pi e+ e-
Rare Kaon Decays: Matching Long and Short Distance Physics in K-> Pi e+ e-
 
A Graph Syntax for Processes and Services @ Workshop WS-FM 2009
A Graph Syntax for Processes and Services @ Workshop WS-FM 2009A Graph Syntax for Processes and Services @ Workshop WS-FM 2009
A Graph Syntax for Processes and Services @ Workshop WS-FM 2009
 
Jarrar: Informed Search
Jarrar: Informed Search  Jarrar: Informed Search
Jarrar: Informed Search
 
Spectral Clustering Report
Spectral Clustering ReportSpectral Clustering Report
Spectral Clustering Report
 
Selected topics in Bayesian Optimization
Selected topics in Bayesian OptimizationSelected topics in Bayesian Optimization
Selected topics in Bayesian Optimization
 
Paper Review: An exact mapping between the Variational Renormalization Group ...
Paper Review: An exact mapping between the Variational Renormalization Group ...Paper Review: An exact mapping between the Variational Renormalization Group ...
Paper Review: An exact mapping between the Variational Renormalization Group ...
 

Similar to Lec5 advanced-policy-gradient-methods

Lecture_NaturalPolicyGradientsTRPOPPO.pdf
Lecture_NaturalPolicyGradientsTRPOPPO.pdfLecture_NaturalPolicyGradientsTRPOPPO.pdf
Lecture_NaturalPolicyGradientsTRPOPPO.pdf
Dongseo University
 
Introduction to Reinforcement Learning for Molecular Design
Introduction to Reinforcement Learning for Molecular Design Introduction to Reinforcement Learning for Molecular Design
Introduction to Reinforcement Learning for Molecular Design
Dan Elton
 
Optimization for Deep Learning
Optimization for Deep LearningOptimization for Deep Learning
Optimization for Deep Learning
Sebastian Ruder
 
100 things I know
100 things I know100 things I know
100 things I know
r-uribe
 
MUMS: Bayesian, Fiducial, and Frequentist Conference - Model Selection in the...
MUMS: Bayesian, Fiducial, and Frequentist Conference - Model Selection in the...MUMS: Bayesian, Fiducial, and Frequentist Conference - Model Selection in the...
MUMS: Bayesian, Fiducial, and Frequentist Conference - Model Selection in the...
The Statistical and Applied Mathematical Sciences Institute
 
CLIM Transition Workshop - Semiparametric Models for Extremes - Surya Tokdar,...
CLIM Transition Workshop - Semiparametric Models for Extremes - Surya Tokdar,...CLIM Transition Workshop - Semiparametric Models for Extremes - Surya Tokdar,...
CLIM Transition Workshop - Semiparametric Models for Extremes - Surya Tokdar,...
The Statistical and Applied Mathematical Sciences Institute
 
Pclsp ntnu
Pclsp ntnuPclsp ntnu
Pclsp ntnu
Kjetil Haugen
 
MUMS Undergraduate Workshop - A Biased Introduction to Global Sensitivity Ana...
MUMS Undergraduate Workshop - A Biased Introduction to Global Sensitivity Ana...MUMS Undergraduate Workshop - A Biased Introduction to Global Sensitivity Ana...
MUMS Undergraduate Workshop - A Biased Introduction to Global Sensitivity Ana...
The Statistical and Applied Mathematical Sciences Institute
 
CDT 22 slides.pdf
CDT 22 slides.pdfCDT 22 slides.pdf
CDT 22 slides.pdf
Christian Robert
 
Explaining the Basics of Mean Field Variational Approximation for Statisticians
Explaining the Basics of Mean Field Variational Approximation for StatisticiansExplaining the Basics of Mean Field Variational Approximation for Statisticians
Explaining the Basics of Mean Field Variational Approximation for Statisticians
Wayne Lee
 
MSSISS riBART 20160321
MSSISS riBART 20160321MSSISS riBART 20160321
MSSISS riBART 20160321
Yaoyuan Vincent Tan
 
CS294-112 Lec 05
CS294-112 Lec 05CS294-112 Lec 05
CS294-112 Lec 05
Gyubin Son
 
Chap 8. Optimization for training deep models
Chap 8. Optimization for training deep modelsChap 8. Optimization for training deep models
Chap 8. Optimization for training deep models
Young-Geun Choi
 
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAIDeep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Jack Clark
 
NCE, GANs & VAEs (and maybe BAC)
NCE, GANs & VAEs (and maybe BAC)NCE, GANs & VAEs (and maybe BAC)
NCE, GANs & VAEs (and maybe BAC)
Christian Robert
 
Introduction to reinforcement learning - Phu Nguyen
Introduction to reinforcement learning - Phu NguyenIntroduction to reinforcement learning - Phu Nguyen
Introduction to reinforcement learning - Phu Nguyen
Tu Le Dinh
 
RL unit 5 part 1.pdf
RL unit 5 part 1.pdfRL unit 5 part 1.pdf
RL unit 5 part 1.pdf
ChandanaVemulapalli2
 
OPTIMIZATION AS A MODEL FOR FEW-SHOT LEARNING
 OPTIMIZATION AS A MODEL FOR FEW-SHOT LEARNING OPTIMIZATION AS A MODEL FOR FEW-SHOT LEARNING
OPTIMIZATION AS A MODEL FOR FEW-SHOT LEARNING
MLReview
 
Competitive Multi-agent Inverse Reinforcement Learning with Sub-optimal Demon...
Competitive Multi-agent Inverse Reinforcement Learning with Sub-optimal Demon...Competitive Multi-agent Inverse Reinforcement Learning with Sub-optimal Demon...
Competitive Multi-agent Inverse Reinforcement Learning with Sub-optimal Demon...
Kenshi Abe
 
analysis.ppt
analysis.pptanalysis.ppt
analysis.ppt
AarushSharma69
 

Similar to Lec5 advanced-policy-gradient-methods (20)

Lecture_NaturalPolicyGradientsTRPOPPO.pdf
Lecture_NaturalPolicyGradientsTRPOPPO.pdfLecture_NaturalPolicyGradientsTRPOPPO.pdf
Lecture_NaturalPolicyGradientsTRPOPPO.pdf
 
Introduction to Reinforcement Learning for Molecular Design
Introduction to Reinforcement Learning for Molecular Design Introduction to Reinforcement Learning for Molecular Design
Introduction to Reinforcement Learning for Molecular Design
 
Optimization for Deep Learning
Optimization for Deep LearningOptimization for Deep Learning
Optimization for Deep Learning
 
100 things I know
100 things I know100 things I know
100 things I know
 
MUMS: Bayesian, Fiducial, and Frequentist Conference - Model Selection in the...
MUMS: Bayesian, Fiducial, and Frequentist Conference - Model Selection in the...MUMS: Bayesian, Fiducial, and Frequentist Conference - Model Selection in the...
MUMS: Bayesian, Fiducial, and Frequentist Conference - Model Selection in the...
 
CLIM Transition Workshop - Semiparametric Models for Extremes - Surya Tokdar,...
CLIM Transition Workshop - Semiparametric Models for Extremes - Surya Tokdar,...CLIM Transition Workshop - Semiparametric Models for Extremes - Surya Tokdar,...
CLIM Transition Workshop - Semiparametric Models for Extremes - Surya Tokdar,...
 
Pclsp ntnu
Pclsp ntnuPclsp ntnu
Pclsp ntnu
 
MUMS Undergraduate Workshop - A Biased Introduction to Global Sensitivity Ana...
MUMS Undergraduate Workshop - A Biased Introduction to Global Sensitivity Ana...MUMS Undergraduate Workshop - A Biased Introduction to Global Sensitivity Ana...
MUMS Undergraduate Workshop - A Biased Introduction to Global Sensitivity Ana...
 
CDT 22 slides.pdf
CDT 22 slides.pdfCDT 22 slides.pdf
CDT 22 slides.pdf
 
Explaining the Basics of Mean Field Variational Approximation for Statisticians
Explaining the Basics of Mean Field Variational Approximation for StatisticiansExplaining the Basics of Mean Field Variational Approximation for Statisticians
Explaining the Basics of Mean Field Variational Approximation for Statisticians
 
MSSISS riBART 20160321
MSSISS riBART 20160321MSSISS riBART 20160321
MSSISS riBART 20160321
 
CS294-112 Lec 05
CS294-112 Lec 05CS294-112 Lec 05
CS294-112 Lec 05
 
Chap 8. Optimization for training deep models
Chap 8. Optimization for training deep modelsChap 8. Optimization for training deep models
Chap 8. Optimization for training deep models
 
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAIDeep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
 
NCE, GANs & VAEs (and maybe BAC)
NCE, GANs & VAEs (and maybe BAC)NCE, GANs & VAEs (and maybe BAC)
NCE, GANs & VAEs (and maybe BAC)
 
Introduction to reinforcement learning - Phu Nguyen
Introduction to reinforcement learning - Phu NguyenIntroduction to reinforcement learning - Phu Nguyen
Introduction to reinforcement learning - Phu Nguyen
 
RL unit 5 part 1.pdf
RL unit 5 part 1.pdfRL unit 5 part 1.pdf
RL unit 5 part 1.pdf
 
OPTIMIZATION AS A MODEL FOR FEW-SHOT LEARNING
 OPTIMIZATION AS A MODEL FOR FEW-SHOT LEARNING OPTIMIZATION AS A MODEL FOR FEW-SHOT LEARNING
OPTIMIZATION AS A MODEL FOR FEW-SHOT LEARNING
 
Competitive Multi-agent Inverse Reinforcement Learning with Sub-optimal Demon...
Competitive Multi-agent Inverse Reinforcement Learning with Sub-optimal Demon...Competitive Multi-agent Inverse Reinforcement Learning with Sub-optimal Demon...
Competitive Multi-agent Inverse Reinforcement Learning with Sub-optimal Demon...
 
analysis.ppt
analysis.pptanalysis.ppt
analysis.ppt
 

More from Ronald Teo

Mc td
Mc tdMc td
Mc td
Ronald Teo
 
07 regularization
07 regularization07 regularization
07 regularization
Ronald Teo
 
Dp
DpDp
06 mlp
06 mlp06 mlp
06 mlp
Ronald Teo
 
Mdp
MdpMdp
04 numerical
04 numerical04 numerical
04 numerical
Ronald Teo
 
Eac4f222d9d468a0c29a71a3830a5c60 c5_w3l08-attentionmodel
 Eac4f222d9d468a0c29a71a3830a5c60 c5_w3l08-attentionmodel Eac4f222d9d468a0c29a71a3830a5c60 c5_w3l08-attentionmodel
Eac4f222d9d468a0c29a71a3830a5c60 c5_w3l08-attentionmodel
Ronald Teo
 
Intro rl
Intro rlIntro rl
Intro rl
Ronald Teo
 
Lec6 nuts-and-bolts-deep-rl-research
Lec6 nuts-and-bolts-deep-rl-researchLec6 nuts-and-bolts-deep-rl-research
Lec6 nuts-and-bolts-deep-rl-research
Ronald Teo
 
Lec4b pong from_pixels
Lec4b pong from_pixelsLec4b pong from_pixels
Lec4b pong from_pixels
Ronald Teo
 
Lec4a policy-gradients-actor-critic
Lec4a policy-gradients-actor-criticLec4a policy-gradients-actor-critic
Lec4a policy-gradients-actor-critic
Ronald Teo
 
Lec3 dqn
Lec3 dqnLec3 dqn
Lec3 dqn
Ronald Teo
 
Lec2 sampling-based-approximations-and-function-fitting
Lec2 sampling-based-approximations-and-function-fittingLec2 sampling-based-approximations-and-function-fitting
Lec2 sampling-based-approximations-and-function-fitting
Ronald Teo
 
Lec1 intro-mdps-exact-methods
Lec1 intro-mdps-exact-methodsLec1 intro-mdps-exact-methods
Lec1 intro-mdps-exact-methods
Ronald Teo
 
02 linear algebra
02 linear algebra02 linear algebra
02 linear algebra
Ronald Teo
 

More from Ronald Teo (15)

Mc td
Mc tdMc td
Mc td
 
07 regularization
07 regularization07 regularization
07 regularization
 
Dp
DpDp
Dp
 
06 mlp
06 mlp06 mlp
06 mlp
 
Mdp
MdpMdp
Mdp
 
04 numerical
04 numerical04 numerical
04 numerical
 
Eac4f222d9d468a0c29a71a3830a5c60 c5_w3l08-attentionmodel
 Eac4f222d9d468a0c29a71a3830a5c60 c5_w3l08-attentionmodel Eac4f222d9d468a0c29a71a3830a5c60 c5_w3l08-attentionmodel
Eac4f222d9d468a0c29a71a3830a5c60 c5_w3l08-attentionmodel
 
Intro rl
Intro rlIntro rl
Intro rl
 
Lec6 nuts-and-bolts-deep-rl-research
Lec6 nuts-and-bolts-deep-rl-researchLec6 nuts-and-bolts-deep-rl-research
Lec6 nuts-and-bolts-deep-rl-research
 
Lec4b pong from_pixels
Lec4b pong from_pixelsLec4b pong from_pixels
Lec4b pong from_pixels
 
Lec4a policy-gradients-actor-critic
Lec4a policy-gradients-actor-criticLec4a policy-gradients-actor-critic
Lec4a policy-gradients-actor-critic
 
Lec3 dqn
Lec3 dqnLec3 dqn
Lec3 dqn
 
Lec2 sampling-based-approximations-and-function-fitting
Lec2 sampling-based-approximations-and-function-fittingLec2 sampling-based-approximations-and-function-fitting
Lec2 sampling-based-approximations-and-function-fitting
 
Lec1 intro-mdps-exact-methods
Lec1 intro-mdps-exact-methodsLec1 intro-mdps-exact-methods
Lec1 intro-mdps-exact-methods
 
02 linear algebra
02 linear algebra02 linear algebra
02 linear algebra
 

Recently uploaded

Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
 
Building RAG with self-deployed Milvus vector database and Snowpark Container...
Building RAG with self-deployed Milvus vector database and Snowpark Container...Building RAG with self-deployed Milvus vector database and Snowpark Container...
Building RAG with self-deployed Milvus vector database and Snowpark Container...
Zilliz
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
Kumud Singh
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
DianaGray10
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
shyamraj55
 
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Speck&Tech
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
Matthew Sinclair
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
danishmna97
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
KAMESHS29
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
Adtran
 
Large Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial ApplicationsLarge Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial Applications
Rohit Gautam
 
Full-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalizationFull-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalization
Zilliz
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
Matthew Sinclair
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
Alpen-Adria-Universität
 
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
Neo4j
 
Data structures and Algorithms in Python.pdf
Data structures and Algorithms in Python.pdfData structures and Algorithms in Python.pdf
Data structures and Algorithms in Python.pdf
TIPNGVN2
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
Neo4j
 
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
SOFTTECHHUB
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
James Anderson
 

Recently uploaded (20)

Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
 
Building RAG with self-deployed Milvus vector database and Snowpark Container...
Building RAG with self-deployed Milvus vector database and Snowpark Container...Building RAG with self-deployed Milvus vector database and Snowpark Container...
Building RAG with self-deployed Milvus vector database and Snowpark Container...
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
 
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
 
Large Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial ApplicationsLarge Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial Applications
 
Full-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalizationFull-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalization
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
 
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
 
Data structures and Algorithms in Python.pdf
Data structures and Algorithms in Python.pdfData structures and Algorithms in Python.pdf
Data structures and Algorithms in Python.pdf
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
 
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
 

Lec5 advanced-policy-gradient-methods

  • 1. Advanced Policy Gradient Methods: Natural Gradient, TRPO, and More John Schulman August 26, 2017
  • 2. Two Limitations of “Vanilla” Policy Gradient Methods Hard to choose stepsizes Input data is nonstationary due to changing policy: observation and reward distributions change Bad step is more damaging than in supervised learning, since it affects visitation distribution Step too far → bad policy Next batch: collected under bad policy Can’t recover—collapse in performance Sample efficiency Only one gradient step per environment sample Dependent on scaling of coordinates
  • 3. Reducing reinforcement learning to optimization Much of modern ML: reduce learning to numerical optimization problem Supervised learning: minimize training error RL: how to use all data so far and compute the best policy? Q-learning: can (in principle) include all transitions seen so far, however, we’re optimizing the wrong objective Policy gradient methods: yes stochastic gradients, but no optimization problem∗ This lecture: write down an optimization problem that allows you to do a small update to policy π based on data sampled from π (on-policy data)
  • 4. What Loss to Optimize? Policy gradients ˆg = ˆEt θ log πθ(at | st) ˆAt Can differentiate the following loss LPG (θ) = ˆEt log πθ(at | st) ˆAt . but don’t want to optimize it too far Equivalently differentiate LIS θold (θ) = ˆEt πθ(at | st) πθold (at | st) ˆAt . at θ = θold, state-actions are sampled using θold. (IS = importance sampling) Just the chain rule: θ log f (θ) θold = θf (θ) θold f (θold) = θ f (θ) f (θold) θold
  • 5. Surrogate Loss: Importance Sampling Interpretation Importance sampling interpretation Est ∼πθold ,at ∼πθ [Aπ (st, at)] = Est ∼πθold ,at ∼πθold πθ(at | st) πθold (at | st) Aπθold (st, at) (importance sampling) = Est ∼πθold ,at ∼πθold πθ(at | st) πθold (at | st) ˆAt (replace Aπ with estimator) = LIS θold (θ) Kakade et al.1 and S. et al.2 analyze how LIS approximates the actual performance difference between θ and θold. In practice, LIS is not much different than the logprob version LPG (θ) = ˆEt log πθ(at | st) ˆAt , for reasonably small policy changes. 1 S. Kakade and J. Langford. “Approximately optimal approximate reinforcement learning”. ICML. 2002. 2 J. Schulman, S. Levine, P. Moritz, M. I. Jordan, and P. Abbeel. “Trust Region Policy Optimization”. ICML (2015).
  • 6. Trust Region Policy Optimization Define the following trust region update: maximize θ ˆEt πθ(at | st) πθold (at | st) ˆAt subject to ˆEt[KL[πθold (· | st), πθ(· | st)]] ≤ δ. Also worth considering using a penalty instead of a constraint maximize θ ˆEt πθ(at | st) πθold (at | st) ˆAt − βˆEt[KL[πθold (· | st), πθ(· | st)]] Method of Lagrange multipliers: optimality point of δ-constrained problem is also an optimality point of β-penalized problem for some β. In practice, δ is easier to tune, and fixed δ is better than fixed β
  • 7. Monotonic Improvement Result Consider KL penalized objective maximize θ ˆEt πθ(at | st) πθold (at | st) ˆAt − βˆEt[KL[πθold (· | st), πθ(· | st)]] Theory result: if we use max KL instead of mean KL in penalty, then we get a lower (=pessimistic) bound on policy performance
  • 8. Trust Region Policy Optimization: Pseudocode Pseudocode: for iteration=1, 2, . . . do Run policy for T timesteps or N trajectories Estimate advantage function at all timesteps maximize θ N n=1 πθ(an | sn) πθold (an | sn) ˆAn subject to KLπθold (πθ) ≤ δ end for Can solve constrained optimization problem efficiently by using conjugate gradient Closely related to natural policy gradients (Kakade, 2002), natural actor critic (Peters and Schaal, 2005), REPS (Peters et al., 2010)
  • 9. Solving KL Penalized Problem maximizeθ Lπθold (πθ) − β · KLπθold (πθ) Make linear approximation to Lπθold and quadratic approximation to KL term: maximize θ g · (θ − θold) − β 2 (θ − θold)T F(θ − θold) where g = ∂ ∂θ Lπθold (πθ) θ=θold , F = ∂2 ∂2θ KLπθold (πθ) θ=θold Quadratic part of L is negligible compared to KL term F is positive semidefinite, but not if we include Hessian of L Solution: θ − θold = 1 β F−1 g, where F is Fisher Information matrix, g is policy gradient. This is called the natural policy gradient3 . 3 S. Kakade. “A Natural Policy Gradient.” NIPS. 2001.
  • 10. Solving KL Constrained Problem Method of Lagrange multipliers: solve penalized problem to get θ∗(β). Then substitute θ∗(β) into original problem and solve for β. β only affects scaling of solution, not direction. Compute scaling as follows: Compute s = F−1g (soln with β = 1) Rescale θ − θold = αs so that constraint is satisfied. Quadratic approx KLπθold (πθ) ≈ 1 2(θ − θold)T F(θ − θold), so we want 1 2 (αs)T F(αs) = δ α = 2δ/(sT Fs) Even better, we can do a line search to solve the original nonlinear problem. maximize Lπθold (πθ) − 1[KLπθold (πθ) ≤ δ] Try α, α/2, α/4, . . . until line search objective improves
  • 11. “Proximal” Policy Optimization: KL Penalty Version Use penalty instead of constraint maximize θ N n=1 πθ(an | sn) πθold (an | sn) ˆAn − C · KLπθold (πθ) Pseudocode: for iteration=1, 2, . . . do Run policy for T timesteps or N trajectories Estimate advantage function at all timesteps Do SGD on above objective for some number of epochs If KL too high, increase β. If KL too low, decrease β. end for ≈ same performance as TRPO, but only first-order optimization
  • 12. Review Suggested optimizing surrogate loss LPG or LIS Suggested using KL to constrain size of update Corresponds to natural gradient step F−1 g under linear quadratic approximation Can solve for this step approximately using conjugate gradient method
  • 13. Connection Between Trust Region Problem and Other Things maximize θ N n=1 πθ(an | sn) πθold (an | sn) ˆAn subject to KLπθold (πθ) ≤ δ Linear-quadratic approximation + penalty ⇒ natural gradient No constraint ⇒ policy iteration Euclidean penalty instead of KL ⇒ vanilla policy gradient
  • 14. Limitations of TRPO Hard to use with architectures with multiple outputs, e.g. policy and value function (need to weight different terms in distance metric) Empirically performs poorly on tasks requiring deep CNNs and RNNs, e.g. Atari benchmark CG makes implementation more complicated
  • 15. Calculating Natural Gradient Step with KFAC Summary: do blockwise approximation to FIM, and approximate each block using a certain factorization Alternate expression for FIM as outer product (instead of second deriv. of KL): ˆEt θ log πθ(at | st)T θ log πθ(at | st) R. Grosse and J. Martens. “A Kronecker-factored approximate Fisher matrix for convolution layers”. 2016 J. Martens and R. Grosse. “Optimizing neural networks with Kronecker-factored approximate curvature”. 2015
  • 16. Calculating Natural Gradient Step with KFAC Consider network with weight matrix W appearing once in network, with y = Wx. Then Wij L = xi yj L = xi yj F =ˆEt Wij log πθ(at | st)T Wij log πθ(at | st) Fij,i j = ˆEt xi yj xi yj KFAC approximation: Fij,i j = ˆEt xi yj xi yj ≈ ˆEt[xi xi ]ˆEt yj yj Approximating Fisher block as tensor product A ⊗ B, where A is nin × nin and B is nout × nout. (A ⊗ B)−1 = A−1 ⊗ B−1 and we can compute matrix-vector product without forming full matrix. Maintain running estimate of covariance matrices ˆEt[xi xj ], ˆEt yi yj and periodically compute matrix inverses (small overhead)
  • 17. ACKTR: Combine A2C with KFAC Natural Gradient Combined with A2C, gives excellent on Atari benchmark and continuous control from images4 . Note: we’re already computing θ log πθ(at | st) for policy gradient: no extra gradient computation needed Matrix inverses can be computed asynchronously Limitation: works straightforwardly for feedforward nets (including convolutions), less straightforward for RNNs or architectures with shared weights 4 Y. Wu, E. Mansimov, S. Liao, R. Grosse, and J. Ba. “Scalable trust-region method for deep reinforcement learning using Kronecker-factored approximation”. (2017).
  • 18. Proximal Policy Optimization: KL Penalized Version Back to penalty instead of constraint maximize θ N n=1 πθ(an | sn) πθold (an | sn) ˆAn − C · KLπθold (πθ) Pseudocode: for iteration=1, 2, . . . do Run policy for T timesteps or N trajectories Estimate advantage function at all timesteps Do SGD on above objective for some number of epochs If KL too high, increase β. If KL too low, decrease β. end for ≈ same performance as TRPO, but only first-order optimization
  • 19. Proximal Policy Optimization: Clipping Objective Recall the surrogate objective LIS (θ) = ˆEt πθ(at | st) πθold (at | st) ˆAt = ˆEt rt(θ) ˆAt . (1) Form a lower bound via clipped importance ratios LCLIP (θ) = ˆEt min(rt(θ) ˆAt, clip(rt(θ), 1 − , 1 + ) ˆAt) (2) 0 1 Linear interpolation factor 0.02 0.00 0.02 0.04 0.06 0.08 0.10 0.12 Et[KLt] LCPI =Et[rtAt] Et[clip(rt,1 ,1+ )At] LCLIP =Et[min(rtAt,clip(rt,1 ,1+ )At)] Forms pessimistic bound on objective, can be optimized using SGD
  • 20. Proximal Policy Optimization Pseudocode: for iteration=1, 2, . . . do Run policy for T timesteps or N trajectories Estimate advantage function at all timesteps Do SGD on LCLIP (θ) objective for some number of epochs end for A bit better than TRPO on continuous control, much better on Atari Compatible with multi-output networks and RNNs
  • 21. Further Reading S. Kakade. “A Natural Policy Gradient.” NIPS. 2001 S. Kakade and J. Langford. “Approximately optimal approximate reinforcement learning”. ICML. 2002 J. Peters and S. Schaal. “Natural actor-critic”. Neurocomputing (2008) J. Schulman, S. Levine, P. Moritz, M. I. Jordan, and P. Abbeel. “Trust Region Policy Optimization”. ICML (2015) Y. Duan, X. Chen, R. Houthooft, J. Schulman, and P. Abbeel. “Benchmarking Deep Reinforcement Learning for Continuous Control”. ICML (2016) J. Martens and I. Sutskever. “Training deep and recurrent networks with Hessian-free optimization”. Springer, 2012 Z. Wang, V. Bapst, N. Heess, V. Mnih, R. Munos, et al. “Sample Efficient Actor-Critic with Experience Replay”. (2016) Y. Wu, E. Mansimov, S. Liao, R. Grosse, and J. Ba. “Scalable trust-region method for deep reinforcement learning using Kronecker-factored approximation”. (2017) J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. “Proximal Policy Optimization Algorithms”. (2017) blog.openai.com: recent posts on baselines releases