1. The document discusses natural policy gradients, Trust Region Policy Optimization (TRPO), and Proximal Policy Optimization (PPO) for deep reinforcement learning and control.
2. It explains that policy gradient methods aim to directly optimize the policy parameters but have challenges with step size selection and sample efficiency.
3. Natural policy gradients address this by taking steps in the policy space rather than parameter space, using the Fisher information matrix to define a natural gradient that considers how parameter changes affect the policy distribution. This makes step size selection easier by constraining changes to a threshold KL divergence in policy space.
3.3 Rates of Change and Behavior of Graphssmiller5
* Find the average rate of change of a function.
* Use a graph to determine where a function is increasing, decreasing, or constant.
* Use a graph to locate local maxima and local minima.
* Use a graph to locate the absolute maximum and absolute minimum.
Financial Trading as a Game: A Deep Reinforcement Learning Approach謙益 黃
An automatic program that generates constant profit from the financial market is lucrative for every market practitioner. Recent advance in deep reinforcement learning provides a framework toward end-to-end training of such trading agent. In this paper, we propose an Markov Decision Process (MDP) model suitable for the financial trading task and solve it with the state-of-the-art deep recurrent Q-network (DRQN) algorithm. We propose several modifications to the existing learning algorithm to make it more suitable under the financial trading setting, namely 1. We employ a substantially small replay memory (only a few hundreds in size) compared to ones used in modern deep reinforcement learning algorithms (often millions in size.) 2. We develop an action augmentation technique to mitigate the need for random exploration by providing extra feedback signals for all actions to the agent. This enables us to use greedy policy over the course of learning and shows strong empirical performance compared to more commonly used ε-greedy exploration. However, this technique is specific to financial trading under a few market assumptions. 3. We sample a longer sequence for recurrent neural network training. A side product of this mechanism is that we can now train the agent for every T steps. This greatly reduces training time since the overall computation is down by a factor of T. We combine all of the above into a complete online learning algorithm and validate our approach on the spot foreign exchange market.
Hierarchical Deterministic Quadrature Methods for Option Pricing under the Ro...Chiheb Ben Hammouda
Conference talk at the SIAM Conference on Financial Mathematics and Engineering, held in virtual format, June 1-4 2021, about our recently published work "Hierarchical adaptive sparse grids and quasi-Monte Carlo for option pricing under the rough Bergomi model".
- Link of the paper: https://www.tandfonline.com/doi/abs/10.1080/14697688.2020.1744700
Chap 8. Optimization for training deep modelsYoung-Geun Choi
연구실 내부 세미나 자료. Goodfellow et al. (2016), Deep Learning, MIT Press의 Chapter 8을 요약/발췌하였습니다. 깊은 신경망(deep neural network) 모형 훈련시 목적함수 최적화 방법으로 흔히 사용되는 방법들을 소개합니다.
We approach the screening problem - i.e. detecting which inputs of a computer model significantly impact the output - from a formal Bayesian model selection point of view. That is, we place a Gaussian process prior on the computer model and consider the $2^p$ models that result from assuming that each of the subsets of the $p$ inputs affect the response. The goal is to obtain the posterior probabilities of each of these models. In this talk, we focus on the specification of objective priors on the model-specific parameters and on convenient ways to compute the associated marginal likelihoods. These two problems that normally are seen as unrelated, have challenging connections since the priors proposed in the literature are specifically designed to have posterior modes in the boundary of the parameter space, hence precluding the application of approximate integration techniques based on e.g. Laplace approximations. We explore several ways of circumventing this difficulty, comparing different methodologies with synthetic examples taken from the literature.
Authors: Gonzalo Garcia-Donato (Universidad de Castilla-La Mancha) and Rui Paulo (Universidade de Lisboa)
Leveraged and inverse ETFs seek a daily return equal to a multiple of an index' return, an objective that requires continuous portfolio rebalancing. The resulting trading costs create a tradeoff between tracking error, which controls the short-term correlation with the index, and excess return (or tracking difference) -- the long-term deviation from the levered index' performance. With proportional trading costs, the optimal replication policy is robust to the index' dynamics. A summary of a fund's performance is the \emph{implied spread}, equal to the product of tracking error and excess return, rescaled for leverage and average volatility. The implies spread is insensitive to the benchmark's risk premium, and offers a tool to compare the performance of funds on the same benchmark, but with different multiples and tracking errors.
Mean Absolute Percentage Error for regression models, presentation of the paper published in Neurocomputing, 2016.
http://www.sciencedirect.com/science/article/pii/S0925231216003325
Talk on Optimization for Deep Learning, which gives an overview of gradient descent optimization algorithms and highlights some current research directions.
3.3 Rates of Change and Behavior of Graphssmiller5
* Find the average rate of change of a function.
* Use a graph to determine where a function is increasing, decreasing, or constant.
* Use a graph to locate local maxima and local minima.
* Use a graph to locate the absolute maximum and absolute minimum.
Financial Trading as a Game: A Deep Reinforcement Learning Approach謙益 黃
An automatic program that generates constant profit from the financial market is lucrative for every market practitioner. Recent advance in deep reinforcement learning provides a framework toward end-to-end training of such trading agent. In this paper, we propose an Markov Decision Process (MDP) model suitable for the financial trading task and solve it with the state-of-the-art deep recurrent Q-network (DRQN) algorithm. We propose several modifications to the existing learning algorithm to make it more suitable under the financial trading setting, namely 1. We employ a substantially small replay memory (only a few hundreds in size) compared to ones used in modern deep reinforcement learning algorithms (often millions in size.) 2. We develop an action augmentation technique to mitigate the need for random exploration by providing extra feedback signals for all actions to the agent. This enables us to use greedy policy over the course of learning and shows strong empirical performance compared to more commonly used ε-greedy exploration. However, this technique is specific to financial trading under a few market assumptions. 3. We sample a longer sequence for recurrent neural network training. A side product of this mechanism is that we can now train the agent for every T steps. This greatly reduces training time since the overall computation is down by a factor of T. We combine all of the above into a complete online learning algorithm and validate our approach on the spot foreign exchange market.
Hierarchical Deterministic Quadrature Methods for Option Pricing under the Ro...Chiheb Ben Hammouda
Conference talk at the SIAM Conference on Financial Mathematics and Engineering, held in virtual format, June 1-4 2021, about our recently published work "Hierarchical adaptive sparse grids and quasi-Monte Carlo for option pricing under the rough Bergomi model".
- Link of the paper: https://www.tandfonline.com/doi/abs/10.1080/14697688.2020.1744700
Chap 8. Optimization for training deep modelsYoung-Geun Choi
연구실 내부 세미나 자료. Goodfellow et al. (2016), Deep Learning, MIT Press의 Chapter 8을 요약/발췌하였습니다. 깊은 신경망(deep neural network) 모형 훈련시 목적함수 최적화 방법으로 흔히 사용되는 방법들을 소개합니다.
We approach the screening problem - i.e. detecting which inputs of a computer model significantly impact the output - from a formal Bayesian model selection point of view. That is, we place a Gaussian process prior on the computer model and consider the $2^p$ models that result from assuming that each of the subsets of the $p$ inputs affect the response. The goal is to obtain the posterior probabilities of each of these models. In this talk, we focus on the specification of objective priors on the model-specific parameters and on convenient ways to compute the associated marginal likelihoods. These two problems that normally are seen as unrelated, have challenging connections since the priors proposed in the literature are specifically designed to have posterior modes in the boundary of the parameter space, hence precluding the application of approximate integration techniques based on e.g. Laplace approximations. We explore several ways of circumventing this difficulty, comparing different methodologies with synthetic examples taken from the literature.
Authors: Gonzalo Garcia-Donato (Universidad de Castilla-La Mancha) and Rui Paulo (Universidade de Lisboa)
Leveraged and inverse ETFs seek a daily return equal to a multiple of an index' return, an objective that requires continuous portfolio rebalancing. The resulting trading costs create a tradeoff between tracking error, which controls the short-term correlation with the index, and excess return (or tracking difference) -- the long-term deviation from the levered index' performance. With proportional trading costs, the optimal replication policy is robust to the index' dynamics. A summary of a fund's performance is the \emph{implied spread}, equal to the product of tracking error and excess return, rescaled for leverage and average volatility. The implies spread is insensitive to the benchmark's risk premium, and offers a tool to compare the performance of funds on the same benchmark, but with different multiples and tracking errors.
Mean Absolute Percentage Error for regression models, presentation of the paper published in Neurocomputing, 2016.
http://www.sciencedirect.com/science/article/pii/S0925231216003325
Talk on Optimization for Deep Learning, which gives an overview of gradient descent optimization algorithms and highlights some current research directions.
Overview of the fundamental roles in Hydropower generation and the components involved in wider Electrical Engineering.
This paper presents the design and construction of hydroelectric dams from the hydrologist’s survey of the valley before construction, all aspects and involved disciplines, fluid dynamics, structural engineering, generation and mains frequency regulation to the very transmission of power through the network in the United Kingdom.
Author: Robbie Edward Sayers
Collaborators and co editors: Charlie Sims and Connor Healey.
(C) 2024 Robbie E. Sayers
Final project report on grocery store management system..pdfKamal Acharya
In today’s fast-changing business environment, it’s extremely important to be able to respond to client needs in the most effective and timely manner. If your customers wish to see your business online and have instant access to your products or services.
Online Grocery Store is an e-commerce website, which retails various grocery products. This project allows viewing various products available enables registered users to purchase desired products instantly using Paytm, UPI payment processor (Instant Pay) and also can place order by using Cash on Delivery (Pay Later) option. This project provides an easy access to Administrators and Managers to view orders placed using Pay Later and Instant Pay options.
In order to develop an e-commerce website, a number of Technologies must be studied and understood. These include multi-tiered architecture, server and client-side scripting techniques, implementation technologies, programming language (such as PHP, HTML, CSS, JavaScript) and MySQL relational databases. This is a project with the objective to develop a basic website where a consumer is provided with a shopping cart website and also to know about the technologies used to develop such a website.
This document will discuss each of the underlying technologies to create and implement an e- commerce website.
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptxR&R Consult
CFD analysis is incredibly effective at solving mysteries and improving the performance of complex systems!
Here's a great example: At a large natural gas-fired power plant, where they use waste heat to generate steam and energy, they were puzzled that their boiler wasn't producing as much steam as expected.
R&R and Tetra Engineering Group Inc. were asked to solve the issue with reduced steam production.
An inspection had shown that a significant amount of hot flue gas was bypassing the boiler tubes, where the heat was supposed to be transferred.
R&R Consult conducted a CFD analysis, which revealed that 6.3% of the flue gas was bypassing the boiler tubes without transferring heat. The analysis also showed that the flue gas was instead being directed along the sides of the boiler and between the modules that were supposed to capture the heat. This was the cause of the reduced performance.
Based on our results, Tetra Engineering installed covering plates to reduce the bypass flow. This improved the boiler's performance and increased electricity production.
It is always satisfying when we can help solve complex challenges like this. Do your systems also need a check-up or optimization? Give us a call!
Work done in cooperation with James Malloy and David Moelling from Tetra Engineering.
More examples of our work https://www.r-r-consult.dk/en/cases-en/
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...Amil Baba Dawood bangali
Contact with Dawood Bhai Just call on +92322-6382012 and we'll help you. We'll solve all your problems within 12 to 24 hours and with 101% guarantee and with astrology systematic. If you want to take any personal or professional advice then also you can call us on +92322-6382012 , ONLINE LOVE PROBLEM & Other all types of Daily Life Problem's.Then CALL or WHATSAPP us on +92322-6382012 and Get all these problems solutions here by Amil Baba DAWOOD BANGALI
#vashikaranspecialist #astrologer #palmistry #amliyaat #taweez #manpasandshadi #horoscope #spiritual #lovelife #lovespell #marriagespell#aamilbabainpakistan #amilbabainkarachi #powerfullblackmagicspell #kalajadumantarspecialist #realamilbaba #AmilbabainPakistan #astrologerincanada #astrologerindubai #lovespellsmaster #kalajaduspecialist #lovespellsthatwork #aamilbabainlahore#blackmagicformarriage #aamilbaba #kalajadu #kalailam #taweez #wazifaexpert #jadumantar #vashikaranspecialist #astrologer #palmistry #amliyaat #taweez #manpasandshadi #horoscope #spiritual #lovelife #lovespell #marriagespell#aamilbabainpakistan #amilbabainkarachi #powerfullblackmagicspell #kalajadumantarspecialist #realamilbaba #AmilbabainPakistan #astrologerincanada #astrologerindubai #lovespellsmaster #kalajaduspecialist #lovespellsthatwork #aamilbabainlahore #blackmagicforlove #blackmagicformarriage #aamilbaba #kalajadu #kalailam #taweez #wazifaexpert #jadumantar #vashikaranspecialist #astrologer #palmistry #amliyaat #taweez #manpasandshadi #horoscope #spiritual #lovelife #lovespell #marriagespell#aamilbabainpakistan #amilbabainkarachi #powerfullblackmagicspell #kalajadumantarspecialist #realamilbaba #AmilbabainPakistan #astrologerincanada #astrologerindubai #lovespellsmaster #kalajaduspecialist #lovespellsthatwork #aamilbabainlahore #Amilbabainuk #amilbabainspain #amilbabaindubai #Amilbabainnorway #amilbabainkrachi #amilbabainlahore #amilbabaingujranwalan #amilbabainislamabad
Forklift Classes Overview by Intella PartsIntella Parts
Discover the different forklift classes and their specific applications. Learn how to choose the right forklift for your needs to ensure safety, efficiency, and compliance in your operations.
For more technical information, visit our website https://intellaparts.com
Democratizing Fuzzing at Scale by Abhishek Aryaabh.arya
Presented at NUS: Fuzzing and Software Security Summer School 2024
This keynote talks about the democratization of fuzzing at scale, highlighting the collaboration between open source communities, academia, and industry to advance the field of fuzzing. It delves into the history of fuzzing, the development of scalable fuzzing platforms, and the empowerment of community-driven research. The talk will further discuss recent advancements leveraging AI/ML and offer insights into the future evolution of the fuzzing landscape.
Cosmetic shop management system project report.pdfKamal Acharya
Buying new cosmetic products is difficult. It can even be scary for those who have sensitive skin and are prone to skin trouble. The information needed to alleviate this problem is on the back of each product, but it's thought to interpret those ingredient lists unless you have a background in chemistry.
Instead of buying and hoping for the best, we can use data science to help us predict which products may be good fits for us. It includes various function programs to do the above mentioned tasks.
Data file handling has been effectively used in the program.
The automated cosmetic shop management system should deal with the automation of general workflow and administration process of the shop. The main processes of the system focus on customer's request where the system is able to search the most appropriate products and deliver it to the customers. It should help the employees to quickly identify the list of cosmetic product that have reached the minimum quantity and also keep a track of expired date for each cosmetic product. It should help the employees to find the rack number in which the product is placed.It is also Faster and more efficient way.
Quality defects in TMT Bars, Possible causes and Potential Solutions.PrashantGoswami42
Maintaining high-quality standards in the production of TMT bars is crucial for ensuring structural integrity in construction. Addressing common defects through careful monitoring, standardized processes, and advanced technology can significantly improve the quality of TMT bars. Continuous training and adherence to quality control measures will also play a pivotal role in minimizing these defects.
Immunizing Image Classifiers Against Localized Adversary Attacksgerogepatton
This paper addresses the vulnerability of deep learning models, particularly convolutional neural networks
(CNN)s, to adversarial attacks and presents a proactive training technique designed to counter them. We
introduce a novel volumization algorithm, which transforms 2D images into 3D volumetric representations.
When combined with 3D convolution and deep curriculum learning optimization (CLO), itsignificantly improves
the immunity of models against localized universal attacks by up to 40%. We evaluate our proposed approach
using contemporary CNN architectures and the modified Canadian Institute for Advanced Research (CIFAR-10
and CIFAR-100) and ImageNet Large Scale Visual Recognition Challenge (ILSVRC12) datasets, showcasing
accuracy improvements over previous techniques. The results indicate that the combination of the volumetric
input and curriculum learning holds significant promise for mitigating adversarial attacks without necessitating
adversary training.
Event Management System Vb Net Project Report.pdfKamal Acharya
In present era, the scopes of information technology growing with a very fast .We do not see any are untouched from this industry. The scope of information technology has become wider includes: Business and industry. Household Business, Communication, Education, Entertainment, Science, Medicine, Engineering, Distance Learning, Weather Forecasting. Carrier Searching and so on.
My project named “Event Management System” is software that store and maintained all events coordinated in college. It also helpful to print related reports. My project will help to record the events coordinated by faculties with their Name, Event subject, date & details in an efficient & effective ways.
In my system we have to make a system by which a user can record all events coordinated by a particular faculty. In our proposed system some more featured are added which differs it from the existing system such as security.
COLLEGE BUS MANAGEMENT SYSTEM PROJECT REPORT.pdfKamal Acharya
The College Bus Management system is completely developed by Visual Basic .NET Version. The application is connect with most secured database language MS SQL Server. The application is develop by using best combination of front-end and back-end languages. The application is totally design like flat user interface. This flat user interface is more attractive user interface in 2017. The application is gives more important to the system functionality. The application is to manage the student’s details, driver’s details, bus details, bus route details, bus fees details and more. The application has only one unit for admin. The admin can manage the entire application. The admin can login into the application by using username and password of the admin. The application is develop for big and small colleges. It is more user friendly for non-computer person. Even they can easily learn how to manage the application within hours. The application is more secure by the admin. The system will give an effective output for the VB.Net and SQL Server given as input to the system. The compiled java program given as input to the system, after scanning the program will generate different reports. The application generates the report for users. The admin can view and download the report of the data. The application deliver the excel format reports. Because, excel formatted reports is very easy to understand the income and expense of the college bus. This application is mainly develop for windows operating system users. In 2017, 73% of people enterprises are using windows operating system. So the application will easily install for all the windows operating system users. The application-developed size is very low. The application consumes very low space in disk. Therefore, the user can allocate very minimum local disk space for this application.
1. Natural Policy Gradients, TRPO, PPO
Deep Reinforcement Learning and Control
Katerina Fragkiadaki
Carnegie Mellon
School of Computer Science
CMU 10703
2. Part of the slides adapted from John Shulman and Joshua Achiam
4. Policy Gradients
Monte Carlo Policy Gradients (REINFORCE), gradient direction:
What Loss to Optimize?
I Policy gradients
ĝ = Êt
h
r✓ log ⇡✓(at | st)Ât
i
I Can di↵erentiate the following loss
LPG
(✓) = Êt
h
log ⇡✓(at | st)Ât
i
but don’t want to optimize it too far
I Equivalently di↵erentiate
LIS
✓old
(✓) = Êt
⇡✓(at | st)
⇡✓old
(at | st)
Ât
at ✓ = ✓old, state-actions are sampled using ✓old. (IS =
Just the chain rule: r✓ log f (✓) ✓old
=
r✓f (✓)
✓old
f (✓old) = r✓
⇣
Actor-Critic Policy Gradient: ̂
g = ̂
𝔼t [∇θlog πθ(at |st)Aw(st)]
θold
θnew
μθ(s)
σθ(s)
σθnew
(s)
μθnew
(s)
θnew = θ + ϵ ⋅ ̂
g
1. Collect trajectories for policy
2. Estimate advantages
3. Compute policy gradient
4. Update policy parameters
5. GOTO 1
̂
g
A
πθ
This lecture is all about the stepwise
5. What is the underlying objective function?
̂
g ≈
1
N
N
∑
i=1
T
∑
t=1
∇θlog πθ(α(i)
t |s(i)
t )A(s(i)
t , a(i)
t ), τi ∼ πθ
Policy gradients:
What is our objective? Result from differentiating the objective function:
JPG
(θ) =
1
N
N
∑
i=1
T
∑
t=1
log πθ(α(i)
t |s(i)
t )A(s(i)
t , a(i)
t ) τi ∼ πθ
Is this our objective? We cannot both maximize over a variable and sample from it.
Compare to supervised learning and maximum likelihood estimation (MLE). Imagine we
have access to expert actions, then the loss function we want to optimize is:
JSL
(θ) =
1
N
N
∑
i=1
T
∑
t=1
log πθ(α̃(i)
t |s(i)
t ), τi ∼ π*
which maximizes the probability of expert actions in the training set.
Is this our SL objective?
Well, we cannot optimize it too far, our advantage estimates are from samples of
pi_theta_{old}. However, this constraint of “cannot optimize too far from theta_{old}”
does not appear anywhere in the objective.
Well, as a matter of fact, we care about test error, but this is a long story, the short
answer is yes, this is good enough for us to optimize if we regularize.
+regularization
6. Policy Gradients
Monte Carlo Policy Gradients (REINFORCE), gradient direction:
What Loss to Optimize?
I Policy gradients
ĝ = Êt
h
r✓ log ⇡✓(at | st)Ât
i
I Can di↵erentiate the following loss
LPG
(✓) = Êt
h
log ⇡✓(at | st)Ât
i
but don’t want to optimize it too far
I Equivalently di↵erentiate
LIS
✓old
(✓) = Êt
⇡✓(at | st)
⇡✓old
(at | st)
Ât
at ✓ = ✓old, state-actions are sampled using ✓old. (IS =
Just the chain rule: r✓ log f (✓) ✓old
=
r✓f (✓)
✓old
f (✓old) = r✓
⇣
Actor-Critic Policy Gradient: ̂
g = ̂
𝔼t [∇θlog πθ(at |st)Aw(st)]
θnew = θ + ϵ ⋅ ̂
g
1. Collect trajectories for policy
2. Estimate advantages
3. Compute policy gradient
4. Update policy parameters
5. GOTO 1
̂
g
A
πθ
This lecture is all about the stepwise
It is also about writing down an objective that we can
optimize with PG, and the procedure 1,2,3,4,5 will be the
result of this objective maximization
θold
θnew
μθ(s)
σθ(s)
σθnew
(s)
μθnew
(s)
7. Policy Gradients
Monte Carlo Policy Gradients (REINFORCE), gradient direction:
What Loss to Optimize?
I Policy gradients
ĝ = Êt
h
r✓ log ⇡✓(at | st)Ât
i
I Can di↵erentiate the following loss
LPG
(✓) = Êt
h
log ⇡✓(at | st)Ât
i
but don’t want to optimize it too far
I Equivalently di↵erentiate
LIS
✓old
(✓) = Êt
⇡✓(at | st)
⇡✓old
(at | st)
Ât
at ✓ = ✓old, state-actions are sampled using ✓old. (IS =
Just the chain rule: r✓ log f (✓) ✓old
=
r✓f (✓)
✓old
f (✓old) = r✓
⇣
Actor-Critic Policy Gradient: ̂
g = ̂
𝔼t [∇θlog πθ(at |st)Aw(st)]
Two problems with the vanilla formulation:
1. Hard to choose stepwise
2. Sample inefficient: we cannot use data
collected with policies of previous
iterations
ϵ
θnew = θ + ϵ ⋅ ̂
g
1. Collect trajectories for policy
2. Estimate advantages
3. Compute policy gradient
4. Update policy parameters
5. GOTO 1
̂
g
A
πθ
θold
θnew
μθ(s)
σθ(s)
σθnew
(s)
μθnew
(s)
8. Two
Hard to choose stepsizes
• Step too big
Bad policy->data collected under bad
policy-> we cannot recover
(in Supervised Learning, data does not
depend on neural network weights)
• Step too small
Not efficient use of experience
(in Supervised Learning, data can be
trivially re-used)
Gradient descent in parameter space
does not take into account the
resulting distance in the (output) policy
space between and
πθold
(s) πθnew
(s)
Monte Carlo Policy Gradients (REINFORCE), gradient direction:
What Loss to Optimize?
I Policy gradients
ĝ = Êt
h
r✓ log ⇡✓(at | st)Ât
i
I Can di↵erentiate the following loss
LPG
(✓) = Êt
h
log ⇡✓(at | st)Ât
i
but don’t want to optimize it too far
I Equivalently di↵erentiate
LIS
✓old
(✓) = Êt
⇡✓(at | st)
⇡✓old
(at | st)
Ât
at ✓ = ✓old, state-actions are sampled using ✓old. (IS =
Just the chain rule: r✓ log f (✓) ✓old
=
r✓f (✓)
✓old
f (✓old) = r✓
⇣
Actor-Critic Policy Gradient: ̂
g = ̂
𝔼t [∇θlog πθ(at |st)Aw(st)]
θnew = θ + ϵ ⋅ ̂
g
1. Collect trajectories for policy
2. Estimate advantages
3. Compute policy gradient
4. Update policy parameters
5. GOTO 1
̂
g
A
πθ
θold
θnew
μθ(s)
σθ(s)
σθnew
(s)
μθnew
(s)
9. Two
Hard to choose stepsizes
Monte Carlo Policy Gradients (REINFORCE), gradient direction:
What Loss to Optimize?
I Policy gradients
ĝ = Êt
h
r✓ log ⇡✓(at | st)Ât
i
I Can di↵erentiate the following loss
LPG
(✓) = Êt
h
log ⇡✓(at | st)Ât
i
but don’t want to optimize it too far
I Equivalently di↵erentiate
LIS
✓old
(✓) = Êt
⇡✓(at | st)
⇡✓old
(at | st)
Ât
at ✓ = ✓old, state-actions are sampled using ✓old. (IS =
Just the chain rule: r✓ log f (✓) ✓old
=
r✓f (✓)
✓old
f (✓old) = r✓
⇣
Actor-Critic Policy Gradient: ̂
g = ̂
𝔼t [∇θlog πθ(at |st)Aw(st)]
The Problem is More Than Step Size
Consider a family of policies with parametrization:
⇡✓(a) =
⇢
(✓) a = 1
1 (✓) a = 2
Figure: Small changes in the policy parameters can unexpectedly lead to big changes in the policy.
θnew = θ + ϵ ⋅ ̂
g
1. Collect trajectories for policy
2. Estimate advantages
3. Compute policy gradient
4. Update policy parameters
5. GOTO 1
̂
g
A
πθ
10. Two
Notation
We will use the following to denote values of parameters and corresponding policies before
and after an update:
θold → θnew
πold → πnew
θ → θ′
π → π′
11. Gradient Descent in Parameter Space
The stepwise in gradient descent results from solving the following optimization problem, e.g.,
using line search:
Euclidean distance in parameter space
θnew = θold + d *
SGD:
d * = arg max
∥d∥≤ϵ
J(θ + d)
It is hard to predict the result on the parameterized distribution..
µ✓(s)
✓(s)
θ
12. Gradient Descent in Distribution Space
The stepwise in gradient descent results from solving the following optimization problem, e.g.,
using line search:
d * = arg max
d, s.t. KL(πθ∥πθ+d)≤ϵ
J(θ + d)
Euclidean distance in parameter space
θnew = θold + d *
SGD:
d * = arg max
∥d∥≤ϵ
J(θ + d)
KL divergence in distribution space
It is hard to predict the result on the parameterized distribution.. hard to pick the threshold
epsilon
Natural gradient descent: the stepwise in parameter space is determined by
considering the KL divergence in the distributions before and after the update:
Easier to pick the distance threshold!!!
13. Solving the KL Constrained Problem
First order Taylor expansion for the loss and second order for the KL:
d * = arg max
d
J(θ + d) − λ(DKL [πθ∥πθ+d] − ϵ)
Unconstrained penalized objective:
≈ arg max
d
J(θold) + ∇θ J(θ)|θ=θold
⋅ d −
1
2
λ(d⊤
∇2
θDKL [πθold
∥πθ]|θ=θold
d) + λϵ
16. Fisher Information Matrix
F(θ) = 𝔼θ [∇θlog pθ(x)∇θlog pθ(x)⊤
]
Exactly equivalent to the Hessian of KL divergence!
DKL(pθold
|pθ) ≈ DKL(pθold
|pθold
) + d⊤
∇θDKL(pθold
|pθ)|θ=θold
+
1
2
d⊤
∇2
θDKL(pθold
|pθ)|θ=θold
d
=
1
2
d⊤
F(θold)d
=
1
2
(θ − θold)⊤
F(θold)(θ − θold)
Since KL divergence is roughly analogous to a distance measure between
distributions, Fisher information serves as a local distance metric between
distributions: how much you change the distribution if you move the parameters a
little bit in a given direction.
F(θold) = ∇2
θDKL(pθold
|pθ)|θ=θold
17. First order Taylor expansion for the loss and second order for the KL:
d * = arg max
d
J(θ + d) − λ(DKL [πθ∥πθ+d] − ϵ)
Unconstrained penalized objective:
≈ arg max
d
J(θold) + ∇θ J(θ)|θ=θold
⋅ d −
1
2
λ(d⊤
∇2
θDKL [πθold
∥πθ]|θ=θold
d) + λϵ
= arg max
d
∇θ J(θ)|θ=θold
⋅ d −
1
2
λ(d⊤
F(θold)d)
= arg min
d
− ∇θ J(θ)|θ=θold
⋅ d +
1
2
λ(d⊤
F(θold)d)
Substitute for the information matrix:
Solving the KL Constrained Problem
21. Policy Gradients
Monte Carlo Policy Gradients (REINFORCE), gradient direction:
What Loss to Optimize?
I Policy gradients
ĝ = Êt
h
r✓ log ⇡✓(at | st)Ât
i
I Can di↵erentiate the following loss
LPG
(✓) = Êt
h
log ⇡✓(at | st)Ât
i
but don’t want to optimize it too far
I Equivalently di↵erentiate
LIS
✓old
(✓) = Êt
⇡✓(at | st)
⇡✓old
(at | st)
Ât
at ✓ = ✓old, state-actions are sampled using ✓old. (IS =
Just the chain rule: r✓ log f (✓) ✓old
=
r✓f (✓)
✓old
f (✓old) = r✓
⇣
Actor-Critic Policy Gradient: ̂
g = ̂
𝔼t [∇θlog πθ(at |st)Aw(st)]
θold
θnew
μθold
(s)
σθold
(s)
σθnew
(s)
μθnew
(s)
θnew = θold + ϵ ⋅ ̂
g
1. Collect trajectories for policy
2. Estimate advantages
3. Compute policy gradient
4. Update policy parameters
5. GOTO 1
̂
g
A
πθold
22. Policy Gradients
Monte Carlo Policy Gradients (REINFORCE), gradient direction:
What Loss to Optimize?
I Policy gradients
ĝ = Êt
h
r✓ log ⇡✓(at | st)Ât
i
I Can di↵erentiate the following loss
LPG
(✓) = Êt
h
log ⇡✓(at | st)Ât
i
but don’t want to optimize it too far
I Equivalently di↵erentiate
LIS
✓old
(✓) = Êt
⇡✓(at | st)
⇡✓old
(at | st)
Ât
at ✓ = ✓old, state-actions are sampled using ✓old. (IS =
Just the chain rule: r✓ log f (✓) ✓old
=
r✓f (✓)
✓old
f (✓old) = r✓
⇣
Actor-Critic Policy Gradient: ̂
g = ̂
𝔼t [∇θlog πθ(at |st)Aw(st)]
θold
θnew
μθold
(s)
σθold
(s)
σθnew
(s)
μθnew
(s)
θnew = θold + ϵ ⋅ ̂
g
1. Collect trajectories for policy
2. Estimate advantages
3. Compute policy gradient
4. Update policy parameters
5. GOTO 1
̂
g
A
πθold
• On policy learning can be extremely
inefficient
• The policy changes only a little bit with
each gradient step
• I want to be able to use earlier data..how
to do that?
24. J(θ) = 𝔼τ∼πθ(τ) [R(τ)]
=
∑
τ
πθ(τ)R(τ)
=
∑
τ
πθold
(τ)
πθ(τ)
πθold
(τ)
R(τ)
=
∑
τ∼πθold
πθ(τ)
πθold
(τ)
R(τ)
= 𝔼τ∼πθold
πθ(τ)
πθold
(τ)
R(τ)
Off policy learning with Importance Sampling
J(θ) = 𝔼τ∼πθold
T
∑
t=1
t
∏
t′=1
πθ(a′
t |s′
t)
πθold
(a′
t |s′
t)
̂
At
πθ(τ)
πθold
(τ)
=
T
∏
i=1
πθ(at |st)
πθold
(at |st)
Now we can use data from the old
policy, but the variance has
increased by a lot! Those
multiplications can explode or
vanish!
∇θ J(θ)|θ=θold
= 𝔼τ∼πθold
∇θlog πθ(τ)|θ=θold
R(τ)
∇θ J(θ) = 𝔼τ∼πθold
∇θπθ(τ)
πθold
(τ)
R(τ)
25. Trust Region Policy Optimization
I Define the following trust region update:
maximize
✓
Êt
⇡✓(at | st)
⇡✓old
(at | st)
Ât
subject to Êt[KL[⇡✓old
(· | st), ⇡✓(· | st)]] .
I Also worth considering using a penalty instead of a constraint
maximize
✓
Êt
⇡✓(at | st)
⇡✓old
(at | st)
Ât Êt[KL[⇡✓old
(· | st), ⇡✓(· | st)]]
I Method of Lagrange multipliers: optimality point of -constrained problem
is also an optimality point of -penalized problem for some .
I In practice, is easier to tune, and fixed is better than fixed
Trust region Policy Optimization
er Reading
I S. Kakade. “A Natural Policy Gradient.” NIPS. 2001
I S. Kakade and J. Langford. “Approximately optimal approximate reinforcement learning”. ICML. 2002
I J. Peters and S. Schaal. “Natural actor-critic”. Neurocomputing (2008)
I J. Schulman, S. Levine, P. Moritz, M. I. Jordan, and P. Abbeel. “Trust Region Policy Optimization”.
Again the KL penalized problem!
26. Solving KL Penalized Problem
I maximize✓ L⇡✓old
(⇡✓) · KL⇡✓old
(⇡✓)
I Make linear approximation to L⇡✓old
and quadratic approximation to KL term:
maximize
✓
g · (✓ ✓old) 2
(✓ ✓old)T
F(✓ ✓old)
where g =
@
@✓
L⇡✓old
(⇡✓) ✓=✓old
, F =
@2
@2✓
KL⇡✓old
(⇡✓) ✓=✓old
I Quadratic part of L is negligible compared to KL term
I F is positive semidefinite, but not if we include Hessian of L
I Solution: ✓ ✓old = 1
F 1
g, where F is Fisher Information matrix, g is
policy gradient. This is called the natural policy gradient3
.
3
S. Kakade. “A Natural Policy Gradient.” NIPS. 2001.
Solving KL penalized problem
Exactly what we saw with natural policy gradient!
One important detail!
Trust region Policy Optimization
27. Trust Region Policy Optimization
Small problems with NPG update:
Might not be robust to trust region size ; at some iterations may be too large and
performance can degrade
Because of quadratic approximation, KL-divergence constraint may be violated
Solution:
Require improvement in surrogate (make sure that L✓k (✓k+1) 0)
Enforce KL-constraint
How? Backtracking line search with exponential decay (decay coe↵ ↵ 2 (0, 1), budget L)
Algorithm 2 Line Search for TRPO
Compute proposed policy step k =
q
2
ĝT
k
Ĥ 1
k
ĝk
Ĥ 1
k ĝk
for j = 0, 1, 2, ..., L do
Compute proposed update ✓ = ✓k + ↵j
k
if L✓k (✓) 0 and D̄KL(✓||✓k ) then
accept the update and set ✓k+1 = ✓k + ↵j
k
break
end if
end for
Joshua Achiam (UC Berkeley, OpenAI) Advanced Policy Gradient Methods October 11, 2017 32 / 41
Trust region Policy Optimization
Due to the quadratic approximation, the KL constraint may be violated! What if we just do a
line search to find the best stepsize, making sure:
• I am improving my objective J(theta)
• The KL constraint is not violated!
Trust Region Policy Optimization
I Define the following trust region update:
maximize
✓
Êt
⇡✓(at | st)
⇡✓old
(at | st)
Ât
subject to Êt[KL[⇡✓old
(· | st), ⇡✓(· | st)]] .
I Also worth considering using a penalty instead of a constraint
maximize
✓
Êt
⇡✓(at | st)
⇡✓old
(at | st)
Ât Êt[KL[⇡✓old
(· | st), ⇡✓(· | st)]]
I Method of Lagrange multipliers: optimality point of -constrained problem
is also an optimality point of -penalized problem for some .
I In practice, is easier to tune, and fixed is better than fixed
28. Trust Region Policy Optimization
Trust Region Policy Optimization is implemented as TNPG plus a line search. Putting
it all together:
Algorithm 3 Trust Region Policy Optimization
Input: initial policy parameters ✓0
for k = 0, 1, 2, ... do
Collect set of trajectories Dk on policy ⇡k = ⇡(✓k )
Estimate advantages Â
⇡k
t using any advantage estimation algorithm
Form sample estimates for
policy gradient ĝk (using advantage estimates)
and KL-divergence Hessian-vector product function f (v) = Ĥk v
Use CG with ncg iterations to obtain xk ⇡ Ĥ 1
k ĝk
Estimate proposed step k ⇡
q
2
xT
k
Ĥk xk
xk
Perform backtracking line search with exponential decay to obtain final update
✓k+1 = ✓k + ↵j
k
end for
Trust region Policy Optimization
TRPO= NPG +Linesearch
29. Trust Region Policy Optimization
Trust Region Policy Optimization is implemented as TNPG plus a line search. Putting
it all together:
Algorithm 3 Trust Region Policy Optimization
Input: initial policy parameters ✓0
for k = 0, 1, 2, ... do
Collect set of trajectories Dk on policy ⇡k = ⇡(✓k )
Estimate advantages Â
⇡k
t using any advantage estimation algorithm
Form sample estimates for
policy gradient ĝk (using advantage estimates)
and KL-divergence Hessian-vector product function f (v) = Ĥk v
Use CG with ncg iterations to obtain xk ⇡ Ĥ 1
k ĝk
Estimate proposed step k ⇡
q
2
xT
k
Ĥk xk
xk
Perform backtracking line search with exponential decay to obtain final update
✓k+1 = ✓k + ↵j
k
end for
Trust region Policy Optimization
TRPO= NPG +Linesearch+monotonic improvement theorem!
30. Relating objectives of two policies
Policy objective:
Policy objective can be written in terms of old one:
J(πθ) = 𝔼τ∼πθ
∞
∑
t=0
γt
rt
J(πθ′) − J(πθ) = 𝔼τ∼π′
θ
∞
∑
t=0
γt
Aπθ(st, at)
J(π′) − J(π) = 𝔼τ∼π′
∞
∑
t=0
γt
Aπ
(st, at)
Equivalently for succinctness:
31. Proof of Relative Policy Performance Identity
J(⇡0
) J(⇡) = E
⌧⇠⇡0
" 1
X
t=0
t
A⇡
(st , at )
#
= E
⌧⇠⇡0
" 1
X
t=0
t
(R(st , at , st+1) + V ⇡
(st+1) V ⇡
(st ))
#
= J(⇡0
) + E
⌧⇠⇡0
" 1
X
t=0
t+1
V ⇡
(st+1)
1
X
t=0
t
V ⇡
(st )
#
= J(⇡0
) + E
⌧⇠⇡0
" 1
X
t=1
t
V ⇡
(st )
1
X
t=0
t
V ⇡
(st )
#
= J(⇡0
) E
⌧⇠⇡0
[V ⇡
(s0)]
= J(⇡0
) J(⇡)
Joshua Achiam (UC Berkeley, OpenAI) Advanced Policy Gradient Methods October 11, 2017 12 / 41
Approximately Optimal Approximate Reinforcement Learning, Kakade and Langford 2002
Relating objectives of two policies
The initial state distribution is the same for both!
32. Relating objectives of two policies
Discounted state visitation distribution:
J(π′) − J(π) = 𝔼τ∼π′
∞
∑
t=0
γt
Aπ
(st, at)
= 𝔼s∼dπ′
,a∼π′Aπ
(s, a)
= 𝔼s∼dπ′
,a∼π [
π′(a|s)
π(a|s)
Aπ
(s, a)
]
But how are we supposed to sample states from the policy we are trying to optimize for…
Let’s use the previous policy to sample them.
J(π′) − J(π) ≈ 𝔼s∼dπ,a∼π
π′(a|s)
π(a|s)
Aπ
(s, a)
= ℒπ(π′)
It turns out we can bound this approximation error:
A Useful Approximation
What if we just said d⇡0
⇡ d⇡
and didn’t worry about it?
J(⇡0
) J(⇡) ⇡
1
1
E
s⇠d⇡
a⇠⇡
⇡0
(a|s)
⇡(a|s)
A⇡
(s, a)
.
= L⇡(⇡0
)
Turns out: this approximation is pretty good when ⇡0
and ⇡ are close! But why, and
close do they have to be?
Relative policy performance bounds: 2
J(⇡0
) J(⇡) + L⇡(⇡0
) C
q
E
s⇠d⇡
[DKL(⇡0||⇡)[s]]
If policies are close in KL-divergence—the approximation is good!
Constrained Policy Optimization, Achiam et al. 2017
dπ
(s) = (1 − γ)
∞
∑
t=0
γt
P(st = s|π)
33. Relating objectives of two policies
ℒπ′
π = 𝔼s∼dπ,a∼π [
π′(a|s)
π(a|s)
Aπ
(s, a)
]
= 𝔼τ∼π
[
∞
∑
t=0
π′(at |st)
π(at |st)
Aπ
(st, at)
]
This is something we can optimize using trajectories from the old policy!
Now we do not have the product! So, the gradient will have much smaller variance! (Yes, but
we have approximated, that’s why!) What is the gradient?
∇θℒθ
θk
|θ=θk
= 𝔼τ∼πθk [
∞
∑
t=0
γt
∇θπθ(at |st)|θ=θk
πθk
(at |st)
Aπθk(st, at)
]
= 𝔼τ∼πθk [
∞
∑
t=0
γt
∇θlog πθ(at |st)|θ=θk
Aπθk(st, at)
]
J(θ) = 𝔼τ∼πθold
T
∑
t=1
t
∏
t′=1
πθ(a′
t |s′
t)
πθold
(a′
t |s′
t)
̂
At
Compare to Importance Sampling:
34. ⇒ J(π′) − J(π) ≥ ℒπ(π′) − C 𝔼s∼dπ [KL(π′|π)[s]]
|J(π′) − (J(π) + ℒπ(π′))| ≤ C 𝔼s∼dπ [KL(π′|π)[s]]
Given policy , we want to optimize over policy to maximize .
π π′
• If we maximize the RHS we are guaranteed to maximize the LHS.
• We know how to maximize the RHS. I can estimate both quantities of pi’ with
sampled from pi
• But will i have a better policy pi’? (knowing that the distance of the objectives is
maximized is not enough, there needs to be positive or equal to zero)
Monotonic Improvement Theorem
35. Monotonic Improvement Theory
Proof of improvement guarantee: Suppose ⇡k+1 and ⇡k are related by
⇡k+1 = arg max
⇡0
L⇡k (⇡0
) C
q
E
s⇠d⇡k
[DKL(⇡0||⇡k )[s]].
⇡k is a feasible point, and the objective at ⇡k is equal to 0.
L⇡k (⇡k ) / E
s,a⇠d⇡k ,⇡k
[A⇡k (s, a)] = 0
DKL(⇡k ||⇡k )[s] = 0
=) optimal value 0
=) by the performance bound, J(⇡k+1) J(⇡k ) 0
Joshua Achiam (UC Berkeley, OpenAI) Advanced Policy Gradient Methods October 11, 2017 21 / 41
Monotonic Improvement Theorem
36. • Theory is very conservative (high value of C) and we will use KL distance of pi’ and
pi as a constraint (trust region) as opposed to a penalty:
Approximate Monotonic Improvement
⇡k+1 = arg max
⇡0
L⇡k (⇡ ) C E
s⇠d⇡k
[DKL(⇡ ||⇡k )[s]]. (3)
Problem:
C provided by theory is quite high when is near 1
=) steps from (3) are too small.
Solution:
Instead of KL penalty, use KL constraint (called trust region).
Can control worst-case error through constraint upper limit!
⇡k+1 = arg max
⇡0
L⇡k (⇡0
)
s.t. E
s⇠d⇡k
⇥
DKL(⇡0
||⇡k )[s]
⇤
(4)
Joshua Achiam (UC Berkeley, OpenAI) Advanced Policy Gradient Methods October 11, 2017 22 / 41
37. Trust Region Policy Optimization
Trust Region Policy Optimization is implemented as TNPG plus a line search. Putting
it all together:
Algorithm 3 Trust Region Policy Optimization
Input: initial policy parameters ✓0
for k = 0, 1, 2, ... do
Collect set of trajectories Dk on policy ⇡k = ⇡(✓k )
Estimate advantages Â
⇡k
t using any advantage estimation algorithm
Form sample estimates for
policy gradient ĝk (using advantage estimates)
and KL-divergence Hessian-vector product function f (v) = Ĥk v
Use CG with ncg iterations to obtain xk ⇡ Ĥ 1
k ĝk
Estimate proposed step k ⇡
q
2
xT
k
Ĥk xk
xk
Perform backtracking line search with exponential decay to obtain final update
✓k+1 = ✓k + ↵j
k
end for
Trust region Policy Optimization
TRPO= NPG +Linesearch+monotonic improvement theorem!
38. Proximal Policy Optimization
Can I achieve similar performance without second order information (no Fisher matrix!)
Proximal Policy Optimization
Proximal Policy Optimization (PPO) is a family of methods that approximately enforce
KL constraint without computing natural gradients. Two variants:
Adaptive KL Penalty
Policy update solves unconstrained optimization problem
✓k+1 = arg max
✓
L✓k
(✓) k D̄KL(✓||✓k )
Penalty coefficient k changes between iterations to approximately enforce
KL-divergence constraint
Clipped Objective
New objective function: let rt (✓) = ⇡✓(at |st )/⇡✓k
(at |st ). Then
LCLIP
✓k
(✓) = E
⌧⇠⇡k
" T
X
t=0
h
min(rt (✓)Â
⇡k
t , clip (rt (✓), 1 ✏, 1 + ✏) Â
⇡k
t )
i
#
where ✏ is a hyperparameter (maybe ✏ = 0.2)
Policy update is ✓k+1 = arg max✓ LCLIP
✓k
(✓)
Joshua Achiam (UC Berkeley, OpenAI) Advanced Policy Gradient Methods October 11, 2017 35 / 41
rther Reading
I S. Kakade. “A Natural Policy Gradient.” NIPS. 2001
I S. Kakade and J. Langford. “Approximately optimal approximate reinforcement learning”. ICML. 2002
I J. Peters and S. Schaal. “Natural actor-critic”. Neurocomputing (2008)
I J. Schulman, S. Levine, P. Moritz, M. I. Jordan, and P. Abbeel. “Trust Region Policy Optimization”. ICML (2015)
I Y. Duan, X. Chen, R. Houthooft, J. Schulman, and P. Abbeel. “Benchmarking Deep Reinforcement Learning for Continuous Control”.
ICML (2016)
I J. Martens and I. Sutskever. “Training deep and recurrent networks with Hessian-free optimization”. Springer, 2012
I Z. Wang, V. Bapst, N. Heess, V. Mnih, R. Munos, et al. “Sample Efficient Actor-Critic with Experience Replay”. (2016)
I Y. Wu, E. Mansimov, S. Liao, R. Grosse, and J. Ba. “Scalable trust-region method for deep reinforcement learning using Kronecker-factored
approximation”. (2017)
I J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. “Proximal Policy Optimization Algorithms”. (2017)
I
39. Proximal Policy Optimization with Adaptive KL Penalty
Algorithm 4 PPO with Adaptive KL Penalty
Input: initial policy parameters ✓0, initial KL penalty 0, target KL-divergence
for k = 0, 1, 2, ... do
Collect set of partial trajectories Dk on policy ⇡k = ⇡(✓k )
Estimate advantages Â
⇡k
t using any advantage estimation algorithm
Compute policy update
✓k+1 = arg max
✓
L✓k (✓) k D̄KL(✓||✓k )
by taking K steps of minibatch SGD (via Adam)
if D̄KL(✓k+1||✓k ) 1.5 then
k+1 = 2 k
else if D̄KL(✓k+1||✓k ) /1.5 then
k+1 = k /2
end if
end for
Initial KL penalty not that important—it adapts quickly
Some iterations may violate KL constraint, but most don’t
Joshua Achiam (UC Berkeley, OpenAI) Advanced Policy Gradient Methods October 11, 2017 36 / 41
PPO: Adaptive KL Penalty
Don’t use second order approximation for Kl which is
expensive, use standard gradient descent
40. Proximal Policy Optimization: Clipping Objective
I Recall the surrogate objective
LIS
(✓) = Êt
⇡✓(at | st)
⇡✓old
(at | st)
Ât = Êt
h
rt(✓)Ât
i
. (1)
I Form a lower bound via clipped importance ratios
LCLIP
(✓) = Êt
h
min(rt(✓)Ât, clip(rt(✓), 1 ✏, 1 + ✏)Ât)
i
(2)
I Forms pessimistic bound on objective, can be optimized using SGD
PPO: Clipped Objective
rther Reading
I S. Kakade. “A Natural Policy Gradient.” NIPS. 2001
I S. Kakade and J. Langford. “Approximately optimal approximate reinforcement learning”. ICML. 2002
I J. Peters and S. Schaal. “Natural actor-critic”. Neurocomputing (2008)
I J. Schulman, S. Levine, P. Moritz, M. I. Jordan, and P. Abbeel. “Trust Region Policy Optimization”. ICML (2015)
I Y. Duan, X. Chen, R. Houthooft, J. Schulman, and P. Abbeel. “Benchmarking Deep Reinforcement Learning for Continuous Control”.
ICML (2016)
I J. Martens and I. Sutskever. “Training deep and recurrent networks with Hessian-free optimization”. Springer, 2012
I Z. Wang, V. Bapst, N. Heess, V. Mnih, R. Munos, et al. “Sample Efficient Actor-Critic with Experience Replay”. (2016)
I Y. Wu, E. Mansimov, S. Liao, R. Grosse, and J. Ba. “Scalable trust-region method for deep reinforcement learning using Kronecker-factored
approximation”. (2017)
I J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. “Proximal Policy Optimization Algorithms”. (2017)
I
41. Proximal Policy Optimization with Clipped Objective
But how does clipping keep policy close? By making objective as pessimistic as possible
about performance far away from ✓k :
Figure: Various objectives as a function of interpolation factor ↵ between ✓k+1 and ✓k after one
update of PPO-Clip 9
9
Schulman, Wolski, Dhariwal, Radford, Klimov, 2017
Joshua Achiam (UC Berkeley, OpenAI) Advanced Policy Gradient Methods October 11, 2017 38 / 41
Proximal Policy Optimization
PPO: Clipped Objective
42. Proximal Policy Optimization with Clipped Objective
Algorithm 5 PPO with Clipped Objective
Input: initial policy parameters ✓0, clipping threshold ✏
for k = 0, 1, 2, ... do
Collect set of partial trajectories Dk on policy ⇡k = ⇡(✓k )
Estimate advantages Â
⇡k
t using any advantage estimation algorithm
Compute policy update
✓k+1 = arg max
✓
LCLIP
✓k
(✓)
by taking K steps of minibatch SGD (via Adam), where
LCLIP
✓k
(✓) = E
⌧⇠⇡k
" T
X
t=0
h
min(rt (✓)Â
⇡k
t , clip (rt (✓), 1 ✏, 1 + ✏) Â
⇡k
t )
i
#
end for
Clipping prevents policy from having incentive to go far away from ✓k+1
Clipping seems to work at least as well as PPO with KL penalty, but is simpler to
implement
Joshua Achiam (UC Berkeley, OpenAI) Advanced Policy Gradient Methods October 11, 2017 37 / 41
PPO: Clipped Objective
43. Empirical Performance of PPO
Figure: Performance comparison between PPO with clipped objective and various other deep RL
methods on a slate of MuJoCo tasks. 10
10
Schulman, Wolski, Dhariwal, Radford, Klimov, 2017
PPO: Clipped Objective
44. Summary
• Gradient Descent in Parameter VS distribution space
• Natural gradients: we need to keep track of how the KL changes
from iteration to iteration
• Natural policy gradients
• Clipped objective works well
Related Readings
Further Reading
I S. Kakade. “A Natural Policy Gradient.” NIPS. 2001
I S. Kakade and J. Langford. “Approximately optimal approximate reinforcement learning”. ICML. 2002
I J. Peters and S. Schaal. “Natural actor-critic”. Neurocomputing (2008)
I J. Schulman, S. Levine, P. Moritz, M. I. Jordan, and P. Abbeel. “Trust Region Policy Optimization”. ICML (2015)
I Y. Duan, X. Chen, R. Houthooft, J. Schulman, and P. Abbeel. “Benchmarking Deep Reinforcement Learning for Continuous Control”.
ICML (2016)
I J. Martens and I. Sutskever. “Training deep and recurrent networks with Hessian-free optimization”. Springer, 2012
I Z. Wang, V. Bapst, N. Heess, V. Mnih, R. Munos, et al. “Sample Efficient Actor-Critic with Experience Replay”. (2016)
I Y. Wu, E. Mansimov, S. Liao, R. Grosse, and J. Ba. “Scalable trust-region method for deep reinforcement learning using Kronecker-factored
approximation”. (2017)
I J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. “Proximal Policy Optimization Algorithms”. (2017)
I blog.openai.com: recent posts on baselines releases
Further Reading
I S. Kakade. “A Natural Policy Gradient.” NIPS. 2001
I S. Kakade and J. Langford. “Approximately optimal approximate reinforcement learning”. ICML. 2002
I J. Peters and S. Schaal. “Natural actor-critic”. Neurocomputing (2008)
I J. Schulman, S. Levine, P. Moritz, M. I. Jordan, and P. Abbeel. “Trust Region Policy Optimization”. ICML (2015)
I Y. Duan, X. Chen, R. Houthooft, J. Schulman, and P. Abbeel. “Benchmarking Deep Reinforcement Learning for Continuous Control”.
ICML (2016)
I J. Martens and I. Sutskever. “Training deep and recurrent networks with Hessian-free optimization”. Springer, 2012
I Z. Wang, V. Bapst, N. Heess, V. Mnih, R. Munos, et al. “Sample Efficient Actor-Critic with Experience Replay”. (2016)
I Y. Wu, E. Mansimov, S. Liao, R. Grosse, and J. Ba. “Scalable trust-region method for deep reinforcement learning using Kronecker-factored
approximation”. (2017)
I J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. “Proximal Policy Optimization Algorithms”. (2017)
I blog.openai.com: recent posts on baselines releases
J. Achiam, D. Held, A. Tamar, P. Abeel “Constrained Policy Optimization”. (2017)