SlideShare a Scribd company logo
Proximal Policy Optimization Algorithms,
Schulman et al, 2017
옥찬호
utilForever@gmail.com
Proximal Policy Optimization Algorithms, Schulman et al, 2017Introduction
• In recent years, several different approaches have been
proposed for reinforcement learning with neural network
function approximators. The leading contenders are deep Q-
learning, “vanilla” policy gradient methods, and trust region /
natural policy gradient methods.
• However, there is room for improvement in developing a
method that is scalable (to large models and parallel
implementations), data efficient, and robust (i.e., successful
on a variety of problems without hyperparameter tuning).
Proximal Policy Optimization Algorithms, Schulman et al, 2017Introduction
• DQN
• fails on many simple problems and is poorly understood
• A3C – “Vanilla” policy gradient methods
• have poor data efficiency and robustness
• TRPO
• relatively complicated
• is not compatible with architectures that include noise (such as dropout) or
parameter sharing (between the policy and value function, or with auxiliary
tasks)
Proximal Policy Optimization Algorithms, Schulman et al, 2017Introduction
• We propose a novel objective with clipped probability ratios,
which forms a pessimistic estimate (i.e., lower bound) of the
performance of the policy.
• This paper seeks to improve the current state of affairs by
introducing an algorithm that attains the data efficiency and
reliable performance of TRPO, while using only first-order
optimization.
Proximal Policy Optimization Algorithms, Schulman et al, 2017Introduction
• To optimize policies, we alternate between sampling data from
the policy and performing several epochs of optimization on
the sampled data.
Proximal Policy Optimization Algorithms, Schulman et al, 2017Introduction
• Our experiments compare the performance of various different
versions of the surrogate objective, and find that the version
with the clipped probability ratios performs best.
• We also compare PPO to several previous algorithms from the
literature.
• On continuous control tasks, it performs better than the algorithms we
compare against.
• On Atari, it performs significantly better (in terms of sample complexity) than
A2C and similarly to ACER though it is much simpler.
PPO, Schulman et al, 2017Background: Policy Optimization
• Policy gradient methods work by computing an estimator of
the policy gradient and plugging it into a stochastic gradient
ascent algorithm. The most commonly used gradient estimator
has the form
ො𝑔 = ෡𝔼 𝑡 ∇ 𝜃 log 𝜋 𝜃 𝑎 𝑡 𝑠𝑡
መ𝐴 𝑡
• where 𝜋 𝜃 is a stochastic policy and መ𝐴 𝑡 is an estimator of the advantage
function at timestep 𝑡. Here, the expectation ෡𝔼 𝑡 … indicates the empirical
average over a finite batch of samples, in an algorithm that alternates
between sampling and optimization.
PPO, Schulman et al, 2017Background: Policy Optimization
• Implementations that use automatic differentiation software
work by constructing an objective function whose gradient is
the policy gradient estimator; the estimator ො𝑔 is obtained by
differentiating the objective
𝐿 𝑃𝐺
𝜃 = ෡𝔼 𝑡 log 𝜋 𝜃 𝑎 𝑡 𝑠𝑡
መ𝐴 𝑡
PPO, Schulman et al, 2017Background: Policy Optimization
• While it is appealing to perform multiple steps of optimization
on this loss 𝐿 𝑃𝐺
using the same trajectory, doing so is not well-
justified, and empirically it often leads to destructively large
policy updates.
PPO, Schulman et al, 2017Background: Policy Optimization
• In TRPO, an objective function (the “surrogate” objective) is
maximized subject to a constraint on the size of the policy
update. Specifically,
maximize
𝜃
෡𝔼 𝑡
𝜋 𝜃(𝑎 𝑡|𝑠𝑡)
𝜋 𝜃old
(𝑎 𝑡|𝑠𝑡)
መ𝐴 𝑡
subject to ෡𝔼 𝑡 𝐾𝐿 𝜋 𝜃old
∙ 𝑠𝑡 , 𝜋 𝜃 ∙ 𝑠𝑡 ≤ 𝛿
• Here, 𝜃old is the vector of policy parameters before the update.
PPO, Schulman et al, 2017Background: Policy Optimization
• This problem can efficiently be approximately solved using the
conjugate gradient algorithm, after making a linear
approximation to the objective and a quadratic approximation
to the constraint.
PPO, Schulman et al, 2017Background: Policy Optimization
• The theory justifying TRPO actually suggests using a penalty
instead of a constraint, i.e., solving the unconstrained
optimization problem for some coefficient 𝛽.
maximize
𝜃
෡𝔼 𝑡
𝜋 𝜃(𝑎 𝑡|𝑠𝑡)
𝜋 𝜃old
(𝑎 𝑡|𝑠𝑡)
መ𝐴 𝑡 − 𝛽𝐾𝐿 𝜋 𝜃old
∙ 𝑠𝑡 , 𝜋 𝜃 ∙ 𝑠𝑡
PPO, Schulman et al, 2017Background: Policy Optimization
• This follows from the fact that a certain surrogate objective
(which computes the max KL over states instead of the mean)
forms a lower bound (i.e., a pessimistic bound) on the
performance of the policy 𝜋.
• TRPO uses a hard constraint rather than a penalty because it is
hard to choose a single value of 𝜷 that performs well across
different problems—or even within a single problem, where the
characteristics change over the course of learning.
PPO, Schulman et al, 2017Clipped Surrogate Objective
• Let 𝑟𝑡(𝜃) denote the probability ratio 𝑟𝑡 𝜃 =
𝜋 𝜃(𝑎 𝑡|𝑠 𝑡)
𝜋 𝜃old
(𝑎 𝑡|𝑠 𝑡)
,
so 𝑟𝑡 𝜃old = 1. TRPO maximizes a “surrogate” objective
𝐿 𝐶𝑃𝐼
𝜃 = ෡𝔼 𝑡
𝜋 𝜃(𝑎 𝑡|𝑠𝑡)
𝜋 𝜃old
(𝑎 𝑡|𝑠𝑡)
መ𝐴 𝑡 = ෡𝔼 𝑡 𝑟𝑡 𝜃 መ𝐴 𝑡
• The superscript CPI refers to conservative policy iteration,
where this objective was proposed.
PPO, Schulman et al, 2017Clipped Surrogate Objective
• Without a constraint, maximization of 𝐿 𝐶𝑃𝐼
would lead to an
excessively large policy update; hence, we now consider how
to modify the objective, to penalize changes to the policy that
move 𝑟𝑡(𝜃) away from 1.
• The main objective we propose is the following:
𝐿 𝐶𝐿𝐼𝑃
𝜃 = ෡𝔼 𝑡 min(𝑟𝑡 𝜃 መ𝐴 𝑡, clip(𝑟𝑡 𝜃 , 1 − 𝜖, 1 + 𝜖) መ𝐴 𝑡)
• where epsilon is a hyperparameter, say, 𝜖 = 0.2.
PPO, Schulman et al, 2017Clipped Surrogate Objective
• The motivation for this objective is as follows.
• The first term inside the min is 𝐿 𝐶𝑃𝐼 𝜃 .
• The second term, clip(𝑟𝑡 𝜃 , 1 − 𝜖, 1 + 𝜖) መ𝐴 𝑡, modifies the surrogate objective
by clipping the probability ratio, which removes the incentive for moving
𝑟𝑡 𝜃 outside of the interval 1 − 𝜖, 1 + 𝜖 .
• Finally, we take the minimum of the clipped and unclipped objective, so the
final objective is a lower bound (i.e., a pessimistic bound) on the unclipped
objective.
PPO, Schulman et al, 2017Clipped Surrogate Objective
Figure 1: Plots showing one term (i.e., a single timestep) of the surrogate function
𝐿 𝐶𝐿𝐼𝑃
as a function of the probability ratio 𝑟, for positive advantages (left) and negative
advantages (right). The red circle on each plot shows the starting point for the
optimization, i.e., 𝑟 = 1. Note that 𝐿 𝐶𝐿𝐼𝑃
sums many of these terms.
PPO, Schulman et al, 2017Clipped Surrogate Objective
Figure 2: Surrogate objectives, as we interpolate between the initial policy parameter
𝜃old, and the updated policy parameter, which we compute after one iteration of PPO.
The updated policy has a KL divergence of about 0.02 from the initial policy, and this is
the point at which 𝐿 𝐶𝐿𝐼𝑃
is maximal.
PPO, Schulman et al, 2017Adaptive KL Penalty Coefficient
• Another approach, which can be used as an alternative to the
clipped surrogate objective, or in addition to it, is to use a
penalty on KL divergence, and to adapt the penalty coefficient
so that we achieve some target value of the KL divergence
𝑑targ each policy update.
• In our experiments, we found that the KL penalty performed worse than the
clipped surrogate objective, however, we’ve included it here because it’s an
important baseline.
PPO, Schulman et al, 2017Adaptive KL Penalty Coefficient
• In the simplest instantiation of this algorithm, we perform the
following steps in each policy update:
• Using several epochs of minibatch SGD, optimize the KL-penalized objective
𝐿 𝐾𝐿𝑃𝐸𝑁 𝜃 = ෡𝔼 𝑡
𝜋 𝜃(𝑎 𝑡|𝑠𝑡)
𝜋 𝜃old
(𝑎 𝑡|𝑠𝑡)
መ𝐴 𝑡 − 𝛽𝐾𝐿 𝜋 𝜃old
∙ 𝑠𝑡 , 𝜋 𝜃 ∙ 𝑠𝑡
• Compute 𝒅 = ෡𝔼 𝒕 𝑲𝑳 𝝅 𝜽 𝒐𝒍𝒅
∙ 𝒔 𝒕 , 𝝅 𝜽 ∙ 𝒔 𝒕
• If 𝑑 < Τ𝑑 𝑡𝑎𝑟𝑔 1.5 , 𝛽 ← Τ𝛽 2
• If 𝑑 > 𝑑 𝑡𝑎𝑟𝑔 × 1.5, 𝛽 ← 𝛽 × 2
• The updated 𝛽 is used for the next policy update.
Proximal Policy Optimization Algorithms, Schulman et al, 2017Algorithm
• Most techniques for computing variance-reduced advantage-
function estimators make use a learned state-value function 𝑉(𝑠)
• Generalized advantage estimation
• The finite-horizon estimators
Proximal Policy Optimization Algorithms, Schulman et al, 2017Algorithm
• If using a neural network architecture that shares parameters
between the policy and value function, we must use a loss function
that combines the policy surrogate and a value function error term.
• This objective can further be augmented by adding an entropy bonus
to ensure sufficient exploration, as suggested in past work (A3C).
Proximal Policy Optimization Algorithms, Schulman et al, 2017Algorithm
• Combining these terms, we obtain the following objective,
which is (approximately) maximized each iteration:
𝐿 𝑡
𝐶𝐿𝐼𝑃+𝑉𝐹+𝑆
𝜃 = ෡𝔼 𝑡 𝐿 𝑡
𝐶𝐿𝐼𝑃
𝜃 − 𝑐1 𝐿 𝑡
𝑉𝐹
𝜃 + 𝑐2 𝑆 𝜋 𝜃 𝑠𝑡
• where 𝑐1, 𝑐2 are coefficients, and 𝑆 denotes an entropy bonus,
and 𝐿 𝑡
𝑉𝐹
is a squared-error loss 𝑉𝜃 𝑠𝑡 − 𝑉𝑡
targ 2
.
Proximal Policy Optimization Algorithms, Schulman et al, 2017Algorithm
• One style of policy gradient implementation, popularized in
A3C and well-suited for use with recurrent neural networks,
runs the policy for 𝑻 timesteps (where 𝑻 is much less than the
episode length), and uses the collected samples for an update.
Proximal Policy Optimization Algorithms, Schulman et al, 2017Algorithm
• This style requires an advantage estimator that does not look
beyond timestep 𝑻. The estimator used by A3C is
መ𝐴 𝑡 = −𝑉 𝑠𝑡 + 𝑟𝑡 + 𝛾𝑟𝑡+1 + ⋯ + 𝛾 𝑇−𝑡+1
𝑟 𝑇−1 + 𝛾 𝑇−𝑡
𝑉 𝑠 𝑇
• where 𝑡 specifies the time index in 0, 𝑇 ,
within a given length-𝑇 trajectory segment.
Proximal Policy Optimization Algorithms, Schulman et al, 2017Algorithm
• Generalizing this choice, we can use a truncated version of
generalized advantage estimation, which reduces to previous
equation when 𝜆 = 1:
መ𝐴 𝑡 = 𝛿𝑡 + 𝛾𝜆 𝛿𝑡+1 + ⋯ + ⋯ + 𝛾𝜆 𝑇−𝑡+1
𝛿 𝑇−1
where 𝛿𝑡 = 𝑟𝑡 + 𝛾𝑉 𝑠𝑡+1 − 𝑉 𝑠𝑡
Proximal Policy Optimization Algorithms, Schulman et al, 2017Algorithm
• A proximal policy optimization (PPO) algorithm that uses
fixed-length trajectory segments is shown below.
• Each iteration, each of 𝑁 (parallel) actors collect 𝑇 timesteps of data.
• Then we construct the surrogate loss on these 𝑁𝑇 timesteps of data, and
optimize it with minibatch SGD (or usually for better performance, Adam),
for 𝐾 epochs.
Proximal Policy Optimization Algorithms, Schulman et al, 2017Experiments
• First, we compare several different surrogate objectives under
different hyperparameters. Here, we compare the surrogate
objective 𝐿 𝐶𝐿𝐼𝑃
to several natural variations and ablated
versions.
• No clipping or penalty: 𝐿 𝑡(𝜃) = 𝑟𝑡 𝜃 መ𝐴 𝑡
• Clipping: 𝐿 𝑡(𝜃) = min(𝑟𝑡 𝜃 መ𝐴 𝑡, clip(𝑟𝑡 𝜃 , 1 − 𝜖, 1 + 𝜖) መ𝐴 𝑡)
• KL penalty (fixed or adaptive): 𝐿 𝑡 𝜃 = 𝑟𝑡 𝜃 መ𝐴 𝑡 − 𝛽𝐾𝐿 𝜋 𝜃old
, 𝜋 𝜃
Proximal Policy Optimization Algorithms, Schulman et al, 2017Experiments
Proximal Policy Optimization Algorithms, Schulman et al, 2017Experiments
Proximal Policy Optimization Algorithms, Schulman et al, 2017Experiments
Proximal Policy Optimization Algorithms, Schulman et al, 2017Experiments
Proximal Policy Optimization Algorithms, Schulman et al, 2017Conclusion
• We have introduced proximal policy optimization, a family of
policy optimization methods that use multiple epochs of
stochastic gradient ascent to perform each policy update.
Proximal Policy Optimization Algorithms, Schulman et al, 2017Conclusion
• These methods have the stability and reliability of trust-region
methods but are much simpler to implement, requiring only
few lines of code change to a vanilla policy gradient (A3C)
implementation, applicable in more general settings (for
example, when using a joint architecture for the policy and
value function), and have better overall performance.
References
• https://reinforcement-learning-kr.github.io/2018/06/22/7_ppo/
• https://lynnn.tistory.com/73
• https://jay.tech.blog/2018/10/09/trpo%EC%99%80-ppo/
• https://talkingaboutme.tistory.com/entry/RL-Policy-Gradient-
Algorithms
Proximal Policy Optimization Algorithms, Schulman et al, 2017
Thank you!

More Related Content

What's hot

From REINFORCE to PPO
From REINFORCE to PPOFrom REINFORCE to PPO
From REINFORCE to PPO
Woong won Lee
 
02 problem solving_search_control
02 problem solving_search_control02 problem solving_search_control
02 problem solving_search_control
Praveen Kumar
 
Regularization and variable selection via elastic net
Regularization and variable selection via elastic netRegularization and variable selection via elastic net
Regularization and variable selection via elastic net
KyusonLim
 
SATySFiのこれからの課題たち
SATySFiのこれからの課題たちSATySFiのこれからの課題たち
SATySFiのこれからの課題たち
T. Suwa
 
多段階計算の型システムの基礎
多段階計算の型システムの基礎多段階計算の型システムの基礎
多段階計算の型システムの基礎
T. Suwa
 
chap7_basic_cluster_analysis.pptx
chap7_basic_cluster_analysis.pptxchap7_basic_cluster_analysis.pptx
chap7_basic_cluster_analysis.pptx
DiaaMustafa2
 
暗号化したまま計算できる暗号技術とOSS開発による広がり
暗号化したまま計算できる暗号技術とOSS開発による広がり暗号化したまま計算できる暗号技術とOSS開発による広がり
暗号化したまま計算できる暗号技術とOSS開発による広がり
MITSUNARI Shigeo
 
trim の作法
trim の作法trim の作法
trim の作法
Suzuki Mitsuhiro
 
Production System in AI
Production System in AIProduction System in AI
Production System in AI
Bharat Bhushan
 
Policy gradient
Policy gradientPolicy gradient
Policy gradient
Jie-Han Chen
 
Polygraphic Substitution Cipher -Part 1
Polygraphic Substitution Cipher  -Part 1Polygraphic Substitution Cipher  -Part 1
Polygraphic Substitution Cipher -Part 1
SHUBHA CHATURVEDI
 
[공간정보시스템 개론] L03 지구의형상과좌표체계
[공간정보시스템 개론] L03 지구의형상과좌표체계[공간정보시스템 개론] L03 지구의형상과좌표체계
[공간정보시스템 개론] L03 지구의형상과좌표체계
Kwang Woo NAM
 
A* Path Finding
A* Path FindingA* Path Finding
A* Path Finding
dnatapov
 
Lecture 26 local beam search
Lecture 26 local beam searchLecture 26 local beam search
Lecture 26 local beam search
Hema Kashyap
 
楕円曲線と暗号
楕円曲線と暗号楕円曲線と暗号
楕円曲線と暗号
MITSUNARI Shigeo
 
OpenStreetMap 기반의 Mapbox 오픈소스 매핑 서비스
OpenStreetMap 기반의 Mapbox 오픈소스 매핑 서비스OpenStreetMap 기반의 Mapbox 오픈소스 매핑 서비스
OpenStreetMap 기반의 Mapbox 오픈소스 매핑 서비스
Kyu-sung Choi
 
Blue Line 2015 Getac presentation
Blue Line 2015 Getac presentationBlue Line 2015 Getac presentation
Blue Line 2015 Getac presentation
Blue Line
 
Hyperbolic Deep Reinforcement Learning
Hyperbolic Deep Reinforcement LearningHyperbolic Deep Reinforcement Learning
Hyperbolic Deep Reinforcement Learning
Sangwoo Mo
 
RLCode와 A3C 쉽고 깊게 이해하기
RLCode와 A3C 쉽고 깊게 이해하기RLCode와 A3C 쉽고 깊게 이해하기
RLCode와 A3C 쉽고 깊게 이해하기
Woong won Lee
 
ElGamal型暗号文に対する任意関数演算・再暗号化の二者間秘密計算プロトコルとその応用
ElGamal型暗号文に対する任意関数演算・再暗号化の二者間秘密計算プロトコルとその応用ElGamal型暗号文に対する任意関数演算・再暗号化の二者間秘密計算プロトコルとその応用
ElGamal型暗号文に対する任意関数演算・再暗号化の二者間秘密計算プロトコルとその応用
MITSUNARI Shigeo
 

What's hot (20)

From REINFORCE to PPO
From REINFORCE to PPOFrom REINFORCE to PPO
From REINFORCE to PPO
 
02 problem solving_search_control
02 problem solving_search_control02 problem solving_search_control
02 problem solving_search_control
 
Regularization and variable selection via elastic net
Regularization and variable selection via elastic netRegularization and variable selection via elastic net
Regularization and variable selection via elastic net
 
SATySFiのこれからの課題たち
SATySFiのこれからの課題たちSATySFiのこれからの課題たち
SATySFiのこれからの課題たち
 
多段階計算の型システムの基礎
多段階計算の型システムの基礎多段階計算の型システムの基礎
多段階計算の型システムの基礎
 
chap7_basic_cluster_analysis.pptx
chap7_basic_cluster_analysis.pptxchap7_basic_cluster_analysis.pptx
chap7_basic_cluster_analysis.pptx
 
暗号化したまま計算できる暗号技術とOSS開発による広がり
暗号化したまま計算できる暗号技術とOSS開発による広がり暗号化したまま計算できる暗号技術とOSS開発による広がり
暗号化したまま計算できる暗号技術とOSS開発による広がり
 
trim の作法
trim の作法trim の作法
trim の作法
 
Production System in AI
Production System in AIProduction System in AI
Production System in AI
 
Policy gradient
Policy gradientPolicy gradient
Policy gradient
 
Polygraphic Substitution Cipher -Part 1
Polygraphic Substitution Cipher  -Part 1Polygraphic Substitution Cipher  -Part 1
Polygraphic Substitution Cipher -Part 1
 
[공간정보시스템 개론] L03 지구의형상과좌표체계
[공간정보시스템 개론] L03 지구의형상과좌표체계[공간정보시스템 개론] L03 지구의형상과좌표체계
[공간정보시스템 개론] L03 지구의형상과좌표체계
 
A* Path Finding
A* Path FindingA* Path Finding
A* Path Finding
 
Lecture 26 local beam search
Lecture 26 local beam searchLecture 26 local beam search
Lecture 26 local beam search
 
楕円曲線と暗号
楕円曲線と暗号楕円曲線と暗号
楕円曲線と暗号
 
OpenStreetMap 기반의 Mapbox 오픈소스 매핑 서비스
OpenStreetMap 기반의 Mapbox 오픈소스 매핑 서비스OpenStreetMap 기반의 Mapbox 오픈소스 매핑 서비스
OpenStreetMap 기반의 Mapbox 오픈소스 매핑 서비스
 
Blue Line 2015 Getac presentation
Blue Line 2015 Getac presentationBlue Line 2015 Getac presentation
Blue Line 2015 Getac presentation
 
Hyperbolic Deep Reinforcement Learning
Hyperbolic Deep Reinforcement LearningHyperbolic Deep Reinforcement Learning
Hyperbolic Deep Reinforcement Learning
 
RLCode와 A3C 쉽고 깊게 이해하기
RLCode와 A3C 쉽고 깊게 이해하기RLCode와 A3C 쉽고 깊게 이해하기
RLCode와 A3C 쉽고 깊게 이해하기
 
ElGamal型暗号文に対する任意関数演算・再暗号化の二者間秘密計算プロトコルとその応用
ElGamal型暗号文に対する任意関数演算・再暗号化の二者間秘密計算プロトコルとその応用ElGamal型暗号文に対する任意関数演算・再暗号化の二者間秘密計算プロトコルとその応用
ElGamal型暗号文に対する任意関数演算・再暗号化の二者間秘密計算プロトコルとその応用
 

Similar to Proximal Policy Optimization Algorithms, Schulman et al, 2017

Trust Region Policy Optimization, Schulman et al, 2015
Trust Region Policy Optimization, Schulman et al, 2015Trust Region Policy Optimization, Schulman et al, 2015
Trust Region Policy Optimization, Schulman et al, 2015
Chris Ohk
 
LFA-NPG-Paper.pdf
LFA-NPG-Paper.pdfLFA-NPG-Paper.pdf
LFA-NPG-Paper.pdf
harinsrikanth
 
A brief introduction to Searn Algorithm
A brief introduction to Searn AlgorithmA brief introduction to Searn Algorithm
A brief introduction to Searn Algorithm
Supun Abeysinghe
 
ngboost.pptx
ngboost.pptxngboost.pptx
ngboost.pptx
MohamedAliHabib3
 
Artificial Intelligence Course: Linear models
Artificial Intelligence Course: Linear models Artificial Intelligence Course: Linear models
Artificial Intelligence Course: Linear models
ananth
 
AUTOMATIC TRANSFER RATE ADJUSTMENT FOR TRANSFER REINFORCEMENT LEARNING
AUTOMATIC TRANSFER RATE ADJUSTMENT FOR TRANSFER REINFORCEMENT LEARNINGAUTOMATIC TRANSFER RATE ADJUSTMENT FOR TRANSFER REINFORCEMENT LEARNING
AUTOMATIC TRANSFER RATE ADJUSTMENT FOR TRANSFER REINFORCEMENT LEARNING
gerogepatton
 
Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015
Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015
Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015
Chris Ohk
 
Summary of BRAC
Summary of BRACSummary of BRAC
Summary of BRAC
ssuser0e9ad8
 
Reinforcement Learning Guide For Beginners
Reinforcement Learning Guide For BeginnersReinforcement Learning Guide For Beginners
Reinforcement Learning Guide For Beginners
gokulprasath06
 
safe and efficient off policy reinforcement learning
safe and efficient off policy reinforcement learningsafe and efficient off policy reinforcement learning
safe and efficient off policy reinforcement learning
Ryo Iwaki
 
1118_Seminar_Continuous_Deep Q-Learning with Model based acceleration
1118_Seminar_Continuous_Deep Q-Learning with Model based acceleration1118_Seminar_Continuous_Deep Q-Learning with Model based acceleration
1118_Seminar_Continuous_Deep Q-Learning with Model based acceleration
Hye-min Ahn
 
How to formulate reinforcement learning in illustrative ways
How to formulate reinforcement learning in illustrative waysHow to formulate reinforcement learning in illustrative ways
How to formulate reinforcement learning in illustrative ways
YasutoTamura1
 
Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021
Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021
Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021
Chris Ohk
 
Adapted Branch-and-Bound Algorithm Using SVM With Model Selection
Adapted Branch-and-Bound Algorithm Using SVM With Model SelectionAdapted Branch-and-Bound Algorithm Using SVM With Model Selection
Adapted Branch-and-Bound Algorithm Using SVM With Model Selection
IJECEIAES
 
Deep Reinforcement learning
Deep Reinforcement learningDeep Reinforcement learning
Deep Reinforcement learning
Cairo University
 
ngboost.pptx
ngboost.pptxngboost.pptx
ngboost.pptx
Hadrian7
 
Pydata presentation
Pydata presentationPydata presentation
Pydata presentation
Thomas Huijskens
 
Analysis and Design of Algorithms
Analysis and Design of AlgorithmsAnalysis and Design of Algorithms
Analysis and Design of Algorithms
Bulbul Agrawal
 
COMPARISON BETWEEN THE GENETIC ALGORITHMS OPTIMIZATION AND PARTICLE SWARM OPT...
COMPARISON BETWEEN THE GENETIC ALGORITHMS OPTIMIZATION AND PARTICLE SWARM OPT...COMPARISON BETWEEN THE GENETIC ALGORITHMS OPTIMIZATION AND PARTICLE SWARM OPT...
COMPARISON BETWEEN THE GENETIC ALGORITHMS OPTIMIZATION AND PARTICLE SWARM OPT...
IAEME Publication
 
Comparison between the genetic algorithms optimization and particle swarm opt...
Comparison between the genetic algorithms optimization and particle swarm opt...Comparison between the genetic algorithms optimization and particle swarm opt...
Comparison between the genetic algorithms optimization and particle swarm opt...
IAEME Publication
 

Similar to Proximal Policy Optimization Algorithms, Schulman et al, 2017 (20)

Trust Region Policy Optimization, Schulman et al, 2015
Trust Region Policy Optimization, Schulman et al, 2015Trust Region Policy Optimization, Schulman et al, 2015
Trust Region Policy Optimization, Schulman et al, 2015
 
LFA-NPG-Paper.pdf
LFA-NPG-Paper.pdfLFA-NPG-Paper.pdf
LFA-NPG-Paper.pdf
 
A brief introduction to Searn Algorithm
A brief introduction to Searn AlgorithmA brief introduction to Searn Algorithm
A brief introduction to Searn Algorithm
 
ngboost.pptx
ngboost.pptxngboost.pptx
ngboost.pptx
 
Artificial Intelligence Course: Linear models
Artificial Intelligence Course: Linear models Artificial Intelligence Course: Linear models
Artificial Intelligence Course: Linear models
 
AUTOMATIC TRANSFER RATE ADJUSTMENT FOR TRANSFER REINFORCEMENT LEARNING
AUTOMATIC TRANSFER RATE ADJUSTMENT FOR TRANSFER REINFORCEMENT LEARNINGAUTOMATIC TRANSFER RATE ADJUSTMENT FOR TRANSFER REINFORCEMENT LEARNING
AUTOMATIC TRANSFER RATE ADJUSTMENT FOR TRANSFER REINFORCEMENT LEARNING
 
Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015
Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015
Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015
 
Summary of BRAC
Summary of BRACSummary of BRAC
Summary of BRAC
 
Reinforcement Learning Guide For Beginners
Reinforcement Learning Guide For BeginnersReinforcement Learning Guide For Beginners
Reinforcement Learning Guide For Beginners
 
safe and efficient off policy reinforcement learning
safe and efficient off policy reinforcement learningsafe and efficient off policy reinforcement learning
safe and efficient off policy reinforcement learning
 
1118_Seminar_Continuous_Deep Q-Learning with Model based acceleration
1118_Seminar_Continuous_Deep Q-Learning with Model based acceleration1118_Seminar_Continuous_Deep Q-Learning with Model based acceleration
1118_Seminar_Continuous_Deep Q-Learning with Model based acceleration
 
How to formulate reinforcement learning in illustrative ways
How to formulate reinforcement learning in illustrative waysHow to formulate reinforcement learning in illustrative ways
How to formulate reinforcement learning in illustrative ways
 
Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021
Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021
Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021
 
Adapted Branch-and-Bound Algorithm Using SVM With Model Selection
Adapted Branch-and-Bound Algorithm Using SVM With Model SelectionAdapted Branch-and-Bound Algorithm Using SVM With Model Selection
Adapted Branch-and-Bound Algorithm Using SVM With Model Selection
 
Deep Reinforcement learning
Deep Reinforcement learningDeep Reinforcement learning
Deep Reinforcement learning
 
ngboost.pptx
ngboost.pptxngboost.pptx
ngboost.pptx
 
Pydata presentation
Pydata presentationPydata presentation
Pydata presentation
 
Analysis and Design of Algorithms
Analysis and Design of AlgorithmsAnalysis and Design of Algorithms
Analysis and Design of Algorithms
 
COMPARISON BETWEEN THE GENETIC ALGORITHMS OPTIMIZATION AND PARTICLE SWARM OPT...
COMPARISON BETWEEN THE GENETIC ALGORITHMS OPTIMIZATION AND PARTICLE SWARM OPT...COMPARISON BETWEEN THE GENETIC ALGORITHMS OPTIMIZATION AND PARTICLE SWARM OPT...
COMPARISON BETWEEN THE GENETIC ALGORITHMS OPTIMIZATION AND PARTICLE SWARM OPT...
 
Comparison between the genetic algorithms optimization and particle swarm opt...
Comparison between the genetic algorithms optimization and particle swarm opt...Comparison between the genetic algorithms optimization and particle swarm opt...
Comparison between the genetic algorithms optimization and particle swarm opt...
 

More from Chris Ohk

인프콘 2022 - Rust 크로스 플랫폼 프로그래밍
인프콘 2022 - Rust 크로스 플랫폼 프로그래밍인프콘 2022 - Rust 크로스 플랫폼 프로그래밍
인프콘 2022 - Rust 크로스 플랫폼 프로그래밍
Chris Ohk
 
고려대학교 컴퓨터학과 특강 - 대학생 때 알았더라면 좋았을 것들
고려대학교 컴퓨터학과 특강 - 대학생 때 알았더라면 좋았을 것들고려대학교 컴퓨터학과 특강 - 대학생 때 알았더라면 좋았을 것들
고려대학교 컴퓨터학과 특강 - 대학생 때 알았더라면 좋았을 것들
Chris Ohk
 
Momenti Seminar - 5 Years of RosettaStone
Momenti Seminar - 5 Years of RosettaStoneMomenti Seminar - 5 Years of RosettaStone
Momenti Seminar - 5 Years of RosettaStone
Chris Ohk
 
선린인터넷고등학교 2021 알고리즘 컨퍼런스 - Rust로 알고리즘 문제 풀어보기
선린인터넷고등학교 2021 알고리즘 컨퍼런스 - Rust로 알고리즘 문제 풀어보기선린인터넷고등학교 2021 알고리즘 컨퍼런스 - Rust로 알고리즘 문제 풀어보기
선린인터넷고등학교 2021 알고리즘 컨퍼런스 - Rust로 알고리즘 문제 풀어보기
Chris Ohk
 
Momenti Seminar - A Tour of Rust, Part 2
Momenti Seminar - A Tour of Rust, Part 2Momenti Seminar - A Tour of Rust, Part 2
Momenti Seminar - A Tour of Rust, Part 2
Chris Ohk
 
Momenti Seminar - A Tour of Rust, Part 1
Momenti Seminar - A Tour of Rust, Part 1Momenti Seminar - A Tour of Rust, Part 1
Momenti Seminar - A Tour of Rust, Part 1
Chris Ohk
 
Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021
Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021
Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021
Chris Ohk
 
Agent57: Outperforming the Atari Human Benchmark, Badia, A. P. et al, 2020
Agent57: Outperforming the Atari Human Benchmark, Badia, A. P. et al, 2020Agent57: Outperforming the Atari Human Benchmark, Badia, A. P. et al, 2020
Agent57: Outperforming the Atari Human Benchmark, Badia, A. P. et al, 2020
Chris Ohk
 
GDG Gwangju DevFest 2019 - <하스스톤> 강화학습 환경 개발기
GDG Gwangju DevFest 2019 - <하스스톤> 강화학습 환경 개발기GDG Gwangju DevFest 2019 - <하스스톤> 강화학습 환경 개발기
GDG Gwangju DevFest 2019 - <하스스톤> 강화학습 환경 개발기
Chris Ohk
 
[RLKorea] <하스스톤> 강화학습 환경 개발기
[RLKorea] <하스스톤> 강화학습 환경 개발기[RLKorea] <하스스톤> 강화학습 환경 개발기
[RLKorea] <하스스톤> 강화학습 환경 개발기
Chris Ohk
 
[NDC 2019] 하스스톤 강화학습 환경 개발기
[NDC 2019] 하스스톤 강화학습 환경 개발기[NDC 2019] 하스스톤 강화학습 환경 개발기
[NDC 2019] 하스스톤 강화학습 환경 개발기
Chris Ohk
 
C++20 Key Features Summary
C++20 Key Features SummaryC++20 Key Features Summary
C++20 Key Features Summary
Chris Ohk
 
[델리만주] 대학원 캐슬 - 석사에서 게임 프로그래머까지
[델리만주] 대학원 캐슬 - 석사에서 게임 프로그래머까지[델리만주] 대학원 캐슬 - 석사에서 게임 프로그래머까지
[델리만주] 대학원 캐슬 - 석사에서 게임 프로그래머까지
Chris Ohk
 
디미고 특강 - 개발을 시작하려는 여러분에게
디미고 특강 - 개발을 시작하려는 여러분에게디미고 특강 - 개발을 시작하려는 여러분에게
디미고 특강 - 개발을 시작하려는 여러분에게
Chris Ohk
 
청강대 특강 - 프로젝트 제대로 해보기
청강대 특강 - 프로젝트 제대로 해보기청강대 특강 - 프로젝트 제대로 해보기
청강대 특강 - 프로젝트 제대로 해보기
Chris Ohk
 
[NDC 2018] 유체역학 엔진 개발기
[NDC 2018] 유체역학 엔진 개발기[NDC 2018] 유체역학 엔진 개발기
[NDC 2018] 유체역학 엔진 개발기
Chris Ohk
 
My Way, Your Way
My Way, Your WayMy Way, Your Way
My Way, Your Way
Chris Ohk
 
Re:Zero부터 시작하지 않는 오픈소스 개발
Re:Zero부터 시작하지 않는 오픈소스 개발Re:Zero부터 시작하지 않는 오픈소스 개발
Re:Zero부터 시작하지 않는 오픈소스 개발
Chris Ohk
 
[9XD] Introduction to Computer Graphics
[9XD] Introduction to Computer Graphics[9XD] Introduction to Computer Graphics
[9XD] Introduction to Computer Graphics
Chris Ohk
 
C++17 Key Features Summary - Ver 2
C++17 Key Features Summary - Ver 2C++17 Key Features Summary - Ver 2
C++17 Key Features Summary - Ver 2
Chris Ohk
 

More from Chris Ohk (20)

인프콘 2022 - Rust 크로스 플랫폼 프로그래밍
인프콘 2022 - Rust 크로스 플랫폼 프로그래밍인프콘 2022 - Rust 크로스 플랫폼 프로그래밍
인프콘 2022 - Rust 크로스 플랫폼 프로그래밍
 
고려대학교 컴퓨터학과 특강 - 대학생 때 알았더라면 좋았을 것들
고려대학교 컴퓨터학과 특강 - 대학생 때 알았더라면 좋았을 것들고려대학교 컴퓨터학과 특강 - 대학생 때 알았더라면 좋았을 것들
고려대학교 컴퓨터학과 특강 - 대학생 때 알았더라면 좋았을 것들
 
Momenti Seminar - 5 Years of RosettaStone
Momenti Seminar - 5 Years of RosettaStoneMomenti Seminar - 5 Years of RosettaStone
Momenti Seminar - 5 Years of RosettaStone
 
선린인터넷고등학교 2021 알고리즘 컨퍼런스 - Rust로 알고리즘 문제 풀어보기
선린인터넷고등학교 2021 알고리즘 컨퍼런스 - Rust로 알고리즘 문제 풀어보기선린인터넷고등학교 2021 알고리즘 컨퍼런스 - Rust로 알고리즘 문제 풀어보기
선린인터넷고등학교 2021 알고리즘 컨퍼런스 - Rust로 알고리즘 문제 풀어보기
 
Momenti Seminar - A Tour of Rust, Part 2
Momenti Seminar - A Tour of Rust, Part 2Momenti Seminar - A Tour of Rust, Part 2
Momenti Seminar - A Tour of Rust, Part 2
 
Momenti Seminar - A Tour of Rust, Part 1
Momenti Seminar - A Tour of Rust, Part 1Momenti Seminar - A Tour of Rust, Part 1
Momenti Seminar - A Tour of Rust, Part 1
 
Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021
Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021
Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021
 
Agent57: Outperforming the Atari Human Benchmark, Badia, A. P. et al, 2020
Agent57: Outperforming the Atari Human Benchmark, Badia, A. P. et al, 2020Agent57: Outperforming the Atari Human Benchmark, Badia, A. P. et al, 2020
Agent57: Outperforming the Atari Human Benchmark, Badia, A. P. et al, 2020
 
GDG Gwangju DevFest 2019 - <하스스톤> 강화학습 환경 개발기
GDG Gwangju DevFest 2019 - <하스스톤> 강화학습 환경 개발기GDG Gwangju DevFest 2019 - <하스스톤> 강화학습 환경 개발기
GDG Gwangju DevFest 2019 - <하스스톤> 강화학습 환경 개발기
 
[RLKorea] <하스스톤> 강화학습 환경 개발기
[RLKorea] <하스스톤> 강화학습 환경 개발기[RLKorea] <하스스톤> 강화학습 환경 개발기
[RLKorea] <하스스톤> 강화학습 환경 개발기
 
[NDC 2019] 하스스톤 강화학습 환경 개발기
[NDC 2019] 하스스톤 강화학습 환경 개발기[NDC 2019] 하스스톤 강화학습 환경 개발기
[NDC 2019] 하스스톤 강화학습 환경 개발기
 
C++20 Key Features Summary
C++20 Key Features SummaryC++20 Key Features Summary
C++20 Key Features Summary
 
[델리만주] 대학원 캐슬 - 석사에서 게임 프로그래머까지
[델리만주] 대학원 캐슬 - 석사에서 게임 프로그래머까지[델리만주] 대학원 캐슬 - 석사에서 게임 프로그래머까지
[델리만주] 대학원 캐슬 - 석사에서 게임 프로그래머까지
 
디미고 특강 - 개발을 시작하려는 여러분에게
디미고 특강 - 개발을 시작하려는 여러분에게디미고 특강 - 개발을 시작하려는 여러분에게
디미고 특강 - 개발을 시작하려는 여러분에게
 
청강대 특강 - 프로젝트 제대로 해보기
청강대 특강 - 프로젝트 제대로 해보기청강대 특강 - 프로젝트 제대로 해보기
청강대 특강 - 프로젝트 제대로 해보기
 
[NDC 2018] 유체역학 엔진 개발기
[NDC 2018] 유체역학 엔진 개발기[NDC 2018] 유체역학 엔진 개발기
[NDC 2018] 유체역학 엔진 개발기
 
My Way, Your Way
My Way, Your WayMy Way, Your Way
My Way, Your Way
 
Re:Zero부터 시작하지 않는 오픈소스 개발
Re:Zero부터 시작하지 않는 오픈소스 개발Re:Zero부터 시작하지 않는 오픈소스 개발
Re:Zero부터 시작하지 않는 오픈소스 개발
 
[9XD] Introduction to Computer Graphics
[9XD] Introduction to Computer Graphics[9XD] Introduction to Computer Graphics
[9XD] Introduction to Computer Graphics
 
C++17 Key Features Summary - Ver 2
C++17 Key Features Summary - Ver 2C++17 Key Features Summary - Ver 2
C++17 Key Features Summary - Ver 2
 

Recently uploaded

Mariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceXMariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceX
Mariano Tinti
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
DianaGray10
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
Uni Systems S.M.S.A.
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
Alpen-Adria-Universität
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
Zilliz
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
KAMESHS29
 
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
Edge AI and Vision Alliance
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
Kari Kakkonen
 
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Speck&Tech
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Malak Abu Hammad
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
Neo4j
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
panagenda
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
Zilliz
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
Kumud Singh
 
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
SOFTTECHHUB
 
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Safe Software
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
DianaGray10
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
shyamraj55
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
Pixlogix Infotech
 

Recently uploaded (20)

Mariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceXMariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceX
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
 
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
 
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
 
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
 
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
 

Proximal Policy Optimization Algorithms, Schulman et al, 2017

  • 1. Proximal Policy Optimization Algorithms, Schulman et al, 2017 옥찬호 utilForever@gmail.com
  • 2. Proximal Policy Optimization Algorithms, Schulman et al, 2017Introduction • In recent years, several different approaches have been proposed for reinforcement learning with neural network function approximators. The leading contenders are deep Q- learning, “vanilla” policy gradient methods, and trust region / natural policy gradient methods. • However, there is room for improvement in developing a method that is scalable (to large models and parallel implementations), data efficient, and robust (i.e., successful on a variety of problems without hyperparameter tuning).
  • 3. Proximal Policy Optimization Algorithms, Schulman et al, 2017Introduction • DQN • fails on many simple problems and is poorly understood • A3C – “Vanilla” policy gradient methods • have poor data efficiency and robustness • TRPO • relatively complicated • is not compatible with architectures that include noise (such as dropout) or parameter sharing (between the policy and value function, or with auxiliary tasks)
  • 4. Proximal Policy Optimization Algorithms, Schulman et al, 2017Introduction • We propose a novel objective with clipped probability ratios, which forms a pessimistic estimate (i.e., lower bound) of the performance of the policy. • This paper seeks to improve the current state of affairs by introducing an algorithm that attains the data efficiency and reliable performance of TRPO, while using only first-order optimization.
  • 5. Proximal Policy Optimization Algorithms, Schulman et al, 2017Introduction • To optimize policies, we alternate between sampling data from the policy and performing several epochs of optimization on the sampled data.
  • 6. Proximal Policy Optimization Algorithms, Schulman et al, 2017Introduction • Our experiments compare the performance of various different versions of the surrogate objective, and find that the version with the clipped probability ratios performs best. • We also compare PPO to several previous algorithms from the literature. • On continuous control tasks, it performs better than the algorithms we compare against. • On Atari, it performs significantly better (in terms of sample complexity) than A2C and similarly to ACER though it is much simpler.
  • 7. PPO, Schulman et al, 2017Background: Policy Optimization • Policy gradient methods work by computing an estimator of the policy gradient and plugging it into a stochastic gradient ascent algorithm. The most commonly used gradient estimator has the form ො𝑔 = ෡𝔼 𝑡 ∇ 𝜃 log 𝜋 𝜃 𝑎 𝑡 𝑠𝑡 መ𝐴 𝑡 • where 𝜋 𝜃 is a stochastic policy and መ𝐴 𝑡 is an estimator of the advantage function at timestep 𝑡. Here, the expectation ෡𝔼 𝑡 … indicates the empirical average over a finite batch of samples, in an algorithm that alternates between sampling and optimization.
  • 8. PPO, Schulman et al, 2017Background: Policy Optimization • Implementations that use automatic differentiation software work by constructing an objective function whose gradient is the policy gradient estimator; the estimator ො𝑔 is obtained by differentiating the objective 𝐿 𝑃𝐺 𝜃 = ෡𝔼 𝑡 log 𝜋 𝜃 𝑎 𝑡 𝑠𝑡 መ𝐴 𝑡
  • 9. PPO, Schulman et al, 2017Background: Policy Optimization • While it is appealing to perform multiple steps of optimization on this loss 𝐿 𝑃𝐺 using the same trajectory, doing so is not well- justified, and empirically it often leads to destructively large policy updates.
  • 10. PPO, Schulman et al, 2017Background: Policy Optimization • In TRPO, an objective function (the “surrogate” objective) is maximized subject to a constraint on the size of the policy update. Specifically, maximize 𝜃 ෡𝔼 𝑡 𝜋 𝜃(𝑎 𝑡|𝑠𝑡) 𝜋 𝜃old (𝑎 𝑡|𝑠𝑡) መ𝐴 𝑡 subject to ෡𝔼 𝑡 𝐾𝐿 𝜋 𝜃old ∙ 𝑠𝑡 , 𝜋 𝜃 ∙ 𝑠𝑡 ≤ 𝛿 • Here, 𝜃old is the vector of policy parameters before the update.
  • 11. PPO, Schulman et al, 2017Background: Policy Optimization • This problem can efficiently be approximately solved using the conjugate gradient algorithm, after making a linear approximation to the objective and a quadratic approximation to the constraint.
  • 12. PPO, Schulman et al, 2017Background: Policy Optimization • The theory justifying TRPO actually suggests using a penalty instead of a constraint, i.e., solving the unconstrained optimization problem for some coefficient 𝛽. maximize 𝜃 ෡𝔼 𝑡 𝜋 𝜃(𝑎 𝑡|𝑠𝑡) 𝜋 𝜃old (𝑎 𝑡|𝑠𝑡) መ𝐴 𝑡 − 𝛽𝐾𝐿 𝜋 𝜃old ∙ 𝑠𝑡 , 𝜋 𝜃 ∙ 𝑠𝑡
  • 13. PPO, Schulman et al, 2017Background: Policy Optimization • This follows from the fact that a certain surrogate objective (which computes the max KL over states instead of the mean) forms a lower bound (i.e., a pessimistic bound) on the performance of the policy 𝜋. • TRPO uses a hard constraint rather than a penalty because it is hard to choose a single value of 𝜷 that performs well across different problems—or even within a single problem, where the characteristics change over the course of learning.
  • 14. PPO, Schulman et al, 2017Clipped Surrogate Objective • Let 𝑟𝑡(𝜃) denote the probability ratio 𝑟𝑡 𝜃 = 𝜋 𝜃(𝑎 𝑡|𝑠 𝑡) 𝜋 𝜃old (𝑎 𝑡|𝑠 𝑡) , so 𝑟𝑡 𝜃old = 1. TRPO maximizes a “surrogate” objective 𝐿 𝐶𝑃𝐼 𝜃 = ෡𝔼 𝑡 𝜋 𝜃(𝑎 𝑡|𝑠𝑡) 𝜋 𝜃old (𝑎 𝑡|𝑠𝑡) መ𝐴 𝑡 = ෡𝔼 𝑡 𝑟𝑡 𝜃 መ𝐴 𝑡 • The superscript CPI refers to conservative policy iteration, where this objective was proposed.
  • 15. PPO, Schulman et al, 2017Clipped Surrogate Objective • Without a constraint, maximization of 𝐿 𝐶𝑃𝐼 would lead to an excessively large policy update; hence, we now consider how to modify the objective, to penalize changes to the policy that move 𝑟𝑡(𝜃) away from 1. • The main objective we propose is the following: 𝐿 𝐶𝐿𝐼𝑃 𝜃 = ෡𝔼 𝑡 min(𝑟𝑡 𝜃 መ𝐴 𝑡, clip(𝑟𝑡 𝜃 , 1 − 𝜖, 1 + 𝜖) መ𝐴 𝑡) • where epsilon is a hyperparameter, say, 𝜖 = 0.2.
  • 16. PPO, Schulman et al, 2017Clipped Surrogate Objective • The motivation for this objective is as follows. • The first term inside the min is 𝐿 𝐶𝑃𝐼 𝜃 . • The second term, clip(𝑟𝑡 𝜃 , 1 − 𝜖, 1 + 𝜖) መ𝐴 𝑡, modifies the surrogate objective by clipping the probability ratio, which removes the incentive for moving 𝑟𝑡 𝜃 outside of the interval 1 − 𝜖, 1 + 𝜖 . • Finally, we take the minimum of the clipped and unclipped objective, so the final objective is a lower bound (i.e., a pessimistic bound) on the unclipped objective.
  • 17. PPO, Schulman et al, 2017Clipped Surrogate Objective Figure 1: Plots showing one term (i.e., a single timestep) of the surrogate function 𝐿 𝐶𝐿𝐼𝑃 as a function of the probability ratio 𝑟, for positive advantages (left) and negative advantages (right). The red circle on each plot shows the starting point for the optimization, i.e., 𝑟 = 1. Note that 𝐿 𝐶𝐿𝐼𝑃 sums many of these terms.
  • 18. PPO, Schulman et al, 2017Clipped Surrogate Objective Figure 2: Surrogate objectives, as we interpolate between the initial policy parameter 𝜃old, and the updated policy parameter, which we compute after one iteration of PPO. The updated policy has a KL divergence of about 0.02 from the initial policy, and this is the point at which 𝐿 𝐶𝐿𝐼𝑃 is maximal.
  • 19. PPO, Schulman et al, 2017Adaptive KL Penalty Coefficient • Another approach, which can be used as an alternative to the clipped surrogate objective, or in addition to it, is to use a penalty on KL divergence, and to adapt the penalty coefficient so that we achieve some target value of the KL divergence 𝑑targ each policy update. • In our experiments, we found that the KL penalty performed worse than the clipped surrogate objective, however, we’ve included it here because it’s an important baseline.
  • 20. PPO, Schulman et al, 2017Adaptive KL Penalty Coefficient • In the simplest instantiation of this algorithm, we perform the following steps in each policy update: • Using several epochs of minibatch SGD, optimize the KL-penalized objective 𝐿 𝐾𝐿𝑃𝐸𝑁 𝜃 = ෡𝔼 𝑡 𝜋 𝜃(𝑎 𝑡|𝑠𝑡) 𝜋 𝜃old (𝑎 𝑡|𝑠𝑡) መ𝐴 𝑡 − 𝛽𝐾𝐿 𝜋 𝜃old ∙ 𝑠𝑡 , 𝜋 𝜃 ∙ 𝑠𝑡 • Compute 𝒅 = ෡𝔼 𝒕 𝑲𝑳 𝝅 𝜽 𝒐𝒍𝒅 ∙ 𝒔 𝒕 , 𝝅 𝜽 ∙ 𝒔 𝒕 • If 𝑑 < Τ𝑑 𝑡𝑎𝑟𝑔 1.5 , 𝛽 ← Τ𝛽 2 • If 𝑑 > 𝑑 𝑡𝑎𝑟𝑔 × 1.5, 𝛽 ← 𝛽 × 2 • The updated 𝛽 is used for the next policy update.
  • 21. Proximal Policy Optimization Algorithms, Schulman et al, 2017Algorithm • Most techniques for computing variance-reduced advantage- function estimators make use a learned state-value function 𝑉(𝑠) • Generalized advantage estimation • The finite-horizon estimators
  • 22. Proximal Policy Optimization Algorithms, Schulman et al, 2017Algorithm • If using a neural network architecture that shares parameters between the policy and value function, we must use a loss function that combines the policy surrogate and a value function error term. • This objective can further be augmented by adding an entropy bonus to ensure sufficient exploration, as suggested in past work (A3C).
  • 23. Proximal Policy Optimization Algorithms, Schulman et al, 2017Algorithm • Combining these terms, we obtain the following objective, which is (approximately) maximized each iteration: 𝐿 𝑡 𝐶𝐿𝐼𝑃+𝑉𝐹+𝑆 𝜃 = ෡𝔼 𝑡 𝐿 𝑡 𝐶𝐿𝐼𝑃 𝜃 − 𝑐1 𝐿 𝑡 𝑉𝐹 𝜃 + 𝑐2 𝑆 𝜋 𝜃 𝑠𝑡 • where 𝑐1, 𝑐2 are coefficients, and 𝑆 denotes an entropy bonus, and 𝐿 𝑡 𝑉𝐹 is a squared-error loss 𝑉𝜃 𝑠𝑡 − 𝑉𝑡 targ 2 .
  • 24. Proximal Policy Optimization Algorithms, Schulman et al, 2017Algorithm • One style of policy gradient implementation, popularized in A3C and well-suited for use with recurrent neural networks, runs the policy for 𝑻 timesteps (where 𝑻 is much less than the episode length), and uses the collected samples for an update.
  • 25. Proximal Policy Optimization Algorithms, Schulman et al, 2017Algorithm • This style requires an advantage estimator that does not look beyond timestep 𝑻. The estimator used by A3C is መ𝐴 𝑡 = −𝑉 𝑠𝑡 + 𝑟𝑡 + 𝛾𝑟𝑡+1 + ⋯ + 𝛾 𝑇−𝑡+1 𝑟 𝑇−1 + 𝛾 𝑇−𝑡 𝑉 𝑠 𝑇 • where 𝑡 specifies the time index in 0, 𝑇 , within a given length-𝑇 trajectory segment.
  • 26. Proximal Policy Optimization Algorithms, Schulman et al, 2017Algorithm • Generalizing this choice, we can use a truncated version of generalized advantage estimation, which reduces to previous equation when 𝜆 = 1: መ𝐴 𝑡 = 𝛿𝑡 + 𝛾𝜆 𝛿𝑡+1 + ⋯ + ⋯ + 𝛾𝜆 𝑇−𝑡+1 𝛿 𝑇−1 where 𝛿𝑡 = 𝑟𝑡 + 𝛾𝑉 𝑠𝑡+1 − 𝑉 𝑠𝑡
  • 27. Proximal Policy Optimization Algorithms, Schulman et al, 2017Algorithm • A proximal policy optimization (PPO) algorithm that uses fixed-length trajectory segments is shown below. • Each iteration, each of 𝑁 (parallel) actors collect 𝑇 timesteps of data. • Then we construct the surrogate loss on these 𝑁𝑇 timesteps of data, and optimize it with minibatch SGD (or usually for better performance, Adam), for 𝐾 epochs.
  • 28. Proximal Policy Optimization Algorithms, Schulman et al, 2017Experiments • First, we compare several different surrogate objectives under different hyperparameters. Here, we compare the surrogate objective 𝐿 𝐶𝐿𝐼𝑃 to several natural variations and ablated versions. • No clipping or penalty: 𝐿 𝑡(𝜃) = 𝑟𝑡 𝜃 መ𝐴 𝑡 • Clipping: 𝐿 𝑡(𝜃) = min(𝑟𝑡 𝜃 መ𝐴 𝑡, clip(𝑟𝑡 𝜃 , 1 − 𝜖, 1 + 𝜖) መ𝐴 𝑡) • KL penalty (fixed or adaptive): 𝐿 𝑡 𝜃 = 𝑟𝑡 𝜃 መ𝐴 𝑡 − 𝛽𝐾𝐿 𝜋 𝜃old , 𝜋 𝜃
  • 29. Proximal Policy Optimization Algorithms, Schulman et al, 2017Experiments
  • 30. Proximal Policy Optimization Algorithms, Schulman et al, 2017Experiments
  • 31. Proximal Policy Optimization Algorithms, Schulman et al, 2017Experiments
  • 32. Proximal Policy Optimization Algorithms, Schulman et al, 2017Experiments
  • 33. Proximal Policy Optimization Algorithms, Schulman et al, 2017Conclusion • We have introduced proximal policy optimization, a family of policy optimization methods that use multiple epochs of stochastic gradient ascent to perform each policy update.
  • 34. Proximal Policy Optimization Algorithms, Schulman et al, 2017Conclusion • These methods have the stability and reliability of trust-region methods but are much simpler to implement, requiring only few lines of code change to a vanilla policy gradient (A3C) implementation, applicable in more general settings (for example, when using a joint architecture for the policy and value function), and have better overall performance.
  • 35. References • https://reinforcement-learning-kr.github.io/2018/06/22/7_ppo/ • https://lynnn.tistory.com/73 • https://jay.tech.blog/2018/10/09/trpo%EC%99%80-ppo/ • https://talkingaboutme.tistory.com/entry/RL-Policy-Gradient- Algorithms Proximal Policy Optimization Algorithms, Schulman et al, 2017