SlideShare a Scribd company logo
Trust Region Policy Optimization, Schulman et al, 2015.
옥찬호
utilForever@gmail.com
Trust Region Policy Optimization, Schulman et al, 2015.Introduction
• Most algorithms for policy optimization can be classified...
• (1) Policy iteration methods (Alternate between estimating the value
function under the current policy and improving the policy)
• (2) Policy gradient methods (Use an estimator of the gradient of the
expected return)
• (3) Derivative-free optimization methods (Treat the return as a black box
function to be optimized in terms of the policy parameters) such as
• The Cross-Entropy Method (CEM)
• The Covariance Matrix Adaptation (CMA)
Introduction
• General derivative-free stochastic optimization methods
such as CEM and CMA are preferred on many problems,
• They achieve good results while being simple to understand and implement.
• For example, while Tetris is a classic benchmark problem for
approximate dynamic programming(ADP) methods, stochastic
optimization methods are difficult to beat on this task.
Trust Region Policy Optimization, Schulman et al, 2015.
Introduction
• For continuous control problems, methods like CMA have been
successful at learning control policies for challenging tasks like
locomotion when provided with hand-engineered policy
classes with low-dimensional parameterizations.
• The inability of ADP and gradient-based methods to
consistently beat gradient-free random search is unsatisfying,
• Gradient-based optimization algorithms enjoy much better sample
complexity guarantees than gradient-free methods.
Trust Region Policy Optimization, Schulman et al, 2015.
Introduction
• Continuous gradient-based optimization has been very
successful at learning function approximators for supervised
learning tasks with huge numbers of parameters, and
extending their success to reinforcement learning would allow
for efficient training of complex and powerful policies.
Trust Region Policy Optimization, Schulman et al, 2015.
Preliminaries
• Markov Decision Process(MDP) : 𝒮, 𝒜, 𝑃, 𝑟, 𝜌0, 𝛾
• 𝒮 is a finite set of states
• 𝒜 is a finite set of actions
• 𝑃 ∶ 𝒮 × 𝒜 × 𝒮 → ℝ is the transition probability
• 𝑟 ∶ 𝒮 → ℝ is the reward function
• 𝜌0 ∶ 𝒮 → ℝ is the distribution of the initial state 𝑠0
• 𝛾 ∈ 0, 1 is the discount factor
• Policy
• 𝜋 ∶ 𝒮 × 𝒜 → 0, 1
Trust Region Policy Optimization, Schulman et al, 2015.
Preliminaries Trust Region Policy Optimization, Schulman et al, 2015.
Preliminaries Trust Region Policy Optimization, Schulman et al, 2015.
Preliminaries Trust Region Policy Optimization, Schulman et al, 2015.
Preliminaries Trust Region Policy Optimization, Schulman et al, 2015.
Preliminaries
How to measure policy’s performance?
Trust Region Policy Optimization, Schulman et al, 2015.
Preliminaries
• Expected cumulative discounted reward
• Action value, Value/Advantage functions
Trust Region Policy Optimization, Schulman et al, 2015.
Preliminaries Trust Region Policy Optimization, Schulman et al, 2015.
Preliminaries Trust Region Policy Optimization, Schulman et al, 2015.
Preliminaries
• Lemma 1. Kakade & Langford (2002)
• Proof
Trust Region Policy Optimization, Schulman et al, 2015.
Preliminaries
• Discounted (unnormalized) visitation frequencies
• Rewrite Lemma 1
Trust Region Policy Optimization, Schulman et al, 2015.
Preliminaries
• This equation implies that any policy update 𝜋 → ෤𝜋 that has
a nonnegative expected advantage at every state 𝑠,
i.e., σ 𝑎 ෤𝜋 𝑎 𝑠 𝐴 𝜋(𝑠, 𝑎) ≥ 0, is guaranteed to increase
the policy performance 𝜂, or leave it constant in the case that
the expected advantage is zero everywhere.
Trust Region Policy Optimization, Schulman et al, 2015.
Preliminaries Trust Region Policy Optimization, Schulman et al, 2015.
Preliminaries
• However, in the approximate setting, it will typically be
unavoidable, due to estimation and approximation error, that
there will be some states 𝑠 for which the expected advantage
is negative, that is, σ 𝑎 ෤𝜋 𝑎 𝑠 𝐴 𝜋(𝑠, 𝑎) < 0.
• The complex dependency of 𝜌෥𝜋 𝑠 on ෤𝜋 makes equation
difficult to optimize directly.
Trust Region Policy Optimization, Schulman et al, 2015.
Preliminaries
• Instead, we introduce the following local approximation to 𝜂:
• Note that 𝐿 𝜋 uses the visitation frequency 𝜌 𝜋 rather than 𝜌෥𝜋,
ignoring changes in state visitation density due to changes in
the policy.
Trust Region Policy Optimization, Schulman et al, 2015.
Preliminaries Trust Region Policy Optimization, Schulman et al, 2015.
Preliminaries
• However, if we have a parameterized policy 𝜋 𝜃, where
𝜋 𝜃 𝑎 𝑠 is a differentiable function of the parameter vector 𝜃,
then 𝐿 𝜋 matches 𝜂 to first order.
• This equation implies that a sufficiently small step 𝜋 𝜃0
→ ෤𝜋 that
improves 𝐿 𝜋 𝜃 𝑜𝑙𝑑
will also improve 𝜂, but does not give us any
guidance on how big of a step to take.
Trust Region Policy Optimization, Schulman et al, 2015.
Preliminaries Trust Region Policy Optimization, Schulman et al, 2015.
Preliminaries
• Conservative Policy Iteration
• Provide explicit lower bounds on the improvement of 𝜂.
• Let 𝜋 𝑜𝑙𝑑 denote the current policy, and let 𝜋′ = argmax 𝜋′ 𝐿 𝜋 𝑜𝑙𝑑
(𝜋′).
The new policy 𝜋 𝑛𝑒𝑤 was defined to be the following texture:
Trust Region Policy Optimization, Schulman et al, 2015.
Preliminaries
• Conservative Policy Iteration
• The following lower bound:
• However, this policy class is unwieldy and restrictive in practice,
and it is desirable for a practical policy update scheme to be applicable to
all general stochastic policy classes.
Trust Region Policy Optimization, Schulman et al, 2015.
Monotonic Improvement Guarantee
• The policy improvement bound in above equation can be
extended to general stochastic policies, by
• (1) Replacing 𝛼 with a distance measure between 𝜋 and ෤𝜋
• (2) Changing the constant 𝜖 appropriately
Monotonic Improvement Guarantee
• The particular distance measure we use is
the total variation divergence for
discrete probability distributions 𝑝, 𝑞:
𝐷TV 𝑝 ∥ 𝑞 =
1
2
෍
𝑖
𝑝𝑖 − 𝑞𝑖
• Define 𝐷TV
max
𝜋, ෤𝜋 as
Monotonic Improvement Guarantee
• Theorem 1. Let 𝛼 = 𝐷TV
max
𝜋old, 𝜋new .
Then the following bound holds:
Monotonic Improvement Guarantee
• We note the following relationship between the total variation
divergence and the KL divergence.
𝐷TV 𝑝 ∥ 𝑞 2
≤ 𝐷KL 𝑝 ∥ 𝑞
• Let 𝐷KL
max
𝜋, ෤𝜋 = max
𝑠
𝐷KL 𝜋 ∙ 𝑠 ∥ ෤𝜋 ∙ |𝑠 .
The following bound then follows directly from Theorem 1:
Monotonic Improvement Guarantee
Monotonic Improvement Guarantee
• Algorithm 1
Monotonic Improvement Guarantee
Monotonic Improvement Guarantee
• Algorithm 1 is guaranteed to generate a monotonically
improving sequence of policies 𝜂 𝜋0 ≤ 𝜂 𝜋1 ≤ 𝜂 𝜋2 ≤ ⋯.
• Let 𝑀𝑖 𝜋 = 𝐿 𝜋 𝑖
− 𝐶𝐷KL
max
𝜋𝑖, 𝜋 . Then,
• Thus, by maximizing 𝑀𝑖 at each iteration, we guarantee that
the true objective 𝜂 is non-decreasing.
• This algorithm is a type of minorization-maximization (MM) algorithm.
Optimization of Parameterized Policies
• Overload our previous notation to use functions of 𝜃.
• 𝜂 𝜃 ≔ 𝜂 𝜋 𝜃
• 𝐿 𝜃
෨𝜃 ≔ 𝐿 𝜋 𝜃
𝜋෩𝜃
• 𝐷KL 𝜃 ∥ ෨𝜃 ≔ 𝐷KL 𝜋 𝜃 ∥ 𝜋෩𝜃
• Use 𝜃old to denote the previous policy parameters
Optimization of Parameterized Policies
• The preceding section showed that
𝜂 𝜃 ≥ 𝐿 𝜃old
𝜃 − 𝐶𝐷KL
max
𝜃old, 𝜃
• Thus, by performing the following maximization,
we are guaranteed to improve the true objective 𝜂:
maximize 𝜃 𝐿 𝜃old
𝜃 − 𝐶𝐷KL
max
𝜃old, 𝜃
Optimization of Parameterized Policies
• In practice, if we used the penalty coefficient 𝐶 recommended
by the theory above, the step sizes would be very small.
• One way to take largest steps in a robust way is to use a
constraint on the KL divergence between the new policy and
the old policy, i.e., a trust region constraint:
Optimization of Parameterized Policies
• This problem imposes a constraint that the KL divergence is
bounded at every point in the state space. While it is motivated
by the theory, this problem is impractical to solve due to the
large number of constraints.
• Instead, we can use a heuristic approximation which considers
the average KL divergence:
Optimization of Parameterized Policies
• We therefore propose solving the following optimization
problem to generate a policy update:
Optimization of Parameterized Policies
Sample-Based Estimation
• How the objective and constraint functions can be
approximated using Monte Carlo simulation?
• We seek to solve the following optimization problem, obtained
by expanding 𝐿 𝜃old
in previous equation:
Sample-Based Estimation
• We replace
• σ 𝑠 𝜌 𝜃old
𝑠 … →
1
1−𝛾
𝐸𝑠~𝜌 𝜃old
[… ]
• 𝐴 𝜃old
→ 𝑄 𝜃old
• σ 𝑎 𝜋 𝜃old
𝑎|𝑠 𝐴 𝜃old
→ 𝐸 𝑎~𝑞
𝜋 𝜃(𝑎|𝑠)
𝑞(𝑎|𝑠)
𝐴 𝜃old
(We replace the sum over the actions by an importance sampling estimator.)
Sample-Based Estimation
Sample-Based Estimation
Sample-Based Estimation
• Now, our optimization problem in previous equation is exactly
equivalent to the following one, written in terms of
expectations:
• Constrained Optimization → Sample-based Estimation
Sample-Based Estimation
• All that remains is to replace the expectations by sample
averages and replace the 𝑄 value by an empirical estimate.
The following sections describe two different schemes for
performing this estimation.
• Single Path: Typically used for policy gradient estimation (Bartlett & Baxter,
2011), and is based on sampling individual trajectories.
• Vine: Constructing a rollout set and then performing multiple actions from
each state in the rollout set.
Sample-Based Estimation
Sample-Based Estimation
Sample-Based Estimation
Practical Algorithm
1. Use the single path or vine procedures to collect a set of state-
action pairs along with Monte Carlo estimates of their 𝑄-values.
2. By averaging over samples, construct the estimated objective and
constraint in Equation .
3. Approximately solve this constrained optimization problem to
update the policy’s parameter vector 𝜃. We use the conjugate
gradient algorithm followed by a line search, which is altogether
only slightly more expensive than computing the gradient itself.
See Appendix C for details.
Trust Region Policy Optimization, Schulman et al, 2015.
Practical Algorithm
• With regard to (3), we construct the Fisher information matrix
(FIM) by analytically computing the Hessian of KL divergence,
rather than using the covariance matrix of the gradients.
• That is, we estimate 𝐴𝑖𝑗 as ,
rather than .
• This analytic estimator has computational benefits in the
large-scale setting, since it removes the need to store a dense
Hessian or all policy gradients from a batch of trajectories.
Trust Region Policy Optimization, Schulman et al, 2015.
Experiments
We designed experiments to investigate the following questions:
1. What are the performance characteristics of the single path and vine
sampling procedures?
2. TRPO is related to prior methods but makes several changes, most notably by
using a fixed KL divergence rather than a fixed penalty coefficient. How does
this affect the performance of the algorithm?
3. Can TRPO be used to solve challenging large-scale problems? How does
TRPO compare with other methods when applied to large-scale problems,
with regard to final performance, computation time, and sample complexity?
Trust Region Policy Optimization, Schulman et al, 2015.
Experiments
• Simulated Robotic Locomotion
Trust Region Policy Optimization, Schulman et al, 2015.
Experiments
• Simulated Robotic Locomotion
Trust Region Policy Optimization, Schulman et al, 2015.
Experiments
• Simulated Robotic Locomotion
Trust Region Policy Optimization, Schulman et al, 2015.
Experiments
• Simulated Robotic Locomotion
Trust Region Policy Optimization, Schulman et al, 2015.
Experiments
• Playing Games from Images
• To evaluate TRPO on a partially observed task with complex observations, we
trained policies for playing Atari games, using raw images as input.
• Challenging points
• The high dimensionality, challenging elements of these games include delayed rewards (no
immediate penalty is incurred when a life is lost in Breakout or Space Invaders)
• Complex sequences of behavior (Q*bert requires a character to hop on 21 different platforms)
• Non-stationary image statistics (Enduro involves a changing and flickering background)
Trust Region Policy Optimization, Schulman et al, 2015.
Experiments
• Playing Games from Images
Trust Region Policy Optimization, Schulman et al, 2015.
Experiments
• Playing Games from Images
Trust Region Policy Optimization, Schulman et al, 2015.
Experiments
• Playing Games from Images
Trust Region Policy Optimization, Schulman et al, 2015.
Discussion
• We proposed and analyzed trust region methods for optimizing
stochastic control policies.
• We proved monotonic improvement for an algorithm that
repeatedly optimizes a local approximation to the expected
return of the policy with a KL divergence penalty.
Trust Region Policy Optimization, Schulman et al, 2015.
Discussion
• In the domain of robotic locomotion, we successfully learned
controllers for swimming, walking and hopping in a physics
simulator, using general purpose neural networks and
minimally informative rewards.
• At the intersection of the two experimental domains we
explored, there is the possibility of learning robotic control
policies that use vision and raw sensory data as input, providing
a unified scheme for training robotic controllers that perform
both perception and control.
Trust Region Policy Optimization, Schulman et al, 2015.
Discussion
• By combining our method with model learning, it would also be
possible to substantially reduce its sample complexity, making
it applicable to real-world settings where samples are
expensive.
Trust Region Policy Optimization, Schulman et al, 2015.
Summary Trust Region Policy Optimization, Schulman et al, 2015.
References
• https://www.slideshare.net/WoongwonLee/trpo-87165690
• https://reinforcement-learning-kr.github.io/2018/06/24/5_trpo/
• https://www.youtube.com/watch?v=XBO4oPChMfI
• https://www.youtube.com/watch?v=CKaN5PgkSBc
Trust Region Policy Optimization, Schulman et al, 2015.
Thank you!

More Related Content

What's hot

Reinforcement Learning with Deep Energy-Based Policies
Reinforcement Learning with Deep Energy-Based PoliciesReinforcement Learning with Deep Energy-Based Policies
Reinforcement Learning with Deep Energy-Based Policies
Sangwoo Mo
 
ベイズ入門
ベイズ入門ベイズ入門
ベイズ入門
Zansa
 
Temporal difference learning
Temporal difference learningTemporal difference learning
Temporal difference learning
Jie-Han Chen
 
Safe Reinforcement Learning
Safe Reinforcement LearningSafe Reinforcement Learning
Safe Reinforcement Learning
Dongmin Lee
 
Maximum Entropy Reinforcement Learning (Stochastic Control)
Maximum Entropy Reinforcement Learning (Stochastic Control)Maximum Entropy Reinforcement Learning (Stochastic Control)
Maximum Entropy Reinforcement Learning (Stochastic Control)
Dongmin Lee
 
岩波データサイエンス_Vol.5_勉強会資料02
岩波データサイエンス_Vol.5_勉強会資料02岩波データサイエンス_Vol.5_勉強会資料02
岩波データサイエンス_Vol.5_勉強会資料02
goony0101
 
Natural Policy Gradient 직관적 접근
Natural Policy Gradient 직관적 접근Natural Policy Gradient 직관적 접근
Natural Policy Gradient 직관적 접근
Sooyoung Moon
 
파이썬과 케라스로 배우는 강화학습 저자특강
파이썬과 케라스로 배우는 강화학습 저자특강파이썬과 케라스로 배우는 강화학습 저자특강
파이썬과 케라스로 배우는 강화학습 저자특강
Woong won Lee
 
Deep deterministic policy gradient
Deep deterministic policy gradientDeep deterministic policy gradient
Deep deterministic policy gradient
Slobodan Blazeski
 
Proximal Policy Optimization (Reinforcement Learning)
Proximal Policy Optimization (Reinforcement Learning)Proximal Policy Optimization (Reinforcement Learning)
Proximal Policy Optimization (Reinforcement Learning)
Thom Lane
 
강화학습 기초부터 DQN까지 (Reinforcement Learning from Basics to DQN)
강화학습 기초부터 DQN까지 (Reinforcement Learning from Basics to DQN)강화학습 기초부터 DQN까지 (Reinforcement Learning from Basics to DQN)
강화학습 기초부터 DQN까지 (Reinforcement Learning from Basics to DQN)
Curt Park
 
非定常データストリームにおける適応的決定木を用いたアンサンブル学習
非定常データストリームにおける適応的決定木を用いたアンサンブル学習非定常データストリームにおける適応的決定木を用いたアンサンブル学習
非定常データストリームにおける適応的決定木を用いたアンサンブル学習
Yu Sugawara
 
강화학습 알고리즘의 흐름도 Part 2
강화학습 알고리즘의 흐름도 Part 2강화학습 알고리즘의 흐름도 Part 2
강화학습 알고리즘의 흐름도 Part 2
Dongmin Lee
 
RLCode와 A3C 쉽고 깊게 이해하기
RLCode와 A3C 쉽고 깊게 이해하기RLCode와 A3C 쉽고 깊게 이해하기
RLCode와 A3C 쉽고 깊게 이해하기
Woong won Lee
 
クラシックな機械学習入門 1 導入
クラシックな機械学習入門 1 導入クラシックな機械学習入門 1 導入
クラシックな機械学習入門 1 導入
Hiroshi Nakagawa
 
[DL輪読会]Reinforcement Learning with Deep Energy-Based Policies
[DL輪読会]Reinforcement Learning with Deep Energy-Based Policies[DL輪読会]Reinforcement Learning with Deep Energy-Based Policies
[DL輪読会]Reinforcement Learning with Deep Energy-Based Policies
Deep Learning JP
 
ddpg seminar
ddpg seminarddpg seminar
ddpg seminar
민재 정
 
알기쉬운 Variational autoencoder
알기쉬운 Variational autoencoder알기쉬운 Variational autoencoder
알기쉬운 Variational autoencoder
홍배 김
 
MIT 6.S091: Introduction to Deep Reinforcement Learning (Deep RL) by Lex Fridman
MIT 6.S091: Introduction to Deep Reinforcement Learning (Deep RL) by Lex FridmanMIT 6.S091: Introduction to Deep Reinforcement Learning (Deep RL) by Lex Fridman
MIT 6.S091: Introduction to Deep Reinforcement Learning (Deep RL) by Lex Fridman
Peerasak C.
 
[DL輪読会]近年のオフライン強化学習のまとめ —Offline Reinforcement Learning: Tutorial, Review, an...
[DL輪読会]近年のオフライン強化学習のまとめ —Offline Reinforcement Learning: Tutorial, Review, an...[DL輪読会]近年のオフライン強化学習のまとめ —Offline Reinforcement Learning: Tutorial, Review, an...
[DL輪読会]近年のオフライン強化学習のまとめ —Offline Reinforcement Learning: Tutorial, Review, an...
Deep Learning JP
 

What's hot (20)

Reinforcement Learning with Deep Energy-Based Policies
Reinforcement Learning with Deep Energy-Based PoliciesReinforcement Learning with Deep Energy-Based Policies
Reinforcement Learning with Deep Energy-Based Policies
 
ベイズ入門
ベイズ入門ベイズ入門
ベイズ入門
 
Temporal difference learning
Temporal difference learningTemporal difference learning
Temporal difference learning
 
Safe Reinforcement Learning
Safe Reinforcement LearningSafe Reinforcement Learning
Safe Reinforcement Learning
 
Maximum Entropy Reinforcement Learning (Stochastic Control)
Maximum Entropy Reinforcement Learning (Stochastic Control)Maximum Entropy Reinforcement Learning (Stochastic Control)
Maximum Entropy Reinforcement Learning (Stochastic Control)
 
岩波データサイエンス_Vol.5_勉強会資料02
岩波データサイエンス_Vol.5_勉強会資料02岩波データサイエンス_Vol.5_勉強会資料02
岩波データサイエンス_Vol.5_勉強会資料02
 
Natural Policy Gradient 직관적 접근
Natural Policy Gradient 직관적 접근Natural Policy Gradient 직관적 접근
Natural Policy Gradient 직관적 접근
 
파이썬과 케라스로 배우는 강화학습 저자특강
파이썬과 케라스로 배우는 강화학습 저자특강파이썬과 케라스로 배우는 강화학습 저자특강
파이썬과 케라스로 배우는 강화학습 저자특강
 
Deep deterministic policy gradient
Deep deterministic policy gradientDeep deterministic policy gradient
Deep deterministic policy gradient
 
Proximal Policy Optimization (Reinforcement Learning)
Proximal Policy Optimization (Reinforcement Learning)Proximal Policy Optimization (Reinforcement Learning)
Proximal Policy Optimization (Reinforcement Learning)
 
강화학습 기초부터 DQN까지 (Reinforcement Learning from Basics to DQN)
강화학습 기초부터 DQN까지 (Reinforcement Learning from Basics to DQN)강화학습 기초부터 DQN까지 (Reinforcement Learning from Basics to DQN)
강화학습 기초부터 DQN까지 (Reinforcement Learning from Basics to DQN)
 
非定常データストリームにおける適応的決定木を用いたアンサンブル学習
非定常データストリームにおける適応的決定木を用いたアンサンブル学習非定常データストリームにおける適応的決定木を用いたアンサンブル学習
非定常データストリームにおける適応的決定木を用いたアンサンブル学習
 
강화학습 알고리즘의 흐름도 Part 2
강화학습 알고리즘의 흐름도 Part 2강화학습 알고리즘의 흐름도 Part 2
강화학습 알고리즘의 흐름도 Part 2
 
RLCode와 A3C 쉽고 깊게 이해하기
RLCode와 A3C 쉽고 깊게 이해하기RLCode와 A3C 쉽고 깊게 이해하기
RLCode와 A3C 쉽고 깊게 이해하기
 
クラシックな機械学習入門 1 導入
クラシックな機械学習入門 1 導入クラシックな機械学習入門 1 導入
クラシックな機械学習入門 1 導入
 
[DL輪読会]Reinforcement Learning with Deep Energy-Based Policies
[DL輪読会]Reinforcement Learning with Deep Energy-Based Policies[DL輪読会]Reinforcement Learning with Deep Energy-Based Policies
[DL輪読会]Reinforcement Learning with Deep Energy-Based Policies
 
ddpg seminar
ddpg seminarddpg seminar
ddpg seminar
 
알기쉬운 Variational autoencoder
알기쉬운 Variational autoencoder알기쉬운 Variational autoencoder
알기쉬운 Variational autoencoder
 
MIT 6.S091: Introduction to Deep Reinforcement Learning (Deep RL) by Lex Fridman
MIT 6.S091: Introduction to Deep Reinforcement Learning (Deep RL) by Lex FridmanMIT 6.S091: Introduction to Deep Reinforcement Learning (Deep RL) by Lex Fridman
MIT 6.S091: Introduction to Deep Reinforcement Learning (Deep RL) by Lex Fridman
 
[DL輪読会]近年のオフライン強化学習のまとめ —Offline Reinforcement Learning: Tutorial, Review, an...
[DL輪読会]近年のオフライン強化学習のまとめ —Offline Reinforcement Learning: Tutorial, Review, an...[DL輪読会]近年のオフライン強化学習のまとめ —Offline Reinforcement Learning: Tutorial, Review, an...
[DL輪読会]近年のオフライン強化学習のまとめ —Offline Reinforcement Learning: Tutorial, Review, an...
 

Similar to Trust Region Policy Optimization, Schulman et al, 2015

Proximal Policy Optimization Algorithms, Schulman et al, 2017
Proximal Policy Optimization Algorithms, Schulman et al, 2017Proximal Policy Optimization Algorithms, Schulman et al, 2017
Proximal Policy Optimization Algorithms, Schulman et al, 2017
Chris Ohk
 
Policy gradient
Policy gradientPolicy gradient
Policy gradient
Jie-Han Chen
 
Intro to Reinforcement learning - part I
Intro to Reinforcement learning - part IIntro to Reinforcement learning - part I
Intro to Reinforcement learning - part I
Mikko Mäkipää
 
Regression analysis made easy
Regression analysis made easyRegression analysis made easy
Regression analysis made easy
Weam Banjar
 
Modern Recommendation for Advanced Practitioners part2
Modern Recommendation for Advanced Practitioners part2Modern Recommendation for Advanced Practitioners part2
Modern Recommendation for Advanced Practitioners part2
Flavian Vasile
 
Intro to Reinforcement learning - part II
Intro to Reinforcement learning - part IIIntro to Reinforcement learning - part II
Intro to Reinforcement learning - part II
Mikko Mäkipää
 
Statistical Learning and Model Selection (1).pptx
Statistical Learning and Model Selection (1).pptxStatistical Learning and Model Selection (1).pptx
Statistical Learning and Model Selection (1).pptx
rajalakshmi5921
 
Dummy variables (1)
Dummy variables (1)Dummy variables (1)
Dummy variables (1)
dinesh sharma
 
Dummy variables
Dummy variablesDummy variables
Dummy variables
dineshsahejpal
 
Dummy ppt
Dummy pptDummy ppt
Dummy ppt
Daniel Hazrati
 
Dummy variables xd
Dummy variables xdDummy variables xd
Dummy variables xd
teachersdotcom
 
Dummy variables - title
Dummy variables - titleDummy variables - title
Dummy variables - title
Mehul Mohan
 
Reinforcement Learning Guide For Beginners
Reinforcement Learning Guide For BeginnersReinforcement Learning Guide For Beginners
Reinforcement Learning Guide For Beginners
gokulprasath06
 
Models of Operational research, Advantages & disadvantages of Operational res...
Models of Operational research, Advantages & disadvantages of Operational res...Models of Operational research, Advantages & disadvantages of Operational res...
Models of Operational research, Advantages & disadvantages of Operational res...
Sunny Mervyne Baa
 
Deep Reinforcement learning
Deep Reinforcement learningDeep Reinforcement learning
Deep Reinforcement learning
Cairo University
 
Stepwise Selection Choosing the Optimal Model .ppt
Stepwise Selection  Choosing the Optimal Model .pptStepwise Selection  Choosing the Optimal Model .ppt
Stepwise Selection Choosing the Optimal Model .ppt
neelamsanjeevkumar
 
3.3 Forecasting Part 2(1) (1).pptx
3.3 Forecasting Part 2(1) (1).pptx3.3 Forecasting Part 2(1) (1).pptx
3.3 Forecasting Part 2(1) (1).pptx
MBALOGISTICSSUPPLYCH
 
How to formulate reinforcement learning in illustrative ways
How to formulate reinforcement learning in illustrative waysHow to formulate reinforcement learning in illustrative ways
How to formulate reinforcement learning in illustrative ways
YasutoTamura1
 
Reinforcement learning:policy gradient (part 1)
Reinforcement learning:policy gradient (part 1)Reinforcement learning:policy gradient (part 1)
Reinforcement learning:policy gradient (part 1)
Bean Yen
 
Hypothesis Testing using STATA
Hypothesis Testing using STATAHypothesis Testing using STATA
Hypothesis Testing using STATA
Pantho Sarker
 

Similar to Trust Region Policy Optimization, Schulman et al, 2015 (20)

Proximal Policy Optimization Algorithms, Schulman et al, 2017
Proximal Policy Optimization Algorithms, Schulman et al, 2017Proximal Policy Optimization Algorithms, Schulman et al, 2017
Proximal Policy Optimization Algorithms, Schulman et al, 2017
 
Policy gradient
Policy gradientPolicy gradient
Policy gradient
 
Intro to Reinforcement learning - part I
Intro to Reinforcement learning - part IIntro to Reinforcement learning - part I
Intro to Reinforcement learning - part I
 
Regression analysis made easy
Regression analysis made easyRegression analysis made easy
Regression analysis made easy
 
Modern Recommendation for Advanced Practitioners part2
Modern Recommendation for Advanced Practitioners part2Modern Recommendation for Advanced Practitioners part2
Modern Recommendation for Advanced Practitioners part2
 
Intro to Reinforcement learning - part II
Intro to Reinforcement learning - part IIIntro to Reinforcement learning - part II
Intro to Reinforcement learning - part II
 
Statistical Learning and Model Selection (1).pptx
Statistical Learning and Model Selection (1).pptxStatistical Learning and Model Selection (1).pptx
Statistical Learning and Model Selection (1).pptx
 
Dummy variables (1)
Dummy variables (1)Dummy variables (1)
Dummy variables (1)
 
Dummy variables
Dummy variablesDummy variables
Dummy variables
 
Dummy ppt
Dummy pptDummy ppt
Dummy ppt
 
Dummy variables xd
Dummy variables xdDummy variables xd
Dummy variables xd
 
Dummy variables - title
Dummy variables - titleDummy variables - title
Dummy variables - title
 
Reinforcement Learning Guide For Beginners
Reinforcement Learning Guide For BeginnersReinforcement Learning Guide For Beginners
Reinforcement Learning Guide For Beginners
 
Models of Operational research, Advantages & disadvantages of Operational res...
Models of Operational research, Advantages & disadvantages of Operational res...Models of Operational research, Advantages & disadvantages of Operational res...
Models of Operational research, Advantages & disadvantages of Operational res...
 
Deep Reinforcement learning
Deep Reinforcement learningDeep Reinforcement learning
Deep Reinforcement learning
 
Stepwise Selection Choosing the Optimal Model .ppt
Stepwise Selection  Choosing the Optimal Model .pptStepwise Selection  Choosing the Optimal Model .ppt
Stepwise Selection Choosing the Optimal Model .ppt
 
3.3 Forecasting Part 2(1) (1).pptx
3.3 Forecasting Part 2(1) (1).pptx3.3 Forecasting Part 2(1) (1).pptx
3.3 Forecasting Part 2(1) (1).pptx
 
How to formulate reinforcement learning in illustrative ways
How to formulate reinforcement learning in illustrative waysHow to formulate reinforcement learning in illustrative ways
How to formulate reinforcement learning in illustrative ways
 
Reinforcement learning:policy gradient (part 1)
Reinforcement learning:policy gradient (part 1)Reinforcement learning:policy gradient (part 1)
Reinforcement learning:policy gradient (part 1)
 
Hypothesis Testing using STATA
Hypothesis Testing using STATAHypothesis Testing using STATA
Hypothesis Testing using STATA
 

More from Chris Ohk

인프콘 2022 - Rust 크로스 플랫폼 프로그래밍
인프콘 2022 - Rust 크로스 플랫폼 프로그래밍인프콘 2022 - Rust 크로스 플랫폼 프로그래밍
인프콘 2022 - Rust 크로스 플랫폼 프로그래밍
Chris Ohk
 
고려대학교 컴퓨터학과 특강 - 대학생 때 알았더라면 좋았을 것들
고려대학교 컴퓨터학과 특강 - 대학생 때 알았더라면 좋았을 것들고려대학교 컴퓨터학과 특강 - 대학생 때 알았더라면 좋았을 것들
고려대학교 컴퓨터학과 특강 - 대학생 때 알았더라면 좋았을 것들
Chris Ohk
 
Momenti Seminar - 5 Years of RosettaStone
Momenti Seminar - 5 Years of RosettaStoneMomenti Seminar - 5 Years of RosettaStone
Momenti Seminar - 5 Years of RosettaStone
Chris Ohk
 
선린인터넷고등학교 2021 알고리즘 컨퍼런스 - Rust로 알고리즘 문제 풀어보기
선린인터넷고등학교 2021 알고리즘 컨퍼런스 - Rust로 알고리즘 문제 풀어보기선린인터넷고등학교 2021 알고리즘 컨퍼런스 - Rust로 알고리즘 문제 풀어보기
선린인터넷고등학교 2021 알고리즘 컨퍼런스 - Rust로 알고리즘 문제 풀어보기
Chris Ohk
 
Momenti Seminar - A Tour of Rust, Part 2
Momenti Seminar - A Tour of Rust, Part 2Momenti Seminar - A Tour of Rust, Part 2
Momenti Seminar - A Tour of Rust, Part 2
Chris Ohk
 
Momenti Seminar - A Tour of Rust, Part 1
Momenti Seminar - A Tour of Rust, Part 1Momenti Seminar - A Tour of Rust, Part 1
Momenti Seminar - A Tour of Rust, Part 1
Chris Ohk
 
Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021
Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021
Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021
Chris Ohk
 
Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021
Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021
Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021
Chris Ohk
 
Agent57: Outperforming the Atari Human Benchmark, Badia, A. P. et al, 2020
Agent57: Outperforming the Atari Human Benchmark, Badia, A. P. et al, 2020Agent57: Outperforming the Atari Human Benchmark, Badia, A. P. et al, 2020
Agent57: Outperforming the Atari Human Benchmark, Badia, A. P. et al, 2020
Chris Ohk
 
GDG Gwangju DevFest 2019 - <하스스톤> 강화학습 환경 개발기
GDG Gwangju DevFest 2019 - <하스스톤> 강화학습 환경 개발기GDG Gwangju DevFest 2019 - <하스스톤> 강화학습 환경 개발기
GDG Gwangju DevFest 2019 - <하스스톤> 강화학습 환경 개발기
Chris Ohk
 
[RLKorea] <하스스톤> 강화학습 환경 개발기
[RLKorea] <하스스톤> 강화학습 환경 개발기[RLKorea] <하스스톤> 강화학습 환경 개발기
[RLKorea] <하스스톤> 강화학습 환경 개발기
Chris Ohk
 
[NDC 2019] 하스스톤 강화학습 환경 개발기
[NDC 2019] 하스스톤 강화학습 환경 개발기[NDC 2019] 하스스톤 강화학습 환경 개발기
[NDC 2019] 하스스톤 강화학습 환경 개발기
Chris Ohk
 
C++20 Key Features Summary
C++20 Key Features SummaryC++20 Key Features Summary
C++20 Key Features Summary
Chris Ohk
 
[델리만주] 대학원 캐슬 - 석사에서 게임 프로그래머까지
[델리만주] 대학원 캐슬 - 석사에서 게임 프로그래머까지[델리만주] 대학원 캐슬 - 석사에서 게임 프로그래머까지
[델리만주] 대학원 캐슬 - 석사에서 게임 프로그래머까지
Chris Ohk
 
디미고 특강 - 개발을 시작하려는 여러분에게
디미고 특강 - 개발을 시작하려는 여러분에게디미고 특강 - 개발을 시작하려는 여러분에게
디미고 특강 - 개발을 시작하려는 여러분에게
Chris Ohk
 
청강대 특강 - 프로젝트 제대로 해보기
청강대 특강 - 프로젝트 제대로 해보기청강대 특강 - 프로젝트 제대로 해보기
청강대 특강 - 프로젝트 제대로 해보기
Chris Ohk
 
[NDC 2018] 유체역학 엔진 개발기
[NDC 2018] 유체역학 엔진 개발기[NDC 2018] 유체역학 엔진 개발기
[NDC 2018] 유체역학 엔진 개발기
Chris Ohk
 
My Way, Your Way
My Way, Your WayMy Way, Your Way
My Way, Your Way
Chris Ohk
 
Re:Zero부터 시작하지 않는 오픈소스 개발
Re:Zero부터 시작하지 않는 오픈소스 개발Re:Zero부터 시작하지 않는 오픈소스 개발
Re:Zero부터 시작하지 않는 오픈소스 개발
Chris Ohk
 
[9XD] Introduction to Computer Graphics
[9XD] Introduction to Computer Graphics[9XD] Introduction to Computer Graphics
[9XD] Introduction to Computer Graphics
Chris Ohk
 

More from Chris Ohk (20)

인프콘 2022 - Rust 크로스 플랫폼 프로그래밍
인프콘 2022 - Rust 크로스 플랫폼 프로그래밍인프콘 2022 - Rust 크로스 플랫폼 프로그래밍
인프콘 2022 - Rust 크로스 플랫폼 프로그래밍
 
고려대학교 컴퓨터학과 특강 - 대학생 때 알았더라면 좋았을 것들
고려대학교 컴퓨터학과 특강 - 대학생 때 알았더라면 좋았을 것들고려대학교 컴퓨터학과 특강 - 대학생 때 알았더라면 좋았을 것들
고려대학교 컴퓨터학과 특강 - 대학생 때 알았더라면 좋았을 것들
 
Momenti Seminar - 5 Years of RosettaStone
Momenti Seminar - 5 Years of RosettaStoneMomenti Seminar - 5 Years of RosettaStone
Momenti Seminar - 5 Years of RosettaStone
 
선린인터넷고등학교 2021 알고리즘 컨퍼런스 - Rust로 알고리즘 문제 풀어보기
선린인터넷고등학교 2021 알고리즘 컨퍼런스 - Rust로 알고리즘 문제 풀어보기선린인터넷고등학교 2021 알고리즘 컨퍼런스 - Rust로 알고리즘 문제 풀어보기
선린인터넷고등학교 2021 알고리즘 컨퍼런스 - Rust로 알고리즘 문제 풀어보기
 
Momenti Seminar - A Tour of Rust, Part 2
Momenti Seminar - A Tour of Rust, Part 2Momenti Seminar - A Tour of Rust, Part 2
Momenti Seminar - A Tour of Rust, Part 2
 
Momenti Seminar - A Tour of Rust, Part 1
Momenti Seminar - A Tour of Rust, Part 1Momenti Seminar - A Tour of Rust, Part 1
Momenti Seminar - A Tour of Rust, Part 1
 
Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021
Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021
Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021
 
Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021
Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021
Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021
 
Agent57: Outperforming the Atari Human Benchmark, Badia, A. P. et al, 2020
Agent57: Outperforming the Atari Human Benchmark, Badia, A. P. et al, 2020Agent57: Outperforming the Atari Human Benchmark, Badia, A. P. et al, 2020
Agent57: Outperforming the Atari Human Benchmark, Badia, A. P. et al, 2020
 
GDG Gwangju DevFest 2019 - <하스스톤> 강화학습 환경 개발기
GDG Gwangju DevFest 2019 - <하스스톤> 강화학습 환경 개발기GDG Gwangju DevFest 2019 - <하스스톤> 강화학습 환경 개발기
GDG Gwangju DevFest 2019 - <하스스톤> 강화학습 환경 개발기
 
[RLKorea] <하스스톤> 강화학습 환경 개발기
[RLKorea] <하스스톤> 강화학습 환경 개발기[RLKorea] <하스스톤> 강화학습 환경 개발기
[RLKorea] <하스스톤> 강화학습 환경 개발기
 
[NDC 2019] 하스스톤 강화학습 환경 개발기
[NDC 2019] 하스스톤 강화학습 환경 개발기[NDC 2019] 하스스톤 강화학습 환경 개발기
[NDC 2019] 하스스톤 강화학습 환경 개발기
 
C++20 Key Features Summary
C++20 Key Features SummaryC++20 Key Features Summary
C++20 Key Features Summary
 
[델리만주] 대학원 캐슬 - 석사에서 게임 프로그래머까지
[델리만주] 대학원 캐슬 - 석사에서 게임 프로그래머까지[델리만주] 대학원 캐슬 - 석사에서 게임 프로그래머까지
[델리만주] 대학원 캐슬 - 석사에서 게임 프로그래머까지
 
디미고 특강 - 개발을 시작하려는 여러분에게
디미고 특강 - 개발을 시작하려는 여러분에게디미고 특강 - 개발을 시작하려는 여러분에게
디미고 특강 - 개발을 시작하려는 여러분에게
 
청강대 특강 - 프로젝트 제대로 해보기
청강대 특강 - 프로젝트 제대로 해보기청강대 특강 - 프로젝트 제대로 해보기
청강대 특강 - 프로젝트 제대로 해보기
 
[NDC 2018] 유체역학 엔진 개발기
[NDC 2018] 유체역학 엔진 개발기[NDC 2018] 유체역학 엔진 개발기
[NDC 2018] 유체역학 엔진 개발기
 
My Way, Your Way
My Way, Your WayMy Way, Your Way
My Way, Your Way
 
Re:Zero부터 시작하지 않는 오픈소스 개발
Re:Zero부터 시작하지 않는 오픈소스 개발Re:Zero부터 시작하지 않는 오픈소스 개발
Re:Zero부터 시작하지 않는 오픈소스 개발
 
[9XD] Introduction to Computer Graphics
[9XD] Introduction to Computer Graphics[9XD] Introduction to Computer Graphics
[9XD] Introduction to Computer Graphics
 

Recently uploaded

Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Nexer Digital
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
Adtran
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems S.M.S.A.
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
danishmna97
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Safe Software
 
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIEnchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Vladimir Iglovikov, Ph.D.
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
Kari Kakkonen
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
Octavian Nadolu
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex ProofszkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
Alex Pruden
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
SOFTTECHHUB
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Paige Cruz
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
Matthew Sinclair
 
Full-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalizationFull-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalization
Zilliz
 
“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”
Claudio Di Ciccio
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
KAMESHS29
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
Kumud Singh
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
James Anderson
 

Recently uploaded (20)

Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
 
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIEnchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex ProofszkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
 
Full-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalizationFull-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalization
 
“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
 

Trust Region Policy Optimization, Schulman et al, 2015

  • 1. Trust Region Policy Optimization, Schulman et al, 2015. 옥찬호 utilForever@gmail.com
  • 2. Trust Region Policy Optimization, Schulman et al, 2015.Introduction • Most algorithms for policy optimization can be classified... • (1) Policy iteration methods (Alternate between estimating the value function under the current policy and improving the policy) • (2) Policy gradient methods (Use an estimator of the gradient of the expected return) • (3) Derivative-free optimization methods (Treat the return as a black box function to be optimized in terms of the policy parameters) such as • The Cross-Entropy Method (CEM) • The Covariance Matrix Adaptation (CMA)
  • 3. Introduction • General derivative-free stochastic optimization methods such as CEM and CMA are preferred on many problems, • They achieve good results while being simple to understand and implement. • For example, while Tetris is a classic benchmark problem for approximate dynamic programming(ADP) methods, stochastic optimization methods are difficult to beat on this task. Trust Region Policy Optimization, Schulman et al, 2015.
  • 4. Introduction • For continuous control problems, methods like CMA have been successful at learning control policies for challenging tasks like locomotion when provided with hand-engineered policy classes with low-dimensional parameterizations. • The inability of ADP and gradient-based methods to consistently beat gradient-free random search is unsatisfying, • Gradient-based optimization algorithms enjoy much better sample complexity guarantees than gradient-free methods. Trust Region Policy Optimization, Schulman et al, 2015.
  • 5. Introduction • Continuous gradient-based optimization has been very successful at learning function approximators for supervised learning tasks with huge numbers of parameters, and extending their success to reinforcement learning would allow for efficient training of complex and powerful policies. Trust Region Policy Optimization, Schulman et al, 2015.
  • 6. Preliminaries • Markov Decision Process(MDP) : 𝒮, 𝒜, 𝑃, 𝑟, 𝜌0, 𝛾 • 𝒮 is a finite set of states • 𝒜 is a finite set of actions • 𝑃 ∶ 𝒮 × 𝒜 × 𝒮 → ℝ is the transition probability • 𝑟 ∶ 𝒮 → ℝ is the reward function • 𝜌0 ∶ 𝒮 → ℝ is the distribution of the initial state 𝑠0 • 𝛾 ∈ 0, 1 is the discount factor • Policy • 𝜋 ∶ 𝒮 × 𝒜 → 0, 1 Trust Region Policy Optimization, Schulman et al, 2015.
  • 7. Preliminaries Trust Region Policy Optimization, Schulman et al, 2015.
  • 8. Preliminaries Trust Region Policy Optimization, Schulman et al, 2015.
  • 9. Preliminaries Trust Region Policy Optimization, Schulman et al, 2015.
  • 10. Preliminaries Trust Region Policy Optimization, Schulman et al, 2015.
  • 11. Preliminaries How to measure policy’s performance? Trust Region Policy Optimization, Schulman et al, 2015.
  • 12. Preliminaries • Expected cumulative discounted reward • Action value, Value/Advantage functions Trust Region Policy Optimization, Schulman et al, 2015.
  • 13. Preliminaries Trust Region Policy Optimization, Schulman et al, 2015.
  • 14. Preliminaries Trust Region Policy Optimization, Schulman et al, 2015.
  • 15. Preliminaries • Lemma 1. Kakade & Langford (2002) • Proof Trust Region Policy Optimization, Schulman et al, 2015.
  • 16. Preliminaries • Discounted (unnormalized) visitation frequencies • Rewrite Lemma 1 Trust Region Policy Optimization, Schulman et al, 2015.
  • 17. Preliminaries • This equation implies that any policy update 𝜋 → ෤𝜋 that has a nonnegative expected advantage at every state 𝑠, i.e., σ 𝑎 ෤𝜋 𝑎 𝑠 𝐴 𝜋(𝑠, 𝑎) ≥ 0, is guaranteed to increase the policy performance 𝜂, or leave it constant in the case that the expected advantage is zero everywhere. Trust Region Policy Optimization, Schulman et al, 2015.
  • 18. Preliminaries Trust Region Policy Optimization, Schulman et al, 2015.
  • 19. Preliminaries • However, in the approximate setting, it will typically be unavoidable, due to estimation and approximation error, that there will be some states 𝑠 for which the expected advantage is negative, that is, σ 𝑎 ෤𝜋 𝑎 𝑠 𝐴 𝜋(𝑠, 𝑎) < 0. • The complex dependency of 𝜌෥𝜋 𝑠 on ෤𝜋 makes equation difficult to optimize directly. Trust Region Policy Optimization, Schulman et al, 2015.
  • 20. Preliminaries • Instead, we introduce the following local approximation to 𝜂: • Note that 𝐿 𝜋 uses the visitation frequency 𝜌 𝜋 rather than 𝜌෥𝜋, ignoring changes in state visitation density due to changes in the policy. Trust Region Policy Optimization, Schulman et al, 2015.
  • 21. Preliminaries Trust Region Policy Optimization, Schulman et al, 2015.
  • 22. Preliminaries • However, if we have a parameterized policy 𝜋 𝜃, where 𝜋 𝜃 𝑎 𝑠 is a differentiable function of the parameter vector 𝜃, then 𝐿 𝜋 matches 𝜂 to first order. • This equation implies that a sufficiently small step 𝜋 𝜃0 → ෤𝜋 that improves 𝐿 𝜋 𝜃 𝑜𝑙𝑑 will also improve 𝜂, but does not give us any guidance on how big of a step to take. Trust Region Policy Optimization, Schulman et al, 2015.
  • 23. Preliminaries Trust Region Policy Optimization, Schulman et al, 2015.
  • 24. Preliminaries • Conservative Policy Iteration • Provide explicit lower bounds on the improvement of 𝜂. • Let 𝜋 𝑜𝑙𝑑 denote the current policy, and let 𝜋′ = argmax 𝜋′ 𝐿 𝜋 𝑜𝑙𝑑 (𝜋′). The new policy 𝜋 𝑛𝑒𝑤 was defined to be the following texture: Trust Region Policy Optimization, Schulman et al, 2015.
  • 25. Preliminaries • Conservative Policy Iteration • The following lower bound: • However, this policy class is unwieldy and restrictive in practice, and it is desirable for a practical policy update scheme to be applicable to all general stochastic policy classes. Trust Region Policy Optimization, Schulman et al, 2015.
  • 26. Monotonic Improvement Guarantee • The policy improvement bound in above equation can be extended to general stochastic policies, by • (1) Replacing 𝛼 with a distance measure between 𝜋 and ෤𝜋 • (2) Changing the constant 𝜖 appropriately
  • 27. Monotonic Improvement Guarantee • The particular distance measure we use is the total variation divergence for discrete probability distributions 𝑝, 𝑞: 𝐷TV 𝑝 ∥ 𝑞 = 1 2 ෍ 𝑖 𝑝𝑖 − 𝑞𝑖 • Define 𝐷TV max 𝜋, ෤𝜋 as
  • 28. Monotonic Improvement Guarantee • Theorem 1. Let 𝛼 = 𝐷TV max 𝜋old, 𝜋new . Then the following bound holds:
  • 29. Monotonic Improvement Guarantee • We note the following relationship between the total variation divergence and the KL divergence. 𝐷TV 𝑝 ∥ 𝑞 2 ≤ 𝐷KL 𝑝 ∥ 𝑞 • Let 𝐷KL max 𝜋, ෤𝜋 = max 𝑠 𝐷KL 𝜋 ∙ 𝑠 ∥ ෤𝜋 ∙ |𝑠 . The following bound then follows directly from Theorem 1:
  • 33. Monotonic Improvement Guarantee • Algorithm 1 is guaranteed to generate a monotonically improving sequence of policies 𝜂 𝜋0 ≤ 𝜂 𝜋1 ≤ 𝜂 𝜋2 ≤ ⋯. • Let 𝑀𝑖 𝜋 = 𝐿 𝜋 𝑖 − 𝐶𝐷KL max 𝜋𝑖, 𝜋 . Then, • Thus, by maximizing 𝑀𝑖 at each iteration, we guarantee that the true objective 𝜂 is non-decreasing. • This algorithm is a type of minorization-maximization (MM) algorithm.
  • 34. Optimization of Parameterized Policies • Overload our previous notation to use functions of 𝜃. • 𝜂 𝜃 ≔ 𝜂 𝜋 𝜃 • 𝐿 𝜃 ෨𝜃 ≔ 𝐿 𝜋 𝜃 𝜋෩𝜃 • 𝐷KL 𝜃 ∥ ෨𝜃 ≔ 𝐷KL 𝜋 𝜃 ∥ 𝜋෩𝜃 • Use 𝜃old to denote the previous policy parameters
  • 35. Optimization of Parameterized Policies • The preceding section showed that 𝜂 𝜃 ≥ 𝐿 𝜃old 𝜃 − 𝐶𝐷KL max 𝜃old, 𝜃 • Thus, by performing the following maximization, we are guaranteed to improve the true objective 𝜂: maximize 𝜃 𝐿 𝜃old 𝜃 − 𝐶𝐷KL max 𝜃old, 𝜃
  • 36. Optimization of Parameterized Policies • In practice, if we used the penalty coefficient 𝐶 recommended by the theory above, the step sizes would be very small. • One way to take largest steps in a robust way is to use a constraint on the KL divergence between the new policy and the old policy, i.e., a trust region constraint:
  • 37. Optimization of Parameterized Policies • This problem imposes a constraint that the KL divergence is bounded at every point in the state space. While it is motivated by the theory, this problem is impractical to solve due to the large number of constraints. • Instead, we can use a heuristic approximation which considers the average KL divergence:
  • 38. Optimization of Parameterized Policies • We therefore propose solving the following optimization problem to generate a policy update:
  • 40. Sample-Based Estimation • How the objective and constraint functions can be approximated using Monte Carlo simulation? • We seek to solve the following optimization problem, obtained by expanding 𝐿 𝜃old in previous equation:
  • 41. Sample-Based Estimation • We replace • σ 𝑠 𝜌 𝜃old 𝑠 … → 1 1−𝛾 𝐸𝑠~𝜌 𝜃old [… ] • 𝐴 𝜃old → 𝑄 𝜃old • σ 𝑎 𝜋 𝜃old 𝑎|𝑠 𝐴 𝜃old → 𝐸 𝑎~𝑞 𝜋 𝜃(𝑎|𝑠) 𝑞(𝑎|𝑠) 𝐴 𝜃old (We replace the sum over the actions by an importance sampling estimator.)
  • 44. Sample-Based Estimation • Now, our optimization problem in previous equation is exactly equivalent to the following one, written in terms of expectations: • Constrained Optimization → Sample-based Estimation
  • 45. Sample-Based Estimation • All that remains is to replace the expectations by sample averages and replace the 𝑄 value by an empirical estimate. The following sections describe two different schemes for performing this estimation. • Single Path: Typically used for policy gradient estimation (Bartlett & Baxter, 2011), and is based on sampling individual trajectories. • Vine: Constructing a rollout set and then performing multiple actions from each state in the rollout set.
  • 49. Practical Algorithm 1. Use the single path or vine procedures to collect a set of state- action pairs along with Monte Carlo estimates of their 𝑄-values. 2. By averaging over samples, construct the estimated objective and constraint in Equation . 3. Approximately solve this constrained optimization problem to update the policy’s parameter vector 𝜃. We use the conjugate gradient algorithm followed by a line search, which is altogether only slightly more expensive than computing the gradient itself. See Appendix C for details. Trust Region Policy Optimization, Schulman et al, 2015.
  • 50. Practical Algorithm • With regard to (3), we construct the Fisher information matrix (FIM) by analytically computing the Hessian of KL divergence, rather than using the covariance matrix of the gradients. • That is, we estimate 𝐴𝑖𝑗 as , rather than . • This analytic estimator has computational benefits in the large-scale setting, since it removes the need to store a dense Hessian or all policy gradients from a batch of trajectories. Trust Region Policy Optimization, Schulman et al, 2015.
  • 51. Experiments We designed experiments to investigate the following questions: 1. What are the performance characteristics of the single path and vine sampling procedures? 2. TRPO is related to prior methods but makes several changes, most notably by using a fixed KL divergence rather than a fixed penalty coefficient. How does this affect the performance of the algorithm? 3. Can TRPO be used to solve challenging large-scale problems? How does TRPO compare with other methods when applied to large-scale problems, with regard to final performance, computation time, and sample complexity? Trust Region Policy Optimization, Schulman et al, 2015.
  • 52. Experiments • Simulated Robotic Locomotion Trust Region Policy Optimization, Schulman et al, 2015.
  • 53. Experiments • Simulated Robotic Locomotion Trust Region Policy Optimization, Schulman et al, 2015.
  • 54. Experiments • Simulated Robotic Locomotion Trust Region Policy Optimization, Schulman et al, 2015.
  • 55. Experiments • Simulated Robotic Locomotion Trust Region Policy Optimization, Schulman et al, 2015.
  • 56. Experiments • Playing Games from Images • To evaluate TRPO on a partially observed task with complex observations, we trained policies for playing Atari games, using raw images as input. • Challenging points • The high dimensionality, challenging elements of these games include delayed rewards (no immediate penalty is incurred when a life is lost in Breakout or Space Invaders) • Complex sequences of behavior (Q*bert requires a character to hop on 21 different platforms) • Non-stationary image statistics (Enduro involves a changing and flickering background) Trust Region Policy Optimization, Schulman et al, 2015.
  • 57. Experiments • Playing Games from Images Trust Region Policy Optimization, Schulman et al, 2015.
  • 58. Experiments • Playing Games from Images Trust Region Policy Optimization, Schulman et al, 2015.
  • 59. Experiments • Playing Games from Images Trust Region Policy Optimization, Schulman et al, 2015.
  • 60. Discussion • We proposed and analyzed trust region methods for optimizing stochastic control policies. • We proved monotonic improvement for an algorithm that repeatedly optimizes a local approximation to the expected return of the policy with a KL divergence penalty. Trust Region Policy Optimization, Schulman et al, 2015.
  • 61. Discussion • In the domain of robotic locomotion, we successfully learned controllers for swimming, walking and hopping in a physics simulator, using general purpose neural networks and minimally informative rewards. • At the intersection of the two experimental domains we explored, there is the possibility of learning robotic control policies that use vision and raw sensory data as input, providing a unified scheme for training robotic controllers that perform both perception and control. Trust Region Policy Optimization, Schulman et al, 2015.
  • 62. Discussion • By combining our method with model learning, it would also be possible to substantially reduce its sample complexity, making it applicable to real-world settings where samples are expensive. Trust Region Policy Optimization, Schulman et al, 2015.
  • 63. Summary Trust Region Policy Optimization, Schulman et al, 2015.
  • 64. References • https://www.slideshare.net/WoongwonLee/trpo-87165690 • https://reinforcement-learning-kr.github.io/2018/06/24/5_trpo/ • https://www.youtube.com/watch?v=XBO4oPChMfI • https://www.youtube.com/watch?v=CKaN5PgkSBc Trust Region Policy Optimization, Schulman et al, 2015.