SlideShare a Scribd company logo
Trust Region Policy Optimization, Schulman et al, 2015.
옥찬호
utilForever@gmail.com
Trust Region Policy Optimization, Schulman et al, 2015.Introduction
• Most algorithms for policy optimization can be classified...
• (1) Policy iteration methods (Alternate between estimating the value
function under the current policy and improving the policy)
• (2) Policy gradient methods (Use an estimator of the gradient of the
expected return)
• (3) Derivative-free optimization methods (Treat the return as a black box
function to be optimized in terms of the policy parameters) such as
• The Cross-Entropy Method (CEM)
• The Covariance Matrix Adaptation (CMA)
Introduction
• General derivative-free stochastic optimization methods
such as CEM and CMA are preferred on many problems,
• They achieve good results while being simple to understand and implement.
• For example, while Tetris is a classic benchmark problem for
approximate dynamic programming(ADP) methods, stochastic
optimization methods are difficult to beat on this task.
Trust Region Policy Optimization, Schulman et al, 2015.
Introduction
• For continuous control problems, methods like CMA have been
successful at learning control policies for challenging tasks like
locomotion when provided with hand-engineered policy
classes with low-dimensional parameterizations.
• The inability of ADP and gradient-based methods to
consistently beat gradient-free random search is unsatisfying,
• Gradient-based optimization algorithms enjoy much better sample
complexity guarantees than gradient-free methods.
Trust Region Policy Optimization, Schulman et al, 2015.
Introduction
• Continuous gradient-based optimization has been very
successful at learning function approximators for supervised
learning tasks with huge numbers of parameters, and
extending their success to reinforcement learning would allow
for efficient training of complex and powerful policies.
Trust Region Policy Optimization, Schulman et al, 2015.
Preliminaries
• Markov Decision Process(MDP) : 𝒮, 𝒜, 𝑃, 𝑟, 𝜌0, 𝛾
• 𝒮 is a finite set of states
• 𝒜 is a finite set of actions
• 𝑃 ∶ 𝒮 × 𝒜 × 𝒮 → ℝ is the transition probability
• 𝑟 ∶ 𝒮 → ℝ is the reward function
• 𝜌0 ∶ 𝒮 → ℝ is the distribution of the initial state 𝑠0
• 𝛾 ∈ 0, 1 is the discount factor
• Policy
• 𝜋 ∶ 𝒮 × 𝒜 → 0, 1
Trust Region Policy Optimization, Schulman et al, 2015.
Preliminaries Trust Region Policy Optimization, Schulman et al, 2015.
Preliminaries Trust Region Policy Optimization, Schulman et al, 2015.
Preliminaries Trust Region Policy Optimization, Schulman et al, 2015.
Preliminaries Trust Region Policy Optimization, Schulman et al, 2015.
Preliminaries
How to measure policy’s performance?
Trust Region Policy Optimization, Schulman et al, 2015.
Preliminaries
• Expected cumulative discounted reward
• Action value, Value/Advantage functions
Trust Region Policy Optimization, Schulman et al, 2015.
Preliminaries Trust Region Policy Optimization, Schulman et al, 2015.
Preliminaries Trust Region Policy Optimization, Schulman et al, 2015.
Preliminaries
• Lemma 1. Kakade & Langford (2002)
• Proof
Trust Region Policy Optimization, Schulman et al, 2015.
Preliminaries
• Discounted (unnormalized) visitation frequencies
• Rewrite Lemma 1
Trust Region Policy Optimization, Schulman et al, 2015.
Preliminaries
• This equation implies that any policy update 𝜋 → ෤𝜋 that has
a nonnegative expected advantage at every state 𝑠,
i.e., σ 𝑎 ෤𝜋 𝑎 𝑠 𝐴 𝜋(𝑠, 𝑎) ≥ 0, is guaranteed to increase
the policy performance 𝜂, or leave it constant in the case that
the expected advantage is zero everywhere.
Trust Region Policy Optimization, Schulman et al, 2015.
Preliminaries Trust Region Policy Optimization, Schulman et al, 2015.
Preliminaries
• However, in the approximate setting, it will typically be
unavoidable, due to estimation and approximation error, that
there will be some states 𝑠 for which the expected advantage
is negative, that is, σ 𝑎 ෤𝜋 𝑎 𝑠 𝐴 𝜋(𝑠, 𝑎) < 0.
• The complex dependency of 𝜌෥𝜋 𝑠 on ෤𝜋 makes equation
difficult to optimize directly.
Trust Region Policy Optimization, Schulman et al, 2015.
Preliminaries
• Instead, we introduce the following local approximation to 𝜂:
• Note that 𝐿 𝜋 uses the visitation frequency 𝜌 𝜋 rather than 𝜌෥𝜋,
ignoring changes in state visitation density due to changes in
the policy.
Trust Region Policy Optimization, Schulman et al, 2015.
Preliminaries Trust Region Policy Optimization, Schulman et al, 2015.
Preliminaries
• However, if we have a parameterized policy 𝜋 𝜃, where
𝜋 𝜃 𝑎 𝑠 is a differentiable function of the parameter vector 𝜃,
then 𝐿 𝜋 matches 𝜂 to first order.
• This equation implies that a sufficiently small step 𝜋 𝜃0
→ ෤𝜋 that
improves 𝐿 𝜋 𝜃 𝑜𝑙𝑑
will also improve 𝜂, but does not give us any
guidance on how big of a step to take.
Trust Region Policy Optimization, Schulman et al, 2015.
Preliminaries Trust Region Policy Optimization, Schulman et al, 2015.
Preliminaries
• Conservative Policy Iteration
• Provide explicit lower bounds on the improvement of 𝜂.
• Let 𝜋 𝑜𝑙𝑑 denote the current policy, and let 𝜋′ = argmax 𝜋′ 𝐿 𝜋 𝑜𝑙𝑑
(𝜋′).
The new policy 𝜋 𝑛𝑒𝑤 was defined to be the following texture:
Trust Region Policy Optimization, Schulman et al, 2015.
Preliminaries
• Conservative Policy Iteration
• The following lower bound:
• However, this policy class is unwieldy and restrictive in practice,
and it is desirable for a practical policy update scheme to be applicable to
all general stochastic policy classes.
Trust Region Policy Optimization, Schulman et al, 2015.
Monotonic Improvement Guarantee
• The policy improvement bound in above equation can be
extended to general stochastic policies, by
• (1) Replacing 𝛼 with a distance measure between 𝜋 and ෤𝜋
• (2) Changing the constant 𝜖 appropriately
Monotonic Improvement Guarantee
• The particular distance measure we use is
the total variation divergence for
discrete probability distributions 𝑝, 𝑞:
𝐷TV 𝑝 ∥ 𝑞 =
1
2
෍
𝑖
𝑝𝑖 − 𝑞𝑖
• Define 𝐷TV
max
𝜋, ෤𝜋 as
Monotonic Improvement Guarantee
• Theorem 1. Let 𝛼 = 𝐷TV
max
𝜋old, 𝜋new .
Then the following bound holds:
Monotonic Improvement Guarantee
• We note the following relationship between the total variation
divergence and the KL divergence.
𝐷TV 𝑝 ∥ 𝑞 2
≤ 𝐷KL 𝑝 ∥ 𝑞
• Let 𝐷KL
max
𝜋, ෤𝜋 = max
𝑠
𝐷KL 𝜋 ∙ 𝑠 ∥ ෤𝜋 ∙ |𝑠 .
The following bound then follows directly from Theorem 1:
Monotonic Improvement Guarantee
Monotonic Improvement Guarantee
• Algorithm 1
Monotonic Improvement Guarantee
Monotonic Improvement Guarantee
• Algorithm 1 is guaranteed to generate a monotonically
improving sequence of policies 𝜂 𝜋0 ≤ 𝜂 𝜋1 ≤ 𝜂 𝜋2 ≤ ⋯.
• Let 𝑀𝑖 𝜋 = 𝐿 𝜋 𝑖
− 𝐶𝐷KL
max
𝜋𝑖, 𝜋 . Then,
• Thus, by maximizing 𝑀𝑖 at each iteration, we guarantee that
the true objective 𝜂 is non-decreasing.
• This algorithm is a type of minorization-maximization (MM) algorithm.
Optimization of Parameterized Policies
• Overload our previous notation to use functions of 𝜃.
• 𝜂 𝜃 ≔ 𝜂 𝜋 𝜃
• 𝐿 𝜃
෨𝜃 ≔ 𝐿 𝜋 𝜃
𝜋෩𝜃
• 𝐷KL 𝜃 ∥ ෨𝜃 ≔ 𝐷KL 𝜋 𝜃 ∥ 𝜋෩𝜃
• Use 𝜃old to denote the previous policy parameters
Optimization of Parameterized Policies
• The preceding section showed that
𝜂 𝜃 ≥ 𝐿 𝜃old
𝜃 − 𝐶𝐷KL
max
𝜃old, 𝜃
• Thus, by performing the following maximization,
we are guaranteed to improve the true objective 𝜂:
maximize 𝜃 𝐿 𝜃old
𝜃 − 𝐶𝐷KL
max
𝜃old, 𝜃
Optimization of Parameterized Policies
• In practice, if we used the penalty coefficient 𝐶 recommended
by the theory above, the step sizes would be very small.
• One way to take largest steps in a robust way is to use a
constraint on the KL divergence between the new policy and
the old policy, i.e., a trust region constraint:
Optimization of Parameterized Policies
• This problem imposes a constraint that the KL divergence is
bounded at every point in the state space. While it is motivated
by the theory, this problem is impractical to solve due to the
large number of constraints.
• Instead, we can use a heuristic approximation which considers
the average KL divergence:
Optimization of Parameterized Policies
• We therefore propose solving the following optimization
problem to generate a policy update:
Optimization of Parameterized Policies
Sample-Based Estimation
• How the objective and constraint functions can be
approximated using Monte Carlo simulation?
• We seek to solve the following optimization problem, obtained
by expanding 𝐿 𝜃old
in previous equation:
Sample-Based Estimation
• We replace
• σ 𝑠 𝜌 𝜃old
𝑠 … →
1
1−𝛾
𝐸𝑠~𝜌 𝜃old
[… ]
• 𝐴 𝜃old
→ 𝑄 𝜃old
• σ 𝑎 𝜋 𝜃old
𝑎|𝑠 𝐴 𝜃old
→ 𝐸 𝑎~𝑞
𝜋 𝜃(𝑎|𝑠)
𝑞(𝑎|𝑠)
𝐴 𝜃old
(We replace the sum over the actions by an importance sampling estimator.)
Sample-Based Estimation
Sample-Based Estimation
Sample-Based Estimation
• Now, our optimization problem in previous equation is exactly
equivalent to the following one, written in terms of
expectations:
• Constrained Optimization → Sample-based Estimation
Sample-Based Estimation
• All that remains is to replace the expectations by sample
averages and replace the 𝑄 value by an empirical estimate.
The following sections describe two different schemes for
performing this estimation.
• Single Path: Typically used for policy gradient estimation (Bartlett & Baxter,
2011), and is based on sampling individual trajectories.
• Vine: Constructing a rollout set and then performing multiple actions from
each state in the rollout set.
Sample-Based Estimation
Sample-Based Estimation
Sample-Based Estimation
Practical Algorithm
1. Use the single path or vine procedures to collect a set of state-
action pairs along with Monte Carlo estimates of their 𝑄-values.
2. By averaging over samples, construct the estimated objective and
constraint in Equation .
3. Approximately solve this constrained optimization problem to
update the policy’s parameter vector 𝜃. We use the conjugate
gradient algorithm followed by a line search, which is altogether
only slightly more expensive than computing the gradient itself.
See Appendix C for details.
Trust Region Policy Optimization, Schulman et al, 2015.
Practical Algorithm
• With regard to (3), we construct the Fisher information matrix
(FIM) by analytically computing the Hessian of KL divergence,
rather than using the covariance matrix of the gradients.
• That is, we estimate 𝐴𝑖𝑗 as ,
rather than .
• This analytic estimator has computational benefits in the
large-scale setting, since it removes the need to store a dense
Hessian or all policy gradients from a batch of trajectories.
Trust Region Policy Optimization, Schulman et al, 2015.
Experiments
We designed experiments to investigate the following questions:
1. What are the performance characteristics of the single path and vine
sampling procedures?
2. TRPO is related to prior methods but makes several changes, most notably by
using a fixed KL divergence rather than a fixed penalty coefficient. How does
this affect the performance of the algorithm?
3. Can TRPO be used to solve challenging large-scale problems? How does
TRPO compare with other methods when applied to large-scale problems,
with regard to final performance, computation time, and sample complexity?
Trust Region Policy Optimization, Schulman et al, 2015.
Experiments
• Simulated Robotic Locomotion
Trust Region Policy Optimization, Schulman et al, 2015.
Experiments
• Simulated Robotic Locomotion
Trust Region Policy Optimization, Schulman et al, 2015.
Experiments
• Simulated Robotic Locomotion
Trust Region Policy Optimization, Schulman et al, 2015.
Experiments
• Simulated Robotic Locomotion
Trust Region Policy Optimization, Schulman et al, 2015.
Experiments
• Playing Games from Images
• To evaluate TRPO on a partially observed task with complex observations, we
trained policies for playing Atari games, using raw images as input.
• Challenging points
• The high dimensionality, challenging elements of these games include delayed rewards (no
immediate penalty is incurred when a life is lost in Breakout or Space Invaders)
• Complex sequences of behavior (Q*bert requires a character to hop on 21 different platforms)
• Non-stationary image statistics (Enduro involves a changing and flickering background)
Trust Region Policy Optimization, Schulman et al, 2015.
Experiments
• Playing Games from Images
Trust Region Policy Optimization, Schulman et al, 2015.
Experiments
• Playing Games from Images
Trust Region Policy Optimization, Schulman et al, 2015.
Experiments
• Playing Games from Images
Trust Region Policy Optimization, Schulman et al, 2015.
Discussion
• We proposed and analyzed trust region methods for optimizing
stochastic control policies.
• We proved monotonic improvement for an algorithm that
repeatedly optimizes a local approximation to the expected
return of the policy with a KL divergence penalty.
Trust Region Policy Optimization, Schulman et al, 2015.
Discussion
• In the domain of robotic locomotion, we successfully learned
controllers for swimming, walking and hopping in a physics
simulator, using general purpose neural networks and
minimally informative rewards.
• At the intersection of the two experimental domains we
explored, there is the possibility of learning robotic control
policies that use vision and raw sensory data as input, providing
a unified scheme for training robotic controllers that perform
both perception and control.
Trust Region Policy Optimization, Schulman et al, 2015.
Discussion
• By combining our method with model learning, it would also be
possible to substantially reduce its sample complexity, making
it applicable to real-world settings where samples are
expensive.
Trust Region Policy Optimization, Schulman et al, 2015.
Summary Trust Region Policy Optimization, Schulman et al, 2015.
References
• https://www.slideshare.net/WoongwonLee/trpo-87165690
• https://reinforcement-learning-kr.github.io/2018/06/24/5_trpo/
• https://www.youtube.com/watch?v=XBO4oPChMfI
• https://www.youtube.com/watch?v=CKaN5PgkSBc
Trust Region Policy Optimization, Schulman et al, 2015.
Thank you!

More Related Content

What's hot

Introduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement LearningIntroduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement Learning
NAVER Engineering
 
Policy gradient
Policy gradientPolicy gradient
Policy gradient
Jie-Han Chen
 
강화학습 알고리즘의 흐름도 Part 2
강화학습 알고리즘의 흐름도 Part 2강화학습 알고리즘의 흐름도 Part 2
강화학습 알고리즘의 흐름도 Part 2
Dongmin Lee
 
강화학습 기초_2(Deep sarsa, Deep Q-learning, DQN)
강화학습 기초_2(Deep sarsa, Deep Q-learning, DQN)강화학습 기초_2(Deep sarsa, Deep Q-learning, DQN)
강화학습 기초_2(Deep sarsa, Deep Q-learning, DQN)
Euijin Jeong
 
Kernel Method
Kernel MethodKernel Method
Kernel Method
수철 박
 
Conditional Image Generation with PixelCNN Decoders
Conditional Image Generation with PixelCNN DecodersConditional Image Generation with PixelCNN Decoders
Conditional Image Generation with PixelCNN Decoders
suga93
 
Hierarchical Reinforcement Learning
Hierarchical Reinforcement LearningHierarchical Reinforcement Learning
Hierarchical Reinforcement Learning
David Jardim
 
Natural Policy Gradient 직관적 접근
Natural Policy Gradient 직관적 접근Natural Policy Gradient 직관적 접근
Natural Policy Gradient 직관적 접근
Sooyoung Moon
 
Control as Inference.pptx
Control as Inference.pptxControl as Inference.pptx
Control as Inference.pptx
ssuserbd1647
 
오토인코더의 모든 것
오토인코더의 모든 것오토인코더의 모든 것
오토인코더의 모든 것
NAVER Engineering
 
pycon2018 "RL Adventure : DQN 부터 Rainbow DQN까지"
pycon2018 "RL Adventure : DQN 부터 Rainbow DQN까지"pycon2018 "RL Adventure : DQN 부터 Rainbow DQN까지"
pycon2018 "RL Adventure : DQN 부터 Rainbow DQN까지"
YeChan(Paul) Kim
 
SchNet: A continuous-filter convolutional neural network for modeling quantum...
SchNet: A continuous-filter convolutional neural network for modeling quantum...SchNet: A continuous-filter convolutional neural network for modeling quantum...
SchNet: A continuous-filter convolutional neural network for modeling quantum...
Kazuki Fujikawa
 
「これからの強化学習」勉強会#1
「これからの強化学習」勉強会#1「これからの強化学習」勉強会#1
「これからの強化学習」勉強会#1
Chihiro Kusunoki
 
강화학습의 개요
강화학습의 개요강화학습의 개요
강화학습의 개요
Dongmin Lee
 
Introduction to SAC(Soft Actor-Critic)
Introduction to SAC(Soft Actor-Critic)Introduction to SAC(Soft Actor-Critic)
Introduction to SAC(Soft Actor-Critic)
Suhyun Cho
 
From REINFORCE to PPO
From REINFORCE to PPOFrom REINFORCE to PPO
From REINFORCE to PPO
Woong won Lee
 
強化学習5章
強化学習5章強化学習5章
強化学習5章
hiroki yamaoka
 
Soft Actor-Critic Algorithms and Applications 한국어 리뷰
Soft Actor-Critic Algorithms and Applications 한국어 리뷰Soft Actor-Critic Algorithms and Applications 한국어 리뷰
Soft Actor-Critic Algorithms and Applications 한국어 리뷰
태영 정
 
機械学習工学への誘い:SES'20チュートリアル
機械学習工学への誘い:SES'20チュートリアル機械学習工学への誘い:SES'20チュートリアル
機械学習工学への誘い:SES'20チュートリアル
Fuyuki Ishikawa
 
is anyone_interest_in_auto-encoding_variational-bayes
is anyone_interest_in_auto-encoding_variational-bayesis anyone_interest_in_auto-encoding_variational-bayes
is anyone_interest_in_auto-encoding_variational-bayes
NAVER Engineering
 

What's hot (20)

Introduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement LearningIntroduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement Learning
 
Policy gradient
Policy gradientPolicy gradient
Policy gradient
 
강화학습 알고리즘의 흐름도 Part 2
강화학습 알고리즘의 흐름도 Part 2강화학습 알고리즘의 흐름도 Part 2
강화학습 알고리즘의 흐름도 Part 2
 
강화학습 기초_2(Deep sarsa, Deep Q-learning, DQN)
강화학습 기초_2(Deep sarsa, Deep Q-learning, DQN)강화학습 기초_2(Deep sarsa, Deep Q-learning, DQN)
강화학습 기초_2(Deep sarsa, Deep Q-learning, DQN)
 
Kernel Method
Kernel MethodKernel Method
Kernel Method
 
Conditional Image Generation with PixelCNN Decoders
Conditional Image Generation with PixelCNN DecodersConditional Image Generation with PixelCNN Decoders
Conditional Image Generation with PixelCNN Decoders
 
Hierarchical Reinforcement Learning
Hierarchical Reinforcement LearningHierarchical Reinforcement Learning
Hierarchical Reinforcement Learning
 
Natural Policy Gradient 직관적 접근
Natural Policy Gradient 직관적 접근Natural Policy Gradient 직관적 접근
Natural Policy Gradient 직관적 접근
 
Control as Inference.pptx
Control as Inference.pptxControl as Inference.pptx
Control as Inference.pptx
 
오토인코더의 모든 것
오토인코더의 모든 것오토인코더의 모든 것
오토인코더의 모든 것
 
pycon2018 "RL Adventure : DQN 부터 Rainbow DQN까지"
pycon2018 "RL Adventure : DQN 부터 Rainbow DQN까지"pycon2018 "RL Adventure : DQN 부터 Rainbow DQN까지"
pycon2018 "RL Adventure : DQN 부터 Rainbow DQN까지"
 
SchNet: A continuous-filter convolutional neural network for modeling quantum...
SchNet: A continuous-filter convolutional neural network for modeling quantum...SchNet: A continuous-filter convolutional neural network for modeling quantum...
SchNet: A continuous-filter convolutional neural network for modeling quantum...
 
「これからの強化学習」勉強会#1
「これからの強化学習」勉強会#1「これからの強化学習」勉強会#1
「これからの強化学習」勉強会#1
 
강화학습의 개요
강화학습의 개요강화학습의 개요
강화학습의 개요
 
Introduction to SAC(Soft Actor-Critic)
Introduction to SAC(Soft Actor-Critic)Introduction to SAC(Soft Actor-Critic)
Introduction to SAC(Soft Actor-Critic)
 
From REINFORCE to PPO
From REINFORCE to PPOFrom REINFORCE to PPO
From REINFORCE to PPO
 
強化学習5章
強化学習5章強化学習5章
強化学習5章
 
Soft Actor-Critic Algorithms and Applications 한국어 리뷰
Soft Actor-Critic Algorithms and Applications 한국어 리뷰Soft Actor-Critic Algorithms and Applications 한국어 리뷰
Soft Actor-Critic Algorithms and Applications 한국어 리뷰
 
機械学習工学への誘い:SES'20チュートリアル
機械学習工学への誘い:SES'20チュートリアル機械学習工学への誘い:SES'20チュートリアル
機械学習工学への誘い:SES'20チュートリアル
 
is anyone_interest_in_auto-encoding_variational-bayes
is anyone_interest_in_auto-encoding_variational-bayesis anyone_interest_in_auto-encoding_variational-bayes
is anyone_interest_in_auto-encoding_variational-bayes
 

Similar to Trust Region Policy Optimization, Schulman et al, 2015

Proximal Policy Optimization Algorithms, Schulman et al, 2017
Proximal Policy Optimization Algorithms, Schulman et al, 2017Proximal Policy Optimization Algorithms, Schulman et al, 2017
Proximal Policy Optimization Algorithms, Schulman et al, 2017
Chris Ohk
 
Proximal Policy Optimization
Proximal Policy OptimizationProximal Policy Optimization
Proximal Policy Optimization
ShubhaManikarnike
 
Intro to Reinforcement learning - part I
Intro to Reinforcement learning - part IIntro to Reinforcement learning - part I
Intro to Reinforcement learning - part I
Mikko Mäkipää
 
Regression analysis made easy
Regression analysis made easyRegression analysis made easy
Regression analysis made easy
Weam Banjar
 
Modern Recommendation for Advanced Practitioners part2
Modern Recommendation for Advanced Practitioners part2Modern Recommendation for Advanced Practitioners part2
Modern Recommendation for Advanced Practitioners part2
Flavian Vasile
 
Intro to Reinforcement learning - part II
Intro to Reinforcement learning - part IIIntro to Reinforcement learning - part II
Intro to Reinforcement learning - part II
Mikko Mäkipää
 
Statistical Learning and Model Selection (1).pptx
Statistical Learning and Model Selection (1).pptxStatistical Learning and Model Selection (1).pptx
Statistical Learning and Model Selection (1).pptx
rajalakshmi5921
 
Dummy variables (1)
Dummy variables (1)Dummy variables (1)
Dummy variables (1)
dinesh sharma
 
Dummy variables
Dummy variablesDummy variables
Dummy variables
dineshsahejpal
 
Dummy ppt
Dummy pptDummy ppt
Dummy ppt
Daniel Hazrati
 
Dummy variables xd
Dummy variables xdDummy variables xd
Dummy variables xd
teachersdotcom
 
Dummy variables - title
Dummy variables - titleDummy variables - title
Dummy variables - title
Mehul Mohan
 
Reinforcement Learning Guide For Beginners
Reinforcement Learning Guide For BeginnersReinforcement Learning Guide For Beginners
Reinforcement Learning Guide For Beginners
gokulprasath06
 
Models of Operational research, Advantages & disadvantages of Operational res...
Models of Operational research, Advantages & disadvantages of Operational res...Models of Operational research, Advantages & disadvantages of Operational res...
Models of Operational research, Advantages & disadvantages of Operational res...
Sunny Mervyne Baa
 
Deep Reinforcement learning
Deep Reinforcement learningDeep Reinforcement learning
Deep Reinforcement learning
Cairo University
 
Stepwise Selection Choosing the Optimal Model .ppt
Stepwise Selection  Choosing the Optimal Model .pptStepwise Selection  Choosing the Optimal Model .ppt
Stepwise Selection Choosing the Optimal Model .ppt
neelamsanjeevkumar
 
3.3 Forecasting Part 2(1) (1).pptx
3.3 Forecasting Part 2(1) (1).pptx3.3 Forecasting Part 2(1) (1).pptx
3.3 Forecasting Part 2(1) (1).pptx
MBALOGISTICSSUPPLYCH
 
How to formulate reinforcement learning in illustrative ways
How to formulate reinforcement learning in illustrative waysHow to formulate reinforcement learning in illustrative ways
How to formulate reinforcement learning in illustrative ways
YasutoTamura1
 
Reinforcement learning:policy gradient (part 1)
Reinforcement learning:policy gradient (part 1)Reinforcement learning:policy gradient (part 1)
Reinforcement learning:policy gradient (part 1)
Bean Yen
 
Hypothesis Testing using STATA
Hypothesis Testing using STATAHypothesis Testing using STATA
Hypothesis Testing using STATA
Pantho Sarker
 

Similar to Trust Region Policy Optimization, Schulman et al, 2015 (20)

Proximal Policy Optimization Algorithms, Schulman et al, 2017
Proximal Policy Optimization Algorithms, Schulman et al, 2017Proximal Policy Optimization Algorithms, Schulman et al, 2017
Proximal Policy Optimization Algorithms, Schulman et al, 2017
 
Proximal Policy Optimization
Proximal Policy OptimizationProximal Policy Optimization
Proximal Policy Optimization
 
Intro to Reinforcement learning - part I
Intro to Reinforcement learning - part IIntro to Reinforcement learning - part I
Intro to Reinforcement learning - part I
 
Regression analysis made easy
Regression analysis made easyRegression analysis made easy
Regression analysis made easy
 
Modern Recommendation for Advanced Practitioners part2
Modern Recommendation for Advanced Practitioners part2Modern Recommendation for Advanced Practitioners part2
Modern Recommendation for Advanced Practitioners part2
 
Intro to Reinforcement learning - part II
Intro to Reinforcement learning - part IIIntro to Reinforcement learning - part II
Intro to Reinforcement learning - part II
 
Statistical Learning and Model Selection (1).pptx
Statistical Learning and Model Selection (1).pptxStatistical Learning and Model Selection (1).pptx
Statistical Learning and Model Selection (1).pptx
 
Dummy variables (1)
Dummy variables (1)Dummy variables (1)
Dummy variables (1)
 
Dummy variables
Dummy variablesDummy variables
Dummy variables
 
Dummy ppt
Dummy pptDummy ppt
Dummy ppt
 
Dummy variables xd
Dummy variables xdDummy variables xd
Dummy variables xd
 
Dummy variables - title
Dummy variables - titleDummy variables - title
Dummy variables - title
 
Reinforcement Learning Guide For Beginners
Reinforcement Learning Guide For BeginnersReinforcement Learning Guide For Beginners
Reinforcement Learning Guide For Beginners
 
Models of Operational research, Advantages & disadvantages of Operational res...
Models of Operational research, Advantages & disadvantages of Operational res...Models of Operational research, Advantages & disadvantages of Operational res...
Models of Operational research, Advantages & disadvantages of Operational res...
 
Deep Reinforcement learning
Deep Reinforcement learningDeep Reinforcement learning
Deep Reinforcement learning
 
Stepwise Selection Choosing the Optimal Model .ppt
Stepwise Selection  Choosing the Optimal Model .pptStepwise Selection  Choosing the Optimal Model .ppt
Stepwise Selection Choosing the Optimal Model .ppt
 
3.3 Forecasting Part 2(1) (1).pptx
3.3 Forecasting Part 2(1) (1).pptx3.3 Forecasting Part 2(1) (1).pptx
3.3 Forecasting Part 2(1) (1).pptx
 
How to formulate reinforcement learning in illustrative ways
How to formulate reinforcement learning in illustrative waysHow to formulate reinforcement learning in illustrative ways
How to formulate reinforcement learning in illustrative ways
 
Reinforcement learning:policy gradient (part 1)
Reinforcement learning:policy gradient (part 1)Reinforcement learning:policy gradient (part 1)
Reinforcement learning:policy gradient (part 1)
 
Hypothesis Testing using STATA
Hypothesis Testing using STATAHypothesis Testing using STATA
Hypothesis Testing using STATA
 

More from Chris Ohk

인프콘 2022 - Rust 크로스 플랫폼 프로그래밍
인프콘 2022 - Rust 크로스 플랫폼 프로그래밍인프콘 2022 - Rust 크로스 플랫폼 프로그래밍
인프콘 2022 - Rust 크로스 플랫폼 프로그래밍
Chris Ohk
 
고려대학교 컴퓨터학과 특강 - 대학생 때 알았더라면 좋았을 것들
고려대학교 컴퓨터학과 특강 - 대학생 때 알았더라면 좋았을 것들고려대학교 컴퓨터학과 특강 - 대학생 때 알았더라면 좋았을 것들
고려대학교 컴퓨터학과 특강 - 대학생 때 알았더라면 좋았을 것들
Chris Ohk
 
Momenti Seminar - 5 Years of RosettaStone
Momenti Seminar - 5 Years of RosettaStoneMomenti Seminar - 5 Years of RosettaStone
Momenti Seminar - 5 Years of RosettaStone
Chris Ohk
 
선린인터넷고등학교 2021 알고리즘 컨퍼런스 - Rust로 알고리즘 문제 풀어보기
선린인터넷고등학교 2021 알고리즘 컨퍼런스 - Rust로 알고리즘 문제 풀어보기선린인터넷고등학교 2021 알고리즘 컨퍼런스 - Rust로 알고리즘 문제 풀어보기
선린인터넷고등학교 2021 알고리즘 컨퍼런스 - Rust로 알고리즘 문제 풀어보기
Chris Ohk
 
Momenti Seminar - A Tour of Rust, Part 2
Momenti Seminar - A Tour of Rust, Part 2Momenti Seminar - A Tour of Rust, Part 2
Momenti Seminar - A Tour of Rust, Part 2
Chris Ohk
 
Momenti Seminar - A Tour of Rust, Part 1
Momenti Seminar - A Tour of Rust, Part 1Momenti Seminar - A Tour of Rust, Part 1
Momenti Seminar - A Tour of Rust, Part 1
Chris Ohk
 
Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021
Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021
Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021
Chris Ohk
 
Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021
Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021
Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021
Chris Ohk
 
Agent57: Outperforming the Atari Human Benchmark, Badia, A. P. et al, 2020
Agent57: Outperforming the Atari Human Benchmark, Badia, A. P. et al, 2020Agent57: Outperforming the Atari Human Benchmark, Badia, A. P. et al, 2020
Agent57: Outperforming the Atari Human Benchmark, Badia, A. P. et al, 2020
Chris Ohk
 
Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015
Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015
Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015
Chris Ohk
 
GDG Gwangju DevFest 2019 - <하스스톤> 강화학습 환경 개발기
GDG Gwangju DevFest 2019 - <하스스톤> 강화학습 환경 개발기GDG Gwangju DevFest 2019 - <하스스톤> 강화학습 환경 개발기
GDG Gwangju DevFest 2019 - <하스스톤> 강화학습 환경 개발기
Chris Ohk
 
[RLKorea] <하스스톤> 강화학습 환경 개발기
[RLKorea] <하스스톤> 강화학습 환경 개발기[RLKorea] <하스스톤> 강화학습 환경 개발기
[RLKorea] <하스스톤> 강화학습 환경 개발기
Chris Ohk
 
[NDC 2019] 하스스톤 강화학습 환경 개발기
[NDC 2019] 하스스톤 강화학습 환경 개발기[NDC 2019] 하스스톤 강화학습 환경 개발기
[NDC 2019] 하스스톤 강화학습 환경 개발기
Chris Ohk
 
C++20 Key Features Summary
C++20 Key Features SummaryC++20 Key Features Summary
C++20 Key Features Summary
Chris Ohk
 
[델리만주] 대학원 캐슬 - 석사에서 게임 프로그래머까지
[델리만주] 대학원 캐슬 - 석사에서 게임 프로그래머까지[델리만주] 대학원 캐슬 - 석사에서 게임 프로그래머까지
[델리만주] 대학원 캐슬 - 석사에서 게임 프로그래머까지
Chris Ohk
 
디미고 특강 - 개발을 시작하려는 여러분에게
디미고 특강 - 개발을 시작하려는 여러분에게디미고 특강 - 개발을 시작하려는 여러분에게
디미고 특강 - 개발을 시작하려는 여러분에게
Chris Ohk
 
청강대 특강 - 프로젝트 제대로 해보기
청강대 특강 - 프로젝트 제대로 해보기청강대 특강 - 프로젝트 제대로 해보기
청강대 특강 - 프로젝트 제대로 해보기
Chris Ohk
 
[NDC 2018] 유체역학 엔진 개발기
[NDC 2018] 유체역학 엔진 개발기[NDC 2018] 유체역학 엔진 개발기
[NDC 2018] 유체역학 엔진 개발기
Chris Ohk
 
My Way, Your Way
My Way, Your WayMy Way, Your Way
My Way, Your Way
Chris Ohk
 
Re:Zero부터 시작하지 않는 오픈소스 개발
Re:Zero부터 시작하지 않는 오픈소스 개발Re:Zero부터 시작하지 않는 오픈소스 개발
Re:Zero부터 시작하지 않는 오픈소스 개발
Chris Ohk
 

More from Chris Ohk (20)

인프콘 2022 - Rust 크로스 플랫폼 프로그래밍
인프콘 2022 - Rust 크로스 플랫폼 프로그래밍인프콘 2022 - Rust 크로스 플랫폼 프로그래밍
인프콘 2022 - Rust 크로스 플랫폼 프로그래밍
 
고려대학교 컴퓨터학과 특강 - 대학생 때 알았더라면 좋았을 것들
고려대학교 컴퓨터학과 특강 - 대학생 때 알았더라면 좋았을 것들고려대학교 컴퓨터학과 특강 - 대학생 때 알았더라면 좋았을 것들
고려대학교 컴퓨터학과 특강 - 대학생 때 알았더라면 좋았을 것들
 
Momenti Seminar - 5 Years of RosettaStone
Momenti Seminar - 5 Years of RosettaStoneMomenti Seminar - 5 Years of RosettaStone
Momenti Seminar - 5 Years of RosettaStone
 
선린인터넷고등학교 2021 알고리즘 컨퍼런스 - Rust로 알고리즘 문제 풀어보기
선린인터넷고등학교 2021 알고리즘 컨퍼런스 - Rust로 알고리즘 문제 풀어보기선린인터넷고등학교 2021 알고리즘 컨퍼런스 - Rust로 알고리즘 문제 풀어보기
선린인터넷고등학교 2021 알고리즘 컨퍼런스 - Rust로 알고리즘 문제 풀어보기
 
Momenti Seminar - A Tour of Rust, Part 2
Momenti Seminar - A Tour of Rust, Part 2Momenti Seminar - A Tour of Rust, Part 2
Momenti Seminar - A Tour of Rust, Part 2
 
Momenti Seminar - A Tour of Rust, Part 1
Momenti Seminar - A Tour of Rust, Part 1Momenti Seminar - A Tour of Rust, Part 1
Momenti Seminar - A Tour of Rust, Part 1
 
Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021
Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021
Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021
 
Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021
Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021
Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021
 
Agent57: Outperforming the Atari Human Benchmark, Badia, A. P. et al, 2020
Agent57: Outperforming the Atari Human Benchmark, Badia, A. P. et al, 2020Agent57: Outperforming the Atari Human Benchmark, Badia, A. P. et al, 2020
Agent57: Outperforming the Atari Human Benchmark, Badia, A. P. et al, 2020
 
Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015
Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015
Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015
 
GDG Gwangju DevFest 2019 - <하스스톤> 강화학습 환경 개발기
GDG Gwangju DevFest 2019 - <하스스톤> 강화학습 환경 개발기GDG Gwangju DevFest 2019 - <하스스톤> 강화학습 환경 개발기
GDG Gwangju DevFest 2019 - <하스스톤> 강화학습 환경 개발기
 
[RLKorea] <하스스톤> 강화학습 환경 개발기
[RLKorea] <하스스톤> 강화학습 환경 개발기[RLKorea] <하스스톤> 강화학습 환경 개발기
[RLKorea] <하스스톤> 강화학습 환경 개발기
 
[NDC 2019] 하스스톤 강화학습 환경 개발기
[NDC 2019] 하스스톤 강화학습 환경 개발기[NDC 2019] 하스스톤 강화학습 환경 개발기
[NDC 2019] 하스스톤 강화학습 환경 개발기
 
C++20 Key Features Summary
C++20 Key Features SummaryC++20 Key Features Summary
C++20 Key Features Summary
 
[델리만주] 대학원 캐슬 - 석사에서 게임 프로그래머까지
[델리만주] 대학원 캐슬 - 석사에서 게임 프로그래머까지[델리만주] 대학원 캐슬 - 석사에서 게임 프로그래머까지
[델리만주] 대학원 캐슬 - 석사에서 게임 프로그래머까지
 
디미고 특강 - 개발을 시작하려는 여러분에게
디미고 특강 - 개발을 시작하려는 여러분에게디미고 특강 - 개발을 시작하려는 여러분에게
디미고 특강 - 개발을 시작하려는 여러분에게
 
청강대 특강 - 프로젝트 제대로 해보기
청강대 특강 - 프로젝트 제대로 해보기청강대 특강 - 프로젝트 제대로 해보기
청강대 특강 - 프로젝트 제대로 해보기
 
[NDC 2018] 유체역학 엔진 개발기
[NDC 2018] 유체역학 엔진 개발기[NDC 2018] 유체역학 엔진 개발기
[NDC 2018] 유체역학 엔진 개발기
 
My Way, Your Way
My Way, Your WayMy Way, Your Way
My Way, Your Way
 
Re:Zero부터 시작하지 않는 오픈소스 개발
Re:Zero부터 시작하지 않는 오픈소스 개발Re:Zero부터 시작하지 않는 오픈소스 개발
Re:Zero부터 시작하지 않는 오픈소스 개발
 

Recently uploaded

Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Paige Cruz
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
Matthew Sinclair
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
Aftab Hussain
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
Uni Systems S.M.S.A.
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems S.M.S.A.
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
Zilliz
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
DianaGray10
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
名前 です男
 
GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
Tomaz Bratanic
 
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Speck&Tech
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
Octavian Nadolu
 
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
SOFTTECHHUB
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
mikeeftimakis1
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
Matthew Sinclair
 
Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
tolgahangng
 
Infrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI modelsInfrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI models
Zilliz
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
innovationoecd
 
Full-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalizationFull-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalization
Zilliz
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
Kari Kakkonen
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
SOFTTECHHUB
 

Recently uploaded (20)

Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
 
GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
 
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
 
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
 
Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
 
Infrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI modelsInfrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI models
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
 
Full-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalizationFull-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalization
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
 

Trust Region Policy Optimization, Schulman et al, 2015

  • 1. Trust Region Policy Optimization, Schulman et al, 2015. 옥찬호 utilForever@gmail.com
  • 2. Trust Region Policy Optimization, Schulman et al, 2015.Introduction • Most algorithms for policy optimization can be classified... • (1) Policy iteration methods (Alternate between estimating the value function under the current policy and improving the policy) • (2) Policy gradient methods (Use an estimator of the gradient of the expected return) • (3) Derivative-free optimization methods (Treat the return as a black box function to be optimized in terms of the policy parameters) such as • The Cross-Entropy Method (CEM) • The Covariance Matrix Adaptation (CMA)
  • 3. Introduction • General derivative-free stochastic optimization methods such as CEM and CMA are preferred on many problems, • They achieve good results while being simple to understand and implement. • For example, while Tetris is a classic benchmark problem for approximate dynamic programming(ADP) methods, stochastic optimization methods are difficult to beat on this task. Trust Region Policy Optimization, Schulman et al, 2015.
  • 4. Introduction • For continuous control problems, methods like CMA have been successful at learning control policies for challenging tasks like locomotion when provided with hand-engineered policy classes with low-dimensional parameterizations. • The inability of ADP and gradient-based methods to consistently beat gradient-free random search is unsatisfying, • Gradient-based optimization algorithms enjoy much better sample complexity guarantees than gradient-free methods. Trust Region Policy Optimization, Schulman et al, 2015.
  • 5. Introduction • Continuous gradient-based optimization has been very successful at learning function approximators for supervised learning tasks with huge numbers of parameters, and extending their success to reinforcement learning would allow for efficient training of complex and powerful policies. Trust Region Policy Optimization, Schulman et al, 2015.
  • 6. Preliminaries • Markov Decision Process(MDP) : 𝒮, 𝒜, 𝑃, 𝑟, 𝜌0, 𝛾 • 𝒮 is a finite set of states • 𝒜 is a finite set of actions • 𝑃 ∶ 𝒮 × 𝒜 × 𝒮 → ℝ is the transition probability • 𝑟 ∶ 𝒮 → ℝ is the reward function • 𝜌0 ∶ 𝒮 → ℝ is the distribution of the initial state 𝑠0 • 𝛾 ∈ 0, 1 is the discount factor • Policy • 𝜋 ∶ 𝒮 × 𝒜 → 0, 1 Trust Region Policy Optimization, Schulman et al, 2015.
  • 7. Preliminaries Trust Region Policy Optimization, Schulman et al, 2015.
  • 8. Preliminaries Trust Region Policy Optimization, Schulman et al, 2015.
  • 9. Preliminaries Trust Region Policy Optimization, Schulman et al, 2015.
  • 10. Preliminaries Trust Region Policy Optimization, Schulman et al, 2015.
  • 11. Preliminaries How to measure policy’s performance? Trust Region Policy Optimization, Schulman et al, 2015.
  • 12. Preliminaries • Expected cumulative discounted reward • Action value, Value/Advantage functions Trust Region Policy Optimization, Schulman et al, 2015.
  • 13. Preliminaries Trust Region Policy Optimization, Schulman et al, 2015.
  • 14. Preliminaries Trust Region Policy Optimization, Schulman et al, 2015.
  • 15. Preliminaries • Lemma 1. Kakade & Langford (2002) • Proof Trust Region Policy Optimization, Schulman et al, 2015.
  • 16. Preliminaries • Discounted (unnormalized) visitation frequencies • Rewrite Lemma 1 Trust Region Policy Optimization, Schulman et al, 2015.
  • 17. Preliminaries • This equation implies that any policy update 𝜋 → ෤𝜋 that has a nonnegative expected advantage at every state 𝑠, i.e., σ 𝑎 ෤𝜋 𝑎 𝑠 𝐴 𝜋(𝑠, 𝑎) ≥ 0, is guaranteed to increase the policy performance 𝜂, or leave it constant in the case that the expected advantage is zero everywhere. Trust Region Policy Optimization, Schulman et al, 2015.
  • 18. Preliminaries Trust Region Policy Optimization, Schulman et al, 2015.
  • 19. Preliminaries • However, in the approximate setting, it will typically be unavoidable, due to estimation and approximation error, that there will be some states 𝑠 for which the expected advantage is negative, that is, σ 𝑎 ෤𝜋 𝑎 𝑠 𝐴 𝜋(𝑠, 𝑎) < 0. • The complex dependency of 𝜌෥𝜋 𝑠 on ෤𝜋 makes equation difficult to optimize directly. Trust Region Policy Optimization, Schulman et al, 2015.
  • 20. Preliminaries • Instead, we introduce the following local approximation to 𝜂: • Note that 𝐿 𝜋 uses the visitation frequency 𝜌 𝜋 rather than 𝜌෥𝜋, ignoring changes in state visitation density due to changes in the policy. Trust Region Policy Optimization, Schulman et al, 2015.
  • 21. Preliminaries Trust Region Policy Optimization, Schulman et al, 2015.
  • 22. Preliminaries • However, if we have a parameterized policy 𝜋 𝜃, where 𝜋 𝜃 𝑎 𝑠 is a differentiable function of the parameter vector 𝜃, then 𝐿 𝜋 matches 𝜂 to first order. • This equation implies that a sufficiently small step 𝜋 𝜃0 → ෤𝜋 that improves 𝐿 𝜋 𝜃 𝑜𝑙𝑑 will also improve 𝜂, but does not give us any guidance on how big of a step to take. Trust Region Policy Optimization, Schulman et al, 2015.
  • 23. Preliminaries Trust Region Policy Optimization, Schulman et al, 2015.
  • 24. Preliminaries • Conservative Policy Iteration • Provide explicit lower bounds on the improvement of 𝜂. • Let 𝜋 𝑜𝑙𝑑 denote the current policy, and let 𝜋′ = argmax 𝜋′ 𝐿 𝜋 𝑜𝑙𝑑 (𝜋′). The new policy 𝜋 𝑛𝑒𝑤 was defined to be the following texture: Trust Region Policy Optimization, Schulman et al, 2015.
  • 25. Preliminaries • Conservative Policy Iteration • The following lower bound: • However, this policy class is unwieldy and restrictive in practice, and it is desirable for a practical policy update scheme to be applicable to all general stochastic policy classes. Trust Region Policy Optimization, Schulman et al, 2015.
  • 26. Monotonic Improvement Guarantee • The policy improvement bound in above equation can be extended to general stochastic policies, by • (1) Replacing 𝛼 with a distance measure between 𝜋 and ෤𝜋 • (2) Changing the constant 𝜖 appropriately
  • 27. Monotonic Improvement Guarantee • The particular distance measure we use is the total variation divergence for discrete probability distributions 𝑝, 𝑞: 𝐷TV 𝑝 ∥ 𝑞 = 1 2 ෍ 𝑖 𝑝𝑖 − 𝑞𝑖 • Define 𝐷TV max 𝜋, ෤𝜋 as
  • 28. Monotonic Improvement Guarantee • Theorem 1. Let 𝛼 = 𝐷TV max 𝜋old, 𝜋new . Then the following bound holds:
  • 29. Monotonic Improvement Guarantee • We note the following relationship between the total variation divergence and the KL divergence. 𝐷TV 𝑝 ∥ 𝑞 2 ≤ 𝐷KL 𝑝 ∥ 𝑞 • Let 𝐷KL max 𝜋, ෤𝜋 = max 𝑠 𝐷KL 𝜋 ∙ 𝑠 ∥ ෤𝜋 ∙ |𝑠 . The following bound then follows directly from Theorem 1:
  • 33. Monotonic Improvement Guarantee • Algorithm 1 is guaranteed to generate a monotonically improving sequence of policies 𝜂 𝜋0 ≤ 𝜂 𝜋1 ≤ 𝜂 𝜋2 ≤ ⋯. • Let 𝑀𝑖 𝜋 = 𝐿 𝜋 𝑖 − 𝐶𝐷KL max 𝜋𝑖, 𝜋 . Then, • Thus, by maximizing 𝑀𝑖 at each iteration, we guarantee that the true objective 𝜂 is non-decreasing. • This algorithm is a type of minorization-maximization (MM) algorithm.
  • 34. Optimization of Parameterized Policies • Overload our previous notation to use functions of 𝜃. • 𝜂 𝜃 ≔ 𝜂 𝜋 𝜃 • 𝐿 𝜃 ෨𝜃 ≔ 𝐿 𝜋 𝜃 𝜋෩𝜃 • 𝐷KL 𝜃 ∥ ෨𝜃 ≔ 𝐷KL 𝜋 𝜃 ∥ 𝜋෩𝜃 • Use 𝜃old to denote the previous policy parameters
  • 35. Optimization of Parameterized Policies • The preceding section showed that 𝜂 𝜃 ≥ 𝐿 𝜃old 𝜃 − 𝐶𝐷KL max 𝜃old, 𝜃 • Thus, by performing the following maximization, we are guaranteed to improve the true objective 𝜂: maximize 𝜃 𝐿 𝜃old 𝜃 − 𝐶𝐷KL max 𝜃old, 𝜃
  • 36. Optimization of Parameterized Policies • In practice, if we used the penalty coefficient 𝐶 recommended by the theory above, the step sizes would be very small. • One way to take largest steps in a robust way is to use a constraint on the KL divergence between the new policy and the old policy, i.e., a trust region constraint:
  • 37. Optimization of Parameterized Policies • This problem imposes a constraint that the KL divergence is bounded at every point in the state space. While it is motivated by the theory, this problem is impractical to solve due to the large number of constraints. • Instead, we can use a heuristic approximation which considers the average KL divergence:
  • 38. Optimization of Parameterized Policies • We therefore propose solving the following optimization problem to generate a policy update:
  • 40. Sample-Based Estimation • How the objective and constraint functions can be approximated using Monte Carlo simulation? • We seek to solve the following optimization problem, obtained by expanding 𝐿 𝜃old in previous equation:
  • 41. Sample-Based Estimation • We replace • σ 𝑠 𝜌 𝜃old 𝑠 … → 1 1−𝛾 𝐸𝑠~𝜌 𝜃old [… ] • 𝐴 𝜃old → 𝑄 𝜃old • σ 𝑎 𝜋 𝜃old 𝑎|𝑠 𝐴 𝜃old → 𝐸 𝑎~𝑞 𝜋 𝜃(𝑎|𝑠) 𝑞(𝑎|𝑠) 𝐴 𝜃old (We replace the sum over the actions by an importance sampling estimator.)
  • 44. Sample-Based Estimation • Now, our optimization problem in previous equation is exactly equivalent to the following one, written in terms of expectations: • Constrained Optimization → Sample-based Estimation
  • 45. Sample-Based Estimation • All that remains is to replace the expectations by sample averages and replace the 𝑄 value by an empirical estimate. The following sections describe two different schemes for performing this estimation. • Single Path: Typically used for policy gradient estimation (Bartlett & Baxter, 2011), and is based on sampling individual trajectories. • Vine: Constructing a rollout set and then performing multiple actions from each state in the rollout set.
  • 49. Practical Algorithm 1. Use the single path or vine procedures to collect a set of state- action pairs along with Monte Carlo estimates of their 𝑄-values. 2. By averaging over samples, construct the estimated objective and constraint in Equation . 3. Approximately solve this constrained optimization problem to update the policy’s parameter vector 𝜃. We use the conjugate gradient algorithm followed by a line search, which is altogether only slightly more expensive than computing the gradient itself. See Appendix C for details. Trust Region Policy Optimization, Schulman et al, 2015.
  • 50. Practical Algorithm • With regard to (3), we construct the Fisher information matrix (FIM) by analytically computing the Hessian of KL divergence, rather than using the covariance matrix of the gradients. • That is, we estimate 𝐴𝑖𝑗 as , rather than . • This analytic estimator has computational benefits in the large-scale setting, since it removes the need to store a dense Hessian or all policy gradients from a batch of trajectories. Trust Region Policy Optimization, Schulman et al, 2015.
  • 51. Experiments We designed experiments to investigate the following questions: 1. What are the performance characteristics of the single path and vine sampling procedures? 2. TRPO is related to prior methods but makes several changes, most notably by using a fixed KL divergence rather than a fixed penalty coefficient. How does this affect the performance of the algorithm? 3. Can TRPO be used to solve challenging large-scale problems? How does TRPO compare with other methods when applied to large-scale problems, with regard to final performance, computation time, and sample complexity? Trust Region Policy Optimization, Schulman et al, 2015.
  • 52. Experiments • Simulated Robotic Locomotion Trust Region Policy Optimization, Schulman et al, 2015.
  • 53. Experiments • Simulated Robotic Locomotion Trust Region Policy Optimization, Schulman et al, 2015.
  • 54. Experiments • Simulated Robotic Locomotion Trust Region Policy Optimization, Schulman et al, 2015.
  • 55. Experiments • Simulated Robotic Locomotion Trust Region Policy Optimization, Schulman et al, 2015.
  • 56. Experiments • Playing Games from Images • To evaluate TRPO on a partially observed task with complex observations, we trained policies for playing Atari games, using raw images as input. • Challenging points • The high dimensionality, challenging elements of these games include delayed rewards (no immediate penalty is incurred when a life is lost in Breakout or Space Invaders) • Complex sequences of behavior (Q*bert requires a character to hop on 21 different platforms) • Non-stationary image statistics (Enduro involves a changing and flickering background) Trust Region Policy Optimization, Schulman et al, 2015.
  • 57. Experiments • Playing Games from Images Trust Region Policy Optimization, Schulman et al, 2015.
  • 58. Experiments • Playing Games from Images Trust Region Policy Optimization, Schulman et al, 2015.
  • 59. Experiments • Playing Games from Images Trust Region Policy Optimization, Schulman et al, 2015.
  • 60. Discussion • We proposed and analyzed trust region methods for optimizing stochastic control policies. • We proved monotonic improvement for an algorithm that repeatedly optimizes a local approximation to the expected return of the policy with a KL divergence penalty. Trust Region Policy Optimization, Schulman et al, 2015.
  • 61. Discussion • In the domain of robotic locomotion, we successfully learned controllers for swimming, walking and hopping in a physics simulator, using general purpose neural networks and minimally informative rewards. • At the intersection of the two experimental domains we explored, there is the possibility of learning robotic control policies that use vision and raw sensory data as input, providing a unified scheme for training robotic controllers that perform both perception and control. Trust Region Policy Optimization, Schulman et al, 2015.
  • 62. Discussion • By combining our method with model learning, it would also be possible to substantially reduce its sample complexity, making it applicable to real-world settings where samples are expensive. Trust Region Policy Optimization, Schulman et al, 2015.
  • 63. Summary Trust Region Policy Optimization, Schulman et al, 2015.
  • 64. References • https://www.slideshare.net/WoongwonLee/trpo-87165690 • https://reinforcement-learning-kr.github.io/2018/06/24/5_trpo/ • https://www.youtube.com/watch?v=XBO4oPChMfI • https://www.youtube.com/watch?v=CKaN5PgkSBc Trust Region Policy Optimization, Schulman et al, 2015.