SlideShare a Scribd company logo
Reward-Conditioned Policies
Aviral Kumar, Xue Bin Peng, Sergey Levine, 2019
Changhoon, Kevin Jeong
Seoul National University
chjeong@bi.snu.ac.kr
June 7, 2020
Changhoon, Kevin Jeong (Seoul National University*chjeong@bi.snu.ac.kr)Reward-Conditioned Policies 1 / 24 June 7, 2020 1 / 24
Contents
I. Motivation
II. Preliminaries
III. Reward-Conditioned Policies
IV. Experimental Evaluation
V. Discussion and Future Work
Changhoon, Kevin Jeong (Seoul National University*chjeong@bi.snu.ac.kr)Reward-Conditioned Policies 2 / 24 June 7, 2020 2 / 24
I. Motivation
Changhoon, Kevin Jeong (Seoul National University*chjeong@bi.snu.ac.kr)Reward-Conditioned Policies 3 / 24 June 7, 2020 3 / 24
Motivation
Supervised Learning
– Works on existing or given sample data or examples
– Predict feedback is given
– Commonly used and well-understood
Reinforcement Learning
– Works on interacting with the environment
– Is about sequential decision making(e.g. Game, Robot, etc.)
– RL algorithms can be brittle, difficult to use and tune
Can we learn effective policies via supervised learning?
Changhoon, Kevin Jeong (Seoul National University*chjeong@bi.snu.ac.kr)Reward-Conditioned Policies 4 / 24 June 7, 2020 4 / 24
Motivation
One of possible method: Imitation learning
– Behavioural cloning, Direct policy learning, Inverse RL, etc.
– Imitation learning utilizes standard and well-understood supervised
learning methods
– But they require near-optimal expert data in advance
So, Can we learn effective policies via supervised learning without
demonstrations?
– non-expert trajectories collected from sub-optimal policies can be
viewed as optimal supervision
– not for maximizing the reward, but for matching the reward of the
given trajectory
Changhoon, Kevin Jeong (Seoul National University*chjeong@bi.snu.ac.kr)Reward-Conditioned Policies 5 / 24 June 7, 2020 5 / 24
II. Preliminaries
Changhoon, Kevin Jeong (Seoul National University*chjeong@bi.snu.ac.kr)Reward-Conditioned Policies 6 / 24 June 7, 2020 6 / 24
Preliminaries
Reinforcement Learning
Objective
J(θ) = Es0∼p(s0),a0:∞∼π,st+1∼p(·|at ,st ) [ ∞
t=1 γtr (st, at)]
– Policy-based: compute the derivative of J(π) w.r.t the policy
parameter θ
– Value-based: estimate value(or Q) function by means of temporal
difference learning
– How to avoid high-variance policy gradient estimators, as well as the
complexity of temporal difference learning?
Changhoon, Kevin Jeong (Seoul National University*chjeong@bi.snu.ac.kr)Reward-Conditioned Policies 7 / 24 June 7, 2020 7 / 24
Preliminaries
Monte-Carlo update
V (St) ← V (St) + α (Gt − V (St))
where Gt =
∞
t=1 γt
r (st, at)
– Pros: unbiased, good convergence properties
– Cons: high variance
Temporal-Difference update
V (St) ← V (St) + α (Rt+1 + γV (St+1) − V (St))
– Pros: learn online every step, low variance
– Cons: bootstrapping - update involves an estimate; biased
Changhoon, Kevin Jeong (Seoul National University*chjeong@bi.snu.ac.kr)Reward-Conditioned Policies 8 / 24 June 7, 2020 8 / 24
Preliminaries
Function Approximation: Policy Gradient
Policy Gradient Theorem
For any differentiable policy πθ(s, a), for any of the policy objective
functions, the policy gradient is
θJ(θ) = Eπθ
[ θ log πθ(s, a)Qπθ (s, a)]
Monte-Carlo Policy Gradient(REINFORCE)
– using return Gt as an unbiased sample of Qπθ
(st, at)
∆θt = α θ log πθ (st, at) Gt
Reducing variance using a baseline
– A good baseline is the state value function V πθ
(s)
Changhoon, Kevin Jeong (Seoul National University*chjeong@bi.snu.ac.kr)Reward-Conditioned Policies 9 / 24 June 7, 2020 9 / 24
Preliminaries
Actor-critic algorithm
– Critic: updates Q-function parameters w
error = Eπθ
(Qπθ
(s, a) − Qw (s, a))
2
– Actor: updates policy parameters θ, in direction suggested by critic
θJ(θ) = Eπθ
[ θ log πθ(s, a)Qw (s, a)]
Reducing variance using a baseline: Advantage function
– One of good baseline is the state value function V πθ
(s)
– Advantage function;
Aπθ
(s, a) = Qπθ
(s, a) − V πθ
(s)
– Rewriting the policy gradient using advantage function
θJ(θ) = Eπθ
[ θ log πθ(s, a)Aπθ
(s, a)]
Changhoon, Kevin Jeong (Seoul National University*chjeong@bi.snu.ac.kr)Reward-Conditioned Policies 10 / 24 June 7, 2020 10 / 24
III. Reward-Conditioned Policies
Changhoon, Kevin Jeong (Seoul National University*chjeong@bi.snu.ac.kr)Reward-Conditioned Policies 11 / 24 June 7, 2020 11 / 24
Reward-Conditioned Policies
RCPs Algorithm(left) and Architecture(right)
– Z can be return(RCP-R) or advantage(RCP-A)
– Z can be incorporated in form of multiplicative interactions(πθ(a|s, Z))
– ˆpk (Z) is represented as Gaussian distribution, and µZ and σZ are
updated based on the soft − maximum, i.e. log exp, of target value
Z observed so far in the dataset D
Changhoon, Kevin Jeong (Seoul National University*chjeong@bi.snu.ac.kr)Reward-Conditioned Policies 12 / 24 June 7, 2020 12 / 24
Theoretical Motivation for RCPs
Derivation of two variants of RCPs;
– RCP-R: use Z as an return
– RCP-A: use Z as an advantage
RCP-R
Constrained Optimization
arg max
π
Eτ,Z∼pπ(τ,Z)[Z]
s.t. DKL (pπ(τ, Z) pµ(τ, Z)) ≤ ε
By forming the Lagrangian of constrained optimization with Lagrange
multiplier β,
L(π, β) = Eτ,Z∼pπ(τ,Z)[Z] + β ε − Eτ,Z∼∼pπ(τ,Z) log
pπ(τ, Z)
pµ(τ, Z)
Changhoon, Kevin Jeong (Seoul National University*chjeong@bi.snu.ac.kr)Reward-Conditioned Policies 13 / 24 June 7, 2020 13 / 24
Theoretical Motivation for RCPs
Constrained Optimization
Differentiating L(π, β) with respect to π and β and applying optimality
conditions, we obtain a non-parametric form for the joint trajectory-return
distribution of the optimal policy, pπ∗ (τ, Z); (See AWR Appendix A.)
pπ∗ (τ, Z) ∝ pµ(τ, Z) exp Z
β
By decompose the joint distribution pπ(τ, Z) into conditionals pπ(Z) and
pπ(τ|Z)
pπ∗ (τ|Z)pπ∗ (Z) ∝ [pµ(τ|Z)pµ(Z)] exp Z
β
Changhoon, Kevin Jeong (Seoul National University*chjeong@bi.snu.ac.kr)Reward-Conditioned Policies 14 / 24 June 7, 2020 14 / 24
Theoretical Motivation for RCPs
Constrained Optimization
pπ∗ (τ|Z) ∝ pµ(τ|Z) → corresponds to Line 9
pπ∗ (Z) ∝ pµ(Z) exp Z
β → corresponds to Line 10
Changhoon, Kevin Jeong (Seoul National University*chjeong@bi.snu.ac.kr)Reward-Conditioned Policies 15 / 24 June 7, 2020 15 / 24
Theoretical Motivation for RCPs
Maximum likelihood estimation
By factorizing pπ(τ|Z) as pπ(τ|Z) = Πtπ (at|st, Z) p (st+1|st, at) and,
to train a parametric policy πθ(a|s, ˆZ), projecting the optimal
non-parametric policy p∗
π computed above onto the manifold of parametric
policies, according to
πθ(a|s, Z) = arg min
θ
EZ∼D [DKL (pπ∗ (τ|Z) pπθ
(τ|Z))]
= arg maxθ EZ∼D Ea∼µ(a|s, ˆZ) [log πθ(a|s, Z)]
Theoretical motivation of RCP-A(See the Section 4.3.2)
For RCP-A, a new sample for Z is drawn at each time step, while for
RCP-R, a sample for the return Z is drawn once for the whole
trajectory(Line 5)
Changhoon, Kevin Jeong (Seoul National University*chjeong@bi.snu.ac.kr)Reward-Conditioned Policies 16 / 24 June 7, 2020 16 / 24
IV. Experimental Evaluation
Changhoon, Kevin Jeong (Seoul National University*chjeong@bi.snu.ac.kr)Reward-Conditioned Policies 17 / 24 June 7, 2020 17 / 24
Experimental Evaluation
– Results are averaged across 5 random seeds
– Comparison to RL benchmark: on-policy(TRPO, PPO)
off-policy(SAC, DDPG)
– AWR: off-policy RL method that also utilizes supervised learning as a
subroutine, but does not condition on rewards and requires an
exponential weighting scheme during training
Changhoon, Kevin Jeong (Seoul National University*chjeong@bi.snu.ac.kr)Reward-Conditioned Policies 18 / 24 June 7, 2020 18 / 24
Experimental Evaluation
– Heatmap: relationship between target value ˆZ and observed target
values of Z after 2,000 training iterations for both RCP variants
Changhoon, Kevin Jeong (Seoul National University*chjeong@bi.snu.ac.kr)Reward-Conditioned Policies 19 / 24 June 7, 2020 19 / 24
V. Discussion and Future Work
Changhoon, Kevin Jeong (Seoul National University*chjeong@bi.snu.ac.kr)Reward-Conditioned Policies 20 / 24 June 7, 2020 20 / 24
Discussion and Future work
Propose a general class of algorithms that enable learning of control
policies with standard supervised learning approaches
Sub-optimal trajectories can be regarded as optimal supervision for a
policy that does not aim to attain the largest possible reward, but
rather to match the reward of that trajectory
By then conditioning the policy on the reward, we can train a single
model to simultaneously represent policies for all possible reward
values, and generalize to larger reward values
Changhoon, Kevin Jeong (Seoul National University*chjeong@bi.snu.ac.kr)Reward-Conditioned Policies 21 / 24 June 7, 2020 21 / 24
Discussion and Future work
Limitations
– Its sample efficiency and final performance still lags behind the best
and most efficient approximate dynamic programming methods(SAC,
DDPG, etc.)
– Sometimes the reward-conditioned policies might generalize
successfully, and sometimes they might not
– Main challenge of these variants: exploration?
Changhoon, Kevin Jeong (Seoul National University*chjeong@bi.snu.ac.kr)Reward-Conditioned Policies 22 / 24 June 7, 2020 22 / 24
References
– Xue Bin Peng, et al., Advantage-Weighted Regression: Simple and
Scalable Off-Policy Reinforcement Learning, 2019
– Jan Peters, et al., Reinforcement learning by reward-weighted
regression for operational space control, ICML 2007
– RL course by David Silver, DeepMind
Changhoon, Kevin Jeong (Seoul National University*chjeong@bi.snu.ac.kr)Reward-Conditioned Policies 23 / 24 June 7, 2020 23 / 24
Thank you for your attention!
Changhoon, Kevin Jeong (Seoul National University*chjeong@bi.snu.ac.kr)Reward-Conditioned Policies 24 / 24 June 7, 2020 24 / 24

More Related Content

Similar to Tensorflow KR PR12(Season 3) : 251th Paper Review

Modern Recommendation for Advanced Practitioners part2
Modern Recommendation for Advanced Practitioners part2Modern Recommendation for Advanced Practitioners part2
Modern Recommendation for Advanced Practitioners part2
Flavian Vasile
 
Summary of BRAC
Summary of BRACSummary of BRAC
Summary of BRAC
ssuser0e9ad8
 
Policy gradient
Policy gradientPolicy gradient
Policy gradient
Jie-Han Chen
 
Yuyuy
YuyuyYuyuy
Yuyuy
priyashna1
 
Unlocking Exploration: Self-Motivated Agents Thrive on Memory-Driven Curiosity
Unlocking Exploration: Self-Motivated Agents Thrive on Memory-Driven CuriosityUnlocking Exploration: Self-Motivated Agents Thrive on Memory-Driven Curiosity
Unlocking Exploration: Self-Motivated Agents Thrive on Memory-Driven Curiosity
Hung Le
 
Fd33935939
Fd33935939Fd33935939
Fd33935939
IJERA Editor
 
Fd33935939
Fd33935939Fd33935939
Fd33935939
IJERA Editor
 
1PPA 670 Public Policy AnalysisBasic Policy Terms an.docx
1PPA 670 Public Policy AnalysisBasic Policy Terms an.docx1PPA 670 Public Policy AnalysisBasic Policy Terms an.docx
1PPA 670 Public Policy AnalysisBasic Policy Terms an.docx
felicidaddinwoodie
 
Regression, Bayesian Learning and Support vector machine
Regression, Bayesian Learning and Support vector machineRegression, Bayesian Learning and Support vector machine
Regression, Bayesian Learning and Support vector machine
Dr. Radhey Shyam
 
Reinforcement learning:policy gradient (part 1)
Reinforcement learning:policy gradient (part 1)Reinforcement learning:policy gradient (part 1)
Reinforcement learning:policy gradient (part 1)
Bean Yen
 
Mangai
MangaiMangai
Mangai
MangaiMangai
Forecasting stock price movement direction by machine learning algorithm
Forecasting stock price movement direction by machine  learning algorithmForecasting stock price movement direction by machine  learning algorithm
Forecasting stock price movement direction by machine learning algorithm
IJECEIAES
 
COMPARISON BETWEEN THE GENETIC ALGORITHMS OPTIMIZATION AND PARTICLE SWARM OPT...
COMPARISON BETWEEN THE GENETIC ALGORITHMS OPTIMIZATION AND PARTICLE SWARM OPT...COMPARISON BETWEEN THE GENETIC ALGORITHMS OPTIMIZATION AND PARTICLE SWARM OPT...
COMPARISON BETWEEN THE GENETIC ALGORITHMS OPTIMIZATION AND PARTICLE SWARM OPT...
IAEME Publication
 
Comparison between the genetic algorithms optimization and particle swarm opt...
Comparison between the genetic algorithms optimization and particle swarm opt...Comparison between the genetic algorithms optimization and particle swarm opt...
Comparison between the genetic algorithms optimization and particle swarm opt...
IAEME Publication
 
自然方策勾配法の基礎と応用
自然方策勾配法の基礎と応用自然方策勾配法の基礎と応用
自然方策勾配法の基礎と応用
Ryo Iwaki
 
Cold_Start_Reinforcement_Learning_with_Softmax_Policy_Gradient.pdf
Cold_Start_Reinforcement_Learning_with_Softmax_Policy_Gradient.pdfCold_Start_Reinforcement_Learning_with_Softmax_Policy_Gradient.pdf
Cold_Start_Reinforcement_Learning_with_Softmax_Policy_Gradient.pdf
Po-Chuan Chen
 
Active Offline Policy Selection
Active Offline Policy SelectionActive Offline Policy Selection
Active Offline Policy Selection
Lori Moore
 
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAIDeep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Jack Clark
 
0415_seminar_DeepDPG
0415_seminar_DeepDPG0415_seminar_DeepDPG
0415_seminar_DeepDPG
Hye-min Ahn
 

Similar to Tensorflow KR PR12(Season 3) : 251th Paper Review (20)

Modern Recommendation for Advanced Practitioners part2
Modern Recommendation for Advanced Practitioners part2Modern Recommendation for Advanced Practitioners part2
Modern Recommendation for Advanced Practitioners part2
 
Summary of BRAC
Summary of BRACSummary of BRAC
Summary of BRAC
 
Policy gradient
Policy gradientPolicy gradient
Policy gradient
 
Yuyuy
YuyuyYuyuy
Yuyuy
 
Unlocking Exploration: Self-Motivated Agents Thrive on Memory-Driven Curiosity
Unlocking Exploration: Self-Motivated Agents Thrive on Memory-Driven CuriosityUnlocking Exploration: Self-Motivated Agents Thrive on Memory-Driven Curiosity
Unlocking Exploration: Self-Motivated Agents Thrive on Memory-Driven Curiosity
 
Fd33935939
Fd33935939Fd33935939
Fd33935939
 
Fd33935939
Fd33935939Fd33935939
Fd33935939
 
1PPA 670 Public Policy AnalysisBasic Policy Terms an.docx
1PPA 670 Public Policy AnalysisBasic Policy Terms an.docx1PPA 670 Public Policy AnalysisBasic Policy Terms an.docx
1PPA 670 Public Policy AnalysisBasic Policy Terms an.docx
 
Regression, Bayesian Learning and Support vector machine
Regression, Bayesian Learning and Support vector machineRegression, Bayesian Learning and Support vector machine
Regression, Bayesian Learning and Support vector machine
 
Reinforcement learning:policy gradient (part 1)
Reinforcement learning:policy gradient (part 1)Reinforcement learning:policy gradient (part 1)
Reinforcement learning:policy gradient (part 1)
 
Mangai
MangaiMangai
Mangai
 
Mangai
MangaiMangai
Mangai
 
Forecasting stock price movement direction by machine learning algorithm
Forecasting stock price movement direction by machine  learning algorithmForecasting stock price movement direction by machine  learning algorithm
Forecasting stock price movement direction by machine learning algorithm
 
COMPARISON BETWEEN THE GENETIC ALGORITHMS OPTIMIZATION AND PARTICLE SWARM OPT...
COMPARISON BETWEEN THE GENETIC ALGORITHMS OPTIMIZATION AND PARTICLE SWARM OPT...COMPARISON BETWEEN THE GENETIC ALGORITHMS OPTIMIZATION AND PARTICLE SWARM OPT...
COMPARISON BETWEEN THE GENETIC ALGORITHMS OPTIMIZATION AND PARTICLE SWARM OPT...
 
Comparison between the genetic algorithms optimization and particle swarm opt...
Comparison between the genetic algorithms optimization and particle swarm opt...Comparison between the genetic algorithms optimization and particle swarm opt...
Comparison between the genetic algorithms optimization and particle swarm opt...
 
自然方策勾配法の基礎と応用
自然方策勾配法の基礎と応用自然方策勾配法の基礎と応用
自然方策勾配法の基礎と応用
 
Cold_Start_Reinforcement_Learning_with_Softmax_Policy_Gradient.pdf
Cold_Start_Reinforcement_Learning_with_Softmax_Policy_Gradient.pdfCold_Start_Reinforcement_Learning_with_Softmax_Policy_Gradient.pdf
Cold_Start_Reinforcement_Learning_with_Softmax_Policy_Gradient.pdf
 
Active Offline Policy Selection
Active Offline Policy SelectionActive Offline Policy Selection
Active Offline Policy Selection
 
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAIDeep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
 
0415_seminar_DeepDPG
0415_seminar_DeepDPG0415_seminar_DeepDPG
0415_seminar_DeepDPG
 

Recently uploaded

A SYSTEMATIC RISK ASSESSMENT APPROACH FOR SECURING THE SMART IRRIGATION SYSTEMS
A SYSTEMATIC RISK ASSESSMENT APPROACH FOR SECURING THE SMART IRRIGATION SYSTEMSA SYSTEMATIC RISK ASSESSMENT APPROACH FOR SECURING THE SMART IRRIGATION SYSTEMS
A SYSTEMATIC RISK ASSESSMENT APPROACH FOR SECURING THE SMART IRRIGATION SYSTEMS
IJNSA Journal
 
IEEE Aerospace and Electronic Systems Society as a Graduate Student Member
IEEE Aerospace and Electronic Systems Society as a Graduate Student MemberIEEE Aerospace and Electronic Systems Society as a Graduate Student Member
IEEE Aerospace and Electronic Systems Society as a Graduate Student Member
VICTOR MAESTRE RAMIREZ
 
Properties Railway Sleepers and Test.pptx
Properties Railway Sleepers and Test.pptxProperties Railway Sleepers and Test.pptx
Properties Railway Sleepers and Test.pptx
MDSABBIROJJAMANPAYEL
 
5214-1693458878915-Unit 6 2023 to 2024 academic year assignment (AutoRecovere...
5214-1693458878915-Unit 6 2023 to 2024 academic year assignment (AutoRecovere...5214-1693458878915-Unit 6 2023 to 2024 academic year assignment (AutoRecovere...
5214-1693458878915-Unit 6 2023 to 2024 academic year assignment (AutoRecovere...
ihlasbinance2003
 
Iron and Steel Technology Roadmap - Towards more sustainable steelmaking.pdf
Iron and Steel Technology Roadmap - Towards more sustainable steelmaking.pdfIron and Steel Technology Roadmap - Towards more sustainable steelmaking.pdf
Iron and Steel Technology Roadmap - Towards more sustainable steelmaking.pdf
RadiNasr
 
Comparative analysis between traditional aquaponics and reconstructed aquapon...
Comparative analysis between traditional aquaponics and reconstructed aquapon...Comparative analysis between traditional aquaponics and reconstructed aquapon...
Comparative analysis between traditional aquaponics and reconstructed aquapon...
bijceesjournal
 
Advanced control scheme of doubly fed induction generator for wind turbine us...
Advanced control scheme of doubly fed induction generator for wind turbine us...Advanced control scheme of doubly fed induction generator for wind turbine us...
Advanced control scheme of doubly fed induction generator for wind turbine us...
IJECEIAES
 
ML Based Model for NIDS MSc Updated Presentation.v2.pptx
ML Based Model for NIDS MSc Updated Presentation.v2.pptxML Based Model for NIDS MSc Updated Presentation.v2.pptx
ML Based Model for NIDS MSc Updated Presentation.v2.pptx
JamalHussainArman
 
CHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECT
CHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECTCHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECT
CHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECT
jpsjournal1
 
Literature Review Basics and Understanding Reference Management.pptx
Literature Review Basics and Understanding Reference Management.pptxLiterature Review Basics and Understanding Reference Management.pptx
Literature Review Basics and Understanding Reference Management.pptx
Dr Ramhari Poudyal
 
ACEP Magazine edition 4th launched on 05.06.2024
ACEP Magazine edition 4th launched on 05.06.2024ACEP Magazine edition 4th launched on 05.06.2024
ACEP Magazine edition 4th launched on 05.06.2024
Rahul
 
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
insn4465
 
Question paper of renewable energy sources
Question paper of renewable energy sourcesQuestion paper of renewable energy sources
Question paper of renewable energy sources
mahammadsalmanmech
 
Engine Lubrication performance System.pdf
Engine Lubrication performance System.pdfEngine Lubrication performance System.pdf
Engine Lubrication performance System.pdf
mamamaam477
 
Casting-Defect-inSlab continuous casting.pdf
Casting-Defect-inSlab continuous casting.pdfCasting-Defect-inSlab continuous casting.pdf
Casting-Defect-inSlab continuous casting.pdf
zubairahmad848137
 
Eric Nizeyimana's document 2006 from gicumbi to ttc nyamata handball play
Eric Nizeyimana's document 2006 from gicumbi to ttc nyamata handball playEric Nizeyimana's document 2006 from gicumbi to ttc nyamata handball play
Eric Nizeyimana's document 2006 from gicumbi to ttc nyamata handball play
enizeyimana36
 
Manufacturing Process of molasses based distillery ppt.pptx
Manufacturing Process of molasses based distillery ppt.pptxManufacturing Process of molasses based distillery ppt.pptx
Manufacturing Process of molasses based distillery ppt.pptx
Madan Karki
 
Generative AI leverages algorithms to create various forms of content
Generative AI leverages algorithms to create various forms of contentGenerative AI leverages algorithms to create various forms of content
Generative AI leverages algorithms to create various forms of content
Hitesh Mohapatra
 
学校原版美国波士顿大学毕业证学历学位证书原版一模一样
学校原版美国波士顿大学毕业证学历学位证书原版一模一样学校原版美国波士顿大学毕业证学历学位证书原版一模一样
学校原版美国波士顿大学毕业证学历学位证书原版一模一样
171ticu
 
TIME DIVISION MULTIPLEXING TECHNIQUE FOR COMMUNICATION SYSTEM
TIME DIVISION MULTIPLEXING TECHNIQUE FOR COMMUNICATION SYSTEMTIME DIVISION MULTIPLEXING TECHNIQUE FOR COMMUNICATION SYSTEM
TIME DIVISION MULTIPLEXING TECHNIQUE FOR COMMUNICATION SYSTEM
HODECEDSIET
 

Recently uploaded (20)

A SYSTEMATIC RISK ASSESSMENT APPROACH FOR SECURING THE SMART IRRIGATION SYSTEMS
A SYSTEMATIC RISK ASSESSMENT APPROACH FOR SECURING THE SMART IRRIGATION SYSTEMSA SYSTEMATIC RISK ASSESSMENT APPROACH FOR SECURING THE SMART IRRIGATION SYSTEMS
A SYSTEMATIC RISK ASSESSMENT APPROACH FOR SECURING THE SMART IRRIGATION SYSTEMS
 
IEEE Aerospace and Electronic Systems Society as a Graduate Student Member
IEEE Aerospace and Electronic Systems Society as a Graduate Student MemberIEEE Aerospace and Electronic Systems Society as a Graduate Student Member
IEEE Aerospace and Electronic Systems Society as a Graduate Student Member
 
Properties Railway Sleepers and Test.pptx
Properties Railway Sleepers and Test.pptxProperties Railway Sleepers and Test.pptx
Properties Railway Sleepers and Test.pptx
 
5214-1693458878915-Unit 6 2023 to 2024 academic year assignment (AutoRecovere...
5214-1693458878915-Unit 6 2023 to 2024 academic year assignment (AutoRecovere...5214-1693458878915-Unit 6 2023 to 2024 academic year assignment (AutoRecovere...
5214-1693458878915-Unit 6 2023 to 2024 academic year assignment (AutoRecovere...
 
Iron and Steel Technology Roadmap - Towards more sustainable steelmaking.pdf
Iron and Steel Technology Roadmap - Towards more sustainable steelmaking.pdfIron and Steel Technology Roadmap - Towards more sustainable steelmaking.pdf
Iron and Steel Technology Roadmap - Towards more sustainable steelmaking.pdf
 
Comparative analysis between traditional aquaponics and reconstructed aquapon...
Comparative analysis between traditional aquaponics and reconstructed aquapon...Comparative analysis between traditional aquaponics and reconstructed aquapon...
Comparative analysis between traditional aquaponics and reconstructed aquapon...
 
Advanced control scheme of doubly fed induction generator for wind turbine us...
Advanced control scheme of doubly fed induction generator for wind turbine us...Advanced control scheme of doubly fed induction generator for wind turbine us...
Advanced control scheme of doubly fed induction generator for wind turbine us...
 
ML Based Model for NIDS MSc Updated Presentation.v2.pptx
ML Based Model for NIDS MSc Updated Presentation.v2.pptxML Based Model for NIDS MSc Updated Presentation.v2.pptx
ML Based Model for NIDS MSc Updated Presentation.v2.pptx
 
CHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECT
CHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECTCHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECT
CHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECT
 
Literature Review Basics and Understanding Reference Management.pptx
Literature Review Basics and Understanding Reference Management.pptxLiterature Review Basics and Understanding Reference Management.pptx
Literature Review Basics and Understanding Reference Management.pptx
 
ACEP Magazine edition 4th launched on 05.06.2024
ACEP Magazine edition 4th launched on 05.06.2024ACEP Magazine edition 4th launched on 05.06.2024
ACEP Magazine edition 4th launched on 05.06.2024
 
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
 
Question paper of renewable energy sources
Question paper of renewable energy sourcesQuestion paper of renewable energy sources
Question paper of renewable energy sources
 
Engine Lubrication performance System.pdf
Engine Lubrication performance System.pdfEngine Lubrication performance System.pdf
Engine Lubrication performance System.pdf
 
Casting-Defect-inSlab continuous casting.pdf
Casting-Defect-inSlab continuous casting.pdfCasting-Defect-inSlab continuous casting.pdf
Casting-Defect-inSlab continuous casting.pdf
 
Eric Nizeyimana's document 2006 from gicumbi to ttc nyamata handball play
Eric Nizeyimana's document 2006 from gicumbi to ttc nyamata handball playEric Nizeyimana's document 2006 from gicumbi to ttc nyamata handball play
Eric Nizeyimana's document 2006 from gicumbi to ttc nyamata handball play
 
Manufacturing Process of molasses based distillery ppt.pptx
Manufacturing Process of molasses based distillery ppt.pptxManufacturing Process of molasses based distillery ppt.pptx
Manufacturing Process of molasses based distillery ppt.pptx
 
Generative AI leverages algorithms to create various forms of content
Generative AI leverages algorithms to create various forms of contentGenerative AI leverages algorithms to create various forms of content
Generative AI leverages algorithms to create various forms of content
 
学校原版美国波士顿大学毕业证学历学位证书原版一模一样
学校原版美国波士顿大学毕业证学历学位证书原版一模一样学校原版美国波士顿大学毕业证学历学位证书原版一模一样
学校原版美国波士顿大学毕业证学历学位证书原版一模一样
 
TIME DIVISION MULTIPLEXING TECHNIQUE FOR COMMUNICATION SYSTEM
TIME DIVISION MULTIPLEXING TECHNIQUE FOR COMMUNICATION SYSTEMTIME DIVISION MULTIPLEXING TECHNIQUE FOR COMMUNICATION SYSTEM
TIME DIVISION MULTIPLEXING TECHNIQUE FOR COMMUNICATION SYSTEM
 

Tensorflow KR PR12(Season 3) : 251th Paper Review

  • 1. Reward-Conditioned Policies Aviral Kumar, Xue Bin Peng, Sergey Levine, 2019 Changhoon, Kevin Jeong Seoul National University chjeong@bi.snu.ac.kr June 7, 2020 Changhoon, Kevin Jeong (Seoul National University*chjeong@bi.snu.ac.kr)Reward-Conditioned Policies 1 / 24 June 7, 2020 1 / 24
  • 2. Contents I. Motivation II. Preliminaries III. Reward-Conditioned Policies IV. Experimental Evaluation V. Discussion and Future Work Changhoon, Kevin Jeong (Seoul National University*chjeong@bi.snu.ac.kr)Reward-Conditioned Policies 2 / 24 June 7, 2020 2 / 24
  • 3. I. Motivation Changhoon, Kevin Jeong (Seoul National University*chjeong@bi.snu.ac.kr)Reward-Conditioned Policies 3 / 24 June 7, 2020 3 / 24
  • 4. Motivation Supervised Learning – Works on existing or given sample data or examples – Predict feedback is given – Commonly used and well-understood Reinforcement Learning – Works on interacting with the environment – Is about sequential decision making(e.g. Game, Robot, etc.) – RL algorithms can be brittle, difficult to use and tune Can we learn effective policies via supervised learning? Changhoon, Kevin Jeong (Seoul National University*chjeong@bi.snu.ac.kr)Reward-Conditioned Policies 4 / 24 June 7, 2020 4 / 24
  • 5. Motivation One of possible method: Imitation learning – Behavioural cloning, Direct policy learning, Inverse RL, etc. – Imitation learning utilizes standard and well-understood supervised learning methods – But they require near-optimal expert data in advance So, Can we learn effective policies via supervised learning without demonstrations? – non-expert trajectories collected from sub-optimal policies can be viewed as optimal supervision – not for maximizing the reward, but for matching the reward of the given trajectory Changhoon, Kevin Jeong (Seoul National University*chjeong@bi.snu.ac.kr)Reward-Conditioned Policies 5 / 24 June 7, 2020 5 / 24
  • 6. II. Preliminaries Changhoon, Kevin Jeong (Seoul National University*chjeong@bi.snu.ac.kr)Reward-Conditioned Policies 6 / 24 June 7, 2020 6 / 24
  • 7. Preliminaries Reinforcement Learning Objective J(θ) = Es0∼p(s0),a0:∞∼π,st+1∼p(·|at ,st ) [ ∞ t=1 γtr (st, at)] – Policy-based: compute the derivative of J(π) w.r.t the policy parameter θ – Value-based: estimate value(or Q) function by means of temporal difference learning – How to avoid high-variance policy gradient estimators, as well as the complexity of temporal difference learning? Changhoon, Kevin Jeong (Seoul National University*chjeong@bi.snu.ac.kr)Reward-Conditioned Policies 7 / 24 June 7, 2020 7 / 24
  • 8. Preliminaries Monte-Carlo update V (St) ← V (St) + α (Gt − V (St)) where Gt = ∞ t=1 γt r (st, at) – Pros: unbiased, good convergence properties – Cons: high variance Temporal-Difference update V (St) ← V (St) + α (Rt+1 + γV (St+1) − V (St)) – Pros: learn online every step, low variance – Cons: bootstrapping - update involves an estimate; biased Changhoon, Kevin Jeong (Seoul National University*chjeong@bi.snu.ac.kr)Reward-Conditioned Policies 8 / 24 June 7, 2020 8 / 24
  • 9. Preliminaries Function Approximation: Policy Gradient Policy Gradient Theorem For any differentiable policy πθ(s, a), for any of the policy objective functions, the policy gradient is θJ(θ) = Eπθ [ θ log πθ(s, a)Qπθ (s, a)] Monte-Carlo Policy Gradient(REINFORCE) – using return Gt as an unbiased sample of Qπθ (st, at) ∆θt = α θ log πθ (st, at) Gt Reducing variance using a baseline – A good baseline is the state value function V πθ (s) Changhoon, Kevin Jeong (Seoul National University*chjeong@bi.snu.ac.kr)Reward-Conditioned Policies 9 / 24 June 7, 2020 9 / 24
  • 10. Preliminaries Actor-critic algorithm – Critic: updates Q-function parameters w error = Eπθ (Qπθ (s, a) − Qw (s, a)) 2 – Actor: updates policy parameters θ, in direction suggested by critic θJ(θ) = Eπθ [ θ log πθ(s, a)Qw (s, a)] Reducing variance using a baseline: Advantage function – One of good baseline is the state value function V πθ (s) – Advantage function; Aπθ (s, a) = Qπθ (s, a) − V πθ (s) – Rewriting the policy gradient using advantage function θJ(θ) = Eπθ [ θ log πθ(s, a)Aπθ (s, a)] Changhoon, Kevin Jeong (Seoul National University*chjeong@bi.snu.ac.kr)Reward-Conditioned Policies 10 / 24 June 7, 2020 10 / 24
  • 11. III. Reward-Conditioned Policies Changhoon, Kevin Jeong (Seoul National University*chjeong@bi.snu.ac.kr)Reward-Conditioned Policies 11 / 24 June 7, 2020 11 / 24
  • 12. Reward-Conditioned Policies RCPs Algorithm(left) and Architecture(right) – Z can be return(RCP-R) or advantage(RCP-A) – Z can be incorporated in form of multiplicative interactions(πθ(a|s, Z)) – ˆpk (Z) is represented as Gaussian distribution, and µZ and σZ are updated based on the soft − maximum, i.e. log exp, of target value Z observed so far in the dataset D Changhoon, Kevin Jeong (Seoul National University*chjeong@bi.snu.ac.kr)Reward-Conditioned Policies 12 / 24 June 7, 2020 12 / 24
  • 13. Theoretical Motivation for RCPs Derivation of two variants of RCPs; – RCP-R: use Z as an return – RCP-A: use Z as an advantage RCP-R Constrained Optimization arg max π Eτ,Z∼pπ(τ,Z)[Z] s.t. DKL (pπ(τ, Z) pµ(τ, Z)) ≤ ε By forming the Lagrangian of constrained optimization with Lagrange multiplier β, L(π, β) = Eτ,Z∼pπ(τ,Z)[Z] + β ε − Eτ,Z∼∼pπ(τ,Z) log pπ(τ, Z) pµ(τ, Z) Changhoon, Kevin Jeong (Seoul National University*chjeong@bi.snu.ac.kr)Reward-Conditioned Policies 13 / 24 June 7, 2020 13 / 24
  • 14. Theoretical Motivation for RCPs Constrained Optimization Differentiating L(π, β) with respect to π and β and applying optimality conditions, we obtain a non-parametric form for the joint trajectory-return distribution of the optimal policy, pπ∗ (τ, Z); (See AWR Appendix A.) pπ∗ (τ, Z) ∝ pµ(τ, Z) exp Z β By decompose the joint distribution pπ(τ, Z) into conditionals pπ(Z) and pπ(τ|Z) pπ∗ (τ|Z)pπ∗ (Z) ∝ [pµ(τ|Z)pµ(Z)] exp Z β Changhoon, Kevin Jeong (Seoul National University*chjeong@bi.snu.ac.kr)Reward-Conditioned Policies 14 / 24 June 7, 2020 14 / 24
  • 15. Theoretical Motivation for RCPs Constrained Optimization pπ∗ (τ|Z) ∝ pµ(τ|Z) → corresponds to Line 9 pπ∗ (Z) ∝ pµ(Z) exp Z β → corresponds to Line 10 Changhoon, Kevin Jeong (Seoul National University*chjeong@bi.snu.ac.kr)Reward-Conditioned Policies 15 / 24 June 7, 2020 15 / 24
  • 16. Theoretical Motivation for RCPs Maximum likelihood estimation By factorizing pπ(τ|Z) as pπ(τ|Z) = Πtπ (at|st, Z) p (st+1|st, at) and, to train a parametric policy πθ(a|s, ˆZ), projecting the optimal non-parametric policy p∗ π computed above onto the manifold of parametric policies, according to πθ(a|s, Z) = arg min θ EZ∼D [DKL (pπ∗ (τ|Z) pπθ (τ|Z))] = arg maxθ EZ∼D Ea∼µ(a|s, ˆZ) [log πθ(a|s, Z)] Theoretical motivation of RCP-A(See the Section 4.3.2) For RCP-A, a new sample for Z is drawn at each time step, while for RCP-R, a sample for the return Z is drawn once for the whole trajectory(Line 5) Changhoon, Kevin Jeong (Seoul National University*chjeong@bi.snu.ac.kr)Reward-Conditioned Policies 16 / 24 June 7, 2020 16 / 24
  • 17. IV. Experimental Evaluation Changhoon, Kevin Jeong (Seoul National University*chjeong@bi.snu.ac.kr)Reward-Conditioned Policies 17 / 24 June 7, 2020 17 / 24
  • 18. Experimental Evaluation – Results are averaged across 5 random seeds – Comparison to RL benchmark: on-policy(TRPO, PPO) off-policy(SAC, DDPG) – AWR: off-policy RL method that also utilizes supervised learning as a subroutine, but does not condition on rewards and requires an exponential weighting scheme during training Changhoon, Kevin Jeong (Seoul National University*chjeong@bi.snu.ac.kr)Reward-Conditioned Policies 18 / 24 June 7, 2020 18 / 24
  • 19. Experimental Evaluation – Heatmap: relationship between target value ˆZ and observed target values of Z after 2,000 training iterations for both RCP variants Changhoon, Kevin Jeong (Seoul National University*chjeong@bi.snu.ac.kr)Reward-Conditioned Policies 19 / 24 June 7, 2020 19 / 24
  • 20. V. Discussion and Future Work Changhoon, Kevin Jeong (Seoul National University*chjeong@bi.snu.ac.kr)Reward-Conditioned Policies 20 / 24 June 7, 2020 20 / 24
  • 21. Discussion and Future work Propose a general class of algorithms that enable learning of control policies with standard supervised learning approaches Sub-optimal trajectories can be regarded as optimal supervision for a policy that does not aim to attain the largest possible reward, but rather to match the reward of that trajectory By then conditioning the policy on the reward, we can train a single model to simultaneously represent policies for all possible reward values, and generalize to larger reward values Changhoon, Kevin Jeong (Seoul National University*chjeong@bi.snu.ac.kr)Reward-Conditioned Policies 21 / 24 June 7, 2020 21 / 24
  • 22. Discussion and Future work Limitations – Its sample efficiency and final performance still lags behind the best and most efficient approximate dynamic programming methods(SAC, DDPG, etc.) – Sometimes the reward-conditioned policies might generalize successfully, and sometimes they might not – Main challenge of these variants: exploration? Changhoon, Kevin Jeong (Seoul National University*chjeong@bi.snu.ac.kr)Reward-Conditioned Policies 22 / 24 June 7, 2020 22 / 24
  • 23. References – Xue Bin Peng, et al., Advantage-Weighted Regression: Simple and Scalable Off-Policy Reinforcement Learning, 2019 – Jan Peters, et al., Reinforcement learning by reward-weighted regression for operational space control, ICML 2007 – RL course by David Silver, DeepMind Changhoon, Kevin Jeong (Seoul National University*chjeong@bi.snu.ac.kr)Reward-Conditioned Policies 23 / 24 June 7, 2020 23 / 24
  • 24. Thank you for your attention! Changhoon, Kevin Jeong (Seoul National University*chjeong@bi.snu.ac.kr)Reward-Conditioned Policies 24 / 24 June 7, 2020 24 / 24