SlideShare a Scribd company logo
1 of 24
Download to read offline
Reward-Conditioned Policies
Aviral Kumar, Xue Bin Peng, Sergey Levine, 2019
Changhoon, Kevin Jeong
Seoul National University
chjeong@bi.snu.ac.kr
June 7, 2020
Changhoon, Kevin Jeong (Seoul National University*chjeong@bi.snu.ac.kr)Reward-Conditioned Policies 1 / 24 June 7, 2020 1 / 24
Contents
I. Motivation
II. Preliminaries
III. Reward-Conditioned Policies
IV. Experimental Evaluation
V. Discussion and Future Work
Changhoon, Kevin Jeong (Seoul National University*chjeong@bi.snu.ac.kr)Reward-Conditioned Policies 2 / 24 June 7, 2020 2 / 24
I. Motivation
Changhoon, Kevin Jeong (Seoul National University*chjeong@bi.snu.ac.kr)Reward-Conditioned Policies 3 / 24 June 7, 2020 3 / 24
Motivation
Supervised Learning
– Works on existing or given sample data or examples
– Predict feedback is given
– Commonly used and well-understood
Reinforcement Learning
– Works on interacting with the environment
– Is about sequential decision making(e.g. Game, Robot, etc.)
– RL algorithms can be brittle, difficult to use and tune
Can we learn effective policies via supervised learning?
Changhoon, Kevin Jeong (Seoul National University*chjeong@bi.snu.ac.kr)Reward-Conditioned Policies 4 / 24 June 7, 2020 4 / 24
Motivation
One of possible method: Imitation learning
– Behavioural cloning, Direct policy learning, Inverse RL, etc.
– Imitation learning utilizes standard and well-understood supervised
learning methods
– But they require near-optimal expert data in advance
So, Can we learn effective policies via supervised learning without
demonstrations?
– non-expert trajectories collected from sub-optimal policies can be
viewed as optimal supervision
– not for maximizing the reward, but for matching the reward of the
given trajectory
Changhoon, Kevin Jeong (Seoul National University*chjeong@bi.snu.ac.kr)Reward-Conditioned Policies 5 / 24 June 7, 2020 5 / 24
II. Preliminaries
Changhoon, Kevin Jeong (Seoul National University*chjeong@bi.snu.ac.kr)Reward-Conditioned Policies 6 / 24 June 7, 2020 6 / 24
Preliminaries
Reinforcement Learning
Objective
J(θ) = Es0∼p(s0),a0:∞∼π,st+1∼p(·|at ,st ) [ ∞
t=1 γtr (st, at)]
– Policy-based: compute the derivative of J(π) w.r.t the policy
parameter θ
– Value-based: estimate value(or Q) function by means of temporal
difference learning
– How to avoid high-variance policy gradient estimators, as well as the
complexity of temporal difference learning?
Changhoon, Kevin Jeong (Seoul National University*chjeong@bi.snu.ac.kr)Reward-Conditioned Policies 7 / 24 June 7, 2020 7 / 24
Preliminaries
Monte-Carlo update
V (St) ← V (St) + α (Gt − V (St))
where Gt =
∞
t=1 γt
r (st, at)
– Pros: unbiased, good convergence properties
– Cons: high variance
Temporal-Difference update
V (St) ← V (St) + α (Rt+1 + γV (St+1) − V (St))
– Pros: learn online every step, low variance
– Cons: bootstrapping - update involves an estimate; biased
Changhoon, Kevin Jeong (Seoul National University*chjeong@bi.snu.ac.kr)Reward-Conditioned Policies 8 / 24 June 7, 2020 8 / 24
Preliminaries
Function Approximation: Policy Gradient
Policy Gradient Theorem
For any differentiable policy πθ(s, a), for any of the policy objective
functions, the policy gradient is
θJ(θ) = Eπθ
[ θ log πθ(s, a)Qπθ (s, a)]
Monte-Carlo Policy Gradient(REINFORCE)
– using return Gt as an unbiased sample of Qπθ
(st, at)
∆θt = α θ log πθ (st, at) Gt
Reducing variance using a baseline
– A good baseline is the state value function V πθ
(s)
Changhoon, Kevin Jeong (Seoul National University*chjeong@bi.snu.ac.kr)Reward-Conditioned Policies 9 / 24 June 7, 2020 9 / 24
Preliminaries
Actor-critic algorithm
– Critic: updates Q-function parameters w
error = Eπθ
(Qπθ
(s, a) − Qw (s, a))
2
– Actor: updates policy parameters θ, in direction suggested by critic
θJ(θ) = Eπθ
[ θ log πθ(s, a)Qw (s, a)]
Reducing variance using a baseline: Advantage function
– One of good baseline is the state value function V πθ
(s)
– Advantage function;
Aπθ
(s, a) = Qπθ
(s, a) − V πθ
(s)
– Rewriting the policy gradient using advantage function
θJ(θ) = Eπθ
[ θ log πθ(s, a)Aπθ
(s, a)]
Changhoon, Kevin Jeong (Seoul National University*chjeong@bi.snu.ac.kr)Reward-Conditioned Policies 10 / 24 June 7, 2020 10 / 24
III. Reward-Conditioned Policies
Changhoon, Kevin Jeong (Seoul National University*chjeong@bi.snu.ac.kr)Reward-Conditioned Policies 11 / 24 June 7, 2020 11 / 24
Reward-Conditioned Policies
RCPs Algorithm(left) and Architecture(right)
– Z can be return(RCP-R) or advantage(RCP-A)
– Z can be incorporated in form of multiplicative interactions(πθ(a|s, Z))
– ˆpk (Z) is represented as Gaussian distribution, and µZ and σZ are
updated based on the soft − maximum, i.e. log exp, of target value
Z observed so far in the dataset D
Changhoon, Kevin Jeong (Seoul National University*chjeong@bi.snu.ac.kr)Reward-Conditioned Policies 12 / 24 June 7, 2020 12 / 24
Theoretical Motivation for RCPs
Derivation of two variants of RCPs;
– RCP-R: use Z as an return
– RCP-A: use Z as an advantage
RCP-R
Constrained Optimization
arg max
π
Eτ,Z∼pπ(τ,Z)[Z]
s.t. DKL (pπ(τ, Z) pµ(τ, Z)) ≤ ε
By forming the Lagrangian of constrained optimization with Lagrange
multiplier β,
L(π, β) = Eτ,Z∼pπ(τ,Z)[Z] + β ε − Eτ,Z∼∼pπ(τ,Z) log
pπ(τ, Z)
pµ(τ, Z)
Changhoon, Kevin Jeong (Seoul National University*chjeong@bi.snu.ac.kr)Reward-Conditioned Policies 13 / 24 June 7, 2020 13 / 24
Theoretical Motivation for RCPs
Constrained Optimization
Differentiating L(π, β) with respect to π and β and applying optimality
conditions, we obtain a non-parametric form for the joint trajectory-return
distribution of the optimal policy, pπ∗ (τ, Z); (See AWR Appendix A.)
pπ∗ (τ, Z) ∝ pµ(τ, Z) exp Z
β
By decompose the joint distribution pπ(τ, Z) into conditionals pπ(Z) and
pπ(τ|Z)
pπ∗ (τ|Z)pπ∗ (Z) ∝ [pµ(τ|Z)pµ(Z)] exp Z
β
Changhoon, Kevin Jeong (Seoul National University*chjeong@bi.snu.ac.kr)Reward-Conditioned Policies 14 / 24 June 7, 2020 14 / 24
Theoretical Motivation for RCPs
Constrained Optimization
pπ∗ (τ|Z) ∝ pµ(τ|Z) → corresponds to Line 9
pπ∗ (Z) ∝ pµ(Z) exp Z
β → corresponds to Line 10
Changhoon, Kevin Jeong (Seoul National University*chjeong@bi.snu.ac.kr)Reward-Conditioned Policies 15 / 24 June 7, 2020 15 / 24
Theoretical Motivation for RCPs
Maximum likelihood estimation
By factorizing pπ(τ|Z) as pπ(τ|Z) = Πtπ (at|st, Z) p (st+1|st, at) and,
to train a parametric policy πθ(a|s, ˆZ), projecting the optimal
non-parametric policy p∗
π computed above onto the manifold of parametric
policies, according to
πθ(a|s, Z) = arg min
θ
EZ∼D [DKL (pπ∗ (τ|Z) pπθ
(τ|Z))]
= arg maxθ EZ∼D Ea∼µ(a|s, ˆZ) [log πθ(a|s, Z)]
Theoretical motivation of RCP-A(See the Section 4.3.2)
For RCP-A, a new sample for Z is drawn at each time step, while for
RCP-R, a sample for the return Z is drawn once for the whole
trajectory(Line 5)
Changhoon, Kevin Jeong (Seoul National University*chjeong@bi.snu.ac.kr)Reward-Conditioned Policies 16 / 24 June 7, 2020 16 / 24
IV. Experimental Evaluation
Changhoon, Kevin Jeong (Seoul National University*chjeong@bi.snu.ac.kr)Reward-Conditioned Policies 17 / 24 June 7, 2020 17 / 24
Experimental Evaluation
– Results are averaged across 5 random seeds
– Comparison to RL benchmark: on-policy(TRPO, PPO)
off-policy(SAC, DDPG)
– AWR: off-policy RL method that also utilizes supervised learning as a
subroutine, but does not condition on rewards and requires an
exponential weighting scheme during training
Changhoon, Kevin Jeong (Seoul National University*chjeong@bi.snu.ac.kr)Reward-Conditioned Policies 18 / 24 June 7, 2020 18 / 24
Experimental Evaluation
– Heatmap: relationship between target value ˆZ and observed target
values of Z after 2,000 training iterations for both RCP variants
Changhoon, Kevin Jeong (Seoul National University*chjeong@bi.snu.ac.kr)Reward-Conditioned Policies 19 / 24 June 7, 2020 19 / 24
V. Discussion and Future Work
Changhoon, Kevin Jeong (Seoul National University*chjeong@bi.snu.ac.kr)Reward-Conditioned Policies 20 / 24 June 7, 2020 20 / 24
Discussion and Future work
Propose a general class of algorithms that enable learning of control
policies with standard supervised learning approaches
Sub-optimal trajectories can be regarded as optimal supervision for a
policy that does not aim to attain the largest possible reward, but
rather to match the reward of that trajectory
By then conditioning the policy on the reward, we can train a single
model to simultaneously represent policies for all possible reward
values, and generalize to larger reward values
Changhoon, Kevin Jeong (Seoul National University*chjeong@bi.snu.ac.kr)Reward-Conditioned Policies 21 / 24 June 7, 2020 21 / 24
Discussion and Future work
Limitations
– Its sample efficiency and final performance still lags behind the best
and most efficient approximate dynamic programming methods(SAC,
DDPG, etc.)
– Sometimes the reward-conditioned policies might generalize
successfully, and sometimes they might not
– Main challenge of these variants: exploration?
Changhoon, Kevin Jeong (Seoul National University*chjeong@bi.snu.ac.kr)Reward-Conditioned Policies 22 / 24 June 7, 2020 22 / 24
References
– Xue Bin Peng, et al., Advantage-Weighted Regression: Simple and
Scalable Off-Policy Reinforcement Learning, 2019
– Jan Peters, et al., Reinforcement learning by reward-weighted
regression for operational space control, ICML 2007
– RL course by David Silver, DeepMind
Changhoon, Kevin Jeong (Seoul National University*chjeong@bi.snu.ac.kr)Reward-Conditioned Policies 23 / 24 June 7, 2020 23 / 24
Thank you for your attention!
Changhoon, Kevin Jeong (Seoul National University*chjeong@bi.snu.ac.kr)Reward-Conditioned Policies 24 / 24 June 7, 2020 24 / 24

More Related Content

Similar to Tensorflow KR PR12(Season 3) : 251th Paper Review

1PPA 670 Public Policy AnalysisBasic Policy Terms an.docx
1PPA 670 Public Policy AnalysisBasic Policy Terms an.docx1PPA 670 Public Policy AnalysisBasic Policy Terms an.docx
1PPA 670 Public Policy AnalysisBasic Policy Terms an.docx
felicidaddinwoodie
 
Cold_Start_Reinforcement_Learning_with_Softmax_Policy_Gradient.pdf
Cold_Start_Reinforcement_Learning_with_Softmax_Policy_Gradient.pdfCold_Start_Reinforcement_Learning_with_Softmax_Policy_Gradient.pdf
Cold_Start_Reinforcement_Learning_with_Softmax_Policy_Gradient.pdf
Po-Chuan Chen
 

Similar to Tensorflow KR PR12(Season 3) : 251th Paper Review (20)

Modern Recommendation for Advanced Practitioners part2
Modern Recommendation for Advanced Practitioners part2Modern Recommendation for Advanced Practitioners part2
Modern Recommendation for Advanced Practitioners part2
 
Summary of BRAC
Summary of BRACSummary of BRAC
Summary of BRAC
 
Policy gradient
Policy gradientPolicy gradient
Policy gradient
 
Yuyuy
YuyuyYuyuy
Yuyuy
 
Fd33935939
Fd33935939Fd33935939
Fd33935939
 
Fd33935939
Fd33935939Fd33935939
Fd33935939
 
1PPA 670 Public Policy AnalysisBasic Policy Terms an.docx
1PPA 670 Public Policy AnalysisBasic Policy Terms an.docx1PPA 670 Public Policy AnalysisBasic Policy Terms an.docx
1PPA 670 Public Policy AnalysisBasic Policy Terms an.docx
 
Regression, Bayesian Learning and Support vector machine
Regression, Bayesian Learning and Support vector machineRegression, Bayesian Learning and Support vector machine
Regression, Bayesian Learning and Support vector machine
 
Reinforcement learning:policy gradient (part 1)
Reinforcement learning:policy gradient (part 1)Reinforcement learning:policy gradient (part 1)
Reinforcement learning:policy gradient (part 1)
 
Mangai
MangaiMangai
Mangai
 
Mangai
MangaiMangai
Mangai
 
Forecasting stock price movement direction by machine learning algorithm
Forecasting stock price movement direction by machine  learning algorithmForecasting stock price movement direction by machine  learning algorithm
Forecasting stock price movement direction by machine learning algorithm
 
COMPARISON BETWEEN THE GENETIC ALGORITHMS OPTIMIZATION AND PARTICLE SWARM OPT...
COMPARISON BETWEEN THE GENETIC ALGORITHMS OPTIMIZATION AND PARTICLE SWARM OPT...COMPARISON BETWEEN THE GENETIC ALGORITHMS OPTIMIZATION AND PARTICLE SWARM OPT...
COMPARISON BETWEEN THE GENETIC ALGORITHMS OPTIMIZATION AND PARTICLE SWARM OPT...
 
Comparison between the genetic algorithms optimization and particle swarm opt...
Comparison between the genetic algorithms optimization and particle swarm opt...Comparison between the genetic algorithms optimization and particle swarm opt...
Comparison between the genetic algorithms optimization and particle swarm opt...
 
自然方策勾配法の基礎と応用
自然方策勾配法の基礎と応用自然方策勾配法の基礎と応用
自然方策勾配法の基礎と応用
 
Cold_Start_Reinforcement_Learning_with_Softmax_Policy_Gradient.pdf
Cold_Start_Reinforcement_Learning_with_Softmax_Policy_Gradient.pdfCold_Start_Reinforcement_Learning_with_Softmax_Policy_Gradient.pdf
Cold_Start_Reinforcement_Learning_with_Softmax_Policy_Gradient.pdf
 
Active Offline Policy Selection
Active Offline Policy SelectionActive Offline Policy Selection
Active Offline Policy Selection
 
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAIDeep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
 
0415_seminar_DeepDPG
0415_seminar_DeepDPG0415_seminar_DeepDPG
0415_seminar_DeepDPG
 
Reinforcement Learning on Mine Sweeper
Reinforcement Learning on Mine SweeperReinforcement Learning on Mine Sweeper
Reinforcement Learning on Mine Sweeper
 

Recently uploaded

Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
notes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.pptnotes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.ppt
MsecMca
 
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 

Recently uploaded (20)

Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
 
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
 
Bhosari ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For ...
Bhosari ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For ...Bhosari ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For ...
Bhosari ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For ...
 
Vivazz, Mieres Social Housing Design Spain
Vivazz, Mieres Social Housing Design SpainVivazz, Mieres Social Housing Design Spain
Vivazz, Mieres Social Housing Design Spain
 
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
 
chapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineeringchapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineering
 
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
 
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
 
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
 
Roadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and RoutesRoadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and Routes
 
notes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.pptnotes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.ppt
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
 
Double rodded leveling 1 pdf activity 01
Double rodded leveling 1 pdf activity 01Double rodded leveling 1 pdf activity 01
Double rodded leveling 1 pdf activity 01
 
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
 
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELL
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELLPVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELL
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELL
 
data_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfdata_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdf
 
Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)
 
University management System project report..pdf
University management System project report..pdfUniversity management System project report..pdf
University management System project report..pdf
 
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
 
Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . ppt
 

Tensorflow KR PR12(Season 3) : 251th Paper Review

  • 1. Reward-Conditioned Policies Aviral Kumar, Xue Bin Peng, Sergey Levine, 2019 Changhoon, Kevin Jeong Seoul National University chjeong@bi.snu.ac.kr June 7, 2020 Changhoon, Kevin Jeong (Seoul National University*chjeong@bi.snu.ac.kr)Reward-Conditioned Policies 1 / 24 June 7, 2020 1 / 24
  • 2. Contents I. Motivation II. Preliminaries III. Reward-Conditioned Policies IV. Experimental Evaluation V. Discussion and Future Work Changhoon, Kevin Jeong (Seoul National University*chjeong@bi.snu.ac.kr)Reward-Conditioned Policies 2 / 24 June 7, 2020 2 / 24
  • 3. I. Motivation Changhoon, Kevin Jeong (Seoul National University*chjeong@bi.snu.ac.kr)Reward-Conditioned Policies 3 / 24 June 7, 2020 3 / 24
  • 4. Motivation Supervised Learning – Works on existing or given sample data or examples – Predict feedback is given – Commonly used and well-understood Reinforcement Learning – Works on interacting with the environment – Is about sequential decision making(e.g. Game, Robot, etc.) – RL algorithms can be brittle, difficult to use and tune Can we learn effective policies via supervised learning? Changhoon, Kevin Jeong (Seoul National University*chjeong@bi.snu.ac.kr)Reward-Conditioned Policies 4 / 24 June 7, 2020 4 / 24
  • 5. Motivation One of possible method: Imitation learning – Behavioural cloning, Direct policy learning, Inverse RL, etc. – Imitation learning utilizes standard and well-understood supervised learning methods – But they require near-optimal expert data in advance So, Can we learn effective policies via supervised learning without demonstrations? – non-expert trajectories collected from sub-optimal policies can be viewed as optimal supervision – not for maximizing the reward, but for matching the reward of the given trajectory Changhoon, Kevin Jeong (Seoul National University*chjeong@bi.snu.ac.kr)Reward-Conditioned Policies 5 / 24 June 7, 2020 5 / 24
  • 6. II. Preliminaries Changhoon, Kevin Jeong (Seoul National University*chjeong@bi.snu.ac.kr)Reward-Conditioned Policies 6 / 24 June 7, 2020 6 / 24
  • 7. Preliminaries Reinforcement Learning Objective J(θ) = Es0∼p(s0),a0:∞∼π,st+1∼p(·|at ,st ) [ ∞ t=1 γtr (st, at)] – Policy-based: compute the derivative of J(π) w.r.t the policy parameter θ – Value-based: estimate value(or Q) function by means of temporal difference learning – How to avoid high-variance policy gradient estimators, as well as the complexity of temporal difference learning? Changhoon, Kevin Jeong (Seoul National University*chjeong@bi.snu.ac.kr)Reward-Conditioned Policies 7 / 24 June 7, 2020 7 / 24
  • 8. Preliminaries Monte-Carlo update V (St) ← V (St) + α (Gt − V (St)) where Gt = ∞ t=1 γt r (st, at) – Pros: unbiased, good convergence properties – Cons: high variance Temporal-Difference update V (St) ← V (St) + α (Rt+1 + γV (St+1) − V (St)) – Pros: learn online every step, low variance – Cons: bootstrapping - update involves an estimate; biased Changhoon, Kevin Jeong (Seoul National University*chjeong@bi.snu.ac.kr)Reward-Conditioned Policies 8 / 24 June 7, 2020 8 / 24
  • 9. Preliminaries Function Approximation: Policy Gradient Policy Gradient Theorem For any differentiable policy πθ(s, a), for any of the policy objective functions, the policy gradient is θJ(θ) = Eπθ [ θ log πθ(s, a)Qπθ (s, a)] Monte-Carlo Policy Gradient(REINFORCE) – using return Gt as an unbiased sample of Qπθ (st, at) ∆θt = α θ log πθ (st, at) Gt Reducing variance using a baseline – A good baseline is the state value function V πθ (s) Changhoon, Kevin Jeong (Seoul National University*chjeong@bi.snu.ac.kr)Reward-Conditioned Policies 9 / 24 June 7, 2020 9 / 24
  • 10. Preliminaries Actor-critic algorithm – Critic: updates Q-function parameters w error = Eπθ (Qπθ (s, a) − Qw (s, a)) 2 – Actor: updates policy parameters θ, in direction suggested by critic θJ(θ) = Eπθ [ θ log πθ(s, a)Qw (s, a)] Reducing variance using a baseline: Advantage function – One of good baseline is the state value function V πθ (s) – Advantage function; Aπθ (s, a) = Qπθ (s, a) − V πθ (s) – Rewriting the policy gradient using advantage function θJ(θ) = Eπθ [ θ log πθ(s, a)Aπθ (s, a)] Changhoon, Kevin Jeong (Seoul National University*chjeong@bi.snu.ac.kr)Reward-Conditioned Policies 10 / 24 June 7, 2020 10 / 24
  • 11. III. Reward-Conditioned Policies Changhoon, Kevin Jeong (Seoul National University*chjeong@bi.snu.ac.kr)Reward-Conditioned Policies 11 / 24 June 7, 2020 11 / 24
  • 12. Reward-Conditioned Policies RCPs Algorithm(left) and Architecture(right) – Z can be return(RCP-R) or advantage(RCP-A) – Z can be incorporated in form of multiplicative interactions(πθ(a|s, Z)) – ˆpk (Z) is represented as Gaussian distribution, and µZ and σZ are updated based on the soft − maximum, i.e. log exp, of target value Z observed so far in the dataset D Changhoon, Kevin Jeong (Seoul National University*chjeong@bi.snu.ac.kr)Reward-Conditioned Policies 12 / 24 June 7, 2020 12 / 24
  • 13. Theoretical Motivation for RCPs Derivation of two variants of RCPs; – RCP-R: use Z as an return – RCP-A: use Z as an advantage RCP-R Constrained Optimization arg max π Eτ,Z∼pπ(τ,Z)[Z] s.t. DKL (pπ(τ, Z) pµ(τ, Z)) ≤ ε By forming the Lagrangian of constrained optimization with Lagrange multiplier β, L(π, β) = Eτ,Z∼pπ(τ,Z)[Z] + β ε − Eτ,Z∼∼pπ(τ,Z) log pπ(τ, Z) pµ(τ, Z) Changhoon, Kevin Jeong (Seoul National University*chjeong@bi.snu.ac.kr)Reward-Conditioned Policies 13 / 24 June 7, 2020 13 / 24
  • 14. Theoretical Motivation for RCPs Constrained Optimization Differentiating L(π, β) with respect to π and β and applying optimality conditions, we obtain a non-parametric form for the joint trajectory-return distribution of the optimal policy, pπ∗ (τ, Z); (See AWR Appendix A.) pπ∗ (τ, Z) ∝ pµ(τ, Z) exp Z β By decompose the joint distribution pπ(τ, Z) into conditionals pπ(Z) and pπ(τ|Z) pπ∗ (τ|Z)pπ∗ (Z) ∝ [pµ(τ|Z)pµ(Z)] exp Z β Changhoon, Kevin Jeong (Seoul National University*chjeong@bi.snu.ac.kr)Reward-Conditioned Policies 14 / 24 June 7, 2020 14 / 24
  • 15. Theoretical Motivation for RCPs Constrained Optimization pπ∗ (τ|Z) ∝ pµ(τ|Z) → corresponds to Line 9 pπ∗ (Z) ∝ pµ(Z) exp Z β → corresponds to Line 10 Changhoon, Kevin Jeong (Seoul National University*chjeong@bi.snu.ac.kr)Reward-Conditioned Policies 15 / 24 June 7, 2020 15 / 24
  • 16. Theoretical Motivation for RCPs Maximum likelihood estimation By factorizing pπ(τ|Z) as pπ(τ|Z) = Πtπ (at|st, Z) p (st+1|st, at) and, to train a parametric policy πθ(a|s, ˆZ), projecting the optimal non-parametric policy p∗ π computed above onto the manifold of parametric policies, according to πθ(a|s, Z) = arg min θ EZ∼D [DKL (pπ∗ (τ|Z) pπθ (τ|Z))] = arg maxθ EZ∼D Ea∼µ(a|s, ˆZ) [log πθ(a|s, Z)] Theoretical motivation of RCP-A(See the Section 4.3.2) For RCP-A, a new sample for Z is drawn at each time step, while for RCP-R, a sample for the return Z is drawn once for the whole trajectory(Line 5) Changhoon, Kevin Jeong (Seoul National University*chjeong@bi.snu.ac.kr)Reward-Conditioned Policies 16 / 24 June 7, 2020 16 / 24
  • 17. IV. Experimental Evaluation Changhoon, Kevin Jeong (Seoul National University*chjeong@bi.snu.ac.kr)Reward-Conditioned Policies 17 / 24 June 7, 2020 17 / 24
  • 18. Experimental Evaluation – Results are averaged across 5 random seeds – Comparison to RL benchmark: on-policy(TRPO, PPO) off-policy(SAC, DDPG) – AWR: off-policy RL method that also utilizes supervised learning as a subroutine, but does not condition on rewards and requires an exponential weighting scheme during training Changhoon, Kevin Jeong (Seoul National University*chjeong@bi.snu.ac.kr)Reward-Conditioned Policies 18 / 24 June 7, 2020 18 / 24
  • 19. Experimental Evaluation – Heatmap: relationship between target value ˆZ and observed target values of Z after 2,000 training iterations for both RCP variants Changhoon, Kevin Jeong (Seoul National University*chjeong@bi.snu.ac.kr)Reward-Conditioned Policies 19 / 24 June 7, 2020 19 / 24
  • 20. V. Discussion and Future Work Changhoon, Kevin Jeong (Seoul National University*chjeong@bi.snu.ac.kr)Reward-Conditioned Policies 20 / 24 June 7, 2020 20 / 24
  • 21. Discussion and Future work Propose a general class of algorithms that enable learning of control policies with standard supervised learning approaches Sub-optimal trajectories can be regarded as optimal supervision for a policy that does not aim to attain the largest possible reward, but rather to match the reward of that trajectory By then conditioning the policy on the reward, we can train a single model to simultaneously represent policies for all possible reward values, and generalize to larger reward values Changhoon, Kevin Jeong (Seoul National University*chjeong@bi.snu.ac.kr)Reward-Conditioned Policies 21 / 24 June 7, 2020 21 / 24
  • 22. Discussion and Future work Limitations – Its sample efficiency and final performance still lags behind the best and most efficient approximate dynamic programming methods(SAC, DDPG, etc.) – Sometimes the reward-conditioned policies might generalize successfully, and sometimes they might not – Main challenge of these variants: exploration? Changhoon, Kevin Jeong (Seoul National University*chjeong@bi.snu.ac.kr)Reward-Conditioned Policies 22 / 24 June 7, 2020 22 / 24
  • 23. References – Xue Bin Peng, et al., Advantage-Weighted Regression: Simple and Scalable Off-Policy Reinforcement Learning, 2019 – Jan Peters, et al., Reinforcement learning by reward-weighted regression for operational space control, ICML 2007 – RL course by David Silver, DeepMind Changhoon, Kevin Jeong (Seoul National University*chjeong@bi.snu.ac.kr)Reward-Conditioned Policies 23 / 24 June 7, 2020 23 / 24
  • 24. Thank you for your attention! Changhoon, Kevin Jeong (Seoul National University*chjeong@bi.snu.ac.kr)Reward-Conditioned Policies 24 / 24 June 7, 2020 24 / 24