Tensorflow KR PR12(Season 3) : 251th Paper Review

Reward-Conditioned Policies
Aviral Kumar, Xue Bin Peng, Sergey Levine, 2019
Changhoon, Kevin Jeong
Seoul National University
chjeong@bi.snu.ac.kr
June 7, 2020
Changhoon, Kevin Jeong (Seoul National University*chjeong@bi.snu.ac.kr)Reward-Conditioned Policies 1 / 24 June 7, 2020 1 / 24

Contents
I. Motivation
II. Preliminaries
III. Reward-Conditioned Policies
IV. Experimental Evaluation
V. Discussion and Future Work

I. Motivation

Motivation
Supervised Learning
– Works on existing or given sample data or examples
– Predict feedback is given
– Commonly used and well-understood
Reinforcement Learning
– Works on interacting with the environment
– Is about sequential decision making(e.g. Game, Robot, etc.)
– RL algorithms can be brittle, diﬃcult to use and tune
Can we learn eﬀective policies via supervised learning?

Motivation
One of possible method: Imitation learning
– Behavioural cloning, Direct policy learning, Inverse RL, etc.
– Imitation learning utilizes standard and well-understood supervised
learning methods
– But they require near-optimal expert data in advance
So, Can we learn eﬀective policies via supervised learning without
demonstrations?
– non-expert trajectories collected from sub-optimal policies can be
viewed as optimal supervision
– not for maximizing the reward, but for matching the reward of the
given trajectory

II. Preliminaries

Preliminaries
Reinforcement Learning
Objective
J(θ) = Es0∼p(s0),a0:∞∼π,st+1∼p(·|at ,st ) [ ∞
t=1 γtr (st, at)]
– Policy-based: compute the derivative of J(π) w.r.t the policy
parameter θ
– Value-based: estimate value(or Q) function by means of temporal
diﬀerence learning
– How to avoid high-variance policy gradient estimators, as well as the
complexity of temporal diﬀerence learning?

Preliminaries
Monte-Carlo update
V (St) ← V (St) + α (Gt − V (St))
where Gt =
∞
t=1 γt
r (st, at)
– Pros: unbiased, good convergence properties
– Cons: high variance
Temporal-Diﬀerence update
V (St) ← V (St) + α (Rt+1 + γV (St+1) − V (St))
– Pros: learn online every step, low variance
– Cons: bootstrapping - update involves an estimate; biased

Preliminaries
Function Approximation: Policy Gradient
Policy Gradient Theorem
For any diﬀerentiable policy πθ(s, a), for any of the policy objective
functions, the policy gradient is
θJ(θ) = Eπθ
[ θ log πθ(s, a)Qπθ (s, a)]
Monte-Carlo Policy Gradient(REINFORCE)
– using return Gt as an unbiased sample of Qπθ
(st, at)
∆θt = α θ log πθ (st, at) Gt
Reducing variance using a baseline
– A good baseline is the state value function V πθ
(s)

Preliminaries
Actor-critic algorithm
– Critic: updates Q-function parameters w
error = Eπθ
(Qπθ
(s, a) − Qw (s, a))
2
– Actor: updates policy parameters θ, in direction suggested by critic
θJ(θ) = Eπθ
[ θ log πθ(s, a)Qw (s, a)]
Reducing variance using a baseline: Advantage function
– One of good baseline is the state value function V πθ
(s)
– Advantage function;
Aπθ
(s, a) = Qπθ
(s, a) − V πθ
(s)
– Rewriting the policy gradient using advantage function
θJ(θ) = Eπθ
[ θ log πθ(s, a)Aπθ
(s, a)]

III. Reward-Conditioned Policies

Reward-Conditioned Policies
RCPs Algorithm(left) and Architecture(right)
– Z can be return(RCP-R) or advantage(RCP-A)
– Z can be incorporated in form of multiplicative interactions(πθ(a|s, Z))
– ˆpk (Z) is represented as Gaussian distribution, and µZ and σZ are
updated based on the soft − maximum, i.e. log exp, of target value
Z observed so far in the dataset D

Theoretical Motivation for RCPs
Derivation of two variants of RCPs;
– RCP-R: use Z as an return
– RCP-A: use Z as an advantage
RCP-R
Constrained Optimization
arg max
π
Eτ,Z∼pπ(τ,Z)[Z]
s.t. DKL (pπ(τ, Z) pµ(τ, Z)) ≤ ε
By forming the Lagrangian of constrained optimization with Lagrange
multiplier β,
L(π, β) = Eτ,Z∼pπ(τ,Z)[Z] + β ε − Eτ,Z∼∼pπ(τ,Z) log
pπ(τ, Z)
pµ(τ, Z)

Diﬀerentiating L(π, β) with respect to π and β and applying optimality
conditions, we obtain a non-parametric form for the joint trajectory-return
distribution of the optimal policy, pπ∗ (τ, Z); (See AWR Appendix A.)
pπ∗ (τ, Z) ∝ pµ(τ, Z) exp Z
β
By decompose the joint distribution pπ(τ, Z) into conditionals pπ(Z) and
pπ(τ|Z)
pπ∗ (τ|Z)pπ∗ (Z) ∝ [pµ(τ|Z)pµ(Z)] exp Z
β

pπ∗ (τ|Z) ∝ pµ(τ|Z) → corresponds to Line 9
pπ∗ (Z) ∝ pµ(Z) exp Z
β → corresponds to Line 10

Maximum likelihood estimation
By factorizing pπ(τ|Z) as pπ(τ|Z) = Πtπ (at|st, Z) p (st+1|st, at) and,
to train a parametric policy πθ(a|s, ˆZ), projecting the optimal
non-parametric policy p∗
π computed above onto the manifold of parametric
policies, according to
πθ(a|s, Z) = arg min
θ
EZ∼D [DKL (pπ∗ (τ|Z) pπθ
(τ|Z))]
= arg maxθ EZ∼D Ea∼µ(a|s, ˆZ) [log πθ(a|s, Z)]
Theoretical motivation of RCP-A(See the Section 4.3.2)
For RCP-A, a new sample for Z is drawn at each time step, while for
RCP-R, a sample for the return Z is drawn once for the whole
trajectory(Line 5)

IV. Experimental Evaluation

Experimental Evaluation
– Results are averaged across 5 random seeds
– Comparison to RL benchmark: on-policy(TRPO, PPO)
oﬀ-policy(SAC, DDPG)
– AWR: oﬀ-policy RL method that also utilizes supervised learning as a
subroutine, but does not condition on rewards and requires an
exponential weighting scheme during training

Experimental Evaluation
– Heatmap: relationship between target value ˆZ and observed target
values of Z after 2,000 training iterations for both RCP variants

V. Discussion and Future Work

Discussion and Future work
Propose a general class of algorithms that enable learning of control
policies with standard supervised learning approaches
Sub-optimal trajectories can be regarded as optimal supervision for a
policy that does not aim to attain the largest possible reward, but
rather to match the reward of that trajectory
By then conditioning the policy on the reward, we can train a single
model to simultaneously represent policies for all possible reward
values, and generalize to larger reward values

Discussion and Future work
Limitations
– Its sample efficiency and final performance still lags behind the best
and most efficient approximate dynamic programming methods(SAC,
DDPG, etc.)
– Sometimes the reward-conditioned policies might generalize
successfully, and sometimes they might not
– Main challenge of these variants: exploration?

References
– Xue Bin Peng, et al., Advantage-Weighted Regression: Simple and
Scalable Oﬀ-Policy Reinforcement Learning, 2019
– Jan Peters, et al., Reinforcement learning by reward-weighted
regression for operational space control, ICML 2007
– RL course by David Silver, DeepMind

Thank you for your attention!

Tensorflow KR PR12(Season 3) : 251th Paper Review

Recommended

Recommended

More Related Content

Similar to Tensorflow KR PR12(Season 3) : 251th Paper Review

Similar to Tensorflow KR PR12(Season 3) : 251th Paper Review (20)

Recently uploaded

Recently uploaded (20)

Tensorflow KR PR12(Season 3) : 251th Paper Review