SlideShare a Scribd company logo
1 of 44
Download to read offline
Natural Policy Gradients, TRPO, PPO
Deep Reinforcement Learning and Control
Katerina Fragkiadaki
Carnegie Mellon
School of Computer Science
CMU 10703
Part of the slides adapted from John Shulman and Joshua Achiam
Stochastic policies
continuous actions
discrete actions
usually multivariate
Gaussian
almost always
categorical
a ⇠ N(µ✓(s), 2
✓(s))
a ∼ Cat(pθ(s))
µ✓(s)
✓(s)
pθ(s)
θ
θ
Policy Gradients
Monte Carlo Policy Gradients (REINFORCE), gradient direction:
What Loss to Optimize?
I Policy gradients
ĝ = Êt
h
r✓ log ⇡✓(at | st)Ât
i
I Can di↵erentiate the following loss
LPG
(✓) = Êt
h
log ⇡✓(at | st)Ât
i
but don’t want to optimize it too far
I Equivalently di↵erentiate
LIS
✓old
(✓) = Êt

⇡✓(at | st)
⇡✓old
(at | st)
Ât
at ✓ = ✓old, state-actions are sampled using ✓old. (IS =
Just the chain rule: r✓ log f (✓) ✓old
=
r✓f (✓)
✓old
f (✓old) = r✓
⇣
Actor-Critic Policy Gradient: ̂
g = ̂
𝔼t [∇θlog πθ(at |st)Aw(st)]
θold
θnew
μθ(s)
σθ(s)
σθnew
(s)
μθnew
(s)
θnew = θ + ϵ ⋅ ̂
g
1. Collect trajectories for policy
2. Estimate advantages
3. Compute policy gradient
4. Update policy parameters
5. GOTO 1
̂
g
A
πθ
This lecture is all about the stepwise
What is the underlying objective function?
̂
g ≈
1
N
N
∑
i=1
T
∑
t=1
∇θlog πθ(α(i)
t |s(i)
t )A(s(i)
t , a(i)
t ), τi ∼ πθ
Policy gradients:
What is our objective? Result from differentiating the objective function:
JPG
(θ) =
1
N
N
∑
i=1
T
∑
t=1
log πθ(α(i)
t |s(i)
t )A(s(i)
t , a(i)
t ) τi ∼ πθ
Is this our objective? We cannot both maximize over a variable and sample from it.
Compare to supervised learning and maximum likelihood estimation (MLE). Imagine we
have access to expert actions, then the loss function we want to optimize is:
JSL
(θ) =
1
N
N
∑
i=1
T
∑
t=1
log πθ(α̃(i)
t |s(i)
t ), τi ∼ π*
which maximizes the probability of expert actions in the training set.
Is this our SL objective?
Well, we cannot optimize it too far, our advantage estimates are from samples of
pi_theta_{old}. However, this constraint of “cannot optimize too far from theta_{old}”
does not appear anywhere in the objective.
Well, as a matter of fact, we care about test error, but this is a long story, the short
answer is yes, this is good enough for us to optimize if we regularize.
+regularization
Policy Gradients
Monte Carlo Policy Gradients (REINFORCE), gradient direction:
What Loss to Optimize?
I Policy gradients
ĝ = Êt
h
r✓ log ⇡✓(at | st)Ât
i
I Can di↵erentiate the following loss
LPG
(✓) = Êt
h
log ⇡✓(at | st)Ât
i
but don’t want to optimize it too far
I Equivalently di↵erentiate
LIS
✓old
(✓) = Êt

⇡✓(at | st)
⇡✓old
(at | st)
Ât
at ✓ = ✓old, state-actions are sampled using ✓old. (IS =
Just the chain rule: r✓ log f (✓) ✓old
=
r✓f (✓)
✓old
f (✓old) = r✓
⇣
Actor-Critic Policy Gradient: ̂
g = ̂
𝔼t [∇θlog πθ(at |st)Aw(st)]
θnew = θ + ϵ ⋅ ̂
g
1. Collect trajectories for policy
2. Estimate advantages
3. Compute policy gradient
4. Update policy parameters
5. GOTO 1
̂
g
A
πθ
This lecture is all about the stepwise
It is also about writing down an objective that we can
optimize with PG, and the procedure 1,2,3,4,5 will be the
result of this objective maximization
θold
θnew
μθ(s)
σθ(s)
σθnew
(s)
μθnew
(s)
Policy Gradients
Monte Carlo Policy Gradients (REINFORCE), gradient direction:
What Loss to Optimize?
I Policy gradients
ĝ = Êt
h
r✓ log ⇡✓(at | st)Ât
i
I Can di↵erentiate the following loss
LPG
(✓) = Êt
h
log ⇡✓(at | st)Ât
i
but don’t want to optimize it too far
I Equivalently di↵erentiate
LIS
✓old
(✓) = Êt

⇡✓(at | st)
⇡✓old
(at | st)
Ât
at ✓ = ✓old, state-actions are sampled using ✓old. (IS =
Just the chain rule: r✓ log f (✓) ✓old
=
r✓f (✓)
✓old
f (✓old) = r✓
⇣
Actor-Critic Policy Gradient: ̂
g = ̂
𝔼t [∇θlog πθ(at |st)Aw(st)]
Two problems with the vanilla formulation:
1. Hard to choose stepwise
2. Sample inefficient: we cannot use data
collected with policies of previous
iterations
ϵ
θnew = θ + ϵ ⋅ ̂
g
1. Collect trajectories for policy
2. Estimate advantages
3. Compute policy gradient
4. Update policy parameters
5. GOTO 1
̂
g
A
πθ
θold
θnew
μθ(s)
σθ(s)
σθnew
(s)
μθnew
(s)
Two
Hard to choose stepsizes
• Step too big
Bad policy->data collected under bad
policy-> we cannot recover
(in Supervised Learning, data does not
depend on neural network weights)
• Step too small
Not efficient use of experience
(in Supervised Learning, data can be
trivially re-used)
Gradient descent in parameter space
does not take into account the
resulting distance in the (output) policy
space between and
πθold
(s) πθnew
(s)
Monte Carlo Policy Gradients (REINFORCE), gradient direction:
What Loss to Optimize?
I Policy gradients
ĝ = Êt
h
r✓ log ⇡✓(at | st)Ât
i
I Can di↵erentiate the following loss
LPG
(✓) = Êt
h
log ⇡✓(at | st)Ât
i
but don’t want to optimize it too far
I Equivalently di↵erentiate
LIS
✓old
(✓) = Êt

⇡✓(at | st)
⇡✓old
(at | st)
Ât
at ✓ = ✓old, state-actions are sampled using ✓old. (IS =
Just the chain rule: r✓ log f (✓) ✓old
=
r✓f (✓)
✓old
f (✓old) = r✓
⇣
Actor-Critic Policy Gradient: ̂
g = ̂
𝔼t [∇θlog πθ(at |st)Aw(st)]
θnew = θ + ϵ ⋅ ̂
g
1. Collect trajectories for policy
2. Estimate advantages
3. Compute policy gradient
4. Update policy parameters
5. GOTO 1
̂
g
A
πθ
θold
θnew
μθ(s)
σθ(s)
σθnew
(s)
μθnew
(s)
Two
Hard to choose stepsizes
Monte Carlo Policy Gradients (REINFORCE), gradient direction:
What Loss to Optimize?
I Policy gradients
ĝ = Êt
h
r✓ log ⇡✓(at | st)Ât
i
I Can di↵erentiate the following loss
LPG
(✓) = Êt
h
log ⇡✓(at | st)Ât
i
but don’t want to optimize it too far
I Equivalently di↵erentiate
LIS
✓old
(✓) = Êt

⇡✓(at | st)
⇡✓old
(at | st)
Ât
at ✓ = ✓old, state-actions are sampled using ✓old. (IS =
Just the chain rule: r✓ log f (✓) ✓old
=
r✓f (✓)
✓old
f (✓old) = r✓
⇣
Actor-Critic Policy Gradient: ̂
g = ̂
𝔼t [∇θlog πθ(at |st)Aw(st)]
The Problem is More Than Step Size
Consider a family of policies with parametrization:
⇡✓(a) =
⇢
(✓) a = 1
1 (✓) a = 2
Figure: Small changes in the policy parameters can unexpectedly lead to big changes in the policy.
θnew = θ + ϵ ⋅ ̂
g
1. Collect trajectories for policy
2. Estimate advantages
3. Compute policy gradient
4. Update policy parameters
5. GOTO 1
̂
g
A
πθ
Two
Notation
We will use the following to denote values of parameters and corresponding policies before
and after an update:
θold → θnew
πold → πnew
θ → θ′
π → π′
Gradient Descent in Parameter Space
The stepwise in gradient descent results from solving the following optimization problem, e.g.,
using line search:
Euclidean distance in parameter space
θnew = θold + d *
SGD:
d * = arg max
∥d∥≤ϵ
J(θ + d)
It is hard to predict the result on the parameterized distribution..
µ✓(s)
✓(s)
θ
Gradient Descent in Distribution Space
The stepwise in gradient descent results from solving the following optimization problem, e.g.,
using line search:
d * = arg max
d, s.t. KL(πθ∥πθ+d)≤ϵ
J(θ + d)
Euclidean distance in parameter space
θnew = θold + d *
SGD:
d * = arg max
∥d∥≤ϵ
J(θ + d)
KL divergence in distribution space
It is hard to predict the result on the parameterized distribution.. hard to pick the threshold
epsilon
Natural gradient descent: the stepwise in parameter space is determined by
considering the KL divergence in the distributions before and after the update:
Easier to pick the distance threshold!!!
Solving the KL Constrained Problem
First order Taylor expansion for the loss and second order for the KL:
d * = arg max
d
J(θ + d) − λ(DKL [πθ∥πθ+d] − ϵ)
Unconstrained penalized objective:
≈ arg max
d
J(θold) + ∇θ J(θ)|θ=θold
⋅ d −
1
2
λ(d⊤
∇2
θDKL [πθold
∥πθ]|θ=θold
d) + λϵ
Taylor expansion of KL
DKL(pθold
|pθ) ≈ DKL(pθold
|pθold
) + d⊤
∇θDKL(pθold
|pθ)|θ=θold
+
1
2
d⊤
∇2
θDKL(pθold
|pθ)|θ=θold
∇θDKL(pθold
|pθ)|θ=θold
= −∇θ𝔼x∼pθold
log Pθ(x)|θ=θold
= −𝔼x∼pθold
∇θlog Pθ(x)|θ=θold
= −𝔼x∼pθold
1
Pθold
(x)
∇θPθ(x)|θ=θold
=
∫x
Pθold
(x)
1
Pθold
(x)
∇θPθ(x)|θ=θold
=
∫x
∇θPθ(x)|θ=θold
= ∇θ
∫x
Pθ(x)|θ=θold
.
= 0
KL(pθold
|pθ) = 𝔼x∼pθold
log
(
Pθold
(x)
Pθ(x) )
Taylor expansion of KL
DKL(pθold
|pθ) ≈ DKL(pθold
|pθold
) + d⊤
∇θKL(pθold
|pθ)|θ=θold
+
1
2
d⊤
∇2
θDKL(pθold
|pθ)|θ=θold
d
∇2
θDKL(pθold
|pθ)|θ=θold
= −𝔼x∼pθold
∇2
θlog Pθ(x)|θ=θold
= −𝔼x∼pθold
∇θ
(
∇θPθ(x)
Pθ(x) )
|θ=θold
= −𝔼x∼pθold (
∇2
θPθ(x)Pθ(x) − ∇θPθ(x)∇θPθ(x)⊤
Pθ(x)2 )
|θ=θold
= −𝔼x∼pθold
∇2
θPθ(x)|θ=θold
Pθold
(x)
+ 𝔼x∼pθold
∇θlog Pθ(x)∇θlog Pθ(x)⊤
|θ=θold
= 𝔼x∼pθold
∇θlog Pθ(x)∇θlog Pθ(x)⊤
|θ=θold
DKL(pθold
|pθ) = 𝔼x∼pθold
log
(
Pθold
(x)
Pθ(x) )
Fisher Information Matrix
F(θ) = 𝔼θ [∇θlog pθ(x)∇θlog pθ(x)⊤
]
Exactly equivalent to the Hessian of KL divergence!
DKL(pθold
|pθ) ≈ DKL(pθold
|pθold
) + d⊤
∇θDKL(pθold
|pθ)|θ=θold
+
1
2
d⊤
∇2
θDKL(pθold
|pθ)|θ=θold
d
=
1
2
d⊤
F(θold)d
=
1
2
(θ − θold)⊤
F(θold)(θ − θold)
Since KL divergence is roughly analogous to a distance measure between
distributions, Fisher information serves as a local distance metric between
distributions: how much you change the distribution if you move the parameters a
little bit in a given direction.
F(θold) = ∇2
θDKL(pθold
|pθ)|θ=θold
First order Taylor expansion for the loss and second order for the KL:
d * = arg max
d
J(θ + d) − λ(DKL [πθ∥πθ+d] − ϵ)
Unconstrained penalized objective:
≈ arg max
d
J(θold) + ∇θ J(θ)|θ=θold
⋅ d −
1
2
λ(d⊤
∇2
θDKL [πθold
∥πθ]|θ=θold
d) + λϵ
= arg max
d
∇θ J(θ)|θ=θold
⋅ d −
1
2
λ(d⊤
F(θold)d)
= arg min
d
− ∇θ J(θ)|θ=θold
⋅ d +
1
2
λ(d⊤
F(θold)d)
Substitute for the information matrix:
Solving the KL Constrained Problem
The natural gradient:
Natural Gradient Descent
Setting the gradient to zero:
0 =
∂
∂d (
−∇θ J(θ)|θ=θold
⋅ d +
1
2
λ(d⊤
F(θold)d)
)
= −∇θ J(θ)|θ=θold
+
1
2
λ(F(θold))d
d =
2
λ
F−1
(θold)∇θ J(θ)|θ=θold
˜
∇J(θ) = F−1
(θold)∇θ J(θ)
θnew = θold + α ⋅ F−1
(θold) ̂
g
DKL(πθold
|πθ) ≈
1
2
(θ − θold)⊤
F(θold)(θ − θold)
1
2
(αgN)⊤
F(αgN) = ϵ
α =
2ϵ
(g⊤
NFgN)
Natural Gradient Descent
ϵ
Both use samples from the current policy pi_k
Natural Gradient Descent
ϵ
very expensive to compute for a large number of parameters!
Policy Gradients
Monte Carlo Policy Gradients (REINFORCE), gradient direction:
What Loss to Optimize?
I Policy gradients
ĝ = Êt
h
r✓ log ⇡✓(at | st)Ât
i
I Can di↵erentiate the following loss
LPG
(✓) = Êt
h
log ⇡✓(at | st)Ât
i
but don’t want to optimize it too far
I Equivalently di↵erentiate
LIS
✓old
(✓) = Êt

⇡✓(at | st)
⇡✓old
(at | st)
Ât
at ✓ = ✓old, state-actions are sampled using ✓old. (IS =
Just the chain rule: r✓ log f (✓) ✓old
=
r✓f (✓)
✓old
f (✓old) = r✓
⇣
Actor-Critic Policy Gradient: ̂
g = ̂
𝔼t [∇θlog πθ(at |st)Aw(st)]
θold
θnew
μθold
(s)
σθold
(s)
σθnew
(s)
μθnew
(s)
θnew = θold + ϵ ⋅ ̂
g
1. Collect trajectories for policy
2. Estimate advantages
3. Compute policy gradient
4. Update policy parameters
5. GOTO 1
̂
g
A
πθold
Policy Gradients
Monte Carlo Policy Gradients (REINFORCE), gradient direction:
What Loss to Optimize?
I Policy gradients
ĝ = Êt
h
r✓ log ⇡✓(at | st)Ât
i
I Can di↵erentiate the following loss
LPG
(✓) = Êt
h
log ⇡✓(at | st)Ât
i
but don’t want to optimize it too far
I Equivalently di↵erentiate
LIS
✓old
(✓) = Êt

⇡✓(at | st)
⇡✓old
(at | st)
Ât
at ✓ = ✓old, state-actions are sampled using ✓old. (IS =
Just the chain rule: r✓ log f (✓) ✓old
=
r✓f (✓)
✓old
f (✓old) = r✓
⇣
Actor-Critic Policy Gradient: ̂
g = ̂
𝔼t [∇θlog πθ(at |st)Aw(st)]
θold
θnew
μθold
(s)
σθold
(s)
σθnew
(s)
μθnew
(s)
θnew = θold + ϵ ⋅ ̂
g
1. Collect trajectories for policy
2. Estimate advantages
3. Compute policy gradient
4. Update policy parameters
5. GOTO 1
̂
g
A
πθold
• On policy learning can be extremely
inefficient
• The policy changes only a little bit with
each gradient step
• I want to be able to use earlier data..how
to do that?
J(θ) = 𝔼τ∼πθ(τ) [R(τ)]
=
∑
τ
πθ(τ)R(τ)
=
∑
τ
πθold
(τ)
πθ(τ)
πθold
(τ)
R(τ)
=
∑
τ∼πθold
πθ(τ)
πθold
(τ)
R(τ)
= 𝔼τ∼πθold
πθ(τ)
πθold
(τ)
R(τ)
∇θ J(θ) = 𝔼τ∼πθold
∇θπθ(τ)
πθold
(τ)
R(τ)
Off policy learning with Importance Sampling
<-Gradient evaluated at theta_old is unchanged
∇θ J(θ)|θ=θold
= 𝔼τ∼πθold
∇θlog πθ(τ)|θ=θold
R(τ)
J(θ) = 𝔼τ∼πθ(τ) [R(τ)]
=
∑
τ
πθ(τ)R(τ)
=
∑
τ
πθold
(τ)
πθ(τ)
πθold
(τ)
R(τ)
=
∑
τ∼πθold
πθ(τ)
πθold
(τ)
R(τ)
= 𝔼τ∼πθold
πθ(τ)
πθold
(τ)
R(τ)
Off policy learning with Importance Sampling
J(θ) = 𝔼τ∼πθold
T
∑
t=1
t
∏
t′=1
πθ(a′
t |s′
t)
πθold
(a′
t |s′
t)
̂
At
πθ(τ)
πθold
(τ)
=
T
∏
i=1
πθ(at |st)
πθold
(at |st)
Now we can use data from the old
policy, but the variance has
increased by a lot! Those
multiplications can explode or
vanish!
∇θ J(θ)|θ=θold
= 𝔼τ∼πθold
∇θlog πθ(τ)|θ=θold
R(τ)
∇θ J(θ) = 𝔼τ∼πθold
∇θπθ(τ)
πθold
(τ)
R(τ)
Trust Region Policy Optimization
I Define the following trust region update:
maximize
✓
Êt

⇡✓(at | st)
⇡✓old
(at | st)
Ât
subject to Êt[KL[⇡✓old
(· | st), ⇡✓(· | st)]]  .
I Also worth considering using a penalty instead of a constraint
maximize
✓
Êt

⇡✓(at | st)
⇡✓old
(at | st)
Ât Êt[KL[⇡✓old
(· | st), ⇡✓(· | st)]]
I Method of Lagrange multipliers: optimality point of -constrained problem
is also an optimality point of -penalized problem for some .
I In practice, is easier to tune, and fixed is better than fixed
Trust region Policy Optimization
er Reading
I S. Kakade. “A Natural Policy Gradient.” NIPS. 2001
I S. Kakade and J. Langford. “Approximately optimal approximate reinforcement learning”. ICML. 2002
I J. Peters and S. Schaal. “Natural actor-critic”. Neurocomputing (2008)
I J. Schulman, S. Levine, P. Moritz, M. I. Jordan, and P. Abbeel. “Trust Region Policy Optimization”.
Again the KL penalized problem!
Solving KL Penalized Problem
I maximize✓ L⇡✓old
(⇡✓) · KL⇡✓old
(⇡✓)
I Make linear approximation to L⇡✓old
and quadratic approximation to KL term:
maximize
✓
g · (✓ ✓old) 2
(✓ ✓old)T
F(✓ ✓old)
where g =
@
@✓
L⇡✓old
(⇡✓) ✓=✓old
, F =
@2
@2✓
KL⇡✓old
(⇡✓) ✓=✓old
I Quadratic part of L is negligible compared to KL term
I F is positive semidefinite, but not if we include Hessian of L
I Solution: ✓ ✓old = 1
F 1
g, where F is Fisher Information matrix, g is
policy gradient. This is called the natural policy gradient3
.
3
S. Kakade. “A Natural Policy Gradient.” NIPS. 2001.
Solving KL penalized problem
Exactly what we saw with natural policy gradient!
One important detail!
Trust region Policy Optimization
Trust Region Policy Optimization
Small problems with NPG update:
Might not be robust to trust region size ; at some iterations may be too large and
performance can degrade
Because of quadratic approximation, KL-divergence constraint may be violated
Solution:
Require improvement in surrogate (make sure that L✓k (✓k+1) 0)
Enforce KL-constraint
How? Backtracking line search with exponential decay (decay coe↵ ↵ 2 (0, 1), budget L)
Algorithm 2 Line Search for TRPO
Compute proposed policy step k =
q
2
ĝT
k
Ĥ 1
k
ĝk
Ĥ 1
k ĝk
for j = 0, 1, 2, ..., L do
Compute proposed update ✓ = ✓k + ↵j
k
if L✓k (✓) 0 and D̄KL(✓||✓k )  then
accept the update and set ✓k+1 = ✓k + ↵j
k
break
end if
end for
Joshua Achiam (UC Berkeley, OpenAI) Advanced Policy Gradient Methods October 11, 2017 32 / 41
Trust region Policy Optimization
Due to the quadratic approximation, the KL constraint may be violated! What if we just do a
line search to find the best stepsize, making sure:
• I am improving my objective J(theta)
• The KL constraint is not violated!
Trust Region Policy Optimization
I Define the following trust region update:
maximize
✓
Êt

⇡✓(at | st)
⇡✓old
(at | st)
Ât
subject to Êt[KL[⇡✓old
(· | st), ⇡✓(· | st)]]  .
I Also worth considering using a penalty instead of a constraint
maximize
✓
Êt

⇡✓(at | st)
⇡✓old
(at | st)
Ât Êt[KL[⇡✓old
(· | st), ⇡✓(· | st)]]
I Method of Lagrange multipliers: optimality point of -constrained problem
is also an optimality point of -penalized problem for some .
I In practice, is easier to tune, and fixed is better than fixed
Trust Region Policy Optimization
Trust Region Policy Optimization is implemented as TNPG plus a line search. Putting
it all together:
Algorithm 3 Trust Region Policy Optimization
Input: initial policy parameters ✓0
for k = 0, 1, 2, ... do
Collect set of trajectories Dk on policy ⇡k = ⇡(✓k )
Estimate advantages Â
⇡k
t using any advantage estimation algorithm
Form sample estimates for
policy gradient ĝk (using advantage estimates)
and KL-divergence Hessian-vector product function f (v) = Ĥk v
Use CG with ncg iterations to obtain xk ⇡ Ĥ 1
k ĝk
Estimate proposed step k ⇡
q
2
xT
k
Ĥk xk
xk
Perform backtracking line search with exponential decay to obtain final update
✓k+1 = ✓k + ↵j
k
end for
Trust region Policy Optimization
TRPO= NPG +Linesearch
Trust Region Policy Optimization
Trust Region Policy Optimization is implemented as TNPG plus a line search. Putting
it all together:
Algorithm 3 Trust Region Policy Optimization
Input: initial policy parameters ✓0
for k = 0, 1, 2, ... do
Collect set of trajectories Dk on policy ⇡k = ⇡(✓k )
Estimate advantages Â
⇡k
t using any advantage estimation algorithm
Form sample estimates for
policy gradient ĝk (using advantage estimates)
and KL-divergence Hessian-vector product function f (v) = Ĥk v
Use CG with ncg iterations to obtain xk ⇡ Ĥ 1
k ĝk
Estimate proposed step k ⇡
q
2
xT
k
Ĥk xk
xk
Perform backtracking line search with exponential decay to obtain final update
✓k+1 = ✓k + ↵j
k
end for
Trust region Policy Optimization
TRPO= NPG +Linesearch+monotonic improvement theorem!
Relating objectives of two policies
Policy objective:
Policy objective can be written in terms of old one:
J(πθ) = 𝔼τ∼πθ
∞
∑
t=0
γt
rt
J(πθ′) − J(πθ) = 𝔼τ∼π′
θ
∞
∑
t=0
γt
Aπθ(st, at)
J(π′) − J(π) = 𝔼τ∼π′
∞
∑
t=0
γt
Aπ
(st, at)
Equivalently for succinctness:
Proof of Relative Policy Performance Identity
J(⇡0
) J(⇡) = E
⌧⇠⇡0
" 1
X
t=0
t
A⇡
(st , at )
#
= E
⌧⇠⇡0
" 1
X
t=0
t
(R(st , at , st+1) + V ⇡
(st+1) V ⇡
(st ))
#
= J(⇡0
) + E
⌧⇠⇡0
" 1
X
t=0
t+1
V ⇡
(st+1)
1
X
t=0
t
V ⇡
(st )
#
= J(⇡0
) + E
⌧⇠⇡0
" 1
X
t=1
t
V ⇡
(st )
1
X
t=0
t
V ⇡
(st )
#
= J(⇡0
) E
⌧⇠⇡0
[V ⇡
(s0)]
= J(⇡0
) J(⇡)
Joshua Achiam (UC Berkeley, OpenAI) Advanced Policy Gradient Methods October 11, 2017 12 / 41
Approximately Optimal Approximate Reinforcement Learning, Kakade and Langford 2002
Relating objectives of two policies
The initial state distribution is the same for both!
Relating objectives of two policies
Discounted state visitation distribution:
J(π′) − J(π) = 𝔼τ∼π′
∞
∑
t=0
γt
Aπ
(st, at)
= 𝔼s∼dπ′
,a∼π′Aπ
(s, a)
= 𝔼s∼dπ′
,a∼π [
π′(a|s)
π(a|s)
Aπ
(s, a)
]
But how are we supposed to sample states from the policy we are trying to optimize for…
Let’s use the previous policy to sample them.
J(π′) − J(π) ≈ 𝔼s∼dπ,a∼π
π′(a|s)
π(a|s)
Aπ
(s, a)
= ℒπ(π′)
It turns out we can bound this approximation error:
A Useful Approximation
What if we just said d⇡0
⇡ d⇡
and didn’t worry about it?
J(⇡0
) J(⇡) ⇡
1
1
E
s⇠d⇡
a⇠⇡

⇡0
(a|s)
⇡(a|s)
A⇡
(s, a)
.
= L⇡(⇡0
)
Turns out: this approximation is pretty good when ⇡0
and ⇡ are close! But why, and
close do they have to be?
Relative policy performance bounds: 2
J(⇡0
) J(⇡) + L⇡(⇡0
)  C
q
E
s⇠d⇡
[DKL(⇡0||⇡)[s]]
If policies are close in KL-divergence—the approximation is good!
Constrained Policy Optimization, Achiam et al. 2017
dπ
(s) = (1 − γ)
∞
∑
t=0
γt
P(st = s|π)
Relating objectives of two policies
ℒπ′
π = 𝔼s∼dπ,a∼π [
π′(a|s)
π(a|s)
Aπ
(s, a)
]
= 𝔼τ∼π
[
∞
∑
t=0
π′(at |st)
π(at |st)
Aπ
(st, at)
]
This is something we can optimize using trajectories from the old policy!
Now we do not have the product! So, the gradient will have much smaller variance! (Yes, but
we have approximated, that’s why!) What is the gradient?
∇θℒθ
θk
|θ=θk
= 𝔼τ∼πθk [
∞
∑
t=0
γt
∇θπθ(at |st)|θ=θk
πθk
(at |st)
Aπθk(st, at)
]
= 𝔼τ∼πθk [
∞
∑
t=0
γt
∇θlog πθ(at |st)|θ=θk
Aπθk(st, at)
]
J(θ) = 𝔼τ∼πθold
T
∑
t=1
t
∏
t′=1
πθ(a′
t |s′
t)
πθold
(a′
t |s′
t)
̂
At
Compare to Importance Sampling:
⇒ J(π′) − J(π) ≥ ℒπ(π′) − C 𝔼s∼dπ [KL(π′|π)[s]]
|J(π′) − (J(π) + ℒπ(π′))| ≤ C 𝔼s∼dπ [KL(π′|π)[s]]
Given policy , we want to optimize over policy to maximize .
π π′
• If we maximize the RHS we are guaranteed to maximize the LHS.
• We know how to maximize the RHS. I can estimate both quantities of pi’ with
sampled from pi
• But will i have a better policy pi’? (knowing that the distance of the objectives is
maximized is not enough, there needs to be positive or equal to zero)
Monotonic Improvement Theorem
Monotonic Improvement Theory
Proof of improvement guarantee: Suppose ⇡k+1 and ⇡k are related by
⇡k+1 = arg max
⇡0
L⇡k (⇡0
) C
q
E
s⇠d⇡k
[DKL(⇡0||⇡k )[s]].
⇡k is a feasible point, and the objective at ⇡k is equal to 0.
L⇡k (⇡k ) / E
s,a⇠d⇡k ,⇡k
[A⇡k (s, a)] = 0
DKL(⇡k ||⇡k )[s] = 0
=) optimal value 0
=) by the performance bound, J(⇡k+1) J(⇡k ) 0
Joshua Achiam (UC Berkeley, OpenAI) Advanced Policy Gradient Methods October 11, 2017 21 / 41
Monotonic Improvement Theorem
• Theory is very conservative (high value of C) and we will use KL distance of pi’ and
pi as a constraint (trust region) as opposed to a penalty:
Approximate Monotonic Improvement
⇡k+1 = arg max
⇡0
L⇡k (⇡ ) C E
s⇠d⇡k
[DKL(⇡ ||⇡k )[s]]. (3)
Problem:
C provided by theory is quite high when is near 1
=) steps from (3) are too small.
Solution:
Instead of KL penalty, use KL constraint (called trust region).
Can control worst-case error through constraint upper limit!
⇡k+1 = arg max
⇡0
L⇡k (⇡0
)
s.t. E
s⇠d⇡k
⇥
DKL(⇡0
||⇡k )[s]
⇤

(4)
Joshua Achiam (UC Berkeley, OpenAI) Advanced Policy Gradient Methods October 11, 2017 22 / 41
Trust Region Policy Optimization
Trust Region Policy Optimization is implemented as TNPG plus a line search. Putting
it all together:
Algorithm 3 Trust Region Policy Optimization
Input: initial policy parameters ✓0
for k = 0, 1, 2, ... do
Collect set of trajectories Dk on policy ⇡k = ⇡(✓k )
Estimate advantages Â
⇡k
t using any advantage estimation algorithm
Form sample estimates for
policy gradient ĝk (using advantage estimates)
and KL-divergence Hessian-vector product function f (v) = Ĥk v
Use CG with ncg iterations to obtain xk ⇡ Ĥ 1
k ĝk
Estimate proposed step k ⇡
q
2
xT
k
Ĥk xk
xk
Perform backtracking line search with exponential decay to obtain final update
✓k+1 = ✓k + ↵j
k
end for
Trust region Policy Optimization
TRPO= NPG +Linesearch+monotonic improvement theorem!
Proximal Policy Optimization
Can I achieve similar performance without second order information (no Fisher matrix!)
Proximal Policy Optimization
Proximal Policy Optimization (PPO) is a family of methods that approximately enforce
KL constraint without computing natural gradients. Two variants:
Adaptive KL Penalty
Policy update solves unconstrained optimization problem
✓k+1 = arg max
✓
L✓k
(✓) k D̄KL(✓||✓k )
Penalty coefficient k changes between iterations to approximately enforce
KL-divergence constraint
Clipped Objective
New objective function: let rt (✓) = ⇡✓(at |st )/⇡✓k
(at |st ). Then
LCLIP
✓k
(✓) = E
⌧⇠⇡k
" T
X
t=0
h
min(rt (✓)Â
⇡k
t , clip (rt (✓), 1 ✏, 1 + ✏) Â
⇡k
t )
i
#
where ✏ is a hyperparameter (maybe ✏ = 0.2)
Policy update is ✓k+1 = arg max✓ LCLIP
✓k
(✓)
Joshua Achiam (UC Berkeley, OpenAI) Advanced Policy Gradient Methods October 11, 2017 35 / 41
rther Reading
I S. Kakade. “A Natural Policy Gradient.” NIPS. 2001
I S. Kakade and J. Langford. “Approximately optimal approximate reinforcement learning”. ICML. 2002
I J. Peters and S. Schaal. “Natural actor-critic”. Neurocomputing (2008)
I J. Schulman, S. Levine, P. Moritz, M. I. Jordan, and P. Abbeel. “Trust Region Policy Optimization”. ICML (2015)
I Y. Duan, X. Chen, R. Houthooft, J. Schulman, and P. Abbeel. “Benchmarking Deep Reinforcement Learning for Continuous Control”.
ICML (2016)
I J. Martens and I. Sutskever. “Training deep and recurrent networks with Hessian-free optimization”. Springer, 2012
I Z. Wang, V. Bapst, N. Heess, V. Mnih, R. Munos, et al. “Sample Efficient Actor-Critic with Experience Replay”. (2016)
I Y. Wu, E. Mansimov, S. Liao, R. Grosse, and J. Ba. “Scalable trust-region method for deep reinforcement learning using Kronecker-factored
approximation”. (2017)
I J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. “Proximal Policy Optimization Algorithms”. (2017)
I
Proximal Policy Optimization with Adaptive KL Penalty
Algorithm 4 PPO with Adaptive KL Penalty
Input: initial policy parameters ✓0, initial KL penalty 0, target KL-divergence
for k = 0, 1, 2, ... do
Collect set of partial trajectories Dk on policy ⇡k = ⇡(✓k )
Estimate advantages Â
⇡k
t using any advantage estimation algorithm
Compute policy update
✓k+1 = arg max
✓
L✓k (✓) k D̄KL(✓||✓k )
by taking K steps of minibatch SGD (via Adam)
if D̄KL(✓k+1||✓k ) 1.5 then
k+1 = 2 k
else if D̄KL(✓k+1||✓k )  /1.5 then
k+1 = k /2
end if
end for
Initial KL penalty not that important—it adapts quickly
Some iterations may violate KL constraint, but most don’t
Joshua Achiam (UC Berkeley, OpenAI) Advanced Policy Gradient Methods October 11, 2017 36 / 41
PPO: Adaptive KL Penalty
Don’t use second order approximation for Kl which is
expensive, use standard gradient descent
Proximal Policy Optimization: Clipping Objective
I Recall the surrogate objective
LIS
(✓) = Êt

⇡✓(at | st)
⇡✓old
(at | st)
Ât = Êt
h
rt(✓)Ât
i
. (1)
I Form a lower bound via clipped importance ratios
LCLIP
(✓) = Êt
h
min(rt(✓)Ât, clip(rt(✓), 1 ✏, 1 + ✏)Ât)
i
(2)
I Forms pessimistic bound on objective, can be optimized using SGD
PPO: Clipped Objective
rther Reading
I S. Kakade. “A Natural Policy Gradient.” NIPS. 2001
I S. Kakade and J. Langford. “Approximately optimal approximate reinforcement learning”. ICML. 2002
I J. Peters and S. Schaal. “Natural actor-critic”. Neurocomputing (2008)
I J. Schulman, S. Levine, P. Moritz, M. I. Jordan, and P. Abbeel. “Trust Region Policy Optimization”. ICML (2015)
I Y. Duan, X. Chen, R. Houthooft, J. Schulman, and P. Abbeel. “Benchmarking Deep Reinforcement Learning for Continuous Control”.
ICML (2016)
I J. Martens and I. Sutskever. “Training deep and recurrent networks with Hessian-free optimization”. Springer, 2012
I Z. Wang, V. Bapst, N. Heess, V. Mnih, R. Munos, et al. “Sample Efficient Actor-Critic with Experience Replay”. (2016)
I Y. Wu, E. Mansimov, S. Liao, R. Grosse, and J. Ba. “Scalable trust-region method for deep reinforcement learning using Kronecker-factored
approximation”. (2017)
I J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. “Proximal Policy Optimization Algorithms”. (2017)
I
Proximal Policy Optimization with Clipped Objective
But how does clipping keep policy close? By making objective as pessimistic as possible
about performance far away from ✓k :
Figure: Various objectives as a function of interpolation factor ↵ between ✓k+1 and ✓k after one
update of PPO-Clip 9
9
Schulman, Wolski, Dhariwal, Radford, Klimov, 2017
Joshua Achiam (UC Berkeley, OpenAI) Advanced Policy Gradient Methods October 11, 2017 38 / 41
Proximal Policy Optimization
PPO: Clipped Objective
Proximal Policy Optimization with Clipped Objective
Algorithm 5 PPO with Clipped Objective
Input: initial policy parameters ✓0, clipping threshold ✏
for k = 0, 1, 2, ... do
Collect set of partial trajectories Dk on policy ⇡k = ⇡(✓k )
Estimate advantages Â
⇡k
t using any advantage estimation algorithm
Compute policy update
✓k+1 = arg max
✓
LCLIP
✓k
(✓)
by taking K steps of minibatch SGD (via Adam), where
LCLIP
✓k
(✓) = E
⌧⇠⇡k
" T
X
t=0
h
min(rt (✓)Â
⇡k
t , clip (rt (✓), 1 ✏, 1 + ✏) Â
⇡k
t )
i
#
end for
Clipping prevents policy from having incentive to go far away from ✓k+1
Clipping seems to work at least as well as PPO with KL penalty, but is simpler to
implement
Joshua Achiam (UC Berkeley, OpenAI) Advanced Policy Gradient Methods October 11, 2017 37 / 41
PPO: Clipped Objective
Empirical Performance of PPO
Figure: Performance comparison between PPO with clipped objective and various other deep RL
methods on a slate of MuJoCo tasks. 10
10
Schulman, Wolski, Dhariwal, Radford, Klimov, 2017
PPO: Clipped Objective
Summary
• Gradient Descent in Parameter VS distribution space
• Natural gradients: we need to keep track of how the KL changes
from iteration to iteration
• Natural policy gradients
• Clipped objective works well
Related Readings
Further Reading
I S. Kakade. “A Natural Policy Gradient.” NIPS. 2001
I S. Kakade and J. Langford. “Approximately optimal approximate reinforcement learning”. ICML. 2002
I J. Peters and S. Schaal. “Natural actor-critic”. Neurocomputing (2008)
I J. Schulman, S. Levine, P. Moritz, M. I. Jordan, and P. Abbeel. “Trust Region Policy Optimization”. ICML (2015)
I Y. Duan, X. Chen, R. Houthooft, J. Schulman, and P. Abbeel. “Benchmarking Deep Reinforcement Learning for Continuous Control”.
ICML (2016)
I J. Martens and I. Sutskever. “Training deep and recurrent networks with Hessian-free optimization”. Springer, 2012
I Z. Wang, V. Bapst, N. Heess, V. Mnih, R. Munos, et al. “Sample Efficient Actor-Critic with Experience Replay”. (2016)
I Y. Wu, E. Mansimov, S. Liao, R. Grosse, and J. Ba. “Scalable trust-region method for deep reinforcement learning using Kronecker-factored
approximation”. (2017)
I J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. “Proximal Policy Optimization Algorithms”. (2017)
I blog.openai.com: recent posts on baselines releases
Further Reading
I S. Kakade. “A Natural Policy Gradient.” NIPS. 2001
I S. Kakade and J. Langford. “Approximately optimal approximate reinforcement learning”. ICML. 2002
I J. Peters and S. Schaal. “Natural actor-critic”. Neurocomputing (2008)
I J. Schulman, S. Levine, P. Moritz, M. I. Jordan, and P. Abbeel. “Trust Region Policy Optimization”. ICML (2015)
I Y. Duan, X. Chen, R. Houthooft, J. Schulman, and P. Abbeel. “Benchmarking Deep Reinforcement Learning for Continuous Control”.
ICML (2016)
I J. Martens and I. Sutskever. “Training deep and recurrent networks with Hessian-free optimization”. Springer, 2012
I Z. Wang, V. Bapst, N. Heess, V. Mnih, R. Munos, et al. “Sample Efficient Actor-Critic with Experience Replay”. (2016)
I Y. Wu, E. Mansimov, S. Liao, R. Grosse, and J. Ba. “Scalable trust-region method for deep reinforcement learning using Kronecker-factored
approximation”. (2017)
I J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. “Proximal Policy Optimization Algorithms”. (2017)
I blog.openai.com: recent posts on baselines releases
J. Achiam, D. Held, A. Tamar, P. Abeel “Constrained Policy Optimization”. (2017)

More Related Content

Similar to Lecture_NaturalPolicyGradientsTRPOPPO.pdf

Logisticregression
LogisticregressionLogisticregression
Logisticregressionrnunoo
 
logisticregression.ppt
logisticregression.pptlogisticregression.ppt
logisticregression.pptShrutiPanda12
 
3.3 Rates of Change and Behavior of Graphs
3.3 Rates of Change and Behavior of Graphs3.3 Rates of Change and Behavior of Graphs
3.3 Rates of Change and Behavior of Graphssmiller5
 
Competitive Multi-agent Inverse Reinforcement Learning with Sub-optimal Demon...
Competitive Multi-agent Inverse Reinforcement Learning with Sub-optimal Demon...Competitive Multi-agent Inverse Reinforcement Learning with Sub-optimal Demon...
Competitive Multi-agent Inverse Reinforcement Learning with Sub-optimal Demon...Kenshi Abe
 
Financial Trading as a Game: A Deep Reinforcement Learning Approach
Financial Trading as a Game: A Deep Reinforcement Learning ApproachFinancial Trading as a Game: A Deep Reinforcement Learning Approach
Financial Trading as a Game: A Deep Reinforcement Learning Approach謙益 黃
 
Interpreting Logistic Regression.pptx
Interpreting Logistic Regression.pptxInterpreting Logistic Regression.pptx
Interpreting Logistic Regression.pptxGairuzazmiMGhani
 
Hierarchical Deterministic Quadrature Methods for Option Pricing under the Ro...
Hierarchical Deterministic Quadrature Methods for Option Pricing under the Ro...Hierarchical Deterministic Quadrature Methods for Option Pricing under the Ro...
Hierarchical Deterministic Quadrature Methods for Option Pricing under the Ro...Chiheb Ben Hammouda
 
Chap 8. Optimization for training deep models
Chap 8. Optimization for training deep modelsChap 8. Optimization for training deep models
Chap 8. Optimization for training deep modelsYoung-Geun Choi
 
100 things I know
100 things I know100 things I know
100 things I knowr-uribe
 
DDPG algortihm for angry birds
DDPG algortihm for angry birdsDDPG algortihm for angry birds
DDPG algortihm for angry birdsWangyu Han
 
Leveraged ETFs Performance Evaluation
Leveraged ETFs Performance EvaluationLeveraged ETFs Performance Evaluation
Leveraged ETFs Performance Evaluationguasoni
 
Logisticregression
LogisticregressionLogisticregression
Logisticregressionsujimaa
 
MAPE regression, seminar @ QUT (Brisbane)
MAPE regression, seminar @ QUT (Brisbane)MAPE regression, seminar @ QUT (Brisbane)
MAPE regression, seminar @ QUT (Brisbane)Arnaud de Myttenaere
 
Optimization for Deep Learning
Optimization for Deep LearningOptimization for Deep Learning
Optimization for Deep LearningSebastian Ruder
 
Reinforcement Learning in Configurable Environments
Reinforcement Learning in Configurable EnvironmentsReinforcement Learning in Configurable Environments
Reinforcement Learning in Configurable EnvironmentsEmanuele Ghelfi
 
GAN in_kakao
GAN in_kakaoGAN in_kakao
GAN in_kakaoJunho Kim
 
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAIDeep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAIJack Clark
 

Similar to Lecture_NaturalPolicyGradientsTRPOPPO.pdf (20)

Logisticregression
LogisticregressionLogisticregression
Logisticregression
 
logisticregression
logisticregressionlogisticregression
logisticregression
 
logisticregression.ppt
logisticregression.pptlogisticregression.ppt
logisticregression.ppt
 
3.3 Rates of Change and Behavior of Graphs
3.3 Rates of Change and Behavior of Graphs3.3 Rates of Change and Behavior of Graphs
3.3 Rates of Change and Behavior of Graphs
 
Competitive Multi-agent Inverse Reinforcement Learning with Sub-optimal Demon...
Competitive Multi-agent Inverse Reinforcement Learning with Sub-optimal Demon...Competitive Multi-agent Inverse Reinforcement Learning with Sub-optimal Demon...
Competitive Multi-agent Inverse Reinforcement Learning with Sub-optimal Demon...
 
Financial Trading as a Game: A Deep Reinforcement Learning Approach
Financial Trading as a Game: A Deep Reinforcement Learning ApproachFinancial Trading as a Game: A Deep Reinforcement Learning Approach
Financial Trading as a Game: A Deep Reinforcement Learning Approach
 
Interpreting Logistic Regression.pptx
Interpreting Logistic Regression.pptxInterpreting Logistic Regression.pptx
Interpreting Logistic Regression.pptx
 
Hierarchical Deterministic Quadrature Methods for Option Pricing under the Ro...
Hierarchical Deterministic Quadrature Methods for Option Pricing under the Ro...Hierarchical Deterministic Quadrature Methods for Option Pricing under the Ro...
Hierarchical Deterministic Quadrature Methods for Option Pricing under the Ro...
 
Chap 8. Optimization for training deep models
Chap 8. Optimization for training deep modelsChap 8. Optimization for training deep models
Chap 8. Optimization for training deep models
 
100 things I know
100 things I know100 things I know
100 things I know
 
DDPG algortihm for angry birds
DDPG algortihm for angry birdsDDPG algortihm for angry birds
DDPG algortihm for angry birds
 
MUMS: Bayesian, Fiducial, and Frequentist Conference - Model Selection in the...
MUMS: Bayesian, Fiducial, and Frequentist Conference - Model Selection in the...MUMS: Bayesian, Fiducial, and Frequentist Conference - Model Selection in the...
MUMS: Bayesian, Fiducial, and Frequentist Conference - Model Selection in the...
 
Leveraged ETFs Performance Evaluation
Leveraged ETFs Performance EvaluationLeveraged ETFs Performance Evaluation
Leveraged ETFs Performance Evaluation
 
Logisticregression
LogisticregressionLogisticregression
Logisticregression
 
MAPE regression, seminar @ QUT (Brisbane)
MAPE regression, seminar @ QUT (Brisbane)MAPE regression, seminar @ QUT (Brisbane)
MAPE regression, seminar @ QUT (Brisbane)
 
Asymptotic Notation
Asymptotic NotationAsymptotic Notation
Asymptotic Notation
 
Optimization for Deep Learning
Optimization for Deep LearningOptimization for Deep Learning
Optimization for Deep Learning
 
Reinforcement Learning in Configurable Environments
Reinforcement Learning in Configurable EnvironmentsReinforcement Learning in Configurable Environments
Reinforcement Learning in Configurable Environments
 
GAN in_kakao
GAN in_kakaoGAN in_kakao
GAN in_kakao
 
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAIDeep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
 

More from Dongseo University

Evolutionary Computation Lecture notes03
Evolutionary Computation Lecture notes03Evolutionary Computation Lecture notes03
Evolutionary Computation Lecture notes03Dongseo University
 
Evolutionary Computation Lecture notes02
Evolutionary Computation Lecture notes02Evolutionary Computation Lecture notes02
Evolutionary Computation Lecture notes02Dongseo University
 
Evolutionary Computation Lecture notes01
Evolutionary Computation Lecture notes01Evolutionary Computation Lecture notes01
Evolutionary Computation Lecture notes01Dongseo University
 
Average Linear Selection Algorithm
Average Linear Selection AlgorithmAverage Linear Selection Algorithm
Average Linear Selection AlgorithmDongseo University
 
Lower Bound of Comparison Sort
Lower Bound of Comparison SortLower Bound of Comparison Sort
Lower Bound of Comparison SortDongseo University
 
Running Time of Building Binary Heap using Array
Running Time of Building Binary Heap using ArrayRunning Time of Building Binary Heap using Array
Running Time of Building Binary Heap using ArrayDongseo University
 
Proof By Math Induction Example
Proof By Math Induction ExampleProof By Math Induction Example
Proof By Math Induction ExampleDongseo University
 
Estimating probability distributions
Estimating probability distributionsEstimating probability distributions
Estimating probability distributionsDongseo University
 
2018-2 Machine Learning (Wasserstein GAN and BEGAN)
2018-2 Machine Learning (Wasserstein GAN and BEGAN)2018-2 Machine Learning (Wasserstein GAN and BEGAN)
2018-2 Machine Learning (Wasserstein GAN and BEGAN)Dongseo University
 
2018-2 Machine Learning (Linear regression, Logistic regression)
2018-2 Machine Learning (Linear regression, Logistic regression)2018-2 Machine Learning (Linear regression, Logistic regression)
2018-2 Machine Learning (Linear regression, Logistic regression)Dongseo University
 
2017-2 ML W9 Reinforcement Learning #5
2017-2 ML W9 Reinforcement Learning #52017-2 ML W9 Reinforcement Learning #5
2017-2 ML W9 Reinforcement Learning #5Dongseo University
 
2017-2 ML W8 Reinforcement Learning #4
2017-2 ML W8 Reinforcement Learning #42017-2 ML W8 Reinforcement Learning #4
2017-2 ML W8 Reinforcement Learning #4Dongseo University
 

More from Dongseo University (20)

Evolutionary Computation Lecture notes03
Evolutionary Computation Lecture notes03Evolutionary Computation Lecture notes03
Evolutionary Computation Lecture notes03
 
Evolutionary Computation Lecture notes02
Evolutionary Computation Lecture notes02Evolutionary Computation Lecture notes02
Evolutionary Computation Lecture notes02
 
Evolutionary Computation Lecture notes01
Evolutionary Computation Lecture notes01Evolutionary Computation Lecture notes01
Evolutionary Computation Lecture notes01
 
Markov Chain Monte Carlo
Markov Chain Monte CarloMarkov Chain Monte Carlo
Markov Chain Monte Carlo
 
Simplex Lecture Notes
Simplex Lecture NotesSimplex Lecture Notes
Simplex Lecture Notes
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
Median of Medians
Median of MediansMedian of Medians
Median of Medians
 
Average Linear Selection Algorithm
Average Linear Selection AlgorithmAverage Linear Selection Algorithm
Average Linear Selection Algorithm
 
Lower Bound of Comparison Sort
Lower Bound of Comparison SortLower Bound of Comparison Sort
Lower Bound of Comparison Sort
 
Running Time of Building Binary Heap using Array
Running Time of Building Binary Heap using ArrayRunning Time of Building Binary Heap using Array
Running Time of Building Binary Heap using Array
 
Running Time of MergeSort
Running Time of MergeSortRunning Time of MergeSort
Running Time of MergeSort
 
Binary Trees
Binary TreesBinary Trees
Binary Trees
 
Proof By Math Induction Example
Proof By Math Induction ExampleProof By Math Induction Example
Proof By Math Induction Example
 
TRPO and PPO notes
TRPO and PPO notesTRPO and PPO notes
TRPO and PPO notes
 
Estimating probability distributions
Estimating probability distributionsEstimating probability distributions
Estimating probability distributions
 
2018-2 Machine Learning (Wasserstein GAN and BEGAN)
2018-2 Machine Learning (Wasserstein GAN and BEGAN)2018-2 Machine Learning (Wasserstein GAN and BEGAN)
2018-2 Machine Learning (Wasserstein GAN and BEGAN)
 
2018-2 Machine Learning (Linear regression, Logistic regression)
2018-2 Machine Learning (Linear regression, Logistic regression)2018-2 Machine Learning (Linear regression, Logistic regression)
2018-2 Machine Learning (Linear regression, Logistic regression)
 
2017-2 ML W11 GAN #1
2017-2 ML W11 GAN #12017-2 ML W11 GAN #1
2017-2 ML W11 GAN #1
 
2017-2 ML W9 Reinforcement Learning #5
2017-2 ML W9 Reinforcement Learning #52017-2 ML W9 Reinforcement Learning #5
2017-2 ML W9 Reinforcement Learning #5
 
2017-2 ML W8 Reinforcement Learning #4
2017-2 ML W8 Reinforcement Learning #42017-2 ML W8 Reinforcement Learning #4
2017-2 ML W8 Reinforcement Learning #4
 

Recently uploaded

What are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptxWhat are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptxwendy cai
 
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxDecoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxJoão Esperancinha
 
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...Soham Mondal
 
Electronically Controlled suspensions system .pdf
Electronically Controlled suspensions system .pdfElectronically Controlled suspensions system .pdf
Electronically Controlled suspensions system .pdfme23b1001
 
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort serviceGurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort servicejennyeacort
 
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETEINFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETEroselinkalist12
 
Concrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptxConcrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptxKartikeyaDwivedi3
 
Past, Present and Future of Generative AI
Past, Present and Future of Generative AIPast, Present and Future of Generative AI
Past, Present and Future of Generative AIabhishek36461
 
Microscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxMicroscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxpurnimasatapathy1234
 
Call Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call GirlsCall Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call Girlsssuser7cb4ff
 
SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )Tsuyoshi Horigome
 
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...VICTOR MAESTRE RAMIREZ
 
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionSachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionDr.Costas Sachpazis
 
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...srsj9000
 
Artificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptxArtificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptxbritheesh05
 

Recently uploaded (20)

Design and analysis of solar grass cutter.pdf
Design and analysis of solar grass cutter.pdfDesign and analysis of solar grass cutter.pdf
Design and analysis of solar grass cutter.pdf
 
What are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptxWhat are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptx
 
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxDecoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
 
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
 
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptxExploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
 
Electronically Controlled suspensions system .pdf
Electronically Controlled suspensions system .pdfElectronically Controlled suspensions system .pdf
Electronically Controlled suspensions system .pdf
 
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort serviceGurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
 
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETEINFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
 
Concrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptxConcrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptx
 
Past, Present and Future of Generative AI
Past, Present and Future of Generative AIPast, Present and Future of Generative AI
Past, Present and Future of Generative AI
 
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Serviceyoung call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
 
Microscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxMicroscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptx
 
Call Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call GirlsCall Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call Girls
 
SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )
 
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...
 
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionSachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
 
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
 
Artificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptxArtificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptx
 
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
 
POWER SYSTEMS-1 Complete notes examples
POWER SYSTEMS-1 Complete notes  examplesPOWER SYSTEMS-1 Complete notes  examples
POWER SYSTEMS-1 Complete notes examples
 

Lecture_NaturalPolicyGradientsTRPOPPO.pdf

  • 1. Natural Policy Gradients, TRPO, PPO Deep Reinforcement Learning and Control Katerina Fragkiadaki Carnegie Mellon School of Computer Science CMU 10703
  • 2. Part of the slides adapted from John Shulman and Joshua Achiam
  • 3. Stochastic policies continuous actions discrete actions usually multivariate Gaussian almost always categorical a ⇠ N(µ✓(s), 2 ✓(s)) a ∼ Cat(pθ(s)) µ✓(s) ✓(s) pθ(s) θ θ
  • 4. Policy Gradients Monte Carlo Policy Gradients (REINFORCE), gradient direction: What Loss to Optimize? I Policy gradients ĝ = Êt h r✓ log ⇡✓(at | st)Ât i I Can di↵erentiate the following loss LPG (✓) = Êt h log ⇡✓(at | st)Ât i but don’t want to optimize it too far I Equivalently di↵erentiate LIS ✓old (✓) = Êt  ⇡✓(at | st) ⇡✓old (at | st) Ât at ✓ = ✓old, state-actions are sampled using ✓old. (IS = Just the chain rule: r✓ log f (✓) ✓old = r✓f (✓) ✓old f (✓old) = r✓ ⇣ Actor-Critic Policy Gradient: ̂ g = ̂ 𝔼t [∇θlog πθ(at |st)Aw(st)] θold θnew μθ(s) σθ(s) σθnew (s) μθnew (s) θnew = θ + ϵ ⋅ ̂ g 1. Collect trajectories for policy 2. Estimate advantages 3. Compute policy gradient 4. Update policy parameters 5. GOTO 1 ̂ g A πθ This lecture is all about the stepwise
  • 5. What is the underlying objective function? ̂ g ≈ 1 N N ∑ i=1 T ∑ t=1 ∇θlog πθ(α(i) t |s(i) t )A(s(i) t , a(i) t ), τi ∼ πθ Policy gradients: What is our objective? Result from differentiating the objective function: JPG (θ) = 1 N N ∑ i=1 T ∑ t=1 log πθ(α(i) t |s(i) t )A(s(i) t , a(i) t ) τi ∼ πθ Is this our objective? We cannot both maximize over a variable and sample from it. Compare to supervised learning and maximum likelihood estimation (MLE). Imagine we have access to expert actions, then the loss function we want to optimize is: JSL (θ) = 1 N N ∑ i=1 T ∑ t=1 log πθ(α̃(i) t |s(i) t ), τi ∼ π* which maximizes the probability of expert actions in the training set. Is this our SL objective? Well, we cannot optimize it too far, our advantage estimates are from samples of pi_theta_{old}. However, this constraint of “cannot optimize too far from theta_{old}” does not appear anywhere in the objective. Well, as a matter of fact, we care about test error, but this is a long story, the short answer is yes, this is good enough for us to optimize if we regularize. +regularization
  • 6. Policy Gradients Monte Carlo Policy Gradients (REINFORCE), gradient direction: What Loss to Optimize? I Policy gradients ĝ = Êt h r✓ log ⇡✓(at | st)Ât i I Can di↵erentiate the following loss LPG (✓) = Êt h log ⇡✓(at | st)Ât i but don’t want to optimize it too far I Equivalently di↵erentiate LIS ✓old (✓) = Êt  ⇡✓(at | st) ⇡✓old (at | st) Ât at ✓ = ✓old, state-actions are sampled using ✓old. (IS = Just the chain rule: r✓ log f (✓) ✓old = r✓f (✓) ✓old f (✓old) = r✓ ⇣ Actor-Critic Policy Gradient: ̂ g = ̂ 𝔼t [∇θlog πθ(at |st)Aw(st)] θnew = θ + ϵ ⋅ ̂ g 1. Collect trajectories for policy 2. Estimate advantages 3. Compute policy gradient 4. Update policy parameters 5. GOTO 1 ̂ g A πθ This lecture is all about the stepwise It is also about writing down an objective that we can optimize with PG, and the procedure 1,2,3,4,5 will be the result of this objective maximization θold θnew μθ(s) σθ(s) σθnew (s) μθnew (s)
  • 7. Policy Gradients Monte Carlo Policy Gradients (REINFORCE), gradient direction: What Loss to Optimize? I Policy gradients ĝ = Êt h r✓ log ⇡✓(at | st)Ât i I Can di↵erentiate the following loss LPG (✓) = Êt h log ⇡✓(at | st)Ât i but don’t want to optimize it too far I Equivalently di↵erentiate LIS ✓old (✓) = Êt  ⇡✓(at | st) ⇡✓old (at | st) Ât at ✓ = ✓old, state-actions are sampled using ✓old. (IS = Just the chain rule: r✓ log f (✓) ✓old = r✓f (✓) ✓old f (✓old) = r✓ ⇣ Actor-Critic Policy Gradient: ̂ g = ̂ 𝔼t [∇θlog πθ(at |st)Aw(st)] Two problems with the vanilla formulation: 1. Hard to choose stepwise 2. Sample inefficient: we cannot use data collected with policies of previous iterations ϵ θnew = θ + ϵ ⋅ ̂ g 1. Collect trajectories for policy 2. Estimate advantages 3. Compute policy gradient 4. Update policy parameters 5. GOTO 1 ̂ g A πθ θold θnew μθ(s) σθ(s) σθnew (s) μθnew (s)
  • 8. Two Hard to choose stepsizes • Step too big Bad policy->data collected under bad policy-> we cannot recover (in Supervised Learning, data does not depend on neural network weights) • Step too small Not efficient use of experience (in Supervised Learning, data can be trivially re-used) Gradient descent in parameter space does not take into account the resulting distance in the (output) policy space between and πθold (s) πθnew (s) Monte Carlo Policy Gradients (REINFORCE), gradient direction: What Loss to Optimize? I Policy gradients ĝ = Êt h r✓ log ⇡✓(at | st)Ât i I Can di↵erentiate the following loss LPG (✓) = Êt h log ⇡✓(at | st)Ât i but don’t want to optimize it too far I Equivalently di↵erentiate LIS ✓old (✓) = Êt  ⇡✓(at | st) ⇡✓old (at | st) Ât at ✓ = ✓old, state-actions are sampled using ✓old. (IS = Just the chain rule: r✓ log f (✓) ✓old = r✓f (✓) ✓old f (✓old) = r✓ ⇣ Actor-Critic Policy Gradient: ̂ g = ̂ 𝔼t [∇θlog πθ(at |st)Aw(st)] θnew = θ + ϵ ⋅ ̂ g 1. Collect trajectories for policy 2. Estimate advantages 3. Compute policy gradient 4. Update policy parameters 5. GOTO 1 ̂ g A πθ θold θnew μθ(s) σθ(s) σθnew (s) μθnew (s)
  • 9. Two Hard to choose stepsizes Monte Carlo Policy Gradients (REINFORCE), gradient direction: What Loss to Optimize? I Policy gradients ĝ = Êt h r✓ log ⇡✓(at | st)Ât i I Can di↵erentiate the following loss LPG (✓) = Êt h log ⇡✓(at | st)Ât i but don’t want to optimize it too far I Equivalently di↵erentiate LIS ✓old (✓) = Êt  ⇡✓(at | st) ⇡✓old (at | st) Ât at ✓ = ✓old, state-actions are sampled using ✓old. (IS = Just the chain rule: r✓ log f (✓) ✓old = r✓f (✓) ✓old f (✓old) = r✓ ⇣ Actor-Critic Policy Gradient: ̂ g = ̂ 𝔼t [∇θlog πθ(at |st)Aw(st)] The Problem is More Than Step Size Consider a family of policies with parametrization: ⇡✓(a) = ⇢ (✓) a = 1 1 (✓) a = 2 Figure: Small changes in the policy parameters can unexpectedly lead to big changes in the policy. θnew = θ + ϵ ⋅ ̂ g 1. Collect trajectories for policy 2. Estimate advantages 3. Compute policy gradient 4. Update policy parameters 5. GOTO 1 ̂ g A πθ
  • 10. Two Notation We will use the following to denote values of parameters and corresponding policies before and after an update: θold → θnew πold → πnew θ → θ′ π → π′
  • 11. Gradient Descent in Parameter Space The stepwise in gradient descent results from solving the following optimization problem, e.g., using line search: Euclidean distance in parameter space θnew = θold + d * SGD: d * = arg max ∥d∥≤ϵ J(θ + d) It is hard to predict the result on the parameterized distribution.. µ✓(s) ✓(s) θ
  • 12. Gradient Descent in Distribution Space The stepwise in gradient descent results from solving the following optimization problem, e.g., using line search: d * = arg max d, s.t. KL(πθ∥πθ+d)≤ϵ J(θ + d) Euclidean distance in parameter space θnew = θold + d * SGD: d * = arg max ∥d∥≤ϵ J(θ + d) KL divergence in distribution space It is hard to predict the result on the parameterized distribution.. hard to pick the threshold epsilon Natural gradient descent: the stepwise in parameter space is determined by considering the KL divergence in the distributions before and after the update: Easier to pick the distance threshold!!!
  • 13. Solving the KL Constrained Problem First order Taylor expansion for the loss and second order for the KL: d * = arg max d J(θ + d) − λ(DKL [πθ∥πθ+d] − ϵ) Unconstrained penalized objective: ≈ arg max d J(θold) + ∇θ J(θ)|θ=θold ⋅ d − 1 2 λ(d⊤ ∇2 θDKL [πθold ∥πθ]|θ=θold d) + λϵ
  • 14. Taylor expansion of KL DKL(pθold |pθ) ≈ DKL(pθold |pθold ) + d⊤ ∇θDKL(pθold |pθ)|θ=θold + 1 2 d⊤ ∇2 θDKL(pθold |pθ)|θ=θold ∇θDKL(pθold |pθ)|θ=θold = −∇θ𝔼x∼pθold log Pθ(x)|θ=θold = −𝔼x∼pθold ∇θlog Pθ(x)|θ=θold = −𝔼x∼pθold 1 Pθold (x) ∇θPθ(x)|θ=θold = ∫x Pθold (x) 1 Pθold (x) ∇θPθ(x)|θ=θold = ∫x ∇θPθ(x)|θ=θold = ∇θ ∫x Pθ(x)|θ=θold . = 0 KL(pθold |pθ) = 𝔼x∼pθold log ( Pθold (x) Pθ(x) )
  • 15. Taylor expansion of KL DKL(pθold |pθ) ≈ DKL(pθold |pθold ) + d⊤ ∇θKL(pθold |pθ)|θ=θold + 1 2 d⊤ ∇2 θDKL(pθold |pθ)|θ=θold d ∇2 θDKL(pθold |pθ)|θ=θold = −𝔼x∼pθold ∇2 θlog Pθ(x)|θ=θold = −𝔼x∼pθold ∇θ ( ∇θPθ(x) Pθ(x) ) |θ=θold = −𝔼x∼pθold ( ∇2 θPθ(x)Pθ(x) − ∇θPθ(x)∇θPθ(x)⊤ Pθ(x)2 ) |θ=θold = −𝔼x∼pθold ∇2 θPθ(x)|θ=θold Pθold (x) + 𝔼x∼pθold ∇θlog Pθ(x)∇θlog Pθ(x)⊤ |θ=θold = 𝔼x∼pθold ∇θlog Pθ(x)∇θlog Pθ(x)⊤ |θ=θold DKL(pθold |pθ) = 𝔼x∼pθold log ( Pθold (x) Pθ(x) )
  • 16. Fisher Information Matrix F(θ) = 𝔼θ [∇θlog pθ(x)∇θlog pθ(x)⊤ ] Exactly equivalent to the Hessian of KL divergence! DKL(pθold |pθ) ≈ DKL(pθold |pθold ) + d⊤ ∇θDKL(pθold |pθ)|θ=θold + 1 2 d⊤ ∇2 θDKL(pθold |pθ)|θ=θold d = 1 2 d⊤ F(θold)d = 1 2 (θ − θold)⊤ F(θold)(θ − θold) Since KL divergence is roughly analogous to a distance measure between distributions, Fisher information serves as a local distance metric between distributions: how much you change the distribution if you move the parameters a little bit in a given direction. F(θold) = ∇2 θDKL(pθold |pθ)|θ=θold
  • 17. First order Taylor expansion for the loss and second order for the KL: d * = arg max d J(θ + d) − λ(DKL [πθ∥πθ+d] − ϵ) Unconstrained penalized objective: ≈ arg max d J(θold) + ∇θ J(θ)|θ=θold ⋅ d − 1 2 λ(d⊤ ∇2 θDKL [πθold ∥πθ]|θ=θold d) + λϵ = arg max d ∇θ J(θ)|θ=θold ⋅ d − 1 2 λ(d⊤ F(θold)d) = arg min d − ∇θ J(θ)|θ=θold ⋅ d + 1 2 λ(d⊤ F(θold)d) Substitute for the information matrix: Solving the KL Constrained Problem
  • 18. The natural gradient: Natural Gradient Descent Setting the gradient to zero: 0 = ∂ ∂d ( −∇θ J(θ)|θ=θold ⋅ d + 1 2 λ(d⊤ F(θold)d) ) = −∇θ J(θ)|θ=θold + 1 2 λ(F(θold))d d = 2 λ F−1 (θold)∇θ J(θ)|θ=θold ˜ ∇J(θ) = F−1 (θold)∇θ J(θ) θnew = θold + α ⋅ F−1 (θold) ̂ g DKL(πθold |πθ) ≈ 1 2 (θ − θold)⊤ F(θold)(θ − θold) 1 2 (αgN)⊤ F(αgN) = ϵ α = 2ϵ (g⊤ NFgN)
  • 19. Natural Gradient Descent ϵ Both use samples from the current policy pi_k
  • 20. Natural Gradient Descent ϵ very expensive to compute for a large number of parameters!
  • 21. Policy Gradients Monte Carlo Policy Gradients (REINFORCE), gradient direction: What Loss to Optimize? I Policy gradients ĝ = Êt h r✓ log ⇡✓(at | st)Ât i I Can di↵erentiate the following loss LPG (✓) = Êt h log ⇡✓(at | st)Ât i but don’t want to optimize it too far I Equivalently di↵erentiate LIS ✓old (✓) = Êt  ⇡✓(at | st) ⇡✓old (at | st) Ât at ✓ = ✓old, state-actions are sampled using ✓old. (IS = Just the chain rule: r✓ log f (✓) ✓old = r✓f (✓) ✓old f (✓old) = r✓ ⇣ Actor-Critic Policy Gradient: ̂ g = ̂ 𝔼t [∇θlog πθ(at |st)Aw(st)] θold θnew μθold (s) σθold (s) σθnew (s) μθnew (s) θnew = θold + ϵ ⋅ ̂ g 1. Collect trajectories for policy 2. Estimate advantages 3. Compute policy gradient 4. Update policy parameters 5. GOTO 1 ̂ g A πθold
  • 22. Policy Gradients Monte Carlo Policy Gradients (REINFORCE), gradient direction: What Loss to Optimize? I Policy gradients ĝ = Êt h r✓ log ⇡✓(at | st)Ât i I Can di↵erentiate the following loss LPG (✓) = Êt h log ⇡✓(at | st)Ât i but don’t want to optimize it too far I Equivalently di↵erentiate LIS ✓old (✓) = Êt  ⇡✓(at | st) ⇡✓old (at | st) Ât at ✓ = ✓old, state-actions are sampled using ✓old. (IS = Just the chain rule: r✓ log f (✓) ✓old = r✓f (✓) ✓old f (✓old) = r✓ ⇣ Actor-Critic Policy Gradient: ̂ g = ̂ 𝔼t [∇θlog πθ(at |st)Aw(st)] θold θnew μθold (s) σθold (s) σθnew (s) μθnew (s) θnew = θold + ϵ ⋅ ̂ g 1. Collect trajectories for policy 2. Estimate advantages 3. Compute policy gradient 4. Update policy parameters 5. GOTO 1 ̂ g A πθold • On policy learning can be extremely inefficient • The policy changes only a little bit with each gradient step • I want to be able to use earlier data..how to do that?
  • 23. J(θ) = 𝔼τ∼πθ(τ) [R(τ)] = ∑ τ πθ(τ)R(τ) = ∑ τ πθold (τ) πθ(τ) πθold (τ) R(τ) = ∑ τ∼πθold πθ(τ) πθold (τ) R(τ) = 𝔼τ∼πθold πθ(τ) πθold (τ) R(τ) ∇θ J(θ) = 𝔼τ∼πθold ∇θπθ(τ) πθold (τ) R(τ) Off policy learning with Importance Sampling <-Gradient evaluated at theta_old is unchanged ∇θ J(θ)|θ=θold = 𝔼τ∼πθold ∇θlog πθ(τ)|θ=θold R(τ)
  • 24. J(θ) = 𝔼τ∼πθ(τ) [R(τ)] = ∑ τ πθ(τ)R(τ) = ∑ τ πθold (τ) πθ(τ) πθold (τ) R(τ) = ∑ τ∼πθold πθ(τ) πθold (τ) R(τ) = 𝔼τ∼πθold πθ(τ) πθold (τ) R(τ) Off policy learning with Importance Sampling J(θ) = 𝔼τ∼πθold T ∑ t=1 t ∏ t′=1 πθ(a′ t |s′ t) πθold (a′ t |s′ t) ̂ At πθ(τ) πθold (τ) = T ∏ i=1 πθ(at |st) πθold (at |st) Now we can use data from the old policy, but the variance has increased by a lot! Those multiplications can explode or vanish! ∇θ J(θ)|θ=θold = 𝔼τ∼πθold ∇θlog πθ(τ)|θ=θold R(τ) ∇θ J(θ) = 𝔼τ∼πθold ∇θπθ(τ) πθold (τ) R(τ)
  • 25. Trust Region Policy Optimization I Define the following trust region update: maximize ✓ Êt  ⇡✓(at | st) ⇡✓old (at | st) Ât subject to Êt[KL[⇡✓old (· | st), ⇡✓(· | st)]]  . I Also worth considering using a penalty instead of a constraint maximize ✓ Êt  ⇡✓(at | st) ⇡✓old (at | st) Ât Êt[KL[⇡✓old (· | st), ⇡✓(· | st)]] I Method of Lagrange multipliers: optimality point of -constrained problem is also an optimality point of -penalized problem for some . I In practice, is easier to tune, and fixed is better than fixed Trust region Policy Optimization er Reading I S. Kakade. “A Natural Policy Gradient.” NIPS. 2001 I S. Kakade and J. Langford. “Approximately optimal approximate reinforcement learning”. ICML. 2002 I J. Peters and S. Schaal. “Natural actor-critic”. Neurocomputing (2008) I J. Schulman, S. Levine, P. Moritz, M. I. Jordan, and P. Abbeel. “Trust Region Policy Optimization”. Again the KL penalized problem!
  • 26. Solving KL Penalized Problem I maximize✓ L⇡✓old (⇡✓) · KL⇡✓old (⇡✓) I Make linear approximation to L⇡✓old and quadratic approximation to KL term: maximize ✓ g · (✓ ✓old) 2 (✓ ✓old)T F(✓ ✓old) where g = @ @✓ L⇡✓old (⇡✓) ✓=✓old , F = @2 @2✓ KL⇡✓old (⇡✓) ✓=✓old I Quadratic part of L is negligible compared to KL term I F is positive semidefinite, but not if we include Hessian of L I Solution: ✓ ✓old = 1 F 1 g, where F is Fisher Information matrix, g is policy gradient. This is called the natural policy gradient3 . 3 S. Kakade. “A Natural Policy Gradient.” NIPS. 2001. Solving KL penalized problem Exactly what we saw with natural policy gradient! One important detail! Trust region Policy Optimization
  • 27. Trust Region Policy Optimization Small problems with NPG update: Might not be robust to trust region size ; at some iterations may be too large and performance can degrade Because of quadratic approximation, KL-divergence constraint may be violated Solution: Require improvement in surrogate (make sure that L✓k (✓k+1) 0) Enforce KL-constraint How? Backtracking line search with exponential decay (decay coe↵ ↵ 2 (0, 1), budget L) Algorithm 2 Line Search for TRPO Compute proposed policy step k = q 2 ĝT k Ĥ 1 k ĝk Ĥ 1 k ĝk for j = 0, 1, 2, ..., L do Compute proposed update ✓ = ✓k + ↵j k if L✓k (✓) 0 and D̄KL(✓||✓k )  then accept the update and set ✓k+1 = ✓k + ↵j k break end if end for Joshua Achiam (UC Berkeley, OpenAI) Advanced Policy Gradient Methods October 11, 2017 32 / 41 Trust region Policy Optimization Due to the quadratic approximation, the KL constraint may be violated! What if we just do a line search to find the best stepsize, making sure: • I am improving my objective J(theta) • The KL constraint is not violated! Trust Region Policy Optimization I Define the following trust region update: maximize ✓ Êt  ⇡✓(at | st) ⇡✓old (at | st) Ât subject to Êt[KL[⇡✓old (· | st), ⇡✓(· | st)]]  . I Also worth considering using a penalty instead of a constraint maximize ✓ Êt  ⇡✓(at | st) ⇡✓old (at | st) Ât Êt[KL[⇡✓old (· | st), ⇡✓(· | st)]] I Method of Lagrange multipliers: optimality point of -constrained problem is also an optimality point of -penalized problem for some . I In practice, is easier to tune, and fixed is better than fixed
  • 28. Trust Region Policy Optimization Trust Region Policy Optimization is implemented as TNPG plus a line search. Putting it all together: Algorithm 3 Trust Region Policy Optimization Input: initial policy parameters ✓0 for k = 0, 1, 2, ... do Collect set of trajectories Dk on policy ⇡k = ⇡(✓k ) Estimate advantages  ⇡k t using any advantage estimation algorithm Form sample estimates for policy gradient ĝk (using advantage estimates) and KL-divergence Hessian-vector product function f (v) = Ĥk v Use CG with ncg iterations to obtain xk ⇡ Ĥ 1 k ĝk Estimate proposed step k ⇡ q 2 xT k Ĥk xk xk Perform backtracking line search with exponential decay to obtain final update ✓k+1 = ✓k + ↵j k end for Trust region Policy Optimization TRPO= NPG +Linesearch
  • 29. Trust Region Policy Optimization Trust Region Policy Optimization is implemented as TNPG plus a line search. Putting it all together: Algorithm 3 Trust Region Policy Optimization Input: initial policy parameters ✓0 for k = 0, 1, 2, ... do Collect set of trajectories Dk on policy ⇡k = ⇡(✓k ) Estimate advantages  ⇡k t using any advantage estimation algorithm Form sample estimates for policy gradient ĝk (using advantage estimates) and KL-divergence Hessian-vector product function f (v) = Ĥk v Use CG with ncg iterations to obtain xk ⇡ Ĥ 1 k ĝk Estimate proposed step k ⇡ q 2 xT k Ĥk xk xk Perform backtracking line search with exponential decay to obtain final update ✓k+1 = ✓k + ↵j k end for Trust region Policy Optimization TRPO= NPG +Linesearch+monotonic improvement theorem!
  • 30. Relating objectives of two policies Policy objective: Policy objective can be written in terms of old one: J(πθ) = 𝔼τ∼πθ ∞ ∑ t=0 γt rt J(πθ′) − J(πθ) = 𝔼τ∼π′ θ ∞ ∑ t=0 γt Aπθ(st, at) J(π′) − J(π) = 𝔼τ∼π′ ∞ ∑ t=0 γt Aπ (st, at) Equivalently for succinctness:
  • 31. Proof of Relative Policy Performance Identity J(⇡0 ) J(⇡) = E ⌧⇠⇡0 " 1 X t=0 t A⇡ (st , at ) # = E ⌧⇠⇡0 " 1 X t=0 t (R(st , at , st+1) + V ⇡ (st+1) V ⇡ (st )) # = J(⇡0 ) + E ⌧⇠⇡0 " 1 X t=0 t+1 V ⇡ (st+1) 1 X t=0 t V ⇡ (st ) # = J(⇡0 ) + E ⌧⇠⇡0 " 1 X t=1 t V ⇡ (st ) 1 X t=0 t V ⇡ (st ) # = J(⇡0 ) E ⌧⇠⇡0 [V ⇡ (s0)] = J(⇡0 ) J(⇡) Joshua Achiam (UC Berkeley, OpenAI) Advanced Policy Gradient Methods October 11, 2017 12 / 41 Approximately Optimal Approximate Reinforcement Learning, Kakade and Langford 2002 Relating objectives of two policies The initial state distribution is the same for both!
  • 32. Relating objectives of two policies Discounted state visitation distribution: J(π′) − J(π) = 𝔼τ∼π′ ∞ ∑ t=0 γt Aπ (st, at) = 𝔼s∼dπ′ ,a∼π′Aπ (s, a) = 𝔼s∼dπ′ ,a∼π [ π′(a|s) π(a|s) Aπ (s, a) ] But how are we supposed to sample states from the policy we are trying to optimize for… Let’s use the previous policy to sample them. J(π′) − J(π) ≈ 𝔼s∼dπ,a∼π π′(a|s) π(a|s) Aπ (s, a) = ℒπ(π′) It turns out we can bound this approximation error: A Useful Approximation What if we just said d⇡0 ⇡ d⇡ and didn’t worry about it? J(⇡0 ) J(⇡) ⇡ 1 1 E s⇠d⇡ a⇠⇡  ⇡0 (a|s) ⇡(a|s) A⇡ (s, a) . = L⇡(⇡0 ) Turns out: this approximation is pretty good when ⇡0 and ⇡ are close! But why, and close do they have to be? Relative policy performance bounds: 2 J(⇡0 ) J(⇡) + L⇡(⇡0 )  C q E s⇠d⇡ [DKL(⇡0||⇡)[s]] If policies are close in KL-divergence—the approximation is good! Constrained Policy Optimization, Achiam et al. 2017 dπ (s) = (1 − γ) ∞ ∑ t=0 γt P(st = s|π)
  • 33. Relating objectives of two policies ℒπ′ π = 𝔼s∼dπ,a∼π [ π′(a|s) π(a|s) Aπ (s, a) ] = 𝔼τ∼π [ ∞ ∑ t=0 π′(at |st) π(at |st) Aπ (st, at) ] This is something we can optimize using trajectories from the old policy! Now we do not have the product! So, the gradient will have much smaller variance! (Yes, but we have approximated, that’s why!) What is the gradient? ∇θℒθ θk |θ=θk = 𝔼τ∼πθk [ ∞ ∑ t=0 γt ∇θπθ(at |st)|θ=θk πθk (at |st) Aπθk(st, at) ] = 𝔼τ∼πθk [ ∞ ∑ t=0 γt ∇θlog πθ(at |st)|θ=θk Aπθk(st, at) ] J(θ) = 𝔼τ∼πθold T ∑ t=1 t ∏ t′=1 πθ(a′ t |s′ t) πθold (a′ t |s′ t) ̂ At Compare to Importance Sampling:
  • 34. ⇒ J(π′) − J(π) ≥ ℒπ(π′) − C 𝔼s∼dπ [KL(π′|π)[s]] |J(π′) − (J(π) + ℒπ(π′))| ≤ C 𝔼s∼dπ [KL(π′|π)[s]] Given policy , we want to optimize over policy to maximize . π π′ • If we maximize the RHS we are guaranteed to maximize the LHS. • We know how to maximize the RHS. I can estimate both quantities of pi’ with sampled from pi • But will i have a better policy pi’? (knowing that the distance of the objectives is maximized is not enough, there needs to be positive or equal to zero) Monotonic Improvement Theorem
  • 35. Monotonic Improvement Theory Proof of improvement guarantee: Suppose ⇡k+1 and ⇡k are related by ⇡k+1 = arg max ⇡0 L⇡k (⇡0 ) C q E s⇠d⇡k [DKL(⇡0||⇡k )[s]]. ⇡k is a feasible point, and the objective at ⇡k is equal to 0. L⇡k (⇡k ) / E s,a⇠d⇡k ,⇡k [A⇡k (s, a)] = 0 DKL(⇡k ||⇡k )[s] = 0 =) optimal value 0 =) by the performance bound, J(⇡k+1) J(⇡k ) 0 Joshua Achiam (UC Berkeley, OpenAI) Advanced Policy Gradient Methods October 11, 2017 21 / 41 Monotonic Improvement Theorem
  • 36. • Theory is very conservative (high value of C) and we will use KL distance of pi’ and pi as a constraint (trust region) as opposed to a penalty: Approximate Monotonic Improvement ⇡k+1 = arg max ⇡0 L⇡k (⇡ ) C E s⇠d⇡k [DKL(⇡ ||⇡k )[s]]. (3) Problem: C provided by theory is quite high when is near 1 =) steps from (3) are too small. Solution: Instead of KL penalty, use KL constraint (called trust region). Can control worst-case error through constraint upper limit! ⇡k+1 = arg max ⇡0 L⇡k (⇡0 ) s.t. E s⇠d⇡k ⇥ DKL(⇡0 ||⇡k )[s] ⇤  (4) Joshua Achiam (UC Berkeley, OpenAI) Advanced Policy Gradient Methods October 11, 2017 22 / 41
  • 37. Trust Region Policy Optimization Trust Region Policy Optimization is implemented as TNPG plus a line search. Putting it all together: Algorithm 3 Trust Region Policy Optimization Input: initial policy parameters ✓0 for k = 0, 1, 2, ... do Collect set of trajectories Dk on policy ⇡k = ⇡(✓k ) Estimate advantages  ⇡k t using any advantage estimation algorithm Form sample estimates for policy gradient ĝk (using advantage estimates) and KL-divergence Hessian-vector product function f (v) = Ĥk v Use CG with ncg iterations to obtain xk ⇡ Ĥ 1 k ĝk Estimate proposed step k ⇡ q 2 xT k Ĥk xk xk Perform backtracking line search with exponential decay to obtain final update ✓k+1 = ✓k + ↵j k end for Trust region Policy Optimization TRPO= NPG +Linesearch+monotonic improvement theorem!
  • 38. Proximal Policy Optimization Can I achieve similar performance without second order information (no Fisher matrix!) Proximal Policy Optimization Proximal Policy Optimization (PPO) is a family of methods that approximately enforce KL constraint without computing natural gradients. Two variants: Adaptive KL Penalty Policy update solves unconstrained optimization problem ✓k+1 = arg max ✓ L✓k (✓) k D̄KL(✓||✓k ) Penalty coefficient k changes between iterations to approximately enforce KL-divergence constraint Clipped Objective New objective function: let rt (✓) = ⇡✓(at |st )/⇡✓k (at |st ). Then LCLIP ✓k (✓) = E ⌧⇠⇡k " T X t=0 h min(rt (✓)Â ⇡k t , clip (rt (✓), 1 ✏, 1 + ✏) Â ⇡k t ) i # where ✏ is a hyperparameter (maybe ✏ = 0.2) Policy update is ✓k+1 = arg max✓ LCLIP ✓k (✓) Joshua Achiam (UC Berkeley, OpenAI) Advanced Policy Gradient Methods October 11, 2017 35 / 41 rther Reading I S. Kakade. “A Natural Policy Gradient.” NIPS. 2001 I S. Kakade and J. Langford. “Approximately optimal approximate reinforcement learning”. ICML. 2002 I J. Peters and S. Schaal. “Natural actor-critic”. Neurocomputing (2008) I J. Schulman, S. Levine, P. Moritz, M. I. Jordan, and P. Abbeel. “Trust Region Policy Optimization”. ICML (2015) I Y. Duan, X. Chen, R. Houthooft, J. Schulman, and P. Abbeel. “Benchmarking Deep Reinforcement Learning for Continuous Control”. ICML (2016) I J. Martens and I. Sutskever. “Training deep and recurrent networks with Hessian-free optimization”. Springer, 2012 I Z. Wang, V. Bapst, N. Heess, V. Mnih, R. Munos, et al. “Sample Efficient Actor-Critic with Experience Replay”. (2016) I Y. Wu, E. Mansimov, S. Liao, R. Grosse, and J. Ba. “Scalable trust-region method for deep reinforcement learning using Kronecker-factored approximation”. (2017) I J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. “Proximal Policy Optimization Algorithms”. (2017) I
  • 39. Proximal Policy Optimization with Adaptive KL Penalty Algorithm 4 PPO with Adaptive KL Penalty Input: initial policy parameters ✓0, initial KL penalty 0, target KL-divergence for k = 0, 1, 2, ... do Collect set of partial trajectories Dk on policy ⇡k = ⇡(✓k ) Estimate advantages  ⇡k t using any advantage estimation algorithm Compute policy update ✓k+1 = arg max ✓ L✓k (✓) k D̄KL(✓||✓k ) by taking K steps of minibatch SGD (via Adam) if D̄KL(✓k+1||✓k ) 1.5 then k+1 = 2 k else if D̄KL(✓k+1||✓k )  /1.5 then k+1 = k /2 end if end for Initial KL penalty not that important—it adapts quickly Some iterations may violate KL constraint, but most don’t Joshua Achiam (UC Berkeley, OpenAI) Advanced Policy Gradient Methods October 11, 2017 36 / 41 PPO: Adaptive KL Penalty Don’t use second order approximation for Kl which is expensive, use standard gradient descent
  • 40. Proximal Policy Optimization: Clipping Objective I Recall the surrogate objective LIS (✓) = Êt  ⇡✓(at | st) ⇡✓old (at | st) Ât = Êt h rt(✓)Ât i . (1) I Form a lower bound via clipped importance ratios LCLIP (✓) = Êt h min(rt(✓)Ât, clip(rt(✓), 1 ✏, 1 + ✏)Ât) i (2) I Forms pessimistic bound on objective, can be optimized using SGD PPO: Clipped Objective rther Reading I S. Kakade. “A Natural Policy Gradient.” NIPS. 2001 I S. Kakade and J. Langford. “Approximately optimal approximate reinforcement learning”. ICML. 2002 I J. Peters and S. Schaal. “Natural actor-critic”. Neurocomputing (2008) I J. Schulman, S. Levine, P. Moritz, M. I. Jordan, and P. Abbeel. “Trust Region Policy Optimization”. ICML (2015) I Y. Duan, X. Chen, R. Houthooft, J. Schulman, and P. Abbeel. “Benchmarking Deep Reinforcement Learning for Continuous Control”. ICML (2016) I J. Martens and I. Sutskever. “Training deep and recurrent networks with Hessian-free optimization”. Springer, 2012 I Z. Wang, V. Bapst, N. Heess, V. Mnih, R. Munos, et al. “Sample Efficient Actor-Critic with Experience Replay”. (2016) I Y. Wu, E. Mansimov, S. Liao, R. Grosse, and J. Ba. “Scalable trust-region method for deep reinforcement learning using Kronecker-factored approximation”. (2017) I J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. “Proximal Policy Optimization Algorithms”. (2017) I
  • 41. Proximal Policy Optimization with Clipped Objective But how does clipping keep policy close? By making objective as pessimistic as possible about performance far away from ✓k : Figure: Various objectives as a function of interpolation factor ↵ between ✓k+1 and ✓k after one update of PPO-Clip 9 9 Schulman, Wolski, Dhariwal, Radford, Klimov, 2017 Joshua Achiam (UC Berkeley, OpenAI) Advanced Policy Gradient Methods October 11, 2017 38 / 41 Proximal Policy Optimization PPO: Clipped Objective
  • 42. Proximal Policy Optimization with Clipped Objective Algorithm 5 PPO with Clipped Objective Input: initial policy parameters ✓0, clipping threshold ✏ for k = 0, 1, 2, ... do Collect set of partial trajectories Dk on policy ⇡k = ⇡(✓k ) Estimate advantages  ⇡k t using any advantage estimation algorithm Compute policy update ✓k+1 = arg max ✓ LCLIP ✓k (✓) by taking K steps of minibatch SGD (via Adam), where LCLIP ✓k (✓) = E ⌧⇠⇡k " T X t=0 h min(rt (✓) ⇡k t , clip (rt (✓), 1 ✏, 1 + ✏)  ⇡k t ) i # end for Clipping prevents policy from having incentive to go far away from ✓k+1 Clipping seems to work at least as well as PPO with KL penalty, but is simpler to implement Joshua Achiam (UC Berkeley, OpenAI) Advanced Policy Gradient Methods October 11, 2017 37 / 41 PPO: Clipped Objective
  • 43. Empirical Performance of PPO Figure: Performance comparison between PPO with clipped objective and various other deep RL methods on a slate of MuJoCo tasks. 10 10 Schulman, Wolski, Dhariwal, Radford, Klimov, 2017 PPO: Clipped Objective
  • 44. Summary • Gradient Descent in Parameter VS distribution space • Natural gradients: we need to keep track of how the KL changes from iteration to iteration • Natural policy gradients • Clipped objective works well Related Readings Further Reading I S. Kakade. “A Natural Policy Gradient.” NIPS. 2001 I S. Kakade and J. Langford. “Approximately optimal approximate reinforcement learning”. ICML. 2002 I J. Peters and S. Schaal. “Natural actor-critic”. Neurocomputing (2008) I J. Schulman, S. Levine, P. Moritz, M. I. Jordan, and P. Abbeel. “Trust Region Policy Optimization”. ICML (2015) I Y. Duan, X. Chen, R. Houthooft, J. Schulman, and P. Abbeel. “Benchmarking Deep Reinforcement Learning for Continuous Control”. ICML (2016) I J. Martens and I. Sutskever. “Training deep and recurrent networks with Hessian-free optimization”. Springer, 2012 I Z. Wang, V. Bapst, N. Heess, V. Mnih, R. Munos, et al. “Sample Efficient Actor-Critic with Experience Replay”. (2016) I Y. Wu, E. Mansimov, S. Liao, R. Grosse, and J. Ba. “Scalable trust-region method for deep reinforcement learning using Kronecker-factored approximation”. (2017) I J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. “Proximal Policy Optimization Algorithms”. (2017) I blog.openai.com: recent posts on baselines releases Further Reading I S. Kakade. “A Natural Policy Gradient.” NIPS. 2001 I S. Kakade and J. Langford. “Approximately optimal approximate reinforcement learning”. ICML. 2002 I J. Peters and S. Schaal. “Natural actor-critic”. Neurocomputing (2008) I J. Schulman, S. Levine, P. Moritz, M. I. Jordan, and P. Abbeel. “Trust Region Policy Optimization”. ICML (2015) I Y. Duan, X. Chen, R. Houthooft, J. Schulman, and P. Abbeel. “Benchmarking Deep Reinforcement Learning for Continuous Control”. ICML (2016) I J. Martens and I. Sutskever. “Training deep and recurrent networks with Hessian-free optimization”. Springer, 2012 I Z. Wang, V. Bapst, N. Heess, V. Mnih, R. Munos, et al. “Sample Efficient Actor-Critic with Experience Replay”. (2016) I Y. Wu, E. Mansimov, S. Liao, R. Grosse, and J. Ba. “Scalable trust-region method for deep reinforcement learning using Kronecker-factored approximation”. (2017) I J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. “Proximal Policy Optimization Algorithms”. (2017) I blog.openai.com: recent posts on baselines releases J. Achiam, D. Held, A. Tamar, P. Abeel “Constrained Policy Optimization”. (2017)