This document contains slides about policy gradients, an approach to reinforcement learning. It discusses the likelihood ratio policy gradient method, which estimates the gradient of expected return with respect to the policy parameters. The gradient aims to increase the probability of high-reward paths and decrease low-reward paths. The derivation from importance sampling is shown, and it is noted that this suggests looking at more than just the gradient. Fixes for practical use include adding a baseline to reduce variance and exploiting temporal structure in the paths.
7. n Oien can be simpler than Q or V
n E.g., roboQc grasp
n V: doesn’t prescribe acQons
n Would need dynamics model (+ compute 1 Bellman back-up)
n Q: need to be able to efficiently solve
n Challenge for conQnuous / high-dimensional acQon spaces*
Why Policy OpQmizaQon
⇡
*some recent work (parQally) addressing this:
NAF: Gu, Lillicrap, Sutskever, Levine ICML 2016
Input Convex NNs: Amos, Xu, Kolter arXiv 2016
Deep Energy Q: Haarnoja, Tang, Abbeel, Levine, ICML 2017
arg max
u
Q✓(s, u)
8. n Conceptually:
Policy OpQmizaQon Dynamic Programming
n Empirically:
Optimize what you care
about
Indirect, exploit the problem
structure, self-consistency
More compatible with rich
architectures (including
recurrence)
More versatile
More compatible with
auxiliary objectives
More compatible with
exploration and off-policy
learning
More sample-efficient when
they work
17. n Gradient tries to:
n Increase probability of paths with
posiQve R
n Decrease probability of paths with
negaQve R
Likelihood RaQo Gradient: IntuiQon
rU(✓) ⇡ ˆg =
1
m
mX
i=1
r✓ log P(⌧(i)
; ✓)R(⌧(i)
)
! Likelihood raQo changes probabiliQes of experienced paths,
does not try to change the paths (<-> Path DerivaQve)
29. n Gradient tries to:
n Increase probability of paths with
posiQve R
n Decrease probability of paths with
negaQve R
Likelihood RaQo Gradient: IntuiQon
rU(✓) ⇡ ˆg =
1
m
mX
i=1
r✓ log P(⌧(i)
; ✓)R(⌧(i)
)
! Likelihood raQo changes probabiliQes of experienced paths,
does not try to change the paths (<-> Path DerivaQve)
30. à Consider baseline b:
Likelihood RaQo Gradient: Baseline
rU(✓) ⇡ ˆg =
1
m
mX
i=1
r✓ log P(⌧(i)
; ✓)R(⌧(i)
)
rU(✓) ⇡ ˆg =
1
m
mX
i=1
r✓ log P(⌧(i)
; ✓)(R(⌧(i)
) b)
sQll unbiased!
[Williams 1992]
E [r✓ log P(⌧; ✓)b]
=
X
⌧
P(⌧; ✓)r✓ log P(⌧; ✓)b
=
X
⌧
P(⌧; ✓)
r✓P(⌧; ✓)
P(⌧; ✓)
b
=
X
⌧
r✓P(⌧; ✓)b
=r✓
X
⌧
P(⌧)b
!
= br✓(
X
⌧
P(⌧)) = b ⇥ 0
OK as long as baseline doesn’t depend on acQon
in logprob(acQon)
31. n Current esQmate:
n Removing terms that don’t depend on current acQon can lower variance:
Likelihood RaQo and Temporal Structure
ˆg =
1
m
mX
i=1
r✓ log P(⌧(i)
; ✓)(R(⌧(i)
) b)
=
1
m
mX
i=1
H 1X
t=0
r✓ log ⇡✓(u
(i)
t |s
(i)
t )
! H 1X
t=0
R(s
(i)
t , u
(i)
t ) b
!
[Policy Gradient Theorem: SuTon et al, NIPS 1999; GPOMDP: BartleT & Baxter, JAIR 2001; Survey: Peters & Schaal, IROS 2006]
Doesn’t depend on u
(i)
t Ok to depend on s
(i)
t
1
m
mX
i=1
H 1X
t=0
r✓ log ⇡✓(u
(i)
t |s
(i)
t )
H 1X
k=t
R(s
(i)
k , u
(i)
k ) b(s
(i)
t )
!
=
1
m
mX
i=1
H 1X
t=0
r✓ log ⇡✓(u
(i)
t |s
(i)
t )
" t 1X
k=0
R(s
(i)
k , u
(i)
k ) +
H 1X
k=t
R(s
(i)
k , u
(i)
k ) b
# !
32. n Good choice for b?
n Constant baseline:
n OpQmal Constant baseline:
n Time-dependent baseline:
n State-dependent expected return:
à Increase logprob of acQon proporQonally to how much its returns are
beTer than the expected return under the current policy
Baseline Choices
b(st) = E [rt + rt+1 + rt+2 + . . . + rH 1]
b = E [R(⌧)] ⇡
1
m
mX
i=1
R(⌧(i)
)
[See: Greensmith, BartleT, Baxter, JMLR 2004 for variance reducQon techniques.]
bt =
1
m
mX
i=1
H 1X
k=t
R(s
(i)
k , u
(i)
k )
= V ⇡
(st)
33. 1
m
mX
i=1
H 1X
t=0
r✓ log ⇡✓(u
(i)
t |s
(i)
t )
H 1X
k=t
R(s
(i)
k , u
(i)
k ) V ⇡
(s
(i)
k )
!
How to estimate?
V ⇡
n Init
n Collect trajectories
n Regress against empirical return:
V ⇡
0
⌧1, . . . , ⌧m
i+1 arg min
1
m
mX
i=1
H 1X
t=0
V ⇡
✓ (s
(i)
t )
H 1X
k=t
R(s
(i)
k , u
(i)
k )
!2
EsQmaQon of
34. n Bellman EquaQon for
n Init
n Collect data {s, u, s’, r}
n FiTed V iteraQon:
EsQmaQon of
V ⇡
(s) =
X
u
⇡(u|s)
X
s0
P(s0
|s, u)[R(s, u, s0
) + V ⇡
(s0
)]
V ⇡
V ⇡
V ⇡
0
i+1 min
X
(s,u,s0,r)
kr + V ⇡
i
(s0
) V (s)k2
2 + k ik2
2
36. n EsQmaQon of Q from single roll-out
Recall Our Likelihood RaQo PG EsQmator
Q⇡
(s, u) = E[r0 + r1 + r2 + · · · |s0 = s, a0 = a]
1
m
mX
i=1
H 1X
t=0
r✓ log ⇡✓(u
(i)
t |s
(i)
t )
H 1X
k=t
R(s
(i)
k , u
(i)
k ) V ⇡
(s
(i)
k )
!
n = high variance per sample based / no generalizaQon used
n Reduce variance by discounQng
n Reduce variance by funcQon approximaQon (=criQc)
37. n EsQmaQon of Q from single roll-out
Recall Our Likelihood RaQo PG EsQmator
Q⇡
(s, u) = E[r0 + r1 + r2 + · · · |s0 = s, a0 = a]
1
m
mX
i=1
H 1X
t=0
r✓ log ⇡✓(u
(i)
t |s
(i)
t )
H 1X
k=t
R(s
(i)
k , u
(i)
k ) V ⇡
(s
(i)
k )
!
n = high variance per sample based / no generalizaQon used
n Reduce variance by discounQng
n Reduce variance by funcQon approximaQon (=criQc)
38. n EsQmaQon of Q from single roll-out
Recall Our Likelihood RaQo PG EsQmator
Q⇡
(s, u) = E[r0 + r1 + r2 + · · · |s0 = s, a0 = a]
1
m
mX
i=1
H 1X
t=0
r✓ log ⇡✓(u
(i)
t |s
(i)
t )
H 1X
k=t
R(s
(i)
k , u
(i)
k ) V ⇡
(s
(i)
k )
!
n = high variance per sample based / no generalizaQon used
n Reduce variance by discounQng
n Reduce variance by funcQon approximaQon (=criQc)
39. n EsQmaQon of Q from single roll-out
Recall Our Likelihood RaQo PG EsQmator
Q⇡
(s, u) = E[r0 + r1 + r2 + · · · |s0 = s, a0 = a]
1
m
mX
i=1
H 1X
t=0
r✓ log ⇡✓(u
(i)
t |s
(i)
t )
H 1X
k=t
R(s
(i)
k , u
(i)
k ) V ⇡
(s
(i)
k )
!
n = high variance per sample based / no generalizaQon
n Reduce variance by discounQng
n Reduce variance by funcQon approximaQon (=criQc)
40. n EsQmaQon of Q from single roll-out
Further Refinements
Q⇡
(s, u) = E[r0 + r1 + r2 + · · · |s0 = s, a0 = a]
1
m
mX
i=1
H 1X
t=0
r✓ log ⇡✓(u
(i)
t |s
(i)
t )
H 1X
k=t
R(s
(i)
k , u
(i)
k ) V ⇡
(s
(i)
k )
!
n = high variance per sample based / no generalizaQon
n Reduce variance by discounQng
n Reduce variance by funcQon approximaQon (=criQc)
41. n EsQmaQon of Q from single roll-out
Recall Our Likelihood RaQo PG EsQmator
Q⇡
(s, u) = E[r0 + r1 + r2 + · · · |s0 = s, a0 = a]
1
m
mX
i=1
H 1X
t=0
r✓ log ⇡✓(u
(i)
t |s
(i)
t )
H 1X
k=t
R(s
(i)
k , u
(i)
k ) V ⇡
(s
(i)
k )
!
n = high variance per sample based / no generalizaQon
n Reduce variance by discounQng
n Reduce variance by funcQon approximaQon (=criQc)
49. n Policy Gradient + Generalized Advantage EsQmaQon:
n Init
n Collect roll-outs {s, u, s’, r} and
n Update:
Actor-CriQc with A3C or GAE
V ⇡
0
⇡✓0
Note: many variaQons, e.g. could instead use 1-step for V, full roll-out for pi:
i+1 min
X
(s,u,s0,r)
kr + V ⇡
i
(s0
) V (s)k2
2 + k ik2
2
✓i+1 ✓i + ↵
1
m
mX
k=1
H 1X
t=0
r✓ log ⇡✓i (u
(k)
t |s
(k)
t )
H 1X
t0=t
r
(k)
t0 V ⇡
i
(s
(k)
t0 )
!
ˆQi(s, u)
✓i+1 ✓i + ↵
1
m
mX
k=1
H 1X
t=0
r✓ log ⇡✓i (u
(k)
t |s
(k)
t )
⇣
ˆQi(s
(k)
t , u
(k)
t ) V ⇡
i
(s
(k)
t )
⌘
i+1 min
X
(s,u,s0,r)
k ˆQi(s, u) V ⇡
(s)k2
2 + k ik2
2