Policy Gradients Lecture

1. Policy Gradients Pieter Abbeel – OpenAI / UC Berkeley / Gradescope Slides authored with John Schulman and Xi (Peter) Chen

2. 8:30: Breakfast / Check-in 9-10: Core Lecture 1 Intro to MDPs and Exact SoluQon Methods (Pieter Abbeel) 10-10:10 Brief Sponsor Intros 10:10-11:10 Core Lecture 2 Sample-based ApproximaQons and FiTed Learning (Rocky Duan) 11:10-11:30: Coﬀee Break 11:30-12:30 Core Lecture 3 DQN + variants (Vlad Mnih) 12:30-1:30 Lunch [catered] 1:30-2:15 Core Lecture 4a Policy Gradients and Actor Cri<c (Pieter Abbeel) 2:15-3:00 Core Lecture 4b Pong from Pixels (Andrej Karpathy) 3:00-3:30 Core Lecture 5 Natural Policy Gradients, TRPO, and PPO (John Schulman) 3:30-3:50 Coﬀee Break 3:50-4:30 Core Lecture 6 Nuts and Bolts of Deep RL ExperimentaQon (John Schulman) 4:30-7 Labs 1-2-3 7-8 FronQers Lecture I: Recent Advances, FronQers and Future of Deep RL (Vlad Mnih) Schedule -- Saturday

3. Reinforcement Learning [Figure source: SuTon & Barto, 1998] ut

4. Policy OpQmizaQon in the RL Landscape

5. Policy OpQmizaQon ⇡✓(u|s) ut [Figure source: SuTon & Barto, 1998]

6. Policy OpQmizaQon n  Consider control policy parameterized by parameter vector n  StochasQc policy class (smooths out the problem): : probability of acQon u in state s ✓ max ✓ E[ HX t=0 R(st)|⇡✓] ⇡✓(u|s) ⇡✓(u|s) ut [Figure source: SuTon & Barto, 1998]

7. n  Oien can be simpler than Q or V n  E.g., roboQc grasp n  V: doesn’t prescribe acQons n  Would need dynamics model (+ compute 1 Bellman back-up) n  Q: need to be able to eﬃciently solve n  Challenge for conQnuous / high-dimensional acQon spaces* Why Policy OpQmizaQon ⇡ *some recent work (parQally) addressing this: NAF: Gu, Lillicrap, Sutskever, Levine ICML 2016 Input Convex NNs: Amos, Xu, Kolter arXiv 2016 Deep Energy Q: Haarnoja, Tang, Abbeel, Levine, ICML 2017 arg max u Q✓(s, u)

8. n  Conceptually: Policy OpQmizaQon Dynamic Programming n  Empirically: Optimize what you care about Indirect, exploit the problem structure, self-consistency More compatible with rich architectures (including recurrence) More versatile More compatible with auxiliary objectives More compatible with exploration and off-policy learning More sample-efficient when they work

9. Likelihood RaQo Policy Gradient

10. Likelihood RaQo Policy Gradient [Aleksandrov, Sysoyev, & Shemeneva, 1968] [Rubinstein, 1969] [Glynn, 1986] [Reinforce, Williams 1992] [GPOMDP, Baxter & BartleT, 2001]

16. n  Valid even when n  R is disconQnuous and/or unknown n  Sample space (of paths) is a discrete set Likelihood RaQo Gradient: Validity rU(✓) ⇡ ˆg = 1 m mX i=1 r✓ log P(⌧(i) ; ✓)R(⌧(i) )

17. n  Gradient tries to: n  Increase probability of paths with posiQve R n  Decrease probability of paths with negaQve R Likelihood RaQo Gradient: IntuiQon rU(✓) ⇡ ˆg = 1 m mX i=1 r✓ log P(⌧(i) ; ✓)R(⌧(i) ) ! Likelihood raQo changes probabiliQes of experienced paths, does not try to change the paths (<-> Path DerivaQve)

18. Let’s Decompose Path into States and AcQons

22. Likelihood RaQo Gradient EsQmate

28. n  As formulated thus far: unbiased but very noisy n  Fixes that lead to real-world pracQcality n  Baseline n  Temporal structure n  Also: KL-divergence trust region / natural gradient (= general trick, equally applicable to perturbaQon analysis and ﬁnite diﬀerences) Likelihood RaQo Gradient EsQmate

29. n  Gradient tries to: n  Increase probability of paths with posiQve R n  Decrease probability of paths with negaQve R Likelihood RaQo Gradient: IntuiQon rU(✓) ⇡ ˆg = 1 m mX i=1 r✓ log P(⌧(i) ; ✓)R(⌧(i) ) ! Likelihood raQo changes probabiliQes of experienced paths, does not try to change the paths (<-> Path DerivaQve)

30. à  Consider baseline b: Likelihood RaQo Gradient: Baseline rU(✓) ⇡ ˆg = 1 m mX i=1 r✓ log P(⌧(i) ; ✓)R(⌧(i) ) rU(✓) ⇡ ˆg = 1 m mX i=1 r✓ log P(⌧(i) ; ✓)(R(⌧(i) ) b) sQll unbiased! [Williams 1992] E [r✓ log P(⌧; ✓)b] = X ⌧ P(⌧; ✓)r✓ log P(⌧; ✓)b = X ⌧ P(⌧; ✓) r✓P(⌧; ✓) P(⌧; ✓) b = X ⌧ r✓P(⌧; ✓)b =r✓ X ⌧ P(⌧)b ! = br✓( X ⌧ P(⌧)) = b ⇥ 0 OK as long as baseline doesn’t depend on acQon in logprob(acQon)

31. n  Current esQmate: n  Removing terms that don’t depend on current acQon can lower variance: Likelihood RaQo and Temporal Structure ˆg = 1 m mX i=1 r✓ log P(⌧(i) ; ✓)(R(⌧(i) ) b) = 1 m mX i=1 H 1X t=0 r✓ log ⇡✓(u (i) t |s (i) t ) ! H 1X t=0 R(s (i) t , u (i) t ) b ! [Policy Gradient Theorem: SuTon et al, NIPS 1999; GPOMDP: BartleT & Baxter, JAIR 2001; Survey: Peters & Schaal, IROS 2006] Doesn’t depend on u (i) t Ok to depend on s (i) t 1 m mX i=1 H 1X t=0 r✓ log ⇡✓(u (i) t |s (i) t ) H 1X k=t R(s (i) k , u (i) k ) b(s (i) t ) ! = 1 m mX i=1 H 1X t=0 r✓ log ⇡✓(u (i) t |s (i) t ) " t 1X k=0 R(s (i) k , u (i) k ) + H 1X k=t R(s (i) k , u (i) k ) b # !

32. n  Good choice for b? n  Constant baseline: n  OpQmal Constant baseline: n  Time-dependent baseline: n  State-dependent expected return: à Increase logprob of acQon proporQonally to how much its returns are beTer than the expected return under the current policy Baseline Choices b(st) = E [rt + rt+1 + rt+2 + . . . + rH 1] b = E [R(⌧)] ⇡ 1 m mX i=1 R(⌧(i) ) [See: Greensmith, BartleT, Baxter, JMLR 2004 for variance reducQon techniques.] bt = 1 m mX i=1 H 1X k=t R(s (i) k , u (i) k ) = V ⇡ (st)

33. 1 m mX i=1 H 1X t=0 r✓ log ⇡✓(u (i) t |s (i) t ) H 1X k=t R(s (i) k , u (i) k ) V ⇡ (s (i) k ) ! How to estimate? V ⇡ n  Init n  Collect trajectories n  Regress against empirical return: V ⇡ 0 ⌧1, . . . , ⌧m i+1 arg min 1 m mX i=1 H 1X t=0 V ⇡ ✓ (s (i) t ) H 1X k=t R(s (i) k , u (i) k ) !2 EsQmaQon of

34. n  Bellman EquaQon for n  Init n  Collect data {s, u, s’, r} n  FiTed V iteraQon: EsQmaQon of V ⇡ (s) = X u ⇡(u|s) X s0 P(s0 |s, u)[R(s, u, s0 ) + V ⇡ (s0 )] V ⇡ V ⇡ V ⇡ 0 i+1 min X (s,u,s0,r) kr + V ⇡ i (s0 ) V (s)k2 2 + k ik2 2

35. Vanilla Policy Gradient ~ [Williams, 1992]

36. n  EsQmaQon of Q from single roll-out Recall Our Likelihood RaQo PG EsQmator Q⇡ (s, u) = E[r0 + r1 + r2 + · · · |s0 = s, a0 = a] 1 m mX i=1 H 1X t=0 r✓ log ⇡✓(u (i) t |s (i) t ) H 1X k=t R(s (i) k , u (i) k ) V ⇡ (s (i) k ) ! n  = high variance per sample based / no generalizaQon used n  Reduce variance by discounQng n  Reduce variance by funcQon approximaQon (=criQc)

39. n  EsQmaQon of Q from single roll-out Recall Our Likelihood RaQo PG EsQmator Q⇡ (s, u) = E[r0 + r1 + r2 + · · · |s0 = s, a0 = a] 1 m mX i=1 H 1X t=0 r✓ log ⇡✓(u (i) t |s (i) t ) H 1X k=t R(s (i) k , u (i) k ) V ⇡ (s (i) k ) ! n  = high variance per sample based / no generalizaQon n  Reduce variance by discounQng n  Reduce variance by funcQon approximaQon (=criQc)

40. n  EsQmaQon of Q from single roll-out Further Reﬁnements Q⇡ (s, u) = E[r0 + r1 + r2 + · · · |s0 = s, a0 = a] 1 m mX i=1 H 1X t=0 r✓ log ⇡✓(u (i) t |s (i) t ) H 1X k=t R(s (i) k , u (i) k ) V ⇡ (s (i) k ) ! n  = high variance per sample based / no generalizaQon n  Reduce variance by discounQng n  Reduce variance by funcQon approximaQon (=criQc)

41. n  EsQmaQon of Q from single roll-out Recall Our Likelihood RaQo PG EsQmator Q⇡ (s, u) = E[r0 + r1 + r2 + · · · |s0 = s, a0 = a] 1 m mX i=1 H 1X t=0 r✓ log ⇡✓(u (i) t |s (i) t ) H 1X k=t R(s (i) k , u (i) k ) V ⇡ (s (i) k ) ! n  = high variance per sample based / no generalizaQon n  Reduce variance by discounQng n  Reduce variance by funcQon approximaQon (=criQc)

42. à introduce discount factor as a hyperparameter to improve esQmate of Q: Variance ReducQon by DiscounQng Q⇡ (s, u) = E[r0 + r1 + r2 + · · · |s0 = s, a0 = a] Q⇡, (s, u) = E[r0 + r1 + 2 r2 + · · · |s0 = s, a0 = a]

43. n  Generalized Advantage Es<ma<on uses an exponenQally weighted average of these n  ~ TD(lambda) Reducing Variance by FuncQon ApproximaQon Q⇡, (s, u) = E[r0 + r1 + 2 r2 + · · · | s0 = s, u0 = u] = E[r0 + V ⇡ (s1) | s0 = s, u0 = u] = E[r0 + r1 + 2 V ⇡ (s2) | s0 = s, u0 = u] = E[r0 + r1 + + 2 r2 + 3 V ⇡ (s3) | s0 = s, u0 = u] = · · ·

46. n  Async Advantage Actor Cri<c (A3C) [Mnih et al, 2016] n  one of the above choices (e.g. k=5 step lookahead) Reducing Variance by FuncQon ApproximaQon Q⇡, (s, u) = E[r0 + r1 + 2 r2 + · · · | s0 = s, u0 = u] = E[r0 + V ⇡ (s1) | s0 = s, u0 = u] = E[r0 + r1 + 2 V ⇡ (s2) | s0 = s, u0 = u] = E[r0 + r1 + + 2 r2 + 3 V ⇡ (s3) | s0 = s, u0 = u] = · · · ˆQ

47. n  Generalized Advantage Es<ma<on (GAE) [Schulman et al, ICLR 2016] n  = lambda exponenQally weighted average of all the above n  ~ TD(lambda) / eligibility traces [SuTon and Barto, 1990] Reducing Variance by FuncQon ApproximaQon Q⇡, (s, u) = E[r0 + r1 + 2 r2 + · · · | s0 = s, u0 = u] = E[r0 + V ⇡ (s1) | s0 = s, u0 = u] = E[r0 + r1 + 2 V ⇡ (s2) | s0 = s, u0 = u] = E[r0 + r1 + + 2 r2 + 3 V ⇡ (s3) | s0 = s, u0 = u] = · · · (1 ) (1 ) (1 ) 2 (1 ) 3 ˆQ

48. n  Generalized Advantage Es<ma<on (GAE) [Schulman et al, ICLR 2016] n  = lambda exponenQally weighted average of all the above n  ~ TD(lambda) / eligibility traces [SuTon and Barto, 1990] Reducing Variance by FuncQon ApproximaQon Q⇡, (s, u) = E[r0 + r1 + 2 r2 + · · · | s0 = s, u0 = u] = E[r0 + V ⇡ (s1) | s0 = s, u0 = u] = E[r0 + r1 + 2 V ⇡ (s2) | s0 = s, u0 = u] = E[r0 + r1 + + 2 r2 + 3 V ⇡ (s3) | s0 = s, u0 = u] = · · · (1 ) (1 ) (1 ) 2 (1 ) 3 ˆQ

49. n  Policy Gradient + Generalized Advantage EsQmaQon: n  Init n  Collect roll-outs {s, u, s’, r} and n  Update: Actor-CriQc with A3C or GAE V ⇡ 0 ⇡✓0 Note: many variaQons, e.g. could instead use 1-step for V, full roll-out for pi: i+1 min X (s,u,s0,r) kr + V ⇡ i (s0 ) V (s)k2 2 + k ik2 2 ✓i+1 ✓i + ↵ 1 m mX k=1 H 1X t=0 r✓ log ⇡✓i (u (k) t |s (k) t ) H 1X t0=t r (k) t0 V ⇡ i (s (k) t0 ) ! ˆQi(s, u) ✓i+1 ✓i + ↵ 1 m mX k=1 H 1X t=0 r✓ log ⇡✓i (u (k) t |s (k) t ) ⇣ ˆQi(s (k) t , u (k) t ) V ⇡ i (s (k) t ) ⌘ i+1 min X (s,u,s0,r) k ˆQi(s, u) V ⇡ (s)k2 2 + k ik2 2

50. n  [Mnih et al, ICML 2016] n  Likelihood RaQo Policy Gradient n  n-step Advantage EsQmaQon Async Advantage Actor CriQc (A3C)

51. A3C -- labyrinth

52. GAE: Eﬀect of gamma and lambda [Schulman et al, 2016 -- GAE]

53. Learning LocomoQon (TRPO + GAE) [Schulman, Moritz, Levine, Jordan, Abbeel, 2016]

54. In Contrast: Darpa RoboQcs Challenge

Policy Gradients Lecture

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Policy Gradients Lecture

Similar to Policy Gradients Lecture (20)

More from Ronald Teo

More from Ronald Teo (16)

Recently uploaded

Recently uploaded (20)

Policy Gradients Lecture