Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Continuous control

524 views

Published on

Reinforcement Learning, Continuous Control, DeepX, Model-free, TRPO, DDPG, NAF, A3C

Published in: Engineering
  • Be the first to comment

Continuous control

  1. 1. Model-free Continuous Control Reinforcement Learning 初谷怜慈
  2. 2. 2 ⾃⼰紹介 • アメフト -> 東⼤Warriors • 東京⼤学情報理⼯学系研究科修⼠2年 • DeepX, シニアエンジニア • 研究 -> 強化学習 – 特に実環境に向けて • twitter -> @Reiji_Hatsu • github -> rarilurelo
  3. 3. 3 What is reinforcement learning? Environment Agent
  4. 4. 4 What is reinforcement learning? Environment Agent state, reward
  5. 5. 5 What is reinforcement learning? Environment Agent action
  6. 6. 6 What is reinforcement learning? Environment Agent state, reward action Described as MDP or POMDP
  7. 7. 7 Formulation Agent Environment 𝐴"~𝜋(𝑎|𝑆") 𝑆"*+~𝑃(𝑠. |𝑆", 𝐴") 𝑟"*+ = 𝑟(𝑆", 𝐴", 𝑆"*+)
  8. 8. 8 Formulation Agent Environment 𝐴"~𝜋(𝑎|𝑆") 𝑆"*+~𝑃(𝑠. |𝑆", 𝐴") 𝑟"*+ = 𝑟(𝑆", 𝐴", 𝑆"*+) Modeling π! π∗ = argmax π Eπ [ γ τ rτ ] τ =0 ∞ ∑Get Model-free
  9. 9. 9 Formulation Agent Environment 𝐴"~𝜋(𝑎|𝑆") 𝑆"*+~𝑃(𝑠. |𝑆", 𝐴") 𝑟"*+ = 𝑟(𝑆", 𝐴", 𝑆"*+) Modeling π and P! π∗ = argmax π Eπ [ γ τ rτ ] τ =0 ∞ ∑Get Model-base
  10. 10. 10 Example • DQN – Learn Q function – ε-greedy <- policy π! 𝜋 𝑎 𝑠 = 2 𝑎𝑟𝑔𝑚𝑎𝑥6 𝑄 𝑠, 𝑎 𝜀 < 𝑢 𝑟𝑎𝑛𝑑𝑜𝑚 𝑢 < 𝜀 (𝑢~𝑢𝑛𝑖𝑓𝑜𝑟𝑚 0,1 )
  11. 11. 11 What is Continuous Control? Atari invaders Robot arm • Stick directions • Buttons • Torques
  12. 12. 12 What is Continuous Control? Robot arm Assume π is gaussian (gaussian policy) μ(s) σ(s ) action action is sampled from this distribution μ(s) and σ(s) is represented by neural network
  13. 13. 13 Overview of reinforcement learningʼ complexity Action space complexity Statespacecomplexity Discrete Continuous
  14. 14. 14 Learning methods of continuous policy • NAF • Policy gradient • Value gradient
  15. 15. 15 Definition 𝑄C 𝑠, 𝑎 = 𝐸E,C F 𝛾" 𝑟" " |𝑠, 𝑎 = 𝐸E 𝑟 + 𝛾𝐸C 𝑄C (𝑠. , 𝑎. ) 𝑉C (𝑠) = 𝐸E,C F 𝛾" 𝑟" " |𝑠 = 𝐸E,C 𝑟 + 𝛾𝑉C (𝑠. ) 𝐴C s, a = 𝑄C 𝑠, 𝑎 − 𝑉C (𝑠) exist s, take a, and then according to π exist s, and then according to π true influence of a (Advantage function) bellman equation
  16. 16. 16 Q-learning Consider optimal policy 𝜋∗ 𝑄C∗ 𝑠, 𝑎 = 𝐸E 𝑟 + 𝛾𝐸C∗ 𝑄C∗ (𝑠. , 𝑎. ) 𝜋∗ 𝑎 𝑠 = O 1 (𝑎𝑟𝑔𝑚𝑎𝑥6 𝑄C∗ 𝑠, 𝑎 ) 0 (𝑜𝑡ℎ𝑒𝑟𝑠) 𝑄C∗ 𝑠, 𝑎 = 𝐸E 𝑟 + 𝛾 max 6 𝑄C∗ (𝑠, 𝑎) minimize (lhs – rhs)**2 (for function approximation)
  17. 17. 17 NAF (Normalized Advantage Function) In DQN, we can get max Q How to get max Q with continuous action?
  18. 18. 18 NAF (Normalized Advantage Function) 𝑄(𝑠, 𝑎) = 𝐴(𝑠, 𝑎) + 𝑉(𝑠) 𝐴 𝑠, 𝑎 = − 1 2 𝑎 − 𝜇 𝑠 𝑃(𝑠)(𝑎 − 𝜇 𝑠 ) Positive definite matrix Conve x μ(s) We can get max Q as V! 0 max 𝑄 𝑠, 𝑎 = 0 + 𝑉(𝑠) minimize (r+γmaxQ – Q) w.r.t all parameters
  19. 19. 19 Learning methods of continuous policy • NAF • Policy gradient • Value gradient
  20. 20. 20 Formulation Agent Environment 𝐴"~𝜋(𝑎|𝑆") 𝑆"*+~𝑃(𝑠. |𝑆", 𝐴") 𝑟"*+ = 𝑟(𝑆", 𝐴", 𝑆"*+) Modeling π! π∗ = argmax π Eπ [ γ τ rτ ] τ =0 ∞ ∑Get Model-free
  21. 21. 21 Policy gradient more direct approach than Q-learning 𝐽 = 𝐸CX F 𝛾Y 𝑟Y Z Y 𝜋[ 𝑎 𝑠 = 𝒩(𝜇[ 𝑠 , 𝜎[ 𝑠 ) 𝛻𝜃 𝐽 is what we want
  22. 22. 22 Policy gradient ∇θ J = ∇θ Eπθ [ γ τ rτ ] τ =0 ∞ ∑ = ∇θ Es0 ~ρ,s'~p πθ at ,st( ) γ τ rτ τ =0 ∞ ∑t=0 ∏ ⎡ ⎣ ⎢ ⎤ ⎦ ⎥ = Es0 ~ρ,s'~p ∇θ πθ at ,st( ) γ τ rτ τ =0 ∞ ∑t=0 ∏ ⎡ ⎣ ⎢ ⎤ ⎦ ⎥ = Es~ρ πθ at ,st( ) ∇θ πθ at ,st( ) t=0 ∏ πθ at ,st( ) t=0 ∏ γ τ rτ τ =0 ∞ ∑ t=0 ∏ ⎡ ⎣ ⎢ ⎢ ⎢ ⎤ ⎦ ⎥ ⎥ ⎥ = Es~ρ πθ (at | st ) ∇θ log(πθ (at | st )) t=0 ∑t=0 ∏ γ τ rτ τ =0 ∞ ∑ ⎡ ⎣ ⎢ ⎤ ⎦ ⎥ = Eπθ [ ∇θ log(πθ (at | st )) t=0 ∑ γ τ rτ τ =t ∞ ∑ ] Expectation to summation differentiate w.r.t theta multiple pi to nominator and denominator logarithmic differentiation Causality Approximated by MC
  23. 23. 23 Intuition
  24. 24. 24 Intuition
  25. 25. 25 Property of Policy gradient • unbiased estimate – stable • on-policy and high-variance estimate – need large batch size (or A3C like asynchronous training) – less sample efficiency • on-policy or off-policy – current policy can be updated by only current policyʼs sample (on-policy) – current policy can be updated by any policyʼs sample (off-policy)
  26. 26. 26 High variance • In policy gradient, we have to estimate ∑ 𝛾 𝜏 𝑅 𝜏 ∞ 𝜏=0 • Estimation of ∑ 𝛾 𝜏 𝑅 𝜏 ∞ 𝜏=0 is high variance – long time sequence – environmentʼs state transition probability • There are several methods to reduce variance
  27. 27. 27 Actor-critic method policy gradient evaluate how good π(a_t|s_t) was That depends on only τ > t (causality) 𝛻[ 𝐽 ≈ 𝐸CX F 𝛻[ log 𝜋[ 𝑎" 𝑠" 𝑄C (𝑠", 𝑎") " reduce variance, but biased estimate
  28. 28. 28 Bias-variance
  29. 29. 29 Bias-variance
  30. 30. 30 Bias-variance
  31. 31. 31 Baseline 𝛻[ 𝐽 ≈ 𝐸CX F 𝛻[ log 𝜋[ 𝑎" 𝑠" (𝑄C 𝑠", 𝑎" − 𝑏 𝑠" ) " 𝐸CX 𝛻[ log 𝜋[ 𝑎" 𝑠" 𝑏(𝑠") = 𝑏(𝑠")𝛻[ 𝐸C[ 𝜋[ = 0 b=V is good choice, because Q and V are correlation! 𝛻[ 𝐽 ≈ 𝐸CX F 𝛻[ log 𝜋[ 𝑎" 𝑠" (𝐴C (𝑠", 𝑎")) "
  32. 32. 32 Learning methods of continuous policy • NAF • Policy gradient • Value gradient
  33. 33. 33 Value gradient state transition distribution 𝜌 𝐽 = 𝐸j 𝑄C (𝑠, 𝑎) = 𝐸j 𝛻6 𝑄 𝑠, 𝑎 k 6lm n *op n 𝛻[(𝜇 𝑠 + 𝜀𝜎 𝑠 ) 𝛻[ 𝐽 = 𝐸j 𝛻[ 𝑄C (𝑠, 𝑎) = Ej 𝛻6 𝑄 𝑠, 𝑎 k 6lm n 𝛻[ 𝜇(𝑠) DPG SVG
  34. 34. 34 Similarity of GANs policy (generator) is updated by gradient of Q function (Discriminator)
  35. 35. 35 Property of Value gradient • biased estimate – it depends on function approximation of Q – less stable • off-policy and low-variance estimate – high sample efficiency
  36. 36. 36 Recent approaches • TRPO • A3C • Q-Prop
  37. 37. 37 TRPO (Trust Region Policy Optimization) • Problem of Policy gradient (on-policy) method – large step size may affect policy to be divergence – if once policy becomes bad, policy is updated by bad samples • Careful choosing step size – update should not cause large change – KL constraint
  38. 38. 38 TRPO 𝐿Cstu 𝜋 = 𝐸Cstu 𝜋 𝜋stu 𝐴C (𝑠, 𝑎) variant of PG constraint 𝐾𝐿(𝜋stu| 𝜋 < 𝐶 max 𝐿Cstu 𝜋 − 𝜆 ∗ 𝐾𝐿(𝜋stu||𝜋) lagrange multiplier λ Make linear approximation to L and quadratic approximation to KL max 𝑔 𝜃 − 𝜃stu − 𝜆 2 𝜃 − 𝜃stu y 𝐹 (𝜃 − 𝜃stu) 𝐹 = 𝜕| 𝜕𝜃| 𝐾𝐿
  39. 39. 39 TRPO 𝜃 − 𝜃stu = 1 𝜆 𝐹}+ 𝑔 Finally, natural gradient is obtained Conjugate gradient method, line search
  40. 40. 40 A3C • Asynchronously Advantage Actor-Critic • Advantage Actor-Critic (A2C) is variant of policy gradient • Asynchronously update – no need to get large batch – no need experience replay
  41. 41. 41 A3C
  42. 42. 42 Q-Prop • On-policy + Off-policy • Policy gradient + value gradient Stability and sample efficiency
  43. 43. 43 Two main ideas • First-order Taylor expansion • Control variate
  44. 44. 44 First-order Taylor expansion
  45. 45. 45 Value gradient appear
  46. 46. 46 Can we compute these?
  47. 47. 47 Control variate
  48. 48. 48 Adaptive Q-Prop
  49. 49. 49 More detail about Q-Prop • https://www.slideshare.net/ReijiHatsugai/q-prop

×