Model-free Continuous Control
Reinforcement Learning
初谷怜慈
2
⾃⼰紹介
• アメフト -> 東⼤Warriors
• 東京⼤学情報理⼯学系研究科修⼠2年
• DeepX, シニアエンジニア
• 研究 -> 強化学習
– 特に実環境に向けて
• twitter -> @Reiji_Hatsu
• github -> rarilurelo
3
What is reinforcement learning?
Environment Agent
4
What is reinforcement learning?
Environment Agent
state, reward
5
What is reinforcement learning?
Environment Agent
action
6
What is reinforcement learning?
Environment Agent
state, reward
action
Described as MDP or POMDP
7
Formulation
Agent
Environment
𝐴"~𝜋(𝑎|𝑆")
𝑆"*+~𝑃(𝑠.
|𝑆", 𝐴")
𝑟"*+ = 𝑟(𝑆", 𝐴", 𝑆"*+)
8
Formulation
Agent
Environment
𝐴"~𝜋(𝑎|𝑆")
𝑆"*+~𝑃(𝑠.
|𝑆", 𝐴")
𝑟"*+ = 𝑟(𝑆", 𝐴", 𝑆"*+)
Modeling π!
π∗
= argmax
π
Eπ [ γ τ
rτ ]
τ =0
∞
∑Get
Model-free
9
Formulation
Agent
Environment
𝐴"~𝜋(𝑎|𝑆")
𝑆"*+~𝑃(𝑠.
|𝑆", 𝐴")
𝑟"*+ = 𝑟(𝑆", 𝐴", 𝑆"*+)
Modeling π and P!
π∗
= argmax
π
Eπ [ γ τ
rτ ]
τ =0
∞
∑Get
Model-base
10
Example
• DQN
– Learn Q function
– ε-greedy <- policy π!
𝜋 𝑎 𝑠 = 2
𝑎𝑟𝑔𝑚𝑎𝑥6	𝑄 𝑠, 𝑎 	 𝜀 < 𝑢
𝑟𝑎𝑛𝑑𝑜𝑚	 𝑢 < 𝜀
(𝑢~𝑢𝑛𝑖𝑓𝑜𝑟𝑚 0,1 )
11
What is Continuous Control?
Atari invaders
Robot arm
• Stick directions
• Buttons
• Torques
12
What is Continuous Control?
Robot arm
Assume π is gaussian (gaussian
policy)
μ(s)
σ(s
)
action
action is sampled from this distribution
μ(s) and σ(s) is represented by neural
network
13
Overview of reinforcement learningʼ complexity
Action space complexity
Statespacecomplexity
Discrete Continuous
14
Learning methods of continuous policy
• NAF
• Policy gradient
• Value gradient
15
Definition
𝑄C
𝑠, 𝑎 = 𝐸E,C F 𝛾"
𝑟"
"
|𝑠, 𝑎 = 𝐸E 𝑟 + 𝛾𝐸C 𝑄C
(𝑠.
, 𝑎.
)
𝑉C
(𝑠) = 𝐸E,C F 𝛾"
𝑟"
"
|𝑠 = 𝐸E,C 𝑟 + 𝛾𝑉C
(𝑠.
)
𝐴C
s, a = 𝑄C
𝑠, 𝑎 − 𝑉C
(𝑠)
exist s, take a, and then according to π
exist s, and then according to π
true influence of a
(Advantage function)
bellman equation
16
Q-learning
Consider optimal policy 𝜋∗
𝑄C∗
𝑠, 𝑎 = 𝐸E 𝑟 + 𝛾𝐸C∗ 𝑄C∗
(𝑠.
, 𝑎.
)
𝜋∗
𝑎 𝑠 = O
1	(𝑎𝑟𝑔𝑚𝑎𝑥6 𝑄C∗
𝑠, 𝑎 )
0	(𝑜𝑡ℎ𝑒𝑟𝑠)
𝑄C∗
𝑠, 𝑎 = 𝐸E 𝑟 + 𝛾 max
6
𝑄C∗
(𝑠, 𝑎)
minimize (lhs – rhs)**2 (for function approximation)
17
NAF (Normalized Advantage Function)
In DQN, we can get max Q
How to get max Q with continuous action?
18
NAF (Normalized Advantage Function)
𝑄(𝑠, 𝑎) = 𝐴(𝑠, 𝑎) + 𝑉(𝑠)
𝐴 𝑠, 𝑎 = −
1
2
𝑎 − 𝜇 𝑠 𝑃(𝑠)(𝑎 − 𝜇 𝑠 )
Positive definite matrix Conve
x
μ(s)
We can get max Q as V!
0
max 𝑄 𝑠, 𝑎 = 0 + 𝑉(𝑠)
minimize (r+γmaxQ – Q) w.r.t all parameters
19
Learning methods of continuous policy
• NAF
• Policy gradient
• Value gradient
20
Formulation
Agent
Environment
𝐴"~𝜋(𝑎|𝑆")
𝑆"*+~𝑃(𝑠.
|𝑆", 𝐴")
𝑟"*+ = 𝑟(𝑆", 𝐴", 𝑆"*+)
Modeling π!
π∗
= argmax
π
Eπ [ γ τ
rτ ]
τ =0
∞
∑Get
Model-free
21
Policy gradient
more direct approach than Q-learning
𝐽 = 𝐸CX
F 𝛾Y
𝑟Y
Z
Y
𝜋[ 𝑎 𝑠 = 𝒩(𝜇[ 𝑠 , 𝜎[ 𝑠 )
𝛻𝜃 𝐽	is what we want
22
Policy gradient
∇θ J = ∇θ Eπθ
[ γ τ
rτ ]
τ =0
∞
∑
= ∇θ Es0 ~ρ,s'~p πθ at ,st( ) γ τ
rτ
τ =0
∞
∑t=0
∏
⎡
⎣
⎢
⎤
⎦
⎥
= Es0 ~ρ,s'~p ∇θ πθ at ,st( ) γ τ
rτ
τ =0
∞
∑t=0
∏
⎡
⎣
⎢
⎤
⎦
⎥
= Es~ρ πθ at ,st( )
∇θ πθ at ,st( )
t=0
∏
πθ at ,st( )
t=0
∏
γ τ
rτ
τ =0
∞
∑
t=0
∏
⎡
⎣
⎢
⎢
⎢
⎤
⎦
⎥
⎥
⎥
= Es~ρ πθ (at | st ) ∇θ log(πθ (at | st ))
t=0
∑t=0
∏ γ τ
rτ
τ =0
∞
∑
⎡
⎣
⎢
⎤
⎦
⎥
= Eπθ
[ ∇θ log(πθ (at | st ))
t=0
∑ γ τ
rτ
τ =t
∞
∑ ]
Expectation to summation
differentiate w.r.t theta
multiple pi to
nominator and denominator
logarithmic differentiation
Causality
Approximated by MC
23
Intuition
24
Intuition
25
Property of Policy gradient
• unbiased estimate
– stable
• on-policy and high-variance estimate
– need large batch size (or A3C like asynchronous training)
– less sample efficiency
• on-policy or off-policy
– current policy can be updated by only current policyʼs sample (on-policy)
– current policy can be updated by any policyʼs sample (off-policy)
26
High variance
• In policy gradient, we have to estimate ∑ 𝛾 𝜏
𝑅 𝜏
∞
𝜏=0
• Estimation of ∑ 𝛾 𝜏
𝑅 𝜏
∞
𝜏=0 is high variance
– long time sequence
– environmentʼs state transition probability
• There are several methods to reduce variance
27
Actor-critic method
policy gradient evaluate how good π(a_t|s_t) was
That depends on only τ > t (causality)
𝛻[ 𝐽 ≈ 𝐸CX
F 𝛻[ log 𝜋[ 𝑎" 𝑠" 𝑄C
(𝑠", 𝑎")
"
reduce variance, but biased estimate
28
Bias-variance
29
Bias-variance
30
Bias-variance
31
Baseline
𝛻[ 𝐽 ≈ 𝐸CX
F 𝛻[ log 𝜋[ 𝑎" 𝑠" (𝑄C
𝑠", 𝑎" − 𝑏 𝑠" )
"
𝐸CX
𝛻[ log 𝜋[ 𝑎" 𝑠" 𝑏(𝑠") = 𝑏(𝑠")𝛻[ 𝐸C[
𝜋[ = 0
b=V is good choice, because Q and V are correlation!
𝛻[ 𝐽 ≈ 𝐸CX
F 𝛻[ log 𝜋[ 𝑎" 𝑠" (𝐴C
(𝑠", 𝑎"))
"
32
Learning methods of continuous policy
• NAF
• Policy gradient
• Value gradient
33
Value gradient
state transition distribution 𝜌
𝐽 = 𝐸j 𝑄C
(𝑠, 𝑎)
= 𝐸j 𝛻6 𝑄 𝑠, 𝑎 k
6lm n *op n
𝛻[(𝜇 𝑠 + 𝜀𝜎 𝑠 )
𝛻[ 𝐽 = 𝐸j 𝛻[ 𝑄C
(𝑠, 𝑎)
= Ej 𝛻6 𝑄 𝑠, 𝑎 k
6lm n
𝛻[ 𝜇(𝑠) DPG
SVG
34
Similarity of GANs
policy (generator) is updated by
gradient of Q function (Discriminator)
35
Property of Value gradient
• biased estimate
– it depends on function approximation of Q
– less stable
• off-policy and low-variance estimate
– high sample efficiency
36
Recent approaches
• TRPO
• A3C
• Q-Prop
37
TRPO (Trust Region Policy Optimization)
• Problem of Policy gradient (on-policy) method
– large step size may affect policy to be divergence
– if once policy becomes bad, policy is updated by bad samples
• Careful choosing step size
– update should not cause large change
– KL constraint
38
TRPO
𝐿Cstu
𝜋 = 𝐸Cstu
𝜋
𝜋stu
𝐴C
(𝑠, 𝑎) variant of PG
constraint	𝐾𝐿(𝜋stu| 𝜋 < 𝐶
max	𝐿Cstu
𝜋 − 𝜆 ∗ 𝐾𝐿(𝜋stu||𝜋)
lagrange multiplier λ
Make linear approximation to L and quadratic approximation to KL
max 𝑔 𝜃 − 𝜃stu −
𝜆
2
𝜃 − 𝜃stu
y
𝐹	(𝜃 − 𝜃stu) 𝐹 =
𝜕|
𝜕𝜃|
𝐾𝐿
39
TRPO
𝜃 − 𝜃stu =
1
𝜆
𝐹}+
𝑔
Finally, natural gradient is
obtained
Conjugate gradient method, line
search
40
A3C
• Asynchronously Advantage Actor-Critic
• Advantage Actor-Critic (A2C) is variant of policy gradient
• Asynchronously update
– no need to get large batch
– no need experience replay
41
A3C
42
Q-Prop
• On-policy + Off-policy
• Policy gradient + value gradient
Stability and sample efficiency
43
Two main ideas
• First-order Taylor expansion
• Control variate
44
First-order Taylor expansion
45
Value gradient appear
46
Can we compute these?
47
Control variate
48
Adaptive Q-Prop
49
More detail about Q-Prop
• https://www.slideshare.net/ReijiHatsugai/q-prop

Continuous control

  • 1.
  • 2.
    2 ⾃⼰紹介 • アメフト ->東⼤Warriors • 東京⼤学情報理⼯学系研究科修⼠2年 • DeepX, シニアエンジニア • 研究 -> 強化学習 – 特に実環境に向けて • twitter -> @Reiji_Hatsu • github -> rarilurelo
  • 3.
    3 What is reinforcementlearning? Environment Agent
  • 4.
    4 What is reinforcementlearning? Environment Agent state, reward
  • 5.
    5 What is reinforcementlearning? Environment Agent action
  • 6.
    6 What is reinforcementlearning? Environment Agent state, reward action Described as MDP or POMDP
  • 7.
  • 8.
    8 Formulation Agent Environment 𝐴"~𝜋(𝑎|𝑆") 𝑆"*+~𝑃(𝑠. |𝑆", 𝐴") 𝑟"*+ =𝑟(𝑆", 𝐴", 𝑆"*+) Modeling π! π∗ = argmax π Eπ [ γ τ rτ ] τ =0 ∞ ∑Get Model-free
  • 9.
    9 Formulation Agent Environment 𝐴"~𝜋(𝑎|𝑆") 𝑆"*+~𝑃(𝑠. |𝑆", 𝐴") 𝑟"*+ =𝑟(𝑆", 𝐴", 𝑆"*+) Modeling π and P! π∗ = argmax π Eπ [ γ τ rτ ] τ =0 ∞ ∑Get Model-base
  • 10.
    10 Example • DQN – LearnQ function – ε-greedy <- policy π! 𝜋 𝑎 𝑠 = 2 𝑎𝑟𝑔𝑚𝑎𝑥6 𝑄 𝑠, 𝑎 𝜀 < 𝑢 𝑟𝑎𝑛𝑑𝑜𝑚 𝑢 < 𝜀 (𝑢~𝑢𝑛𝑖𝑓𝑜𝑟𝑚 0,1 )
  • 11.
    11 What is ContinuousControl? Atari invaders Robot arm • Stick directions • Buttons • Torques
  • 12.
    12 What is ContinuousControl? Robot arm Assume π is gaussian (gaussian policy) μ(s) σ(s ) action action is sampled from this distribution μ(s) and σ(s) is represented by neural network
  • 13.
    13 Overview of reinforcementlearningʼ complexity Action space complexity Statespacecomplexity Discrete Continuous
  • 14.
    14 Learning methods ofcontinuous policy • NAF • Policy gradient • Value gradient
  • 15.
    15 Definition 𝑄C 𝑠, 𝑎 =𝐸E,C F 𝛾" 𝑟" " |𝑠, 𝑎 = 𝐸E 𝑟 + 𝛾𝐸C 𝑄C (𝑠. , 𝑎. ) 𝑉C (𝑠) = 𝐸E,C F 𝛾" 𝑟" " |𝑠 = 𝐸E,C 𝑟 + 𝛾𝑉C (𝑠. ) 𝐴C s, a = 𝑄C 𝑠, 𝑎 − 𝑉C (𝑠) exist s, take a, and then according to π exist s, and then according to π true influence of a (Advantage function) bellman equation
  • 16.
    16 Q-learning Consider optimal policy𝜋∗ 𝑄C∗ 𝑠, 𝑎 = 𝐸E 𝑟 + 𝛾𝐸C∗ 𝑄C∗ (𝑠. , 𝑎. ) 𝜋∗ 𝑎 𝑠 = O 1 (𝑎𝑟𝑔𝑚𝑎𝑥6 𝑄C∗ 𝑠, 𝑎 ) 0 (𝑜𝑡ℎ𝑒𝑟𝑠) 𝑄C∗ 𝑠, 𝑎 = 𝐸E 𝑟 + 𝛾 max 6 𝑄C∗ (𝑠, 𝑎) minimize (lhs – rhs)**2 (for function approximation)
  • 17.
    17 NAF (Normalized AdvantageFunction) In DQN, we can get max Q How to get max Q with continuous action?
  • 18.
    18 NAF (Normalized AdvantageFunction) 𝑄(𝑠, 𝑎) = 𝐴(𝑠, 𝑎) + 𝑉(𝑠) 𝐴 𝑠, 𝑎 = − 1 2 𝑎 − 𝜇 𝑠 𝑃(𝑠)(𝑎 − 𝜇 𝑠 ) Positive definite matrix Conve x μ(s) We can get max Q as V! 0 max 𝑄 𝑠, 𝑎 = 0 + 𝑉(𝑠) minimize (r+γmaxQ – Q) w.r.t all parameters
  • 19.
    19 Learning methods ofcontinuous policy • NAF • Policy gradient • Value gradient
  • 20.
    20 Formulation Agent Environment 𝐴"~𝜋(𝑎|𝑆") 𝑆"*+~𝑃(𝑠. |𝑆", 𝐴") 𝑟"*+ =𝑟(𝑆", 𝐴", 𝑆"*+) Modeling π! π∗ = argmax π Eπ [ γ τ rτ ] τ =0 ∞ ∑Get Model-free
  • 21.
    21 Policy gradient more directapproach than Q-learning 𝐽 = 𝐸CX F 𝛾Y 𝑟Y Z Y 𝜋[ 𝑎 𝑠 = 𝒩(𝜇[ 𝑠 , 𝜎[ 𝑠 ) 𝛻𝜃 𝐽 is what we want
  • 22.
    22 Policy gradient ∇θ J= ∇θ Eπθ [ γ τ rτ ] τ =0 ∞ ∑ = ∇θ Es0 ~ρ,s'~p πθ at ,st( ) γ τ rτ τ =0 ∞ ∑t=0 ∏ ⎡ ⎣ ⎢ ⎤ ⎦ ⎥ = Es0 ~ρ,s'~p ∇θ πθ at ,st( ) γ τ rτ τ =0 ∞ ∑t=0 ∏ ⎡ ⎣ ⎢ ⎤ ⎦ ⎥ = Es~ρ πθ at ,st( ) ∇θ πθ at ,st( ) t=0 ∏ πθ at ,st( ) t=0 ∏ γ τ rτ τ =0 ∞ ∑ t=0 ∏ ⎡ ⎣ ⎢ ⎢ ⎢ ⎤ ⎦ ⎥ ⎥ ⎥ = Es~ρ πθ (at | st ) ∇θ log(πθ (at | st )) t=0 ∑t=0 ∏ γ τ rτ τ =0 ∞ ∑ ⎡ ⎣ ⎢ ⎤ ⎦ ⎥ = Eπθ [ ∇θ log(πθ (at | st )) t=0 ∑ γ τ rτ τ =t ∞ ∑ ] Expectation to summation differentiate w.r.t theta multiple pi to nominator and denominator logarithmic differentiation Causality Approximated by MC
  • 23.
  • 24.
  • 25.
    25 Property of Policygradient • unbiased estimate – stable • on-policy and high-variance estimate – need large batch size (or A3C like asynchronous training) – less sample efficiency • on-policy or off-policy – current policy can be updated by only current policyʼs sample (on-policy) – current policy can be updated by any policyʼs sample (off-policy)
  • 26.
    26 High variance • Inpolicy gradient, we have to estimate ∑ 𝛾 𝜏 𝑅 𝜏 ∞ 𝜏=0 • Estimation of ∑ 𝛾 𝜏 𝑅 𝜏 ∞ 𝜏=0 is high variance – long time sequence – environmentʼs state transition probability • There are several methods to reduce variance
  • 27.
    27 Actor-critic method policy gradientevaluate how good π(a_t|s_t) was That depends on only τ > t (causality) 𝛻[ 𝐽 ≈ 𝐸CX F 𝛻[ log 𝜋[ 𝑎" 𝑠" 𝑄C (𝑠", 𝑎") " reduce variance, but biased estimate
  • 28.
  • 29.
  • 30.
  • 31.
    31 Baseline 𝛻[ 𝐽 ≈𝐸CX F 𝛻[ log 𝜋[ 𝑎" 𝑠" (𝑄C 𝑠", 𝑎" − 𝑏 𝑠" ) " 𝐸CX 𝛻[ log 𝜋[ 𝑎" 𝑠" 𝑏(𝑠") = 𝑏(𝑠")𝛻[ 𝐸C[ 𝜋[ = 0 b=V is good choice, because Q and V are correlation! 𝛻[ 𝐽 ≈ 𝐸CX F 𝛻[ log 𝜋[ 𝑎" 𝑠" (𝐴C (𝑠", 𝑎")) "
  • 32.
    32 Learning methods ofcontinuous policy • NAF • Policy gradient • Value gradient
  • 33.
    33 Value gradient state transitiondistribution 𝜌 𝐽 = 𝐸j 𝑄C (𝑠, 𝑎) = 𝐸j 𝛻6 𝑄 𝑠, 𝑎 k 6lm n *op n 𝛻[(𝜇 𝑠 + 𝜀𝜎 𝑠 ) 𝛻[ 𝐽 = 𝐸j 𝛻[ 𝑄C (𝑠, 𝑎) = Ej 𝛻6 𝑄 𝑠, 𝑎 k 6lm n 𝛻[ 𝜇(𝑠) DPG SVG
  • 34.
    34 Similarity of GANs policy(generator) is updated by gradient of Q function (Discriminator)
  • 35.
    35 Property of Valuegradient • biased estimate – it depends on function approximation of Q – less stable • off-policy and low-variance estimate – high sample efficiency
  • 36.
  • 37.
    37 TRPO (Trust RegionPolicy Optimization) • Problem of Policy gradient (on-policy) method – large step size may affect policy to be divergence – if once policy becomes bad, policy is updated by bad samples • Careful choosing step size – update should not cause large change – KL constraint
  • 38.
    38 TRPO 𝐿Cstu 𝜋 = 𝐸Cstu 𝜋 𝜋stu 𝐴C (𝑠,𝑎) variant of PG constraint 𝐾𝐿(𝜋stu| 𝜋 < 𝐶 max 𝐿Cstu 𝜋 − 𝜆 ∗ 𝐾𝐿(𝜋stu||𝜋) lagrange multiplier λ Make linear approximation to L and quadratic approximation to KL max 𝑔 𝜃 − 𝜃stu − 𝜆 2 𝜃 − 𝜃stu y 𝐹 (𝜃 − 𝜃stu) 𝐹 = 𝜕| 𝜕𝜃| 𝐾𝐿
  • 39.
    39 TRPO 𝜃 − 𝜃stu= 1 𝜆 𝐹}+ 𝑔 Finally, natural gradient is obtained Conjugate gradient method, line search
  • 40.
    40 A3C • Asynchronously AdvantageActor-Critic • Advantage Actor-Critic (A2C) is variant of policy gradient • Asynchronously update – no need to get large batch – no need experience replay
  • 41.
  • 42.
    42 Q-Prop • On-policy +Off-policy • Policy gradient + value gradient Stability and sample efficiency
  • 43.
    43 Two main ideas •First-order Taylor expansion • Control variate
  • 44.
  • 45.
  • 46.
  • 47.
  • 48.
  • 49.
    49 More detail aboutQ-Prop • https://www.slideshare.net/ReijiHatsugai/q-prop