Stochastic Optimal Control
&
Reinforcement Learning
Jinwon Choi
Contents
01
02
03
04
05
Reinforcement
Learning
Stochastic
Optimal Control
Stochastic
Control
to
Reinforcement
Learning
Large Scale
Reinforcement
Learning
Summary
Reinforcement Learning
Reinforcement Learning
Ivan Pavlov
Reinforcement Learning
Agent Environment
Action
Reward
State
Reinforcement Learning
Agent Environment
Action
Reward
State
Agent
Environment
Reinforcement Learning
Agent Environment
Action
Reward
State
Agent
Environment
Markov Decision Process
Markov?
Markov Decision Process
Markov?
“The future is independent of the past given the present”
ℙ 𝑠𝑡+1 𝑠1, … , 𝑠𝑡 = ℙ 𝑠𝑡+1 𝑠𝑡
Markov Decision Process
Markov?
“The future is independent of the past given the present”
ℙ 𝑠𝑡+1 𝑠1, … , 𝑠𝑡 = ℙ 𝑠𝑡+1 𝑠𝑡
Memoryless process!
Markov Decision Process
Markov?
“The future is independent of the past given the present”
ℙ 𝑠𝑡+1 𝑠1, … , 𝑠𝑡 = ℙ 𝑠𝑡+1 𝑠𝑡
Markov “Decision” Process:
ℙ 𝑠𝑡+1 𝑠1, 𝑎1 … , 𝑠𝑡, 𝑎 𝑡 = ℙ 𝑠𝑡+1 𝑠𝑡, 𝑎 𝑡
Future state only depends on the current state and action
&
Policy also depends on the current state only
𝜋 𝑎 𝑡 𝑠1, 𝑎1 … , 𝑠𝑡 = 𝜋 𝑎 𝑡 𝑠𝑡
Reinforcement Learning
Agent Environment
Action
Reward
• State 𝑠 ∈ 𝑆 ⊂ ℝ 𝑛
• Action 𝑎 ∈ 𝐴 ⊂ ℝ 𝑚
• Action sequence 𝑎0, 𝑎1, … ,
∀𝑎𝑖 ∈ 𝐴, 𝑖 = 1,2, …
State
Reinforcement Learning
Agent Environment
Action
Reward
• State 𝑠 ∈ 𝑆 ⊂ ℝ 𝑛
• Action 𝑎 ∈ 𝐴 ⊂ ℝ 𝑚
• Action sequence 𝑎0, 𝑎1, … ,
∀𝑎𝑖 ∈ 𝐴, 𝑖 = 1,2, …
State
• Reward 𝑟: 𝑆 × 𝐴 → ℝ
• Discounting factor 𝛾 ∈ (0,1)
• Transition probability 𝑠𝑡+1~𝑝 𝑠𝑡+1 𝑠𝑡, 𝑎 𝑡
• Total reward 𝑅𝑡𝑜𝑡 = 𝐸𝑠~𝑝 σ 𝑡=0
∞
𝛾 𝑡 𝑟(𝑠𝑡, 𝑎 𝑡)
Reinforcement Learning
Agent Environment
Action
Reward
• State 𝑠 ∈ 𝑆 ⊂ ℝ 𝑛
• Action 𝑎 ∈ 𝐴 ⊂ ℝ 𝑚
• Action sequence 𝑎0, 𝑎1, … ,
∀𝑎𝑖 ∈ 𝐴, 𝑖 = 1,2, …
State
Objective function:
max
𝜋∈Π
𝑅 𝜋
• Policy: 𝜋: 𝑆 → 𝐴
• Total reward w.r.t. 𝜋:
𝑅 𝜋
= 𝐸𝑠~𝑝,𝑎~𝜋 σ 𝑡=0
∞
𝛾 𝑡
𝑟(𝑠𝑡, 𝑎 𝑡)
• Reward 𝑟: 𝑆 × 𝐴 → ℝ
• Discounting factor 𝛾 ∈ (0,1)
• Transition probability 𝑠𝑡+1~𝑝 𝑠𝑡+1 𝑠𝑡, 𝑎 𝑡
• Total reward 𝑅𝑡𝑜𝑡 = 𝐸𝑠~𝑝 σ 𝑡=0
∞
𝛾 𝑡 𝑟(𝑠𝑡, 𝑎 𝑡)
Terminology of RL and Optimal control
State
Action
Agent
Environment
Reward of a stage
Reward (or value) function
Maximizing the value function
Bellman operator
Greedy policy w.r.t. 𝐽
State
Control Input
Controller
System
Cost of a stage
Value (or cost) function
Minimizing the value function
DP mapping or operator
Minimizing policy w.r.t. 𝐽
RL Optimal Control
Stochastic Optimal Control
Stochastic Optimal Control
DiscreteContinuous
Stochastic
Deterministicሶ𝑥 = 𝑓(𝑥, 𝑢)
𝑑𝑥 = 𝑓 𝑥, 𝑢 𝑑𝑡 + 𝜎 𝑥, 𝑢 𝑑𝑊
𝑥 𝑘+1 = 𝑓(𝑥 𝑘, 𝑢 𝑘)
𝑥 𝑘+1 = 𝑓 𝑥 𝑘, 𝑢 𝑘, 𝑤 𝑘
(𝑤 𝑘 is a random Gaussian noise)
or
𝑥 𝑘+1~𝑝 𝑥 𝑘+1 𝑥 𝑘, 𝑢 𝑘
System Dynamics
Stochastic Optimal Control
DiscreteContinuous
Stochastic
(Policy)
Deterministic
(Control Input)
𝑢 𝑥
𝑢 𝑥 ~ 𝜋 𝑢 𝑥
{𝑢0, 𝑢1, 𝑢2, … }
𝑢 𝑘 ~ 𝜋 𝑢 𝑘 𝑥 𝑘
Control Input and Policy
Stochastic Optimal Control
DiscreteContinuous
Infinite-horizon
Finite-horizon
Value function
inf
𝑢∈𝑈
𝐸 𝑥~𝑝 න
𝑡=0
𝑇
𝑟 𝑥 𝑡 , 𝑢 𝑡 𝑑𝑡 + 𝑞(𝑥 𝑇)
inf
𝑢∈𝑈
𝐸 𝑥~𝑝 න
𝑡=0
∞
𝑒−𝛾𝑡
𝑟 𝑥 𝑡 , 𝑢 𝑡 𝑑𝑡
inf
𝑢 𝑘∈𝑈
𝐸 𝑥~𝑝 ෍
𝑘=0
𝑁
𝑟 𝑥 𝑘, 𝑢 𝑘 + 𝑞(𝑥 𝑁)
inf
𝑢 𝑘∈𝑈
𝐸 𝑥~𝑝 ෍
𝑘=0
∞
𝛾 𝑘 𝑟 𝑥 𝑘, 𝑢 𝑘
Stochastic Optimal Control
Discrete(𝑉 𝑥 𝑘 )Continuous(𝑉 𝑥(𝑡) )
Infinite-horizon
Finite-horizon
Dynamic Programming
inf
𝑢∈𝑈
𝐸 𝑥~𝑝 න
𝑡
𝑡+Δ𝑡
𝑟 𝑥 𝑠 , 𝑢 𝑠 𝑑𝑠 + 𝑉(𝑥 𝑡 + Δ𝑡 )
𝑉 𝑥 𝑇 = 𝑞(𝑥 𝑇 )
inf
𝑢∈𝑈
𝐸 𝑥~𝑝 න
𝑡=0
𝑡+Δ𝑡
𝑒−𝛾𝑠
𝑟 𝑥 𝑠 , 𝑢 𝑠 𝑑𝑠 + 𝑉(𝑥 𝑡 + Δ𝑡 )
inf
𝑢 𝑘∈𝑈
𝑟 𝑥 𝑘, 𝑢 𝑘 + 𝐸 𝑥~𝑝 𝑉(𝑥 𝑘+1)
𝑉 𝑥 𝑇 = 𝑞(𝑥 𝑇)
inf
𝑢 𝑘∈𝑈
𝑟 𝑥 𝑘, 𝑢 𝑘 + 𝐸 𝑥~𝑝 𝛾𝑉(𝑥 𝑘+1)
Stochastic Optimal Control
Discrete(𝑉 𝑥 𝑘 )Continuous(𝑉 𝑥(𝑡) )
Infinite
Dynamic Programming
𝜕𝑉
𝜕𝑡
+ inf
𝑢∈𝑈
𝑟 𝑥 𝑡 , 𝑢 𝑡 +
𝜕𝑉
𝜕𝑥
𝑓 𝑥 𝑡 , 𝑢 𝑡 +
1
2
𝜎 𝑇
(𝑥 𝑡 , 𝑢 𝑡 )
𝜕2
𝑉
𝜕𝑥2
𝜎(𝑥 𝑡 , 𝑢 𝑡 ) = 0
𝑉 𝑥 𝑇 = 𝑞(𝑥 𝑇 )
inf
𝑢 𝑘∈𝑈
𝑟 𝑥 𝑘, 𝑢 𝑘 + 𝐸 𝑥~𝑝 𝑉(𝑥 𝑘+1)
𝑉 𝑥 𝑇 = 𝑞(𝑥 𝑇)
inf
𝑢 𝑘∈𝑈
𝑟 𝑥 𝑘, 𝑢 𝑘 + 𝐸 𝑥~𝑝 𝛾𝑉(𝑥 𝑘+1)
Finite
HJB equation Bellman equation
−𝛾 + inf
𝑢∈𝑈
𝑟 𝑥 𝑡 , 𝑢 𝑡 +
𝜕𝑉
𝜕𝑥
𝑓 𝑥 𝑡 , 𝑢 𝑡 +
1
2
𝜎 𝑇
(𝑥 𝑡 , 𝑢 𝑡 )
𝜕2
𝑉
𝜕𝑥2
𝜎(𝑥 𝑡 , 𝑢 𝑡 ) = 0
Stochastic Optimal Control
Discrete(𝑉 𝑥 𝑘 )Continuous(𝑉 𝑥(𝑡) )
Infinite
Dynamic Programming
𝜕𝑉
𝜕𝑡
+ inf
𝑢∈𝑈
𝑟 𝑥 𝑡 , 𝑢 𝑡 +
𝜕𝑉
𝜕𝑥
𝑓 𝑥 𝑡 , 𝑢 𝑡 +
1
2
𝜎 𝑇
(𝑥 𝑡 , 𝑢 𝑡 )
𝜕2
𝑉
𝜕𝑥2
𝜎(𝑥 𝑡 , 𝑢 𝑡 ) = 0
𝑉 𝑥 𝑇 = 𝑞(𝑥 𝑇 )
inf
𝑢 𝑘∈𝑈
𝑟 𝑥 𝑘, 𝑢 𝑘 + 𝐸 𝑥~𝑝 𝑉(𝑥 𝑘+1)
𝑉 𝑥 𝑇 = 𝑞(𝑥 𝑇)
inf
𝑢 𝑘∈𝑈
𝑟 𝑥 𝑘, 𝑢 𝑘 + 𝐸 𝑥~𝑝 𝛾𝑉(𝑥 𝑘+1)
Finite
−𝛾 + inf
𝑢∈𝑈
𝑟 𝑥 𝑡 , 𝑢 𝑡 +
𝜕𝑉
𝜕𝑥
𝑓 𝑥 𝑡 , 𝑢 𝑡 +
1
2
𝜎 𝑇
(𝑥 𝑡 , 𝑢 𝑡 )
𝜕2
𝑉
𝜕𝑥2
𝜎(𝑥 𝑡 , 𝑢 𝑡 ) = 0
HJB equation Bellman equation
Stochastic Optimal Control
Dynamic Programming
𝐢𝐧𝐟
𝒖 𝒌∈𝑼
𝒓 𝒙 𝒌, 𝒖 𝒌 + 𝑬 𝒙~𝒑 𝜸𝑽(𝒙 𝒌+𝟏)
How to solve the infinite-horizon discrete time system stochastic optimal control problem?
Stochastic Optimal Control
Dynamic Programming
𝐢𝐧𝐟
𝒖 𝒌∈𝑼
𝒓 𝒙 𝒌, 𝒖 𝒌 + 𝑬 𝒙~𝒑 𝜸𝑽(𝒙 𝒌+𝟏)
How to solve the infinite-horizon discrete time system stochastic optimal control problem?
Note) There is an other approach using different dynamic programming equation,
average reward.
Value Iteration & Policy Iteration
Bellman Operator
Definition. Given policy 𝜋, the state-value function 𝑉 𝜋: ℝn → ℝ is defined by
𝑉 𝜋
𝑥0 : = 𝐸 𝑥~𝑝,𝜋 ෍
𝑘=0
∞
𝛾 𝑘
𝑟𝑘(𝑥 𝑘, 𝜋(𝑥 𝑘), 𝑤 𝑘) ȁ 𝑥 = 𝑥0 at 𝑡 = 0
= 𝑟(𝑥0, 𝜋(𝑥 𝑘)) + 𝐸 𝑥~𝑝,𝜋 ෍
𝑘=0
∞
𝛾 𝑘
𝑟𝑘(𝑥 𝑘, 𝜋(𝑥 𝑘), 𝑤 𝑘)
and the state-input-value function 𝑄 𝜋
: ℝn
× ℝm
→ ℝ is defined by
𝑄 𝜋 𝑥0, 𝑢0 : = 𝐸 𝑥~𝑝,𝜋 ෍
𝑘=0
∞
𝛾 𝑘 𝑟𝑘(𝑥 𝑘, 𝜋(𝑥 𝑘), 𝑤 𝑘) ȁ 𝑥 = 𝑥0, 𝑢 = 𝑢0 at 𝑡 = 0
= 𝑟 𝑥0, 𝑢0 + 𝐸 𝑥~𝑝,𝜋 ෍
𝑘=1
∞
𝛾 𝑘 𝑟𝑘(𝑥 𝑘, 𝜋(𝑥 𝑘), 𝑤 𝑘)
Bellman Operator
Definition. Given policy 𝜋, the state-value function 𝑉 𝜋: ℝn → ℝ is defined by
𝑉 𝜋
𝑥0 : = 𝐸 𝑥~𝑝,𝜋 ෍
𝑘=0
∞
𝛾 𝑘
𝑟𝑘(𝑥 𝑘, 𝜋(𝑥 𝑘), 𝑤 𝑘) ȁ 𝑥 = 𝑥0 at 𝑡 = 0
= 𝑟(𝑥0, 𝜋(𝑥 𝑘)) + 𝐸 𝑥~𝑝,𝜋 ෍
𝑘=0
∞
𝛾 𝑘
𝑟𝑘(𝑥 𝑘, 𝜋(𝑥 𝑘), 𝑤 𝑘)
and the state-input-value function 𝑄 𝜋
: ℝn
× ℝm
→ ℝ is defined by
𝑄 𝜋 𝑥0, 𝑢0 : = 𝐸 𝑥~𝑝,𝜋 ෍
𝑘=0
∞
𝛾 𝑘 𝑟𝑘(𝑥 𝑘, 𝜋(𝑥 𝑘), 𝑤 𝑘) ȁ 𝑥 = 𝑥0, 𝑢 = 𝑢0 at 𝑡 = 0
= 𝑟 𝑥0, 𝑢0 + 𝐸 𝑥~𝑝,𝜋 ෍
𝑘=1
∞
𝛾 𝑘 𝑟𝑘(𝑥 𝑘, 𝜋(𝑥 𝑘), 𝑤 𝑘)
𝑉 𝑥0 = inf
𝑢 𝑘∈𝑈
𝐸 𝑥~𝑝 ෍
𝑘=0
∞
𝛾 𝑘
𝑟 𝑥 𝑘, 𝑢 𝑘
Bellman Operator
𝑉 𝑥 𝑘 = inf
𝑢 𝑘∈𝑈
𝑟 𝑥 𝑘, 𝑢 𝑘 + 𝛾𝐸 𝑥~𝑝 𝑉(𝑥 𝑘+1)
𝑉 𝜋 𝑥 𝑘 = 𝑟(𝑥 𝑘, 𝜋(𝑥 𝑘)) + 𝛾𝐸 𝑥~𝑝 𝑉 𝜋(𝑥 𝑘+1)
𝑄 𝜋
𝑥 𝑘, 𝑢 𝑘 = 𝑟 𝑥 𝑘, 𝑢 𝑘 + 𝛾𝐸 𝑥~𝑝,𝜋 𝑄 𝜋
𝑥 𝑘+1, 𝜋(𝑥 𝑘+1)
Dynamic Programming
Bellman Operator
Let 𝔹, ∙ ∞, 𝑑∞ be a metric space where 𝔹 = 𝜓: Ω → ℝ continuous and bounded ,
𝜓 ∞ ≔ sup
𝑥∈𝑋
𝜓(𝑥) , and 𝑑∞ 𝜓, 𝜓′ = sup
𝑥∈𝑋
𝜓 𝑥 − 𝜓′(𝑥) .
Definition. Given policy 𝜋, the Bellman operator 𝑇 𝜋: 𝔹 → 𝔹 is defined by
𝑇 𝜋 𝜓 𝑥 𝑘 = 𝑟 𝑥 𝑘, 𝜋 𝑥 𝑘 + 𝛾𝐸 𝑥~𝑝[𝜓 𝑥 𝑘+1 ]
and the Bellman optimal operator 𝑇∗: 𝔹 → 𝔹 is defined by
𝑇∗ 𝜓 𝑥 𝑘 = min
𝑢 𝑘∈𝑈(𝑥 𝑘)
𝑟 𝑥 𝑘, 𝑢 𝑘 + 𝛾𝐸 𝑥~𝑝[𝜓 𝑥 𝑘+1 ]
Bellman Operator
Proposition 1. (Monotonicity) The Bellman operator 𝑇 𝜋
, 𝑇∗
are monotone, i.e. if
𝜓 𝑥 ≤ 𝜓′ 𝑥 ∀𝑥 ∈ 𝑋,
then
𝑇 𝜋
𝜓 𝑥 ≤ 𝑇 𝜋
𝜓′ 𝑥 ∀𝑥 ∈ 𝑋
𝑇∗
𝜓 𝑥 ≤ 𝑇∗
𝜓′ 𝑥 ∀𝑥 ∈ 𝑋.
Proposition 2. (Constant shift property) For any scalar 𝑟,
𝑇 𝜋
𝜓 + 𝑟𝑒 𝑥 = 𝑇 𝜋
𝜓 𝑥 + 𝛾𝑟 ∀𝑥 ∈ 𝑋
𝑇∗
𝜓 + 𝑟𝑒 𝑥 = 𝑇∗
𝜓 𝑥 + 𝛾𝑟 ∀𝑥 ∈ 𝑋.
Bellman Operator
Proposition 1. (Monotonicity) The Bellman operator 𝑇 𝜋
, 𝑇∗
are monotone, i.e. if
𝜓 𝑥 ≤ 𝜓′ 𝑥 ∀𝑥 ∈ 𝑋,
then
𝑇 𝜋
𝜓 𝑥 ≤ 𝑇 𝜋
𝜓′ 𝑥 ∀𝑥 ∈ 𝑋
𝑇∗
𝜓 𝑥 ≤ 𝑇∗
𝜓′ 𝑥 ∀𝑥 ∈ 𝑋.
Proposition 2. (Constant shift property) For any scalar 𝑟,
𝑇 𝜋
𝜓 + 𝑟𝑒 𝑥 = 𝑇 𝜋
𝜓 𝑥 + 𝛾𝑟 ∀𝑥 ∈ 𝑋
𝑇∗
𝜓 + 𝑟𝑒 𝑥 = 𝑇∗
𝜓 𝑥 + 𝛾𝑟 ∀𝑥 ∈ 𝑋.
Proposition 3. The Bellman operator 𝑇 𝜋
, 𝑇 are a contraction
with modulus 𝛾 with respect to the sup norm ∙ ∞ i.e.
𝑇 𝜋 𝜓 − 𝑇 𝜋 𝜓′ ≤ 𝛾 𝜓 − 𝜓′ 𝜓, 𝜓′ ∈ 𝔹
𝑇∗ 𝜓 − 𝑇∗ 𝜓′ ≤ 𝛾 𝜓 − 𝜓′ 𝜓, 𝜓′ ∈ 𝔹.
Bellman Operator
Theorem 2.3. (Contraction Mapping Theorem) Let 𝔹, ∙ ∞, 𝑑∞ be a metric
space and 𝑇: 𝔹 → 𝔹 is a contraction mapping with modulus 𝛾. Then,
1) 𝑇 has an unique fixed point in 𝔹 i.e. there exist unique 𝑓∗
∈ 𝔹 s.t. 𝑇𝑓∗
= 𝑓∗
.
2) Consider the sequence {𝑓𝑛} s.t. 𝑓𝑛+1 = 𝑇𝑓𝑛 ∀𝑓0 ∈ 𝔹.
Then, lim
n→∞
𝑇 𝑛
𝑓0 → 𝑓∗
.
Value Iteration
Algorithm: Value Iteration
Input: 𝑟, 𝑝, 𝛾, 𝛥
Output: 𝜋
1. 𝑉 𝑥 ←Initialize arbitrarily for 𝑥 ∈ 𝑋
2. repeat
3. for all 𝑥 ∈ 𝑋 do
4. 𝑉𝑘+1 ← 𝑇∗
𝑉𝑘
5. until 𝑉𝑘+1 − 𝑉𝑘 < 𝛥
6. 𝜋∗ 𝑥 ∈ 𝑎𝑟𝑔min
𝑢∈𝑈(𝑥)
𝑟 𝑥, 𝑢 + 𝛾 σ 𝑥′∈𝑋 𝑝 𝑥′ 𝑥, 𝑢 𝑉𝑘(𝑥′)
7. return 𝜋∗
Policy Iteration
Algorithm: Policy Iteration
Input: 𝑟, 𝑝, 𝛾, 𝛥
Output: 𝜋
1. 𝑉 𝑥 ←Initialize arbitrarily for 𝑥 ∈ 𝑋
2. repeat
1) Policy Evaluation
3. for 𝑥 ∈ 𝑋 with fixed 𝜋 𝑘 do
4. 𝑉𝑘+1
𝜋 𝑘
← 𝑇 𝜋 𝑉𝑘
𝜋 𝑘
5. until 𝑉𝑘+1
𝜋 𝑘
− 𝑉𝑘
𝜋 𝑘
< 𝛥
2) Policy Improvement
6. 𝜋 𝑘+1 𝑥 ∈ 𝑎𝑟𝑔min
𝑢∈𝑈(𝑥)
𝑟 𝑥, 𝑢 + 𝛾 σ 𝑥′∈𝑋 𝑝 𝑥′
𝑥, 𝑢 𝑉𝑘+1
𝜋 𝑘
(𝑥′
)
7. until 𝑉𝑘+1
𝜋 𝑘+1
− 𝑉𝑘
𝜋 𝑘
< 𝛥
8. return 𝜋
Stochastic Control to RL
Q: If the system dynamics 𝑝 and reward function 𝑟 is unknown,
how do we solve DP equation?
Learning-based approach
Q: If the system dynamics 𝑝 and reward function 𝑟 is unknown,
how do we solve DP equation?
Learning-based approach
Q: If the system dynamics 𝑝 and reward function 𝑟 is unknown,
how do we solve DP equation?
1. Estimate model (𝑟 and 𝑝) from simulation data and use previous methods
Model-based approach (model learning)
Learning-based approach
Q: If the system dynamics 𝑝 and reward function 𝑟 is unknown,
how do we solve DP equation?
1. Estimate model (𝑟 and 𝑝) from simulation data and use previous methods
Model-based approach (model learning)
2. Without system identification, obtain the value function and policy directly from simulation data
Model-free approach
𝐢𝐧𝐟
𝝅∈𝚷
𝑬 𝝅 𝒓 𝒙 𝒌, 𝝅(𝒙 𝒌) + 𝑬 𝒙~𝒑 𝜸𝑽(𝒙 𝒌+𝟏)
𝐢𝐧𝐟
𝝅∈𝚷
𝑬 𝝅 𝒓 𝒙 𝒌, 𝝅(𝒙 𝒌) + 𝑬 𝒙~𝒑 𝜸𝑽(𝒙 𝒌+𝟏)
Approximation in policy space
Approximation in value space
Approximate 𝐸[∙]
Parametric approximation
Problem approximation
Rollout, MPC
Monte-Carlo search
Certainty equivalence
Policy search
Policy gradient
Approximation
in Value space
Approximation
in Policy space
Actor-Critic
• TD, SARSA, Q-learning
• Function approximation
• Policy search
• Policy gradient
DPG, DDPG, TRPO, CPO, PPO, Soft actor-critic, …
Approximate
Expectation
• Monte-Carlo search
• Certainty equivalence
Approximation in Value Space
DP algorithms sweep over “all state” for each step
𝐸 𝑓 ≈
1
𝑁
෍
𝑖=1
𝑁
𝑓𝑖Use Monte-Carlo Search 
Approximation in Value Space
DP algorithms sweep over “all state” for each step
𝑁~14,000,605
Use Monte-Carlo Search  𝐸 𝑓 ≈
1
𝑁
෍
𝑖=1
𝑁
𝑓𝑖
Approximation in Value Space
DP algorithms sweep over “all state” for each step
𝑁~14,000,605
Use Monte-Carlo Search  𝐸 𝑓 ≈
1
𝑁
෍
𝑖=1
𝑁
𝑓𝑖
Impractical
Stochastic Approximation
Consider the problem
𝑥 = 𝐿(𝑥).
Then, this problem can be solved by iterative algorithm
𝑥 𝑘+1 = 𝐿 𝑥 𝑘
or
𝑥 𝑘+1 = 1 − 𝛼 𝑘 𝑥 𝑘 + 𝛼 𝑘 𝐿(𝑥 𝑘).
If 𝐿 𝑥 is of the form 𝐸 𝑓(𝑥, 𝑤) where 𝑤 is a random noise. Then, 𝐿(𝑥) can be approximated by
𝐿 𝑥 =
1
𝑁
෍
𝑖=1
𝑁
𝑓(𝑥𝑖, 𝑤𝑖)
which becomes inefficient when 𝑁 is large.
Stochastic Approximation
Use a single sample as an estimation of expectation in each update
This update can be seen as a stochastic approximation of the form
𝑥 𝑘+1 = 1 − 𝛼 𝑘 𝑥 𝑘 + 𝛼 𝑘 𝑓(𝑥 𝑘, 𝑤 𝑘).
𝑥 𝑘+1 = 1 − 𝛼 𝑘 𝑥 𝑘 + 𝛼 𝑘 𝐸 𝑓 𝑥 𝑘, 𝑤 𝑘 + 𝜀 𝑘
= 1 − 𝛼 𝑘 𝑥 𝑘 + 𝛼 𝑘 𝐿(𝑥 𝑘) + 𝜀 𝑘
where 𝜀 𝑘= 𝑓 𝑥 𝑘, 𝑤 𝑘 − 𝐸 𝑓 𝑥 𝑘, 𝑤 𝑘 .
Robbins-Monro stochastic approximation guarantees the convergence under
contraction or monotonicity assumptions of the mapping 𝐿 with assumptions
෍
𝑘=0
∞
𝛼 𝑘 = + ∞ and ෍
𝑘=0
∞
𝛼 𝑘
2
< +∞.
Policy Iteration
Algorithm: Policy Iteration(Classical Dynamic Programming)
Input: 𝑟, 𝑝, 𝛾, 𝛥
Output: 𝜋
1. 𝑉 𝑥 , 𝜋(𝑥) ←Initialize arbitrarily for 𝑥 ∈ 𝑋
2. repeat
3.
1) Policy Evaluation
4. for 𝑥 ∈ 𝑋 with fixed 𝜋 𝑘 do
5. 𝑉𝑘+1
𝜋 𝑘
← 𝑇 𝜋
𝑉𝑘
𝜋 𝑘
6. until 𝑉𝑘+1
𝜋 𝑘
− 𝑉𝑘
𝜋 𝑘
< 𝛥
2) Policy Improvement
7. 𝜋 𝑘+1 𝑥𝑖 ∈ 𝑎𝑟𝑔min
𝑢∈𝑈(𝑥)
𝑟 𝑥, 𝑢 + 𝛾 σ 𝑥′∈𝑋 𝑝 𝑥′ 𝑥, 𝑢 𝑉𝑘+1
𝜋 𝑘
(𝑥′)
8. until 𝑉𝑘+1
𝜋 𝑘+1
− 𝑉𝑘
𝜋 𝑘
< 𝛥
9. return 𝜋
Policy Iteration
Algorithm: Policy Iteration(Temporal Difference)
Input: 𝑟, 𝑝, 𝛾, 𝛥
Output: 𝜋
1. 𝑉 𝑥 , 𝜋(𝑥) ←Initialize arbitrarily for 𝑥 ∈ 𝑋
2. repeat
3. given 𝑥0, simulate 𝜋 𝑘 and store 𝒟 = {𝑥𝑖, 𝜋 𝑘 𝑥𝑖 , 𝑟(𝑥𝑖, 𝜋 𝑘 𝑥𝑖 )}
1) Policy Evaluation
4. for 𝑥 ∈ 𝑋 for 𝑥𝑖 ∈ 𝒟 with fixed 𝜋 𝑘 do
5. 𝑉𝑘+1
𝜋 𝑘
← 𝑇 𝜋
𝑉𝑘
𝜋 𝑘
𝑉𝑘+1
𝜋 𝑘
(𝑥𝑖) ← 1 − 𝛼 𝑘 𝑉𝑘
𝜋 𝑘
(𝑥𝑖) + 𝛼 𝑘 𝑟 𝑥𝑖, 𝜋 𝑘 𝑥𝑖 + 𝑉𝑘
𝜋 𝑘
(𝑥𝑖+1)
6. until 𝑉𝑘+1
𝜋 𝑘
− 𝑉𝑘
𝜋 𝑘
< 𝛥
2) Policy Improvement
7. 𝜋 𝑘+1 𝑥𝑖 ∈ 𝑎𝑟𝑔min
𝑢∈𝑈(𝑥)
𝑟 𝑥, 𝑢 + 𝛾 σ 𝑥′∈𝑋 𝑝 𝑥′ 𝑥, 𝑢 𝑉𝑘+1
𝜋 𝑘
(𝑥′)
8. until 𝑉𝑘+1
𝜋 𝑘+1
− 𝑉𝑘
𝜋 𝑘
< 𝛥
9. return 𝜋
Policy Iteration
Algorithm: Policy Iteration(SARSA)
Input: 𝑟, 𝑝, 𝛾, 𝛥
Output: 𝜋
1. 𝑉 𝑥 , 𝜋(𝑥) ←Initialize arbitrarily for 𝑥 ∈ 𝑋
2. repeat
3. given 𝑥0, simulate 𝜋 𝑘 and store 𝒟 = {𝑥𝑖, 𝜋 𝑘 𝑥𝑖 , 𝑟(𝑥𝑖, 𝜋 𝑘 𝑥𝑖 )}
1) Policy Evaluation
4. for 𝑥 ∈ 𝑋 for 𝑥𝑖 ∈ 𝒟 with fixed 𝜋 𝑘 do
5. 𝑉𝑘+1
𝜋 𝑘
← 𝑇 𝜋
𝑉𝑘
𝜋 𝑘
𝑄 𝑘+1
𝜋 𝑘
(𝑥𝑖, 𝑢𝑖) ← 1 − 𝛼 𝑘 𝑄 𝑘
𝜋 𝑘
(𝑥𝑖, 𝑢𝑖) + 𝛼 𝑘 𝑟 𝑥𝑖, 𝜋 𝑘 𝑥𝑖 + 𝑄 𝑘
𝜋 𝑘
(𝑥𝑖+1, 𝜋 𝑘 𝑥𝑖+1 )
6. until 𝑉𝑘+1
𝜋 𝑘
− 𝑉𝑘
𝜋 𝑘
< 𝛥
2) Policy Improvement
7. 𝜋 𝑘+1 𝑥𝑖 ∈ 𝑎𝑟𝑔min
𝑢∈𝑈(𝑥)
𝑟 𝑥, 𝑢 + 𝛾 σ 𝑥′∈𝑋 𝑝 𝑥′ 𝑥, 𝑢 𝑉𝑘+1
𝜋 𝑘
(𝑥′)
8. until 𝑉𝑘+1
𝜋 𝑘+1
− 𝑉𝑘
𝜋 𝑘
< 𝛥
9. return 𝜋
Value Iteration
Algorithm: Value Iteration(Classical Dynamic Programming)
Input: 𝑟, 𝑝, 𝛾, 𝛥
Output: 𝜋
1. 𝑉 𝑥 ←Initialize arbitrarily for 𝑥 ∈ 𝑋
2. repeat
3. given 𝑥0, simulate and store 𝒟 = {𝑥𝑖, 𝑟(𝑥𝑖, 𝑢𝑖)}
4. for all 𝑥 ∈ 𝑋 do
5. 𝑉𝑘+1 𝑥 ← 𝑇∗ 𝑉𝑘
2. until 𝑉𝑘+1 − 𝑉𝑘 < 𝛥
3. 𝜋∗
𝑥 ∈ 𝑎𝑟𝑔min
𝑢∈𝑈(𝑥)
𝑟 𝑥, 𝑢 + 𝛾 σ 𝑥′∈𝑋 𝑝 𝑥′
𝑥, 𝑢 𝑉𝑘(𝑥′
)
4. return 𝜋∗
Value Iteration
Algorithm: Value Iteration(Q-learning)
Input: 𝑟, 𝑝, 𝛾, 𝛥
Output: 𝜋
1. 𝑉 𝑥 ←Initialize arbitrarily for 𝑥 ∈ 𝑋
𝑄(𝑥, 𝑢) ←Initialize arbitrarily for 𝑥 ∈ 𝑋, 𝑢 ∈ 𝑈
2. repeat
3. given 𝑥0, simulate and store 𝒟 = {𝑥𝑖, 𝑟(𝑥𝑖, 𝑢𝑖)}
4. for all 𝑥 ∈ 𝑋 do  for 𝑥𝑖, 𝑎𝑖, 𝑟 𝑥𝑖, 𝑢𝑖 , 𝑥𝑖+1 ∈ 𝒟 do
5. 𝑄 𝑘+1 𝑥, 𝑢 ← 𝑇∗ 𝑄 𝑘
𝑄 𝑘+1(𝑥𝑖, 𝑢𝑖) ← (1 − 𝛼 𝑘)𝑄 𝑘(𝑥𝑖, 𝑢𝑖) + 𝛼 𝑘 𝑟 𝑥𝑖, 𝑢𝑖 + min
𝑢𝑖+1∈𝑈
𝑄 𝑘(𝑥𝑖+1, 𝑢𝑖+1)
2. until 𝑄 𝑘+1 − 𝑄 𝑘 < 𝛥
3. 𝜋∗
𝑥 ∈ 𝑎𝑟𝑔min
𝑢∈𝑈(𝑥)
𝑟 𝑥, 𝑢 + 𝛾 σ 𝑥′∈𝑋 𝑝 𝑥′
𝑥, 𝑢 𝑉𝑘(𝑥′
)
4. return 𝜋∗
Approximation in Value Space
𝑇 𝜋
𝑇∗
𝑉 𝜋
𝑄 𝜋
𝑄
Policy
Evaluation
TD(0),
TD(𝜆)
SARSA(0)
SARSA(𝜆)
Q-learning
Large Scale RL
Large-scale RL
Number of states > 2200
(for 10 × 20 board)
”Function approximation”
Approximation
in Value space
Approximation
in Policy space
Actor-Critic
• TD, SARSA, Q-learning
• Function approximation
• Policy search
• Policy gradient
DPG, DDPG, TRPO, CPO, PPO, Soft actor-critic, …
Approximate
Expectation
• Monte-Carlo search
• Certainty equivalence
Approximate Dynamic Programming
Direct method (Gradient methods)
min
𝜃
෍
𝑖=1
𝑁
𝑉 𝜋
𝑥𝑖 − ෠𝑉(𝑥𝑖; 𝜃)
2
≈ min
𝜃
෍
𝑥 𝑖∈𝑋
෍
𝑚=1
𝑀
𝐽(𝑥𝑖, 𝑚) − ෠𝑉(𝑥𝑖; 𝜃)
2
෠𝑉(𝑥; 𝜃) : Approximated value function (ex. Polynomial approx., Neural Network, etc.)
𝑉 𝜋
𝑥 : State-value function
𝐽(𝑥𝑖, 𝑚): 𝑚-th sample of the cost function at 𝑥𝑖 where 𝑚 = 1,2, … , 𝑀
𝜃 𝑘+1 = 𝜃 𝑘 − 𝜂 ෍
𝑥 𝑖∈𝑋
෍
𝑚=1
𝑀
𝛻 ෠𝑉(𝑥𝑖; 𝜃) 𝐽(𝑥𝑖, 𝑚) − ෠𝑉(𝑥𝑖; 𝜃)
2
Approximate Dynamic Programming
Indirect method (Projected equation)
Solve the projected Bellman equation: Φ𝜃 = Π𝑇(Φ𝜃)
Subspace 𝑆 = Φ𝜃 𝜃 ∈ ℝ 𝑠
0
𝐽
Π𝐽
Subspace 𝑆 = Φ𝜃 𝜃 ∈ ℝ 𝑠
0
Φ𝜃 = Π𝑇(Φ𝜃)
𝑇(Φ𝜃)
Direct Method Indirect Method
𝑇 𝜋
𝑇∗
𝑉 𝜋
𝑄 𝜋
𝑄
Policy
Evaluation
TD(0),
TD(𝜆)
SARSA(0)
SARSA(𝜆)
Q-learning
𝑇 𝜋
𝑇∗
𝑉 𝜋
𝑄 𝜋
𝑄
Direct
(Gradient
methods)
TD
SARSA
DQN
𝑇 𝜋
𝑇∗
𝑉 𝜋
𝑄 𝜋
𝑄
Indirect
(Projected DP)
TD, LSTD
TD, LSTD
LSPE
Function
Approximation
Policy
Evaluation
Summary
RL is a toolbox solving
Infinite-horizon Discrete-time DP
𝐢𝐧𝐟
𝝅∈𝜫
𝑬 𝝅 𝒓 𝒙 𝒌, 𝝅(𝒙 𝒌) + 𝑬 𝒙~𝒑 𝜸𝑽(𝒙 𝒌+𝟏)
𝐢𝐧𝐟
𝝅∈𝚷
𝑬 𝝅 𝒓 𝒙 𝒌, 𝝅(𝒙 𝒌) + 𝑬 𝒙~𝒑 𝜸𝑽(𝒙 𝒌+𝟏)
Approximation in policy space
Approximation in value space
Approximate 𝐸[∙]
Parametric approximation
Problem approximation
Rollout, MPC
Monte-Carlo search
Certainty equivalence
Policy search
Policy gradient
Approximation
in Value space
Approximation
in Policy space
Actor-Critic
• TD, SARSA, Q-learning
• Function approximation
• Policy search
• Policy gradient
DPG, DDPG, TRPO, CPO, PPO, Soft actor-critic, …
Approximate
Expectation
• Monte-Carlo search
• Certainty equivalence
𝑇 𝜋
𝑇∗
𝑉 𝜋
𝑄 𝜋
𝑄
Policy
Evaluation
TD(0),
TD(𝜆)
SARSA(0)
SARSA(𝜆)
Q-learning
𝑇 𝜋
𝑇∗
𝑉 𝜋
𝑄 𝜋
𝑄
Direct
(Gradient
methods)
TD
SARSA
DQN
𝑇 𝜋
𝑇∗
𝑉 𝜋
𝑄 𝜋
𝑄
Indirect
(Projected DP)
TD, LSTD
TD, LSTD
LSPE
Function
Approximation
Policy
Evaluation
Q&A
Thank you

Stochastic optimal control &amp; rl

  • 1.
  • 5.
  • 6.
  • 8.
  • 9.
  • 10.
  • 11.
  • 12.
  • 13.
    Markov Decision Process Markov? “Thefuture is independent of the past given the present” ℙ 𝑠𝑡+1 𝑠1, … , 𝑠𝑡 = ℙ 𝑠𝑡+1 𝑠𝑡
  • 14.
    Markov Decision Process Markov? “Thefuture is independent of the past given the present” ℙ 𝑠𝑡+1 𝑠1, … , 𝑠𝑡 = ℙ 𝑠𝑡+1 𝑠𝑡 Memoryless process!
  • 15.
    Markov Decision Process Markov? “Thefuture is independent of the past given the present” ℙ 𝑠𝑡+1 𝑠1, … , 𝑠𝑡 = ℙ 𝑠𝑡+1 𝑠𝑡 Markov “Decision” Process: ℙ 𝑠𝑡+1 𝑠1, 𝑎1 … , 𝑠𝑡, 𝑎 𝑡 = ℙ 𝑠𝑡+1 𝑠𝑡, 𝑎 𝑡 Future state only depends on the current state and action & Policy also depends on the current state only 𝜋 𝑎 𝑡 𝑠1, 𝑎1 … , 𝑠𝑡 = 𝜋 𝑎 𝑡 𝑠𝑡
  • 16.
    Reinforcement Learning Agent Environment Action Reward •State 𝑠 ∈ 𝑆 ⊂ ℝ 𝑛 • Action 𝑎 ∈ 𝐴 ⊂ ℝ 𝑚 • Action sequence 𝑎0, 𝑎1, … , ∀𝑎𝑖 ∈ 𝐴, 𝑖 = 1,2, … State
  • 17.
    Reinforcement Learning Agent Environment Action Reward •State 𝑠 ∈ 𝑆 ⊂ ℝ 𝑛 • Action 𝑎 ∈ 𝐴 ⊂ ℝ 𝑚 • Action sequence 𝑎0, 𝑎1, … , ∀𝑎𝑖 ∈ 𝐴, 𝑖 = 1,2, … State • Reward 𝑟: 𝑆 × 𝐴 → ℝ • Discounting factor 𝛾 ∈ (0,1) • Transition probability 𝑠𝑡+1~𝑝 𝑠𝑡+1 𝑠𝑡, 𝑎 𝑡 • Total reward 𝑅𝑡𝑜𝑡 = 𝐸𝑠~𝑝 σ 𝑡=0 ∞ 𝛾 𝑡 𝑟(𝑠𝑡, 𝑎 𝑡)
  • 18.
    Reinforcement Learning Agent Environment Action Reward •State 𝑠 ∈ 𝑆 ⊂ ℝ 𝑛 • Action 𝑎 ∈ 𝐴 ⊂ ℝ 𝑚 • Action sequence 𝑎0, 𝑎1, … , ∀𝑎𝑖 ∈ 𝐴, 𝑖 = 1,2, … State Objective function: max 𝜋∈Π 𝑅 𝜋 • Policy: 𝜋: 𝑆 → 𝐴 • Total reward w.r.t. 𝜋: 𝑅 𝜋 = 𝐸𝑠~𝑝,𝑎~𝜋 σ 𝑡=0 ∞ 𝛾 𝑡 𝑟(𝑠𝑡, 𝑎 𝑡) • Reward 𝑟: 𝑆 × 𝐴 → ℝ • Discounting factor 𝛾 ∈ (0,1) • Transition probability 𝑠𝑡+1~𝑝 𝑠𝑡+1 𝑠𝑡, 𝑎 𝑡 • Total reward 𝑅𝑡𝑜𝑡 = 𝐸𝑠~𝑝 σ 𝑡=0 ∞ 𝛾 𝑡 𝑟(𝑠𝑡, 𝑎 𝑡)
  • 19.
    Terminology of RLand Optimal control State Action Agent Environment Reward of a stage Reward (or value) function Maximizing the value function Bellman operator Greedy policy w.r.t. 𝐽 State Control Input Controller System Cost of a stage Value (or cost) function Minimizing the value function DP mapping or operator Minimizing policy w.r.t. 𝐽 RL Optimal Control
  • 20.
  • 21.
    Stochastic Optimal Control DiscreteContinuous Stochastic Deterministicሶ𝑥= 𝑓(𝑥, 𝑢) 𝑑𝑥 = 𝑓 𝑥, 𝑢 𝑑𝑡 + 𝜎 𝑥, 𝑢 𝑑𝑊 𝑥 𝑘+1 = 𝑓(𝑥 𝑘, 𝑢 𝑘) 𝑥 𝑘+1 = 𝑓 𝑥 𝑘, 𝑢 𝑘, 𝑤 𝑘 (𝑤 𝑘 is a random Gaussian noise) or 𝑥 𝑘+1~𝑝 𝑥 𝑘+1 𝑥 𝑘, 𝑢 𝑘 System Dynamics
  • 22.
    Stochastic Optimal Control DiscreteContinuous Stochastic (Policy) Deterministic (ControlInput) 𝑢 𝑥 𝑢 𝑥 ~ 𝜋 𝑢 𝑥 {𝑢0, 𝑢1, 𝑢2, … } 𝑢 𝑘 ~ 𝜋 𝑢 𝑘 𝑥 𝑘 Control Input and Policy
  • 23.
    Stochastic Optimal Control DiscreteContinuous Infinite-horizon Finite-horizon Valuefunction inf 𝑢∈𝑈 𝐸 𝑥~𝑝 න 𝑡=0 𝑇 𝑟 𝑥 𝑡 , 𝑢 𝑡 𝑑𝑡 + 𝑞(𝑥 𝑇) inf 𝑢∈𝑈 𝐸 𝑥~𝑝 න 𝑡=0 ∞ 𝑒−𝛾𝑡 𝑟 𝑥 𝑡 , 𝑢 𝑡 𝑑𝑡 inf 𝑢 𝑘∈𝑈 𝐸 𝑥~𝑝 ෍ 𝑘=0 𝑁 𝑟 𝑥 𝑘, 𝑢 𝑘 + 𝑞(𝑥 𝑁) inf 𝑢 𝑘∈𝑈 𝐸 𝑥~𝑝 ෍ 𝑘=0 ∞ 𝛾 𝑘 𝑟 𝑥 𝑘, 𝑢 𝑘
  • 24.
    Stochastic Optimal Control Discrete(𝑉𝑥 𝑘 )Continuous(𝑉 𝑥(𝑡) ) Infinite-horizon Finite-horizon Dynamic Programming inf 𝑢∈𝑈 𝐸 𝑥~𝑝 න 𝑡 𝑡+Δ𝑡 𝑟 𝑥 𝑠 , 𝑢 𝑠 𝑑𝑠 + 𝑉(𝑥 𝑡 + Δ𝑡 ) 𝑉 𝑥 𝑇 = 𝑞(𝑥 𝑇 ) inf 𝑢∈𝑈 𝐸 𝑥~𝑝 න 𝑡=0 𝑡+Δ𝑡 𝑒−𝛾𝑠 𝑟 𝑥 𝑠 , 𝑢 𝑠 𝑑𝑠 + 𝑉(𝑥 𝑡 + Δ𝑡 ) inf 𝑢 𝑘∈𝑈 𝑟 𝑥 𝑘, 𝑢 𝑘 + 𝐸 𝑥~𝑝 𝑉(𝑥 𝑘+1) 𝑉 𝑥 𝑇 = 𝑞(𝑥 𝑇) inf 𝑢 𝑘∈𝑈 𝑟 𝑥 𝑘, 𝑢 𝑘 + 𝐸 𝑥~𝑝 𝛾𝑉(𝑥 𝑘+1)
  • 25.
    Stochastic Optimal Control Discrete(𝑉𝑥 𝑘 )Continuous(𝑉 𝑥(𝑡) ) Infinite Dynamic Programming 𝜕𝑉 𝜕𝑡 + inf 𝑢∈𝑈 𝑟 𝑥 𝑡 , 𝑢 𝑡 + 𝜕𝑉 𝜕𝑥 𝑓 𝑥 𝑡 , 𝑢 𝑡 + 1 2 𝜎 𝑇 (𝑥 𝑡 , 𝑢 𝑡 ) 𝜕2 𝑉 𝜕𝑥2 𝜎(𝑥 𝑡 , 𝑢 𝑡 ) = 0 𝑉 𝑥 𝑇 = 𝑞(𝑥 𝑇 ) inf 𝑢 𝑘∈𝑈 𝑟 𝑥 𝑘, 𝑢 𝑘 + 𝐸 𝑥~𝑝 𝑉(𝑥 𝑘+1) 𝑉 𝑥 𝑇 = 𝑞(𝑥 𝑇) inf 𝑢 𝑘∈𝑈 𝑟 𝑥 𝑘, 𝑢 𝑘 + 𝐸 𝑥~𝑝 𝛾𝑉(𝑥 𝑘+1) Finite HJB equation Bellman equation −𝛾 + inf 𝑢∈𝑈 𝑟 𝑥 𝑡 , 𝑢 𝑡 + 𝜕𝑉 𝜕𝑥 𝑓 𝑥 𝑡 , 𝑢 𝑡 + 1 2 𝜎 𝑇 (𝑥 𝑡 , 𝑢 𝑡 ) 𝜕2 𝑉 𝜕𝑥2 𝜎(𝑥 𝑡 , 𝑢 𝑡 ) = 0
  • 26.
    Stochastic Optimal Control Discrete(𝑉𝑥 𝑘 )Continuous(𝑉 𝑥(𝑡) ) Infinite Dynamic Programming 𝜕𝑉 𝜕𝑡 + inf 𝑢∈𝑈 𝑟 𝑥 𝑡 , 𝑢 𝑡 + 𝜕𝑉 𝜕𝑥 𝑓 𝑥 𝑡 , 𝑢 𝑡 + 1 2 𝜎 𝑇 (𝑥 𝑡 , 𝑢 𝑡 ) 𝜕2 𝑉 𝜕𝑥2 𝜎(𝑥 𝑡 , 𝑢 𝑡 ) = 0 𝑉 𝑥 𝑇 = 𝑞(𝑥 𝑇 ) inf 𝑢 𝑘∈𝑈 𝑟 𝑥 𝑘, 𝑢 𝑘 + 𝐸 𝑥~𝑝 𝑉(𝑥 𝑘+1) 𝑉 𝑥 𝑇 = 𝑞(𝑥 𝑇) inf 𝑢 𝑘∈𝑈 𝑟 𝑥 𝑘, 𝑢 𝑘 + 𝐸 𝑥~𝑝 𝛾𝑉(𝑥 𝑘+1) Finite −𝛾 + inf 𝑢∈𝑈 𝑟 𝑥 𝑡 , 𝑢 𝑡 + 𝜕𝑉 𝜕𝑥 𝑓 𝑥 𝑡 , 𝑢 𝑡 + 1 2 𝜎 𝑇 (𝑥 𝑡 , 𝑢 𝑡 ) 𝜕2 𝑉 𝜕𝑥2 𝜎(𝑥 𝑡 , 𝑢 𝑡 ) = 0 HJB equation Bellman equation
  • 27.
    Stochastic Optimal Control DynamicProgramming 𝐢𝐧𝐟 𝒖 𝒌∈𝑼 𝒓 𝒙 𝒌, 𝒖 𝒌 + 𝑬 𝒙~𝒑 𝜸𝑽(𝒙 𝒌+𝟏) How to solve the infinite-horizon discrete time system stochastic optimal control problem?
  • 28.
    Stochastic Optimal Control DynamicProgramming 𝐢𝐧𝐟 𝒖 𝒌∈𝑼 𝒓 𝒙 𝒌, 𝒖 𝒌 + 𝑬 𝒙~𝒑 𝜸𝑽(𝒙 𝒌+𝟏) How to solve the infinite-horizon discrete time system stochastic optimal control problem? Note) There is an other approach using different dynamic programming equation, average reward. Value Iteration & Policy Iteration
  • 29.
    Bellman Operator Definition. Givenpolicy 𝜋, the state-value function 𝑉 𝜋: ℝn → ℝ is defined by 𝑉 𝜋 𝑥0 : = 𝐸 𝑥~𝑝,𝜋 ෍ 𝑘=0 ∞ 𝛾 𝑘 𝑟𝑘(𝑥 𝑘, 𝜋(𝑥 𝑘), 𝑤 𝑘) ȁ 𝑥 = 𝑥0 at 𝑡 = 0 = 𝑟(𝑥0, 𝜋(𝑥 𝑘)) + 𝐸 𝑥~𝑝,𝜋 ෍ 𝑘=0 ∞ 𝛾 𝑘 𝑟𝑘(𝑥 𝑘, 𝜋(𝑥 𝑘), 𝑤 𝑘) and the state-input-value function 𝑄 𝜋 : ℝn × ℝm → ℝ is defined by 𝑄 𝜋 𝑥0, 𝑢0 : = 𝐸 𝑥~𝑝,𝜋 ෍ 𝑘=0 ∞ 𝛾 𝑘 𝑟𝑘(𝑥 𝑘, 𝜋(𝑥 𝑘), 𝑤 𝑘) ȁ 𝑥 = 𝑥0, 𝑢 = 𝑢0 at 𝑡 = 0 = 𝑟 𝑥0, 𝑢0 + 𝐸 𝑥~𝑝,𝜋 ෍ 𝑘=1 ∞ 𝛾 𝑘 𝑟𝑘(𝑥 𝑘, 𝜋(𝑥 𝑘), 𝑤 𝑘)
  • 30.
    Bellman Operator Definition. Givenpolicy 𝜋, the state-value function 𝑉 𝜋: ℝn → ℝ is defined by 𝑉 𝜋 𝑥0 : = 𝐸 𝑥~𝑝,𝜋 ෍ 𝑘=0 ∞ 𝛾 𝑘 𝑟𝑘(𝑥 𝑘, 𝜋(𝑥 𝑘), 𝑤 𝑘) ȁ 𝑥 = 𝑥0 at 𝑡 = 0 = 𝑟(𝑥0, 𝜋(𝑥 𝑘)) + 𝐸 𝑥~𝑝,𝜋 ෍ 𝑘=0 ∞ 𝛾 𝑘 𝑟𝑘(𝑥 𝑘, 𝜋(𝑥 𝑘), 𝑤 𝑘) and the state-input-value function 𝑄 𝜋 : ℝn × ℝm → ℝ is defined by 𝑄 𝜋 𝑥0, 𝑢0 : = 𝐸 𝑥~𝑝,𝜋 ෍ 𝑘=0 ∞ 𝛾 𝑘 𝑟𝑘(𝑥 𝑘, 𝜋(𝑥 𝑘), 𝑤 𝑘) ȁ 𝑥 = 𝑥0, 𝑢 = 𝑢0 at 𝑡 = 0 = 𝑟 𝑥0, 𝑢0 + 𝐸 𝑥~𝑝,𝜋 ෍ 𝑘=1 ∞ 𝛾 𝑘 𝑟𝑘(𝑥 𝑘, 𝜋(𝑥 𝑘), 𝑤 𝑘) 𝑉 𝑥0 = inf 𝑢 𝑘∈𝑈 𝐸 𝑥~𝑝 ෍ 𝑘=0 ∞ 𝛾 𝑘 𝑟 𝑥 𝑘, 𝑢 𝑘
  • 31.
    Bellman Operator 𝑉 𝑥𝑘 = inf 𝑢 𝑘∈𝑈 𝑟 𝑥 𝑘, 𝑢 𝑘 + 𝛾𝐸 𝑥~𝑝 𝑉(𝑥 𝑘+1) 𝑉 𝜋 𝑥 𝑘 = 𝑟(𝑥 𝑘, 𝜋(𝑥 𝑘)) + 𝛾𝐸 𝑥~𝑝 𝑉 𝜋(𝑥 𝑘+1) 𝑄 𝜋 𝑥 𝑘, 𝑢 𝑘 = 𝑟 𝑥 𝑘, 𝑢 𝑘 + 𝛾𝐸 𝑥~𝑝,𝜋 𝑄 𝜋 𝑥 𝑘+1, 𝜋(𝑥 𝑘+1) Dynamic Programming
  • 32.
    Bellman Operator Let 𝔹,∙ ∞, 𝑑∞ be a metric space where 𝔹 = 𝜓: Ω → ℝ continuous and bounded , 𝜓 ∞ ≔ sup 𝑥∈𝑋 𝜓(𝑥) , and 𝑑∞ 𝜓, 𝜓′ = sup 𝑥∈𝑋 𝜓 𝑥 − 𝜓′(𝑥) . Definition. Given policy 𝜋, the Bellman operator 𝑇 𝜋: 𝔹 → 𝔹 is defined by 𝑇 𝜋 𝜓 𝑥 𝑘 = 𝑟 𝑥 𝑘, 𝜋 𝑥 𝑘 + 𝛾𝐸 𝑥~𝑝[𝜓 𝑥 𝑘+1 ] and the Bellman optimal operator 𝑇∗: 𝔹 → 𝔹 is defined by 𝑇∗ 𝜓 𝑥 𝑘 = min 𝑢 𝑘∈𝑈(𝑥 𝑘) 𝑟 𝑥 𝑘, 𝑢 𝑘 + 𝛾𝐸 𝑥~𝑝[𝜓 𝑥 𝑘+1 ]
  • 33.
    Bellman Operator Proposition 1.(Monotonicity) The Bellman operator 𝑇 𝜋 , 𝑇∗ are monotone, i.e. if 𝜓 𝑥 ≤ 𝜓′ 𝑥 ∀𝑥 ∈ 𝑋, then 𝑇 𝜋 𝜓 𝑥 ≤ 𝑇 𝜋 𝜓′ 𝑥 ∀𝑥 ∈ 𝑋 𝑇∗ 𝜓 𝑥 ≤ 𝑇∗ 𝜓′ 𝑥 ∀𝑥 ∈ 𝑋. Proposition 2. (Constant shift property) For any scalar 𝑟, 𝑇 𝜋 𝜓 + 𝑟𝑒 𝑥 = 𝑇 𝜋 𝜓 𝑥 + 𝛾𝑟 ∀𝑥 ∈ 𝑋 𝑇∗ 𝜓 + 𝑟𝑒 𝑥 = 𝑇∗ 𝜓 𝑥 + 𝛾𝑟 ∀𝑥 ∈ 𝑋.
  • 34.
    Bellman Operator Proposition 1.(Monotonicity) The Bellman operator 𝑇 𝜋 , 𝑇∗ are monotone, i.e. if 𝜓 𝑥 ≤ 𝜓′ 𝑥 ∀𝑥 ∈ 𝑋, then 𝑇 𝜋 𝜓 𝑥 ≤ 𝑇 𝜋 𝜓′ 𝑥 ∀𝑥 ∈ 𝑋 𝑇∗ 𝜓 𝑥 ≤ 𝑇∗ 𝜓′ 𝑥 ∀𝑥 ∈ 𝑋. Proposition 2. (Constant shift property) For any scalar 𝑟, 𝑇 𝜋 𝜓 + 𝑟𝑒 𝑥 = 𝑇 𝜋 𝜓 𝑥 + 𝛾𝑟 ∀𝑥 ∈ 𝑋 𝑇∗ 𝜓 + 𝑟𝑒 𝑥 = 𝑇∗ 𝜓 𝑥 + 𝛾𝑟 ∀𝑥 ∈ 𝑋. Proposition 3. The Bellman operator 𝑇 𝜋 , 𝑇 are a contraction with modulus 𝛾 with respect to the sup norm ∙ ∞ i.e. 𝑇 𝜋 𝜓 − 𝑇 𝜋 𝜓′ ≤ 𝛾 𝜓 − 𝜓′ 𝜓, 𝜓′ ∈ 𝔹 𝑇∗ 𝜓 − 𝑇∗ 𝜓′ ≤ 𝛾 𝜓 − 𝜓′ 𝜓, 𝜓′ ∈ 𝔹.
  • 35.
    Bellman Operator Theorem 2.3.(Contraction Mapping Theorem) Let 𝔹, ∙ ∞, 𝑑∞ be a metric space and 𝑇: 𝔹 → 𝔹 is a contraction mapping with modulus 𝛾. Then, 1) 𝑇 has an unique fixed point in 𝔹 i.e. there exist unique 𝑓∗ ∈ 𝔹 s.t. 𝑇𝑓∗ = 𝑓∗ . 2) Consider the sequence {𝑓𝑛} s.t. 𝑓𝑛+1 = 𝑇𝑓𝑛 ∀𝑓0 ∈ 𝔹. Then, lim n→∞ 𝑇 𝑛 𝑓0 → 𝑓∗ .
  • 36.
    Value Iteration Algorithm: ValueIteration Input: 𝑟, 𝑝, 𝛾, 𝛥 Output: 𝜋 1. 𝑉 𝑥 ←Initialize arbitrarily for 𝑥 ∈ 𝑋 2. repeat 3. for all 𝑥 ∈ 𝑋 do 4. 𝑉𝑘+1 ← 𝑇∗ 𝑉𝑘 5. until 𝑉𝑘+1 − 𝑉𝑘 < 𝛥 6. 𝜋∗ 𝑥 ∈ 𝑎𝑟𝑔min 𝑢∈𝑈(𝑥) 𝑟 𝑥, 𝑢 + 𝛾 σ 𝑥′∈𝑋 𝑝 𝑥′ 𝑥, 𝑢 𝑉𝑘(𝑥′) 7. return 𝜋∗
  • 37.
    Policy Iteration Algorithm: PolicyIteration Input: 𝑟, 𝑝, 𝛾, 𝛥 Output: 𝜋 1. 𝑉 𝑥 ←Initialize arbitrarily for 𝑥 ∈ 𝑋 2. repeat 1) Policy Evaluation 3. for 𝑥 ∈ 𝑋 with fixed 𝜋 𝑘 do 4. 𝑉𝑘+1 𝜋 𝑘 ← 𝑇 𝜋 𝑉𝑘 𝜋 𝑘 5. until 𝑉𝑘+1 𝜋 𝑘 − 𝑉𝑘 𝜋 𝑘 < 𝛥 2) Policy Improvement 6. 𝜋 𝑘+1 𝑥 ∈ 𝑎𝑟𝑔min 𝑢∈𝑈(𝑥) 𝑟 𝑥, 𝑢 + 𝛾 σ 𝑥′∈𝑋 𝑝 𝑥′ 𝑥, 𝑢 𝑉𝑘+1 𝜋 𝑘 (𝑥′ ) 7. until 𝑉𝑘+1 𝜋 𝑘+1 − 𝑉𝑘 𝜋 𝑘 < 𝛥 8. return 𝜋
  • 38.
  • 39.
    Q: If thesystem dynamics 𝑝 and reward function 𝑟 is unknown, how do we solve DP equation?
  • 40.
    Learning-based approach Q: Ifthe system dynamics 𝑝 and reward function 𝑟 is unknown, how do we solve DP equation?
  • 41.
    Learning-based approach Q: Ifthe system dynamics 𝑝 and reward function 𝑟 is unknown, how do we solve DP equation? 1. Estimate model (𝑟 and 𝑝) from simulation data and use previous methods Model-based approach (model learning)
  • 42.
    Learning-based approach Q: Ifthe system dynamics 𝑝 and reward function 𝑟 is unknown, how do we solve DP equation? 1. Estimate model (𝑟 and 𝑝) from simulation data and use previous methods Model-based approach (model learning) 2. Without system identification, obtain the value function and policy directly from simulation data Model-free approach
  • 43.
    𝐢𝐧𝐟 𝝅∈𝚷 𝑬 𝝅 𝒓𝒙 𝒌, 𝝅(𝒙 𝒌) + 𝑬 𝒙~𝒑 𝜸𝑽(𝒙 𝒌+𝟏)
  • 44.
    𝐢𝐧𝐟 𝝅∈𝚷 𝑬 𝝅 𝒓𝒙 𝒌, 𝝅(𝒙 𝒌) + 𝑬 𝒙~𝒑 𝜸𝑽(𝒙 𝒌+𝟏) Approximation in policy space Approximation in value space Approximate 𝐸[∙] Parametric approximation Problem approximation Rollout, MPC Monte-Carlo search Certainty equivalence Policy search Policy gradient
  • 45.
    Approximation in Value space Approximation inPolicy space Actor-Critic • TD, SARSA, Q-learning • Function approximation • Policy search • Policy gradient DPG, DDPG, TRPO, CPO, PPO, Soft actor-critic, … Approximate Expectation • Monte-Carlo search • Certainty equivalence
  • 46.
    Approximation in ValueSpace DP algorithms sweep over “all state” for each step 𝐸 𝑓 ≈ 1 𝑁 ෍ 𝑖=1 𝑁 𝑓𝑖Use Monte-Carlo Search 
  • 47.
    Approximation in ValueSpace DP algorithms sweep over “all state” for each step 𝑁~14,000,605 Use Monte-Carlo Search  𝐸 𝑓 ≈ 1 𝑁 ෍ 𝑖=1 𝑁 𝑓𝑖
  • 48.
    Approximation in ValueSpace DP algorithms sweep over “all state” for each step 𝑁~14,000,605 Use Monte-Carlo Search  𝐸 𝑓 ≈ 1 𝑁 ෍ 𝑖=1 𝑁 𝑓𝑖 Impractical
  • 49.
    Stochastic Approximation Consider theproblem 𝑥 = 𝐿(𝑥). Then, this problem can be solved by iterative algorithm 𝑥 𝑘+1 = 𝐿 𝑥 𝑘 or 𝑥 𝑘+1 = 1 − 𝛼 𝑘 𝑥 𝑘 + 𝛼 𝑘 𝐿(𝑥 𝑘). If 𝐿 𝑥 is of the form 𝐸 𝑓(𝑥, 𝑤) where 𝑤 is a random noise. Then, 𝐿(𝑥) can be approximated by 𝐿 𝑥 = 1 𝑁 ෍ 𝑖=1 𝑁 𝑓(𝑥𝑖, 𝑤𝑖) which becomes inefficient when 𝑁 is large.
  • 50.
    Stochastic Approximation Use asingle sample as an estimation of expectation in each update This update can be seen as a stochastic approximation of the form 𝑥 𝑘+1 = 1 − 𝛼 𝑘 𝑥 𝑘 + 𝛼 𝑘 𝑓(𝑥 𝑘, 𝑤 𝑘). 𝑥 𝑘+1 = 1 − 𝛼 𝑘 𝑥 𝑘 + 𝛼 𝑘 𝐸 𝑓 𝑥 𝑘, 𝑤 𝑘 + 𝜀 𝑘 = 1 − 𝛼 𝑘 𝑥 𝑘 + 𝛼 𝑘 𝐿(𝑥 𝑘) + 𝜀 𝑘 where 𝜀 𝑘= 𝑓 𝑥 𝑘, 𝑤 𝑘 − 𝐸 𝑓 𝑥 𝑘, 𝑤 𝑘 . Robbins-Monro stochastic approximation guarantees the convergence under contraction or monotonicity assumptions of the mapping 𝐿 with assumptions ෍ 𝑘=0 ∞ 𝛼 𝑘 = + ∞ and ෍ 𝑘=0 ∞ 𝛼 𝑘 2 < +∞.
  • 51.
    Policy Iteration Algorithm: PolicyIteration(Classical Dynamic Programming) Input: 𝑟, 𝑝, 𝛾, 𝛥 Output: 𝜋 1. 𝑉 𝑥 , 𝜋(𝑥) ←Initialize arbitrarily for 𝑥 ∈ 𝑋 2. repeat 3. 1) Policy Evaluation 4. for 𝑥 ∈ 𝑋 with fixed 𝜋 𝑘 do 5. 𝑉𝑘+1 𝜋 𝑘 ← 𝑇 𝜋 𝑉𝑘 𝜋 𝑘 6. until 𝑉𝑘+1 𝜋 𝑘 − 𝑉𝑘 𝜋 𝑘 < 𝛥 2) Policy Improvement 7. 𝜋 𝑘+1 𝑥𝑖 ∈ 𝑎𝑟𝑔min 𝑢∈𝑈(𝑥) 𝑟 𝑥, 𝑢 + 𝛾 σ 𝑥′∈𝑋 𝑝 𝑥′ 𝑥, 𝑢 𝑉𝑘+1 𝜋 𝑘 (𝑥′) 8. until 𝑉𝑘+1 𝜋 𝑘+1 − 𝑉𝑘 𝜋 𝑘 < 𝛥 9. return 𝜋
  • 52.
    Policy Iteration Algorithm: PolicyIteration(Temporal Difference) Input: 𝑟, 𝑝, 𝛾, 𝛥 Output: 𝜋 1. 𝑉 𝑥 , 𝜋(𝑥) ←Initialize arbitrarily for 𝑥 ∈ 𝑋 2. repeat 3. given 𝑥0, simulate 𝜋 𝑘 and store 𝒟 = {𝑥𝑖, 𝜋 𝑘 𝑥𝑖 , 𝑟(𝑥𝑖, 𝜋 𝑘 𝑥𝑖 )} 1) Policy Evaluation 4. for 𝑥 ∈ 𝑋 for 𝑥𝑖 ∈ 𝒟 with fixed 𝜋 𝑘 do 5. 𝑉𝑘+1 𝜋 𝑘 ← 𝑇 𝜋 𝑉𝑘 𝜋 𝑘 𝑉𝑘+1 𝜋 𝑘 (𝑥𝑖) ← 1 − 𝛼 𝑘 𝑉𝑘 𝜋 𝑘 (𝑥𝑖) + 𝛼 𝑘 𝑟 𝑥𝑖, 𝜋 𝑘 𝑥𝑖 + 𝑉𝑘 𝜋 𝑘 (𝑥𝑖+1) 6. until 𝑉𝑘+1 𝜋 𝑘 − 𝑉𝑘 𝜋 𝑘 < 𝛥 2) Policy Improvement 7. 𝜋 𝑘+1 𝑥𝑖 ∈ 𝑎𝑟𝑔min 𝑢∈𝑈(𝑥) 𝑟 𝑥, 𝑢 + 𝛾 σ 𝑥′∈𝑋 𝑝 𝑥′ 𝑥, 𝑢 𝑉𝑘+1 𝜋 𝑘 (𝑥′) 8. until 𝑉𝑘+1 𝜋 𝑘+1 − 𝑉𝑘 𝜋 𝑘 < 𝛥 9. return 𝜋
  • 53.
    Policy Iteration Algorithm: PolicyIteration(SARSA) Input: 𝑟, 𝑝, 𝛾, 𝛥 Output: 𝜋 1. 𝑉 𝑥 , 𝜋(𝑥) ←Initialize arbitrarily for 𝑥 ∈ 𝑋 2. repeat 3. given 𝑥0, simulate 𝜋 𝑘 and store 𝒟 = {𝑥𝑖, 𝜋 𝑘 𝑥𝑖 , 𝑟(𝑥𝑖, 𝜋 𝑘 𝑥𝑖 )} 1) Policy Evaluation 4. for 𝑥 ∈ 𝑋 for 𝑥𝑖 ∈ 𝒟 with fixed 𝜋 𝑘 do 5. 𝑉𝑘+1 𝜋 𝑘 ← 𝑇 𝜋 𝑉𝑘 𝜋 𝑘 𝑄 𝑘+1 𝜋 𝑘 (𝑥𝑖, 𝑢𝑖) ← 1 − 𝛼 𝑘 𝑄 𝑘 𝜋 𝑘 (𝑥𝑖, 𝑢𝑖) + 𝛼 𝑘 𝑟 𝑥𝑖, 𝜋 𝑘 𝑥𝑖 + 𝑄 𝑘 𝜋 𝑘 (𝑥𝑖+1, 𝜋 𝑘 𝑥𝑖+1 ) 6. until 𝑉𝑘+1 𝜋 𝑘 − 𝑉𝑘 𝜋 𝑘 < 𝛥 2) Policy Improvement 7. 𝜋 𝑘+1 𝑥𝑖 ∈ 𝑎𝑟𝑔min 𝑢∈𝑈(𝑥) 𝑟 𝑥, 𝑢 + 𝛾 σ 𝑥′∈𝑋 𝑝 𝑥′ 𝑥, 𝑢 𝑉𝑘+1 𝜋 𝑘 (𝑥′) 8. until 𝑉𝑘+1 𝜋 𝑘+1 − 𝑉𝑘 𝜋 𝑘 < 𝛥 9. return 𝜋
  • 54.
    Value Iteration Algorithm: ValueIteration(Classical Dynamic Programming) Input: 𝑟, 𝑝, 𝛾, 𝛥 Output: 𝜋 1. 𝑉 𝑥 ←Initialize arbitrarily for 𝑥 ∈ 𝑋 2. repeat 3. given 𝑥0, simulate and store 𝒟 = {𝑥𝑖, 𝑟(𝑥𝑖, 𝑢𝑖)} 4. for all 𝑥 ∈ 𝑋 do 5. 𝑉𝑘+1 𝑥 ← 𝑇∗ 𝑉𝑘 2. until 𝑉𝑘+1 − 𝑉𝑘 < 𝛥 3. 𝜋∗ 𝑥 ∈ 𝑎𝑟𝑔min 𝑢∈𝑈(𝑥) 𝑟 𝑥, 𝑢 + 𝛾 σ 𝑥′∈𝑋 𝑝 𝑥′ 𝑥, 𝑢 𝑉𝑘(𝑥′ ) 4. return 𝜋∗
  • 55.
    Value Iteration Algorithm: ValueIteration(Q-learning) Input: 𝑟, 𝑝, 𝛾, 𝛥 Output: 𝜋 1. 𝑉 𝑥 ←Initialize arbitrarily for 𝑥 ∈ 𝑋 𝑄(𝑥, 𝑢) ←Initialize arbitrarily for 𝑥 ∈ 𝑋, 𝑢 ∈ 𝑈 2. repeat 3. given 𝑥0, simulate and store 𝒟 = {𝑥𝑖, 𝑟(𝑥𝑖, 𝑢𝑖)} 4. for all 𝑥 ∈ 𝑋 do  for 𝑥𝑖, 𝑎𝑖, 𝑟 𝑥𝑖, 𝑢𝑖 , 𝑥𝑖+1 ∈ 𝒟 do 5. 𝑄 𝑘+1 𝑥, 𝑢 ← 𝑇∗ 𝑄 𝑘 𝑄 𝑘+1(𝑥𝑖, 𝑢𝑖) ← (1 − 𝛼 𝑘)𝑄 𝑘(𝑥𝑖, 𝑢𝑖) + 𝛼 𝑘 𝑟 𝑥𝑖, 𝑢𝑖 + min 𝑢𝑖+1∈𝑈 𝑄 𝑘(𝑥𝑖+1, 𝑢𝑖+1) 2. until 𝑄 𝑘+1 − 𝑄 𝑘 < 𝛥 3. 𝜋∗ 𝑥 ∈ 𝑎𝑟𝑔min 𝑢∈𝑈(𝑥) 𝑟 𝑥, 𝑢 + 𝛾 σ 𝑥′∈𝑋 𝑝 𝑥′ 𝑥, 𝑢 𝑉𝑘(𝑥′ ) 4. return 𝜋∗
  • 56.
    Approximation in ValueSpace 𝑇 𝜋 𝑇∗ 𝑉 𝜋 𝑄 𝜋 𝑄 Policy Evaluation TD(0), TD(𝜆) SARSA(0) SARSA(𝜆) Q-learning
  • 57.
  • 58.
    Large-scale RL Number ofstates > 2200 (for 10 × 20 board)
  • 59.
  • 60.
    Approximation in Value space Approximation inPolicy space Actor-Critic • TD, SARSA, Q-learning • Function approximation • Policy search • Policy gradient DPG, DDPG, TRPO, CPO, PPO, Soft actor-critic, … Approximate Expectation • Monte-Carlo search • Certainty equivalence
  • 61.
    Approximate Dynamic Programming Directmethod (Gradient methods) min 𝜃 ෍ 𝑖=1 𝑁 𝑉 𝜋 𝑥𝑖 − ෠𝑉(𝑥𝑖; 𝜃) 2 ≈ min 𝜃 ෍ 𝑥 𝑖∈𝑋 ෍ 𝑚=1 𝑀 𝐽(𝑥𝑖, 𝑚) − ෠𝑉(𝑥𝑖; 𝜃) 2 ෠𝑉(𝑥; 𝜃) : Approximated value function (ex. Polynomial approx., Neural Network, etc.) 𝑉 𝜋 𝑥 : State-value function 𝐽(𝑥𝑖, 𝑚): 𝑚-th sample of the cost function at 𝑥𝑖 where 𝑚 = 1,2, … , 𝑀 𝜃 𝑘+1 = 𝜃 𝑘 − 𝜂 ෍ 𝑥 𝑖∈𝑋 ෍ 𝑚=1 𝑀 𝛻 ෠𝑉(𝑥𝑖; 𝜃) 𝐽(𝑥𝑖, 𝑚) − ෠𝑉(𝑥𝑖; 𝜃) 2
  • 62.
    Approximate Dynamic Programming Indirectmethod (Projected equation) Solve the projected Bellman equation: Φ𝜃 = Π𝑇(Φ𝜃) Subspace 𝑆 = Φ𝜃 𝜃 ∈ ℝ 𝑠 0 𝐽 Π𝐽 Subspace 𝑆 = Φ𝜃 𝜃 ∈ ℝ 𝑠 0 Φ𝜃 = Π𝑇(Φ𝜃) 𝑇(Φ𝜃) Direct Method Indirect Method
  • 63.
    𝑇 𝜋 𝑇∗ 𝑉 𝜋 𝑄𝜋 𝑄 Policy Evaluation TD(0), TD(𝜆) SARSA(0) SARSA(𝜆) Q-learning 𝑇 𝜋 𝑇∗ 𝑉 𝜋 𝑄 𝜋 𝑄 Direct (Gradient methods) TD SARSA DQN 𝑇 𝜋 𝑇∗ 𝑉 𝜋 𝑄 𝜋 𝑄 Indirect (Projected DP) TD, LSTD TD, LSTD LSPE Function Approximation Policy Evaluation
  • 64.
  • 65.
    RL is atoolbox solving Infinite-horizon Discrete-time DP 𝐢𝐧𝐟 𝝅∈𝜫 𝑬 𝝅 𝒓 𝒙 𝒌, 𝝅(𝒙 𝒌) + 𝑬 𝒙~𝒑 𝜸𝑽(𝒙 𝒌+𝟏)
  • 66.
    𝐢𝐧𝐟 𝝅∈𝚷 𝑬 𝝅 𝒓𝒙 𝒌, 𝝅(𝒙 𝒌) + 𝑬 𝒙~𝒑 𝜸𝑽(𝒙 𝒌+𝟏) Approximation in policy space Approximation in value space Approximate 𝐸[∙] Parametric approximation Problem approximation Rollout, MPC Monte-Carlo search Certainty equivalence Policy search Policy gradient
  • 67.
    Approximation in Value space Approximation inPolicy space Actor-Critic • TD, SARSA, Q-learning • Function approximation • Policy search • Policy gradient DPG, DDPG, TRPO, CPO, PPO, Soft actor-critic, … Approximate Expectation • Monte-Carlo search • Certainty equivalence
  • 68.
    𝑇 𝜋 𝑇∗ 𝑉 𝜋 𝑄𝜋 𝑄 Policy Evaluation TD(0), TD(𝜆) SARSA(0) SARSA(𝜆) Q-learning 𝑇 𝜋 𝑇∗ 𝑉 𝜋 𝑄 𝜋 𝑄 Direct (Gradient methods) TD SARSA DQN 𝑇 𝜋 𝑇∗ 𝑉 𝜋 𝑄 𝜋 𝑄 Indirect (Projected DP) TD, LSTD TD, LSTD LSPE Function Approximation Policy Evaluation
  • 69.
  • 70.