SlideShare a Scribd company logo
Stochastic Optimal Control
&
Reinforcement Learning
Jinwon Choi
Contents
01
02
03
04
05
Reinforcement
Learning
Stochastic
Optimal Control
Stochastic
Control
to
Reinforcement
Learning
Large Scale
Reinforcement
Learning
Summary
Reinforcement Learning
Reinforcement Learning
Ivan Pavlov
Reinforcement Learning
Agent Environment
Action
Reward
State
Reinforcement Learning
Agent Environment
Action
Reward
State
Agent
Environment
Reinforcement Learning
Agent Environment
Action
Reward
State
Agent
Environment
Markov Decision Process
Markov?
Markov Decision Process
Markov?
“The future is independent of the past given the present”
ℙ 𝑠𝑡+1 𝑠1, … , 𝑠𝑡 = ℙ 𝑠𝑡+1 𝑠𝑡
Markov Decision Process
Markov?
“The future is independent of the past given the present”
ℙ 𝑠𝑡+1 𝑠1, … , 𝑠𝑡 = ℙ 𝑠𝑡+1 𝑠𝑡
Memoryless process!
Markov Decision Process
Markov?
“The future is independent of the past given the present”
ℙ 𝑠𝑡+1 𝑠1, … , 𝑠𝑡 = ℙ 𝑠𝑡+1 𝑠𝑡
Markov “Decision” Process:
ℙ 𝑠𝑡+1 𝑠1, 𝑎1 … , 𝑠𝑡, 𝑎 𝑡 = ℙ 𝑠𝑡+1 𝑠𝑡, 𝑎 𝑡
Future state only depends on the current state and action
&
Policy also depends on the current state only
𝜋 𝑎 𝑡 𝑠1, 𝑎1 … , 𝑠𝑡 = 𝜋 𝑎 𝑡 𝑠𝑡
Reinforcement Learning
Agent Environment
Action
Reward
• State 𝑠 ∈ 𝑆 ⊂ ℝ 𝑛
• Action 𝑎 ∈ 𝐴 ⊂ ℝ 𝑚
• Action sequence 𝑎0, 𝑎1, … ,
∀𝑎𝑖 ∈ 𝐴, 𝑖 = 1,2, …
State
Reinforcement Learning
Agent Environment
Action
Reward
• State 𝑠 ∈ 𝑆 ⊂ ℝ 𝑛
• Action 𝑎 ∈ 𝐴 ⊂ ℝ 𝑚
• Action sequence 𝑎0, 𝑎1, … ,
∀𝑎𝑖 ∈ 𝐴, 𝑖 = 1,2, …
State
• Reward 𝑟: 𝑆 × 𝐴 → ℝ
• Discounting factor 𝛾 ∈ (0,1)
• Transition probability 𝑠𝑡+1~𝑝 𝑠𝑡+1 𝑠𝑡, 𝑎 𝑡
• Total reward 𝑅𝑡𝑜𝑡 = 𝐸𝑠~𝑝 σ 𝑡=0
∞
𝛾 𝑡 𝑟(𝑠𝑡, 𝑎 𝑡)
Reinforcement Learning
Agent Environment
Action
Reward
• State 𝑠 ∈ 𝑆 ⊂ ℝ 𝑛
• Action 𝑎 ∈ 𝐴 ⊂ ℝ 𝑚
• Action sequence 𝑎0, 𝑎1, … ,
∀𝑎𝑖 ∈ 𝐴, 𝑖 = 1,2, …
State
Objective function:
max
𝜋∈Π
𝑅 𝜋
• Policy: 𝜋: 𝑆 → 𝐴
• Total reward w.r.t. 𝜋:
𝑅 𝜋
= 𝐸𝑠~𝑝,𝑎~𝜋 σ 𝑡=0
∞
𝛾 𝑡
𝑟(𝑠𝑡, 𝑎 𝑡)
• Reward 𝑟: 𝑆 × 𝐴 → ℝ
• Discounting factor 𝛾 ∈ (0,1)
• Transition probability 𝑠𝑡+1~𝑝 𝑠𝑡+1 𝑠𝑡, 𝑎 𝑡
• Total reward 𝑅𝑡𝑜𝑡 = 𝐸𝑠~𝑝 σ 𝑡=0
∞
𝛾 𝑡 𝑟(𝑠𝑡, 𝑎 𝑡)
Terminology of RL and Optimal control
State
Action
Agent
Environment
Reward of a stage
Reward (or value) function
Maximizing the value function
Bellman operator
Greedy policy w.r.t. 𝐽
State
Control Input
Controller
System
Cost of a stage
Value (or cost) function
Minimizing the value function
DP mapping or operator
Minimizing policy w.r.t. 𝐽
RL Optimal Control
Stochastic Optimal Control
Stochastic Optimal Control
DiscreteContinuous
Stochastic
Deterministicሶ𝑥 = 𝑓(𝑥, 𝑢)
𝑑𝑥 = 𝑓 𝑥, 𝑢 𝑑𝑡 + 𝜎 𝑥, 𝑢 𝑑𝑊
𝑥 𝑘+1 = 𝑓(𝑥 𝑘, 𝑢 𝑘)
𝑥 𝑘+1 = 𝑓 𝑥 𝑘, 𝑢 𝑘, 𝑤 𝑘
(𝑤 𝑘 is a random Gaussian noise)
or
𝑥 𝑘+1~𝑝 𝑥 𝑘+1 𝑥 𝑘, 𝑢 𝑘
System Dynamics
Stochastic Optimal Control
DiscreteContinuous
Stochastic
(Policy)
Deterministic
(Control Input)
𝑢 𝑥
𝑢 𝑥 ~ 𝜋 𝑢 𝑥
{𝑢0, 𝑢1, 𝑢2, … }
𝑢 𝑘 ~ 𝜋 𝑢 𝑘 𝑥 𝑘
Control Input and Policy
Stochastic Optimal Control
DiscreteContinuous
Infinite-horizon
Finite-horizon
Value function
inf
𝑢∈𝑈
𝐸 𝑥~𝑝 න
𝑡=0
𝑇
𝑟 𝑥 𝑡 , 𝑢 𝑡 𝑑𝑡 + 𝑞(𝑥 𝑇)
inf
𝑢∈𝑈
𝐸 𝑥~𝑝 න
𝑡=0
∞
𝑒−𝛾𝑡
𝑟 𝑥 𝑡 , 𝑢 𝑡 𝑑𝑡
inf
𝑢 𝑘∈𝑈
𝐸 𝑥~𝑝 ෍
𝑘=0
𝑁
𝑟 𝑥 𝑘, 𝑢 𝑘 + 𝑞(𝑥 𝑁)
inf
𝑢 𝑘∈𝑈
𝐸 𝑥~𝑝 ෍
𝑘=0
∞
𝛾 𝑘 𝑟 𝑥 𝑘, 𝑢 𝑘
Stochastic Optimal Control
Discrete(𝑉 𝑥 𝑘 )Continuous(𝑉 𝑥(𝑡) )
Infinite-horizon
Finite-horizon
Dynamic Programming
inf
𝑢∈𝑈
𝐸 𝑥~𝑝 න
𝑡
𝑡+Δ𝑡
𝑟 𝑥 𝑠 , 𝑢 𝑠 𝑑𝑠 + 𝑉(𝑥 𝑡 + Δ𝑡 )
𝑉 𝑥 𝑇 = 𝑞(𝑥 𝑇 )
inf
𝑢∈𝑈
𝐸 𝑥~𝑝 න
𝑡=0
𝑡+Δ𝑡
𝑒−𝛾𝑠
𝑟 𝑥 𝑠 , 𝑢 𝑠 𝑑𝑠 + 𝑉(𝑥 𝑡 + Δ𝑡 )
inf
𝑢 𝑘∈𝑈
𝑟 𝑥 𝑘, 𝑢 𝑘 + 𝐸 𝑥~𝑝 𝑉(𝑥 𝑘+1)
𝑉 𝑥 𝑇 = 𝑞(𝑥 𝑇)
inf
𝑢 𝑘∈𝑈
𝑟 𝑥 𝑘, 𝑢 𝑘 + 𝐸 𝑥~𝑝 𝛾𝑉(𝑥 𝑘+1)
Stochastic Optimal Control
Discrete(𝑉 𝑥 𝑘 )Continuous(𝑉 𝑥(𝑡) )
Infinite
Dynamic Programming
𝜕𝑉
𝜕𝑡
+ inf
𝑢∈𝑈
𝑟 𝑥 𝑡 , 𝑢 𝑡 +
𝜕𝑉
𝜕𝑥
𝑓 𝑥 𝑡 , 𝑢 𝑡 +
1
2
𝜎 𝑇
(𝑥 𝑡 , 𝑢 𝑡 )
𝜕2
𝑉
𝜕𝑥2
𝜎(𝑥 𝑡 , 𝑢 𝑡 ) = 0
𝑉 𝑥 𝑇 = 𝑞(𝑥 𝑇 )
inf
𝑢 𝑘∈𝑈
𝑟 𝑥 𝑘, 𝑢 𝑘 + 𝐸 𝑥~𝑝 𝑉(𝑥 𝑘+1)
𝑉 𝑥 𝑇 = 𝑞(𝑥 𝑇)
inf
𝑢 𝑘∈𝑈
𝑟 𝑥 𝑘, 𝑢 𝑘 + 𝐸 𝑥~𝑝 𝛾𝑉(𝑥 𝑘+1)
Finite
HJB equation Bellman equation
−𝛾 + inf
𝑢∈𝑈
𝑟 𝑥 𝑡 , 𝑢 𝑡 +
𝜕𝑉
𝜕𝑥
𝑓 𝑥 𝑡 , 𝑢 𝑡 +
1
2
𝜎 𝑇
(𝑥 𝑡 , 𝑢 𝑡 )
𝜕2
𝑉
𝜕𝑥2
𝜎(𝑥 𝑡 , 𝑢 𝑡 ) = 0
Stochastic Optimal Control
Discrete(𝑉 𝑥 𝑘 )Continuous(𝑉 𝑥(𝑡) )
Infinite
Dynamic Programming
𝜕𝑉
𝜕𝑡
+ inf
𝑢∈𝑈
𝑟 𝑥 𝑡 , 𝑢 𝑡 +
𝜕𝑉
𝜕𝑥
𝑓 𝑥 𝑡 , 𝑢 𝑡 +
1
2
𝜎 𝑇
(𝑥 𝑡 , 𝑢 𝑡 )
𝜕2
𝑉
𝜕𝑥2
𝜎(𝑥 𝑡 , 𝑢 𝑡 ) = 0
𝑉 𝑥 𝑇 = 𝑞(𝑥 𝑇 )
inf
𝑢 𝑘∈𝑈
𝑟 𝑥 𝑘, 𝑢 𝑘 + 𝐸 𝑥~𝑝 𝑉(𝑥 𝑘+1)
𝑉 𝑥 𝑇 = 𝑞(𝑥 𝑇)
inf
𝑢 𝑘∈𝑈
𝑟 𝑥 𝑘, 𝑢 𝑘 + 𝐸 𝑥~𝑝 𝛾𝑉(𝑥 𝑘+1)
Finite
−𝛾 + inf
𝑢∈𝑈
𝑟 𝑥 𝑡 , 𝑢 𝑡 +
𝜕𝑉
𝜕𝑥
𝑓 𝑥 𝑡 , 𝑢 𝑡 +
1
2
𝜎 𝑇
(𝑥 𝑡 , 𝑢 𝑡 )
𝜕2
𝑉
𝜕𝑥2
𝜎(𝑥 𝑡 , 𝑢 𝑡 ) = 0
HJB equation Bellman equation
Stochastic Optimal Control
Dynamic Programming
𝐢𝐧𝐟
𝒖 𝒌∈𝑼
𝒓 𝒙 𝒌, 𝒖 𝒌 + 𝑬 𝒙~𝒑 𝜸𝑽(𝒙 𝒌+𝟏)
How to solve the infinite-horizon discrete time system stochastic optimal control problem?
Stochastic Optimal Control
Dynamic Programming
𝐢𝐧𝐟
𝒖 𝒌∈𝑼
𝒓 𝒙 𝒌, 𝒖 𝒌 + 𝑬 𝒙~𝒑 𝜸𝑽(𝒙 𝒌+𝟏)
How to solve the infinite-horizon discrete time system stochastic optimal control problem?
Note) There is an other approach using different dynamic programming equation,
average reward.
Value Iteration & Policy Iteration
Bellman Operator
Definition. Given policy 𝜋, the state-value function 𝑉 𝜋: ℝn → ℝ is defined by
𝑉 𝜋
𝑥0 : = 𝐸 𝑥~𝑝,𝜋 ෍
𝑘=0
∞
𝛾 𝑘
𝑟𝑘(𝑥 𝑘, 𝜋(𝑥 𝑘), 𝑤 𝑘) ȁ 𝑥 = 𝑥0 at 𝑡 = 0
= 𝑟(𝑥0, 𝜋(𝑥 𝑘)) + 𝐸 𝑥~𝑝,𝜋 ෍
𝑘=0
∞
𝛾 𝑘
𝑟𝑘(𝑥 𝑘, 𝜋(𝑥 𝑘), 𝑤 𝑘)
and the state-input-value function 𝑄 𝜋
: ℝn
× ℝm
→ ℝ is defined by
𝑄 𝜋 𝑥0, 𝑢0 : = 𝐸 𝑥~𝑝,𝜋 ෍
𝑘=0
∞
𝛾 𝑘 𝑟𝑘(𝑥 𝑘, 𝜋(𝑥 𝑘), 𝑤 𝑘) ȁ 𝑥 = 𝑥0, 𝑢 = 𝑢0 at 𝑡 = 0
= 𝑟 𝑥0, 𝑢0 + 𝐸 𝑥~𝑝,𝜋 ෍
𝑘=1
∞
𝛾 𝑘 𝑟𝑘(𝑥 𝑘, 𝜋(𝑥 𝑘), 𝑤 𝑘)
Bellman Operator
Definition. Given policy 𝜋, the state-value function 𝑉 𝜋: ℝn → ℝ is defined by
𝑉 𝜋
𝑥0 : = 𝐸 𝑥~𝑝,𝜋 ෍
𝑘=0
∞
𝛾 𝑘
𝑟𝑘(𝑥 𝑘, 𝜋(𝑥 𝑘), 𝑤 𝑘) ȁ 𝑥 = 𝑥0 at 𝑡 = 0
= 𝑟(𝑥0, 𝜋(𝑥 𝑘)) + 𝐸 𝑥~𝑝,𝜋 ෍
𝑘=0
∞
𝛾 𝑘
𝑟𝑘(𝑥 𝑘, 𝜋(𝑥 𝑘), 𝑤 𝑘)
and the state-input-value function 𝑄 𝜋
: ℝn
× ℝm
→ ℝ is defined by
𝑄 𝜋 𝑥0, 𝑢0 : = 𝐸 𝑥~𝑝,𝜋 ෍
𝑘=0
∞
𝛾 𝑘 𝑟𝑘(𝑥 𝑘, 𝜋(𝑥 𝑘), 𝑤 𝑘) ȁ 𝑥 = 𝑥0, 𝑢 = 𝑢0 at 𝑡 = 0
= 𝑟 𝑥0, 𝑢0 + 𝐸 𝑥~𝑝,𝜋 ෍
𝑘=1
∞
𝛾 𝑘 𝑟𝑘(𝑥 𝑘, 𝜋(𝑥 𝑘), 𝑤 𝑘)
𝑉 𝑥0 = inf
𝑢 𝑘∈𝑈
𝐸 𝑥~𝑝 ෍
𝑘=0
∞
𝛾 𝑘
𝑟 𝑥 𝑘, 𝑢 𝑘
Bellman Operator
𝑉 𝑥 𝑘 = inf
𝑢 𝑘∈𝑈
𝑟 𝑥 𝑘, 𝑢 𝑘 + 𝛾𝐸 𝑥~𝑝 𝑉(𝑥 𝑘+1)
𝑉 𝜋 𝑥 𝑘 = 𝑟(𝑥 𝑘, 𝜋(𝑥 𝑘)) + 𝛾𝐸 𝑥~𝑝 𝑉 𝜋(𝑥 𝑘+1)
𝑄 𝜋
𝑥 𝑘, 𝑢 𝑘 = 𝑟 𝑥 𝑘, 𝑢 𝑘 + 𝛾𝐸 𝑥~𝑝,𝜋 𝑄 𝜋
𝑥 𝑘+1, 𝜋(𝑥 𝑘+1)
Dynamic Programming
Bellman Operator
Let 𝔹, ∙ ∞, 𝑑∞ be a metric space where 𝔹 = 𝜓: Ω → ℝ continuous and bounded ,
𝜓 ∞ ≔ sup
𝑥∈𝑋
𝜓(𝑥) , and 𝑑∞ 𝜓, 𝜓′ = sup
𝑥∈𝑋
𝜓 𝑥 − 𝜓′(𝑥) .
Definition. Given policy 𝜋, the Bellman operator 𝑇 𝜋: 𝔹 → 𝔹 is defined by
𝑇 𝜋 𝜓 𝑥 𝑘 = 𝑟 𝑥 𝑘, 𝜋 𝑥 𝑘 + 𝛾𝐸 𝑥~𝑝[𝜓 𝑥 𝑘+1 ]
and the Bellman optimal operator 𝑇∗: 𝔹 → 𝔹 is defined by
𝑇∗ 𝜓 𝑥 𝑘 = min
𝑢 𝑘∈𝑈(𝑥 𝑘)
𝑟 𝑥 𝑘, 𝑢 𝑘 + 𝛾𝐸 𝑥~𝑝[𝜓 𝑥 𝑘+1 ]
Bellman Operator
Proposition 1. (Monotonicity) The Bellman operator 𝑇 𝜋
, 𝑇∗
are monotone, i.e. if
𝜓 𝑥 ≤ 𝜓′ 𝑥 ∀𝑥 ∈ 𝑋,
then
𝑇 𝜋
𝜓 𝑥 ≤ 𝑇 𝜋
𝜓′ 𝑥 ∀𝑥 ∈ 𝑋
𝑇∗
𝜓 𝑥 ≤ 𝑇∗
𝜓′ 𝑥 ∀𝑥 ∈ 𝑋.
Proposition 2. (Constant shift property) For any scalar 𝑟,
𝑇 𝜋
𝜓 + 𝑟𝑒 𝑥 = 𝑇 𝜋
𝜓 𝑥 + 𝛾𝑟 ∀𝑥 ∈ 𝑋
𝑇∗
𝜓 + 𝑟𝑒 𝑥 = 𝑇∗
𝜓 𝑥 + 𝛾𝑟 ∀𝑥 ∈ 𝑋.
Bellman Operator
Proposition 1. (Monotonicity) The Bellman operator 𝑇 𝜋
, 𝑇∗
are monotone, i.e. if
𝜓 𝑥 ≤ 𝜓′ 𝑥 ∀𝑥 ∈ 𝑋,
then
𝑇 𝜋
𝜓 𝑥 ≤ 𝑇 𝜋
𝜓′ 𝑥 ∀𝑥 ∈ 𝑋
𝑇∗
𝜓 𝑥 ≤ 𝑇∗
𝜓′ 𝑥 ∀𝑥 ∈ 𝑋.
Proposition 2. (Constant shift property) For any scalar 𝑟,
𝑇 𝜋
𝜓 + 𝑟𝑒 𝑥 = 𝑇 𝜋
𝜓 𝑥 + 𝛾𝑟 ∀𝑥 ∈ 𝑋
𝑇∗
𝜓 + 𝑟𝑒 𝑥 = 𝑇∗
𝜓 𝑥 + 𝛾𝑟 ∀𝑥 ∈ 𝑋.
Proposition 3. The Bellman operator 𝑇 𝜋
, 𝑇 are a contraction
with modulus 𝛾 with respect to the sup norm ∙ ∞ i.e.
𝑇 𝜋 𝜓 − 𝑇 𝜋 𝜓′ ≤ 𝛾 𝜓 − 𝜓′ 𝜓, 𝜓′ ∈ 𝔹
𝑇∗ 𝜓 − 𝑇∗ 𝜓′ ≤ 𝛾 𝜓 − 𝜓′ 𝜓, 𝜓′ ∈ 𝔹.
Bellman Operator
Theorem 2.3. (Contraction Mapping Theorem) Let 𝔹, ∙ ∞, 𝑑∞ be a metric
space and 𝑇: 𝔹 → 𝔹 is a contraction mapping with modulus 𝛾. Then,
1) 𝑇 has an unique fixed point in 𝔹 i.e. there exist unique 𝑓∗
∈ 𝔹 s.t. 𝑇𝑓∗
= 𝑓∗
.
2) Consider the sequence {𝑓𝑛} s.t. 𝑓𝑛+1 = 𝑇𝑓𝑛 ∀𝑓0 ∈ 𝔹.
Then, lim
n→∞
𝑇 𝑛
𝑓0 → 𝑓∗
.
Value Iteration
Algorithm: Value Iteration
Input: 𝑟, 𝑝, 𝛾, 𝛥
Output: 𝜋
1. 𝑉 𝑥 ←Initialize arbitrarily for 𝑥 ∈ 𝑋
2. repeat
3. for all 𝑥 ∈ 𝑋 do
4. 𝑉𝑘+1 ← 𝑇∗
𝑉𝑘
5. until 𝑉𝑘+1 − 𝑉𝑘 < 𝛥
6. 𝜋∗ 𝑥 ∈ 𝑎𝑟𝑔min
𝑢∈𝑈(𝑥)
𝑟 𝑥, 𝑢 + 𝛾 σ 𝑥′∈𝑋 𝑝 𝑥′ 𝑥, 𝑢 𝑉𝑘(𝑥′)
7. return 𝜋∗
Policy Iteration
Algorithm: Policy Iteration
Input: 𝑟, 𝑝, 𝛾, 𝛥
Output: 𝜋
1. 𝑉 𝑥 ←Initialize arbitrarily for 𝑥 ∈ 𝑋
2. repeat
1) Policy Evaluation
3. for 𝑥 ∈ 𝑋 with fixed 𝜋 𝑘 do
4. 𝑉𝑘+1
𝜋 𝑘
← 𝑇 𝜋 𝑉𝑘
𝜋 𝑘
5. until 𝑉𝑘+1
𝜋 𝑘
− 𝑉𝑘
𝜋 𝑘
< 𝛥
2) Policy Improvement
6. 𝜋 𝑘+1 𝑥 ∈ 𝑎𝑟𝑔min
𝑢∈𝑈(𝑥)
𝑟 𝑥, 𝑢 + 𝛾 σ 𝑥′∈𝑋 𝑝 𝑥′
𝑥, 𝑢 𝑉𝑘+1
𝜋 𝑘
(𝑥′
)
7. until 𝑉𝑘+1
𝜋 𝑘+1
− 𝑉𝑘
𝜋 𝑘
< 𝛥
8. return 𝜋
Stochastic Control to RL
Q: If the system dynamics 𝑝 and reward function 𝑟 is unknown,
how do we solve DP equation?
Learning-based approach
Q: If the system dynamics 𝑝 and reward function 𝑟 is unknown,
how do we solve DP equation?
Learning-based approach
Q: If the system dynamics 𝑝 and reward function 𝑟 is unknown,
how do we solve DP equation?
1. Estimate model (𝑟 and 𝑝) from simulation data and use previous methods
Model-based approach (model learning)
Learning-based approach
Q: If the system dynamics 𝑝 and reward function 𝑟 is unknown,
how do we solve DP equation?
1. Estimate model (𝑟 and 𝑝) from simulation data and use previous methods
Model-based approach (model learning)
2. Without system identification, obtain the value function and policy directly from simulation data
Model-free approach
𝐢𝐧𝐟
𝝅∈𝚷
𝑬 𝝅 𝒓 𝒙 𝒌, 𝝅(𝒙 𝒌) + 𝑬 𝒙~𝒑 𝜸𝑽(𝒙 𝒌+𝟏)
𝐢𝐧𝐟
𝝅∈𝚷
𝑬 𝝅 𝒓 𝒙 𝒌, 𝝅(𝒙 𝒌) + 𝑬 𝒙~𝒑 𝜸𝑽(𝒙 𝒌+𝟏)
Approximation in policy space
Approximation in value space
Approximate 𝐸[∙]
Parametric approximation
Problem approximation
Rollout, MPC
Monte-Carlo search
Certainty equivalence
Policy search
Policy gradient
Approximation
in Value space
Approximation
in Policy space
Actor-Critic
• TD, SARSA, Q-learning
• Function approximation
• Policy search
• Policy gradient
DPG, DDPG, TRPO, CPO, PPO, Soft actor-critic, …
Approximate
Expectation
• Monte-Carlo search
• Certainty equivalence
Approximation in Value Space
DP algorithms sweep over “all state” for each step
𝐸 𝑓 ≈
1
𝑁
෍
𝑖=1
𝑁
𝑓𝑖Use Monte-Carlo Search 
Approximation in Value Space
DP algorithms sweep over “all state” for each step
𝑁~14,000,605
Use Monte-Carlo Search  𝐸 𝑓 ≈
1
𝑁
෍
𝑖=1
𝑁
𝑓𝑖
Approximation in Value Space
DP algorithms sweep over “all state” for each step
𝑁~14,000,605
Use Monte-Carlo Search  𝐸 𝑓 ≈
1
𝑁
෍
𝑖=1
𝑁
𝑓𝑖
Impractical
Stochastic Approximation
Consider the problem
𝑥 = 𝐿(𝑥).
Then, this problem can be solved by iterative algorithm
𝑥 𝑘+1 = 𝐿 𝑥 𝑘
or
𝑥 𝑘+1 = 1 − 𝛼 𝑘 𝑥 𝑘 + 𝛼 𝑘 𝐿(𝑥 𝑘).
If 𝐿 𝑥 is of the form 𝐸 𝑓(𝑥, 𝑤) where 𝑤 is a random noise. Then, 𝐿(𝑥) can be approximated by
𝐿 𝑥 =
1
𝑁
෍
𝑖=1
𝑁
𝑓(𝑥𝑖, 𝑤𝑖)
which becomes inefficient when 𝑁 is large.
Stochastic Approximation
Use a single sample as an estimation of expectation in each update
This update can be seen as a stochastic approximation of the form
𝑥 𝑘+1 = 1 − 𝛼 𝑘 𝑥 𝑘 + 𝛼 𝑘 𝑓(𝑥 𝑘, 𝑤 𝑘).
𝑥 𝑘+1 = 1 − 𝛼 𝑘 𝑥 𝑘 + 𝛼 𝑘 𝐸 𝑓 𝑥 𝑘, 𝑤 𝑘 + 𝜀 𝑘
= 1 − 𝛼 𝑘 𝑥 𝑘 + 𝛼 𝑘 𝐿(𝑥 𝑘) + 𝜀 𝑘
where 𝜀 𝑘= 𝑓 𝑥 𝑘, 𝑤 𝑘 − 𝐸 𝑓 𝑥 𝑘, 𝑤 𝑘 .
Robbins-Monro stochastic approximation guarantees the convergence under
contraction or monotonicity assumptions of the mapping 𝐿 with assumptions
෍
𝑘=0
∞
𝛼 𝑘 = + ∞ and ෍
𝑘=0
∞
𝛼 𝑘
2
< +∞.
Policy Iteration
Algorithm: Policy Iteration(Classical Dynamic Programming)
Input: 𝑟, 𝑝, 𝛾, 𝛥
Output: 𝜋
1. 𝑉 𝑥 , 𝜋(𝑥) ←Initialize arbitrarily for 𝑥 ∈ 𝑋
2. repeat
3.
1) Policy Evaluation
4. for 𝑥 ∈ 𝑋 with fixed 𝜋 𝑘 do
5. 𝑉𝑘+1
𝜋 𝑘
← 𝑇 𝜋
𝑉𝑘
𝜋 𝑘
6. until 𝑉𝑘+1
𝜋 𝑘
− 𝑉𝑘
𝜋 𝑘
< 𝛥
2) Policy Improvement
7. 𝜋 𝑘+1 𝑥𝑖 ∈ 𝑎𝑟𝑔min
𝑢∈𝑈(𝑥)
𝑟 𝑥, 𝑢 + 𝛾 σ 𝑥′∈𝑋 𝑝 𝑥′ 𝑥, 𝑢 𝑉𝑘+1
𝜋 𝑘
(𝑥′)
8. until 𝑉𝑘+1
𝜋 𝑘+1
− 𝑉𝑘
𝜋 𝑘
< 𝛥
9. return 𝜋
Policy Iteration
Algorithm: Policy Iteration(Temporal Difference)
Input: 𝑟, 𝑝, 𝛾, 𝛥
Output: 𝜋
1. 𝑉 𝑥 , 𝜋(𝑥) ←Initialize arbitrarily for 𝑥 ∈ 𝑋
2. repeat
3. given 𝑥0, simulate 𝜋 𝑘 and store 𝒟 = {𝑥𝑖, 𝜋 𝑘 𝑥𝑖 , 𝑟(𝑥𝑖, 𝜋 𝑘 𝑥𝑖 )}
1) Policy Evaluation
4. for 𝑥 ∈ 𝑋 for 𝑥𝑖 ∈ 𝒟 with fixed 𝜋 𝑘 do
5. 𝑉𝑘+1
𝜋 𝑘
← 𝑇 𝜋
𝑉𝑘
𝜋 𝑘
𝑉𝑘+1
𝜋 𝑘
(𝑥𝑖) ← 1 − 𝛼 𝑘 𝑉𝑘
𝜋 𝑘
(𝑥𝑖) + 𝛼 𝑘 𝑟 𝑥𝑖, 𝜋 𝑘 𝑥𝑖 + 𝑉𝑘
𝜋 𝑘
(𝑥𝑖+1)
6. until 𝑉𝑘+1
𝜋 𝑘
− 𝑉𝑘
𝜋 𝑘
< 𝛥
2) Policy Improvement
7. 𝜋 𝑘+1 𝑥𝑖 ∈ 𝑎𝑟𝑔min
𝑢∈𝑈(𝑥)
𝑟 𝑥, 𝑢 + 𝛾 σ 𝑥′∈𝑋 𝑝 𝑥′ 𝑥, 𝑢 𝑉𝑘+1
𝜋 𝑘
(𝑥′)
8. until 𝑉𝑘+1
𝜋 𝑘+1
− 𝑉𝑘
𝜋 𝑘
< 𝛥
9. return 𝜋
Policy Iteration
Algorithm: Policy Iteration(SARSA)
Input: 𝑟, 𝑝, 𝛾, 𝛥
Output: 𝜋
1. 𝑉 𝑥 , 𝜋(𝑥) ←Initialize arbitrarily for 𝑥 ∈ 𝑋
2. repeat
3. given 𝑥0, simulate 𝜋 𝑘 and store 𝒟 = {𝑥𝑖, 𝜋 𝑘 𝑥𝑖 , 𝑟(𝑥𝑖, 𝜋 𝑘 𝑥𝑖 )}
1) Policy Evaluation
4. for 𝑥 ∈ 𝑋 for 𝑥𝑖 ∈ 𝒟 with fixed 𝜋 𝑘 do
5. 𝑉𝑘+1
𝜋 𝑘
← 𝑇 𝜋
𝑉𝑘
𝜋 𝑘
𝑄 𝑘+1
𝜋 𝑘
(𝑥𝑖, 𝑢𝑖) ← 1 − 𝛼 𝑘 𝑄 𝑘
𝜋 𝑘
(𝑥𝑖, 𝑢𝑖) + 𝛼 𝑘 𝑟 𝑥𝑖, 𝜋 𝑘 𝑥𝑖 + 𝑄 𝑘
𝜋 𝑘
(𝑥𝑖+1, 𝜋 𝑘 𝑥𝑖+1 )
6. until 𝑉𝑘+1
𝜋 𝑘
− 𝑉𝑘
𝜋 𝑘
< 𝛥
2) Policy Improvement
7. 𝜋 𝑘+1 𝑥𝑖 ∈ 𝑎𝑟𝑔min
𝑢∈𝑈(𝑥)
𝑟 𝑥, 𝑢 + 𝛾 σ 𝑥′∈𝑋 𝑝 𝑥′ 𝑥, 𝑢 𝑉𝑘+1
𝜋 𝑘
(𝑥′)
8. until 𝑉𝑘+1
𝜋 𝑘+1
− 𝑉𝑘
𝜋 𝑘
< 𝛥
9. return 𝜋
Value Iteration
Algorithm: Value Iteration(Classical Dynamic Programming)
Input: 𝑟, 𝑝, 𝛾, 𝛥
Output: 𝜋
1. 𝑉 𝑥 ←Initialize arbitrarily for 𝑥 ∈ 𝑋
2. repeat
3. given 𝑥0, simulate and store 𝒟 = {𝑥𝑖, 𝑟(𝑥𝑖, 𝑢𝑖)}
4. for all 𝑥 ∈ 𝑋 do
5. 𝑉𝑘+1 𝑥 ← 𝑇∗ 𝑉𝑘
2. until 𝑉𝑘+1 − 𝑉𝑘 < 𝛥
3. 𝜋∗
𝑥 ∈ 𝑎𝑟𝑔min
𝑢∈𝑈(𝑥)
𝑟 𝑥, 𝑢 + 𝛾 σ 𝑥′∈𝑋 𝑝 𝑥′
𝑥, 𝑢 𝑉𝑘(𝑥′
)
4. return 𝜋∗
Value Iteration
Algorithm: Value Iteration(Q-learning)
Input: 𝑟, 𝑝, 𝛾, 𝛥
Output: 𝜋
1. 𝑉 𝑥 ←Initialize arbitrarily for 𝑥 ∈ 𝑋
𝑄(𝑥, 𝑢) ←Initialize arbitrarily for 𝑥 ∈ 𝑋, 𝑢 ∈ 𝑈
2. repeat
3. given 𝑥0, simulate and store 𝒟 = {𝑥𝑖, 𝑟(𝑥𝑖, 𝑢𝑖)}
4. for all 𝑥 ∈ 𝑋 do  for 𝑥𝑖, 𝑎𝑖, 𝑟 𝑥𝑖, 𝑢𝑖 , 𝑥𝑖+1 ∈ 𝒟 do
5. 𝑄 𝑘+1 𝑥, 𝑢 ← 𝑇∗ 𝑄 𝑘
𝑄 𝑘+1(𝑥𝑖, 𝑢𝑖) ← (1 − 𝛼 𝑘)𝑄 𝑘(𝑥𝑖, 𝑢𝑖) + 𝛼 𝑘 𝑟 𝑥𝑖, 𝑢𝑖 + min
𝑢𝑖+1∈𝑈
𝑄 𝑘(𝑥𝑖+1, 𝑢𝑖+1)
2. until 𝑄 𝑘+1 − 𝑄 𝑘 < 𝛥
3. 𝜋∗
𝑥 ∈ 𝑎𝑟𝑔min
𝑢∈𝑈(𝑥)
𝑟 𝑥, 𝑢 + 𝛾 σ 𝑥′∈𝑋 𝑝 𝑥′
𝑥, 𝑢 𝑉𝑘(𝑥′
)
4. return 𝜋∗
Approximation in Value Space
𝑇 𝜋
𝑇∗
𝑉 𝜋
𝑄 𝜋
𝑄
Policy
Evaluation
TD(0),
TD(𝜆)
SARSA(0)
SARSA(𝜆)
Q-learning
Large Scale RL
Large-scale RL
Number of states > 2200
(for 10 × 20 board)
”Function approximation”
Approximation
in Value space
Approximation
in Policy space
Actor-Critic
• TD, SARSA, Q-learning
• Function approximation
• Policy search
• Policy gradient
DPG, DDPG, TRPO, CPO, PPO, Soft actor-critic, …
Approximate
Expectation
• Monte-Carlo search
• Certainty equivalence
Approximate Dynamic Programming
Direct method (Gradient methods)
min
𝜃
෍
𝑖=1
𝑁
𝑉 𝜋
𝑥𝑖 − ෠𝑉(𝑥𝑖; 𝜃)
2
≈ min
𝜃
෍
𝑥 𝑖∈𝑋
෍
𝑚=1
𝑀
𝐽(𝑥𝑖, 𝑚) − ෠𝑉(𝑥𝑖; 𝜃)
2
෠𝑉(𝑥; 𝜃) : Approximated value function (ex. Polynomial approx., Neural Network, etc.)
𝑉 𝜋
𝑥 : State-value function
𝐽(𝑥𝑖, 𝑚): 𝑚-th sample of the cost function at 𝑥𝑖 where 𝑚 = 1,2, … , 𝑀
𝜃 𝑘+1 = 𝜃 𝑘 − 𝜂 ෍
𝑥 𝑖∈𝑋
෍
𝑚=1
𝑀
𝛻 ෠𝑉(𝑥𝑖; 𝜃) 𝐽(𝑥𝑖, 𝑚) − ෠𝑉(𝑥𝑖; 𝜃)
2
Approximate Dynamic Programming
Indirect method (Projected equation)
Solve the projected Bellman equation: Φ𝜃 = Π𝑇(Φ𝜃)
Subspace 𝑆 = Φ𝜃 𝜃 ∈ ℝ 𝑠
0
𝐽
Π𝐽
Subspace 𝑆 = Φ𝜃 𝜃 ∈ ℝ 𝑠
0
Φ𝜃 = Π𝑇(Φ𝜃)
𝑇(Φ𝜃)
Direct Method Indirect Method
𝑇 𝜋
𝑇∗
𝑉 𝜋
𝑄 𝜋
𝑄
Policy
Evaluation
TD(0),
TD(𝜆)
SARSA(0)
SARSA(𝜆)
Q-learning
𝑇 𝜋
𝑇∗
𝑉 𝜋
𝑄 𝜋
𝑄
Direct
(Gradient
methods)
TD
SARSA
DQN
𝑇 𝜋
𝑇∗
𝑉 𝜋
𝑄 𝜋
𝑄
Indirect
(Projected DP)
TD, LSTD
TD, LSTD
LSPE
Function
Approximation
Policy
Evaluation
Summary
RL is a toolbox solving
Infinite-horizon Discrete-time DP
𝐢𝐧𝐟
𝝅∈𝜫
𝑬 𝝅 𝒓 𝒙 𝒌, 𝝅(𝒙 𝒌) + 𝑬 𝒙~𝒑 𝜸𝑽(𝒙 𝒌+𝟏)
𝐢𝐧𝐟
𝝅∈𝚷
𝑬 𝝅 𝒓 𝒙 𝒌, 𝝅(𝒙 𝒌) + 𝑬 𝒙~𝒑 𝜸𝑽(𝒙 𝒌+𝟏)
Approximation in policy space
Approximation in value space
Approximate 𝐸[∙]
Parametric approximation
Problem approximation
Rollout, MPC
Monte-Carlo search
Certainty equivalence
Policy search
Policy gradient
Approximation
in Value space
Approximation
in Policy space
Actor-Critic
• TD, SARSA, Q-learning
• Function approximation
• Policy search
• Policy gradient
DPG, DDPG, TRPO, CPO, PPO, Soft actor-critic, …
Approximate
Expectation
• Monte-Carlo search
• Certainty equivalence
𝑇 𝜋
𝑇∗
𝑉 𝜋
𝑄 𝜋
𝑄
Policy
Evaluation
TD(0),
TD(𝜆)
SARSA(0)
SARSA(𝜆)
Q-learning
𝑇 𝜋
𝑇∗
𝑉 𝜋
𝑄 𝜋
𝑄
Direct
(Gradient
methods)
TD
SARSA
DQN
𝑇 𝜋
𝑇∗
𝑉 𝜋
𝑄 𝜋
𝑄
Indirect
(Projected DP)
TD, LSTD
TD, LSTD
LSPE
Function
Approximation
Policy
Evaluation
Q&A
Thank you

More Related Content

What's hot

Signals and systems-1
Signals and systems-1Signals and systems-1
Signals and systems-1
sarun soman
 
Reinfrocement Learning
Reinfrocement LearningReinfrocement Learning
Reinfrocement Learning
Natan Katz
 
HMM-Based Speech Synthesis
HMM-Based Speech SynthesisHMM-Based Speech Synthesis
HMM-Based Speech Synthesis
IJMER
 
Machine Learning - Reinforcement Learning
Machine Learning - Reinforcement LearningMachine Learning - Reinforcement Learning
Machine Learning - Reinforcement Learning
JY Chun
 
Av 738-Adaptive Filters - Extended Kalman Filter
Av 738-Adaptive Filters - Extended Kalman FilterAv 738-Adaptive Filters - Extended Kalman Filter
Av 738-Adaptive Filters - Extended Kalman Filter
Dr. Bilal Siddiqui, C.Eng., MIMechE, FRAeS
 
Gauge Theory for Beginners.pptx
Gauge Theory for Beginners.pptxGauge Theory for Beginners.pptx
Gauge Theory for Beginners.pptx
Hassaan Saleem
 
Lecture 5 backpropagation
Lecture 5 backpropagationLecture 5 backpropagation
Lecture 5 backpropagation
ParveenMalik18
 
Introduction to Reinforcement Learning
Introduction to Reinforcement LearningIntroduction to Reinforcement Learning
Introduction to Reinforcement LearningEdward Balaban
 
Proje kt report
Proje kt reportProje kt report
Proje kt report
Olalekan Ogunmolu
 
An introduction to reinforcement learning (rl)
An introduction to reinforcement learning (rl)An introduction to reinforcement learning (rl)
An introduction to reinforcement learning (rl)
pauldix
 
International Refereed Journal of Engineering and Science (IRJES)
International Refereed Journal of Engineering and Science (IRJES)International Refereed Journal of Engineering and Science (IRJES)
International Refereed Journal of Engineering and Science (IRJES)irjes
 
Solving Poisson Equation using Conjugate Gradient Method and its implementation
Solving Poisson Equation using Conjugate Gradient Methodand its implementationSolving Poisson Equation using Conjugate Gradient Methodand its implementation
Solving Poisson Equation using Conjugate Gradient Method and its implementation
Jongsu "Liam" Kim
 
Physical Chemistry II (CHEM 308)
Physical Chemistry II (CHEM 308)Physical Chemistry II (CHEM 308)
Physical Chemistry II (CHEM 308)
Jahnn Drigo
 
Continuous control
Continuous controlContinuous control
Continuous control
Reiji Hatsugai
 
Lecture 6 radial basis-function_network
Lecture 6 radial basis-function_networkLecture 6 radial basis-function_network
Lecture 6 radial basis-function_network
ParveenMalik18
 
Lecture 4 neural networks
Lecture 4 neural networksLecture 4 neural networks
Lecture 4 neural networks
ParveenMalik18
 
Taller grupal parcial ii nrc 3246 sebastian fueltala_kevin sánchez
Taller grupal parcial ii nrc 3246  sebastian fueltala_kevin sánchezTaller grupal parcial ii nrc 3246  sebastian fueltala_kevin sánchez
Taller grupal parcial ii nrc 3246 sebastian fueltala_kevin sánchez
kevinct2001
 
Maximum Entropy Reinforcement Learning (Stochastic Control)
Maximum Entropy Reinforcement Learning (Stochastic Control)Maximum Entropy Reinforcement Learning (Stochastic Control)
Maximum Entropy Reinforcement Learning (Stochastic Control)
Dongmin Lee
 
Computational Motor Control: Optimal Control for Stochastic Systems (JAIST su...
Computational Motor Control: Optimal Control for Stochastic Systems (JAIST su...Computational Motor Control: Optimal Control for Stochastic Systems (JAIST su...
Computational Motor Control: Optimal Control for Stochastic Systems (JAIST su...
hirokazutanaka
 

What's hot (20)

Signals and systems-1
Signals and systems-1Signals and systems-1
Signals and systems-1
 
Reinfrocement Learning
Reinfrocement LearningReinfrocement Learning
Reinfrocement Learning
 
HMM-Based Speech Synthesis
HMM-Based Speech SynthesisHMM-Based Speech Synthesis
HMM-Based Speech Synthesis
 
Machine Learning - Reinforcement Learning
Machine Learning - Reinforcement LearningMachine Learning - Reinforcement Learning
Machine Learning - Reinforcement Learning
 
Av 738-Adaptive Filters - Extended Kalman Filter
Av 738-Adaptive Filters - Extended Kalman FilterAv 738-Adaptive Filters - Extended Kalman Filter
Av 738-Adaptive Filters - Extended Kalman Filter
 
Gauge Theory for Beginners.pptx
Gauge Theory for Beginners.pptxGauge Theory for Beginners.pptx
Gauge Theory for Beginners.pptx
 
Lecture 5 backpropagation
Lecture 5 backpropagationLecture 5 backpropagation
Lecture 5 backpropagation
 
Introduction to Reinforcement Learning
Introduction to Reinforcement LearningIntroduction to Reinforcement Learning
Introduction to Reinforcement Learning
 
Proje kt report
Proje kt reportProje kt report
Proje kt report
 
An introduction to reinforcement learning (rl)
An introduction to reinforcement learning (rl)An introduction to reinforcement learning (rl)
An introduction to reinforcement learning (rl)
 
International Refereed Journal of Engineering and Science (IRJES)
International Refereed Journal of Engineering and Science (IRJES)International Refereed Journal of Engineering and Science (IRJES)
International Refereed Journal of Engineering and Science (IRJES)
 
Solving Poisson Equation using Conjugate Gradient Method and its implementation
Solving Poisson Equation using Conjugate Gradient Methodand its implementationSolving Poisson Equation using Conjugate Gradient Methodand its implementation
Solving Poisson Equation using Conjugate Gradient Method and its implementation
 
MSR
MSRMSR
MSR
 
Physical Chemistry II (CHEM 308)
Physical Chemistry II (CHEM 308)Physical Chemistry II (CHEM 308)
Physical Chemistry II (CHEM 308)
 
Continuous control
Continuous controlContinuous control
Continuous control
 
Lecture 6 radial basis-function_network
Lecture 6 radial basis-function_networkLecture 6 radial basis-function_network
Lecture 6 radial basis-function_network
 
Lecture 4 neural networks
Lecture 4 neural networksLecture 4 neural networks
Lecture 4 neural networks
 
Taller grupal parcial ii nrc 3246 sebastian fueltala_kevin sánchez
Taller grupal parcial ii nrc 3246  sebastian fueltala_kevin sánchezTaller grupal parcial ii nrc 3246  sebastian fueltala_kevin sánchez
Taller grupal parcial ii nrc 3246 sebastian fueltala_kevin sánchez
 
Maximum Entropy Reinforcement Learning (Stochastic Control)
Maximum Entropy Reinforcement Learning (Stochastic Control)Maximum Entropy Reinforcement Learning (Stochastic Control)
Maximum Entropy Reinforcement Learning (Stochastic Control)
 
Computational Motor Control: Optimal Control for Stochastic Systems (JAIST su...
Computational Motor Control: Optimal Control for Stochastic Systems (JAIST su...Computational Motor Control: Optimal Control for Stochastic Systems (JAIST su...
Computational Motor Control: Optimal Control for Stochastic Systems (JAIST su...
 

Similar to Stochastic optimal control &amp; rl

Learning group em - 20171025 - copy
Learning group   em - 20171025 - copyLearning group   em - 20171025 - copy
Learning group em - 20171025 - copy
Shuai Zhang
 
Intro to Quant Trading Strategies (Lecture 6 of 10)
Intro to Quant Trading Strategies (Lecture 6 of 10)Intro to Quant Trading Strategies (Lecture 6 of 10)
Intro to Quant Trading Strategies (Lecture 6 of 10)
Adrian Aley
 
Passivity-based control of rigid-body manipulator
Passivity-based control of rigid-body manipulatorPassivity-based control of rigid-body manipulator
Passivity-based control of rigid-body manipulator
Hancheol Choi
 
Linear regression, costs & gradient descent
Linear regression, costs & gradient descentLinear regression, costs & gradient descent
Linear regression, costs & gradient descent
Revanth Kumar
 
PR 113: The Perception Distortion Tradeoff
PR 113: The Perception Distortion TradeoffPR 113: The Perception Distortion Tradeoff
PR 113: The Perception Distortion Tradeoff
Taeoh Kim
 
Stochastic Optimization
Stochastic OptimizationStochastic Optimization
Stochastic Optimization
Mohammad Reza Jabbari
 
Optimum Engineering Design - Day 2b. Classical Optimization methods
Optimum Engineering Design - Day 2b. Classical Optimization methodsOptimum Engineering Design - Day 2b. Classical Optimization methods
Optimum Engineering Design - Day 2b. Classical Optimization methods
SantiagoGarridoBulln
 
Kalman filter for Beginners
Kalman filter for BeginnersKalman filter for Beginners
Kalman filter for Beginners
winfred lu
 
Basic calculus (i)
Basic calculus (i)Basic calculus (i)
Basic calculus (i)
Farzad Javidanrad
 
Periodic Solutions for Nonlinear Systems of Integro-Differential Equations of...
Periodic Solutions for Nonlinear Systems of Integro-Differential Equations of...Periodic Solutions for Nonlinear Systems of Integro-Differential Equations of...
Periodic Solutions for Nonlinear Systems of Integro-Differential Equations of...
International Journal of Engineering Inventions www.ijeijournal.com
 
Lecture 1
Lecture 1Lecture 1
Lecture 1butest
 
Artyom Makovetskii - An Efficient Algorithm for Total Variation Denoising
Artyom Makovetskii - An Efficient Algorithm for Total Variation DenoisingArtyom Makovetskii - An Efficient Algorithm for Total Variation Denoising
Artyom Makovetskii - An Efficient Algorithm for Total Variation Denoising
AIST
 
Coursera 2week
Coursera  2weekCoursera  2week
Coursera 2week
csl9496
 
Specific topics in optimisation
Specific topics in optimisationSpecific topics in optimisation
Specific topics in optimisation
Farzad Javidanrad
 
Integral calculus
Integral calculusIntegral calculus
Integral calculus
Farzad Javidanrad
 
Restricted boltzmann machine
Restricted boltzmann machineRestricted boltzmann machine
Restricted boltzmann machine
강민국 강민국
 
Dl meetup 07-04-16
Dl meetup 07-04-16Dl meetup 07-04-16
Dl meetup 07-04-16
Jim O' Donoghue
 
Differentiation
DifferentiationDifferentiation
Differentiation
Anirudh Gaddamanugu
 
Vibration Isolation of a LEGO® plate
Vibration Isolation of a LEGO® plateVibration Isolation of a LEGO® plate
Vibration Isolation of a LEGO® plate
Open Adaptronik
 
NIPS KANSAI Reading Group #5: State Aware Imitation Learning
NIPS KANSAI Reading Group #5: State Aware Imitation LearningNIPS KANSAI Reading Group #5: State Aware Imitation Learning
NIPS KANSAI Reading Group #5: State Aware Imitation Learning
Eiji Uchibe
 

Similar to Stochastic optimal control &amp; rl (20)

Learning group em - 20171025 - copy
Learning group   em - 20171025 - copyLearning group   em - 20171025 - copy
Learning group em - 20171025 - copy
 
Intro to Quant Trading Strategies (Lecture 6 of 10)
Intro to Quant Trading Strategies (Lecture 6 of 10)Intro to Quant Trading Strategies (Lecture 6 of 10)
Intro to Quant Trading Strategies (Lecture 6 of 10)
 
Passivity-based control of rigid-body manipulator
Passivity-based control of rigid-body manipulatorPassivity-based control of rigid-body manipulator
Passivity-based control of rigid-body manipulator
 
Linear regression, costs & gradient descent
Linear regression, costs & gradient descentLinear regression, costs & gradient descent
Linear regression, costs & gradient descent
 
PR 113: The Perception Distortion Tradeoff
PR 113: The Perception Distortion TradeoffPR 113: The Perception Distortion Tradeoff
PR 113: The Perception Distortion Tradeoff
 
Stochastic Optimization
Stochastic OptimizationStochastic Optimization
Stochastic Optimization
 
Optimum Engineering Design - Day 2b. Classical Optimization methods
Optimum Engineering Design - Day 2b. Classical Optimization methodsOptimum Engineering Design - Day 2b. Classical Optimization methods
Optimum Engineering Design - Day 2b. Classical Optimization methods
 
Kalman filter for Beginners
Kalman filter for BeginnersKalman filter for Beginners
Kalman filter for Beginners
 
Basic calculus (i)
Basic calculus (i)Basic calculus (i)
Basic calculus (i)
 
Periodic Solutions for Nonlinear Systems of Integro-Differential Equations of...
Periodic Solutions for Nonlinear Systems of Integro-Differential Equations of...Periodic Solutions for Nonlinear Systems of Integro-Differential Equations of...
Periodic Solutions for Nonlinear Systems of Integro-Differential Equations of...
 
Lecture 1
Lecture 1Lecture 1
Lecture 1
 
Artyom Makovetskii - An Efficient Algorithm for Total Variation Denoising
Artyom Makovetskii - An Efficient Algorithm for Total Variation DenoisingArtyom Makovetskii - An Efficient Algorithm for Total Variation Denoising
Artyom Makovetskii - An Efficient Algorithm for Total Variation Denoising
 
Coursera 2week
Coursera  2weekCoursera  2week
Coursera 2week
 
Specific topics in optimisation
Specific topics in optimisationSpecific topics in optimisation
Specific topics in optimisation
 
Integral calculus
Integral calculusIntegral calculus
Integral calculus
 
Restricted boltzmann machine
Restricted boltzmann machineRestricted boltzmann machine
Restricted boltzmann machine
 
Dl meetup 07-04-16
Dl meetup 07-04-16Dl meetup 07-04-16
Dl meetup 07-04-16
 
Differentiation
DifferentiationDifferentiation
Differentiation
 
Vibration Isolation of a LEGO® plate
Vibration Isolation of a LEGO® plateVibration Isolation of a LEGO® plate
Vibration Isolation of a LEGO® plate
 
NIPS KANSAI Reading Group #5: State Aware Imitation Learning
NIPS KANSAI Reading Group #5: State Aware Imitation LearningNIPS KANSAI Reading Group #5: State Aware Imitation Learning
NIPS KANSAI Reading Group #5: State Aware Imitation Learning
 

Recently uploaded

Runway Orientation Based on the Wind Rose Diagram.pptx
Runway Orientation Based on the Wind Rose Diagram.pptxRunway Orientation Based on the Wind Rose Diagram.pptx
Runway Orientation Based on the Wind Rose Diagram.pptx
SupreethSP4
 
English lab ppt no titlespecENG PPTt.pdf
English lab ppt no titlespecENG PPTt.pdfEnglish lab ppt no titlespecENG PPTt.pdf
English lab ppt no titlespecENG PPTt.pdf
BrazilAccount1
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理
zwunae
 
Final project report on grocery store management system..pdf
Final project report on grocery store management system..pdfFinal project report on grocery store management system..pdf
Final project report on grocery store management system..pdf
Kamal Acharya
 
road safety engineering r s e unit 3.pdf
road safety engineering  r s e unit 3.pdfroad safety engineering  r s e unit 3.pdf
road safety engineering r s e unit 3.pdf
VENKATESHvenky89705
 
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
AJAYKUMARPUND1
 
The role of big data in decision making.
The role of big data in decision making.The role of big data in decision making.
The role of big data in decision making.
ankuprajapati0525
 
ASME IX(9) 2007 Full Version .pdf
ASME IX(9)  2007 Full Version       .pdfASME IX(9)  2007 Full Version       .pdf
ASME IX(9) 2007 Full Version .pdf
AhmedHussein950959
 
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdfAKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
SamSarthak3
 
MCQ Soil mechanics questions (Soil shear strength).pdf
MCQ Soil mechanics questions (Soil shear strength).pdfMCQ Soil mechanics questions (Soil shear strength).pdf
MCQ Soil mechanics questions (Soil shear strength).pdf
Osamah Alsalih
 
ML for identifying fraud using open blockchain data.pptx
ML for identifying fraud using open blockchain data.pptxML for identifying fraud using open blockchain data.pptx
ML for identifying fraud using open blockchain data.pptx
Vijay Dialani, PhD
 
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
ydteq
 
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
Amil Baba Dawood bangali
 
Railway Signalling Principles Edition 3.pdf
Railway Signalling Principles Edition 3.pdfRailway Signalling Principles Edition 3.pdf
Railway Signalling Principles Edition 3.pdf
TeeVichai
 
HYDROPOWER - Hydroelectric power generation
HYDROPOWER - Hydroelectric power generationHYDROPOWER - Hydroelectric power generation
HYDROPOWER - Hydroelectric power generation
Robbie Edward Sayers
 
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdfTop 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Teleport Manpower Consultant
 
DESIGN A COTTON SEED SEPARATION MACHINE.docx
DESIGN A COTTON SEED SEPARATION MACHINE.docxDESIGN A COTTON SEED SEPARATION MACHINE.docx
DESIGN A COTTON SEED SEPARATION MACHINE.docx
FluxPrime1
 
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
thanhdowork
 
Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024
Massimo Talia
 
Standard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - NeometrixStandard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - Neometrix
Neometrix_Engineering_Pvt_Ltd
 

Recently uploaded (20)

Runway Orientation Based on the Wind Rose Diagram.pptx
Runway Orientation Based on the Wind Rose Diagram.pptxRunway Orientation Based on the Wind Rose Diagram.pptx
Runway Orientation Based on the Wind Rose Diagram.pptx
 
English lab ppt no titlespecENG PPTt.pdf
English lab ppt no titlespecENG PPTt.pdfEnglish lab ppt no titlespecENG PPTt.pdf
English lab ppt no titlespecENG PPTt.pdf
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理
 
Final project report on grocery store management system..pdf
Final project report on grocery store management system..pdfFinal project report on grocery store management system..pdf
Final project report on grocery store management system..pdf
 
road safety engineering r s e unit 3.pdf
road safety engineering  r s e unit 3.pdfroad safety engineering  r s e unit 3.pdf
road safety engineering r s e unit 3.pdf
 
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
 
The role of big data in decision making.
The role of big data in decision making.The role of big data in decision making.
The role of big data in decision making.
 
ASME IX(9) 2007 Full Version .pdf
ASME IX(9)  2007 Full Version       .pdfASME IX(9)  2007 Full Version       .pdf
ASME IX(9) 2007 Full Version .pdf
 
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdfAKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
 
MCQ Soil mechanics questions (Soil shear strength).pdf
MCQ Soil mechanics questions (Soil shear strength).pdfMCQ Soil mechanics questions (Soil shear strength).pdf
MCQ Soil mechanics questions (Soil shear strength).pdf
 
ML for identifying fraud using open blockchain data.pptx
ML for identifying fraud using open blockchain data.pptxML for identifying fraud using open blockchain data.pptx
ML for identifying fraud using open blockchain data.pptx
 
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
 
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
 
Railway Signalling Principles Edition 3.pdf
Railway Signalling Principles Edition 3.pdfRailway Signalling Principles Edition 3.pdf
Railway Signalling Principles Edition 3.pdf
 
HYDROPOWER - Hydroelectric power generation
HYDROPOWER - Hydroelectric power generationHYDROPOWER - Hydroelectric power generation
HYDROPOWER - Hydroelectric power generation
 
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdfTop 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
 
DESIGN A COTTON SEED SEPARATION MACHINE.docx
DESIGN A COTTON SEED SEPARATION MACHINE.docxDESIGN A COTTON SEED SEPARATION MACHINE.docx
DESIGN A COTTON SEED SEPARATION MACHINE.docx
 
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
 
Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024
 
Standard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - NeometrixStandard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - Neometrix
 

Stochastic optimal control &amp; rl

  • 2.
  • 3.
  • 4.
  • 7.
  • 13. Markov Decision Process Markov? “The future is independent of the past given the present” ℙ 𝑠𝑡+1 𝑠1, … , 𝑠𝑡 = ℙ 𝑠𝑡+1 𝑠𝑡
  • 14. Markov Decision Process Markov? “The future is independent of the past given the present” ℙ 𝑠𝑡+1 𝑠1, … , 𝑠𝑡 = ℙ 𝑠𝑡+1 𝑠𝑡 Memoryless process!
  • 15. Markov Decision Process Markov? “The future is independent of the past given the present” ℙ 𝑠𝑡+1 𝑠1, … , 𝑠𝑡 = ℙ 𝑠𝑡+1 𝑠𝑡 Markov “Decision” Process: ℙ 𝑠𝑡+1 𝑠1, 𝑎1 … , 𝑠𝑡, 𝑎 𝑡 = ℙ 𝑠𝑡+1 𝑠𝑡, 𝑎 𝑡 Future state only depends on the current state and action & Policy also depends on the current state only 𝜋 𝑎 𝑡 𝑠1, 𝑎1 … , 𝑠𝑡 = 𝜋 𝑎 𝑡 𝑠𝑡
  • 16. Reinforcement Learning Agent Environment Action Reward • State 𝑠 ∈ 𝑆 ⊂ ℝ 𝑛 • Action 𝑎 ∈ 𝐴 ⊂ ℝ 𝑚 • Action sequence 𝑎0, 𝑎1, … , ∀𝑎𝑖 ∈ 𝐴, 𝑖 = 1,2, … State
  • 17. Reinforcement Learning Agent Environment Action Reward • State 𝑠 ∈ 𝑆 ⊂ ℝ 𝑛 • Action 𝑎 ∈ 𝐴 ⊂ ℝ 𝑚 • Action sequence 𝑎0, 𝑎1, … , ∀𝑎𝑖 ∈ 𝐴, 𝑖 = 1,2, … State • Reward 𝑟: 𝑆 × 𝐴 → ℝ • Discounting factor 𝛾 ∈ (0,1) • Transition probability 𝑠𝑡+1~𝑝 𝑠𝑡+1 𝑠𝑡, 𝑎 𝑡 • Total reward 𝑅𝑡𝑜𝑡 = 𝐸𝑠~𝑝 σ 𝑡=0 ∞ 𝛾 𝑡 𝑟(𝑠𝑡, 𝑎 𝑡)
  • 18. Reinforcement Learning Agent Environment Action Reward • State 𝑠 ∈ 𝑆 ⊂ ℝ 𝑛 • Action 𝑎 ∈ 𝐴 ⊂ ℝ 𝑚 • Action sequence 𝑎0, 𝑎1, … , ∀𝑎𝑖 ∈ 𝐴, 𝑖 = 1,2, … State Objective function: max 𝜋∈Π 𝑅 𝜋 • Policy: 𝜋: 𝑆 → 𝐴 • Total reward w.r.t. 𝜋: 𝑅 𝜋 = 𝐸𝑠~𝑝,𝑎~𝜋 σ 𝑡=0 ∞ 𝛾 𝑡 𝑟(𝑠𝑡, 𝑎 𝑡) • Reward 𝑟: 𝑆 × 𝐴 → ℝ • Discounting factor 𝛾 ∈ (0,1) • Transition probability 𝑠𝑡+1~𝑝 𝑠𝑡+1 𝑠𝑡, 𝑎 𝑡 • Total reward 𝑅𝑡𝑜𝑡 = 𝐸𝑠~𝑝 σ 𝑡=0 ∞ 𝛾 𝑡 𝑟(𝑠𝑡, 𝑎 𝑡)
  • 19. Terminology of RL and Optimal control State Action Agent Environment Reward of a stage Reward (or value) function Maximizing the value function Bellman operator Greedy policy w.r.t. 𝐽 State Control Input Controller System Cost of a stage Value (or cost) function Minimizing the value function DP mapping or operator Minimizing policy w.r.t. 𝐽 RL Optimal Control
  • 21. Stochastic Optimal Control DiscreteContinuous Stochastic Deterministicሶ𝑥 = 𝑓(𝑥, 𝑢) 𝑑𝑥 = 𝑓 𝑥, 𝑢 𝑑𝑡 + 𝜎 𝑥, 𝑢 𝑑𝑊 𝑥 𝑘+1 = 𝑓(𝑥 𝑘, 𝑢 𝑘) 𝑥 𝑘+1 = 𝑓 𝑥 𝑘, 𝑢 𝑘, 𝑤 𝑘 (𝑤 𝑘 is a random Gaussian noise) or 𝑥 𝑘+1~𝑝 𝑥 𝑘+1 𝑥 𝑘, 𝑢 𝑘 System Dynamics
  • 22. Stochastic Optimal Control DiscreteContinuous Stochastic (Policy) Deterministic (Control Input) 𝑢 𝑥 𝑢 𝑥 ~ 𝜋 𝑢 𝑥 {𝑢0, 𝑢1, 𝑢2, … } 𝑢 𝑘 ~ 𝜋 𝑢 𝑘 𝑥 𝑘 Control Input and Policy
  • 23. Stochastic Optimal Control DiscreteContinuous Infinite-horizon Finite-horizon Value function inf 𝑢∈𝑈 𝐸 𝑥~𝑝 න 𝑡=0 𝑇 𝑟 𝑥 𝑡 , 𝑢 𝑡 𝑑𝑡 + 𝑞(𝑥 𝑇) inf 𝑢∈𝑈 𝐸 𝑥~𝑝 න 𝑡=0 ∞ 𝑒−𝛾𝑡 𝑟 𝑥 𝑡 , 𝑢 𝑡 𝑑𝑡 inf 𝑢 𝑘∈𝑈 𝐸 𝑥~𝑝 ෍ 𝑘=0 𝑁 𝑟 𝑥 𝑘, 𝑢 𝑘 + 𝑞(𝑥 𝑁) inf 𝑢 𝑘∈𝑈 𝐸 𝑥~𝑝 ෍ 𝑘=0 ∞ 𝛾 𝑘 𝑟 𝑥 𝑘, 𝑢 𝑘
  • 24. Stochastic Optimal Control Discrete(𝑉 𝑥 𝑘 )Continuous(𝑉 𝑥(𝑡) ) Infinite-horizon Finite-horizon Dynamic Programming inf 𝑢∈𝑈 𝐸 𝑥~𝑝 න 𝑡 𝑡+Δ𝑡 𝑟 𝑥 𝑠 , 𝑢 𝑠 𝑑𝑠 + 𝑉(𝑥 𝑡 + Δ𝑡 ) 𝑉 𝑥 𝑇 = 𝑞(𝑥 𝑇 ) inf 𝑢∈𝑈 𝐸 𝑥~𝑝 න 𝑡=0 𝑡+Δ𝑡 𝑒−𝛾𝑠 𝑟 𝑥 𝑠 , 𝑢 𝑠 𝑑𝑠 + 𝑉(𝑥 𝑡 + Δ𝑡 ) inf 𝑢 𝑘∈𝑈 𝑟 𝑥 𝑘, 𝑢 𝑘 + 𝐸 𝑥~𝑝 𝑉(𝑥 𝑘+1) 𝑉 𝑥 𝑇 = 𝑞(𝑥 𝑇) inf 𝑢 𝑘∈𝑈 𝑟 𝑥 𝑘, 𝑢 𝑘 + 𝐸 𝑥~𝑝 𝛾𝑉(𝑥 𝑘+1)
  • 25. Stochastic Optimal Control Discrete(𝑉 𝑥 𝑘 )Continuous(𝑉 𝑥(𝑡) ) Infinite Dynamic Programming 𝜕𝑉 𝜕𝑡 + inf 𝑢∈𝑈 𝑟 𝑥 𝑡 , 𝑢 𝑡 + 𝜕𝑉 𝜕𝑥 𝑓 𝑥 𝑡 , 𝑢 𝑡 + 1 2 𝜎 𝑇 (𝑥 𝑡 , 𝑢 𝑡 ) 𝜕2 𝑉 𝜕𝑥2 𝜎(𝑥 𝑡 , 𝑢 𝑡 ) = 0 𝑉 𝑥 𝑇 = 𝑞(𝑥 𝑇 ) inf 𝑢 𝑘∈𝑈 𝑟 𝑥 𝑘, 𝑢 𝑘 + 𝐸 𝑥~𝑝 𝑉(𝑥 𝑘+1) 𝑉 𝑥 𝑇 = 𝑞(𝑥 𝑇) inf 𝑢 𝑘∈𝑈 𝑟 𝑥 𝑘, 𝑢 𝑘 + 𝐸 𝑥~𝑝 𝛾𝑉(𝑥 𝑘+1) Finite HJB equation Bellman equation −𝛾 + inf 𝑢∈𝑈 𝑟 𝑥 𝑡 , 𝑢 𝑡 + 𝜕𝑉 𝜕𝑥 𝑓 𝑥 𝑡 , 𝑢 𝑡 + 1 2 𝜎 𝑇 (𝑥 𝑡 , 𝑢 𝑡 ) 𝜕2 𝑉 𝜕𝑥2 𝜎(𝑥 𝑡 , 𝑢 𝑡 ) = 0
  • 26. Stochastic Optimal Control Discrete(𝑉 𝑥 𝑘 )Continuous(𝑉 𝑥(𝑡) ) Infinite Dynamic Programming 𝜕𝑉 𝜕𝑡 + inf 𝑢∈𝑈 𝑟 𝑥 𝑡 , 𝑢 𝑡 + 𝜕𝑉 𝜕𝑥 𝑓 𝑥 𝑡 , 𝑢 𝑡 + 1 2 𝜎 𝑇 (𝑥 𝑡 , 𝑢 𝑡 ) 𝜕2 𝑉 𝜕𝑥2 𝜎(𝑥 𝑡 , 𝑢 𝑡 ) = 0 𝑉 𝑥 𝑇 = 𝑞(𝑥 𝑇 ) inf 𝑢 𝑘∈𝑈 𝑟 𝑥 𝑘, 𝑢 𝑘 + 𝐸 𝑥~𝑝 𝑉(𝑥 𝑘+1) 𝑉 𝑥 𝑇 = 𝑞(𝑥 𝑇) inf 𝑢 𝑘∈𝑈 𝑟 𝑥 𝑘, 𝑢 𝑘 + 𝐸 𝑥~𝑝 𝛾𝑉(𝑥 𝑘+1) Finite −𝛾 + inf 𝑢∈𝑈 𝑟 𝑥 𝑡 , 𝑢 𝑡 + 𝜕𝑉 𝜕𝑥 𝑓 𝑥 𝑡 , 𝑢 𝑡 + 1 2 𝜎 𝑇 (𝑥 𝑡 , 𝑢 𝑡 ) 𝜕2 𝑉 𝜕𝑥2 𝜎(𝑥 𝑡 , 𝑢 𝑡 ) = 0 HJB equation Bellman equation
  • 27. Stochastic Optimal Control Dynamic Programming 𝐢𝐧𝐟 𝒖 𝒌∈𝑼 𝒓 𝒙 𝒌, 𝒖 𝒌 + 𝑬 𝒙~𝒑 𝜸𝑽(𝒙 𝒌+𝟏) How to solve the infinite-horizon discrete time system stochastic optimal control problem?
  • 28. Stochastic Optimal Control Dynamic Programming 𝐢𝐧𝐟 𝒖 𝒌∈𝑼 𝒓 𝒙 𝒌, 𝒖 𝒌 + 𝑬 𝒙~𝒑 𝜸𝑽(𝒙 𝒌+𝟏) How to solve the infinite-horizon discrete time system stochastic optimal control problem? Note) There is an other approach using different dynamic programming equation, average reward. Value Iteration & Policy Iteration
  • 29. Bellman Operator Definition. Given policy 𝜋, the state-value function 𝑉 𝜋: ℝn → ℝ is defined by 𝑉 𝜋 𝑥0 : = 𝐸 𝑥~𝑝,𝜋 ෍ 𝑘=0 ∞ 𝛾 𝑘 𝑟𝑘(𝑥 𝑘, 𝜋(𝑥 𝑘), 𝑤 𝑘) ȁ 𝑥 = 𝑥0 at 𝑡 = 0 = 𝑟(𝑥0, 𝜋(𝑥 𝑘)) + 𝐸 𝑥~𝑝,𝜋 ෍ 𝑘=0 ∞ 𝛾 𝑘 𝑟𝑘(𝑥 𝑘, 𝜋(𝑥 𝑘), 𝑤 𝑘) and the state-input-value function 𝑄 𝜋 : ℝn × ℝm → ℝ is defined by 𝑄 𝜋 𝑥0, 𝑢0 : = 𝐸 𝑥~𝑝,𝜋 ෍ 𝑘=0 ∞ 𝛾 𝑘 𝑟𝑘(𝑥 𝑘, 𝜋(𝑥 𝑘), 𝑤 𝑘) ȁ 𝑥 = 𝑥0, 𝑢 = 𝑢0 at 𝑡 = 0 = 𝑟 𝑥0, 𝑢0 + 𝐸 𝑥~𝑝,𝜋 ෍ 𝑘=1 ∞ 𝛾 𝑘 𝑟𝑘(𝑥 𝑘, 𝜋(𝑥 𝑘), 𝑤 𝑘)
  • 30. Bellman Operator Definition. Given policy 𝜋, the state-value function 𝑉 𝜋: ℝn → ℝ is defined by 𝑉 𝜋 𝑥0 : = 𝐸 𝑥~𝑝,𝜋 ෍ 𝑘=0 ∞ 𝛾 𝑘 𝑟𝑘(𝑥 𝑘, 𝜋(𝑥 𝑘), 𝑤 𝑘) ȁ 𝑥 = 𝑥0 at 𝑡 = 0 = 𝑟(𝑥0, 𝜋(𝑥 𝑘)) + 𝐸 𝑥~𝑝,𝜋 ෍ 𝑘=0 ∞ 𝛾 𝑘 𝑟𝑘(𝑥 𝑘, 𝜋(𝑥 𝑘), 𝑤 𝑘) and the state-input-value function 𝑄 𝜋 : ℝn × ℝm → ℝ is defined by 𝑄 𝜋 𝑥0, 𝑢0 : = 𝐸 𝑥~𝑝,𝜋 ෍ 𝑘=0 ∞ 𝛾 𝑘 𝑟𝑘(𝑥 𝑘, 𝜋(𝑥 𝑘), 𝑤 𝑘) ȁ 𝑥 = 𝑥0, 𝑢 = 𝑢0 at 𝑡 = 0 = 𝑟 𝑥0, 𝑢0 + 𝐸 𝑥~𝑝,𝜋 ෍ 𝑘=1 ∞ 𝛾 𝑘 𝑟𝑘(𝑥 𝑘, 𝜋(𝑥 𝑘), 𝑤 𝑘) 𝑉 𝑥0 = inf 𝑢 𝑘∈𝑈 𝐸 𝑥~𝑝 ෍ 𝑘=0 ∞ 𝛾 𝑘 𝑟 𝑥 𝑘, 𝑢 𝑘
  • 31. Bellman Operator 𝑉 𝑥 𝑘 = inf 𝑢 𝑘∈𝑈 𝑟 𝑥 𝑘, 𝑢 𝑘 + 𝛾𝐸 𝑥~𝑝 𝑉(𝑥 𝑘+1) 𝑉 𝜋 𝑥 𝑘 = 𝑟(𝑥 𝑘, 𝜋(𝑥 𝑘)) + 𝛾𝐸 𝑥~𝑝 𝑉 𝜋(𝑥 𝑘+1) 𝑄 𝜋 𝑥 𝑘, 𝑢 𝑘 = 𝑟 𝑥 𝑘, 𝑢 𝑘 + 𝛾𝐸 𝑥~𝑝,𝜋 𝑄 𝜋 𝑥 𝑘+1, 𝜋(𝑥 𝑘+1) Dynamic Programming
  • 32. Bellman Operator Let 𝔹, ∙ ∞, 𝑑∞ be a metric space where 𝔹 = 𝜓: Ω → ℝ continuous and bounded , 𝜓 ∞ ≔ sup 𝑥∈𝑋 𝜓(𝑥) , and 𝑑∞ 𝜓, 𝜓′ = sup 𝑥∈𝑋 𝜓 𝑥 − 𝜓′(𝑥) . Definition. Given policy 𝜋, the Bellman operator 𝑇 𝜋: 𝔹 → 𝔹 is defined by 𝑇 𝜋 𝜓 𝑥 𝑘 = 𝑟 𝑥 𝑘, 𝜋 𝑥 𝑘 + 𝛾𝐸 𝑥~𝑝[𝜓 𝑥 𝑘+1 ] and the Bellman optimal operator 𝑇∗: 𝔹 → 𝔹 is defined by 𝑇∗ 𝜓 𝑥 𝑘 = min 𝑢 𝑘∈𝑈(𝑥 𝑘) 𝑟 𝑥 𝑘, 𝑢 𝑘 + 𝛾𝐸 𝑥~𝑝[𝜓 𝑥 𝑘+1 ]
  • 33. Bellman Operator Proposition 1. (Monotonicity) The Bellman operator 𝑇 𝜋 , 𝑇∗ are monotone, i.e. if 𝜓 𝑥 ≤ 𝜓′ 𝑥 ∀𝑥 ∈ 𝑋, then 𝑇 𝜋 𝜓 𝑥 ≤ 𝑇 𝜋 𝜓′ 𝑥 ∀𝑥 ∈ 𝑋 𝑇∗ 𝜓 𝑥 ≤ 𝑇∗ 𝜓′ 𝑥 ∀𝑥 ∈ 𝑋. Proposition 2. (Constant shift property) For any scalar 𝑟, 𝑇 𝜋 𝜓 + 𝑟𝑒 𝑥 = 𝑇 𝜋 𝜓 𝑥 + 𝛾𝑟 ∀𝑥 ∈ 𝑋 𝑇∗ 𝜓 + 𝑟𝑒 𝑥 = 𝑇∗ 𝜓 𝑥 + 𝛾𝑟 ∀𝑥 ∈ 𝑋.
  • 34. Bellman Operator Proposition 1. (Monotonicity) The Bellman operator 𝑇 𝜋 , 𝑇∗ are monotone, i.e. if 𝜓 𝑥 ≤ 𝜓′ 𝑥 ∀𝑥 ∈ 𝑋, then 𝑇 𝜋 𝜓 𝑥 ≤ 𝑇 𝜋 𝜓′ 𝑥 ∀𝑥 ∈ 𝑋 𝑇∗ 𝜓 𝑥 ≤ 𝑇∗ 𝜓′ 𝑥 ∀𝑥 ∈ 𝑋. Proposition 2. (Constant shift property) For any scalar 𝑟, 𝑇 𝜋 𝜓 + 𝑟𝑒 𝑥 = 𝑇 𝜋 𝜓 𝑥 + 𝛾𝑟 ∀𝑥 ∈ 𝑋 𝑇∗ 𝜓 + 𝑟𝑒 𝑥 = 𝑇∗ 𝜓 𝑥 + 𝛾𝑟 ∀𝑥 ∈ 𝑋. Proposition 3. The Bellman operator 𝑇 𝜋 , 𝑇 are a contraction with modulus 𝛾 with respect to the sup norm ∙ ∞ i.e. 𝑇 𝜋 𝜓 − 𝑇 𝜋 𝜓′ ≤ 𝛾 𝜓 − 𝜓′ 𝜓, 𝜓′ ∈ 𝔹 𝑇∗ 𝜓 − 𝑇∗ 𝜓′ ≤ 𝛾 𝜓 − 𝜓′ 𝜓, 𝜓′ ∈ 𝔹.
  • 35. Bellman Operator Theorem 2.3. (Contraction Mapping Theorem) Let 𝔹, ∙ ∞, 𝑑∞ be a metric space and 𝑇: 𝔹 → 𝔹 is a contraction mapping with modulus 𝛾. Then, 1) 𝑇 has an unique fixed point in 𝔹 i.e. there exist unique 𝑓∗ ∈ 𝔹 s.t. 𝑇𝑓∗ = 𝑓∗ . 2) Consider the sequence {𝑓𝑛} s.t. 𝑓𝑛+1 = 𝑇𝑓𝑛 ∀𝑓0 ∈ 𝔹. Then, lim n→∞ 𝑇 𝑛 𝑓0 → 𝑓∗ .
  • 36. Value Iteration Algorithm: Value Iteration Input: 𝑟, 𝑝, 𝛾, 𝛥 Output: 𝜋 1. 𝑉 𝑥 ←Initialize arbitrarily for 𝑥 ∈ 𝑋 2. repeat 3. for all 𝑥 ∈ 𝑋 do 4. 𝑉𝑘+1 ← 𝑇∗ 𝑉𝑘 5. until 𝑉𝑘+1 − 𝑉𝑘 < 𝛥 6. 𝜋∗ 𝑥 ∈ 𝑎𝑟𝑔min 𝑢∈𝑈(𝑥) 𝑟 𝑥, 𝑢 + 𝛾 σ 𝑥′∈𝑋 𝑝 𝑥′ 𝑥, 𝑢 𝑉𝑘(𝑥′) 7. return 𝜋∗
  • 37. Policy Iteration Algorithm: Policy Iteration Input: 𝑟, 𝑝, 𝛾, 𝛥 Output: 𝜋 1. 𝑉 𝑥 ←Initialize arbitrarily for 𝑥 ∈ 𝑋 2. repeat 1) Policy Evaluation 3. for 𝑥 ∈ 𝑋 with fixed 𝜋 𝑘 do 4. 𝑉𝑘+1 𝜋 𝑘 ← 𝑇 𝜋 𝑉𝑘 𝜋 𝑘 5. until 𝑉𝑘+1 𝜋 𝑘 − 𝑉𝑘 𝜋 𝑘 < 𝛥 2) Policy Improvement 6. 𝜋 𝑘+1 𝑥 ∈ 𝑎𝑟𝑔min 𝑢∈𝑈(𝑥) 𝑟 𝑥, 𝑢 + 𝛾 σ 𝑥′∈𝑋 𝑝 𝑥′ 𝑥, 𝑢 𝑉𝑘+1 𝜋 𝑘 (𝑥′ ) 7. until 𝑉𝑘+1 𝜋 𝑘+1 − 𝑉𝑘 𝜋 𝑘 < 𝛥 8. return 𝜋
  • 39. Q: If the system dynamics 𝑝 and reward function 𝑟 is unknown, how do we solve DP equation?
  • 40. Learning-based approach Q: If the system dynamics 𝑝 and reward function 𝑟 is unknown, how do we solve DP equation?
  • 41. Learning-based approach Q: If the system dynamics 𝑝 and reward function 𝑟 is unknown, how do we solve DP equation? 1. Estimate model (𝑟 and 𝑝) from simulation data and use previous methods Model-based approach (model learning)
  • 42. Learning-based approach Q: If the system dynamics 𝑝 and reward function 𝑟 is unknown, how do we solve DP equation? 1. Estimate model (𝑟 and 𝑝) from simulation data and use previous methods Model-based approach (model learning) 2. Without system identification, obtain the value function and policy directly from simulation data Model-free approach
  • 43. 𝐢𝐧𝐟 𝝅∈𝚷 𝑬 𝝅 𝒓 𝒙 𝒌, 𝝅(𝒙 𝒌) + 𝑬 𝒙~𝒑 𝜸𝑽(𝒙 𝒌+𝟏)
  • 44. 𝐢𝐧𝐟 𝝅∈𝚷 𝑬 𝝅 𝒓 𝒙 𝒌, 𝝅(𝒙 𝒌) + 𝑬 𝒙~𝒑 𝜸𝑽(𝒙 𝒌+𝟏) Approximation in policy space Approximation in value space Approximate 𝐸[∙] Parametric approximation Problem approximation Rollout, MPC Monte-Carlo search Certainty equivalence Policy search Policy gradient
  • 45. Approximation in Value space Approximation in Policy space Actor-Critic • TD, SARSA, Q-learning • Function approximation • Policy search • Policy gradient DPG, DDPG, TRPO, CPO, PPO, Soft actor-critic, … Approximate Expectation • Monte-Carlo search • Certainty equivalence
  • 46. Approximation in Value Space DP algorithms sweep over “all state” for each step 𝐸 𝑓 ≈ 1 𝑁 ෍ 𝑖=1 𝑁 𝑓𝑖Use Monte-Carlo Search 
  • 47. Approximation in Value Space DP algorithms sweep over “all state” for each step 𝑁~14,000,605 Use Monte-Carlo Search  𝐸 𝑓 ≈ 1 𝑁 ෍ 𝑖=1 𝑁 𝑓𝑖
  • 48. Approximation in Value Space DP algorithms sweep over “all state” for each step 𝑁~14,000,605 Use Monte-Carlo Search  𝐸 𝑓 ≈ 1 𝑁 ෍ 𝑖=1 𝑁 𝑓𝑖 Impractical
  • 49. Stochastic Approximation Consider the problem 𝑥 = 𝐿(𝑥). Then, this problem can be solved by iterative algorithm 𝑥 𝑘+1 = 𝐿 𝑥 𝑘 or 𝑥 𝑘+1 = 1 − 𝛼 𝑘 𝑥 𝑘 + 𝛼 𝑘 𝐿(𝑥 𝑘). If 𝐿 𝑥 is of the form 𝐸 𝑓(𝑥, 𝑤) where 𝑤 is a random noise. Then, 𝐿(𝑥) can be approximated by 𝐿 𝑥 = 1 𝑁 ෍ 𝑖=1 𝑁 𝑓(𝑥𝑖, 𝑤𝑖) which becomes inefficient when 𝑁 is large.
  • 50. Stochastic Approximation Use a single sample as an estimation of expectation in each update This update can be seen as a stochastic approximation of the form 𝑥 𝑘+1 = 1 − 𝛼 𝑘 𝑥 𝑘 + 𝛼 𝑘 𝑓(𝑥 𝑘, 𝑤 𝑘). 𝑥 𝑘+1 = 1 − 𝛼 𝑘 𝑥 𝑘 + 𝛼 𝑘 𝐸 𝑓 𝑥 𝑘, 𝑤 𝑘 + 𝜀 𝑘 = 1 − 𝛼 𝑘 𝑥 𝑘 + 𝛼 𝑘 𝐿(𝑥 𝑘) + 𝜀 𝑘 where 𝜀 𝑘= 𝑓 𝑥 𝑘, 𝑤 𝑘 − 𝐸 𝑓 𝑥 𝑘, 𝑤 𝑘 . Robbins-Monro stochastic approximation guarantees the convergence under contraction or monotonicity assumptions of the mapping 𝐿 with assumptions ෍ 𝑘=0 ∞ 𝛼 𝑘 = + ∞ and ෍ 𝑘=0 ∞ 𝛼 𝑘 2 < +∞.
  • 51. Policy Iteration Algorithm: Policy Iteration(Classical Dynamic Programming) Input: 𝑟, 𝑝, 𝛾, 𝛥 Output: 𝜋 1. 𝑉 𝑥 , 𝜋(𝑥) ←Initialize arbitrarily for 𝑥 ∈ 𝑋 2. repeat 3. 1) Policy Evaluation 4. for 𝑥 ∈ 𝑋 with fixed 𝜋 𝑘 do 5. 𝑉𝑘+1 𝜋 𝑘 ← 𝑇 𝜋 𝑉𝑘 𝜋 𝑘 6. until 𝑉𝑘+1 𝜋 𝑘 − 𝑉𝑘 𝜋 𝑘 < 𝛥 2) Policy Improvement 7. 𝜋 𝑘+1 𝑥𝑖 ∈ 𝑎𝑟𝑔min 𝑢∈𝑈(𝑥) 𝑟 𝑥, 𝑢 + 𝛾 σ 𝑥′∈𝑋 𝑝 𝑥′ 𝑥, 𝑢 𝑉𝑘+1 𝜋 𝑘 (𝑥′) 8. until 𝑉𝑘+1 𝜋 𝑘+1 − 𝑉𝑘 𝜋 𝑘 < 𝛥 9. return 𝜋
  • 52. Policy Iteration Algorithm: Policy Iteration(Temporal Difference) Input: 𝑟, 𝑝, 𝛾, 𝛥 Output: 𝜋 1. 𝑉 𝑥 , 𝜋(𝑥) ←Initialize arbitrarily for 𝑥 ∈ 𝑋 2. repeat 3. given 𝑥0, simulate 𝜋 𝑘 and store 𝒟 = {𝑥𝑖, 𝜋 𝑘 𝑥𝑖 , 𝑟(𝑥𝑖, 𝜋 𝑘 𝑥𝑖 )} 1) Policy Evaluation 4. for 𝑥 ∈ 𝑋 for 𝑥𝑖 ∈ 𝒟 with fixed 𝜋 𝑘 do 5. 𝑉𝑘+1 𝜋 𝑘 ← 𝑇 𝜋 𝑉𝑘 𝜋 𝑘 𝑉𝑘+1 𝜋 𝑘 (𝑥𝑖) ← 1 − 𝛼 𝑘 𝑉𝑘 𝜋 𝑘 (𝑥𝑖) + 𝛼 𝑘 𝑟 𝑥𝑖, 𝜋 𝑘 𝑥𝑖 + 𝑉𝑘 𝜋 𝑘 (𝑥𝑖+1) 6. until 𝑉𝑘+1 𝜋 𝑘 − 𝑉𝑘 𝜋 𝑘 < 𝛥 2) Policy Improvement 7. 𝜋 𝑘+1 𝑥𝑖 ∈ 𝑎𝑟𝑔min 𝑢∈𝑈(𝑥) 𝑟 𝑥, 𝑢 + 𝛾 σ 𝑥′∈𝑋 𝑝 𝑥′ 𝑥, 𝑢 𝑉𝑘+1 𝜋 𝑘 (𝑥′) 8. until 𝑉𝑘+1 𝜋 𝑘+1 − 𝑉𝑘 𝜋 𝑘 < 𝛥 9. return 𝜋
  • 53. Policy Iteration Algorithm: Policy Iteration(SARSA) Input: 𝑟, 𝑝, 𝛾, 𝛥 Output: 𝜋 1. 𝑉 𝑥 , 𝜋(𝑥) ←Initialize arbitrarily for 𝑥 ∈ 𝑋 2. repeat 3. given 𝑥0, simulate 𝜋 𝑘 and store 𝒟 = {𝑥𝑖, 𝜋 𝑘 𝑥𝑖 , 𝑟(𝑥𝑖, 𝜋 𝑘 𝑥𝑖 )} 1) Policy Evaluation 4. for 𝑥 ∈ 𝑋 for 𝑥𝑖 ∈ 𝒟 with fixed 𝜋 𝑘 do 5. 𝑉𝑘+1 𝜋 𝑘 ← 𝑇 𝜋 𝑉𝑘 𝜋 𝑘 𝑄 𝑘+1 𝜋 𝑘 (𝑥𝑖, 𝑢𝑖) ← 1 − 𝛼 𝑘 𝑄 𝑘 𝜋 𝑘 (𝑥𝑖, 𝑢𝑖) + 𝛼 𝑘 𝑟 𝑥𝑖, 𝜋 𝑘 𝑥𝑖 + 𝑄 𝑘 𝜋 𝑘 (𝑥𝑖+1, 𝜋 𝑘 𝑥𝑖+1 ) 6. until 𝑉𝑘+1 𝜋 𝑘 − 𝑉𝑘 𝜋 𝑘 < 𝛥 2) Policy Improvement 7. 𝜋 𝑘+1 𝑥𝑖 ∈ 𝑎𝑟𝑔min 𝑢∈𝑈(𝑥) 𝑟 𝑥, 𝑢 + 𝛾 σ 𝑥′∈𝑋 𝑝 𝑥′ 𝑥, 𝑢 𝑉𝑘+1 𝜋 𝑘 (𝑥′) 8. until 𝑉𝑘+1 𝜋 𝑘+1 − 𝑉𝑘 𝜋 𝑘 < 𝛥 9. return 𝜋
  • 54. Value Iteration Algorithm: Value Iteration(Classical Dynamic Programming) Input: 𝑟, 𝑝, 𝛾, 𝛥 Output: 𝜋 1. 𝑉 𝑥 ←Initialize arbitrarily for 𝑥 ∈ 𝑋 2. repeat 3. given 𝑥0, simulate and store 𝒟 = {𝑥𝑖, 𝑟(𝑥𝑖, 𝑢𝑖)} 4. for all 𝑥 ∈ 𝑋 do 5. 𝑉𝑘+1 𝑥 ← 𝑇∗ 𝑉𝑘 2. until 𝑉𝑘+1 − 𝑉𝑘 < 𝛥 3. 𝜋∗ 𝑥 ∈ 𝑎𝑟𝑔min 𝑢∈𝑈(𝑥) 𝑟 𝑥, 𝑢 + 𝛾 σ 𝑥′∈𝑋 𝑝 𝑥′ 𝑥, 𝑢 𝑉𝑘(𝑥′ ) 4. return 𝜋∗
  • 55. Value Iteration Algorithm: Value Iteration(Q-learning) Input: 𝑟, 𝑝, 𝛾, 𝛥 Output: 𝜋 1. 𝑉 𝑥 ←Initialize arbitrarily for 𝑥 ∈ 𝑋 𝑄(𝑥, 𝑢) ←Initialize arbitrarily for 𝑥 ∈ 𝑋, 𝑢 ∈ 𝑈 2. repeat 3. given 𝑥0, simulate and store 𝒟 = {𝑥𝑖, 𝑟(𝑥𝑖, 𝑢𝑖)} 4. for all 𝑥 ∈ 𝑋 do  for 𝑥𝑖, 𝑎𝑖, 𝑟 𝑥𝑖, 𝑢𝑖 , 𝑥𝑖+1 ∈ 𝒟 do 5. 𝑄 𝑘+1 𝑥, 𝑢 ← 𝑇∗ 𝑄 𝑘 𝑄 𝑘+1(𝑥𝑖, 𝑢𝑖) ← (1 − 𝛼 𝑘)𝑄 𝑘(𝑥𝑖, 𝑢𝑖) + 𝛼 𝑘 𝑟 𝑥𝑖, 𝑢𝑖 + min 𝑢𝑖+1∈𝑈 𝑄 𝑘(𝑥𝑖+1, 𝑢𝑖+1) 2. until 𝑄 𝑘+1 − 𝑄 𝑘 < 𝛥 3. 𝜋∗ 𝑥 ∈ 𝑎𝑟𝑔min 𝑢∈𝑈(𝑥) 𝑟 𝑥, 𝑢 + 𝛾 σ 𝑥′∈𝑋 𝑝 𝑥′ 𝑥, 𝑢 𝑉𝑘(𝑥′ ) 4. return 𝜋∗
  • 56. Approximation in Value Space 𝑇 𝜋 𝑇∗ 𝑉 𝜋 𝑄 𝜋 𝑄 Policy Evaluation TD(0), TD(𝜆) SARSA(0) SARSA(𝜆) Q-learning
  • 58. Large-scale RL Number of states > 2200 (for 10 × 20 board)
  • 60. Approximation in Value space Approximation in Policy space Actor-Critic • TD, SARSA, Q-learning • Function approximation • Policy search • Policy gradient DPG, DDPG, TRPO, CPO, PPO, Soft actor-critic, … Approximate Expectation • Monte-Carlo search • Certainty equivalence
  • 61. Approximate Dynamic Programming Direct method (Gradient methods) min 𝜃 ෍ 𝑖=1 𝑁 𝑉 𝜋 𝑥𝑖 − ෠𝑉(𝑥𝑖; 𝜃) 2 ≈ min 𝜃 ෍ 𝑥 𝑖∈𝑋 ෍ 𝑚=1 𝑀 𝐽(𝑥𝑖, 𝑚) − ෠𝑉(𝑥𝑖; 𝜃) 2 ෠𝑉(𝑥; 𝜃) : Approximated value function (ex. Polynomial approx., Neural Network, etc.) 𝑉 𝜋 𝑥 : State-value function 𝐽(𝑥𝑖, 𝑚): 𝑚-th sample of the cost function at 𝑥𝑖 where 𝑚 = 1,2, … , 𝑀 𝜃 𝑘+1 = 𝜃 𝑘 − 𝜂 ෍ 𝑥 𝑖∈𝑋 ෍ 𝑚=1 𝑀 𝛻 ෠𝑉(𝑥𝑖; 𝜃) 𝐽(𝑥𝑖, 𝑚) − ෠𝑉(𝑥𝑖; 𝜃) 2
  • 62. Approximate Dynamic Programming Indirect method (Projected equation) Solve the projected Bellman equation: Φ𝜃 = Π𝑇(Φ𝜃) Subspace 𝑆 = Φ𝜃 𝜃 ∈ ℝ 𝑠 0 𝐽 Π𝐽 Subspace 𝑆 = Φ𝜃 𝜃 ∈ ℝ 𝑠 0 Φ𝜃 = Π𝑇(Φ𝜃) 𝑇(Φ𝜃) Direct Method Indirect Method
  • 63. 𝑇 𝜋 𝑇∗ 𝑉 𝜋 𝑄 𝜋 𝑄 Policy Evaluation TD(0), TD(𝜆) SARSA(0) SARSA(𝜆) Q-learning 𝑇 𝜋 𝑇∗ 𝑉 𝜋 𝑄 𝜋 𝑄 Direct (Gradient methods) TD SARSA DQN 𝑇 𝜋 𝑇∗ 𝑉 𝜋 𝑄 𝜋 𝑄 Indirect (Projected DP) TD, LSTD TD, LSTD LSPE Function Approximation Policy Evaluation
  • 65. RL is a toolbox solving Infinite-horizon Discrete-time DP 𝐢𝐧𝐟 𝝅∈𝜫 𝑬 𝝅 𝒓 𝒙 𝒌, 𝝅(𝒙 𝒌) + 𝑬 𝒙~𝒑 𝜸𝑽(𝒙 𝒌+𝟏)
  • 66. 𝐢𝐧𝐟 𝝅∈𝚷 𝑬 𝝅 𝒓 𝒙 𝒌, 𝝅(𝒙 𝒌) + 𝑬 𝒙~𝒑 𝜸𝑽(𝒙 𝒌+𝟏) Approximation in policy space Approximation in value space Approximate 𝐸[∙] Parametric approximation Problem approximation Rollout, MPC Monte-Carlo search Certainty equivalence Policy search Policy gradient
  • 67. Approximation in Value space Approximation in Policy space Actor-Critic • TD, SARSA, Q-learning • Function approximation • Policy search • Policy gradient DPG, DDPG, TRPO, CPO, PPO, Soft actor-critic, … Approximate Expectation • Monte-Carlo search • Certainty equivalence
  • 68. 𝑇 𝜋 𝑇∗ 𝑉 𝜋 𝑄 𝜋 𝑄 Policy Evaluation TD(0), TD(𝜆) SARSA(0) SARSA(𝜆) Q-learning 𝑇 𝜋 𝑇∗ 𝑉 𝜋 𝑄 𝜋 𝑄 Direct (Gradient methods) TD SARSA DQN 𝑇 𝜋 𝑇∗ 𝑉 𝜋 𝑄 𝜋 𝑄 Indirect (Projected DP) TD, LSTD TD, LSTD LSPE Function Approximation Policy Evaluation
  • 69. Q&A