SlideShare a Scribd company logo
1 of 70
Download to read offline
Stochastic Optimal Control
&
Reinforcement Learning
Jinwon Choi
Contents
01
02
03
04
05
Reinforcement
Learning
Stochastic
Optimal Control
Stochastic
Control
to
Reinforcement
Learning
Large Scale
Reinforcement
Learning
Summary
Reinforcement Learning
Reinforcement Learning
Ivan Pavlov
Reinforcement Learning
Agent Environment
Action
Reward
State
Reinforcement Learning
Agent Environment
Action
Reward
State
Agent
Environment
Reinforcement Learning
Agent Environment
Action
Reward
State
Agent
Environment
Markov Decision Process
Markov?
Markov Decision Process
Markov?
“The future is independent of the past given the present”
ℙ 𝑠𝑡+1 𝑠1, … , 𝑠𝑡 = ℙ 𝑠𝑡+1 𝑠𝑡
Markov Decision Process
Markov?
“The future is independent of the past given the present”
ℙ 𝑠𝑡+1 𝑠1, … , 𝑠𝑡 = ℙ 𝑠𝑡+1 𝑠𝑡
Memoryless process!
Markov Decision Process
Markov?
“The future is independent of the past given the present”
ℙ 𝑠𝑡+1 𝑠1, … , 𝑠𝑡 = ℙ 𝑠𝑡+1 𝑠𝑡
Markov “Decision” Process:
ℙ 𝑠𝑡+1 𝑠1, 𝑎1 … , 𝑠𝑡, 𝑎 𝑡 = ℙ 𝑠𝑡+1 𝑠𝑡, 𝑎 𝑡
Future state only depends on the current state and action
&
Policy also depends on the current state only
𝜋 𝑎 𝑡 𝑠1, 𝑎1 … , 𝑠𝑡 = 𝜋 𝑎 𝑡 𝑠𝑡
Reinforcement Learning
Agent Environment
Action
Reward
• State 𝑠 ∈ 𝑆 ⊂ ℝ 𝑛
• Action 𝑎 ∈ 𝐴 ⊂ ℝ 𝑚
• Action sequence 𝑎0, 𝑎1, … ,
∀𝑎𝑖 ∈ 𝐴, 𝑖 = 1,2, …
State
Reinforcement Learning
Agent Environment
Action
Reward
• State 𝑠 ∈ 𝑆 ⊂ ℝ 𝑛
• Action 𝑎 ∈ 𝐴 ⊂ ℝ 𝑚
• Action sequence 𝑎0, 𝑎1, … ,
∀𝑎𝑖 ∈ 𝐴, 𝑖 = 1,2, …
State
• Reward 𝑟: 𝑆 × 𝐴 → ℝ
• Discounting factor 𝛾 ∈ (0,1)
• Transition probability 𝑠𝑡+1~𝑝 𝑠𝑡+1 𝑠𝑡, 𝑎 𝑡
• Total reward 𝑅𝑡𝑜𝑡 = 𝐸𝑠~𝑝 σ 𝑡=0
∞
𝛾 𝑡 𝑟(𝑠𝑡, 𝑎 𝑡)
Reinforcement Learning
Agent Environment
Action
Reward
• State 𝑠 ∈ 𝑆 ⊂ ℝ 𝑛
• Action 𝑎 ∈ 𝐴 ⊂ ℝ 𝑚
• Action sequence 𝑎0, 𝑎1, … ,
∀𝑎𝑖 ∈ 𝐴, 𝑖 = 1,2, …
State
Objective function:
max
𝜋∈Π
𝑅 𝜋
• Policy: 𝜋: 𝑆 → 𝐴
• Total reward w.r.t. 𝜋:
𝑅 𝜋
= 𝐸𝑠~𝑝,𝑎~𝜋 σ 𝑡=0
∞
𝛾 𝑡
𝑟(𝑠𝑡, 𝑎 𝑡)
• Reward 𝑟: 𝑆 × 𝐴 → ℝ
• Discounting factor 𝛾 ∈ (0,1)
• Transition probability 𝑠𝑡+1~𝑝 𝑠𝑡+1 𝑠𝑡, 𝑎 𝑡
• Total reward 𝑅𝑡𝑜𝑡 = 𝐸𝑠~𝑝 σ 𝑡=0
∞
𝛾 𝑡 𝑟(𝑠𝑡, 𝑎 𝑡)
Terminology of RL and Optimal control
State
Action
Agent
Environment
Reward of a stage
Reward (or value) function
Maximizing the value function
Bellman operator
Greedy policy w.r.t. 𝐽
State
Control Input
Controller
System
Cost of a stage
Value (or cost) function
Minimizing the value function
DP mapping or operator
Minimizing policy w.r.t. 𝐽
RL Optimal Control
Stochastic Optimal Control
Stochastic Optimal Control
DiscreteContinuous
Stochastic
Deterministicሶ𝑥 = 𝑓(𝑥, 𝑢)
𝑑𝑥 = 𝑓 𝑥, 𝑢 𝑑𝑡 + 𝜎 𝑥, 𝑢 𝑑𝑊
𝑥 𝑘+1 = 𝑓(𝑥 𝑘, 𝑢 𝑘)
𝑥 𝑘+1 = 𝑓 𝑥 𝑘, 𝑢 𝑘, 𝑤 𝑘
(𝑤 𝑘 is a random Gaussian noise)
or
𝑥 𝑘+1~𝑝 𝑥 𝑘+1 𝑥 𝑘, 𝑢 𝑘
System Dynamics
Stochastic Optimal Control
DiscreteContinuous
Stochastic
(Policy)
Deterministic
(Control Input)
𝑢 𝑥
𝑢 𝑥 ~ 𝜋 𝑢 𝑥
{𝑢0, 𝑢1, 𝑢2, … }
𝑢 𝑘 ~ 𝜋 𝑢 𝑘 𝑥 𝑘
Control Input and Policy
Stochastic Optimal Control
DiscreteContinuous
Infinite-horizon
Finite-horizon
Value function
inf
𝑢∈𝑈
𝐸 𝑥~𝑝 න
𝑡=0
𝑇
𝑟 𝑥 𝑡 , 𝑢 𝑡 𝑑𝑡 + 𝑞(𝑥 𝑇)
inf
𝑢∈𝑈
𝐸 𝑥~𝑝 න
𝑡=0
∞
𝑒−𝛾𝑡
𝑟 𝑥 𝑡 , 𝑢 𝑡 𝑑𝑡
inf
𝑢 𝑘∈𝑈
𝐸 𝑥~𝑝 ෍
𝑘=0
𝑁
𝑟 𝑥 𝑘, 𝑢 𝑘 + 𝑞(𝑥 𝑁)
inf
𝑢 𝑘∈𝑈
𝐸 𝑥~𝑝 ෍
𝑘=0
∞
𝛾 𝑘 𝑟 𝑥 𝑘, 𝑢 𝑘
Stochastic Optimal Control
Discrete(𝑉 𝑥 𝑘 )Continuous(𝑉 𝑥(𝑡) )
Infinite-horizon
Finite-horizon
Dynamic Programming
inf
𝑢∈𝑈
𝐸 𝑥~𝑝 න
𝑡
𝑡+Δ𝑡
𝑟 𝑥 𝑠 , 𝑢 𝑠 𝑑𝑠 + 𝑉(𝑥 𝑡 + Δ𝑡 )
𝑉 𝑥 𝑇 = 𝑞(𝑥 𝑇 )
inf
𝑢∈𝑈
𝐸 𝑥~𝑝 න
𝑡=0
𝑡+Δ𝑡
𝑒−𝛾𝑠
𝑟 𝑥 𝑠 , 𝑢 𝑠 𝑑𝑠 + 𝑉(𝑥 𝑡 + Δ𝑡 )
inf
𝑢 𝑘∈𝑈
𝑟 𝑥 𝑘, 𝑢 𝑘 + 𝐸 𝑥~𝑝 𝑉(𝑥 𝑘+1)
𝑉 𝑥 𝑇 = 𝑞(𝑥 𝑇)
inf
𝑢 𝑘∈𝑈
𝑟 𝑥 𝑘, 𝑢 𝑘 + 𝐸 𝑥~𝑝 𝛾𝑉(𝑥 𝑘+1)
Stochastic Optimal Control
Discrete(𝑉 𝑥 𝑘 )Continuous(𝑉 𝑥(𝑡) )
Infinite
Dynamic Programming
𝜕𝑉
𝜕𝑡
+ inf
𝑢∈𝑈
𝑟 𝑥 𝑡 , 𝑢 𝑡 +
𝜕𝑉
𝜕𝑥
𝑓 𝑥 𝑡 , 𝑢 𝑡 +
1
2
𝜎 𝑇
(𝑥 𝑡 , 𝑢 𝑡 )
𝜕2
𝑉
𝜕𝑥2
𝜎(𝑥 𝑡 , 𝑢 𝑡 ) = 0
𝑉 𝑥 𝑇 = 𝑞(𝑥 𝑇 )
inf
𝑢 𝑘∈𝑈
𝑟 𝑥 𝑘, 𝑢 𝑘 + 𝐸 𝑥~𝑝 𝑉(𝑥 𝑘+1)
𝑉 𝑥 𝑇 = 𝑞(𝑥 𝑇)
inf
𝑢 𝑘∈𝑈
𝑟 𝑥 𝑘, 𝑢 𝑘 + 𝐸 𝑥~𝑝 𝛾𝑉(𝑥 𝑘+1)
Finite
HJB equation Bellman equation
−𝛾 + inf
𝑢∈𝑈
𝑟 𝑥 𝑡 , 𝑢 𝑡 +
𝜕𝑉
𝜕𝑥
𝑓 𝑥 𝑡 , 𝑢 𝑡 +
1
2
𝜎 𝑇
(𝑥 𝑡 , 𝑢 𝑡 )
𝜕2
𝑉
𝜕𝑥2
𝜎(𝑥 𝑡 , 𝑢 𝑡 ) = 0
Stochastic Optimal Control
Discrete(𝑉 𝑥 𝑘 )Continuous(𝑉 𝑥(𝑡) )
Infinite
Dynamic Programming
𝜕𝑉
𝜕𝑡
+ inf
𝑢∈𝑈
𝑟 𝑥 𝑡 , 𝑢 𝑡 +
𝜕𝑉
𝜕𝑥
𝑓 𝑥 𝑡 , 𝑢 𝑡 +
1
2
𝜎 𝑇
(𝑥 𝑡 , 𝑢 𝑡 )
𝜕2
𝑉
𝜕𝑥2
𝜎(𝑥 𝑡 , 𝑢 𝑡 ) = 0
𝑉 𝑥 𝑇 = 𝑞(𝑥 𝑇 )
inf
𝑢 𝑘∈𝑈
𝑟 𝑥 𝑘, 𝑢 𝑘 + 𝐸 𝑥~𝑝 𝑉(𝑥 𝑘+1)
𝑉 𝑥 𝑇 = 𝑞(𝑥 𝑇)
inf
𝑢 𝑘∈𝑈
𝑟 𝑥 𝑘, 𝑢 𝑘 + 𝐸 𝑥~𝑝 𝛾𝑉(𝑥 𝑘+1)
Finite
−𝛾 + inf
𝑢∈𝑈
𝑟 𝑥 𝑡 , 𝑢 𝑡 +
𝜕𝑉
𝜕𝑥
𝑓 𝑥 𝑡 , 𝑢 𝑡 +
1
2
𝜎 𝑇
(𝑥 𝑡 , 𝑢 𝑡 )
𝜕2
𝑉
𝜕𝑥2
𝜎(𝑥 𝑡 , 𝑢 𝑡 ) = 0
HJB equation Bellman equation
Stochastic Optimal Control
Dynamic Programming
𝐢𝐧𝐟
𝒖 𝒌∈𝑼
𝒓 𝒙 𝒌, 𝒖 𝒌 + 𝑬 𝒙~𝒑 𝜸𝑽(𝒙 𝒌+𝟏)
How to solve the infinite-horizon discrete time system stochastic optimal control problem?
Stochastic Optimal Control
Dynamic Programming
𝐢𝐧𝐟
𝒖 𝒌∈𝑼
𝒓 𝒙 𝒌, 𝒖 𝒌 + 𝑬 𝒙~𝒑 𝜸𝑽(𝒙 𝒌+𝟏)
How to solve the infinite-horizon discrete time system stochastic optimal control problem?
Note) There is an other approach using different dynamic programming equation,
average reward.
Value Iteration & Policy Iteration
Bellman Operator
Definition. Given policy 𝜋, the state-value function 𝑉 𝜋: ℝn → ℝ is defined by
𝑉 𝜋
𝑥0 : = 𝐸 𝑥~𝑝,𝜋 ෍
𝑘=0
∞
𝛾 𝑘
𝑟𝑘(𝑥 𝑘, 𝜋(𝑥 𝑘), 𝑤 𝑘) ȁ 𝑥 = 𝑥0 at 𝑡 = 0
= 𝑟(𝑥0, 𝜋(𝑥 𝑘)) + 𝐸 𝑥~𝑝,𝜋 ෍
𝑘=0
∞
𝛾 𝑘
𝑟𝑘(𝑥 𝑘, 𝜋(𝑥 𝑘), 𝑤 𝑘)
and the state-input-value function 𝑄 𝜋
: ℝn
× ℝm
→ ℝ is defined by
𝑄 𝜋 𝑥0, 𝑢0 : = 𝐸 𝑥~𝑝,𝜋 ෍
𝑘=0
∞
𝛾 𝑘 𝑟𝑘(𝑥 𝑘, 𝜋(𝑥 𝑘), 𝑤 𝑘) ȁ 𝑥 = 𝑥0, 𝑢 = 𝑢0 at 𝑡 = 0
= 𝑟 𝑥0, 𝑢0 + 𝐸 𝑥~𝑝,𝜋 ෍
𝑘=1
∞
𝛾 𝑘 𝑟𝑘(𝑥 𝑘, 𝜋(𝑥 𝑘), 𝑤 𝑘)
Bellman Operator
Definition. Given policy 𝜋, the state-value function 𝑉 𝜋: ℝn → ℝ is defined by
𝑉 𝜋
𝑥0 : = 𝐸 𝑥~𝑝,𝜋 ෍
𝑘=0
∞
𝛾 𝑘
𝑟𝑘(𝑥 𝑘, 𝜋(𝑥 𝑘), 𝑤 𝑘) ȁ 𝑥 = 𝑥0 at 𝑡 = 0
= 𝑟(𝑥0, 𝜋(𝑥 𝑘)) + 𝐸 𝑥~𝑝,𝜋 ෍
𝑘=0
∞
𝛾 𝑘
𝑟𝑘(𝑥 𝑘, 𝜋(𝑥 𝑘), 𝑤 𝑘)
and the state-input-value function 𝑄 𝜋
: ℝn
× ℝm
→ ℝ is defined by
𝑄 𝜋 𝑥0, 𝑢0 : = 𝐸 𝑥~𝑝,𝜋 ෍
𝑘=0
∞
𝛾 𝑘 𝑟𝑘(𝑥 𝑘, 𝜋(𝑥 𝑘), 𝑤 𝑘) ȁ 𝑥 = 𝑥0, 𝑢 = 𝑢0 at 𝑡 = 0
= 𝑟 𝑥0, 𝑢0 + 𝐸 𝑥~𝑝,𝜋 ෍
𝑘=1
∞
𝛾 𝑘 𝑟𝑘(𝑥 𝑘, 𝜋(𝑥 𝑘), 𝑤 𝑘)
𝑉 𝑥0 = inf
𝑢 𝑘∈𝑈
𝐸 𝑥~𝑝 ෍
𝑘=0
∞
𝛾 𝑘
𝑟 𝑥 𝑘, 𝑢 𝑘
Bellman Operator
𝑉 𝑥 𝑘 = inf
𝑢 𝑘∈𝑈
𝑟 𝑥 𝑘, 𝑢 𝑘 + 𝛾𝐸 𝑥~𝑝 𝑉(𝑥 𝑘+1)
𝑉 𝜋 𝑥 𝑘 = 𝑟(𝑥 𝑘, 𝜋(𝑥 𝑘)) + 𝛾𝐸 𝑥~𝑝 𝑉 𝜋(𝑥 𝑘+1)
𝑄 𝜋
𝑥 𝑘, 𝑢 𝑘 = 𝑟 𝑥 𝑘, 𝑢 𝑘 + 𝛾𝐸 𝑥~𝑝,𝜋 𝑄 𝜋
𝑥 𝑘+1, 𝜋(𝑥 𝑘+1)
Dynamic Programming
Bellman Operator
Let 𝔹, ∙ ∞, 𝑑∞ be a metric space where 𝔹 = 𝜓: Ω → ℝ continuous and bounded ,
𝜓 ∞ ≔ sup
𝑥∈𝑋
𝜓(𝑥) , and 𝑑∞ 𝜓, 𝜓′ = sup
𝑥∈𝑋
𝜓 𝑥 − 𝜓′(𝑥) .
Definition. Given policy 𝜋, the Bellman operator 𝑇 𝜋: 𝔹 → 𝔹 is defined by
𝑇 𝜋 𝜓 𝑥 𝑘 = 𝑟 𝑥 𝑘, 𝜋 𝑥 𝑘 + 𝛾𝐸 𝑥~𝑝[𝜓 𝑥 𝑘+1 ]
and the Bellman optimal operator 𝑇∗: 𝔹 → 𝔹 is defined by
𝑇∗ 𝜓 𝑥 𝑘 = min
𝑢 𝑘∈𝑈(𝑥 𝑘)
𝑟 𝑥 𝑘, 𝑢 𝑘 + 𝛾𝐸 𝑥~𝑝[𝜓 𝑥 𝑘+1 ]
Bellman Operator
Proposition 1. (Monotonicity) The Bellman operator 𝑇 𝜋
, 𝑇∗
are monotone, i.e. if
𝜓 𝑥 ≤ 𝜓′ 𝑥 ∀𝑥 ∈ 𝑋,
then
𝑇 𝜋
𝜓 𝑥 ≤ 𝑇 𝜋
𝜓′ 𝑥 ∀𝑥 ∈ 𝑋
𝑇∗
𝜓 𝑥 ≤ 𝑇∗
𝜓′ 𝑥 ∀𝑥 ∈ 𝑋.
Proposition 2. (Constant shift property) For any scalar 𝑟,
𝑇 𝜋
𝜓 + 𝑟𝑒 𝑥 = 𝑇 𝜋
𝜓 𝑥 + 𝛾𝑟 ∀𝑥 ∈ 𝑋
𝑇∗
𝜓 + 𝑟𝑒 𝑥 = 𝑇∗
𝜓 𝑥 + 𝛾𝑟 ∀𝑥 ∈ 𝑋.
Bellman Operator
Proposition 1. (Monotonicity) The Bellman operator 𝑇 𝜋
, 𝑇∗
are monotone, i.e. if
𝜓 𝑥 ≤ 𝜓′ 𝑥 ∀𝑥 ∈ 𝑋,
then
𝑇 𝜋
𝜓 𝑥 ≤ 𝑇 𝜋
𝜓′ 𝑥 ∀𝑥 ∈ 𝑋
𝑇∗
𝜓 𝑥 ≤ 𝑇∗
𝜓′ 𝑥 ∀𝑥 ∈ 𝑋.
Proposition 2. (Constant shift property) For any scalar 𝑟,
𝑇 𝜋
𝜓 + 𝑟𝑒 𝑥 = 𝑇 𝜋
𝜓 𝑥 + 𝛾𝑟 ∀𝑥 ∈ 𝑋
𝑇∗
𝜓 + 𝑟𝑒 𝑥 = 𝑇∗
𝜓 𝑥 + 𝛾𝑟 ∀𝑥 ∈ 𝑋.
Proposition 3. The Bellman operator 𝑇 𝜋
, 𝑇 are a contraction
with modulus 𝛾 with respect to the sup norm ∙ ∞ i.e.
𝑇 𝜋 𝜓 − 𝑇 𝜋 𝜓′ ≤ 𝛾 𝜓 − 𝜓′ 𝜓, 𝜓′ ∈ 𝔹
𝑇∗ 𝜓 − 𝑇∗ 𝜓′ ≤ 𝛾 𝜓 − 𝜓′ 𝜓, 𝜓′ ∈ 𝔹.
Bellman Operator
Theorem 2.3. (Contraction Mapping Theorem) Let 𝔹, ∙ ∞, 𝑑∞ be a metric
space and 𝑇: 𝔹 → 𝔹 is a contraction mapping with modulus 𝛾. Then,
1) 𝑇 has an unique fixed point in 𝔹 i.e. there exist unique 𝑓∗
∈ 𝔹 s.t. 𝑇𝑓∗
= 𝑓∗
.
2) Consider the sequence {𝑓𝑛} s.t. 𝑓𝑛+1 = 𝑇𝑓𝑛 ∀𝑓0 ∈ 𝔹.
Then, lim
n→∞
𝑇 𝑛
𝑓0 → 𝑓∗
.
Value Iteration
Algorithm: Value Iteration
Input: 𝑟, 𝑝, 𝛾, 𝛥
Output: 𝜋
1. 𝑉 𝑥 ←Initialize arbitrarily for 𝑥 ∈ 𝑋
2. repeat
3. for all 𝑥 ∈ 𝑋 do
4. 𝑉𝑘+1 ← 𝑇∗
𝑉𝑘
5. until 𝑉𝑘+1 − 𝑉𝑘 < 𝛥
6. 𝜋∗ 𝑥 ∈ 𝑎𝑟𝑔min
𝑢∈𝑈(𝑥)
𝑟 𝑥, 𝑢 + 𝛾 σ 𝑥′∈𝑋 𝑝 𝑥′ 𝑥, 𝑢 𝑉𝑘(𝑥′)
7. return 𝜋∗
Policy Iteration
Algorithm: Policy Iteration
Input: 𝑟, 𝑝, 𝛾, 𝛥
Output: 𝜋
1. 𝑉 𝑥 ←Initialize arbitrarily for 𝑥 ∈ 𝑋
2. repeat
1) Policy Evaluation
3. for 𝑥 ∈ 𝑋 with fixed 𝜋 𝑘 do
4. 𝑉𝑘+1
𝜋 𝑘
← 𝑇 𝜋 𝑉𝑘
𝜋 𝑘
5. until 𝑉𝑘+1
𝜋 𝑘
− 𝑉𝑘
𝜋 𝑘
< 𝛥
2) Policy Improvement
6. 𝜋 𝑘+1 𝑥 ∈ 𝑎𝑟𝑔min
𝑢∈𝑈(𝑥)
𝑟 𝑥, 𝑢 + 𝛾 σ 𝑥′∈𝑋 𝑝 𝑥′
𝑥, 𝑢 𝑉𝑘+1
𝜋 𝑘
(𝑥′
)
7. until 𝑉𝑘+1
𝜋 𝑘+1
− 𝑉𝑘
𝜋 𝑘
< 𝛥
8. return 𝜋
Stochastic Control to RL
Q: If the system dynamics 𝑝 and reward function 𝑟 is unknown,
how do we solve DP equation?
Learning-based approach
Q: If the system dynamics 𝑝 and reward function 𝑟 is unknown,
how do we solve DP equation?
Learning-based approach
Q: If the system dynamics 𝑝 and reward function 𝑟 is unknown,
how do we solve DP equation?
1. Estimate model (𝑟 and 𝑝) from simulation data and use previous methods
Model-based approach (model learning)
Learning-based approach
Q: If the system dynamics 𝑝 and reward function 𝑟 is unknown,
how do we solve DP equation?
1. Estimate model (𝑟 and 𝑝) from simulation data and use previous methods
Model-based approach (model learning)
2. Without system identification, obtain the value function and policy directly from simulation data
Model-free approach
𝐢𝐧𝐟
𝝅∈𝚷
𝑬 𝝅 𝒓 𝒙 𝒌, 𝝅(𝒙 𝒌) + 𝑬 𝒙~𝒑 𝜸𝑽(𝒙 𝒌+𝟏)
𝐢𝐧𝐟
𝝅∈𝚷
𝑬 𝝅 𝒓 𝒙 𝒌, 𝝅(𝒙 𝒌) + 𝑬 𝒙~𝒑 𝜸𝑽(𝒙 𝒌+𝟏)
Approximation in policy space
Approximation in value space
Approximate 𝐸[∙]
Parametric approximation
Problem approximation
Rollout, MPC
Monte-Carlo search
Certainty equivalence
Policy search
Policy gradient
Approximation
in Value space
Approximation
in Policy space
Actor-Critic
• TD, SARSA, Q-learning
• Function approximation
• Policy search
• Policy gradient
DPG, DDPG, TRPO, CPO, PPO, Soft actor-critic, …
Approximate
Expectation
• Monte-Carlo search
• Certainty equivalence
Approximation in Value Space
DP algorithms sweep over “all state” for each step
𝐸 𝑓 ≈
1
𝑁
෍
𝑖=1
𝑁
𝑓𝑖Use Monte-Carlo Search 
Approximation in Value Space
DP algorithms sweep over “all state” for each step
𝑁~14,000,605
Use Monte-Carlo Search  𝐸 𝑓 ≈
1
𝑁
෍
𝑖=1
𝑁
𝑓𝑖
Approximation in Value Space
DP algorithms sweep over “all state” for each step
𝑁~14,000,605
Use Monte-Carlo Search  𝐸 𝑓 ≈
1
𝑁
෍
𝑖=1
𝑁
𝑓𝑖
Impractical
Stochastic Approximation
Consider the problem
𝑥 = 𝐿(𝑥).
Then, this problem can be solved by iterative algorithm
𝑥 𝑘+1 = 𝐿 𝑥 𝑘
or
𝑥 𝑘+1 = 1 − 𝛼 𝑘 𝑥 𝑘 + 𝛼 𝑘 𝐿(𝑥 𝑘).
If 𝐿 𝑥 is of the form 𝐸 𝑓(𝑥, 𝑤) where 𝑤 is a random noise. Then, 𝐿(𝑥) can be approximated by
𝐿 𝑥 =
1
𝑁
෍
𝑖=1
𝑁
𝑓(𝑥𝑖, 𝑤𝑖)
which becomes inefficient when 𝑁 is large.
Stochastic Approximation
Use a single sample as an estimation of expectation in each update
This update can be seen as a stochastic approximation of the form
𝑥 𝑘+1 = 1 − 𝛼 𝑘 𝑥 𝑘 + 𝛼 𝑘 𝑓(𝑥 𝑘, 𝑤 𝑘).
𝑥 𝑘+1 = 1 − 𝛼 𝑘 𝑥 𝑘 + 𝛼 𝑘 𝐸 𝑓 𝑥 𝑘, 𝑤 𝑘 + 𝜀 𝑘
= 1 − 𝛼 𝑘 𝑥 𝑘 + 𝛼 𝑘 𝐿(𝑥 𝑘) + 𝜀 𝑘
where 𝜀 𝑘= 𝑓 𝑥 𝑘, 𝑤 𝑘 − 𝐸 𝑓 𝑥 𝑘, 𝑤 𝑘 .
Robbins-Monro stochastic approximation guarantees the convergence under
contraction or monotonicity assumptions of the mapping 𝐿 with assumptions
෍
𝑘=0
∞
𝛼 𝑘 = + ∞ and ෍
𝑘=0
∞
𝛼 𝑘
2
< +∞.
Policy Iteration
Algorithm: Policy Iteration(Classical Dynamic Programming)
Input: 𝑟, 𝑝, 𝛾, 𝛥
Output: 𝜋
1. 𝑉 𝑥 , 𝜋(𝑥) ←Initialize arbitrarily for 𝑥 ∈ 𝑋
2. repeat
3.
1) Policy Evaluation
4. for 𝑥 ∈ 𝑋 with fixed 𝜋 𝑘 do
5. 𝑉𝑘+1
𝜋 𝑘
← 𝑇 𝜋
𝑉𝑘
𝜋 𝑘
6. until 𝑉𝑘+1
𝜋 𝑘
− 𝑉𝑘
𝜋 𝑘
< 𝛥
2) Policy Improvement
7. 𝜋 𝑘+1 𝑥𝑖 ∈ 𝑎𝑟𝑔min
𝑢∈𝑈(𝑥)
𝑟 𝑥, 𝑢 + 𝛾 σ 𝑥′∈𝑋 𝑝 𝑥′ 𝑥, 𝑢 𝑉𝑘+1
𝜋 𝑘
(𝑥′)
8. until 𝑉𝑘+1
𝜋 𝑘+1
− 𝑉𝑘
𝜋 𝑘
< 𝛥
9. return 𝜋
Policy Iteration
Algorithm: Policy Iteration(Temporal Difference)
Input: 𝑟, 𝑝, 𝛾, 𝛥
Output: 𝜋
1. 𝑉 𝑥 , 𝜋(𝑥) ←Initialize arbitrarily for 𝑥 ∈ 𝑋
2. repeat
3. given 𝑥0, simulate 𝜋 𝑘 and store 𝒟 = {𝑥𝑖, 𝜋 𝑘 𝑥𝑖 , 𝑟(𝑥𝑖, 𝜋 𝑘 𝑥𝑖 )}
1) Policy Evaluation
4. for 𝑥 ∈ 𝑋 for 𝑥𝑖 ∈ 𝒟 with fixed 𝜋 𝑘 do
5. 𝑉𝑘+1
𝜋 𝑘
← 𝑇 𝜋
𝑉𝑘
𝜋 𝑘
𝑉𝑘+1
𝜋 𝑘
(𝑥𝑖) ← 1 − 𝛼 𝑘 𝑉𝑘
𝜋 𝑘
(𝑥𝑖) + 𝛼 𝑘 𝑟 𝑥𝑖, 𝜋 𝑘 𝑥𝑖 + 𝑉𝑘
𝜋 𝑘
(𝑥𝑖+1)
6. until 𝑉𝑘+1
𝜋 𝑘
− 𝑉𝑘
𝜋 𝑘
< 𝛥
2) Policy Improvement
7. 𝜋 𝑘+1 𝑥𝑖 ∈ 𝑎𝑟𝑔min
𝑢∈𝑈(𝑥)
𝑟 𝑥, 𝑢 + 𝛾 σ 𝑥′∈𝑋 𝑝 𝑥′ 𝑥, 𝑢 𝑉𝑘+1
𝜋 𝑘
(𝑥′)
8. until 𝑉𝑘+1
𝜋 𝑘+1
− 𝑉𝑘
𝜋 𝑘
< 𝛥
9. return 𝜋
Policy Iteration
Algorithm: Policy Iteration(SARSA)
Input: 𝑟, 𝑝, 𝛾, 𝛥
Output: 𝜋
1. 𝑉 𝑥 , 𝜋(𝑥) ←Initialize arbitrarily for 𝑥 ∈ 𝑋
2. repeat
3. given 𝑥0, simulate 𝜋 𝑘 and store 𝒟 = {𝑥𝑖, 𝜋 𝑘 𝑥𝑖 , 𝑟(𝑥𝑖, 𝜋 𝑘 𝑥𝑖 )}
1) Policy Evaluation
4. for 𝑥 ∈ 𝑋 for 𝑥𝑖 ∈ 𝒟 with fixed 𝜋 𝑘 do
5. 𝑉𝑘+1
𝜋 𝑘
← 𝑇 𝜋
𝑉𝑘
𝜋 𝑘
𝑄 𝑘+1
𝜋 𝑘
(𝑥𝑖, 𝑢𝑖) ← 1 − 𝛼 𝑘 𝑄 𝑘
𝜋 𝑘
(𝑥𝑖, 𝑢𝑖) + 𝛼 𝑘 𝑟 𝑥𝑖, 𝜋 𝑘 𝑥𝑖 + 𝑄 𝑘
𝜋 𝑘
(𝑥𝑖+1, 𝜋 𝑘 𝑥𝑖+1 )
6. until 𝑉𝑘+1
𝜋 𝑘
− 𝑉𝑘
𝜋 𝑘
< 𝛥
2) Policy Improvement
7. 𝜋 𝑘+1 𝑥𝑖 ∈ 𝑎𝑟𝑔min
𝑢∈𝑈(𝑥)
𝑟 𝑥, 𝑢 + 𝛾 σ 𝑥′∈𝑋 𝑝 𝑥′ 𝑥, 𝑢 𝑉𝑘+1
𝜋 𝑘
(𝑥′)
8. until 𝑉𝑘+1
𝜋 𝑘+1
− 𝑉𝑘
𝜋 𝑘
< 𝛥
9. return 𝜋
Value Iteration
Algorithm: Value Iteration(Classical Dynamic Programming)
Input: 𝑟, 𝑝, 𝛾, 𝛥
Output: 𝜋
1. 𝑉 𝑥 ←Initialize arbitrarily for 𝑥 ∈ 𝑋
2. repeat
3. given 𝑥0, simulate and store 𝒟 = {𝑥𝑖, 𝑟(𝑥𝑖, 𝑢𝑖)}
4. for all 𝑥 ∈ 𝑋 do
5. 𝑉𝑘+1 𝑥 ← 𝑇∗ 𝑉𝑘
2. until 𝑉𝑘+1 − 𝑉𝑘 < 𝛥
3. 𝜋∗
𝑥 ∈ 𝑎𝑟𝑔min
𝑢∈𝑈(𝑥)
𝑟 𝑥, 𝑢 + 𝛾 σ 𝑥′∈𝑋 𝑝 𝑥′
𝑥, 𝑢 𝑉𝑘(𝑥′
)
4. return 𝜋∗
Value Iteration
Algorithm: Value Iteration(Q-learning)
Input: 𝑟, 𝑝, 𝛾, 𝛥
Output: 𝜋
1. 𝑉 𝑥 ←Initialize arbitrarily for 𝑥 ∈ 𝑋
𝑄(𝑥, 𝑢) ←Initialize arbitrarily for 𝑥 ∈ 𝑋, 𝑢 ∈ 𝑈
2. repeat
3. given 𝑥0, simulate and store 𝒟 = {𝑥𝑖, 𝑟(𝑥𝑖, 𝑢𝑖)}
4. for all 𝑥 ∈ 𝑋 do  for 𝑥𝑖, 𝑎𝑖, 𝑟 𝑥𝑖, 𝑢𝑖 , 𝑥𝑖+1 ∈ 𝒟 do
5. 𝑄 𝑘+1 𝑥, 𝑢 ← 𝑇∗ 𝑄 𝑘
𝑄 𝑘+1(𝑥𝑖, 𝑢𝑖) ← (1 − 𝛼 𝑘)𝑄 𝑘(𝑥𝑖, 𝑢𝑖) + 𝛼 𝑘 𝑟 𝑥𝑖, 𝑢𝑖 + min
𝑢𝑖+1∈𝑈
𝑄 𝑘(𝑥𝑖+1, 𝑢𝑖+1)
2. until 𝑄 𝑘+1 − 𝑄 𝑘 < 𝛥
3. 𝜋∗
𝑥 ∈ 𝑎𝑟𝑔min
𝑢∈𝑈(𝑥)
𝑟 𝑥, 𝑢 + 𝛾 σ 𝑥′∈𝑋 𝑝 𝑥′
𝑥, 𝑢 𝑉𝑘(𝑥′
)
4. return 𝜋∗
Approximation in Value Space
𝑇 𝜋
𝑇∗
𝑉 𝜋
𝑄 𝜋
𝑄
Policy
Evaluation
TD(0),
TD(𝜆)
SARSA(0)
SARSA(𝜆)
Q-learning
Large Scale RL
Large-scale RL
Number of states > 2200
(for 10 × 20 board)
”Function approximation”
Approximation
in Value space
Approximation
in Policy space
Actor-Critic
• TD, SARSA, Q-learning
• Function approximation
• Policy search
• Policy gradient
DPG, DDPG, TRPO, CPO, PPO, Soft actor-critic, …
Approximate
Expectation
• Monte-Carlo search
• Certainty equivalence
Approximate Dynamic Programming
Direct method (Gradient methods)
min
𝜃
෍
𝑖=1
𝑁
𝑉 𝜋
𝑥𝑖 − ෠𝑉(𝑥𝑖; 𝜃)
2
≈ min
𝜃
෍
𝑥 𝑖∈𝑋
෍
𝑚=1
𝑀
𝐽(𝑥𝑖, 𝑚) − ෠𝑉(𝑥𝑖; 𝜃)
2
෠𝑉(𝑥; 𝜃) : Approximated value function (ex. Polynomial approx., Neural Network, etc.)
𝑉 𝜋
𝑥 : State-value function
𝐽(𝑥𝑖, 𝑚): 𝑚-th sample of the cost function at 𝑥𝑖 where 𝑚 = 1,2, … , 𝑀
𝜃 𝑘+1 = 𝜃 𝑘 − 𝜂 ෍
𝑥 𝑖∈𝑋
෍
𝑚=1
𝑀
𝛻 ෠𝑉(𝑥𝑖; 𝜃) 𝐽(𝑥𝑖, 𝑚) − ෠𝑉(𝑥𝑖; 𝜃)
2
Approximate Dynamic Programming
Indirect method (Projected equation)
Solve the projected Bellman equation: Φ𝜃 = Π𝑇(Φ𝜃)
Subspace 𝑆 = Φ𝜃 𝜃 ∈ ℝ 𝑠
0
𝐽
Π𝐽
Subspace 𝑆 = Φ𝜃 𝜃 ∈ ℝ 𝑠
0
Φ𝜃 = Π𝑇(Φ𝜃)
𝑇(Φ𝜃)
Direct Method Indirect Method
𝑇 𝜋
𝑇∗
𝑉 𝜋
𝑄 𝜋
𝑄
Policy
Evaluation
TD(0),
TD(𝜆)
SARSA(0)
SARSA(𝜆)
Q-learning
𝑇 𝜋
𝑇∗
𝑉 𝜋
𝑄 𝜋
𝑄
Direct
(Gradient
methods)
TD
SARSA
DQN
𝑇 𝜋
𝑇∗
𝑉 𝜋
𝑄 𝜋
𝑄
Indirect
(Projected DP)
TD, LSTD
TD, LSTD
LSPE
Function
Approximation
Policy
Evaluation
Summary
RL is a toolbox solving
Infinite-horizon Discrete-time DP
𝐢𝐧𝐟
𝝅∈𝜫
𝑬 𝝅 𝒓 𝒙 𝒌, 𝝅(𝒙 𝒌) + 𝑬 𝒙~𝒑 𝜸𝑽(𝒙 𝒌+𝟏)
𝐢𝐧𝐟
𝝅∈𝚷
𝑬 𝝅 𝒓 𝒙 𝒌, 𝝅(𝒙 𝒌) + 𝑬 𝒙~𝒑 𝜸𝑽(𝒙 𝒌+𝟏)
Approximation in policy space
Approximation in value space
Approximate 𝐸[∙]
Parametric approximation
Problem approximation
Rollout, MPC
Monte-Carlo search
Certainty equivalence
Policy search
Policy gradient
Approximation
in Value space
Approximation
in Policy space
Actor-Critic
• TD, SARSA, Q-learning
• Function approximation
• Policy search
• Policy gradient
DPG, DDPG, TRPO, CPO, PPO, Soft actor-critic, …
Approximate
Expectation
• Monte-Carlo search
• Certainty equivalence
𝑇 𝜋
𝑇∗
𝑉 𝜋
𝑄 𝜋
𝑄
Policy
Evaluation
TD(0),
TD(𝜆)
SARSA(0)
SARSA(𝜆)
Q-learning
𝑇 𝜋
𝑇∗
𝑉 𝜋
𝑄 𝜋
𝑄
Direct
(Gradient
methods)
TD
SARSA
DQN
𝑇 𝜋
𝑇∗
𝑉 𝜋
𝑄 𝜋
𝑄
Indirect
(Projected DP)
TD, LSTD
TD, LSTD
LSPE
Function
Approximation
Policy
Evaluation
Q&A
Thank you

More Related Content

What's hot

Signals and systems-1
Signals and systems-1Signals and systems-1
Signals and systems-1sarun soman
 
Reinfrocement Learning
Reinfrocement LearningReinfrocement Learning
Reinfrocement LearningNatan Katz
 
HMM-Based Speech Synthesis
HMM-Based Speech SynthesisHMM-Based Speech Synthesis
HMM-Based Speech SynthesisIJMER
 
Machine Learning - Reinforcement Learning
Machine Learning - Reinforcement LearningMachine Learning - Reinforcement Learning
Machine Learning - Reinforcement LearningJY Chun
 
Gauge Theory for Beginners.pptx
Gauge Theory for Beginners.pptxGauge Theory for Beginners.pptx
Gauge Theory for Beginners.pptxHassaan Saleem
 
Lecture 5 backpropagation
Lecture 5 backpropagationLecture 5 backpropagation
Lecture 5 backpropagationParveenMalik18
 
Introduction to Reinforcement Learning
Introduction to Reinforcement LearningIntroduction to Reinforcement Learning
Introduction to Reinforcement LearningEdward Balaban
 
An introduction to reinforcement learning (rl)
An introduction to reinforcement learning (rl)An introduction to reinforcement learning (rl)
An introduction to reinforcement learning (rl)pauldix
 
International Refereed Journal of Engineering and Science (IRJES)
International Refereed Journal of Engineering and Science (IRJES)International Refereed Journal of Engineering and Science (IRJES)
International Refereed Journal of Engineering and Science (IRJES)irjes
 
Solving Poisson Equation using Conjugate Gradient Method and its implementation
Solving Poisson Equation using Conjugate Gradient Methodand its implementationSolving Poisson Equation using Conjugate Gradient Methodand its implementation
Solving Poisson Equation using Conjugate Gradient Method and its implementationJongsu "Liam" Kim
 
Physical Chemistry II (CHEM 308)
Physical Chemistry II (CHEM 308)Physical Chemistry II (CHEM 308)
Physical Chemistry II (CHEM 308)Jahnn Drigo
 
Lecture 6 radial basis-function_network
Lecture 6 radial basis-function_networkLecture 6 radial basis-function_network
Lecture 6 radial basis-function_networkParveenMalik18
 
Lecture 4 neural networks
Lecture 4 neural networksLecture 4 neural networks
Lecture 4 neural networksParveenMalik18
 
Taller grupal parcial ii nrc 3246 sebastian fueltala_kevin sánchez
Taller grupal parcial ii nrc 3246  sebastian fueltala_kevin sánchezTaller grupal parcial ii nrc 3246  sebastian fueltala_kevin sánchez
Taller grupal parcial ii nrc 3246 sebastian fueltala_kevin sánchezkevinct2001
 
Maximum Entropy Reinforcement Learning (Stochastic Control)
Maximum Entropy Reinforcement Learning (Stochastic Control)Maximum Entropy Reinforcement Learning (Stochastic Control)
Maximum Entropy Reinforcement Learning (Stochastic Control)Dongmin Lee
 
Computational Motor Control: Optimal Control for Stochastic Systems (JAIST su...
Computational Motor Control: Optimal Control for Stochastic Systems (JAIST su...Computational Motor Control: Optimal Control for Stochastic Systems (JAIST su...
Computational Motor Control: Optimal Control for Stochastic Systems (JAIST su...hirokazutanaka
 

What's hot (20)

Signals and systems-1
Signals and systems-1Signals and systems-1
Signals and systems-1
 
Reinfrocement Learning
Reinfrocement LearningReinfrocement Learning
Reinfrocement Learning
 
HMM-Based Speech Synthesis
HMM-Based Speech SynthesisHMM-Based Speech Synthesis
HMM-Based Speech Synthesis
 
Machine Learning - Reinforcement Learning
Machine Learning - Reinforcement LearningMachine Learning - Reinforcement Learning
Machine Learning - Reinforcement Learning
 
Av 738-Adaptive Filters - Extended Kalman Filter
Av 738-Adaptive Filters - Extended Kalman FilterAv 738-Adaptive Filters - Extended Kalman Filter
Av 738-Adaptive Filters - Extended Kalman Filter
 
Gauge Theory for Beginners.pptx
Gauge Theory for Beginners.pptxGauge Theory for Beginners.pptx
Gauge Theory for Beginners.pptx
 
Lecture 5 backpropagation
Lecture 5 backpropagationLecture 5 backpropagation
Lecture 5 backpropagation
 
Introduction to Reinforcement Learning
Introduction to Reinforcement LearningIntroduction to Reinforcement Learning
Introduction to Reinforcement Learning
 
Proje kt report
Proje kt reportProje kt report
Proje kt report
 
An introduction to reinforcement learning (rl)
An introduction to reinforcement learning (rl)An introduction to reinforcement learning (rl)
An introduction to reinforcement learning (rl)
 
International Refereed Journal of Engineering and Science (IRJES)
International Refereed Journal of Engineering and Science (IRJES)International Refereed Journal of Engineering and Science (IRJES)
International Refereed Journal of Engineering and Science (IRJES)
 
Solving Poisson Equation using Conjugate Gradient Method and its implementation
Solving Poisson Equation using Conjugate Gradient Methodand its implementationSolving Poisson Equation using Conjugate Gradient Methodand its implementation
Solving Poisson Equation using Conjugate Gradient Method and its implementation
 
MSR
MSRMSR
MSR
 
Physical Chemistry II (CHEM 308)
Physical Chemistry II (CHEM 308)Physical Chemistry II (CHEM 308)
Physical Chemistry II (CHEM 308)
 
Continuous control
Continuous controlContinuous control
Continuous control
 
Lecture 6 radial basis-function_network
Lecture 6 radial basis-function_networkLecture 6 radial basis-function_network
Lecture 6 radial basis-function_network
 
Lecture 4 neural networks
Lecture 4 neural networksLecture 4 neural networks
Lecture 4 neural networks
 
Taller grupal parcial ii nrc 3246 sebastian fueltala_kevin sánchez
Taller grupal parcial ii nrc 3246  sebastian fueltala_kevin sánchezTaller grupal parcial ii nrc 3246  sebastian fueltala_kevin sánchez
Taller grupal parcial ii nrc 3246 sebastian fueltala_kevin sánchez
 
Maximum Entropy Reinforcement Learning (Stochastic Control)
Maximum Entropy Reinforcement Learning (Stochastic Control)Maximum Entropy Reinforcement Learning (Stochastic Control)
Maximum Entropy Reinforcement Learning (Stochastic Control)
 
Computational Motor Control: Optimal Control for Stochastic Systems (JAIST su...
Computational Motor Control: Optimal Control for Stochastic Systems (JAIST su...Computational Motor Control: Optimal Control for Stochastic Systems (JAIST su...
Computational Motor Control: Optimal Control for Stochastic Systems (JAIST su...
 

Similar to Stochastic optimal control &amp; rl

Learning group em - 20171025 - copy
Learning group   em - 20171025 - copyLearning group   em - 20171025 - copy
Learning group em - 20171025 - copyShuai Zhang
 
Intro to Quant Trading Strategies (Lecture 6 of 10)
Intro to Quant Trading Strategies (Lecture 6 of 10)Intro to Quant Trading Strategies (Lecture 6 of 10)
Intro to Quant Trading Strategies (Lecture 6 of 10)Adrian Aley
 
Passivity-based control of rigid-body manipulator
Passivity-based control of rigid-body manipulatorPassivity-based control of rigid-body manipulator
Passivity-based control of rigid-body manipulatorHancheol Choi
 
Linear regression, costs & gradient descent
Linear regression, costs & gradient descentLinear regression, costs & gradient descent
Linear regression, costs & gradient descentRevanth Kumar
 
PR 113: The Perception Distortion Tradeoff
PR 113: The Perception Distortion TradeoffPR 113: The Perception Distortion Tradeoff
PR 113: The Perception Distortion TradeoffTaeoh Kim
 
Optimum Engineering Design - Day 2b. Classical Optimization methods
Optimum Engineering Design - Day 2b. Classical Optimization methodsOptimum Engineering Design - Day 2b. Classical Optimization methods
Optimum Engineering Design - Day 2b. Classical Optimization methodsSantiagoGarridoBulln
 
Kalman filter for Beginners
Kalman filter for BeginnersKalman filter for Beginners
Kalman filter for Beginnerswinfred lu
 
Lecture 1
Lecture 1Lecture 1
Lecture 1butest
 
Artyom Makovetskii - An Efficient Algorithm for Total Variation Denoising
Artyom Makovetskii - An Efficient Algorithm for Total Variation DenoisingArtyom Makovetskii - An Efficient Algorithm for Total Variation Denoising
Artyom Makovetskii - An Efficient Algorithm for Total Variation DenoisingAIST
 
Coursera 2week
Coursera  2weekCoursera  2week
Coursera 2weekcsl9496
 
Specific topics in optimisation
Specific topics in optimisationSpecific topics in optimisation
Specific topics in optimisationFarzad Javidanrad
 
Vibration Isolation of a LEGO® plate
Vibration Isolation of a LEGO® plateVibration Isolation of a LEGO® plate
Vibration Isolation of a LEGO® plateOpen Adaptronik
 
NIPS KANSAI Reading Group #5: State Aware Imitation Learning
NIPS KANSAI Reading Group #5: State Aware Imitation LearningNIPS KANSAI Reading Group #5: State Aware Imitation Learning
NIPS KANSAI Reading Group #5: State Aware Imitation LearningEiji Uchibe
 

Similar to Stochastic optimal control &amp; rl (20)

Learning group em - 20171025 - copy
Learning group   em - 20171025 - copyLearning group   em - 20171025 - copy
Learning group em - 20171025 - copy
 
Intro to Quant Trading Strategies (Lecture 6 of 10)
Intro to Quant Trading Strategies (Lecture 6 of 10)Intro to Quant Trading Strategies (Lecture 6 of 10)
Intro to Quant Trading Strategies (Lecture 6 of 10)
 
Passivity-based control of rigid-body manipulator
Passivity-based control of rigid-body manipulatorPassivity-based control of rigid-body manipulator
Passivity-based control of rigid-body manipulator
 
Linear regression, costs & gradient descent
Linear regression, costs & gradient descentLinear regression, costs & gradient descent
Linear regression, costs & gradient descent
 
PR 113: The Perception Distortion Tradeoff
PR 113: The Perception Distortion TradeoffPR 113: The Perception Distortion Tradeoff
PR 113: The Perception Distortion Tradeoff
 
Stochastic Optimization
Stochastic OptimizationStochastic Optimization
Stochastic Optimization
 
Optimum Engineering Design - Day 2b. Classical Optimization methods
Optimum Engineering Design - Day 2b. Classical Optimization methodsOptimum Engineering Design - Day 2b. Classical Optimization methods
Optimum Engineering Design - Day 2b. Classical Optimization methods
 
Kalman filter for Beginners
Kalman filter for BeginnersKalman filter for Beginners
Kalman filter for Beginners
 
Basic calculus (i)
Basic calculus (i)Basic calculus (i)
Basic calculus (i)
 
Periodic Solutions for Nonlinear Systems of Integro-Differential Equations of...
Periodic Solutions for Nonlinear Systems of Integro-Differential Equations of...Periodic Solutions for Nonlinear Systems of Integro-Differential Equations of...
Periodic Solutions for Nonlinear Systems of Integro-Differential Equations of...
 
Lecture 1
Lecture 1Lecture 1
Lecture 1
 
Artyom Makovetskii - An Efficient Algorithm for Total Variation Denoising
Artyom Makovetskii - An Efficient Algorithm for Total Variation DenoisingArtyom Makovetskii - An Efficient Algorithm for Total Variation Denoising
Artyom Makovetskii - An Efficient Algorithm for Total Variation Denoising
 
Coursera 2week
Coursera  2weekCoursera  2week
Coursera 2week
 
Specific topics in optimisation
Specific topics in optimisationSpecific topics in optimisation
Specific topics in optimisation
 
Integral calculus
Integral calculusIntegral calculus
Integral calculus
 
Restricted boltzmann machine
Restricted boltzmann machineRestricted boltzmann machine
Restricted boltzmann machine
 
Dl meetup 07-04-16
Dl meetup 07-04-16Dl meetup 07-04-16
Dl meetup 07-04-16
 
Differentiation
DifferentiationDifferentiation
Differentiation
 
Vibration Isolation of a LEGO® plate
Vibration Isolation of a LEGO® plateVibration Isolation of a LEGO® plate
Vibration Isolation of a LEGO® plate
 
NIPS KANSAI Reading Group #5: State Aware Imitation Learning
NIPS KANSAI Reading Group #5: State Aware Imitation LearningNIPS KANSAI Reading Group #5: State Aware Imitation Learning
NIPS KANSAI Reading Group #5: State Aware Imitation Learning
 

Recently uploaded

Past, Present and Future of Generative AI
Past, Present and Future of Generative AIPast, Present and Future of Generative AI
Past, Present and Future of Generative AIabhishek36461
 
EduAI - E learning Platform integrated with AI
EduAI - E learning Platform integrated with AIEduAI - E learning Platform integrated with AI
EduAI - E learning Platform integrated with AIkoyaldeepu123
 
Arduino_CSE ece ppt for working and principal of arduino.ppt
Arduino_CSE ece ppt for working and principal of arduino.pptArduino_CSE ece ppt for working and principal of arduino.ppt
Arduino_CSE ece ppt for working and principal of arduino.pptSAURABHKUMAR892774
 
Work Experience-Dalton Park.pptxfvvvvvvv
Work Experience-Dalton Park.pptxfvvvvvvvWork Experience-Dalton Park.pptxfvvvvvvv
Work Experience-Dalton Park.pptxfvvvvvvvLewisJB
 
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...srsj9000
 
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)dollysharma2066
 
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube ExchangerStudy on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube ExchangerAnamika Sarkar
 
What are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptxWhat are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptxwendy cai
 
Internship report on mechanical engineering
Internship report on mechanical engineeringInternship report on mechanical engineering
Internship report on mechanical engineeringmalavadedarshan25
 
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETEINFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETEroselinkalist12
 
pipeline in computer architecture design
pipeline in computer architecture  designpipeline in computer architecture  design
pipeline in computer architecture designssuser87fa0c1
 
Effects of rheological properties on mixing
Effects of rheological properties on mixingEffects of rheological properties on mixing
Effects of rheological properties on mixingviprabot1
 
Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...VICTOR MAESTRE RAMIREZ
 
main PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfidmain PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfidNikhilNagaraju
 
Concrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptxConcrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptxKartikeyaDwivedi3
 
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfCCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfAsst.prof M.Gokilavani
 
Introduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptxIntroduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptxk795866
 

Recently uploaded (20)

Past, Present and Future of Generative AI
Past, Present and Future of Generative AIPast, Present and Future of Generative AI
Past, Present and Future of Generative AI
 
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCRCall Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
 
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Serviceyoung call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
 
EduAI - E learning Platform integrated with AI
EduAI - E learning Platform integrated with AIEduAI - E learning Platform integrated with AI
EduAI - E learning Platform integrated with AI
 
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
 
Arduino_CSE ece ppt for working and principal of arduino.ppt
Arduino_CSE ece ppt for working and principal of arduino.pptArduino_CSE ece ppt for working and principal of arduino.ppt
Arduino_CSE ece ppt for working and principal of arduino.ppt
 
Work Experience-Dalton Park.pptxfvvvvvvv
Work Experience-Dalton Park.pptxfvvvvvvvWork Experience-Dalton Park.pptxfvvvvvvv
Work Experience-Dalton Park.pptxfvvvvvvv
 
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
 
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)
 
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube ExchangerStudy on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
 
What are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptxWhat are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptx
 
Internship report on mechanical engineering
Internship report on mechanical engineeringInternship report on mechanical engineering
Internship report on mechanical engineering
 
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETEINFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
 
pipeline in computer architecture design
pipeline in computer architecture  designpipeline in computer architecture  design
pipeline in computer architecture design
 
Effects of rheological properties on mixing
Effects of rheological properties on mixingEffects of rheological properties on mixing
Effects of rheological properties on mixing
 
Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...
 
main PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfidmain PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfid
 
Concrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptxConcrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptx
 
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfCCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
 
Introduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptxIntroduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptx
 

Stochastic optimal control &amp; rl

  • 2.
  • 3.
  • 4.
  • 7.
  • 13. Markov Decision Process Markov? “The future is independent of the past given the present” ℙ 𝑠𝑡+1 𝑠1, … , 𝑠𝑡 = ℙ 𝑠𝑡+1 𝑠𝑡
  • 14. Markov Decision Process Markov? “The future is independent of the past given the present” ℙ 𝑠𝑡+1 𝑠1, … , 𝑠𝑡 = ℙ 𝑠𝑡+1 𝑠𝑡 Memoryless process!
  • 15. Markov Decision Process Markov? “The future is independent of the past given the present” ℙ 𝑠𝑡+1 𝑠1, … , 𝑠𝑡 = ℙ 𝑠𝑡+1 𝑠𝑡 Markov “Decision” Process: ℙ 𝑠𝑡+1 𝑠1, 𝑎1 … , 𝑠𝑡, 𝑎 𝑡 = ℙ 𝑠𝑡+1 𝑠𝑡, 𝑎 𝑡 Future state only depends on the current state and action & Policy also depends on the current state only 𝜋 𝑎 𝑡 𝑠1, 𝑎1 … , 𝑠𝑡 = 𝜋 𝑎 𝑡 𝑠𝑡
  • 16. Reinforcement Learning Agent Environment Action Reward • State 𝑠 ∈ 𝑆 ⊂ ℝ 𝑛 • Action 𝑎 ∈ 𝐴 ⊂ ℝ 𝑚 • Action sequence 𝑎0, 𝑎1, … , ∀𝑎𝑖 ∈ 𝐴, 𝑖 = 1,2, … State
  • 17. Reinforcement Learning Agent Environment Action Reward • State 𝑠 ∈ 𝑆 ⊂ ℝ 𝑛 • Action 𝑎 ∈ 𝐴 ⊂ ℝ 𝑚 • Action sequence 𝑎0, 𝑎1, … , ∀𝑎𝑖 ∈ 𝐴, 𝑖 = 1,2, … State • Reward 𝑟: 𝑆 × 𝐴 → ℝ • Discounting factor 𝛾 ∈ (0,1) • Transition probability 𝑠𝑡+1~𝑝 𝑠𝑡+1 𝑠𝑡, 𝑎 𝑡 • Total reward 𝑅𝑡𝑜𝑡 = 𝐸𝑠~𝑝 σ 𝑡=0 ∞ 𝛾 𝑡 𝑟(𝑠𝑡, 𝑎 𝑡)
  • 18. Reinforcement Learning Agent Environment Action Reward • State 𝑠 ∈ 𝑆 ⊂ ℝ 𝑛 • Action 𝑎 ∈ 𝐴 ⊂ ℝ 𝑚 • Action sequence 𝑎0, 𝑎1, … , ∀𝑎𝑖 ∈ 𝐴, 𝑖 = 1,2, … State Objective function: max 𝜋∈Π 𝑅 𝜋 • Policy: 𝜋: 𝑆 → 𝐴 • Total reward w.r.t. 𝜋: 𝑅 𝜋 = 𝐸𝑠~𝑝,𝑎~𝜋 σ 𝑡=0 ∞ 𝛾 𝑡 𝑟(𝑠𝑡, 𝑎 𝑡) • Reward 𝑟: 𝑆 × 𝐴 → ℝ • Discounting factor 𝛾 ∈ (0,1) • Transition probability 𝑠𝑡+1~𝑝 𝑠𝑡+1 𝑠𝑡, 𝑎 𝑡 • Total reward 𝑅𝑡𝑜𝑡 = 𝐸𝑠~𝑝 σ 𝑡=0 ∞ 𝛾 𝑡 𝑟(𝑠𝑡, 𝑎 𝑡)
  • 19. Terminology of RL and Optimal control State Action Agent Environment Reward of a stage Reward (or value) function Maximizing the value function Bellman operator Greedy policy w.r.t. 𝐽 State Control Input Controller System Cost of a stage Value (or cost) function Minimizing the value function DP mapping or operator Minimizing policy w.r.t. 𝐽 RL Optimal Control
  • 21. Stochastic Optimal Control DiscreteContinuous Stochastic Deterministicሶ𝑥 = 𝑓(𝑥, 𝑢) 𝑑𝑥 = 𝑓 𝑥, 𝑢 𝑑𝑡 + 𝜎 𝑥, 𝑢 𝑑𝑊 𝑥 𝑘+1 = 𝑓(𝑥 𝑘, 𝑢 𝑘) 𝑥 𝑘+1 = 𝑓 𝑥 𝑘, 𝑢 𝑘, 𝑤 𝑘 (𝑤 𝑘 is a random Gaussian noise) or 𝑥 𝑘+1~𝑝 𝑥 𝑘+1 𝑥 𝑘, 𝑢 𝑘 System Dynamics
  • 22. Stochastic Optimal Control DiscreteContinuous Stochastic (Policy) Deterministic (Control Input) 𝑢 𝑥 𝑢 𝑥 ~ 𝜋 𝑢 𝑥 {𝑢0, 𝑢1, 𝑢2, … } 𝑢 𝑘 ~ 𝜋 𝑢 𝑘 𝑥 𝑘 Control Input and Policy
  • 23. Stochastic Optimal Control DiscreteContinuous Infinite-horizon Finite-horizon Value function inf 𝑢∈𝑈 𝐸 𝑥~𝑝 න 𝑡=0 𝑇 𝑟 𝑥 𝑡 , 𝑢 𝑡 𝑑𝑡 + 𝑞(𝑥 𝑇) inf 𝑢∈𝑈 𝐸 𝑥~𝑝 න 𝑡=0 ∞ 𝑒−𝛾𝑡 𝑟 𝑥 𝑡 , 𝑢 𝑡 𝑑𝑡 inf 𝑢 𝑘∈𝑈 𝐸 𝑥~𝑝 ෍ 𝑘=0 𝑁 𝑟 𝑥 𝑘, 𝑢 𝑘 + 𝑞(𝑥 𝑁) inf 𝑢 𝑘∈𝑈 𝐸 𝑥~𝑝 ෍ 𝑘=0 ∞ 𝛾 𝑘 𝑟 𝑥 𝑘, 𝑢 𝑘
  • 24. Stochastic Optimal Control Discrete(𝑉 𝑥 𝑘 )Continuous(𝑉 𝑥(𝑡) ) Infinite-horizon Finite-horizon Dynamic Programming inf 𝑢∈𝑈 𝐸 𝑥~𝑝 න 𝑡 𝑡+Δ𝑡 𝑟 𝑥 𝑠 , 𝑢 𝑠 𝑑𝑠 + 𝑉(𝑥 𝑡 + Δ𝑡 ) 𝑉 𝑥 𝑇 = 𝑞(𝑥 𝑇 ) inf 𝑢∈𝑈 𝐸 𝑥~𝑝 න 𝑡=0 𝑡+Δ𝑡 𝑒−𝛾𝑠 𝑟 𝑥 𝑠 , 𝑢 𝑠 𝑑𝑠 + 𝑉(𝑥 𝑡 + Δ𝑡 ) inf 𝑢 𝑘∈𝑈 𝑟 𝑥 𝑘, 𝑢 𝑘 + 𝐸 𝑥~𝑝 𝑉(𝑥 𝑘+1) 𝑉 𝑥 𝑇 = 𝑞(𝑥 𝑇) inf 𝑢 𝑘∈𝑈 𝑟 𝑥 𝑘, 𝑢 𝑘 + 𝐸 𝑥~𝑝 𝛾𝑉(𝑥 𝑘+1)
  • 25. Stochastic Optimal Control Discrete(𝑉 𝑥 𝑘 )Continuous(𝑉 𝑥(𝑡) ) Infinite Dynamic Programming 𝜕𝑉 𝜕𝑡 + inf 𝑢∈𝑈 𝑟 𝑥 𝑡 , 𝑢 𝑡 + 𝜕𝑉 𝜕𝑥 𝑓 𝑥 𝑡 , 𝑢 𝑡 + 1 2 𝜎 𝑇 (𝑥 𝑡 , 𝑢 𝑡 ) 𝜕2 𝑉 𝜕𝑥2 𝜎(𝑥 𝑡 , 𝑢 𝑡 ) = 0 𝑉 𝑥 𝑇 = 𝑞(𝑥 𝑇 ) inf 𝑢 𝑘∈𝑈 𝑟 𝑥 𝑘, 𝑢 𝑘 + 𝐸 𝑥~𝑝 𝑉(𝑥 𝑘+1) 𝑉 𝑥 𝑇 = 𝑞(𝑥 𝑇) inf 𝑢 𝑘∈𝑈 𝑟 𝑥 𝑘, 𝑢 𝑘 + 𝐸 𝑥~𝑝 𝛾𝑉(𝑥 𝑘+1) Finite HJB equation Bellman equation −𝛾 + inf 𝑢∈𝑈 𝑟 𝑥 𝑡 , 𝑢 𝑡 + 𝜕𝑉 𝜕𝑥 𝑓 𝑥 𝑡 , 𝑢 𝑡 + 1 2 𝜎 𝑇 (𝑥 𝑡 , 𝑢 𝑡 ) 𝜕2 𝑉 𝜕𝑥2 𝜎(𝑥 𝑡 , 𝑢 𝑡 ) = 0
  • 26. Stochastic Optimal Control Discrete(𝑉 𝑥 𝑘 )Continuous(𝑉 𝑥(𝑡) ) Infinite Dynamic Programming 𝜕𝑉 𝜕𝑡 + inf 𝑢∈𝑈 𝑟 𝑥 𝑡 , 𝑢 𝑡 + 𝜕𝑉 𝜕𝑥 𝑓 𝑥 𝑡 , 𝑢 𝑡 + 1 2 𝜎 𝑇 (𝑥 𝑡 , 𝑢 𝑡 ) 𝜕2 𝑉 𝜕𝑥2 𝜎(𝑥 𝑡 , 𝑢 𝑡 ) = 0 𝑉 𝑥 𝑇 = 𝑞(𝑥 𝑇 ) inf 𝑢 𝑘∈𝑈 𝑟 𝑥 𝑘, 𝑢 𝑘 + 𝐸 𝑥~𝑝 𝑉(𝑥 𝑘+1) 𝑉 𝑥 𝑇 = 𝑞(𝑥 𝑇) inf 𝑢 𝑘∈𝑈 𝑟 𝑥 𝑘, 𝑢 𝑘 + 𝐸 𝑥~𝑝 𝛾𝑉(𝑥 𝑘+1) Finite −𝛾 + inf 𝑢∈𝑈 𝑟 𝑥 𝑡 , 𝑢 𝑡 + 𝜕𝑉 𝜕𝑥 𝑓 𝑥 𝑡 , 𝑢 𝑡 + 1 2 𝜎 𝑇 (𝑥 𝑡 , 𝑢 𝑡 ) 𝜕2 𝑉 𝜕𝑥2 𝜎(𝑥 𝑡 , 𝑢 𝑡 ) = 0 HJB equation Bellman equation
  • 27. Stochastic Optimal Control Dynamic Programming 𝐢𝐧𝐟 𝒖 𝒌∈𝑼 𝒓 𝒙 𝒌, 𝒖 𝒌 + 𝑬 𝒙~𝒑 𝜸𝑽(𝒙 𝒌+𝟏) How to solve the infinite-horizon discrete time system stochastic optimal control problem?
  • 28. Stochastic Optimal Control Dynamic Programming 𝐢𝐧𝐟 𝒖 𝒌∈𝑼 𝒓 𝒙 𝒌, 𝒖 𝒌 + 𝑬 𝒙~𝒑 𝜸𝑽(𝒙 𝒌+𝟏) How to solve the infinite-horizon discrete time system stochastic optimal control problem? Note) There is an other approach using different dynamic programming equation, average reward. Value Iteration & Policy Iteration
  • 29. Bellman Operator Definition. Given policy 𝜋, the state-value function 𝑉 𝜋: ℝn → ℝ is defined by 𝑉 𝜋 𝑥0 : = 𝐸 𝑥~𝑝,𝜋 ෍ 𝑘=0 ∞ 𝛾 𝑘 𝑟𝑘(𝑥 𝑘, 𝜋(𝑥 𝑘), 𝑤 𝑘) ȁ 𝑥 = 𝑥0 at 𝑡 = 0 = 𝑟(𝑥0, 𝜋(𝑥 𝑘)) + 𝐸 𝑥~𝑝,𝜋 ෍ 𝑘=0 ∞ 𝛾 𝑘 𝑟𝑘(𝑥 𝑘, 𝜋(𝑥 𝑘), 𝑤 𝑘) and the state-input-value function 𝑄 𝜋 : ℝn × ℝm → ℝ is defined by 𝑄 𝜋 𝑥0, 𝑢0 : = 𝐸 𝑥~𝑝,𝜋 ෍ 𝑘=0 ∞ 𝛾 𝑘 𝑟𝑘(𝑥 𝑘, 𝜋(𝑥 𝑘), 𝑤 𝑘) ȁ 𝑥 = 𝑥0, 𝑢 = 𝑢0 at 𝑡 = 0 = 𝑟 𝑥0, 𝑢0 + 𝐸 𝑥~𝑝,𝜋 ෍ 𝑘=1 ∞ 𝛾 𝑘 𝑟𝑘(𝑥 𝑘, 𝜋(𝑥 𝑘), 𝑤 𝑘)
  • 30. Bellman Operator Definition. Given policy 𝜋, the state-value function 𝑉 𝜋: ℝn → ℝ is defined by 𝑉 𝜋 𝑥0 : = 𝐸 𝑥~𝑝,𝜋 ෍ 𝑘=0 ∞ 𝛾 𝑘 𝑟𝑘(𝑥 𝑘, 𝜋(𝑥 𝑘), 𝑤 𝑘) ȁ 𝑥 = 𝑥0 at 𝑡 = 0 = 𝑟(𝑥0, 𝜋(𝑥 𝑘)) + 𝐸 𝑥~𝑝,𝜋 ෍ 𝑘=0 ∞ 𝛾 𝑘 𝑟𝑘(𝑥 𝑘, 𝜋(𝑥 𝑘), 𝑤 𝑘) and the state-input-value function 𝑄 𝜋 : ℝn × ℝm → ℝ is defined by 𝑄 𝜋 𝑥0, 𝑢0 : = 𝐸 𝑥~𝑝,𝜋 ෍ 𝑘=0 ∞ 𝛾 𝑘 𝑟𝑘(𝑥 𝑘, 𝜋(𝑥 𝑘), 𝑤 𝑘) ȁ 𝑥 = 𝑥0, 𝑢 = 𝑢0 at 𝑡 = 0 = 𝑟 𝑥0, 𝑢0 + 𝐸 𝑥~𝑝,𝜋 ෍ 𝑘=1 ∞ 𝛾 𝑘 𝑟𝑘(𝑥 𝑘, 𝜋(𝑥 𝑘), 𝑤 𝑘) 𝑉 𝑥0 = inf 𝑢 𝑘∈𝑈 𝐸 𝑥~𝑝 ෍ 𝑘=0 ∞ 𝛾 𝑘 𝑟 𝑥 𝑘, 𝑢 𝑘
  • 31. Bellman Operator 𝑉 𝑥 𝑘 = inf 𝑢 𝑘∈𝑈 𝑟 𝑥 𝑘, 𝑢 𝑘 + 𝛾𝐸 𝑥~𝑝 𝑉(𝑥 𝑘+1) 𝑉 𝜋 𝑥 𝑘 = 𝑟(𝑥 𝑘, 𝜋(𝑥 𝑘)) + 𝛾𝐸 𝑥~𝑝 𝑉 𝜋(𝑥 𝑘+1) 𝑄 𝜋 𝑥 𝑘, 𝑢 𝑘 = 𝑟 𝑥 𝑘, 𝑢 𝑘 + 𝛾𝐸 𝑥~𝑝,𝜋 𝑄 𝜋 𝑥 𝑘+1, 𝜋(𝑥 𝑘+1) Dynamic Programming
  • 32. Bellman Operator Let 𝔹, ∙ ∞, 𝑑∞ be a metric space where 𝔹 = 𝜓: Ω → ℝ continuous and bounded , 𝜓 ∞ ≔ sup 𝑥∈𝑋 𝜓(𝑥) , and 𝑑∞ 𝜓, 𝜓′ = sup 𝑥∈𝑋 𝜓 𝑥 − 𝜓′(𝑥) . Definition. Given policy 𝜋, the Bellman operator 𝑇 𝜋: 𝔹 → 𝔹 is defined by 𝑇 𝜋 𝜓 𝑥 𝑘 = 𝑟 𝑥 𝑘, 𝜋 𝑥 𝑘 + 𝛾𝐸 𝑥~𝑝[𝜓 𝑥 𝑘+1 ] and the Bellman optimal operator 𝑇∗: 𝔹 → 𝔹 is defined by 𝑇∗ 𝜓 𝑥 𝑘 = min 𝑢 𝑘∈𝑈(𝑥 𝑘) 𝑟 𝑥 𝑘, 𝑢 𝑘 + 𝛾𝐸 𝑥~𝑝[𝜓 𝑥 𝑘+1 ]
  • 33. Bellman Operator Proposition 1. (Monotonicity) The Bellman operator 𝑇 𝜋 , 𝑇∗ are monotone, i.e. if 𝜓 𝑥 ≤ 𝜓′ 𝑥 ∀𝑥 ∈ 𝑋, then 𝑇 𝜋 𝜓 𝑥 ≤ 𝑇 𝜋 𝜓′ 𝑥 ∀𝑥 ∈ 𝑋 𝑇∗ 𝜓 𝑥 ≤ 𝑇∗ 𝜓′ 𝑥 ∀𝑥 ∈ 𝑋. Proposition 2. (Constant shift property) For any scalar 𝑟, 𝑇 𝜋 𝜓 + 𝑟𝑒 𝑥 = 𝑇 𝜋 𝜓 𝑥 + 𝛾𝑟 ∀𝑥 ∈ 𝑋 𝑇∗ 𝜓 + 𝑟𝑒 𝑥 = 𝑇∗ 𝜓 𝑥 + 𝛾𝑟 ∀𝑥 ∈ 𝑋.
  • 34. Bellman Operator Proposition 1. (Monotonicity) The Bellman operator 𝑇 𝜋 , 𝑇∗ are monotone, i.e. if 𝜓 𝑥 ≤ 𝜓′ 𝑥 ∀𝑥 ∈ 𝑋, then 𝑇 𝜋 𝜓 𝑥 ≤ 𝑇 𝜋 𝜓′ 𝑥 ∀𝑥 ∈ 𝑋 𝑇∗ 𝜓 𝑥 ≤ 𝑇∗ 𝜓′ 𝑥 ∀𝑥 ∈ 𝑋. Proposition 2. (Constant shift property) For any scalar 𝑟, 𝑇 𝜋 𝜓 + 𝑟𝑒 𝑥 = 𝑇 𝜋 𝜓 𝑥 + 𝛾𝑟 ∀𝑥 ∈ 𝑋 𝑇∗ 𝜓 + 𝑟𝑒 𝑥 = 𝑇∗ 𝜓 𝑥 + 𝛾𝑟 ∀𝑥 ∈ 𝑋. Proposition 3. The Bellman operator 𝑇 𝜋 , 𝑇 are a contraction with modulus 𝛾 with respect to the sup norm ∙ ∞ i.e. 𝑇 𝜋 𝜓 − 𝑇 𝜋 𝜓′ ≤ 𝛾 𝜓 − 𝜓′ 𝜓, 𝜓′ ∈ 𝔹 𝑇∗ 𝜓 − 𝑇∗ 𝜓′ ≤ 𝛾 𝜓 − 𝜓′ 𝜓, 𝜓′ ∈ 𝔹.
  • 35. Bellman Operator Theorem 2.3. (Contraction Mapping Theorem) Let 𝔹, ∙ ∞, 𝑑∞ be a metric space and 𝑇: 𝔹 → 𝔹 is a contraction mapping with modulus 𝛾. Then, 1) 𝑇 has an unique fixed point in 𝔹 i.e. there exist unique 𝑓∗ ∈ 𝔹 s.t. 𝑇𝑓∗ = 𝑓∗ . 2) Consider the sequence {𝑓𝑛} s.t. 𝑓𝑛+1 = 𝑇𝑓𝑛 ∀𝑓0 ∈ 𝔹. Then, lim n→∞ 𝑇 𝑛 𝑓0 → 𝑓∗ .
  • 36. Value Iteration Algorithm: Value Iteration Input: 𝑟, 𝑝, 𝛾, 𝛥 Output: 𝜋 1. 𝑉 𝑥 ←Initialize arbitrarily for 𝑥 ∈ 𝑋 2. repeat 3. for all 𝑥 ∈ 𝑋 do 4. 𝑉𝑘+1 ← 𝑇∗ 𝑉𝑘 5. until 𝑉𝑘+1 − 𝑉𝑘 < 𝛥 6. 𝜋∗ 𝑥 ∈ 𝑎𝑟𝑔min 𝑢∈𝑈(𝑥) 𝑟 𝑥, 𝑢 + 𝛾 σ 𝑥′∈𝑋 𝑝 𝑥′ 𝑥, 𝑢 𝑉𝑘(𝑥′) 7. return 𝜋∗
  • 37. Policy Iteration Algorithm: Policy Iteration Input: 𝑟, 𝑝, 𝛾, 𝛥 Output: 𝜋 1. 𝑉 𝑥 ←Initialize arbitrarily for 𝑥 ∈ 𝑋 2. repeat 1) Policy Evaluation 3. for 𝑥 ∈ 𝑋 with fixed 𝜋 𝑘 do 4. 𝑉𝑘+1 𝜋 𝑘 ← 𝑇 𝜋 𝑉𝑘 𝜋 𝑘 5. until 𝑉𝑘+1 𝜋 𝑘 − 𝑉𝑘 𝜋 𝑘 < 𝛥 2) Policy Improvement 6. 𝜋 𝑘+1 𝑥 ∈ 𝑎𝑟𝑔min 𝑢∈𝑈(𝑥) 𝑟 𝑥, 𝑢 + 𝛾 σ 𝑥′∈𝑋 𝑝 𝑥′ 𝑥, 𝑢 𝑉𝑘+1 𝜋 𝑘 (𝑥′ ) 7. until 𝑉𝑘+1 𝜋 𝑘+1 − 𝑉𝑘 𝜋 𝑘 < 𝛥 8. return 𝜋
  • 39. Q: If the system dynamics 𝑝 and reward function 𝑟 is unknown, how do we solve DP equation?
  • 40. Learning-based approach Q: If the system dynamics 𝑝 and reward function 𝑟 is unknown, how do we solve DP equation?
  • 41. Learning-based approach Q: If the system dynamics 𝑝 and reward function 𝑟 is unknown, how do we solve DP equation? 1. Estimate model (𝑟 and 𝑝) from simulation data and use previous methods Model-based approach (model learning)
  • 42. Learning-based approach Q: If the system dynamics 𝑝 and reward function 𝑟 is unknown, how do we solve DP equation? 1. Estimate model (𝑟 and 𝑝) from simulation data and use previous methods Model-based approach (model learning) 2. Without system identification, obtain the value function and policy directly from simulation data Model-free approach
  • 43. 𝐢𝐧𝐟 𝝅∈𝚷 𝑬 𝝅 𝒓 𝒙 𝒌, 𝝅(𝒙 𝒌) + 𝑬 𝒙~𝒑 𝜸𝑽(𝒙 𝒌+𝟏)
  • 44. 𝐢𝐧𝐟 𝝅∈𝚷 𝑬 𝝅 𝒓 𝒙 𝒌, 𝝅(𝒙 𝒌) + 𝑬 𝒙~𝒑 𝜸𝑽(𝒙 𝒌+𝟏) Approximation in policy space Approximation in value space Approximate 𝐸[∙] Parametric approximation Problem approximation Rollout, MPC Monte-Carlo search Certainty equivalence Policy search Policy gradient
  • 45. Approximation in Value space Approximation in Policy space Actor-Critic • TD, SARSA, Q-learning • Function approximation • Policy search • Policy gradient DPG, DDPG, TRPO, CPO, PPO, Soft actor-critic, … Approximate Expectation • Monte-Carlo search • Certainty equivalence
  • 46. Approximation in Value Space DP algorithms sweep over “all state” for each step 𝐸 𝑓 ≈ 1 𝑁 ෍ 𝑖=1 𝑁 𝑓𝑖Use Monte-Carlo Search 
  • 47. Approximation in Value Space DP algorithms sweep over “all state” for each step 𝑁~14,000,605 Use Monte-Carlo Search  𝐸 𝑓 ≈ 1 𝑁 ෍ 𝑖=1 𝑁 𝑓𝑖
  • 48. Approximation in Value Space DP algorithms sweep over “all state” for each step 𝑁~14,000,605 Use Monte-Carlo Search  𝐸 𝑓 ≈ 1 𝑁 ෍ 𝑖=1 𝑁 𝑓𝑖 Impractical
  • 49. Stochastic Approximation Consider the problem 𝑥 = 𝐿(𝑥). Then, this problem can be solved by iterative algorithm 𝑥 𝑘+1 = 𝐿 𝑥 𝑘 or 𝑥 𝑘+1 = 1 − 𝛼 𝑘 𝑥 𝑘 + 𝛼 𝑘 𝐿(𝑥 𝑘). If 𝐿 𝑥 is of the form 𝐸 𝑓(𝑥, 𝑤) where 𝑤 is a random noise. Then, 𝐿(𝑥) can be approximated by 𝐿 𝑥 = 1 𝑁 ෍ 𝑖=1 𝑁 𝑓(𝑥𝑖, 𝑤𝑖) which becomes inefficient when 𝑁 is large.
  • 50. Stochastic Approximation Use a single sample as an estimation of expectation in each update This update can be seen as a stochastic approximation of the form 𝑥 𝑘+1 = 1 − 𝛼 𝑘 𝑥 𝑘 + 𝛼 𝑘 𝑓(𝑥 𝑘, 𝑤 𝑘). 𝑥 𝑘+1 = 1 − 𝛼 𝑘 𝑥 𝑘 + 𝛼 𝑘 𝐸 𝑓 𝑥 𝑘, 𝑤 𝑘 + 𝜀 𝑘 = 1 − 𝛼 𝑘 𝑥 𝑘 + 𝛼 𝑘 𝐿(𝑥 𝑘) + 𝜀 𝑘 where 𝜀 𝑘= 𝑓 𝑥 𝑘, 𝑤 𝑘 − 𝐸 𝑓 𝑥 𝑘, 𝑤 𝑘 . Robbins-Monro stochastic approximation guarantees the convergence under contraction or monotonicity assumptions of the mapping 𝐿 with assumptions ෍ 𝑘=0 ∞ 𝛼 𝑘 = + ∞ and ෍ 𝑘=0 ∞ 𝛼 𝑘 2 < +∞.
  • 51. Policy Iteration Algorithm: Policy Iteration(Classical Dynamic Programming) Input: 𝑟, 𝑝, 𝛾, 𝛥 Output: 𝜋 1. 𝑉 𝑥 , 𝜋(𝑥) ←Initialize arbitrarily for 𝑥 ∈ 𝑋 2. repeat 3. 1) Policy Evaluation 4. for 𝑥 ∈ 𝑋 with fixed 𝜋 𝑘 do 5. 𝑉𝑘+1 𝜋 𝑘 ← 𝑇 𝜋 𝑉𝑘 𝜋 𝑘 6. until 𝑉𝑘+1 𝜋 𝑘 − 𝑉𝑘 𝜋 𝑘 < 𝛥 2) Policy Improvement 7. 𝜋 𝑘+1 𝑥𝑖 ∈ 𝑎𝑟𝑔min 𝑢∈𝑈(𝑥) 𝑟 𝑥, 𝑢 + 𝛾 σ 𝑥′∈𝑋 𝑝 𝑥′ 𝑥, 𝑢 𝑉𝑘+1 𝜋 𝑘 (𝑥′) 8. until 𝑉𝑘+1 𝜋 𝑘+1 − 𝑉𝑘 𝜋 𝑘 < 𝛥 9. return 𝜋
  • 52. Policy Iteration Algorithm: Policy Iteration(Temporal Difference) Input: 𝑟, 𝑝, 𝛾, 𝛥 Output: 𝜋 1. 𝑉 𝑥 , 𝜋(𝑥) ←Initialize arbitrarily for 𝑥 ∈ 𝑋 2. repeat 3. given 𝑥0, simulate 𝜋 𝑘 and store 𝒟 = {𝑥𝑖, 𝜋 𝑘 𝑥𝑖 , 𝑟(𝑥𝑖, 𝜋 𝑘 𝑥𝑖 )} 1) Policy Evaluation 4. for 𝑥 ∈ 𝑋 for 𝑥𝑖 ∈ 𝒟 with fixed 𝜋 𝑘 do 5. 𝑉𝑘+1 𝜋 𝑘 ← 𝑇 𝜋 𝑉𝑘 𝜋 𝑘 𝑉𝑘+1 𝜋 𝑘 (𝑥𝑖) ← 1 − 𝛼 𝑘 𝑉𝑘 𝜋 𝑘 (𝑥𝑖) + 𝛼 𝑘 𝑟 𝑥𝑖, 𝜋 𝑘 𝑥𝑖 + 𝑉𝑘 𝜋 𝑘 (𝑥𝑖+1) 6. until 𝑉𝑘+1 𝜋 𝑘 − 𝑉𝑘 𝜋 𝑘 < 𝛥 2) Policy Improvement 7. 𝜋 𝑘+1 𝑥𝑖 ∈ 𝑎𝑟𝑔min 𝑢∈𝑈(𝑥) 𝑟 𝑥, 𝑢 + 𝛾 σ 𝑥′∈𝑋 𝑝 𝑥′ 𝑥, 𝑢 𝑉𝑘+1 𝜋 𝑘 (𝑥′) 8. until 𝑉𝑘+1 𝜋 𝑘+1 − 𝑉𝑘 𝜋 𝑘 < 𝛥 9. return 𝜋
  • 53. Policy Iteration Algorithm: Policy Iteration(SARSA) Input: 𝑟, 𝑝, 𝛾, 𝛥 Output: 𝜋 1. 𝑉 𝑥 , 𝜋(𝑥) ←Initialize arbitrarily for 𝑥 ∈ 𝑋 2. repeat 3. given 𝑥0, simulate 𝜋 𝑘 and store 𝒟 = {𝑥𝑖, 𝜋 𝑘 𝑥𝑖 , 𝑟(𝑥𝑖, 𝜋 𝑘 𝑥𝑖 )} 1) Policy Evaluation 4. for 𝑥 ∈ 𝑋 for 𝑥𝑖 ∈ 𝒟 with fixed 𝜋 𝑘 do 5. 𝑉𝑘+1 𝜋 𝑘 ← 𝑇 𝜋 𝑉𝑘 𝜋 𝑘 𝑄 𝑘+1 𝜋 𝑘 (𝑥𝑖, 𝑢𝑖) ← 1 − 𝛼 𝑘 𝑄 𝑘 𝜋 𝑘 (𝑥𝑖, 𝑢𝑖) + 𝛼 𝑘 𝑟 𝑥𝑖, 𝜋 𝑘 𝑥𝑖 + 𝑄 𝑘 𝜋 𝑘 (𝑥𝑖+1, 𝜋 𝑘 𝑥𝑖+1 ) 6. until 𝑉𝑘+1 𝜋 𝑘 − 𝑉𝑘 𝜋 𝑘 < 𝛥 2) Policy Improvement 7. 𝜋 𝑘+1 𝑥𝑖 ∈ 𝑎𝑟𝑔min 𝑢∈𝑈(𝑥) 𝑟 𝑥, 𝑢 + 𝛾 σ 𝑥′∈𝑋 𝑝 𝑥′ 𝑥, 𝑢 𝑉𝑘+1 𝜋 𝑘 (𝑥′) 8. until 𝑉𝑘+1 𝜋 𝑘+1 − 𝑉𝑘 𝜋 𝑘 < 𝛥 9. return 𝜋
  • 54. Value Iteration Algorithm: Value Iteration(Classical Dynamic Programming) Input: 𝑟, 𝑝, 𝛾, 𝛥 Output: 𝜋 1. 𝑉 𝑥 ←Initialize arbitrarily for 𝑥 ∈ 𝑋 2. repeat 3. given 𝑥0, simulate and store 𝒟 = {𝑥𝑖, 𝑟(𝑥𝑖, 𝑢𝑖)} 4. for all 𝑥 ∈ 𝑋 do 5. 𝑉𝑘+1 𝑥 ← 𝑇∗ 𝑉𝑘 2. until 𝑉𝑘+1 − 𝑉𝑘 < 𝛥 3. 𝜋∗ 𝑥 ∈ 𝑎𝑟𝑔min 𝑢∈𝑈(𝑥) 𝑟 𝑥, 𝑢 + 𝛾 σ 𝑥′∈𝑋 𝑝 𝑥′ 𝑥, 𝑢 𝑉𝑘(𝑥′ ) 4. return 𝜋∗
  • 55. Value Iteration Algorithm: Value Iteration(Q-learning) Input: 𝑟, 𝑝, 𝛾, 𝛥 Output: 𝜋 1. 𝑉 𝑥 ←Initialize arbitrarily for 𝑥 ∈ 𝑋 𝑄(𝑥, 𝑢) ←Initialize arbitrarily for 𝑥 ∈ 𝑋, 𝑢 ∈ 𝑈 2. repeat 3. given 𝑥0, simulate and store 𝒟 = {𝑥𝑖, 𝑟(𝑥𝑖, 𝑢𝑖)} 4. for all 𝑥 ∈ 𝑋 do  for 𝑥𝑖, 𝑎𝑖, 𝑟 𝑥𝑖, 𝑢𝑖 , 𝑥𝑖+1 ∈ 𝒟 do 5. 𝑄 𝑘+1 𝑥, 𝑢 ← 𝑇∗ 𝑄 𝑘 𝑄 𝑘+1(𝑥𝑖, 𝑢𝑖) ← (1 − 𝛼 𝑘)𝑄 𝑘(𝑥𝑖, 𝑢𝑖) + 𝛼 𝑘 𝑟 𝑥𝑖, 𝑢𝑖 + min 𝑢𝑖+1∈𝑈 𝑄 𝑘(𝑥𝑖+1, 𝑢𝑖+1) 2. until 𝑄 𝑘+1 − 𝑄 𝑘 < 𝛥 3. 𝜋∗ 𝑥 ∈ 𝑎𝑟𝑔min 𝑢∈𝑈(𝑥) 𝑟 𝑥, 𝑢 + 𝛾 σ 𝑥′∈𝑋 𝑝 𝑥′ 𝑥, 𝑢 𝑉𝑘(𝑥′ ) 4. return 𝜋∗
  • 56. Approximation in Value Space 𝑇 𝜋 𝑇∗ 𝑉 𝜋 𝑄 𝜋 𝑄 Policy Evaluation TD(0), TD(𝜆) SARSA(0) SARSA(𝜆) Q-learning
  • 58. Large-scale RL Number of states > 2200 (for 10 × 20 board)
  • 60. Approximation in Value space Approximation in Policy space Actor-Critic • TD, SARSA, Q-learning • Function approximation • Policy search • Policy gradient DPG, DDPG, TRPO, CPO, PPO, Soft actor-critic, … Approximate Expectation • Monte-Carlo search • Certainty equivalence
  • 61. Approximate Dynamic Programming Direct method (Gradient methods) min 𝜃 ෍ 𝑖=1 𝑁 𝑉 𝜋 𝑥𝑖 − ෠𝑉(𝑥𝑖; 𝜃) 2 ≈ min 𝜃 ෍ 𝑥 𝑖∈𝑋 ෍ 𝑚=1 𝑀 𝐽(𝑥𝑖, 𝑚) − ෠𝑉(𝑥𝑖; 𝜃) 2 ෠𝑉(𝑥; 𝜃) : Approximated value function (ex. Polynomial approx., Neural Network, etc.) 𝑉 𝜋 𝑥 : State-value function 𝐽(𝑥𝑖, 𝑚): 𝑚-th sample of the cost function at 𝑥𝑖 where 𝑚 = 1,2, … , 𝑀 𝜃 𝑘+1 = 𝜃 𝑘 − 𝜂 ෍ 𝑥 𝑖∈𝑋 ෍ 𝑚=1 𝑀 𝛻 ෠𝑉(𝑥𝑖; 𝜃) 𝐽(𝑥𝑖, 𝑚) − ෠𝑉(𝑥𝑖; 𝜃) 2
  • 62. Approximate Dynamic Programming Indirect method (Projected equation) Solve the projected Bellman equation: Φ𝜃 = Π𝑇(Φ𝜃) Subspace 𝑆 = Φ𝜃 𝜃 ∈ ℝ 𝑠 0 𝐽 Π𝐽 Subspace 𝑆 = Φ𝜃 𝜃 ∈ ℝ 𝑠 0 Φ𝜃 = Π𝑇(Φ𝜃) 𝑇(Φ𝜃) Direct Method Indirect Method
  • 63. 𝑇 𝜋 𝑇∗ 𝑉 𝜋 𝑄 𝜋 𝑄 Policy Evaluation TD(0), TD(𝜆) SARSA(0) SARSA(𝜆) Q-learning 𝑇 𝜋 𝑇∗ 𝑉 𝜋 𝑄 𝜋 𝑄 Direct (Gradient methods) TD SARSA DQN 𝑇 𝜋 𝑇∗ 𝑉 𝜋 𝑄 𝜋 𝑄 Indirect (Projected DP) TD, LSTD TD, LSTD LSPE Function Approximation Policy Evaluation
  • 65. RL is a toolbox solving Infinite-horizon Discrete-time DP 𝐢𝐧𝐟 𝝅∈𝜫 𝑬 𝝅 𝒓 𝒙 𝒌, 𝝅(𝒙 𝒌) + 𝑬 𝒙~𝒑 𝜸𝑽(𝒙 𝒌+𝟏)
  • 66. 𝐢𝐧𝐟 𝝅∈𝚷 𝑬 𝝅 𝒓 𝒙 𝒌, 𝝅(𝒙 𝒌) + 𝑬 𝒙~𝒑 𝜸𝑽(𝒙 𝒌+𝟏) Approximation in policy space Approximation in value space Approximate 𝐸[∙] Parametric approximation Problem approximation Rollout, MPC Monte-Carlo search Certainty equivalence Policy search Policy gradient
  • 67. Approximation in Value space Approximation in Policy space Actor-Critic • TD, SARSA, Q-learning • Function approximation • Policy search • Policy gradient DPG, DDPG, TRPO, CPO, PPO, Soft actor-critic, … Approximate Expectation • Monte-Carlo search • Certainty equivalence
  • 68. 𝑇 𝜋 𝑇∗ 𝑉 𝜋 𝑄 𝜋 𝑄 Policy Evaluation TD(0), TD(𝜆) SARSA(0) SARSA(𝜆) Q-learning 𝑇 𝜋 𝑇∗ 𝑉 𝜋 𝑄 𝜋 𝑄 Direct (Gradient methods) TD SARSA DQN 𝑇 𝜋 𝑇∗ 𝑉 𝜋 𝑄 𝜋 𝑄 Indirect (Projected DP) TD, LSTD TD, LSTD LSPE Function Approximation Policy Evaluation
  • 69. Q&A