"Stochastic Optimal Control and Reinforcement Learning", invited to speak at the Nonlinear Dynamic Systems class taught by Prof. Frank Chong-woo Park, Seoul National University, December 4, 2019.
15. Markov Decision Process
Markov?
“The future is independent of the past given the present”
ℙ 𝑠𝑡+1 𝑠1, … , 𝑠𝑡 = ℙ 𝑠𝑡+1 𝑠𝑡
Markov “Decision” Process:
ℙ 𝑠𝑡+1 𝑠1, 𝑎1 … , 𝑠𝑡, 𝑎 𝑡 = ℙ 𝑠𝑡+1 𝑠𝑡, 𝑎 𝑡
Future state only depends on the current state and action
&
Policy also depends on the current state only
𝜋 𝑎 𝑡 𝑠1, 𝑎1 … , 𝑠𝑡 = 𝜋 𝑎 𝑡 𝑠𝑡
19. Terminology of RL and Optimal control
State
Action
Agent
Environment
Reward of a stage
Reward (or value) function
Maximizing the value function
Bellman operator
Greedy policy w.r.t. 𝐽
State
Control Input
Controller
System
Cost of a stage
Value (or cost) function
Minimizing the value function
DP mapping or operator
Minimizing policy w.r.t. 𝐽
RL Optimal Control
27. Stochastic Optimal Control
Dynamic Programming
𝐢𝐧𝐟
𝒖 𝒌∈𝑼
𝒓 𝒙 𝒌, 𝒖 𝒌 + 𝑬 𝒙~𝒑 𝜸𝑽(𝒙 𝒌+𝟏)
How to solve the infinite-horizon discrete time system stochastic optimal control problem?
28. Stochastic Optimal Control
Dynamic Programming
𝐢𝐧𝐟
𝒖 𝒌∈𝑼
𝒓 𝒙 𝒌, 𝒖 𝒌 + 𝑬 𝒙~𝒑 𝜸𝑽(𝒙 𝒌+𝟏)
How to solve the infinite-horizon discrete time system stochastic optimal control problem?
Note) There is an other approach using different dynamic programming equation,
average reward.
Value Iteration & Policy Iteration
29. Bellman Operator
Definition. Given policy 𝜋, the state-value function 𝑉 𝜋: ℝn → ℝ is defined by
𝑉 𝜋
𝑥0 : = 𝐸 𝑥~𝑝,𝜋
𝑘=0
∞
𝛾 𝑘
𝑟𝑘(𝑥 𝑘, 𝜋(𝑥 𝑘), 𝑤 𝑘) ȁ 𝑥 = 𝑥0 at 𝑡 = 0
= 𝑟(𝑥0, 𝜋(𝑥 𝑘)) + 𝐸 𝑥~𝑝,𝜋
𝑘=0
∞
𝛾 𝑘
𝑟𝑘(𝑥 𝑘, 𝜋(𝑥 𝑘), 𝑤 𝑘)
and the state-input-value function 𝑄 𝜋
: ℝn
× ℝm
→ ℝ is defined by
𝑄 𝜋 𝑥0, 𝑢0 : = 𝐸 𝑥~𝑝,𝜋
𝑘=0
∞
𝛾 𝑘 𝑟𝑘(𝑥 𝑘, 𝜋(𝑥 𝑘), 𝑤 𝑘) ȁ 𝑥 = 𝑥0, 𝑢 = 𝑢0 at 𝑡 = 0
= 𝑟 𝑥0, 𝑢0 + 𝐸 𝑥~𝑝,𝜋
𝑘=1
∞
𝛾 𝑘 𝑟𝑘(𝑥 𝑘, 𝜋(𝑥 𝑘), 𝑤 𝑘)
34. Bellman Operator
Proposition 1. (Monotonicity) The Bellman operator 𝑇 𝜋
, 𝑇∗
are monotone, i.e. if
𝜓 𝑥 ≤ 𝜓′ 𝑥 ∀𝑥 ∈ 𝑋,
then
𝑇 𝜋
𝜓 𝑥 ≤ 𝑇 𝜋
𝜓′ 𝑥 ∀𝑥 ∈ 𝑋
𝑇∗
𝜓 𝑥 ≤ 𝑇∗
𝜓′ 𝑥 ∀𝑥 ∈ 𝑋.
Proposition 2. (Constant shift property) For any scalar 𝑟,
𝑇 𝜋
𝜓 + 𝑟𝑒 𝑥 = 𝑇 𝜋
𝜓 𝑥 + 𝛾𝑟 ∀𝑥 ∈ 𝑋
𝑇∗
𝜓 + 𝑟𝑒 𝑥 = 𝑇∗
𝜓 𝑥 + 𝛾𝑟 ∀𝑥 ∈ 𝑋.
Proposition 3. The Bellman operator 𝑇 𝜋
, 𝑇 are a contraction
with modulus 𝛾 with respect to the sup norm ∙ ∞ i.e.
𝑇 𝜋 𝜓 − 𝑇 𝜋 𝜓′ ≤ 𝛾 𝜓 − 𝜓′ 𝜓, 𝜓′ ∈ 𝔹
𝑇∗ 𝜓 − 𝑇∗ 𝜓′ ≤ 𝛾 𝜓 − 𝜓′ 𝜓, 𝜓′ ∈ 𝔹.
35. Bellman Operator
Theorem 2.3. (Contraction Mapping Theorem) Let 𝔹, ∙ ∞, 𝑑∞ be a metric
space and 𝑇: 𝔹 → 𝔹 is a contraction mapping with modulus 𝛾. Then,
1) 𝑇 has an unique fixed point in 𝔹 i.e. there exist unique 𝑓∗
∈ 𝔹 s.t. 𝑇𝑓∗
= 𝑓∗
.
2) Consider the sequence {𝑓𝑛} s.t. 𝑓𝑛+1 = 𝑇𝑓𝑛 ∀𝑓0 ∈ 𝔹.
Then, lim
n→∞
𝑇 𝑛
𝑓0 → 𝑓∗
.
36. Value Iteration
Algorithm: Value Iteration
Input: 𝑟, 𝑝, 𝛾, 𝛥
Output: 𝜋
1. 𝑉 𝑥 ←Initialize arbitrarily for 𝑥 ∈ 𝑋
2. repeat
3. for all 𝑥 ∈ 𝑋 do
4. 𝑉𝑘+1 ← 𝑇∗
𝑉𝑘
5. until 𝑉𝑘+1 − 𝑉𝑘 < 𝛥
6. 𝜋∗ 𝑥 ∈ 𝑎𝑟𝑔min
𝑢∈𝑈(𝑥)
𝑟 𝑥, 𝑢 + 𝛾 σ 𝑥′∈𝑋 𝑝 𝑥′ 𝑥, 𝑢 𝑉𝑘(𝑥′)
7. return 𝜋∗
41. Learning-based approach
Q: If the system dynamics 𝑝 and reward function 𝑟 is unknown,
how do we solve DP equation?
1. Estimate model (𝑟 and 𝑝) from simulation data and use previous methods
Model-based approach (model learning)
42. Learning-based approach
Q: If the system dynamics 𝑝 and reward function 𝑟 is unknown,
how do we solve DP equation?
1. Estimate model (𝑟 and 𝑝) from simulation data and use previous methods
Model-based approach (model learning)
2. Without system identification, obtain the value function and policy directly from simulation data
Model-free approach
44. 𝐢𝐧𝐟
𝝅∈𝚷
𝑬 𝝅 𝒓 𝒙 𝒌, 𝝅(𝒙 𝒌) + 𝑬 𝒙~𝒑 𝜸𝑽(𝒙 𝒌+𝟏)
Approximation in policy space
Approximation in value space
Approximate 𝐸[∙]
Parametric approximation
Problem approximation
Rollout, MPC
Monte-Carlo search
Certainty equivalence
Policy search
Policy gradient
45. Approximation
in Value space
Approximation
in Policy space
Actor-Critic
• TD, SARSA, Q-learning
• Function approximation
• Policy search
• Policy gradient
DPG, DDPG, TRPO, CPO, PPO, Soft actor-critic, …
Approximate
Expectation
• Monte-Carlo search
• Certainty equivalence
46. Approximation in Value Space
DP algorithms sweep over “all state” for each step
𝐸 𝑓 ≈
1
𝑁
𝑖=1
𝑁
𝑓𝑖Use Monte-Carlo Search
47. Approximation in Value Space
DP algorithms sweep over “all state” for each step
𝑁~14,000,605
Use Monte-Carlo Search 𝐸 𝑓 ≈
1
𝑁
𝑖=1
𝑁
𝑓𝑖
48. Approximation in Value Space
DP algorithms sweep over “all state” for each step
𝑁~14,000,605
Use Monte-Carlo Search 𝐸 𝑓 ≈
1
𝑁
𝑖=1
𝑁
𝑓𝑖
Impractical
49. Stochastic Approximation
Consider the problem
𝑥 = 𝐿(𝑥).
Then, this problem can be solved by iterative algorithm
𝑥 𝑘+1 = 𝐿 𝑥 𝑘
or
𝑥 𝑘+1 = 1 − 𝛼 𝑘 𝑥 𝑘 + 𝛼 𝑘 𝐿(𝑥 𝑘).
If 𝐿 𝑥 is of the form 𝐸 𝑓(𝑥, 𝑤) where 𝑤 is a random noise. Then, 𝐿(𝑥) can be approximated by
𝐿 𝑥 =
1
𝑁
𝑖=1
𝑁
𝑓(𝑥𝑖, 𝑤𝑖)
which becomes inefficient when 𝑁 is large.
50. Stochastic Approximation
Use a single sample as an estimation of expectation in each update
This update can be seen as a stochastic approximation of the form
𝑥 𝑘+1 = 1 − 𝛼 𝑘 𝑥 𝑘 + 𝛼 𝑘 𝑓(𝑥 𝑘, 𝑤 𝑘).
𝑥 𝑘+1 = 1 − 𝛼 𝑘 𝑥 𝑘 + 𝛼 𝑘 𝐸 𝑓 𝑥 𝑘, 𝑤 𝑘 + 𝜀 𝑘
= 1 − 𝛼 𝑘 𝑥 𝑘 + 𝛼 𝑘 𝐿(𝑥 𝑘) + 𝜀 𝑘
where 𝜀 𝑘= 𝑓 𝑥 𝑘, 𝑤 𝑘 − 𝐸 𝑓 𝑥 𝑘, 𝑤 𝑘 .
Robbins-Monro stochastic approximation guarantees the convergence under
contraction or monotonicity assumptions of the mapping 𝐿 with assumptions
𝑘=0
∞
𝛼 𝑘 = + ∞ and
𝑘=0
∞
𝛼 𝑘
2
< +∞.