This document discusses using Q-learning to find optimal control policies for nonlinear systems with continuous state spaces. It outlines a 5-step approach: 1) Recognize the fixed point equation for the Q-function, 2) Find a stabilizing policy that is ergodic, 3) Use an optimality criterion to minimize the Bellman error, 4) Use an adjoint operation, and 5) Interpret and simulate the results. As an example, it applies these steps to the linear quadratic regulator (LQR) problem and approximates the Q-function. The goal is to seek the best approximation within a parameterized function class.
How to Uninstall a Module in Odoo 17 Using Command Line
Q-Learning and Pontryagin's Minimum Principle for Coarse Models
1. Q-Learning
and Pontryagin's Minimum Principle
Sean Meyn
Department of Electrical and Computer Engineering
and the Coordinated Science Laboratory
University of Illinois
Joint work with Prashant Mehta
NSF support: ECS-0523620
2. Outline
? Coarse models - what to do with them?
Step 1: Recognize
Step 2: Find a stab...
Step 3: Optimality
Q-learning for nonlinear state space models
Step 4: Adjoint
Step 5: Interpret
Example: Local approximation
Example: Decentralized control
3. Outline
? Coarse models - what to do with them?
Step 1: Recognize
Step 2: Find a stab...
Step 3: Optimality
Q-learning for nonlinear state space models
Step 4: Adjoint
Step 5: Interpret
Example: Local approximation
Example: Decentralized control
4. Coarse Models: A rich collection of model reduction techniques
Many of today’s participants have contributed to this research.
A biased list:
Fluid models: Law of Large Numbers scaling,
most likely paths in large deviations
Workload relaxation for networks
Heavy-traffic limits
Clustering: spectral graph theory
Markov spectral theory
Singular perturbations
Large population limits: Interacting particle systems
5. Workload Relaxations
An example from CTCN:
Figure 7.1: Demand-driven model with routing, scheduling, and re-work.
Workload at two stations evolves as a two-dimensional system
Cost is projected onto these coordinates:
−(1 − ρ) −(1 − ρ)
R STO R∗ R STO R∗
50 50
w2 w2
40 40
30
20
30
20
Optimal policy for
10
0
10
0
relaxation = hedging
-10
-20
-10
-20
policy for full network
-20 -10 0 10 20 30 40 50 -20 -10 0 10 20 30 40 50
w1 w1
Figure 7.2: Optimal policies for two instances of the network shown in Figure 7.1.
In each figure the optimal stochastic control region RSTO is compared with the optimal
region R∗ obtained for the two dimensional fluid model.
6. Workload Relaxations and Simulation
α
µ µ
An example from CTCN:
Station 1 Station 2
µ µ
α
Decision making at stations 1 & 2
e.g., setting safety-stock levels
DP and simulations accelerated
using fluid value function for workload relaxation
VIA initialized with Simulated mean with
and without control variate:
Zero
Average cost
Average cost
Fluid value function
10
10 20 10
20 30 20
30 40 30
40 20 40
50 30
50 100 150 200 250 300 50 40 50
60 60 50
60 60
Iteration safety-stock levels
7. VIA initialized with
Zero
What To Do With a Coarse Model?
Average cost
Fluid value function
50 100 150 200 250 300
Iteration
Setting: we have qualitative or partial quantitative
insight regarding optimal control
The network examples relied on specific network structure
What about other models?
8. VIA initialized with
Zero
What To Do With a Coarse Model?
Average cost
Fluid value function
50 100 150 200 250 300
Iteration
Setting: we have qualitative or partial quantitative
insight regarding optimal control
The network examples relied on specific network structure
What about other models?
An answer lies in a new formulation of Q-learning
9. Outline
? Coarse models - what to do with them?
Step 1: Recognize
Step 2: Find a stab...
Step 3: Optimality
Q-learning for nonlinear state space models
Step 4: Adjoint
Step 5: Interpret
Example: Local approximation
Example: Decentralized control
10. VIA initialized with
Zero
What is Q learning?
Average cost
Fluid value function
50 100 150 200 250 300
Iteration
Watkin’s 1992 formulation applied to finite state space MDPs
Q-Learning
Idea is similar to Mayne & Jacobson’s C. J. C. H. Watkins and P. Dayan
Machine Learning, 1992
differential dynamic programming Differential dynamic programming
D. H. Jacobson and D. Q. Mayne
American Elsevier Pub. Co. 1970
11. VIA initialized with
Zero
What is Q learning?
Average cost
Fluid value function
50 100 150 200 250 300
Iteration
Watkin’s 1992 formulation applied to finite state space MDPs
Q-Learning
Idea is similar to Mayne & Jacobson’s C. J. C. H. Watkins and P. Dayan
Machine Learning, 1992
differential dynamic programming Differential dynamic programming
D. H. Jacobson and D. Q. Mayne
American Elsevier Pub. Co. 1970
Deterministic formulation: Nonlinear system on Euclidean space,
d
dt x(t) = f (x(t), u(t)), t≥ 0
Infinite-horizon discounted cost criterion,
∞
J ∗ (x) = inf e−γs c(x(s), u(s)) ds, x(0) = x
0
with c a non-negative cost function.
12. 1
0.08
Optimal policy
0.07
0.06
What is Q learning? 0
0.05
0.04
0.03
0.02
0.01
−1
−1 0 1
Deterministic formulation: Nonlinear system on Euclidean space,
d
dt x(t) = f (x(t), u(t)), t≥ 0
Infinite-horizon discounted cost criterion,
∞
J ∗ (x) = inf e−γs c(x(s), u(s)) ds, x(0) = x
0
with c a non-negative cost function.
Differential generator: For any smooth function h,
Du h (x) := (∇h (x))T f (x, u)
13. 1
0.08
Optimal policy
0.07
0.06
What is Q learning? 0
0.05
0.04
0.03
0.02
0.01
−1
−1 0 1
Deterministic formulation: Nonlinear system on Euclidean space,
d
dt x(t) = f (x(t), u(t)), t≥ 0
Infinite-horizon discounted cost criterion,
∞
J ∗ (x) = inf e−γs c(x(s), u(s)) ds, x(0) = x
0
with c a non-negative cost function.
Differential generator: For any smooth function h,
Du h (x) := (∇h (x))T f (x, u)
HJB equation: min c(x, u) + Du J ∗ (x) = γJ ∗ (x)
u
14. 1
0.08
Optimal policy
0.07
0.06
What is Q learning? 0
0.05
0.04
0.03
0.02
0.01
−1
−1 0 1
Deterministic formulation: Nonlinear system on Euclidean space,
d
dt x(t) = f (x(t), u(t)), t≥ 0
Infinite-horizon discounted cost criterion,
∞
J ∗ (x) = inf e−γs c(x(s), u(s)) ds, x(0) = x
0
with c a non-negative cost function.
Differential generator: For any smooth function h,
Du h (x) := (∇h (x))T f (x, u)
HJB equation: min c(x, u) + Du J ∗ (x) = γJ ∗ (x)
u
The Q-function of Q-learning is this function of two variables
15. 1
0.08
Optimal policy
0.07
0.06
Q learning - Steps towards an algorithm 0
0.05
0.04
0.03
0.02
0.01
−1
−1 0 1
Sequence of five steps:
Step 1: Recognize fixed point equation for the Q-function
Step 2: Find a stabilizing policy that is ergodic
Step 3: Optimality criterion - minimize Bellman error
Step 4: Adjoint operation
Step 5: Interpret and simulate!
16. 1
0.08
Optimal policy
0.07
0.06
Q learning - Steps towards an algorithm 0
0.05
0.04
0.03
0.02
0.01
−1
−1 0 1
Sequence of five steps:
Step 1: Recognize fixed point equation for the Q-function
Step 2: Find a stabilizing policy that is ergodic
Step 3: Optimality criterion - minimize Bellman error
Step 4: Adjoint operation
Step 5: Interpret and simulate!
Goal - seek the best approximation,
within a parameterized class
17. 1
0.08
Optimal policy
0.07
0.06
Q learning - Steps towards an algorithm 0
0.05
0.04
0.03
0.02
0.01
−1
−1 0 1
Step 1: Recognize fixed point equation for the Q-function
Q-function: H ∗(x, u) = c(x, u) + Du J ∗ (x)
Its minimum: H ∗ (x) := min H ∗ (x, u) = γJ ∗ (x)
u∈U
Fixed point equation:
Du H ∗ (x) = −γ(c(x, u) − H ∗ (x, u))
Step 1: Recognize xed point equation for the Q-function
Step 2: Find a stabilizing policy that is ergodic
Step 3: Optimality criterion - minimize Bellman error
Step 4: Adjoint operation
Step 5: Interpret and simulate!
18. 1
0.08
Optimal policy
0.07
0.06
Q learning - Steps towards an algorithm 0
0.05
0.04
0.03
0.02
0.01
−1
−1 0 1
Step 1: Recognize fixed point equation for the Q-function
Q-function: H ∗(x, u) = c(x, u) + Du J ∗ (x)
Its minimum: H ∗ (x) := min H ∗ (x, u) = γJ ∗ (x)
u∈U
Fixed point equation:
Du H ∗ (x) = −γ(c(x, u) − H ∗ (x, u))
Key observation for learning: For any input-output pair,
Du H ∗ (x) = d
dt H ∗ (x(t)) x=x(t)
u=u(t)
Step 1: Recognize xed point equation for the Q-function
Step 2: Find a stabilizing policy that is ergodic
Step 3: Optimality criterion - minimize Bellman error
Step 4: Adjoint operation
Step 5: Interpret and simulate!
19. 1
0.08
Optimal policy
0.07
0.06
Q learning - LQR example 0
0.05
0.04
0.03
0.02
0.01
−1
−1 0 1
Linear model and quadratic cost,
1 T 1 T
Cost: c(x, u) = 2 x Qx + 2 u Ru
Q-function: H ∗(x, u) = c(x, u) + (Ax + Bu)T P ∗ x
= c(x, u) + Du J ∗ (x) Solves Riccatti eqn
1 T ∗
J ∗ (x) = 2x
P x
Step 2: Find a stabilizing policy that is ergodic
Step 3: Optimality criterion - minimize Bellman error
Step 4: Adjoint operation
Step 5: Interpret and simulate!
20. 1
0.08
Optimal policy
0.07
0.06
Q learning - LQR example 0
0.05
0.04
0.03
0.02
0.01
−1
−1 0 1
Linear model and quadratic cost,
1 T 1 T
Cost: c(x, u) = 2 x Qx + 2 u Ru
Q-function: H ∗(x, u) = c(x, u) + (Ax + Bu)T P ∗ x
Solves Riccatti eqn
Q-function approx:
dx dxu
H θ (x, u) = c(x, u) + 1
2 θi xT E i x +
x
θj xT F i u
x
i=1 j=1
Minimum:
θ 1 T θT
H (x) = 2x Q + E − F R−1 F θ x
θ
Minimizer:
uθ (x) = φθ (x) = −R−1 F θ x
Step 2: Find a stabilizing policy that is ergodic
Step 3: Optimality criterion - minimize Bellman error
Step 4: Adjoint operation
Step 5: Interpret and simulate!
21. 1
0.08
Optimal policy
0.07
0.06
Q learning - Steps towards an algorithm 0
0.05
0.04
0.03
0.02
0.01
−1
−1 0 1
Step 2: Stationary policy that is ergodic?
Assume the LLN holds for continuous functions
F: R × R u
→ R
As T → ∞,
T
1
F (x(t), u(t)) dt −→ F (x, u) (dx, du)
T 0 X×U
Step 1: Recognize xed point equation for the Q-function
Step 2: Find a stabilizing policy that is ergodic
Step 3: Optimality criterion - minimize Bellman error
Step 4: Adjoint operation
Step 5: Interpret and simulate!
22. 1
0.08
Optimal policy
0.07
0.06
Q learning - Steps towards an algorithm 0
0.05
0.04
0.03
0.02
0.01
−1
−1 0 1
Step 2: Stationary policy that is ergodic?
Suppose for example the input is scalar, and the system is stable
[Bounded-input/Bounded-state]
Can try a linear
combination
of sinusouids
Step 1: Recognize xed point equation for the Q-function
Step 2: Find a stabilizing policy that is ergodic
Step 3: Optimality criterion - minimize Bellman error
Step 4: Adjoint operation
Step 5: Interpret and simulate!
23. 1
0.08
Optimal policy
0.07
0.06
Q learning - Steps towards an algorithm 0
0.05
0.04
0.03
0.02
0.01
−1
−1 0 1
Step 2: Stationary policy that is ergodic?
Suppose for example the input is scalar, and the system is stable
[Bounded-input/Bounded-state]
0.08
0.07
0.06
0.05
Can try a linear
combination
0.04
of sinusouids
0.03
0.02
0.01
Step 1: Recognize xed point equation for the Q-function
u(t) = A(sin(t) + sin(πt) + sin(et)) Step 2: Find a stabilizing policy that is ergodic
Step 3: Optimality criterion - minimize Bellman error
Step 4: Adjoint operation
Step 5: Interpret and simulate!
24. 1
0.08
Optimal policy
0.07
0.06
Q learning - Steps towards an algorithm 0
0.05
0.04
0.03
0.02
0.01
−1
−1 0 1
Step 3: Bellman error
Based on observations, minimize the mean-square Bellman error:
θ θ
,
First order condition for optimality: θ
, Du ψ θ − γψi
i
θ
=0
with ψ θ (x) = ψi (x, φθ (x)),
i
θ
1≤i≤d
Step 1: Recognize xed point equation for the Q-function
Step 2: Find a stabilizing policy that is ergodic
Step 3: Optimality criterion - minimize Bellman error
Step 4: Adjoint operation
Step 5: Interpret and simulate!
25. 1
0.08
Optimal policy
0.07
0.06
Q learning - Convex Reformulation 0
0.05
0.04
0.03
0.02
0.01
−1
−1 0 1
Step 3: Bellman error
Based on observations, minimize the mean-square Bellman error:
θ θ
,
G
Gθ (x) ≤ H θ (x, u), all x, u
Step 2: Find a stabilizing policy that is ergodic
Step 3: Optimality criterion - minimize Bellman error
Step 4: Adjoint operation
Step 5: Interpret and simulate!
26. 1
0.08
Optimal policy
0.07
0.06
Q learning - LQR example 0
0.05
0.04
0.03
0.02
0.01
−1
−1 0 1
Linear model and quadratic cost,
1 T 1 T
Cost: c(x, u) = 2 x Qx + 2 u Ru
Q-function: H ∗ (x) = c(x, u) + (Ax + Bu)T P ∗ x
Solves Riccatti eqn
Q-function approx:
dx dxu
H θ (x, u) = c(x, u) + 1
2 θi xT E i x +
x
θj xT F i u
x
i=1 j=1
Approximation to minimum
G θ (x) = 1 xT Gθ x
2
Minimizer:
uθ (x) = φθ (x) = −R−1 F θ x
Step 2: Find a stabilizing policy that is ergodic
Step 3: Optimality criterion - minimize Bellman error
Step 4: Adjoint operation
Step 5: Interpret and simulate!
27. Q learning - Steps towards an algorithm
Step 4: Causal smoothing to avoid differentiation
For any function of two variables, g : R × R w
→ R
Resolvent gives a new function,
∞
−βt
Rβ g (x, w) = e g(x(t), ξ(t)) dt
0
Skip to examples
28. Q learning - Steps towards an algorithm
Step 4: Causal smoothing to avoid differentiation
For any function of two variables, g : R × R w
→ R
Resolvent gives a new function,
∞
Rβ g (x, w) = e−βt g(x(t), ξ (t)) dt , β>0
0
controlled using the nominal policy
u(t) = φ(x(t), ξ(t)), t≥0
stabilizing & ergodic
29. Q learning - Steps towards an algorithm
Step 4: Causal smoothing to avoid differentiation
For any function of two variables, g : R × R w
→ R
Resolvent gives a new function,
∞
Rβ g (x, w) = e−βt g(x(t), ξ (t)) dt , β>0
0
Resolvent equation:
30. Q learning - Steps towards an algorithm
Step 4: Causal smoothing to avoid differentiation
For any function of two variables, g : R × R w
→ R
Resolvent gives a new function,
∞
Rβ g (x, w) = e−βt g(x(t), ξ (t)) dt , β>0
0
Resolvent equation:
Smoothed Bellman error:
Lθ,β = Rβ Lθ
θ θ
= [βRβ − I]H + γRβ (c − H )
31. Q learning - Steps towards an algorithm
Smoothed Bellman error:
1 θ,β 2
Eβ (θ) := 2
θ,β
Eβ (θ) = , θ Lθ,β
= zero at an optimum
Step 4: Causal smoothing to avoid differentiation
32. Q learning - Steps towards an algorithm
Smoothed Bellman error:
1 θ,β 2
Eβ (θ) := 2
θ,β
Eβ (θ) = , θ Lθ,β
= zero at an optimum
Involves terms of the form Rβ g,R β h
Step 4: Causal smoothing to avoid differentiation
34. Q learning - Steps towards an algorithm
1 θ,β 2
Smoothed Bellman error: Eβ (θ) := 2
θ,β θ,β
Eβ (θ) = , θL
Adjoint operation: 1
† †
Rβ Rβ = (Rβ + Rβ )
2β
1 † †
Rβ g,R β h = g,R β h + h,R β g
2β
Adjoint realization: time-reversal
∞
†
Rβ g (x, w) = e−βt Ex, w [g(x◦ (−t), ξ ◦ (−t))] dt
0
expectation conditional on x◦ (0) = x, ξ ◦ (0) = w.
Step 4: Causal smoothing to avoid differentiation
35. 1
0.08
Optimal policy
0.07
0.06
Q learning - Steps towards an algorithm 0
0.05
0.04
0.03
0.02
0.01
−1
−1 0 1
After Step 5: Not quite adaptive control:
Desired behavior
Compare Outputs
and learn Inputs
Complex system
Measured behavior
Ergodic input applied
Step 1: Recognize xed point equation for the Q-function
Step 2: Find a stabilizing policy that is ergodic
Step 3: Optimality criterion - minimize Bellman error
Step 4: Adjoint operation
Step 5: Interpret and simulate!
36. 1
0.08
Optimal policy
0.07
0.06
Q learning - Steps towards an algorithm 0
0.05
0.04
0.03
0.02
0.01
−1
−1 0 1
After Step 5: Not quite adaptive control:
Desired behavior
Compare Outputs
and learn Inputs
Complex system
Measured behavior
Ergodic input applied
Based on observations minimize the mean-square Bellman error:
37. 1
(individual state)
(ensemble state)
Deterministic Stochastic Approximation 0
-1
0 1 2 3 4 5 6 7 8 9 10
Gradient descent:
d
dt θ = −ε θ
, Du θ Hθ − γ θ Hθ
Converges* to the minimizer of the mean-square Bellman error:
d * Convergence observed in experiments!
dt h(x(t)) x=x(t)
= Du h (x) For a convex re-formulation of
w=ξ(t) the problem, see Mehta & Meyn 2009
38. 1
(individual state)
(ensemble state)
Deterministic Stochastic Approximation 0
-1
0 1 2 3 4 5 6 7 8 9 10
Stochastic Approximation
θ
d
dt θ = −εt Lθ
t
d
dt θH (x◦ (t)) − γ θH
θ
(x◦ (t), u◦ (t))
Lθ := dt H θ (x◦ (t)) + γ(c(x◦ (t) , u◦ (t)) − H θ (x◦ (t), u◦ (t)))
t
d
Gradient descent:
d θ θ θ
dt θ = −ε , Du θH −γ θH
Mean-square Bellman error:
d
dt h(x(t)) x=x(t)
= Du h (x)
w=ξ(t)
39. Outline
? Coarse models - what to do with them?
Step 1: Recognize
Step 2: Find a stab...
Step 3: Optimality
Q-learning for nonlinear state space models
Step 4: Adjoint
Step 5: Interpret
Example: Local approximation
Example: Decentralized control
40. Desired behavior
Compare Outputs
Q learning - Local Learning and learn Inputs
Complex system
Measured behavior
Cubic nonlinearity:
d
dt x = −x3 + u, c(x, u) = 1 x2 + 1 u2
2 2
41. Desired behavior
Compare Outputs
Q learning - Local Learning and learn Inputs
Complex system
Measured behavior
Cubic nonlinearity: d
dt x = −x3 + u, c(x, u) = 1 x2 + 1 u2
2 2
HJB:
min ( 2 x2 + 1 u2 + (−x3 + u) J ∗ (x)) = γJ ∗ (x)
1
2
u
42. Desired behavior
Compare Outputs
Q learning - Local Learning and learn Inputs
Complex system
Measured behavior
Cubic nonlinearity: d
dt x = −x3 + u, c(x, u) = 1 x2 + 1 u2
2 2
HJB: min ( 2 x2 + 1 u2 + (−x3 + u) J ∗ (x)) = γJ ∗ (x)
1
2
u
Basis: θ x x 2 xu
H (x, u) = c(x, u) + θ x + θ 2
u
1 + 2x
43. Desired behavior
Compare Outputs
Q learning - Local Learning and learn Inputs
Complex system
Measured behavior
Cubic nonlinearity: d
dt x = −x3 + u, c(x, u) = 1 x2 + 1 u2
2 2
HJB: min ( 2 x2 + 1 u2 + (−x3 + u) J ∗ (x)) = γJ ∗ (x)
1
2
u
x
Basis: H θ (x, u) = c(x, u) + θx x2 + θxu 2
u
1 + 2x
1 1
0.08
Optimal policy Optimal policy 0.06
0.07
0.05
0.06
0.05 0.04
0 0.04 0
0.03
0.03
0.02
0.02
0.01
0.01
−1 −1
−1 0 1 −1 0 1
Low amplitude input High amplitude input
u(t) = A(sin(t) + sin(πt) + sin(et))
44. Outline
? Coarse models - what to do with them?
Step 1: Recognize
Step 2: Find a stab...
Step 3: Optimality
Q-learning for nonlinear state space models
Step 4: Adjoint
Step 5: Interpret
Example: Local approximation
Example: Decentralized control
45. M. Huang, P. E. Caines, and R. P. Malhame. Large-population
cost-coupled LQG problems with nonuniform agents: Individual-mass
Multi-agent model behavior and decentralized ε-Nash equilibria. IEEE Trans. Auto.
Control, 52(9):1560–1571, 2007.
Huang et. al. Local optimization for global coordination
46. Multi-agent model
Model: Linear autonomous models - global cost objective
HJB: Individual state + global average
Basis: Consistent with low dimensional LQG model
Results from five agent model:
47. Multi-agent model
Model: Linear autonomous models - global cost objective
HJB: Individual state + global average
Basis: Consistent with low dimensional LQG model
Results from five agent model: 1
Estimated state feedback gains
0
(individual state)
(ensemble state)
time
-1
Gains for agent 4: Q-learning sample paths
and gains predicted from ∞-agent limit
48. Outline
? Coarse models - what to do with them?
Step 1: Recognize
Step 2: Find a stab...
Step 3: Optimality
Q-learning for nonlinear state space models
Step 4: Adjoint
Step 5: Interpret
Example: Local approximation
Example: Decentralized control
... Conclusions
49. Conclusions
Coarse models give tremendous insight
They are also tremendously useful
for design in approximate dynamic programming algorithms
50. Conclusions
Coarse models give tremendous insight
They are also tremendously useful
for design in approximate dynamic programming algorithms
Q-learning is as fundamental as the Riccati equation - this
should be included in our first-year graduate control courses
51. Conclusions
Coarse models give tremendous insight
They are also tremendously useful
for design in approximate dynamic programming algorithms
Q-learning is as fundamental as the Riccati equation - this
should be included in our first-year graduate control courses
Current research: Algorithm analysis and improvements
Applications in biology and economics
Analysis of game-theoretic issues
in coupled systems
52. References
.
PhD thesis, University of London, London, England, 1967.
. American Elsevier Pub. Co., New York, NY, 1970.
Learning from Delayed Rewards. PhD thesis, King’s College, Cambridge, UK, 1989.
Machine Learning, 8(3-4):279–292, 1992.
SIAM J. Control Optim., 38(2):447–469, 2000.
on policy iteration. Automatica, 45(2):477 – 484, 2009.
Submitted to the 48th IEEE Conference on Decision and Control, December 16-18 2009.
[9] C. Moallemi, S. Kumar, and B. Van Roy. Approximate and data-driven dynamic programming for queueing networks.
Preprint available at http://moallemi.com/ciamac/research-interests.php, 2008.