Q-Learning and Pontryagin's Minimum Principle

Q-Learning
and Pontryagin's Minimum Principle

Sean Meyn
Department of Electrical and Computer Engineering
and the Coordinated Science Laboratory
University of Illinois

Joint work with Prashant Mehta
NSF support: ECS-0523620

Outline

? Coarse models - what to do with them?

Step 1: Recognize
Step 2: Find a stab...
Step 3: Optimality
Q-learning for nonlinear state space models
Step 4: Adjoint
Step 5: Interpret

Example: Local approximation

Example: Decentralized control

Coarse Models: A rich collection of model reduction techniques

Many of today’s participants have contributed to this research.
A biased list:

Fluid models: Law of Large Numbers scaling,
most likely paths in large deviations
Workload relaxation for networks
Heavy-traffic limits

Clustering: spectral graph theory
Markov spectral theory

Singular perturbations
Large population limits: Interacting particle systems

Workload Relaxations

An example from CTCN:

Figure 7.1: Demand-driven model with routing, scheduling, and re-work.

Workload at two stations evolves as a two-dimensional system
Cost is projected onto these coordinates:
−(1 − ρ) −(1 − ρ)
R STO R∗ R STO R∗
50 50
w2 w2
40 40

30

20
30

20
Optimal policy for
10

0
10

0
relaxation = hedging
-10

-20
-10

-20
policy for full network
-20 -10 0 10 20 30 40 50 -20 -10 0 10 20 30 40 50

w1 w1
Figure 7.2: Optimal policies for two instances of the network shown in Figure 7.1.
In each ﬁgure the optimal stochastic control region RSTO is compared with the optimal
region R∗ obtained for the two dimensional ﬂuid model.

Workload Relaxations and Simulation
α
µ µ
An example from CTCN:
Station 1 Station 2

µ µ

α
Decision making at stations 1 & 2
e.g., setting safety-stock levels

DP and simulations accelerated
using fluid value function for workload relaxation
VIA initialized with Simulated mean with
and without control variate:
Zero
Average cost

Average cost

Fluid value function

10
10 20 10
20 30 20
30 40 30
40 20 40
50 30
50 100 150 200 250 300 50 40 50
60 60 50
60 60

Iteration safety-stock levels

VIA initialized with
Zero

What To Do With a Coarse Model?

Average cost

50 100 150 200 250 300

Iteration
Setting: we have qualitative or partial quantitative
insight regarding optimal control

The network examples relied on specific network structure
What about other models?

Zero

What To Do With a Coarse Model?

Average cost

50 100 150 200 250 300

Iteration
Setting: we have qualitative or partial quantitative
insight regarding optimal control

The network examples relied on specific network structure
What about other models?

An answer lies in a new formulation of Q-learning

Zero

What is Q learning?

Average cost

50 100 150 200 250 300

Iteration
Watkin’s 1992 formulation applied to finite state space MDPs
Q-Learning

Idea is similar to Mayne & Jacobson’s C. J. C. H. Watkins and P. Dayan
Machine Learning, 1992

differential dynamic programming Differential dynamic programming
D. H. Jacobson and D. Q. Mayne
American Elsevier Pub. Co. 1970

Zero

What is Q learning?

Average cost

50 100 150 200 250 300

Iteration
Watkin’s 1992 formulation applied to finite state space MDPs
Q-Learning

Idea is similar to Mayne & Jacobson’s C. J. C. H. Watkins and P. Dayan
Machine Learning, 1992

differential dynamic programming Differential dynamic programming
D. H. Jacobson and D. Q. Mayne
American Elsevier Pub. Co. 1970

Deterministic formulation: Nonlinear system on Euclidean space,
d
dt x(t) = f (x(t), u(t)), t≥ 0

Infinite-horizon discounted cost criterion,
∞
J ∗ (x) = inf e−γs c(x(s), u(s)) ds, x(0) = x
0
with c a non-negative cost function.

1
0.08
Optimal policy
0.07

0.06

What is Q learning? 0
0.05

0.04

0.03

0.02

0.01

−1
−1 0 1

d
dt x(t) = f (x(t), u(t)), t≥ 0

∞
0

Differential generator: For any smooth function h,
Du h (x) := (∇h (x))T f (x, u)

1
0.08
Optimal policy
0.07

0.06

0.05

0.04

0.03

0.02

0.01

−1
−1 0 1

d
dt x(t) = f (x(t), u(t)), t≥ 0

∞
0

Du h (x) := (∇h (x))T f (x, u)

HJB equation: min c(x, u) + Du J ∗ (x) = γJ ∗ (x)
u

1
0.08
Optimal policy
0.07

0.06

0.05

0.04

0.03

0.02

0.01

−1
−1 0 1

d
dt x(t) = f (x(t), u(t)), t≥ 0

∞
0

Du h (x) := (∇h (x))T f (x, u)

HJB equation: min c(x, u) + Du J ∗ (x) = γJ ∗ (x)
u

The Q-function of Q-learning is this function of two variables

1
0.08
Optimal policy
0.07

0.06

Q learning - Steps towards an algorithm 0
0.05

0.04

0.03

0.02

0.01

−1
−1 0 1

Sequence of five steps:

Step 1: Recognize fixed point equation for the Q-function
Step 2: Find a stabilizing policy that is ergodic
Step 3: Optimality criterion - minimize Bellman error
Step 4: Adjoint operation
Step 5: Interpret and simulate!

1
0.08
Optimal policy
0.07

0.06

0.05

0.04

0.03

0.02

0.01

−1
−1 0 1

Sequence of five steps:


Goal - seek the best approximation,
within a parameterized class

1
0.08
Optimal policy
0.07

0.06

0.05

0.04

0.03

0.02

0.01

−1
−1 0 1

Q-function: H ∗(x, u) = c(x, u) + Du J ∗ (x)

Its minimum: H ∗ (x) := min H ∗ (x, u) = γJ ∗ (x)
u∈U

Fixed point equation:
Du H ∗ (x) = −γ(c(x, u) − H ∗ (x, u))

Step 1: Recognize xed point equation for the Q-function

1
0.08
Optimal policy
0.07

0.06

0.05

0.04

0.03

0.02

0.01

−1
−1 0 1

Q-function: H ∗(x, u) = c(x, u) + Du J ∗ (x)

Its minimum: H ∗ (x) := min H ∗ (x, u) = γJ ∗ (x)
u∈U

Fixed point equation:
Du H ∗ (x) = −γ(c(x, u) − H ∗ (x, u))

Key observation for learning: For any input-output pair,

Du H ∗ (x) = d
dt H ∗ (x(t)) x=x(t)
u=u(t)

1
0.08
Optimal policy
0.07

0.06

Q learning - LQR example 0
0.05

0.04

0.03

0.02

0.01

−1
−1 0 1

Linear model and quadratic cost,
1 T 1 T
Cost: c(x, u) = 2 x Qx + 2 u Ru

Q-function: H ∗(x, u) = c(x, u) + (Ax + Bu)T P ∗ x
= c(x, u) + Du J ∗ (x) Solves Riccatti eqn

1 T ∗
J ∗ (x) = 2x
P x


1
0.08
Optimal policy
0.07

0.06

0.05

0.04

0.03

0.02

0.01

−1
−1 0 1

1 T 1 T

Q-function: H ∗(x, u) = c(x, u) + (Ax + Bu)T P ∗ x
Solves Riccatti eqn

Q-function approx:
dx dxu
H θ (x, u) = c(x, u) + 1
2 θi xT E i x +
x
θj xT F i u
x

i=1 j=1
Minimum:
θ 1 T θT
H (x) = 2x Q + E − F R−1 F θ x
θ

Minimizer:
uθ (x) = φθ (x) = −R−1 F θ x

1
0.08
Optimal policy
0.07

0.06

0.05

0.04

0.03

0.02

0.01

−1
−1 0 1

Step 2: Stationary policy that is ergodic?

Assume the LLN holds for continuous functions
F: R × R u
→ R
As T → ∞,

T
1
F (x(t), u(t)) dt −→ F (x, u) (dx, du)
T 0 X×U


1
0.08
Optimal policy
0.07

0.06

0.05

0.04

0.03

0.02

0.01

−1
−1 0 1

Suppose for example the input is scalar, and the system is stable
[Bounded-input/Bounded-state]
Can try a linear
combination
of sinusouids


1
0.08
Optimal policy
0.07

0.06

0.05

0.04

0.03

0.02

0.01

−1
−1 0 1

Suppose for example the input is scalar, and the system is stable
[Bounded-input/Bounded-state]
0.08

0.07

0.06

0.05
Can try a linear
combination
0.04
of sinusouids
0.03

0.02

0.01


u(t) = A(sin(t) + sin(πt) + sin(et)) Step 2: Find a stabilizing policy that is ergodic

1
0.08
Optimal policy
0.07

0.06

0.05

0.04

0.03

0.02

0.01

−1
−1 0 1

Step 3: Bellman error

Based on observations, minimize the mean-square Bellman error:

θ θ
,

First order condition for optimality: θ
, Du ψ θ − γψi
i
θ
=0
with ψ θ (x) = ψi (x, φθ (x)),
i
θ

1≤i≤d


1
0.08
Optimal policy
0.07

0.06

Q learning - Convex Reformulation 0
0.05

0.04

0.03

0.02

0.01

−1
−1 0 1

Step 3: Bellman error

Based on observations, minimize the mean-square Bellman error:

θ θ
,

G

Gθ (x) ≤ H θ (x, u), all x, u


1
0.08
Optimal policy
0.07

0.06

0.05

0.04

0.03

0.02

0.01

−1
−1 0 1

1 T 1 T

Q-function: H ∗ (x) = c(x, u) + (Ax + Bu)T P ∗ x
Solves Riccatti eqn

Q-function approx:
dx dxu
H θ (x, u) = c(x, u) + 1
2 θi xT E i x +
x
θj xT F i u
x

i=1 j=1
Approximation to minimum
G θ (x) = 1 xT Gθ x
2

Minimizer:
uθ (x) = φθ (x) = −R−1 F θ x

Q learning - Steps towards an algorithm

Step 4: Causal smoothing to avoid differentiation
For any function of two variables, g : R × R w
→ R
Resolvent gives a new function,

∞
−βt
Rβ g (x, w) = e g(x(t), ξ(t)) dt
0

Skip to examples


→ R

∞
Rβ g (x, w) = e−βt g(x(t), ξ (t)) dt , β>0
0

controlled using the nominal policy

u(t) = φ(x(t), ξ(t)), t≥0

stabilizing & ergodic


→ R
∞
0
Resolvent equation:


→ R
∞
0
Resolvent equation:

Smoothed Bellman error:

Lθ,β = Rβ Lθ
θ θ
= [βRβ − I]H + γRβ (c − H )



1 θ,β 2
Eβ (θ) := 2

θ,β
Eβ (θ) = , θ Lθ,β
= zero at an optimum




1 θ,β 2
Eβ (θ) := 2

θ,β
Eβ (θ) = , θ Lθ,β
= zero at an optimum

Involves terms of the form Rβ g,R β h



1 θ,β 2
Smoothed Bellman error: Eβ (θ) := 2

θ,β θ,β
Eβ (θ) = , θL

Adjoint operation:

† 1 †
Rβ Rβ
= (Rβ + Rβ )
2β
1 † †
Rβ g,R β h = g,R β h + h,R β g
2β



1 θ,β 2
Smoothed Bellman error: Eβ (θ) := 2

θ,β θ,β
Eβ (θ) = , θL

Adjoint operation: 1
† †
Rβ Rβ = (Rβ + Rβ )
2β
1 † †
Rβ g,R β h = g,R β h + h,R β g
2β
Adjoint realization: time-reversal
∞
†
Rβ g (x, w) = e−βt Ex, w [g(x◦ (−t), ξ ◦ (−t))] dt
0
expectation conditional on x◦ (0) = x, ξ ◦ (0) = w.


1
0.08
Optimal policy
0.07

0.06

0.05

0.04

0.03

0.02

0.01

−1
−1 0 1

After Step 5: Not quite adaptive control:

Desired behavior

Compare Outputs
and learn Inputs

Complex system
Measured behavior

Ergodic input applied


1
0.08
Optimal policy
0.07

0.06

0.05

0.04

0.03

0.02

0.01

−1
−1 0 1

After Step 5: Not quite adaptive control:

Desired behavior

Compare Outputs
and learn Inputs

Complex system
Measured behavior

Ergodic input applied
Based on observations minimize the mean-square Bellman error:

1

(individual state)
(ensemble state)
Deterministic Stochastic Approximation 0

-1
0 1 2 3 4 5 6 7 8 9 10

Gradient descent:
d
dt θ = −ε θ
, Du θ Hθ − γ θ Hθ

Converges* to the minimizer of the mean-square Bellman error:

d * Convergence observed in experiments!
dt h(x(t)) x=x(t)
= Du h (x) For a convex re-formulation of
w=ξ(t) the problem, see Mehta & Meyn 2009

1

(individual state)
(ensemble state)
Deterministic Stochastic Approximation 0

-1
0 1 2 3 4 5 6 7 8 9 10

Stochastic Approximation

θ
d
dt θ = −εt Lθ
t
d
dt θH (x◦ (t)) − γ θH
θ
(x◦ (t), u◦ (t))

Lθ := dt H θ (x◦ (t)) + γ(c(x◦ (t) , u◦ (t)) − H θ (x◦ (t), u◦ (t)))
t
d

Gradient descent:
d θ θ θ
dt θ = −ε , Du θH −γ θH

Mean-square Bellman error:

d
dt h(x(t)) x=x(t)
= Du h (x)
w=ξ(t)

Desired behavior

Compare Outputs

Q learning - Local Learning and learn Inputs

Complex system

Measured behavior

Cubic nonlinearity:

d
dt x = −x3 + u, c(x, u) = 1 x2 + 1 u2
2 2

Desired behavior

Compare Outputs


Complex system

Measured behavior

Cubic nonlinearity: d
dt x = −x3 + u, c(x, u) = 1 x2 + 1 u2
2 2

HJB:
min ( 2 x2 + 1 u2 + (−x3 + u) J ∗ (x)) = γJ ∗ (x)
1
2
u

Desired behavior

Compare Outputs


Complex system

Measured behavior

dt x = −x3 + u, c(x, u) = 1 x2 + 1 u2
2 2

HJB: min ( 2 x2 + 1 u2 + (−x3 + u) J ∗ (x)) = γJ ∗ (x)
1
2
u

Basis: θ x x 2 xu
H (x, u) = c(x, u) + θ x + θ 2
u
1 + 2x

Desired behavior

Compare Outputs


Complex system

Measured behavior

dt x = −x3 + u, c(x, u) = 1 x2 + 1 u2
2 2

HJB: min ( 2 x2 + 1 u2 + (−x3 + u) J ∗ (x)) = γJ ∗ (x)
1
2
u

x
Basis: H θ (x, u) = c(x, u) + θx x2 + θxu 2
u
1 + 2x

1 1
0.08
Optimal policy Optimal policy 0.06
0.07

0.05
0.06

0.05 0.04

0 0.04 0
0.03

0.03
0.02
0.02

0.01
0.01

−1 −1
−1 0 1 −1 0 1

Low amplitude input High amplitude input

u(t) = A(sin(t) + sin(πt) + sin(et))

M. Huang, P. E. Caines, and R. P. Malhame. Large-population
cost-coupled LQG problems with nonuniform agents: Individual-mass
Multi-agent model behavior and decentralized ε-Nash equilibria. IEEE Trans. Auto.
Control, 52(9):1560–1571, 2007.

Huang et. al. Local optimization for global coordination

Multi-agent model

Model: Linear autonomous models - global cost objective

HJB: Individual state + global average

Basis: Consistent with low dimensional LQG model

Results from five agent model:

Multi-agent model

Model: Linear autonomous models - global cost objective

HJB: Individual state + global average

Basis: Consistent with low dimensional LQG model

Results from five agent model: 1

Estimated state feedback gains
0
(individual state)
(ensemble state)

time
-1
Gains for agent 4: Q-learning sample paths
and gains predicted from ∞-agent limit

Outline

? Coarse models - what to do with them?

Step 1: Recognize
Step 2: Find a stab...
Step 3: Optimality
Q-learning for nonlinear state space models
Step 4: Adjoint
Step 5: Interpret

Example: Local approximation

Example: Decentralized control

... Conclusions

Conclusions

Coarse models give tremendous insight

They are also tremendously useful
for design in approximate dynamic programming algorithms

Conclusions



Q-learning is as fundamental as the Riccati equation - this
should be included in our first-year graduate control courses

Conclusions



Q-learning is as fundamental as the Riccati equation - this
should be included in our first-year graduate control courses

Current research: Algorithm analysis and improvements
Applications in biology and economics
Analysis of game-theoretic issues
in coupled systems

References

.
PhD thesis, University of London, London, England, 1967.

. American Elsevier Pub. Co., New York, NY, 1970.

Learning from Delayed Rewards. PhD thesis, King’s College, Cambridge, UK, 1989.

Machine Learning, 8(3-4):279–292, 1992.

SIAM J. Control Optim., 38(2):447–469, 2000.

on policy iteration. Automatica, 45(2):477 – 484, 2009.

Submitted to the 48th IEEE Conference on Decision and Control, December 16-18 2009.

[9] C. Moallemi, S. Kumar, and B. Van Roy. Approximate and data-driven dynamic programming for queueing networks.
Preprint available at http://moallemi.com/ciamac/research-interests.php, 2008.

Q-Learning and Pontryagin's Minimum Principle

More Related Content

What's hot

Similar to Q-Learning and Pontryagin's Minimum Principle

More from Sean Meyn

Recently uploaded

Q-Learning and Pontryagin's Minimum Principle