Q-Learning
  and Pontryagin's Minimum Principle

Sean Meyn
Department of Electrical and Computer Engineering
and the Coordinated Science Laboratory
    University of Illinois


   Joint work with Prashant Mehta
   NSF support: ECS-0523620
Outline


?                        Coarse models - what to do with them?


Step 1: Recognize
Step 2: Find a stab...
Step 3: Optimality
                         Q-learning for nonlinear state space models
Step 4: Adjoint
Step 5: Interpret




                         Example: Local approximation


                         Example: Decentralized control
Outline


?                        Coarse models - what to do with them?


Step 1: Recognize
Step 2: Find a stab...
Step 3: Optimality
                         Q-learning for nonlinear state space models
Step 4: Adjoint
Step 5: Interpret




                         Example: Local approximation


                         Example: Decentralized control
Coarse Models: A rich collection of model reduction techniques


Many of today’s participants have contributed to this research.
A biased list:

 Fluid models: Law of Large Numbers scaling,
                 most likely paths in large deviations
 Workload relaxation for networks
 Heavy-traffic limits

 Clustering: spectral graph theory
                Markov spectral theory

  Singular perturbations
  Large population limits: Interacting particle systems
Workload Relaxations

An example from CTCN:




                                                                                           Figure 7.1: Demand-driven model with routing, scheduling, and re-work.



Workload at two stations evolves as a two-dimensional system
Cost is projected onto these coordinates:
                        −(1 − ρ)                                                       −(1 − ρ)
                                             R STO         R∗                                          R STO             R∗
      50                                                             50
w2                                                              w2
      40                                                             40


      30


      20
                                                                     30


                                                                     20
                                                                                                                              Optimal policy for
      10


       0
                                                                     10


                                                                      0
                                                                                                                              relaxation = hedging
      -10


      -20
                                                                     -10


                                                                     -20
                                                                                                                              policy for full network
            -20   -10   0     10   20   30       40        50              -20   -10   0    10    20    30     40        50

                                                      w1                                                            w1
     Figure 7.2: Optimal policies for two instances of the network shown in Figure 7.1.
     In each figure the optimal stochastic control region RSTO is compared with the optimal
     region R∗ obtained for the two dimensional fluid model.
Workload Relaxations and Simulation
                                                               α
                                                                                                   µ                                           µ
               An example from CTCN:
                                                                                                   Station 1                                    Station 2

                                                                                                             µ                                       µ

                                                                                                                                                                           α
                                                                            Decision making at stations 1 & 2
                                                                            e.g., setting safety-stock levels

               DP and simulations accelerated
                     using fluid value function for workload relaxation
                      VIA initialized with                                                   Simulated mean with
                                                                                                    and without control variate:
                                     Zero
Average cost




                                                                    Average cost



                                     Fluid value function



                                                                                                                                                10
                                                                                   10                                                      20                                                                 10
                                                                                        20                                            30                                                                 20
                                                                                              30                                 40                                                                 30
                                                                                                        40                                                  20                                 40
                                                                                                                            50                                   30
                 50      100   150     200   250     300                                                     50                                                       40                  50
                                                                                                                  60   60                                                  50
                                                                                                                                                                                60   60


                                                   Iteration                                                                safety-stock levels
VIA initialized with
                                                                                          Zero

What To Do With a Coarse Model?




                                                       Average cost
                                                                                          Fluid value function




                                                                      50      100   150     200   250     300


                                                                                                        Iteration
Setting: we have qualitative or partial quantitative
    insight regarding optimal control

The network examples relied on specific network structure
              What about other models?
VIA initialized with
                                                                                          Zero

What To Do With a Coarse Model?




                                                       Average cost
                                                                                          Fluid value function




                                                                      50      100   150     200   250     300


                                                                                                        Iteration
Setting: we have qualitative or partial quantitative
    insight regarding optimal control

The network examples relied on specific network structure
              What about other models?


     An answer lies in a new formulation of Q-learning
Outline


?                        Coarse models - what to do with them?


Step 1: Recognize
Step 2: Find a stab...
Step 3: Optimality
                         Q-learning for nonlinear state space models
Step 4: Adjoint
Step 5: Interpret




                         Example: Local approximation


                         Example: Decentralized control
VIA initialized with
                                                                                                      Zero

What is Q learning?




                                                                   Average cost
                                                                                                      Fluid value function




                                                                                  50      100   150     200   250     300


                                                                                                                    Iteration
Watkin’s 1992 formulation applied to finite state space MDPs
                                                      Q-Learning

Idea is similar to Mayne & Jacobson’s                 C. J. C. H. Watkins and P. Dayan
                                                      Machine Learning, 1992

                 differential dynamic programming   Differential dynamic programming
                                                    D. H. Jacobson and D. Q. Mayne
                                                    American Elsevier Pub. Co. 1970
VIA initialized with
                                                                                                                Zero

What is Q learning?




                                                                             Average cost
                                                                                                                Fluid value function




                                                                                            50      100   150     200   250     300


                                                                                                                              Iteration
Watkin’s 1992 formulation applied to finite state space MDPs
                                                                Q-Learning

Idea is similar to Mayne & Jacobson’s                           C. J. C. H. Watkins and P. Dayan
                                                                Machine Learning, 1992

                 differential dynamic programming             Differential dynamic programming
                                                              D. H. Jacobson and D. Q. Mayne
                                                              American Elsevier Pub. Co. 1970


Deterministic formulation: Nonlinear system on Euclidean space,
                 d
                 dt x(t)    = f (x(t), u(t)),         t≥ 0

Infinite-horizon discounted cost criterion,
                                 ∞
            J ∗ (x) = inf            e−γs c(x(s), u(s)) ds,   x(0) = x
                             0
with c a non-negative cost function.
1
                                                                                                    0.08
                                                                               Optimal policy
                                                                                                    0.07


                                                                                                    0.06




What is Q learning?                                                   0
                                                                                                    0.05


                                                                                                    0.04


                                                                                                    0.03


                                                                                                    0.02


                                                                                                    0.01


                                                                     −1
                                                                      −1   0                    1




Deterministic formulation: Nonlinear system on Euclidean space,
                 d
                 dt x(t)    = f (x(t), u(t)),         t≥ 0

Infinite-horizon discounted cost criterion,
                                 ∞
            J ∗ (x) = inf            e−γs c(x(s), u(s)) ds,   x(0) = x
                             0
with c a non-negative cost function.

Differential generator: For any smooth function h,
                   Du h (x) := (∇h (x))T f (x, u)
1
                                                                                                    0.08
                                                                               Optimal policy
                                                                                                    0.07


                                                                                                    0.06




What is Q learning?                                                   0
                                                                                                    0.05


                                                                                                    0.04


                                                                                                    0.03


                                                                                                    0.02


                                                                                                    0.01


                                                                     −1
                                                                      −1   0                    1




Deterministic formulation: Nonlinear system on Euclidean space,
                 d
                 dt x(t)    = f (x(t), u(t)),         t≥ 0

Infinite-horizon discounted cost criterion,
                                 ∞
            J ∗ (x) = inf            e−γs c(x(s), u(s)) ds,   x(0) = x
                             0
with c a non-negative cost function.

Differential generator: For any smooth function h,
                   Du h (x) := (∇h (x))T f (x, u)


     HJB equation:            min c(x, u) + Du J ∗ (x) = γJ ∗ (x)
                                 u
1
                                                                                                    0.08
                                                                               Optimal policy
                                                                                                    0.07


                                                                                                    0.06




What is Q learning?                                                   0
                                                                                                    0.05


                                                                                                    0.04


                                                                                                    0.03


                                                                                                    0.02


                                                                                                    0.01


                                                                     −1
                                                                      −1   0                    1




Deterministic formulation: Nonlinear system on Euclidean space,
                 d
                 dt x(t)    = f (x(t), u(t)),         t≥ 0

Infinite-horizon discounted cost criterion,
                                 ∞
            J ∗ (x) = inf            e−γs c(x(s), u(s)) ds,   x(0) = x
                             0
with c a non-negative cost function.

Differential generator: For any smooth function h,
                   Du h (x) := (∇h (x))T f (x, u)


     HJB equation:            min c(x, u) + Du J ∗ (x) = γJ ∗ (x)
                                 u



The Q-function of Q-learning is this function of two variables
1
                                                                                         0.08
                                                                    Optimal policy
                                                                                         0.07


                                                                                         0.06




Q learning - Steps towards an algorithm                0
                                                                                         0.05


                                                                                         0.04


                                                                                         0.03


                                                                                         0.02


                                                                                         0.01


                                                      −1
                                                       −1       0                    1




Sequence of five steps:

    Step 1: Recognize fixed point equation for the Q-function
    Step 2: Find a stabilizing policy that is ergodic
    Step 3: Optimality criterion - minimize Bellman error
    Step 4: Adjoint operation
    Step 5: Interpret and simulate!
1
                                                                                         0.08
                                                                    Optimal policy
                                                                                         0.07


                                                                                         0.06




Q learning - Steps towards an algorithm                0
                                                                                         0.05


                                                                                         0.04


                                                                                         0.03


                                                                                         0.02


                                                                                         0.01


                                                      −1
                                                       −1       0                    1




Sequence of five steps:

    Step 1: Recognize fixed point equation for the Q-function
    Step 2: Find a stabilizing policy that is ergodic
    Step 3: Optimality criterion - minimize Bellman error
    Step 4: Adjoint operation
    Step 5: Interpret and simulate!


Goal - seek the best approximation,
        within a parameterized class
1
                                                                                                                  0.08
                                                                                             Optimal policy
                                                                                                                  0.07


                                                                                                                  0.06




Q learning - Steps towards an algorithm                                 0
                                                                                                                  0.05


                                                                                                                  0.04


                                                                                                                  0.03


                                                                                                                  0.02


                                                                                                                  0.01


                                                                       −1
                                                                        −1             0                      1




Step 1: Recognize fixed point equation for the Q-function
 Q-function:     H ∗(x, u) = c(x, u) + Du J ∗ (x)

 Its minimum:       H ∗ (x) := min H ∗ (x, u) = γJ ∗ (x)
                              u∈U



 Fixed point equation:
          Du H ∗ (x) = −γ(c(x, u) − H ∗ (x, u))




                                                   Step 1: Recognize xed point equation for the Q-function
                                                   Step 2: Find a stabilizing policy that is ergodic
                                                   Step 3: Optimality criterion - minimize Bellman error
                                                   Step 4: Adjoint operation
                                                   Step 5: Interpret and simulate!
1
                                                                                                                  0.08
                                                                                             Optimal policy
                                                                                                                  0.07


                                                                                                                  0.06




Q learning - Steps towards an algorithm                                 0
                                                                                                                  0.05


                                                                                                                  0.04


                                                                                                                  0.03


                                                                                                                  0.02


                                                                                                                  0.01


                                                                       −1
                                                                        −1             0                      1




Step 1: Recognize fixed point equation for the Q-function
 Q-function:     H ∗(x, u) = c(x, u) + Du J ∗ (x)

 Its minimum:       H ∗ (x) := min H ∗ (x, u) = γJ ∗ (x)
                              u∈U



 Fixed point equation:
          Du H ∗ (x) = −γ(c(x, u) − H ∗ (x, u))

 Key observation for learning: For any input-output pair,

          Du H ∗ (x) =   d
                         dt H ∗ (x(t)) x=x(t)
                                         u=u(t)
                                                   Step 1: Recognize xed point equation for the Q-function
                                                   Step 2: Find a stabilizing policy that is ergodic
                                                   Step 3: Optimality criterion - minimize Bellman error
                                                   Step 4: Adjoint operation
                                                   Step 5: Interpret and simulate!
1
                                                                                                                    0.08
                                                                                               Optimal policy
                                                                                                                    0.07


                                                                                                                    0.06




Q learning - LQR example                                                 0
                                                                                                                    0.05


                                                                                                                    0.04


                                                                                                                    0.03


                                                                                                                    0.02


                                                                                                                    0.01


                                                                        −1
                                                                         −1             0                       1




Linear model and quadratic cost,
                           1 T          1 T
  Cost:       c(x, u) =    2 x Qx   +   2 u Ru

  Q-function: H ∗(x, u) = c(x, u) + (Ax + Bu)T P ∗ x
                          = c(x, u) + Du J ∗ (x)                              Solves Riccatti eqn

                                                                                            1 T ∗
                                                                      J ∗ (x) =             2x
                                                                                               P x




                                                 Step 2:   Find a stabilizing policy that is ergodic
                                                 Step 3:   Optimality criterion - minimize Bellman error
                                                 Step 4:   Adjoint operation
                                                 Step 5:   Interpret and simulate!
1
                                                                                                                           0.08
                                                                                                      Optimal policy
                                                                                                                           0.07


                                                                                                                           0.06




Q learning - LQR example                                                        0
                                                                                                                           0.05


                                                                                                                           0.04


                                                                                                                           0.03


                                                                                                                           0.02


                                                                                                                           0.01


                                                                               −1
                                                                                −1             0                       1




Linear model and quadratic cost,
                              1 T              1 T
  Cost:        c(x, u) =      2 x Qx      +    2 u Ru

  Q-function: H ∗(x, u) = c(x, u) + (Ax + Bu)T P ∗ x
                                                                                     Solves Riccatti eqn

  Q-function approx:
                                          dx                        dxu
          H θ (x, u) = c(x, u) +    1
                                    2          θi xT E i x +
                                                x
                                                                            θj xT F i u
                                                                             x


                                         i=1                       j=1
  Minimum:
           θ        1 T             θT
          H (x) =   2x    Q + E − F R−1 F θ x
                               θ


  Minimizer:
           uθ (x) = φθ (x) = −R−1 F θ x
                                                        Step 2:   Find a stabilizing policy that is ergodic
                                                        Step 3:   Optimality criterion - minimize Bellman error
                                                        Step 4:   Adjoint operation
                                                        Step 5:   Interpret and simulate!
1
                                                                                                                   0.08
                                                                                              Optimal policy
                                                                                                                   0.07


                                                                                                                   0.06




Q learning - Steps towards an algorithm                                  0
                                                                                                                   0.05


                                                                                                                   0.04


                                                                                                                   0.03


                                                                                                                   0.02


                                                                                                                   0.01


                                                                        −1
                                                                         −1             0                      1




Step 2: Stationary policy that is ergodic?

Assume the LLN holds for continuous functions
                        F: R × R    u
                                        → R
As T → ∞,

             T
     1
                 F (x(t), u(t)) dt −→         F (x, u)      (dx, du)
     T   0                              X×U




                                                    Step 1: Recognize xed point equation for the Q-function
                                                    Step 2: Find a stabilizing policy that is ergodic
                                                    Step 3: Optimality criterion - minimize Bellman error
                                                    Step 4: Adjoint operation
                                                    Step 5: Interpret and simulate!
1
                                                                                                             0.08
                                                                                        Optimal policy
                                                                                                             0.07


                                                                                                             0.06




Q learning - Steps towards an algorithm                            0
                                                                                                             0.05


                                                                                                             0.04


                                                                                                             0.03


                                                                                                             0.02


                                                                                                             0.01


                                                                  −1
                                                                   −1             0                      1




Step 2: Stationary policy that is ergodic?
 Suppose for example the input is scalar, and the system is stable
                                       [Bounded-input/Bounded-state]
 Can try a linear
 combination
 of sinusouids




                                              Step 1: Recognize xed point equation for the Q-function
                                              Step 2: Find a stabilizing policy that is ergodic
                                              Step 3: Optimality criterion - minimize Bellman error
                                              Step 4: Adjoint operation
                                              Step 5: Interpret and simulate!
1
                                                                                                                        0.08
                                                                                                   Optimal policy
                                                                                                                        0.07


                                                                                                                        0.06




       Q learning - Steps towards an algorithm                                0
                                                                                                                        0.05


                                                                                                                        0.04


                                                                                                                        0.03


                                                                                                                        0.02


                                                                                                                        0.01


                                                                             −1
                                                                              −1             0                      1




       Step 2: Stationary policy that is ergodic?
        Suppose for example the input is scalar, and the system is stable
                                              [Bounded-input/Bounded-state]
0.08


0.07


0.06


0.05
                                                    Can try a linear
                                                    combination
0.04
                                                    of sinusouids
0.03


0.02


0.01




                                                         Step 1: Recognize xed point equation for the Q-function

   u(t) = A(sin(t) + sin(πt) + sin(et))                  Step 2: Find a stabilizing policy that is ergodic
                                                         Step 3: Optimality criterion - minimize Bellman error
                                                         Step 4: Adjoint operation
                                                         Step 5: Interpret and simulate!
1
                                                                                                                 0.08
                                                                                            Optimal policy
                                                                                                                 0.07


                                                                                                                 0.06




Q learning - Steps towards an algorithm                                0
                                                                                                                 0.05


                                                                                                                 0.04


                                                                                                                 0.03


                                                                                                                 0.02


                                                                                                                 0.01


                                                                      −1
                                                                       −1             0                      1




Step 3: Bellman error



Based on observations, minimize the mean-square Bellman error:

                                                    θ         θ
                                                        ,

First order condition for optimality:     θ
                                              , Du ψ θ − γψi
                                                     i
                                                           θ
                                                                                =0
                                        with ψ θ (x) = ψi (x, φθ (x)),
                                               i
                                                        θ

                                                                                   1≤i≤d



                                                  Step 1: Recognize xed point equation for the Q-function
                                                  Step 2: Find a stabilizing policy that is ergodic
                                                  Step 3: Optimality criterion - minimize Bellman error
                                                  Step 4: Adjoint operation
                                                  Step 5: Interpret and simulate!
1
                                                                                                               0.08
                                                                                          Optimal policy
                                                                                                               0.07


                                                                                                               0.06




Q learning - Convex Reformulation                                   0
                                                                                                               0.05


                                                                                                               0.04


                                                                                                               0.03


                                                                                                               0.02


                                                                                                               0.01


                                                                   −1
                                                                    −1             0                       1




Step 3: Bellman error



Based on observations, minimize the mean-square Bellman error:

                                              θ            θ
                                                  ,


                        G


                Gθ (x) ≤ H θ (x, u),   all x, u


                                            Step 2:   Find a stabilizing policy that is ergodic
                                            Step 3:   Optimality criterion - minimize Bellman error
                                            Step 4:   Adjoint operation
                                            Step 5:   Interpret and simulate!
1
                                                                                                                           0.08
                                                                                                      Optimal policy
                                                                                                                           0.07


                                                                                                                           0.06




Q learning - LQR example                                                        0
                                                                                                                           0.05


                                                                                                                           0.04


                                                                                                                           0.03


                                                                                                                           0.02


                                                                                                                           0.01


                                                                               −1
                                                                                −1             0                       1




Linear model and quadratic cost,
                                1 T            1 T
  Cost:         c(x, u) =       2 x Qx    +    2 u Ru

  Q-function:      H ∗ (x) = c(x, u) + (Ax + Bu)T P ∗ x
                                                                               Solves Riccatti eqn

  Q-function approx:
                                          dx                        dxu
          H θ (x, u) = c(x, u) +    1
                                    2          θi xT E i x +
                                                x
                                                                            θj xT F i u
                                                                             x


                                         i=1                       j=1
  Approximation to minimum
          G θ (x) = 1 xT Gθ x
                    2

  Minimizer:
           uθ (x) = φθ (x) = −R−1 F θ x
                                                        Step 2:   Find a stabilizing policy that is ergodic
                                                        Step 3:   Optimality criterion - minimize Bellman error
                                                        Step 4:   Adjoint operation
                                                        Step 5:   Interpret and simulate!
Q learning - Steps towards an algorithm

Step 4: Causal smoothing to avoid differentiation
For any function of two variables, g : R × R     w
                                                     → R
Resolvent gives a new function,


                             ∞
                                     −βt
  Rβ g (x, w) =                  e         g(x(t), ξ(t)) dt
                         0




                                                           Skip to examples
Q learning - Steps towards an algorithm

Step 4: Causal smoothing to avoid differentiation
For any function of two variables, g : R × R    w
                                                    → R
Resolvent gives a new function,

                         ∞
 Rβ g (x, w) =               e−βt g(x(t), ξ (t)) dt ,       β>0
                     0


                                 controlled using the nominal policy

                                   u(t) = φ(x(t), ξ(t)),     t≥0

                                               stabilizing & ergodic
Q learning - Steps towards an algorithm

Step 4: Causal smoothing to avoid differentiation
For any function of two variables, g : R × R        w
                                                        → R
Resolvent gives a new function,
                                   ∞
               Rβ g (x, w) =           e−βt g(x(t), ξ (t)) dt ,   β>0
                               0
Resolvent equation:
Q learning - Steps towards an algorithm

Step 4: Causal smoothing to avoid differentiation
For any function of two variables, g : R × R        w
                                                        → R
Resolvent gives a new function,
                                   ∞
               Rβ g (x, w) =           e−βt g(x(t), ξ (t)) dt ,   β>0
                               0
Resolvent equation:



Smoothed Bellman error:

    Lθ,β = Rβ Lθ
                     θ            θ
         = [βRβ − I]H + γRβ (c − H )
Q learning - Steps towards an algorithm

Smoothed Bellman error:

                          1       θ,β 2
             Eβ (θ) :=    2


                              θ,β
               Eβ (θ) =             ,   θ Lθ,β
                      = zero at an optimum




                              Step 4: Causal smoothing to avoid differentiation
Q learning - Steps towards an algorithm

Smoothed Bellman error:

                           1       θ,β 2
             Eβ (θ) :=     2


                               θ,β
               Eβ (θ) =              ,   θ Lθ,β
                      = zero at an optimum


                    Involves terms of the form         Rβ g,R β h

                               Step 4: Causal smoothing to avoid differentiation
Q learning - Steps towards an algorithm

                                       1     θ,β 2
Smoothed Bellman error:   Eβ (θ) :=    2

                                           θ,β         θ,β
                           Eβ (θ) =              ,   θL

Adjoint operation:

               †     1   †
              Rβ Rβ
                  =    (Rβ + Rβ )
                    2β
                     1      †         †
       Rβ g,R β h =     g,R β h + h,R β g
                    2β



                             Step 4: Causal smoothing to avoid differentiation
Q learning - Steps towards an algorithm

                                             1     θ,β 2
Smoothed Bellman error:         Eβ (θ) :=    2

                                                 θ,β         θ,β
                                 Eβ (θ) =              ,   θL

Adjoint operation:                 1
                           †           †
                          Rβ Rβ =    (Rβ + Rβ )
                                  2β
                                   1      †         †
                     Rβ g,R β h =     g,R β h + h,R β g
                                  2β
Adjoint realization: time-reversal
                          ∞
  †
 Rβ g (x, w)     =            e−βt Ex, w [g(x◦ (−t), ξ ◦ (−t))] dt
                      0
             expectation conditional on x◦ (0) = x, ξ ◦ (0) = w.

                                   Step 4: Causal smoothing to avoid differentiation
1
                                                                                                                            0.08
                                                                                                       Optimal policy
                                                                                                                            0.07


                                                                                                                            0.06




Q learning - Steps towards an algorithm                                           0
                                                                                                                            0.05


                                                                                                                            0.04


                                                                                                                            0.03


                                                                                                                            0.02


                                                                                                                            0.01


                                                                                 −1
                                                                                  −1             0                      1




After Step 5: Not quite adaptive control:

                   Desired behavior

      Compare                                                              Outputs
      and learn                Inputs


                                                Complex system
                            Measured behavior




Ergodic input applied



                                                             Step 1: Recognize xed point equation for the Q-function
                                                             Step 2: Find a stabilizing policy that is ergodic
                                                             Step 3: Optimality criterion - minimize Bellman error
                                                             Step 4: Adjoint operation
                                                             Step 5: Interpret and simulate!
1
                                                                                                    0.08
                                                                               Optimal policy
                                                                                                    0.07


                                                                                                    0.06




Q learning - Steps towards an algorithm                               0
                                                                                                    0.05


                                                                                                    0.04


                                                                                                    0.03


                                                                                                    0.02


                                                                                                    0.01


                                                                     −1
                                                                      −1   0                    1




After Step 5: Not quite adaptive control:

                   Desired behavior

      Compare                                                    Outputs
      and learn                Inputs


                                                Complex system
                            Measured behavior




Ergodic input applied
Based on observations minimize the mean-square Bellman error:
1




                                                                      (individual state)
                                                                                           (ensemble state)
  Deterministic Stochastic Approximation                                                                      0




                                                                                                              -1
                                                                                                               0   1   2   3   4   5   6   7   8   9   10




  Gradient descent:
                 d
                 dt θ   = −ε       θ
                                       , Du   θ Hθ − γ      θ Hθ

  Converges* to the minimizer of the mean-square Bellman error:




d                                                        * Convergence observed in experiments!
dt h(x(t))   x=x(t)
                      = Du h (x)                           For a convex re-formulation of
             w=ξ(t)                                        the problem, see Mehta & Meyn 2009
1




                                                                                 (individual state)
                                                                                                      (ensemble state)
  Deterministic Stochastic Approximation                                                                                 0




                                                                                                                         -1
                                                                                                                          0   1   2   3   4       5   6   7   8   9   10




  Stochastic Approximation

                                        θ
    d
    dt θ     = −εt Lθ
                    t
                           d
                           dt      θH       (x◦ (t)) − γ    θH
                                                                   θ
                                                                       (x◦ (t), u◦ (t))


    Lθ := dt H θ (x◦ (t)) + γ(c(x◦ (t) , u◦ (t)) − H θ (x◦ (t), u◦ (t)))
     t
          d



                                            Gradient descent:
                                              d             θ                θ                                                                θ
                                              dt θ   = −ε       , Du    θH             −γ                                         θH

                                            Mean-square Bellman error:


d
dt h(x(t))   x=x(t)
                      = Du h (x)
             w=ξ(t)
Outline


?                        Coarse models - what to do with them?


Step 1: Recognize
Step 2: Find a stab...
Step 3: Optimality
                         Q-learning for nonlinear state space models
Step 4: Adjoint
Step 5: Interpret




                         Example: Local approximation


                         Example: Decentralized control
Desired behavior

                                     Compare                                                   Outputs


Q learning - Local Learning          and learn               Inputs


                                                                              Complex system

                                                          Measured behavior




Cubic nonlinearity:

       d
       dt x   = −x3 + u,      c(x, u) = 1 x2 + 1 u2
                                        2      2
Desired behavior

                                             Compare                                                   Outputs


Q learning - Local Learning                  and learn               Inputs


                                                                                      Complex system

                                                                  Measured behavior




Cubic nonlinearity:   d
                      dt x   = −x3 + u,   c(x, u) = 1 x2 + 1 u2
                                                    2      2


HJB:
   min ( 2 x2 + 1 u2 + (−x3 + u) J ∗ (x)) = γJ ∗ (x)
         1
                2
       u
Desired behavior

                                                       Compare                                                        Outputs


Q learning - Local Learning                            and learn                    Inputs


                                                                                                     Complex system

                                                                                 Measured behavior




Cubic nonlinearity:        d
                           dt x   = −x3 + u,       c(x, u) = 1 x2 + 1 u2
                                                             2      2


HJB:                  min ( 2 x2 + 1 u2 + (−x3 + u) J ∗ (x)) = γJ ∗ (x)
                            1
                                   2
                       u


Basis:     θ                             x     x   2               xu
         H (x, u) = c(x, u) + θ x + θ       2
                                              u
                                      1 + 2x
Desired behavior

                                                               Compare                                                           Outputs


     Q learning - Local Learning                               and learn                  Inputs


                                                                                                           Complex system

                                                                                       Measured behavior




     Cubic nonlinearity:         d
                                 dt x    = −x3 + u,        c(x, u) = 1 x2 + 1 u2
                                                                     2      2


     HJB:                  min ( 2 x2 + 1 u2 + (−x3 + u) J ∗ (x)) = γJ ∗ (x)
                                 1
                                        2
                            u

                                                                              x
     Basis:                H θ (x, u) = c(x, u) + θx x2 + θxu                    2
                                                                                   u
                                                                           1 + 2x

 1                                                    1
                                             0.08
                        Optimal policy                                                             Optimal policy               0.06
                                             0.07

                                                                                                                                0.05
                                             0.06


                                             0.05                                                                               0.04

 0                                           0.04     0
                                                                                                                                0.03

                                             0.03
                                                                                                                                0.02
                                             0.02

                                                                                                                                0.01
                                             0.01


−1                                                  −1
 −1                0                     1           −1                       0                                             1



            Low amplitude input                             High amplitude input

                                                          u(t) = A(sin(t) + sin(πt) + sin(et))
Outline


?                        Coarse models - what to do with them?


Step 1: Recognize
Step 2: Find a stab...
Step 3: Optimality
                         Q-learning for nonlinear state space models
Step 4: Adjoint
Step 5: Interpret




                         Example: Local approximation


                         Example: Decentralized control
M. Huang, P. E. Caines, and R. P. Malhame. Large-population
                                 cost-coupled LQG problems with nonuniform agents: Individual-mass
Multi-agent model                behavior and decentralized ε-Nash equilibria. IEEE Trans. Auto.
                                 Control, 52(9):1560–1571, 2007.




Huang et. al. Local optimization for global coordination
Multi-agent model

Model: Linear autonomous models - global cost objective

HJB: Individual state + global average

Basis: Consistent with low dimensional LQG model

Results from five agent model:
Multi-agent model

Model: Linear autonomous models - global cost objective

HJB: Individual state + global average

Basis: Consistent with low dimensional LQG model

Results from five agent model:    1




Estimated state feedback gains
                                  0
             (individual state)
             (ensemble state)

                                                                            time
                                  -1
                                       Gains for agent 4: Q-learning sample paths
                                       and gains predicted from ∞-agent limit
Outline


?                        Coarse models - what to do with them?


Step 1: Recognize
Step 2: Find a stab...
Step 3: Optimality
                         Q-learning for nonlinear state space models
Step 4: Adjoint
Step 5: Interpret




                         Example: Local approximation


                         Example: Decentralized control


                                                          ... Conclusions
Conclusions

Coarse models give tremendous insight

They are also tremendously useful
for design in approximate dynamic programming algorithms
Conclusions

Coarse models give tremendous insight

They are also tremendously useful
for design in approximate dynamic programming algorithms

Q-learning is as fundamental as the Riccati equation - this
should be included in our first-year graduate control courses
Conclusions

Coarse models give tremendous insight

They are also tremendously useful
for design in approximate dynamic programming algorithms

Q-learning is as fundamental as the Riccati equation - this
should be included in our first-year graduate control courses

Current research: Algorithm analysis and improvements
                  Applications in biology and economics
                  Analysis of game-theoretic issues
                                 in coupled systems
References


                                                                                                                    .
      PhD thesis, University of London, London, England, 1967.

                                                                        . American Elsevier Pub. Co., New York, NY, 1970.

                        Learning from Delayed Rewards. PhD thesis, King’s College, Cambridge, UK, 1989.

                                                  Machine Learning, 8(3-4):279–292, 1992.


      SIAM J. Control Optim., 38(2):447–469, 2000.




      on policy iteration. Automatica, 45(2):477 – 484, 2009.


      Submitted to the 48th IEEE Conference on Decision and Control, December 16-18 2009.

[9]   C. Moallemi, S. Kumar, and B. Van Roy. Approximate and data-driven dynamic programming for queueing networks.
      Preprint available at http://moallemi.com/ciamac/research-interests.php, 2008.

Q-Learning and Pontryagin's Minimum Principle

  • 1.
    Q-Learning andPontryagin's Minimum Principle Sean Meyn Department of Electrical and Computer Engineering and the Coordinated Science Laboratory University of Illinois Joint work with Prashant Mehta NSF support: ECS-0523620
  • 2.
    Outline ? Coarse models - what to do with them? Step 1: Recognize Step 2: Find a stab... Step 3: Optimality Q-learning for nonlinear state space models Step 4: Adjoint Step 5: Interpret Example: Local approximation Example: Decentralized control
  • 3.
    Outline ? Coarse models - what to do with them? Step 1: Recognize Step 2: Find a stab... Step 3: Optimality Q-learning for nonlinear state space models Step 4: Adjoint Step 5: Interpret Example: Local approximation Example: Decentralized control
  • 4.
    Coarse Models: Arich collection of model reduction techniques Many of today’s participants have contributed to this research. A biased list: Fluid models: Law of Large Numbers scaling, most likely paths in large deviations Workload relaxation for networks Heavy-traffic limits Clustering: spectral graph theory Markov spectral theory Singular perturbations Large population limits: Interacting particle systems
  • 5.
    Workload Relaxations An examplefrom CTCN: Figure 7.1: Demand-driven model with routing, scheduling, and re-work. Workload at two stations evolves as a two-dimensional system Cost is projected onto these coordinates: −(1 − ρ) −(1 − ρ) R STO R∗ R STO R∗ 50 50 w2 w2 40 40 30 20 30 20 Optimal policy for 10 0 10 0 relaxation = hedging -10 -20 -10 -20 policy for full network -20 -10 0 10 20 30 40 50 -20 -10 0 10 20 30 40 50 w1 w1 Figure 7.2: Optimal policies for two instances of the network shown in Figure 7.1. In each figure the optimal stochastic control region RSTO is compared with the optimal region R∗ obtained for the two dimensional fluid model.
  • 6.
    Workload Relaxations andSimulation α µ µ An example from CTCN: Station 1 Station 2 µ µ α Decision making at stations 1 & 2 e.g., setting safety-stock levels DP and simulations accelerated using fluid value function for workload relaxation VIA initialized with Simulated mean with and without control variate: Zero Average cost Average cost Fluid value function 10 10 20 10 20 30 20 30 40 30 40 20 40 50 30 50 100 150 200 250 300 50 40 50 60 60 50 60 60 Iteration safety-stock levels
  • 7.
    VIA initialized with Zero What To Do With a Coarse Model? Average cost Fluid value function 50 100 150 200 250 300 Iteration Setting: we have qualitative or partial quantitative insight regarding optimal control The network examples relied on specific network structure What about other models?
  • 8.
    VIA initialized with Zero What To Do With a Coarse Model? Average cost Fluid value function 50 100 150 200 250 300 Iteration Setting: we have qualitative or partial quantitative insight regarding optimal control The network examples relied on specific network structure What about other models? An answer lies in a new formulation of Q-learning
  • 9.
    Outline ? Coarse models - what to do with them? Step 1: Recognize Step 2: Find a stab... Step 3: Optimality Q-learning for nonlinear state space models Step 4: Adjoint Step 5: Interpret Example: Local approximation Example: Decentralized control
  • 10.
    VIA initialized with Zero What is Q learning? Average cost Fluid value function 50 100 150 200 250 300 Iteration Watkin’s 1992 formulation applied to finite state space MDPs Q-Learning Idea is similar to Mayne & Jacobson’s C. J. C. H. Watkins and P. Dayan Machine Learning, 1992 differential dynamic programming Differential dynamic programming D. H. Jacobson and D. Q. Mayne American Elsevier Pub. Co. 1970
  • 11.
    VIA initialized with Zero What is Q learning? Average cost Fluid value function 50 100 150 200 250 300 Iteration Watkin’s 1992 formulation applied to finite state space MDPs Q-Learning Idea is similar to Mayne & Jacobson’s C. J. C. H. Watkins and P. Dayan Machine Learning, 1992 differential dynamic programming Differential dynamic programming D. H. Jacobson and D. Q. Mayne American Elsevier Pub. Co. 1970 Deterministic formulation: Nonlinear system on Euclidean space, d dt x(t) = f (x(t), u(t)), t≥ 0 Infinite-horizon discounted cost criterion, ∞ J ∗ (x) = inf e−γs c(x(s), u(s)) ds, x(0) = x 0 with c a non-negative cost function.
  • 12.
    1 0.08 Optimal policy 0.07 0.06 What is Q learning? 0 0.05 0.04 0.03 0.02 0.01 −1 −1 0 1 Deterministic formulation: Nonlinear system on Euclidean space, d dt x(t) = f (x(t), u(t)), t≥ 0 Infinite-horizon discounted cost criterion, ∞ J ∗ (x) = inf e−γs c(x(s), u(s)) ds, x(0) = x 0 with c a non-negative cost function. Differential generator: For any smooth function h, Du h (x) := (∇h (x))T f (x, u)
  • 13.
    1 0.08 Optimal policy 0.07 0.06 What is Q learning? 0 0.05 0.04 0.03 0.02 0.01 −1 −1 0 1 Deterministic formulation: Nonlinear system on Euclidean space, d dt x(t) = f (x(t), u(t)), t≥ 0 Infinite-horizon discounted cost criterion, ∞ J ∗ (x) = inf e−γs c(x(s), u(s)) ds, x(0) = x 0 with c a non-negative cost function. Differential generator: For any smooth function h, Du h (x) := (∇h (x))T f (x, u) HJB equation: min c(x, u) + Du J ∗ (x) = γJ ∗ (x) u
  • 14.
    1 0.08 Optimal policy 0.07 0.06 What is Q learning? 0 0.05 0.04 0.03 0.02 0.01 −1 −1 0 1 Deterministic formulation: Nonlinear system on Euclidean space, d dt x(t) = f (x(t), u(t)), t≥ 0 Infinite-horizon discounted cost criterion, ∞ J ∗ (x) = inf e−γs c(x(s), u(s)) ds, x(0) = x 0 with c a non-negative cost function. Differential generator: For any smooth function h, Du h (x) := (∇h (x))T f (x, u) HJB equation: min c(x, u) + Du J ∗ (x) = γJ ∗ (x) u The Q-function of Q-learning is this function of two variables
  • 15.
    1 0.08 Optimal policy 0.07 0.06 Q learning - Steps towards an algorithm 0 0.05 0.04 0.03 0.02 0.01 −1 −1 0 1 Sequence of five steps: Step 1: Recognize fixed point equation for the Q-function Step 2: Find a stabilizing policy that is ergodic Step 3: Optimality criterion - minimize Bellman error Step 4: Adjoint operation Step 5: Interpret and simulate!
  • 16.
    1 0.08 Optimal policy 0.07 0.06 Q learning - Steps towards an algorithm 0 0.05 0.04 0.03 0.02 0.01 −1 −1 0 1 Sequence of five steps: Step 1: Recognize fixed point equation for the Q-function Step 2: Find a stabilizing policy that is ergodic Step 3: Optimality criterion - minimize Bellman error Step 4: Adjoint operation Step 5: Interpret and simulate! Goal - seek the best approximation, within a parameterized class
  • 17.
    1 0.08 Optimal policy 0.07 0.06 Q learning - Steps towards an algorithm 0 0.05 0.04 0.03 0.02 0.01 −1 −1 0 1 Step 1: Recognize fixed point equation for the Q-function Q-function: H ∗(x, u) = c(x, u) + Du J ∗ (x) Its minimum: H ∗ (x) := min H ∗ (x, u) = γJ ∗ (x) u∈U Fixed point equation: Du H ∗ (x) = −γ(c(x, u) − H ∗ (x, u)) Step 1: Recognize xed point equation for the Q-function Step 2: Find a stabilizing policy that is ergodic Step 3: Optimality criterion - minimize Bellman error Step 4: Adjoint operation Step 5: Interpret and simulate!
  • 18.
    1 0.08 Optimal policy 0.07 0.06 Q learning - Steps towards an algorithm 0 0.05 0.04 0.03 0.02 0.01 −1 −1 0 1 Step 1: Recognize fixed point equation for the Q-function Q-function: H ∗(x, u) = c(x, u) + Du J ∗ (x) Its minimum: H ∗ (x) := min H ∗ (x, u) = γJ ∗ (x) u∈U Fixed point equation: Du H ∗ (x) = −γ(c(x, u) − H ∗ (x, u)) Key observation for learning: For any input-output pair, Du H ∗ (x) = d dt H ∗ (x(t)) x=x(t) u=u(t) Step 1: Recognize xed point equation for the Q-function Step 2: Find a stabilizing policy that is ergodic Step 3: Optimality criterion - minimize Bellman error Step 4: Adjoint operation Step 5: Interpret and simulate!
  • 19.
    1 0.08 Optimal policy 0.07 0.06 Q learning - LQR example 0 0.05 0.04 0.03 0.02 0.01 −1 −1 0 1 Linear model and quadratic cost, 1 T 1 T Cost: c(x, u) = 2 x Qx + 2 u Ru Q-function: H ∗(x, u) = c(x, u) + (Ax + Bu)T P ∗ x = c(x, u) + Du J ∗ (x) Solves Riccatti eqn 1 T ∗ J ∗ (x) = 2x P x Step 2: Find a stabilizing policy that is ergodic Step 3: Optimality criterion - minimize Bellman error Step 4: Adjoint operation Step 5: Interpret and simulate!
  • 20.
    1 0.08 Optimal policy 0.07 0.06 Q learning - LQR example 0 0.05 0.04 0.03 0.02 0.01 −1 −1 0 1 Linear model and quadratic cost, 1 T 1 T Cost: c(x, u) = 2 x Qx + 2 u Ru Q-function: H ∗(x, u) = c(x, u) + (Ax + Bu)T P ∗ x Solves Riccatti eqn Q-function approx: dx dxu H θ (x, u) = c(x, u) + 1 2 θi xT E i x + x θj xT F i u x i=1 j=1 Minimum: θ 1 T θT H (x) = 2x Q + E − F R−1 F θ x θ Minimizer: uθ (x) = φθ (x) = −R−1 F θ x Step 2: Find a stabilizing policy that is ergodic Step 3: Optimality criterion - minimize Bellman error Step 4: Adjoint operation Step 5: Interpret and simulate!
  • 21.
    1 0.08 Optimal policy 0.07 0.06 Q learning - Steps towards an algorithm 0 0.05 0.04 0.03 0.02 0.01 −1 −1 0 1 Step 2: Stationary policy that is ergodic? Assume the LLN holds for continuous functions F: R × R u → R As T → ∞, T 1 F (x(t), u(t)) dt −→ F (x, u) (dx, du) T 0 X×U Step 1: Recognize xed point equation for the Q-function Step 2: Find a stabilizing policy that is ergodic Step 3: Optimality criterion - minimize Bellman error Step 4: Adjoint operation Step 5: Interpret and simulate!
  • 22.
    1 0.08 Optimal policy 0.07 0.06 Q learning - Steps towards an algorithm 0 0.05 0.04 0.03 0.02 0.01 −1 −1 0 1 Step 2: Stationary policy that is ergodic? Suppose for example the input is scalar, and the system is stable [Bounded-input/Bounded-state] Can try a linear combination of sinusouids Step 1: Recognize xed point equation for the Q-function Step 2: Find a stabilizing policy that is ergodic Step 3: Optimality criterion - minimize Bellman error Step 4: Adjoint operation Step 5: Interpret and simulate!
  • 23.
    1 0.08 Optimal policy 0.07 0.06 Q learning - Steps towards an algorithm 0 0.05 0.04 0.03 0.02 0.01 −1 −1 0 1 Step 2: Stationary policy that is ergodic? Suppose for example the input is scalar, and the system is stable [Bounded-input/Bounded-state] 0.08 0.07 0.06 0.05 Can try a linear combination 0.04 of sinusouids 0.03 0.02 0.01 Step 1: Recognize xed point equation for the Q-function u(t) = A(sin(t) + sin(πt) + sin(et)) Step 2: Find a stabilizing policy that is ergodic Step 3: Optimality criterion - minimize Bellman error Step 4: Adjoint operation Step 5: Interpret and simulate!
  • 24.
    1 0.08 Optimal policy 0.07 0.06 Q learning - Steps towards an algorithm 0 0.05 0.04 0.03 0.02 0.01 −1 −1 0 1 Step 3: Bellman error Based on observations, minimize the mean-square Bellman error: θ θ , First order condition for optimality: θ , Du ψ θ − γψi i θ =0 with ψ θ (x) = ψi (x, φθ (x)), i θ 1≤i≤d Step 1: Recognize xed point equation for the Q-function Step 2: Find a stabilizing policy that is ergodic Step 3: Optimality criterion - minimize Bellman error Step 4: Adjoint operation Step 5: Interpret and simulate!
  • 25.
    1 0.08 Optimal policy 0.07 0.06 Q learning - Convex Reformulation 0 0.05 0.04 0.03 0.02 0.01 −1 −1 0 1 Step 3: Bellman error Based on observations, minimize the mean-square Bellman error: θ θ , G Gθ (x) ≤ H θ (x, u), all x, u Step 2: Find a stabilizing policy that is ergodic Step 3: Optimality criterion - minimize Bellman error Step 4: Adjoint operation Step 5: Interpret and simulate!
  • 26.
    1 0.08 Optimal policy 0.07 0.06 Q learning - LQR example 0 0.05 0.04 0.03 0.02 0.01 −1 −1 0 1 Linear model and quadratic cost, 1 T 1 T Cost: c(x, u) = 2 x Qx + 2 u Ru Q-function: H ∗ (x) = c(x, u) + (Ax + Bu)T P ∗ x Solves Riccatti eqn Q-function approx: dx dxu H θ (x, u) = c(x, u) + 1 2 θi xT E i x + x θj xT F i u x i=1 j=1 Approximation to minimum G θ (x) = 1 xT Gθ x 2 Minimizer: uθ (x) = φθ (x) = −R−1 F θ x Step 2: Find a stabilizing policy that is ergodic Step 3: Optimality criterion - minimize Bellman error Step 4: Adjoint operation Step 5: Interpret and simulate!
  • 27.
    Q learning -Steps towards an algorithm Step 4: Causal smoothing to avoid differentiation For any function of two variables, g : R × R w → R Resolvent gives a new function, ∞ −βt Rβ g (x, w) = e g(x(t), ξ(t)) dt 0 Skip to examples
  • 28.
    Q learning -Steps towards an algorithm Step 4: Causal smoothing to avoid differentiation For any function of two variables, g : R × R w → R Resolvent gives a new function, ∞ Rβ g (x, w) = e−βt g(x(t), ξ (t)) dt , β>0 0 controlled using the nominal policy u(t) = φ(x(t), ξ(t)), t≥0 stabilizing & ergodic
  • 29.
    Q learning -Steps towards an algorithm Step 4: Causal smoothing to avoid differentiation For any function of two variables, g : R × R w → R Resolvent gives a new function, ∞ Rβ g (x, w) = e−βt g(x(t), ξ (t)) dt , β>0 0 Resolvent equation:
  • 30.
    Q learning -Steps towards an algorithm Step 4: Causal smoothing to avoid differentiation For any function of two variables, g : R × R w → R Resolvent gives a new function, ∞ Rβ g (x, w) = e−βt g(x(t), ξ (t)) dt , β>0 0 Resolvent equation: Smoothed Bellman error: Lθ,β = Rβ Lθ θ θ = [βRβ − I]H + γRβ (c − H )
  • 31.
    Q learning -Steps towards an algorithm Smoothed Bellman error: 1 θ,β 2 Eβ (θ) := 2 θ,β Eβ (θ) = , θ Lθ,β = zero at an optimum Step 4: Causal smoothing to avoid differentiation
  • 32.
    Q learning -Steps towards an algorithm Smoothed Bellman error: 1 θ,β 2 Eβ (θ) := 2 θ,β Eβ (θ) = , θ Lθ,β = zero at an optimum Involves terms of the form Rβ g,R β h Step 4: Causal smoothing to avoid differentiation
  • 33.
    Q learning -Steps towards an algorithm 1 θ,β 2 Smoothed Bellman error: Eβ (θ) := 2 θ,β θ,β Eβ (θ) = , θL Adjoint operation: † 1 † Rβ Rβ = (Rβ + Rβ ) 2β 1 † † Rβ g,R β h = g,R β h + h,R β g 2β Step 4: Causal smoothing to avoid differentiation
  • 34.
    Q learning -Steps towards an algorithm 1 θ,β 2 Smoothed Bellman error: Eβ (θ) := 2 θ,β θ,β Eβ (θ) = , θL Adjoint operation: 1 † † Rβ Rβ = (Rβ + Rβ ) 2β 1 † † Rβ g,R β h = g,R β h + h,R β g 2β Adjoint realization: time-reversal ∞ † Rβ g (x, w) = e−βt Ex, w [g(x◦ (−t), ξ ◦ (−t))] dt 0 expectation conditional on x◦ (0) = x, ξ ◦ (0) = w. Step 4: Causal smoothing to avoid differentiation
  • 35.
    1 0.08 Optimal policy 0.07 0.06 Q learning - Steps towards an algorithm 0 0.05 0.04 0.03 0.02 0.01 −1 −1 0 1 After Step 5: Not quite adaptive control: Desired behavior Compare Outputs and learn Inputs Complex system Measured behavior Ergodic input applied Step 1: Recognize xed point equation for the Q-function Step 2: Find a stabilizing policy that is ergodic Step 3: Optimality criterion - minimize Bellman error Step 4: Adjoint operation Step 5: Interpret and simulate!
  • 36.
    1 0.08 Optimal policy 0.07 0.06 Q learning - Steps towards an algorithm 0 0.05 0.04 0.03 0.02 0.01 −1 −1 0 1 After Step 5: Not quite adaptive control: Desired behavior Compare Outputs and learn Inputs Complex system Measured behavior Ergodic input applied Based on observations minimize the mean-square Bellman error:
  • 37.
    1 (individual state) (ensemble state) Deterministic Stochastic Approximation 0 -1 0 1 2 3 4 5 6 7 8 9 10 Gradient descent: d dt θ = −ε θ , Du θ Hθ − γ θ Hθ Converges* to the minimizer of the mean-square Bellman error: d * Convergence observed in experiments! dt h(x(t)) x=x(t) = Du h (x) For a convex re-formulation of w=ξ(t) the problem, see Mehta & Meyn 2009
  • 38.
    1 (individual state) (ensemble state) Deterministic Stochastic Approximation 0 -1 0 1 2 3 4 5 6 7 8 9 10 Stochastic Approximation θ d dt θ = −εt Lθ t d dt θH (x◦ (t)) − γ θH θ (x◦ (t), u◦ (t)) Lθ := dt H θ (x◦ (t)) + γ(c(x◦ (t) , u◦ (t)) − H θ (x◦ (t), u◦ (t))) t d Gradient descent: d θ θ θ dt θ = −ε , Du θH −γ θH Mean-square Bellman error: d dt h(x(t)) x=x(t) = Du h (x) w=ξ(t)
  • 39.
    Outline ? Coarse models - what to do with them? Step 1: Recognize Step 2: Find a stab... Step 3: Optimality Q-learning for nonlinear state space models Step 4: Adjoint Step 5: Interpret Example: Local approximation Example: Decentralized control
  • 40.
    Desired behavior Compare Outputs Q learning - Local Learning and learn Inputs Complex system Measured behavior Cubic nonlinearity: d dt x = −x3 + u, c(x, u) = 1 x2 + 1 u2 2 2
  • 41.
    Desired behavior Compare Outputs Q learning - Local Learning and learn Inputs Complex system Measured behavior Cubic nonlinearity: d dt x = −x3 + u, c(x, u) = 1 x2 + 1 u2 2 2 HJB: min ( 2 x2 + 1 u2 + (−x3 + u) J ∗ (x)) = γJ ∗ (x) 1 2 u
  • 42.
    Desired behavior Compare Outputs Q learning - Local Learning and learn Inputs Complex system Measured behavior Cubic nonlinearity: d dt x = −x3 + u, c(x, u) = 1 x2 + 1 u2 2 2 HJB: min ( 2 x2 + 1 u2 + (−x3 + u) J ∗ (x)) = γJ ∗ (x) 1 2 u Basis: θ x x 2 xu H (x, u) = c(x, u) + θ x + θ 2 u 1 + 2x
  • 43.
    Desired behavior Compare Outputs Q learning - Local Learning and learn Inputs Complex system Measured behavior Cubic nonlinearity: d dt x = −x3 + u, c(x, u) = 1 x2 + 1 u2 2 2 HJB: min ( 2 x2 + 1 u2 + (−x3 + u) J ∗ (x)) = γJ ∗ (x) 1 2 u x Basis: H θ (x, u) = c(x, u) + θx x2 + θxu 2 u 1 + 2x 1 1 0.08 Optimal policy Optimal policy 0.06 0.07 0.05 0.06 0.05 0.04 0 0.04 0 0.03 0.03 0.02 0.02 0.01 0.01 −1 −1 −1 0 1 −1 0 1 Low amplitude input High amplitude input u(t) = A(sin(t) + sin(πt) + sin(et))
  • 44.
    Outline ? Coarse models - what to do with them? Step 1: Recognize Step 2: Find a stab... Step 3: Optimality Q-learning for nonlinear state space models Step 4: Adjoint Step 5: Interpret Example: Local approximation Example: Decentralized control
  • 45.
    M. Huang, P.E. Caines, and R. P. Malhame. Large-population cost-coupled LQG problems with nonuniform agents: Individual-mass Multi-agent model behavior and decentralized ε-Nash equilibria. IEEE Trans. Auto. Control, 52(9):1560–1571, 2007. Huang et. al. Local optimization for global coordination
  • 46.
    Multi-agent model Model: Linearautonomous models - global cost objective HJB: Individual state + global average Basis: Consistent with low dimensional LQG model Results from five agent model:
  • 47.
    Multi-agent model Model: Linearautonomous models - global cost objective HJB: Individual state + global average Basis: Consistent with low dimensional LQG model Results from five agent model: 1 Estimated state feedback gains 0 (individual state) (ensemble state) time -1 Gains for agent 4: Q-learning sample paths and gains predicted from ∞-agent limit
  • 48.
    Outline ? Coarse models - what to do with them? Step 1: Recognize Step 2: Find a stab... Step 3: Optimality Q-learning for nonlinear state space models Step 4: Adjoint Step 5: Interpret Example: Local approximation Example: Decentralized control ... Conclusions
  • 49.
    Conclusions Coarse models givetremendous insight They are also tremendously useful for design in approximate dynamic programming algorithms
  • 50.
    Conclusions Coarse models givetremendous insight They are also tremendously useful for design in approximate dynamic programming algorithms Q-learning is as fundamental as the Riccati equation - this should be included in our first-year graduate control courses
  • 51.
    Conclusions Coarse models givetremendous insight They are also tremendously useful for design in approximate dynamic programming algorithms Q-learning is as fundamental as the Riccati equation - this should be included in our first-year graduate control courses Current research: Algorithm analysis and improvements Applications in biology and economics Analysis of game-theoretic issues in coupled systems
  • 52.
    References . PhD thesis, University of London, London, England, 1967. . American Elsevier Pub. Co., New York, NY, 1970. Learning from Delayed Rewards. PhD thesis, King’s College, Cambridge, UK, 1989. Machine Learning, 8(3-4):279–292, 1992. SIAM J. Control Optim., 38(2):447–469, 2000. on policy iteration. Automatica, 45(2):477 – 484, 2009. Submitted to the 48th IEEE Conference on Decision and Control, December 16-18 2009. [9] C. Moallemi, S. Kumar, and B. Van Roy. Approximate and data-driven dynamic programming for queueing networks. Preprint available at http://moallemi.com/ciamac/research-interests.php, 2008.