SlideShare a Scribd company logo
1 of 52
Download to read offline
Q-Learning
  and Pontryagin's Minimum Principle

Sean Meyn
Department of Electrical and Computer Engineering
and the Coordinated Science Laboratory
    University of Illinois


   Joint work with Prashant Mehta
   NSF support: ECS-0523620
Outline


?                        Coarse models - what to do with them?


Step 1: Recognize
Step 2: Find a stab...
Step 3: Optimality
                         Q-learning for nonlinear state space models
Step 4: Adjoint
Step 5: Interpret




                         Example: Local approximation


                         Example: Decentralized control
Outline


?                        Coarse models - what to do with them?


Step 1: Recognize
Step 2: Find a stab...
Step 3: Optimality
                         Q-learning for nonlinear state space models
Step 4: Adjoint
Step 5: Interpret




                         Example: Local approximation


                         Example: Decentralized control
Coarse Models: A rich collection of model reduction techniques


Many of today’s participants have contributed to this research.
A biased list:

 Fluid models: Law of Large Numbers scaling,
                 most likely paths in large deviations
 Workload relaxation for networks
 Heavy-traffic limits

 Clustering: spectral graph theory
                Markov spectral theory

  Singular perturbations
  Large population limits: Interacting particle systems
Workload Relaxations

An example from CTCN:




                                                                                           Figure 7.1: Demand-driven model with routing, scheduling, and re-work.



Workload at two stations evolves as a two-dimensional system
Cost is projected onto these coordinates:
                        −(1 − ρ)                                                       −(1 − ρ)
                                             R STO         R∗                                          R STO             R∗
      50                                                             50
w2                                                              w2
      40                                                             40


      30


      20
                                                                     30


                                                                     20
                                                                                                                              Optimal policy for
      10


       0
                                                                     10


                                                                      0
                                                                                                                              relaxation = hedging
      -10


      -20
                                                                     -10


                                                                     -20
                                                                                                                              policy for full network
            -20   -10   0     10   20   30       40        50              -20   -10   0    10    20    30     40        50

                                                      w1                                                            w1
     Figure 7.2: Optimal policies for two instances of the network shown in Figure 7.1.
     In each figure the optimal stochastic control region RSTO is compared with the optimal
     region R∗ obtained for the two dimensional fluid model.
Workload Relaxations and Simulation
                                                               α
                                                                                                   µ                                           µ
               An example from CTCN:
                                                                                                   Station 1                                    Station 2

                                                                                                             µ                                       µ

                                                                                                                                                                           α
                                                                            Decision making at stations 1 & 2
                                                                            e.g., setting safety-stock levels

               DP and simulations accelerated
                     using fluid value function for workload relaxation
                      VIA initialized with                                                   Simulated mean with
                                                                                                    and without control variate:
                                     Zero
Average cost




                                                                    Average cost



                                     Fluid value function



                                                                                                                                                10
                                                                                   10                                                      20                                                                 10
                                                                                        20                                            30                                                                 20
                                                                                              30                                 40                                                                 30
                                                                                                        40                                                  20                                 40
                                                                                                                            50                                   30
                 50      100   150     200   250     300                                                     50                                                       40                  50
                                                                                                                  60   60                                                  50
                                                                                                                                                                                60   60


                                                   Iteration                                                                safety-stock levels
VIA initialized with
                                                                                          Zero

What To Do With a Coarse Model?




                                                       Average cost
                                                                                          Fluid value function




                                                                      50      100   150     200   250     300


                                                                                                        Iteration
Setting: we have qualitative or partial quantitative
    insight regarding optimal control

The network examples relied on specific network structure
              What about other models?
VIA initialized with
                                                                                          Zero

What To Do With a Coarse Model?




                                                       Average cost
                                                                                          Fluid value function




                                                                      50      100   150     200   250     300


                                                                                                        Iteration
Setting: we have qualitative or partial quantitative
    insight regarding optimal control

The network examples relied on specific network structure
              What about other models?


     An answer lies in a new formulation of Q-learning
Outline


?                        Coarse models - what to do with them?


Step 1: Recognize
Step 2: Find a stab...
Step 3: Optimality
                         Q-learning for nonlinear state space models
Step 4: Adjoint
Step 5: Interpret




                         Example: Local approximation


                         Example: Decentralized control
VIA initialized with
                                                                                                      Zero

What is Q learning?




                                                                   Average cost
                                                                                                      Fluid value function




                                                                                  50      100   150     200   250     300


                                                                                                                    Iteration
Watkin’s 1992 formulation applied to finite state space MDPs
                                                      Q-Learning

Idea is similar to Mayne & Jacobson’s                 C. J. C. H. Watkins and P. Dayan
                                                      Machine Learning, 1992

                 differential dynamic programming   Differential dynamic programming
                                                    D. H. Jacobson and D. Q. Mayne
                                                    American Elsevier Pub. Co. 1970
VIA initialized with
                                                                                                                Zero

What is Q learning?




                                                                             Average cost
                                                                                                                Fluid value function




                                                                                            50      100   150     200   250     300


                                                                                                                              Iteration
Watkin’s 1992 formulation applied to finite state space MDPs
                                                                Q-Learning

Idea is similar to Mayne & Jacobson’s                           C. J. C. H. Watkins and P. Dayan
                                                                Machine Learning, 1992

                 differential dynamic programming             Differential dynamic programming
                                                              D. H. Jacobson and D. Q. Mayne
                                                              American Elsevier Pub. Co. 1970


Deterministic formulation: Nonlinear system on Euclidean space,
                 d
                 dt x(t)    = f (x(t), u(t)),         t≥ 0

Infinite-horizon discounted cost criterion,
                                 ∞
            J ∗ (x) = inf            e−γs c(x(s), u(s)) ds,   x(0) = x
                             0
with c a non-negative cost function.
1
                                                                                                    0.08
                                                                               Optimal policy
                                                                                                    0.07


                                                                                                    0.06




What is Q learning?                                                   0
                                                                                                    0.05


                                                                                                    0.04


                                                                                                    0.03


                                                                                                    0.02


                                                                                                    0.01


                                                                     −1
                                                                      −1   0                    1




Deterministic formulation: Nonlinear system on Euclidean space,
                 d
                 dt x(t)    = f (x(t), u(t)),         t≥ 0

Infinite-horizon discounted cost criterion,
                                 ∞
            J ∗ (x) = inf            e−γs c(x(s), u(s)) ds,   x(0) = x
                             0
with c a non-negative cost function.

Differential generator: For any smooth function h,
                   Du h (x) := (∇h (x))T f (x, u)
1
                                                                                                    0.08
                                                                               Optimal policy
                                                                                                    0.07


                                                                                                    0.06




What is Q learning?                                                   0
                                                                                                    0.05


                                                                                                    0.04


                                                                                                    0.03


                                                                                                    0.02


                                                                                                    0.01


                                                                     −1
                                                                      −1   0                    1




Deterministic formulation: Nonlinear system on Euclidean space,
                 d
                 dt x(t)    = f (x(t), u(t)),         t≥ 0

Infinite-horizon discounted cost criterion,
                                 ∞
            J ∗ (x) = inf            e−γs c(x(s), u(s)) ds,   x(0) = x
                             0
with c a non-negative cost function.

Differential generator: For any smooth function h,
                   Du h (x) := (∇h (x))T f (x, u)


     HJB equation:            min c(x, u) + Du J ∗ (x) = γJ ∗ (x)
                                 u
1
                                                                                                    0.08
                                                                               Optimal policy
                                                                                                    0.07


                                                                                                    0.06




What is Q learning?                                                   0
                                                                                                    0.05


                                                                                                    0.04


                                                                                                    0.03


                                                                                                    0.02


                                                                                                    0.01


                                                                     −1
                                                                      −1   0                    1




Deterministic formulation: Nonlinear system on Euclidean space,
                 d
                 dt x(t)    = f (x(t), u(t)),         t≥ 0

Infinite-horizon discounted cost criterion,
                                 ∞
            J ∗ (x) = inf            e−γs c(x(s), u(s)) ds,   x(0) = x
                             0
with c a non-negative cost function.

Differential generator: For any smooth function h,
                   Du h (x) := (∇h (x))T f (x, u)


     HJB equation:            min c(x, u) + Du J ∗ (x) = γJ ∗ (x)
                                 u



The Q-function of Q-learning is this function of two variables
1
                                                                                         0.08
                                                                    Optimal policy
                                                                                         0.07


                                                                                         0.06




Q learning - Steps towards an algorithm                0
                                                                                         0.05


                                                                                         0.04


                                                                                         0.03


                                                                                         0.02


                                                                                         0.01


                                                      −1
                                                       −1       0                    1




Sequence of five steps:

    Step 1: Recognize fixed point equation for the Q-function
    Step 2: Find a stabilizing policy that is ergodic
    Step 3: Optimality criterion - minimize Bellman error
    Step 4: Adjoint operation
    Step 5: Interpret and simulate!
1
                                                                                         0.08
                                                                    Optimal policy
                                                                                         0.07


                                                                                         0.06




Q learning - Steps towards an algorithm                0
                                                                                         0.05


                                                                                         0.04


                                                                                         0.03


                                                                                         0.02


                                                                                         0.01


                                                      −1
                                                       −1       0                    1




Sequence of five steps:

    Step 1: Recognize fixed point equation for the Q-function
    Step 2: Find a stabilizing policy that is ergodic
    Step 3: Optimality criterion - minimize Bellman error
    Step 4: Adjoint operation
    Step 5: Interpret and simulate!


Goal - seek the best approximation,
        within a parameterized class
1
                                                                                                                  0.08
                                                                                             Optimal policy
                                                                                                                  0.07


                                                                                                                  0.06




Q learning - Steps towards an algorithm                                 0
                                                                                                                  0.05


                                                                                                                  0.04


                                                                                                                  0.03


                                                                                                                  0.02


                                                                                                                  0.01


                                                                       −1
                                                                        −1             0                      1




Step 1: Recognize fixed point equation for the Q-function
 Q-function:     H ∗(x, u) = c(x, u) + Du J ∗ (x)

 Its minimum:       H ∗ (x) := min H ∗ (x, u) = γJ ∗ (x)
                              u∈U



 Fixed point equation:
          Du H ∗ (x) = −γ(c(x, u) − H ∗ (x, u))




                                                   Step 1: Recognize xed point equation for the Q-function
                                                   Step 2: Find a stabilizing policy that is ergodic
                                                   Step 3: Optimality criterion - minimize Bellman error
                                                   Step 4: Adjoint operation
                                                   Step 5: Interpret and simulate!
1
                                                                                                                  0.08
                                                                                             Optimal policy
                                                                                                                  0.07


                                                                                                                  0.06




Q learning - Steps towards an algorithm                                 0
                                                                                                                  0.05


                                                                                                                  0.04


                                                                                                                  0.03


                                                                                                                  0.02


                                                                                                                  0.01


                                                                       −1
                                                                        −1             0                      1




Step 1: Recognize fixed point equation for the Q-function
 Q-function:     H ∗(x, u) = c(x, u) + Du J ∗ (x)

 Its minimum:       H ∗ (x) := min H ∗ (x, u) = γJ ∗ (x)
                              u∈U



 Fixed point equation:
          Du H ∗ (x) = −γ(c(x, u) − H ∗ (x, u))

 Key observation for learning: For any input-output pair,

          Du H ∗ (x) =   d
                         dt H ∗ (x(t)) x=x(t)
                                         u=u(t)
                                                   Step 1: Recognize xed point equation for the Q-function
                                                   Step 2: Find a stabilizing policy that is ergodic
                                                   Step 3: Optimality criterion - minimize Bellman error
                                                   Step 4: Adjoint operation
                                                   Step 5: Interpret and simulate!
1
                                                                                                                    0.08
                                                                                               Optimal policy
                                                                                                                    0.07


                                                                                                                    0.06




Q learning - LQR example                                                 0
                                                                                                                    0.05


                                                                                                                    0.04


                                                                                                                    0.03


                                                                                                                    0.02


                                                                                                                    0.01


                                                                        −1
                                                                         −1             0                       1




Linear model and quadratic cost,
                           1 T          1 T
  Cost:       c(x, u) =    2 x Qx   +   2 u Ru

  Q-function: H ∗(x, u) = c(x, u) + (Ax + Bu)T P ∗ x
                          = c(x, u) + Du J ∗ (x)                              Solves Riccatti eqn

                                                                                            1 T ∗
                                                                      J ∗ (x) =             2x
                                                                                               P x




                                                 Step 2:   Find a stabilizing policy that is ergodic
                                                 Step 3:   Optimality criterion - minimize Bellman error
                                                 Step 4:   Adjoint operation
                                                 Step 5:   Interpret and simulate!
1
                                                                                                                           0.08
                                                                                                      Optimal policy
                                                                                                                           0.07


                                                                                                                           0.06




Q learning - LQR example                                                        0
                                                                                                                           0.05


                                                                                                                           0.04


                                                                                                                           0.03


                                                                                                                           0.02


                                                                                                                           0.01


                                                                               −1
                                                                                −1             0                       1




Linear model and quadratic cost,
                              1 T              1 T
  Cost:        c(x, u) =      2 x Qx      +    2 u Ru

  Q-function: H ∗(x, u) = c(x, u) + (Ax + Bu)T P ∗ x
                                                                                     Solves Riccatti eqn

  Q-function approx:
                                          dx                        dxu
          H θ (x, u) = c(x, u) +    1
                                    2          θi xT E i x +
                                                x
                                                                            θj xT F i u
                                                                             x


                                         i=1                       j=1
  Minimum:
           θ        1 T             θT
          H (x) =   2x    Q + E − F R−1 F θ x
                               θ


  Minimizer:
           uθ (x) = φθ (x) = −R−1 F θ x
                                                        Step 2:   Find a stabilizing policy that is ergodic
                                                        Step 3:   Optimality criterion - minimize Bellman error
                                                        Step 4:   Adjoint operation
                                                        Step 5:   Interpret and simulate!
1
                                                                                                                   0.08
                                                                                              Optimal policy
                                                                                                                   0.07


                                                                                                                   0.06




Q learning - Steps towards an algorithm                                  0
                                                                                                                   0.05


                                                                                                                   0.04


                                                                                                                   0.03


                                                                                                                   0.02


                                                                                                                   0.01


                                                                        −1
                                                                         −1             0                      1




Step 2: Stationary policy that is ergodic?

Assume the LLN holds for continuous functions
                        F: R × R    u
                                        → R
As T → ∞,

             T
     1
                 F (x(t), u(t)) dt −→         F (x, u)      (dx, du)
     T   0                              X×U




                                                    Step 1: Recognize xed point equation for the Q-function
                                                    Step 2: Find a stabilizing policy that is ergodic
                                                    Step 3: Optimality criterion - minimize Bellman error
                                                    Step 4: Adjoint operation
                                                    Step 5: Interpret and simulate!
1
                                                                                                             0.08
                                                                                        Optimal policy
                                                                                                             0.07


                                                                                                             0.06




Q learning - Steps towards an algorithm                            0
                                                                                                             0.05


                                                                                                             0.04


                                                                                                             0.03


                                                                                                             0.02


                                                                                                             0.01


                                                                  −1
                                                                   −1             0                      1




Step 2: Stationary policy that is ergodic?
 Suppose for example the input is scalar, and the system is stable
                                       [Bounded-input/Bounded-state]
 Can try a linear
 combination
 of sinusouids




                                              Step 1: Recognize xed point equation for the Q-function
                                              Step 2: Find a stabilizing policy that is ergodic
                                              Step 3: Optimality criterion - minimize Bellman error
                                              Step 4: Adjoint operation
                                              Step 5: Interpret and simulate!
1
                                                                                                                        0.08
                                                                                                   Optimal policy
                                                                                                                        0.07


                                                                                                                        0.06




       Q learning - Steps towards an algorithm                                0
                                                                                                                        0.05


                                                                                                                        0.04


                                                                                                                        0.03


                                                                                                                        0.02


                                                                                                                        0.01


                                                                             −1
                                                                              −1             0                      1




       Step 2: Stationary policy that is ergodic?
        Suppose for example the input is scalar, and the system is stable
                                              [Bounded-input/Bounded-state]
0.08


0.07


0.06


0.05
                                                    Can try a linear
                                                    combination
0.04
                                                    of sinusouids
0.03


0.02


0.01




                                                         Step 1: Recognize xed point equation for the Q-function

   u(t) = A(sin(t) + sin(πt) + sin(et))                  Step 2: Find a stabilizing policy that is ergodic
                                                         Step 3: Optimality criterion - minimize Bellman error
                                                         Step 4: Adjoint operation
                                                         Step 5: Interpret and simulate!
1
                                                                                                                 0.08
                                                                                            Optimal policy
                                                                                                                 0.07


                                                                                                                 0.06




Q learning - Steps towards an algorithm                                0
                                                                                                                 0.05


                                                                                                                 0.04


                                                                                                                 0.03


                                                                                                                 0.02


                                                                                                                 0.01


                                                                      −1
                                                                       −1             0                      1




Step 3: Bellman error



Based on observations, minimize the mean-square Bellman error:

                                                    θ         θ
                                                        ,

First order condition for optimality:     θ
                                              , Du ψ θ − γψi
                                                     i
                                                           θ
                                                                                =0
                                        with ψ θ (x) = ψi (x, φθ (x)),
                                               i
                                                        θ

                                                                                   1≤i≤d



                                                  Step 1: Recognize xed point equation for the Q-function
                                                  Step 2: Find a stabilizing policy that is ergodic
                                                  Step 3: Optimality criterion - minimize Bellman error
                                                  Step 4: Adjoint operation
                                                  Step 5: Interpret and simulate!
1
                                                                                                               0.08
                                                                                          Optimal policy
                                                                                                               0.07


                                                                                                               0.06




Q learning - Convex Reformulation                                   0
                                                                                                               0.05


                                                                                                               0.04


                                                                                                               0.03


                                                                                                               0.02


                                                                                                               0.01


                                                                   −1
                                                                    −1             0                       1




Step 3: Bellman error



Based on observations, minimize the mean-square Bellman error:

                                              θ            θ
                                                  ,


                        G


                Gθ (x) ≤ H θ (x, u),   all x, u


                                            Step 2:   Find a stabilizing policy that is ergodic
                                            Step 3:   Optimality criterion - minimize Bellman error
                                            Step 4:   Adjoint operation
                                            Step 5:   Interpret and simulate!
1
                                                                                                                           0.08
                                                                                                      Optimal policy
                                                                                                                           0.07


                                                                                                                           0.06




Q learning - LQR example                                                        0
                                                                                                                           0.05


                                                                                                                           0.04


                                                                                                                           0.03


                                                                                                                           0.02


                                                                                                                           0.01


                                                                               −1
                                                                                −1             0                       1




Linear model and quadratic cost,
                                1 T            1 T
  Cost:         c(x, u) =       2 x Qx    +    2 u Ru

  Q-function:      H ∗ (x) = c(x, u) + (Ax + Bu)T P ∗ x
                                                                               Solves Riccatti eqn

  Q-function approx:
                                          dx                        dxu
          H θ (x, u) = c(x, u) +    1
                                    2          θi xT E i x +
                                                x
                                                                            θj xT F i u
                                                                             x


                                         i=1                       j=1
  Approximation to minimum
          G θ (x) = 1 xT Gθ x
                    2

  Minimizer:
           uθ (x) = φθ (x) = −R−1 F θ x
                                                        Step 2:   Find a stabilizing policy that is ergodic
                                                        Step 3:   Optimality criterion - minimize Bellman error
                                                        Step 4:   Adjoint operation
                                                        Step 5:   Interpret and simulate!
Q learning - Steps towards an algorithm

Step 4: Causal smoothing to avoid differentiation
For any function of two variables, g : R × R     w
                                                     → R
Resolvent gives a new function,


                             ∞
                                     −βt
  Rβ g (x, w) =                  e         g(x(t), ξ(t)) dt
                         0




                                                           Skip to examples
Q learning - Steps towards an algorithm

Step 4: Causal smoothing to avoid differentiation
For any function of two variables, g : R × R    w
                                                    → R
Resolvent gives a new function,

                         ∞
 Rβ g (x, w) =               e−βt g(x(t), ξ (t)) dt ,       β>0
                     0


                                 controlled using the nominal policy

                                   u(t) = φ(x(t), ξ(t)),     t≥0

                                               stabilizing & ergodic
Q learning - Steps towards an algorithm

Step 4: Causal smoothing to avoid differentiation
For any function of two variables, g : R × R        w
                                                        → R
Resolvent gives a new function,
                                   ∞
               Rβ g (x, w) =           e−βt g(x(t), ξ (t)) dt ,   β>0
                               0
Resolvent equation:
Q learning - Steps towards an algorithm

Step 4: Causal smoothing to avoid differentiation
For any function of two variables, g : R × R        w
                                                        → R
Resolvent gives a new function,
                                   ∞
               Rβ g (x, w) =           e−βt g(x(t), ξ (t)) dt ,   β>0
                               0
Resolvent equation:



Smoothed Bellman error:

    Lθ,β = Rβ Lθ
                     θ            θ
         = [βRβ − I]H + γRβ (c − H )
Q learning - Steps towards an algorithm

Smoothed Bellman error:

                          1       θ,β 2
             Eβ (θ) :=    2


                              θ,β
               Eβ (θ) =             ,   θ Lθ,β
                      = zero at an optimum




                              Step 4: Causal smoothing to avoid differentiation
Q learning - Steps towards an algorithm

Smoothed Bellman error:

                           1       θ,β 2
             Eβ (θ) :=     2


                               θ,β
               Eβ (θ) =              ,   θ Lθ,β
                      = zero at an optimum


                    Involves terms of the form         Rβ g,R β h

                               Step 4: Causal smoothing to avoid differentiation
Q learning - Steps towards an algorithm

                                       1     θ,β 2
Smoothed Bellman error:   Eβ (θ) :=    2

                                           θ,β         θ,β
                           Eβ (θ) =              ,   θL

Adjoint operation:

               †     1   †
              Rβ Rβ
                  =    (Rβ + Rβ )
                    2β
                     1      †         †
       Rβ g,R β h =     g,R β h + h,R β g
                    2β



                             Step 4: Causal smoothing to avoid differentiation
Q learning - Steps towards an algorithm

                                             1     θ,β 2
Smoothed Bellman error:         Eβ (θ) :=    2

                                                 θ,β         θ,β
                                 Eβ (θ) =              ,   θL

Adjoint operation:                 1
                           †           †
                          Rβ Rβ =    (Rβ + Rβ )
                                  2β
                                   1      †         †
                     Rβ g,R β h =     g,R β h + h,R β g
                                  2β
Adjoint realization: time-reversal
                          ∞
  †
 Rβ g (x, w)     =            e−βt Ex, w [g(x◦ (−t), ξ ◦ (−t))] dt
                      0
             expectation conditional on x◦ (0) = x, ξ ◦ (0) = w.

                                   Step 4: Causal smoothing to avoid differentiation
1
                                                                                                                            0.08
                                                                                                       Optimal policy
                                                                                                                            0.07


                                                                                                                            0.06




Q learning - Steps towards an algorithm                                           0
                                                                                                                            0.05


                                                                                                                            0.04


                                                                                                                            0.03


                                                                                                                            0.02


                                                                                                                            0.01


                                                                                 −1
                                                                                  −1             0                      1




After Step 5: Not quite adaptive control:

                   Desired behavior

      Compare                                                              Outputs
      and learn                Inputs


                                                Complex system
                            Measured behavior




Ergodic input applied



                                                             Step 1: Recognize xed point equation for the Q-function
                                                             Step 2: Find a stabilizing policy that is ergodic
                                                             Step 3: Optimality criterion - minimize Bellman error
                                                             Step 4: Adjoint operation
                                                             Step 5: Interpret and simulate!
1
                                                                                                    0.08
                                                                               Optimal policy
                                                                                                    0.07


                                                                                                    0.06




Q learning - Steps towards an algorithm                               0
                                                                                                    0.05


                                                                                                    0.04


                                                                                                    0.03


                                                                                                    0.02


                                                                                                    0.01


                                                                     −1
                                                                      −1   0                    1




After Step 5: Not quite adaptive control:

                   Desired behavior

      Compare                                                    Outputs
      and learn                Inputs


                                                Complex system
                            Measured behavior




Ergodic input applied
Based on observations minimize the mean-square Bellman error:
1




                                                                      (individual state)
                                                                                           (ensemble state)
  Deterministic Stochastic Approximation                                                                      0




                                                                                                              -1
                                                                                                               0   1   2   3   4   5   6   7   8   9   10




  Gradient descent:
                 d
                 dt θ   = −ε       θ
                                       , Du   θ Hθ − γ      θ Hθ

  Converges* to the minimizer of the mean-square Bellman error:




d                                                        * Convergence observed in experiments!
dt h(x(t))   x=x(t)
                      = Du h (x)                           For a convex re-formulation of
             w=ξ(t)                                        the problem, see Mehta & Meyn 2009
1




                                                                                 (individual state)
                                                                                                      (ensemble state)
  Deterministic Stochastic Approximation                                                                                 0




                                                                                                                         -1
                                                                                                                          0   1   2   3   4       5   6   7   8   9   10




  Stochastic Approximation

                                        θ
    d
    dt θ     = −εt Lθ
                    t
                           d
                           dt      θH       (x◦ (t)) − γ    θH
                                                                   θ
                                                                       (x◦ (t), u◦ (t))


    Lθ := dt H θ (x◦ (t)) + γ(c(x◦ (t) , u◦ (t)) − H θ (x◦ (t), u◦ (t)))
     t
          d



                                            Gradient descent:
                                              d             θ                θ                                                                θ
                                              dt θ   = −ε       , Du    θH             −γ                                         θH

                                            Mean-square Bellman error:


d
dt h(x(t))   x=x(t)
                      = Du h (x)
             w=ξ(t)
Outline


?                        Coarse models - what to do with them?


Step 1: Recognize
Step 2: Find a stab...
Step 3: Optimality
                         Q-learning for nonlinear state space models
Step 4: Adjoint
Step 5: Interpret




                         Example: Local approximation


                         Example: Decentralized control
Desired behavior

                                     Compare                                                   Outputs


Q learning - Local Learning          and learn               Inputs


                                                                              Complex system

                                                          Measured behavior




Cubic nonlinearity:

       d
       dt x   = −x3 + u,      c(x, u) = 1 x2 + 1 u2
                                        2      2
Desired behavior

                                             Compare                                                   Outputs


Q learning - Local Learning                  and learn               Inputs


                                                                                      Complex system

                                                                  Measured behavior




Cubic nonlinearity:   d
                      dt x   = −x3 + u,   c(x, u) = 1 x2 + 1 u2
                                                    2      2


HJB:
   min ( 2 x2 + 1 u2 + (−x3 + u) J ∗ (x)) = γJ ∗ (x)
         1
                2
       u
Desired behavior

                                                       Compare                                                        Outputs


Q learning - Local Learning                            and learn                    Inputs


                                                                                                     Complex system

                                                                                 Measured behavior




Cubic nonlinearity:        d
                           dt x   = −x3 + u,       c(x, u) = 1 x2 + 1 u2
                                                             2      2


HJB:                  min ( 2 x2 + 1 u2 + (−x3 + u) J ∗ (x)) = γJ ∗ (x)
                            1
                                   2
                       u


Basis:     θ                             x     x   2               xu
         H (x, u) = c(x, u) + θ x + θ       2
                                              u
                                      1 + 2x
Desired behavior

                                                               Compare                                                           Outputs


     Q learning - Local Learning                               and learn                  Inputs


                                                                                                           Complex system

                                                                                       Measured behavior




     Cubic nonlinearity:         d
                                 dt x    = −x3 + u,        c(x, u) = 1 x2 + 1 u2
                                                                     2      2


     HJB:                  min ( 2 x2 + 1 u2 + (−x3 + u) J ∗ (x)) = γJ ∗ (x)
                                 1
                                        2
                            u

                                                                              x
     Basis:                H θ (x, u) = c(x, u) + θx x2 + θxu                    2
                                                                                   u
                                                                           1 + 2x

 1                                                    1
                                             0.08
                        Optimal policy                                                             Optimal policy               0.06
                                             0.07

                                                                                                                                0.05
                                             0.06


                                             0.05                                                                               0.04

 0                                           0.04     0
                                                                                                                                0.03

                                             0.03
                                                                                                                                0.02
                                             0.02

                                                                                                                                0.01
                                             0.01


−1                                                  −1
 −1                0                     1           −1                       0                                             1



            Low amplitude input                             High amplitude input

                                                          u(t) = A(sin(t) + sin(πt) + sin(et))
Outline


?                        Coarse models - what to do with them?


Step 1: Recognize
Step 2: Find a stab...
Step 3: Optimality
                         Q-learning for nonlinear state space models
Step 4: Adjoint
Step 5: Interpret




                         Example: Local approximation


                         Example: Decentralized control
M. Huang, P. E. Caines, and R. P. Malhame. Large-population
                                 cost-coupled LQG problems with nonuniform agents: Individual-mass
Multi-agent model                behavior and decentralized ε-Nash equilibria. IEEE Trans. Auto.
                                 Control, 52(9):1560–1571, 2007.




Huang et. al. Local optimization for global coordination
Multi-agent model

Model: Linear autonomous models - global cost objective

HJB: Individual state + global average

Basis: Consistent with low dimensional LQG model

Results from five agent model:
Multi-agent model

Model: Linear autonomous models - global cost objective

HJB: Individual state + global average

Basis: Consistent with low dimensional LQG model

Results from five agent model:    1




Estimated state feedback gains
                                  0
             (individual state)
             (ensemble state)

                                                                            time
                                  -1
                                       Gains for agent 4: Q-learning sample paths
                                       and gains predicted from ∞-agent limit
Outline


?                        Coarse models - what to do with them?


Step 1: Recognize
Step 2: Find a stab...
Step 3: Optimality
                         Q-learning for nonlinear state space models
Step 4: Adjoint
Step 5: Interpret




                         Example: Local approximation


                         Example: Decentralized control


                                                          ... Conclusions
Conclusions

Coarse models give tremendous insight

They are also tremendously useful
for design in approximate dynamic programming algorithms
Conclusions

Coarse models give tremendous insight

They are also tremendously useful
for design in approximate dynamic programming algorithms

Q-learning is as fundamental as the Riccati equation - this
should be included in our first-year graduate control courses
Conclusions

Coarse models give tremendous insight

They are also tremendously useful
for design in approximate dynamic programming algorithms

Q-learning is as fundamental as the Riccati equation - this
should be included in our first-year graduate control courses

Current research: Algorithm analysis and improvements
                  Applications in biology and economics
                  Analysis of game-theoretic issues
                                 in coupled systems
References


                                                                                                                    .
      PhD thesis, University of London, London, England, 1967.

                                                                        . American Elsevier Pub. Co., New York, NY, 1970.

                        Learning from Delayed Rewards. PhD thesis, King’s College, Cambridge, UK, 1989.

                                                  Machine Learning, 8(3-4):279–292, 1992.


      SIAM J. Control Optim., 38(2):447–469, 2000.




      on policy iteration. Automatica, 45(2):477 – 484, 2009.


      Submitted to the 48th IEEE Conference on Decision and Control, December 16-18 2009.

[9]   C. Moallemi, S. Kumar, and B. Van Roy. Approximate and data-driven dynamic programming for queueing networks.
      Preprint available at http://moallemi.com/ciamac/research-interests.php, 2008.

More Related Content

What's hot

Quantum state tomography of slow and stored light
Quantum state tomography of slow and stored lightQuantum state tomography of slow and stored light
Quantum state tomography of slow and stored lightAndrew Dawes
 
3相ACモーターのスパイスモデルの概要
3相ACモーターのスパイスモデルの概要3相ACモーターのスパイスモデルの概要
3相ACモーターのスパイスモデルの概要Tsuyoshi Horigome
 
TB62206FGのスパイスモデル
TB62206FGのスパイスモデルTB62206FGのスパイスモデル
TB62206FGのスパイスモデルTsuyoshi Horigome
 
Six sigma quick references
Six sigma quick referencesSix sigma quick references
Six sigma quick referencesVIVOCORP
 
NACo presentation1.1
NACo presentation1.1NACo presentation1.1
NACo presentation1.1Phil Bresee
 
Deutsche EuroShop | Company Presentation | 06/11
Deutsche EuroShop | Company Presentation | 06/11Deutsche EuroShop | Company Presentation | 06/11
Deutsche EuroShop | Company Presentation | 06/11Deutsche EuroShop AG
 

What's hot (10)

Quantum state tomography of slow and stored light
Quantum state tomography of slow and stored lightQuantum state tomography of slow and stored light
Quantum state tomography of slow and stored light
 
Ncomms2255 s1
Ncomms2255 s1Ncomms2255 s1
Ncomms2255 s1
 
SPICE Model of TA7291P
SPICE Model of TA7291PSPICE Model of TA7291P
SPICE Model of TA7291P
 
3相ACモーターのスパイスモデルの概要
3相ACモーターのスパイスモデルの概要3相ACモーターのスパイスモデルの概要
3相ACモーターのスパイスモデルの概要
 
TB62206FGのスパイスモデル
TB62206FGのスパイスモデルTB62206FGのスパイスモデル
TB62206FGのスパイスモデル
 
Prj2
Prj2Prj2
Prj2
 
Six sigma quick references
Six sigma quick referencesSix sigma quick references
Six sigma quick references
 
Token bus standard
Token bus standardToken bus standard
Token bus standard
 
NACo presentation1.1
NACo presentation1.1NACo presentation1.1
NACo presentation1.1
 
Deutsche EuroShop | Company Presentation | 06/11
Deutsche EuroShop | Company Presentation | 06/11Deutsche EuroShop | Company Presentation | 06/11
Deutsche EuroShop | Company Presentation | 06/11
 

Similar to Q-Learning and Pontryagin's Minimum Principle for Coarse Models

PG&E Presentation to Kerntax 2013-02-22
PG&E Presentation to Kerntax   2013-02-22PG&E Presentation to Kerntax   2013-02-22
PG&E Presentation to Kerntax 2013-02-22Michael Turnipseed
 
Multiuser Pre-equalization for Pre-Rake DS-UWB Systems
Multiuser Pre-equalization for Pre-Rake DS-UWB SystemsMultiuser Pre-equalization for Pre-Rake DS-UWB Systems
Multiuser Pre-equalization for Pre-Rake DS-UWB SystemsZahra Ahmadian, PhD
 
Sirris materials day 2011 ampacimon overhead line monitoring - ampacimon
Sirris materials day 2011   ampacimon overhead line monitoring - ampacimon Sirris materials day 2011   ampacimon overhead line monitoring - ampacimon
Sirris materials day 2011 ampacimon overhead line monitoring - ampacimon Sirris
 
American Pharmaceutical Review Barnes Et Al
American Pharmaceutical Review Barnes Et AlAmerican Pharmaceutical Review Barnes Et Al
American Pharmaceutical Review Barnes Et Albarnes72
 
Electrochemical stripping studies of amlodipine using mwcnt modified glassy c...
Electrochemical stripping studies of amlodipine using mwcnt modified glassy c...Electrochemical stripping studies of amlodipine using mwcnt modified glassy c...
Electrochemical stripping studies of amlodipine using mwcnt modified glassy c...Alexander Decker
 
11.electrochemical stripping studies of amlodipine using mwcnt modified glass...
11.electrochemical stripping studies of amlodipine using mwcnt modified glass...11.electrochemical stripping studies of amlodipine using mwcnt modified glass...
11.electrochemical stripping studies of amlodipine using mwcnt modified glass...Alexander Decker
 
ARGOMARINE Final Conference - CMRE-NATO - Stefano Fioravanti, Alessandra Tesei
ARGOMARINE Final Conference - CMRE-NATO - Stefano Fioravanti, Alessandra TeseiARGOMARINE Final Conference - CMRE-NATO - Stefano Fioravanti, Alessandra Tesei
ARGOMARINE Final Conference - CMRE-NATO - Stefano Fioravanti, Alessandra TeseiARGOMARINE
 
SPICE MODEL of LM7915 SIMetrix in SPICE PARK
SPICE MODEL of LM7915 SIMetrix in SPICE PARKSPICE MODEL of LM7915 SIMetrix in SPICE PARK
SPICE MODEL of LM7915 SIMetrix in SPICE PARKTsuyoshi Horigome
 
Workspace analysis of stewart platform
Workspace analysis of stewart platformWorkspace analysis of stewart platform
Workspace analysis of stewart platformMarzieh Nabi
 
Kenny huld et_al _abstract_
Kenny huld et_al _abstract_Kenny huld et_al _abstract_
Kenny huld et_al _abstract_Susana Iglesias
 
Measuring and Monitoring Foot and Mouth Disease Occurrence Melissa McLaws EuFMD
Measuring and Monitoring Foot and Mouth Disease Occurrence Melissa McLaws EuFMDMeasuring and Monitoring Foot and Mouth Disease Occurrence Melissa McLaws EuFMD
Measuring and Monitoring Foot and Mouth Disease Occurrence Melissa McLaws EuFMDFAO
 
controlled Full Bridge Rectifier
controlled Full Bridge Rectifiercontrolled Full Bridge Rectifier
controlled Full Bridge RectifierJen Trần
 
S-functions Paper: Switching Amplifier Design With S-functions
S-functions Paper: Switching Amplifier Design With S-functionsS-functions Paper: Switching Amplifier Design With S-functions
S-functions Paper: Switching Amplifier Design With S-functionsNMDG NV
 
Non linear static analysis aist petti_marino
Non linear static analysis aist petti_marinoNon linear static analysis aist petti_marino
Non linear static analysis aist petti_marinoSoft.lab Srl
 
SPICE MODEL of LM7815 SIMetrix in SPICE PARK
SPICE MODEL of LM7815 SIMetrix in SPICE PARKSPICE MODEL of LM7815 SIMetrix in SPICE PARK
SPICE MODEL of LM7815 SIMetrix in SPICE PARKTsuyoshi Horigome
 

Similar to Q-Learning and Pontryagin's Minimum Principle for Coarse Models (20)

PG&E Presentation to Kerntax 2013-02-22
PG&E Presentation to Kerntax   2013-02-22PG&E Presentation to Kerntax   2013-02-22
PG&E Presentation to Kerntax 2013-02-22
 
International Journal of Engineering Inventions (IJEI)
International Journal of Engineering Inventions (IJEI)International Journal of Engineering Inventions (IJEI)
International Journal of Engineering Inventions (IJEI)
 
International Journal of Engineering Inventions (IJEI)
International Journal of Engineering Inventions (IJEI)International Journal of Engineering Inventions (IJEI)
International Journal of Engineering Inventions (IJEI)
 
Multiuser Pre-equalization for Pre-Rake DS-UWB Systems
Multiuser Pre-equalization for Pre-Rake DS-UWB SystemsMultiuser Pre-equalization for Pre-Rake DS-UWB Systems
Multiuser Pre-equalization for Pre-Rake DS-UWB Systems
 
Sirris materials day 2011 ampacimon overhead line monitoring - ampacimon
Sirris materials day 2011   ampacimon overhead line monitoring - ampacimon Sirris materials day 2011   ampacimon overhead line monitoring - ampacimon
Sirris materials day 2011 ampacimon overhead line monitoring - ampacimon
 
American Pharmaceutical Review Barnes Et Al
American Pharmaceutical Review Barnes Et AlAmerican Pharmaceutical Review Barnes Et Al
American Pharmaceutical Review Barnes Et Al
 
Electrochemical stripping studies of amlodipine using mwcnt modified glassy c...
Electrochemical stripping studies of amlodipine using mwcnt modified glassy c...Electrochemical stripping studies of amlodipine using mwcnt modified glassy c...
Electrochemical stripping studies of amlodipine using mwcnt modified glassy c...
 
11.electrochemical stripping studies of amlodipine using mwcnt modified glass...
11.electrochemical stripping studies of amlodipine using mwcnt modified glass...11.electrochemical stripping studies of amlodipine using mwcnt modified glass...
11.electrochemical stripping studies of amlodipine using mwcnt modified glass...
 
PFC - Ducati ANIE 2009
PFC - Ducati ANIE 2009PFC - Ducati ANIE 2009
PFC - Ducati ANIE 2009
 
ARGOMARINE Final Conference - CMRE-NATO - Stefano Fioravanti, Alessandra Tesei
ARGOMARINE Final Conference - CMRE-NATO - Stefano Fioravanti, Alessandra TeseiARGOMARINE Final Conference - CMRE-NATO - Stefano Fioravanti, Alessandra Tesei
ARGOMARINE Final Conference - CMRE-NATO - Stefano Fioravanti, Alessandra Tesei
 
SPICE MODEL of LM7915 SIMetrix in SPICE PARK
SPICE MODEL of LM7915 SIMetrix in SPICE PARKSPICE MODEL of LM7915 SIMetrix in SPICE PARK
SPICE MODEL of LM7915 SIMetrix in SPICE PARK
 
Workspace analysis of stewart platform
Workspace analysis of stewart platformWorkspace analysis of stewart platform
Workspace analysis of stewart platform
 
Kenny huld et_al _abstract_
Kenny huld et_al _abstract_Kenny huld et_al _abstract_
Kenny huld et_al _abstract_
 
Measuring and Monitoring Foot and Mouth Disease Occurrence Melissa McLaws EuFMD
Measuring and Monitoring Foot and Mouth Disease Occurrence Melissa McLaws EuFMDMeasuring and Monitoring Foot and Mouth Disease Occurrence Melissa McLaws EuFMD
Measuring and Monitoring Foot and Mouth Disease Occurrence Melissa McLaws EuFMD
 
1 8
1 81 8
1 8
 
controlled Full Bridge Rectifier
controlled Full Bridge Rectifiercontrolled Full Bridge Rectifier
controlled Full Bridge Rectifier
 
S-functions Paper: Switching Amplifier Design With S-functions
S-functions Paper: Switching Amplifier Design With S-functionsS-functions Paper: Switching Amplifier Design With S-functions
S-functions Paper: Switching Amplifier Design With S-functions
 
Non linear static analysis aist petti_marino
Non linear static analysis aist petti_marinoNon linear static analysis aist petti_marino
Non linear static analysis aist petti_marino
 
SPICE MODEL of LM7815 SIMetrix in SPICE PARK
SPICE MODEL of LM7815 SIMetrix in SPICE PARKSPICE MODEL of LM7815 SIMetrix in SPICE PARK
SPICE MODEL of LM7815 SIMetrix in SPICE PARK
 
Urm Advantages.Pptx
Urm Advantages.PptxUrm Advantages.Pptx
Urm Advantages.Pptx
 

More from Sean Meyn

Quasi-Stochastic Approximation: Algorithm Design Principles with Applications...
Quasi-Stochastic Approximation: Algorithm Design Principles with Applications...Quasi-Stochastic Approximation: Algorithm Design Principles with Applications...
Quasi-Stochastic Approximation: Algorithm Design Principles with Applications...Sean Meyn
 
DeepLearn2022 1. Goals & AlgorithmDesign.pdf
DeepLearn2022 1. Goals & AlgorithmDesign.pdfDeepLearn2022 1. Goals & AlgorithmDesign.pdf
DeepLearn2022 1. Goals & AlgorithmDesign.pdfSean Meyn
 
DeepLearn2022 3. TD and Q Learning
DeepLearn2022 3. TD and Q LearningDeepLearn2022 3. TD and Q Learning
DeepLearn2022 3. TD and Q LearningSean Meyn
 
DeepLearn2022 2. Variance Matters
DeepLearn2022  2. Variance MattersDeepLearn2022  2. Variance Matters
DeepLearn2022 2. Variance MattersSean Meyn
 
Smart Grid Tutorial - January 2019
Smart Grid Tutorial - January 2019Smart Grid Tutorial - January 2019
Smart Grid Tutorial - January 2019Sean Meyn
 
State Space Collapse in Resource Allocation for Demand Dispatch - May 2019
State Space Collapse in Resource Allocation for Demand Dispatch - May 2019State Space Collapse in Resource Allocation for Demand Dispatch - May 2019
State Space Collapse in Resource Allocation for Demand Dispatch - May 2019Sean Meyn
 
Irrational Agents and the Power Grid
Irrational Agents and the Power GridIrrational Agents and the Power Grid
Irrational Agents and the Power GridSean Meyn
 
Zap Q-Learning - ISMP 2018
Zap Q-Learning - ISMP 2018Zap Q-Learning - ISMP 2018
Zap Q-Learning - ISMP 2018Sean Meyn
 
Introducing Zap Q-Learning
Introducing Zap Q-Learning   Introducing Zap Q-Learning
Introducing Zap Q-Learning Sean Meyn
 
Reinforcement Learning: Hidden Theory and New Super-Fast Algorithms
Reinforcement Learning: Hidden Theory and New Super-Fast AlgorithmsReinforcement Learning: Hidden Theory and New Super-Fast Algorithms
Reinforcement Learning: Hidden Theory and New Super-Fast AlgorithmsSean Meyn
 
State estimation and Mean-Field Control with application to demand dispatch
State estimation and Mean-Field Control with application to demand dispatchState estimation and Mean-Field Control with application to demand dispatch
State estimation and Mean-Field Control with application to demand dispatchSean Meyn
 
Demand-Side Flexibility for Reliable Ancillary Services
Demand-Side Flexibility for Reliable Ancillary ServicesDemand-Side Flexibility for Reliable Ancillary Services
Demand-Side Flexibility for Reliable Ancillary ServicesSean Meyn
 
Spectral Decomposition of Demand-Side Flexibility for Reliable Ancillary Serv...
Spectral Decomposition of Demand-Side Flexibility for Reliable Ancillary Serv...Spectral Decomposition of Demand-Side Flexibility for Reliable Ancillary Serv...
Spectral Decomposition of Demand-Side Flexibility for Reliable Ancillary Serv...Sean Meyn
 
Demand-Side Flexibility for Reliable Ancillary Services in a Smart Grid: Elim...
Demand-Side Flexibility for Reliable Ancillary Services in a Smart Grid: Elim...Demand-Side Flexibility for Reliable Ancillary Services in a Smart Grid: Elim...
Demand-Side Flexibility for Reliable Ancillary Services in a Smart Grid: Elim...Sean Meyn
 
Why Do We Ignore Risk in Power Economics?
Why Do We Ignore Risk in Power Economics?Why Do We Ignore Risk in Power Economics?
Why Do We Ignore Risk in Power Economics?Sean Meyn
 
Distributed Randomized Control for Ancillary Service to the Power Grid
Distributed Randomized Control for Ancillary Service to the Power GridDistributed Randomized Control for Ancillary Service to the Power Grid
Distributed Randomized Control for Ancillary Service to the Power GridSean Meyn
 
Ancillary service to the grid from deferrable loads: the case for intelligent...
Ancillary service to the grid from deferrable loads: the case for intelligent...Ancillary service to the grid from deferrable loads: the case for intelligent...
Ancillary service to the grid from deferrable loads: the case for intelligent...Sean Meyn
 
2012 Tutorial: Markets for Differentiated Electric Power Products
2012 Tutorial:  Markets for Differentiated Electric Power Products2012 Tutorial:  Markets for Differentiated Electric Power Products
2012 Tutorial: Markets for Differentiated Electric Power ProductsSean Meyn
 
Control Techniques for Complex Systems
Control Techniques for Complex SystemsControl Techniques for Complex Systems
Control Techniques for Complex SystemsSean Meyn
 
Tutorial for Energy Systems Week - Cambridge 2010
Tutorial for Energy Systems Week - Cambridge 2010Tutorial for Energy Systems Week - Cambridge 2010
Tutorial for Energy Systems Week - Cambridge 2010Sean Meyn
 

More from Sean Meyn (20)

Quasi-Stochastic Approximation: Algorithm Design Principles with Applications...
Quasi-Stochastic Approximation: Algorithm Design Principles with Applications...Quasi-Stochastic Approximation: Algorithm Design Principles with Applications...
Quasi-Stochastic Approximation: Algorithm Design Principles with Applications...
 
DeepLearn2022 1. Goals & AlgorithmDesign.pdf
DeepLearn2022 1. Goals & AlgorithmDesign.pdfDeepLearn2022 1. Goals & AlgorithmDesign.pdf
DeepLearn2022 1. Goals & AlgorithmDesign.pdf
 
DeepLearn2022 3. TD and Q Learning
DeepLearn2022 3. TD and Q LearningDeepLearn2022 3. TD and Q Learning
DeepLearn2022 3. TD and Q Learning
 
DeepLearn2022 2. Variance Matters
DeepLearn2022  2. Variance MattersDeepLearn2022  2. Variance Matters
DeepLearn2022 2. Variance Matters
 
Smart Grid Tutorial - January 2019
Smart Grid Tutorial - January 2019Smart Grid Tutorial - January 2019
Smart Grid Tutorial - January 2019
 
State Space Collapse in Resource Allocation for Demand Dispatch - May 2019
State Space Collapse in Resource Allocation for Demand Dispatch - May 2019State Space Collapse in Resource Allocation for Demand Dispatch - May 2019
State Space Collapse in Resource Allocation for Demand Dispatch - May 2019
 
Irrational Agents and the Power Grid
Irrational Agents and the Power GridIrrational Agents and the Power Grid
Irrational Agents and the Power Grid
 
Zap Q-Learning - ISMP 2018
Zap Q-Learning - ISMP 2018Zap Q-Learning - ISMP 2018
Zap Q-Learning - ISMP 2018
 
Introducing Zap Q-Learning
Introducing Zap Q-Learning   Introducing Zap Q-Learning
Introducing Zap Q-Learning
 
Reinforcement Learning: Hidden Theory and New Super-Fast Algorithms
Reinforcement Learning: Hidden Theory and New Super-Fast AlgorithmsReinforcement Learning: Hidden Theory and New Super-Fast Algorithms
Reinforcement Learning: Hidden Theory and New Super-Fast Algorithms
 
State estimation and Mean-Field Control with application to demand dispatch
State estimation and Mean-Field Control with application to demand dispatchState estimation and Mean-Field Control with application to demand dispatch
State estimation and Mean-Field Control with application to demand dispatch
 
Demand-Side Flexibility for Reliable Ancillary Services
Demand-Side Flexibility for Reliable Ancillary ServicesDemand-Side Flexibility for Reliable Ancillary Services
Demand-Side Flexibility for Reliable Ancillary Services
 
Spectral Decomposition of Demand-Side Flexibility for Reliable Ancillary Serv...
Spectral Decomposition of Demand-Side Flexibility for Reliable Ancillary Serv...Spectral Decomposition of Demand-Side Flexibility for Reliable Ancillary Serv...
Spectral Decomposition of Demand-Side Flexibility for Reliable Ancillary Serv...
 
Demand-Side Flexibility for Reliable Ancillary Services in a Smart Grid: Elim...
Demand-Side Flexibility for Reliable Ancillary Services in a Smart Grid: Elim...Demand-Side Flexibility for Reliable Ancillary Services in a Smart Grid: Elim...
Demand-Side Flexibility for Reliable Ancillary Services in a Smart Grid: Elim...
 
Why Do We Ignore Risk in Power Economics?
Why Do We Ignore Risk in Power Economics?Why Do We Ignore Risk in Power Economics?
Why Do We Ignore Risk in Power Economics?
 
Distributed Randomized Control for Ancillary Service to the Power Grid
Distributed Randomized Control for Ancillary Service to the Power GridDistributed Randomized Control for Ancillary Service to the Power Grid
Distributed Randomized Control for Ancillary Service to the Power Grid
 
Ancillary service to the grid from deferrable loads: the case for intelligent...
Ancillary service to the grid from deferrable loads: the case for intelligent...Ancillary service to the grid from deferrable loads: the case for intelligent...
Ancillary service to the grid from deferrable loads: the case for intelligent...
 
2012 Tutorial: Markets for Differentiated Electric Power Products
2012 Tutorial:  Markets for Differentiated Electric Power Products2012 Tutorial:  Markets for Differentiated Electric Power Products
2012 Tutorial: Markets for Differentiated Electric Power Products
 
Control Techniques for Complex Systems
Control Techniques for Complex SystemsControl Techniques for Complex Systems
Control Techniques for Complex Systems
 
Tutorial for Energy Systems Week - Cambridge 2010
Tutorial for Energy Systems Week - Cambridge 2010Tutorial for Energy Systems Week - Cambridge 2010
Tutorial for Energy Systems Week - Cambridge 2010
 

Recently uploaded

Oppenheimer Film Discussion for Philosophy and Film
Oppenheimer Film Discussion for Philosophy and FilmOppenheimer Film Discussion for Philosophy and Film
Oppenheimer Film Discussion for Philosophy and FilmStan Meyer
 
How to Fix XML SyntaxError in Odoo the 17
How to Fix XML SyntaxError in Odoo the 17How to Fix XML SyntaxError in Odoo the 17
How to Fix XML SyntaxError in Odoo the 17Celine George
 
An Overview of the Calendar App in Odoo 17 ERP
An Overview of the Calendar App in Odoo 17 ERPAn Overview of the Calendar App in Odoo 17 ERP
An Overview of the Calendar App in Odoo 17 ERPCeline George
 
Indexing Structures in Database Management system.pdf
Indexing Structures in Database Management system.pdfIndexing Structures in Database Management system.pdf
Indexing Structures in Database Management system.pdfChristalin Nelson
 
Scientific Writing :Research Discourse
Scientific  Writing :Research  DiscourseScientific  Writing :Research  Discourse
Scientific Writing :Research DiscourseAnita GoswamiGiri
 
Beauty Amidst the Bytes_ Unearthing Unexpected Advantages of the Digital Wast...
Beauty Amidst the Bytes_ Unearthing Unexpected Advantages of the Digital Wast...Beauty Amidst the Bytes_ Unearthing Unexpected Advantages of the Digital Wast...
Beauty Amidst the Bytes_ Unearthing Unexpected Advantages of the Digital Wast...DhatriParmar
 
Sulphonamides, mechanisms and their uses
Sulphonamides, mechanisms and their usesSulphonamides, mechanisms and their uses
Sulphonamides, mechanisms and their usesVijayaLaxmi84
 
Mythology Quiz-4th April 2024, Quiz Club NITW
Mythology Quiz-4th April 2024, Quiz Club NITWMythology Quiz-4th April 2024, Quiz Club NITW
Mythology Quiz-4th April 2024, Quiz Club NITWQuiz Club NITW
 
4.9.24 School Desegregation in Boston.pptx
4.9.24 School Desegregation in Boston.pptx4.9.24 School Desegregation in Boston.pptx
4.9.24 School Desegregation in Boston.pptxmary850239
 
How to Manage Buy 3 Get 1 Free in Odoo 17
How to Manage Buy 3 Get 1 Free in Odoo 17How to Manage Buy 3 Get 1 Free in Odoo 17
How to Manage Buy 3 Get 1 Free in Odoo 17Celine George
 
Comparative Literature in India by Amiya dev.pptx
Comparative Literature in India by Amiya dev.pptxComparative Literature in India by Amiya dev.pptx
Comparative Literature in India by Amiya dev.pptxAvaniJani1
 
6 ways Samsung’s Interactive Display powered by Android changes the classroom
6 ways Samsung’s Interactive Display powered by Android changes the classroom6 ways Samsung’s Interactive Display powered by Android changes the classroom
6 ways Samsung’s Interactive Display powered by Android changes the classroomSamsung Business USA
 
Q-Factor HISPOL Quiz-6th April 2024, Quiz Club NITW
Q-Factor HISPOL Quiz-6th April 2024, Quiz Club NITWQ-Factor HISPOL Quiz-6th April 2024, Quiz Club NITW
Q-Factor HISPOL Quiz-6th April 2024, Quiz Club NITWQuiz Club NITW
 
ICS 2208 Lecture Slide Notes for Topic 6
ICS 2208 Lecture Slide Notes for Topic 6ICS 2208 Lecture Slide Notes for Topic 6
ICS 2208 Lecture Slide Notes for Topic 6Vanessa Camilleri
 
How to Uninstall a Module in Odoo 17 Using Command Line
How to Uninstall a Module in Odoo 17 Using Command LineHow to Uninstall a Module in Odoo 17 Using Command Line
How to Uninstall a Module in Odoo 17 Using Command LineCeline George
 

Recently uploaded (20)

Oppenheimer Film Discussion for Philosophy and Film
Oppenheimer Film Discussion for Philosophy and FilmOppenheimer Film Discussion for Philosophy and Film
Oppenheimer Film Discussion for Philosophy and Film
 
How to Fix XML SyntaxError in Odoo the 17
How to Fix XML SyntaxError in Odoo the 17How to Fix XML SyntaxError in Odoo the 17
How to Fix XML SyntaxError in Odoo the 17
 
An Overview of the Calendar App in Odoo 17 ERP
An Overview of the Calendar App in Odoo 17 ERPAn Overview of the Calendar App in Odoo 17 ERP
An Overview of the Calendar App in Odoo 17 ERP
 
INCLUSIVE EDUCATION PRACTICES FOR TEACHERS AND TRAINERS.pptx
INCLUSIVE EDUCATION PRACTICES FOR TEACHERS AND TRAINERS.pptxINCLUSIVE EDUCATION PRACTICES FOR TEACHERS AND TRAINERS.pptx
INCLUSIVE EDUCATION PRACTICES FOR TEACHERS AND TRAINERS.pptx
 
Indexing Structures in Database Management system.pdf
Indexing Structures in Database Management system.pdfIndexing Structures in Database Management system.pdf
Indexing Structures in Database Management system.pdf
 
Scientific Writing :Research Discourse
Scientific  Writing :Research  DiscourseScientific  Writing :Research  Discourse
Scientific Writing :Research Discourse
 
Beauty Amidst the Bytes_ Unearthing Unexpected Advantages of the Digital Wast...
Beauty Amidst the Bytes_ Unearthing Unexpected Advantages of the Digital Wast...Beauty Amidst the Bytes_ Unearthing Unexpected Advantages of the Digital Wast...
Beauty Amidst the Bytes_ Unearthing Unexpected Advantages of the Digital Wast...
 
Sulphonamides, mechanisms and their uses
Sulphonamides, mechanisms and their usesSulphonamides, mechanisms and their uses
Sulphonamides, mechanisms and their uses
 
Mythology Quiz-4th April 2024, Quiz Club NITW
Mythology Quiz-4th April 2024, Quiz Club NITWMythology Quiz-4th April 2024, Quiz Club NITW
Mythology Quiz-4th April 2024, Quiz Club NITW
 
Paradigm shift in nursing research by RS MEHTA
Paradigm shift in nursing research by RS MEHTAParadigm shift in nursing research by RS MEHTA
Paradigm shift in nursing research by RS MEHTA
 
4.9.24 School Desegregation in Boston.pptx
4.9.24 School Desegregation in Boston.pptx4.9.24 School Desegregation in Boston.pptx
4.9.24 School Desegregation in Boston.pptx
 
Spearman's correlation,Formula,Advantages,
Spearman's correlation,Formula,Advantages,Spearman's correlation,Formula,Advantages,
Spearman's correlation,Formula,Advantages,
 
Introduction to Research ,Need for research, Need for design of Experiments, ...
Introduction to Research ,Need for research, Need for design of Experiments, ...Introduction to Research ,Need for research, Need for design of Experiments, ...
Introduction to Research ,Need for research, Need for design of Experiments, ...
 
How to Manage Buy 3 Get 1 Free in Odoo 17
How to Manage Buy 3 Get 1 Free in Odoo 17How to Manage Buy 3 Get 1 Free in Odoo 17
How to Manage Buy 3 Get 1 Free in Odoo 17
 
Comparative Literature in India by Amiya dev.pptx
Comparative Literature in India by Amiya dev.pptxComparative Literature in India by Amiya dev.pptx
Comparative Literature in India by Amiya dev.pptx
 
Faculty Profile prashantha K EEE dept Sri Sairam college of Engineering
Faculty Profile prashantha K EEE dept Sri Sairam college of EngineeringFaculty Profile prashantha K EEE dept Sri Sairam college of Engineering
Faculty Profile prashantha K EEE dept Sri Sairam college of Engineering
 
6 ways Samsung’s Interactive Display powered by Android changes the classroom
6 ways Samsung’s Interactive Display powered by Android changes the classroom6 ways Samsung’s Interactive Display powered by Android changes the classroom
6 ways Samsung’s Interactive Display powered by Android changes the classroom
 
Q-Factor HISPOL Quiz-6th April 2024, Quiz Club NITW
Q-Factor HISPOL Quiz-6th April 2024, Quiz Club NITWQ-Factor HISPOL Quiz-6th April 2024, Quiz Club NITW
Q-Factor HISPOL Quiz-6th April 2024, Quiz Club NITW
 
ICS 2208 Lecture Slide Notes for Topic 6
ICS 2208 Lecture Slide Notes for Topic 6ICS 2208 Lecture Slide Notes for Topic 6
ICS 2208 Lecture Slide Notes for Topic 6
 
How to Uninstall a Module in Odoo 17 Using Command Line
How to Uninstall a Module in Odoo 17 Using Command LineHow to Uninstall a Module in Odoo 17 Using Command Line
How to Uninstall a Module in Odoo 17 Using Command Line
 

Q-Learning and Pontryagin's Minimum Principle for Coarse Models

  • 1. Q-Learning and Pontryagin's Minimum Principle Sean Meyn Department of Electrical and Computer Engineering and the Coordinated Science Laboratory University of Illinois Joint work with Prashant Mehta NSF support: ECS-0523620
  • 2. Outline ? Coarse models - what to do with them? Step 1: Recognize Step 2: Find a stab... Step 3: Optimality Q-learning for nonlinear state space models Step 4: Adjoint Step 5: Interpret Example: Local approximation Example: Decentralized control
  • 3. Outline ? Coarse models - what to do with them? Step 1: Recognize Step 2: Find a stab... Step 3: Optimality Q-learning for nonlinear state space models Step 4: Adjoint Step 5: Interpret Example: Local approximation Example: Decentralized control
  • 4. Coarse Models: A rich collection of model reduction techniques Many of today’s participants have contributed to this research. A biased list: Fluid models: Law of Large Numbers scaling, most likely paths in large deviations Workload relaxation for networks Heavy-traffic limits Clustering: spectral graph theory Markov spectral theory Singular perturbations Large population limits: Interacting particle systems
  • 5. Workload Relaxations An example from CTCN: Figure 7.1: Demand-driven model with routing, scheduling, and re-work. Workload at two stations evolves as a two-dimensional system Cost is projected onto these coordinates: −(1 − ρ) −(1 − ρ) R STO R∗ R STO R∗ 50 50 w2 w2 40 40 30 20 30 20 Optimal policy for 10 0 10 0 relaxation = hedging -10 -20 -10 -20 policy for full network -20 -10 0 10 20 30 40 50 -20 -10 0 10 20 30 40 50 w1 w1 Figure 7.2: Optimal policies for two instances of the network shown in Figure 7.1. In each figure the optimal stochastic control region RSTO is compared with the optimal region R∗ obtained for the two dimensional fluid model.
  • 6. Workload Relaxations and Simulation α µ µ An example from CTCN: Station 1 Station 2 µ µ α Decision making at stations 1 & 2 e.g., setting safety-stock levels DP and simulations accelerated using fluid value function for workload relaxation VIA initialized with Simulated mean with and without control variate: Zero Average cost Average cost Fluid value function 10 10 20 10 20 30 20 30 40 30 40 20 40 50 30 50 100 150 200 250 300 50 40 50 60 60 50 60 60 Iteration safety-stock levels
  • 7. VIA initialized with Zero What To Do With a Coarse Model? Average cost Fluid value function 50 100 150 200 250 300 Iteration Setting: we have qualitative or partial quantitative insight regarding optimal control The network examples relied on specific network structure What about other models?
  • 8. VIA initialized with Zero What To Do With a Coarse Model? Average cost Fluid value function 50 100 150 200 250 300 Iteration Setting: we have qualitative or partial quantitative insight regarding optimal control The network examples relied on specific network structure What about other models? An answer lies in a new formulation of Q-learning
  • 9. Outline ? Coarse models - what to do with them? Step 1: Recognize Step 2: Find a stab... Step 3: Optimality Q-learning for nonlinear state space models Step 4: Adjoint Step 5: Interpret Example: Local approximation Example: Decentralized control
  • 10. VIA initialized with Zero What is Q learning? Average cost Fluid value function 50 100 150 200 250 300 Iteration Watkin’s 1992 formulation applied to finite state space MDPs Q-Learning Idea is similar to Mayne & Jacobson’s C. J. C. H. Watkins and P. Dayan Machine Learning, 1992 differential dynamic programming Differential dynamic programming D. H. Jacobson and D. Q. Mayne American Elsevier Pub. Co. 1970
  • 11. VIA initialized with Zero What is Q learning? Average cost Fluid value function 50 100 150 200 250 300 Iteration Watkin’s 1992 formulation applied to finite state space MDPs Q-Learning Idea is similar to Mayne & Jacobson’s C. J. C. H. Watkins and P. Dayan Machine Learning, 1992 differential dynamic programming Differential dynamic programming D. H. Jacobson and D. Q. Mayne American Elsevier Pub. Co. 1970 Deterministic formulation: Nonlinear system on Euclidean space, d dt x(t) = f (x(t), u(t)), t≥ 0 Infinite-horizon discounted cost criterion, ∞ J ∗ (x) = inf e−γs c(x(s), u(s)) ds, x(0) = x 0 with c a non-negative cost function.
  • 12. 1 0.08 Optimal policy 0.07 0.06 What is Q learning? 0 0.05 0.04 0.03 0.02 0.01 −1 −1 0 1 Deterministic formulation: Nonlinear system on Euclidean space, d dt x(t) = f (x(t), u(t)), t≥ 0 Infinite-horizon discounted cost criterion, ∞ J ∗ (x) = inf e−γs c(x(s), u(s)) ds, x(0) = x 0 with c a non-negative cost function. Differential generator: For any smooth function h, Du h (x) := (∇h (x))T f (x, u)
  • 13. 1 0.08 Optimal policy 0.07 0.06 What is Q learning? 0 0.05 0.04 0.03 0.02 0.01 −1 −1 0 1 Deterministic formulation: Nonlinear system on Euclidean space, d dt x(t) = f (x(t), u(t)), t≥ 0 Infinite-horizon discounted cost criterion, ∞ J ∗ (x) = inf e−γs c(x(s), u(s)) ds, x(0) = x 0 with c a non-negative cost function. Differential generator: For any smooth function h, Du h (x) := (∇h (x))T f (x, u) HJB equation: min c(x, u) + Du J ∗ (x) = γJ ∗ (x) u
  • 14. 1 0.08 Optimal policy 0.07 0.06 What is Q learning? 0 0.05 0.04 0.03 0.02 0.01 −1 −1 0 1 Deterministic formulation: Nonlinear system on Euclidean space, d dt x(t) = f (x(t), u(t)), t≥ 0 Infinite-horizon discounted cost criterion, ∞ J ∗ (x) = inf e−γs c(x(s), u(s)) ds, x(0) = x 0 with c a non-negative cost function. Differential generator: For any smooth function h, Du h (x) := (∇h (x))T f (x, u) HJB equation: min c(x, u) + Du J ∗ (x) = γJ ∗ (x) u The Q-function of Q-learning is this function of two variables
  • 15. 1 0.08 Optimal policy 0.07 0.06 Q learning - Steps towards an algorithm 0 0.05 0.04 0.03 0.02 0.01 −1 −1 0 1 Sequence of five steps: Step 1: Recognize fixed point equation for the Q-function Step 2: Find a stabilizing policy that is ergodic Step 3: Optimality criterion - minimize Bellman error Step 4: Adjoint operation Step 5: Interpret and simulate!
  • 16. 1 0.08 Optimal policy 0.07 0.06 Q learning - Steps towards an algorithm 0 0.05 0.04 0.03 0.02 0.01 −1 −1 0 1 Sequence of five steps: Step 1: Recognize fixed point equation for the Q-function Step 2: Find a stabilizing policy that is ergodic Step 3: Optimality criterion - minimize Bellman error Step 4: Adjoint operation Step 5: Interpret and simulate! Goal - seek the best approximation, within a parameterized class
  • 17. 1 0.08 Optimal policy 0.07 0.06 Q learning - Steps towards an algorithm 0 0.05 0.04 0.03 0.02 0.01 −1 −1 0 1 Step 1: Recognize fixed point equation for the Q-function Q-function: H ∗(x, u) = c(x, u) + Du J ∗ (x) Its minimum: H ∗ (x) := min H ∗ (x, u) = γJ ∗ (x) u∈U Fixed point equation: Du H ∗ (x) = −γ(c(x, u) − H ∗ (x, u)) Step 1: Recognize xed point equation for the Q-function Step 2: Find a stabilizing policy that is ergodic Step 3: Optimality criterion - minimize Bellman error Step 4: Adjoint operation Step 5: Interpret and simulate!
  • 18. 1 0.08 Optimal policy 0.07 0.06 Q learning - Steps towards an algorithm 0 0.05 0.04 0.03 0.02 0.01 −1 −1 0 1 Step 1: Recognize fixed point equation for the Q-function Q-function: H ∗(x, u) = c(x, u) + Du J ∗ (x) Its minimum: H ∗ (x) := min H ∗ (x, u) = γJ ∗ (x) u∈U Fixed point equation: Du H ∗ (x) = −γ(c(x, u) − H ∗ (x, u)) Key observation for learning: For any input-output pair, Du H ∗ (x) = d dt H ∗ (x(t)) x=x(t) u=u(t) Step 1: Recognize xed point equation for the Q-function Step 2: Find a stabilizing policy that is ergodic Step 3: Optimality criterion - minimize Bellman error Step 4: Adjoint operation Step 5: Interpret and simulate!
  • 19. 1 0.08 Optimal policy 0.07 0.06 Q learning - LQR example 0 0.05 0.04 0.03 0.02 0.01 −1 −1 0 1 Linear model and quadratic cost, 1 T 1 T Cost: c(x, u) = 2 x Qx + 2 u Ru Q-function: H ∗(x, u) = c(x, u) + (Ax + Bu)T P ∗ x = c(x, u) + Du J ∗ (x) Solves Riccatti eqn 1 T ∗ J ∗ (x) = 2x P x Step 2: Find a stabilizing policy that is ergodic Step 3: Optimality criterion - minimize Bellman error Step 4: Adjoint operation Step 5: Interpret and simulate!
  • 20. 1 0.08 Optimal policy 0.07 0.06 Q learning - LQR example 0 0.05 0.04 0.03 0.02 0.01 −1 −1 0 1 Linear model and quadratic cost, 1 T 1 T Cost: c(x, u) = 2 x Qx + 2 u Ru Q-function: H ∗(x, u) = c(x, u) + (Ax + Bu)T P ∗ x Solves Riccatti eqn Q-function approx: dx dxu H θ (x, u) = c(x, u) + 1 2 θi xT E i x + x θj xT F i u x i=1 j=1 Minimum: θ 1 T θT H (x) = 2x Q + E − F R−1 F θ x θ Minimizer: uθ (x) = φθ (x) = −R−1 F θ x Step 2: Find a stabilizing policy that is ergodic Step 3: Optimality criterion - minimize Bellman error Step 4: Adjoint operation Step 5: Interpret and simulate!
  • 21. 1 0.08 Optimal policy 0.07 0.06 Q learning - Steps towards an algorithm 0 0.05 0.04 0.03 0.02 0.01 −1 −1 0 1 Step 2: Stationary policy that is ergodic? Assume the LLN holds for continuous functions F: R × R u → R As T → ∞, T 1 F (x(t), u(t)) dt −→ F (x, u) (dx, du) T 0 X×U Step 1: Recognize xed point equation for the Q-function Step 2: Find a stabilizing policy that is ergodic Step 3: Optimality criterion - minimize Bellman error Step 4: Adjoint operation Step 5: Interpret and simulate!
  • 22. 1 0.08 Optimal policy 0.07 0.06 Q learning - Steps towards an algorithm 0 0.05 0.04 0.03 0.02 0.01 −1 −1 0 1 Step 2: Stationary policy that is ergodic? Suppose for example the input is scalar, and the system is stable [Bounded-input/Bounded-state] Can try a linear combination of sinusouids Step 1: Recognize xed point equation for the Q-function Step 2: Find a stabilizing policy that is ergodic Step 3: Optimality criterion - minimize Bellman error Step 4: Adjoint operation Step 5: Interpret and simulate!
  • 23. 1 0.08 Optimal policy 0.07 0.06 Q learning - Steps towards an algorithm 0 0.05 0.04 0.03 0.02 0.01 −1 −1 0 1 Step 2: Stationary policy that is ergodic? Suppose for example the input is scalar, and the system is stable [Bounded-input/Bounded-state] 0.08 0.07 0.06 0.05 Can try a linear combination 0.04 of sinusouids 0.03 0.02 0.01 Step 1: Recognize xed point equation for the Q-function u(t) = A(sin(t) + sin(πt) + sin(et)) Step 2: Find a stabilizing policy that is ergodic Step 3: Optimality criterion - minimize Bellman error Step 4: Adjoint operation Step 5: Interpret and simulate!
  • 24. 1 0.08 Optimal policy 0.07 0.06 Q learning - Steps towards an algorithm 0 0.05 0.04 0.03 0.02 0.01 −1 −1 0 1 Step 3: Bellman error Based on observations, minimize the mean-square Bellman error: θ θ , First order condition for optimality: θ , Du ψ θ − γψi i θ =0 with ψ θ (x) = ψi (x, φθ (x)), i θ 1≤i≤d Step 1: Recognize xed point equation for the Q-function Step 2: Find a stabilizing policy that is ergodic Step 3: Optimality criterion - minimize Bellman error Step 4: Adjoint operation Step 5: Interpret and simulate!
  • 25. 1 0.08 Optimal policy 0.07 0.06 Q learning - Convex Reformulation 0 0.05 0.04 0.03 0.02 0.01 −1 −1 0 1 Step 3: Bellman error Based on observations, minimize the mean-square Bellman error: θ θ , G Gθ (x) ≤ H θ (x, u), all x, u Step 2: Find a stabilizing policy that is ergodic Step 3: Optimality criterion - minimize Bellman error Step 4: Adjoint operation Step 5: Interpret and simulate!
  • 26. 1 0.08 Optimal policy 0.07 0.06 Q learning - LQR example 0 0.05 0.04 0.03 0.02 0.01 −1 −1 0 1 Linear model and quadratic cost, 1 T 1 T Cost: c(x, u) = 2 x Qx + 2 u Ru Q-function: H ∗ (x) = c(x, u) + (Ax + Bu)T P ∗ x Solves Riccatti eqn Q-function approx: dx dxu H θ (x, u) = c(x, u) + 1 2 θi xT E i x + x θj xT F i u x i=1 j=1 Approximation to minimum G θ (x) = 1 xT Gθ x 2 Minimizer: uθ (x) = φθ (x) = −R−1 F θ x Step 2: Find a stabilizing policy that is ergodic Step 3: Optimality criterion - minimize Bellman error Step 4: Adjoint operation Step 5: Interpret and simulate!
  • 27. Q learning - Steps towards an algorithm Step 4: Causal smoothing to avoid differentiation For any function of two variables, g : R × R w → R Resolvent gives a new function, ∞ −βt Rβ g (x, w) = e g(x(t), ξ(t)) dt 0 Skip to examples
  • 28. Q learning - Steps towards an algorithm Step 4: Causal smoothing to avoid differentiation For any function of two variables, g : R × R w → R Resolvent gives a new function, ∞ Rβ g (x, w) = e−βt g(x(t), ξ (t)) dt , β>0 0 controlled using the nominal policy u(t) = φ(x(t), ξ(t)), t≥0 stabilizing & ergodic
  • 29. Q learning - Steps towards an algorithm Step 4: Causal smoothing to avoid differentiation For any function of two variables, g : R × R w → R Resolvent gives a new function, ∞ Rβ g (x, w) = e−βt g(x(t), ξ (t)) dt , β>0 0 Resolvent equation:
  • 30. Q learning - Steps towards an algorithm Step 4: Causal smoothing to avoid differentiation For any function of two variables, g : R × R w → R Resolvent gives a new function, ∞ Rβ g (x, w) = e−βt g(x(t), ξ (t)) dt , β>0 0 Resolvent equation: Smoothed Bellman error: Lθ,β = Rβ Lθ θ θ = [βRβ − I]H + γRβ (c − H )
  • 31. Q learning - Steps towards an algorithm Smoothed Bellman error: 1 θ,β 2 Eβ (θ) := 2 θ,β Eβ (θ) = , θ Lθ,β = zero at an optimum Step 4: Causal smoothing to avoid differentiation
  • 32. Q learning - Steps towards an algorithm Smoothed Bellman error: 1 θ,β 2 Eβ (θ) := 2 θ,β Eβ (θ) = , θ Lθ,β = zero at an optimum Involves terms of the form Rβ g,R β h Step 4: Causal smoothing to avoid differentiation
  • 33. Q learning - Steps towards an algorithm 1 θ,β 2 Smoothed Bellman error: Eβ (θ) := 2 θ,β θ,β Eβ (θ) = , θL Adjoint operation: † 1 † Rβ Rβ = (Rβ + Rβ ) 2β 1 † † Rβ g,R β h = g,R β h + h,R β g 2β Step 4: Causal smoothing to avoid differentiation
  • 34. Q learning - Steps towards an algorithm 1 θ,β 2 Smoothed Bellman error: Eβ (θ) := 2 θ,β θ,β Eβ (θ) = , θL Adjoint operation: 1 † † Rβ Rβ = (Rβ + Rβ ) 2β 1 † † Rβ g,R β h = g,R β h + h,R β g 2β Adjoint realization: time-reversal ∞ † Rβ g (x, w) = e−βt Ex, w [g(x◦ (−t), ξ ◦ (−t))] dt 0 expectation conditional on x◦ (0) = x, ξ ◦ (0) = w. Step 4: Causal smoothing to avoid differentiation
  • 35. 1 0.08 Optimal policy 0.07 0.06 Q learning - Steps towards an algorithm 0 0.05 0.04 0.03 0.02 0.01 −1 −1 0 1 After Step 5: Not quite adaptive control: Desired behavior Compare Outputs and learn Inputs Complex system Measured behavior Ergodic input applied Step 1: Recognize xed point equation for the Q-function Step 2: Find a stabilizing policy that is ergodic Step 3: Optimality criterion - minimize Bellman error Step 4: Adjoint operation Step 5: Interpret and simulate!
  • 36. 1 0.08 Optimal policy 0.07 0.06 Q learning - Steps towards an algorithm 0 0.05 0.04 0.03 0.02 0.01 −1 −1 0 1 After Step 5: Not quite adaptive control: Desired behavior Compare Outputs and learn Inputs Complex system Measured behavior Ergodic input applied Based on observations minimize the mean-square Bellman error:
  • 37. 1 (individual state) (ensemble state) Deterministic Stochastic Approximation 0 -1 0 1 2 3 4 5 6 7 8 9 10 Gradient descent: d dt θ = −ε θ , Du θ Hθ − γ θ Hθ Converges* to the minimizer of the mean-square Bellman error: d * Convergence observed in experiments! dt h(x(t)) x=x(t) = Du h (x) For a convex re-formulation of w=ξ(t) the problem, see Mehta & Meyn 2009
  • 38. 1 (individual state) (ensemble state) Deterministic Stochastic Approximation 0 -1 0 1 2 3 4 5 6 7 8 9 10 Stochastic Approximation θ d dt θ = −εt Lθ t d dt θH (x◦ (t)) − γ θH θ (x◦ (t), u◦ (t)) Lθ := dt H θ (x◦ (t)) + γ(c(x◦ (t) , u◦ (t)) − H θ (x◦ (t), u◦ (t))) t d Gradient descent: d θ θ θ dt θ = −ε , Du θH −γ θH Mean-square Bellman error: d dt h(x(t)) x=x(t) = Du h (x) w=ξ(t)
  • 39. Outline ? Coarse models - what to do with them? Step 1: Recognize Step 2: Find a stab... Step 3: Optimality Q-learning for nonlinear state space models Step 4: Adjoint Step 5: Interpret Example: Local approximation Example: Decentralized control
  • 40. Desired behavior Compare Outputs Q learning - Local Learning and learn Inputs Complex system Measured behavior Cubic nonlinearity: d dt x = −x3 + u, c(x, u) = 1 x2 + 1 u2 2 2
  • 41. Desired behavior Compare Outputs Q learning - Local Learning and learn Inputs Complex system Measured behavior Cubic nonlinearity: d dt x = −x3 + u, c(x, u) = 1 x2 + 1 u2 2 2 HJB: min ( 2 x2 + 1 u2 + (−x3 + u) J ∗ (x)) = γJ ∗ (x) 1 2 u
  • 42. Desired behavior Compare Outputs Q learning - Local Learning and learn Inputs Complex system Measured behavior Cubic nonlinearity: d dt x = −x3 + u, c(x, u) = 1 x2 + 1 u2 2 2 HJB: min ( 2 x2 + 1 u2 + (−x3 + u) J ∗ (x)) = γJ ∗ (x) 1 2 u Basis: θ x x 2 xu H (x, u) = c(x, u) + θ x + θ 2 u 1 + 2x
  • 43. Desired behavior Compare Outputs Q learning - Local Learning and learn Inputs Complex system Measured behavior Cubic nonlinearity: d dt x = −x3 + u, c(x, u) = 1 x2 + 1 u2 2 2 HJB: min ( 2 x2 + 1 u2 + (−x3 + u) J ∗ (x)) = γJ ∗ (x) 1 2 u x Basis: H θ (x, u) = c(x, u) + θx x2 + θxu 2 u 1 + 2x 1 1 0.08 Optimal policy Optimal policy 0.06 0.07 0.05 0.06 0.05 0.04 0 0.04 0 0.03 0.03 0.02 0.02 0.01 0.01 −1 −1 −1 0 1 −1 0 1 Low amplitude input High amplitude input u(t) = A(sin(t) + sin(πt) + sin(et))
  • 44. Outline ? Coarse models - what to do with them? Step 1: Recognize Step 2: Find a stab... Step 3: Optimality Q-learning for nonlinear state space models Step 4: Adjoint Step 5: Interpret Example: Local approximation Example: Decentralized control
  • 45. M. Huang, P. E. Caines, and R. P. Malhame. Large-population cost-coupled LQG problems with nonuniform agents: Individual-mass Multi-agent model behavior and decentralized ε-Nash equilibria. IEEE Trans. Auto. Control, 52(9):1560–1571, 2007. Huang et. al. Local optimization for global coordination
  • 46. Multi-agent model Model: Linear autonomous models - global cost objective HJB: Individual state + global average Basis: Consistent with low dimensional LQG model Results from five agent model:
  • 47. Multi-agent model Model: Linear autonomous models - global cost objective HJB: Individual state + global average Basis: Consistent with low dimensional LQG model Results from five agent model: 1 Estimated state feedback gains 0 (individual state) (ensemble state) time -1 Gains for agent 4: Q-learning sample paths and gains predicted from ∞-agent limit
  • 48. Outline ? Coarse models - what to do with them? Step 1: Recognize Step 2: Find a stab... Step 3: Optimality Q-learning for nonlinear state space models Step 4: Adjoint Step 5: Interpret Example: Local approximation Example: Decentralized control ... Conclusions
  • 49. Conclusions Coarse models give tremendous insight They are also tremendously useful for design in approximate dynamic programming algorithms
  • 50. Conclusions Coarse models give tremendous insight They are also tremendously useful for design in approximate dynamic programming algorithms Q-learning is as fundamental as the Riccati equation - this should be included in our first-year graduate control courses
  • 51. Conclusions Coarse models give tremendous insight They are also tremendously useful for design in approximate dynamic programming algorithms Q-learning is as fundamental as the Riccati equation - this should be included in our first-year graduate control courses Current research: Algorithm analysis and improvements Applications in biology and economics Analysis of game-theoretic issues in coupled systems
  • 52. References . PhD thesis, University of London, London, England, 1967. . American Elsevier Pub. Co., New York, NY, 1970. Learning from Delayed Rewards. PhD thesis, King’s College, Cambridge, UK, 1989. Machine Learning, 8(3-4):279–292, 1992. SIAM J. Control Optim., 38(2):447–469, 2000. on policy iteration. Automatica, 45(2):477 – 484, 2009. Submitted to the 48th IEEE Conference on Decision and Control, December 16-18 2009. [9] C. Moallemi, S. Kumar, and B. Van Roy. Approximate and data-driven dynamic programming for queueing networks. Preprint available at http://moallemi.com/ciamac/research-interests.php, 2008.