Q-Learning and Pontryagin's Minimum Principle

Loading...

Flash Player 9 (or above) is needed to view presentations.
We have detected that you do not have it on your computer. To install it, go here.

0 comments

Post a comment

    Post a comment
    Embed Video
    Edit your comment Cancel

    Favorites, Groups & Events

    Q-Learning and Pontryagin's Minimum Principle - Presentation Transcript

    1. Q-Learning and Pontryagin's Minimum Principle Sean Meyn Department of Electrical and Computer Engineering and the Coordinated Science Laboratory University of Illinois Joint work with Prashant Mehta NSF support: ECS-0523620
    2. Outline ? Coarse models - what to do with them? Step 1: Recognize Step 2: Find a stab... Step 3: Optimality Q-learning for nonlinear state space models Step 4: Adjoint Step 5: Interpret Example: Local approximation Example: Decentralized control
    3. Outline ? Coarse models - what to do with them? Step 1: Recognize Step 2: Find a stab... Step 3: Optimality Q-learning for nonlinear state space models Step 4: Adjoint Step 5: Interpret Example: Local approximation Example: Decentralized control
    4. Coarse Models: A rich collection of model reduction techniques Many of today’s participants have contributed to this research. A biased list: Fluid models: Law of Large Numbers scaling, most likely paths in large deviations Workload relaxation for networks Heavy-traffic limits Clustering: spectral graph theory Markov spectral theory Singular perturbations Large population limits: Interacting particle systems
    5. Workload Relaxations An example from CTCN: Figure 7.1: Demand-driven model with routing, scheduling, and re-work. Workload at two stations evolves as a two-dimensional system Cost is projected onto these coordinates: −(1 − ρ) −(1 − ρ) R STO R∗ R STO R∗ 50 50 w2 w2 40 40 30 20 30 20 Optimal policy for 10 0 10 0 relaxation = hedging -10 -20 -10 -20 policy for full network -20 -10 0 10 20 30 40 50 -20 -10 0 10 20 30 40 50 w1 w1 Figure 7.2: Optimal policies for two instances of the network shown in Figure 7.1. In each figure the optimal stochastic control region RSTO is compared with the optimal region R∗ obtained for the two dimensional fluid model.
    6. Workload Relaxations and Simulation α µ µ An example from CTCN: Station 1 Station 2 µ µ α Decision making at stations 1 & 2 e.g., setting safety-stock levels DP and simulations accelerated using fluid value function for workload relaxation VIA initialized with Simulated mean with and without control variate: Zero Average cost Average cost Fluid value function 10 10 20 10 20 30 20 30 40 30 40 20 40 50 30 50 100 150 200 250 300 50 40 50 60 60 50 60 60 Iteration safety-stock levels
    7. VIA initialized with Zero What To Do With a Coarse Model? Average cost Fluid value function 50 100 150 200 250 300 Iteration Setting: we have qualitative or partial quantitative insight regarding optimal control The network examples relied on specific network structure What about other models?
    8. VIA initialized with Zero What To Do With a Coarse Model? Average cost Fluid value function 50 100 150 200 250 300 Iteration Setting: we have qualitative or partial quantitative insight regarding optimal control The network examples relied on specific network structure What about other models? An answer lies in a new formulation of Q-learning
    9. Outline ? Coarse models - what to do with them? Step 1: Recognize Step 2: Find a stab... Step 3: Optimality Q-learning for nonlinear state space models Step 4: Adjoint Step 5: Interpret Example: Local approximation Example: Decentralized control
    10. VIA initialized with Zero What is Q learning? Average cost Fluid value function 50 100 150 200 250 300 Iteration Watkin’s 1992 formulation applied to finite state space MDPs Q-Learning Idea is similar to Mayne & Jacobson’s C. J. C. H. Watkins and P. Dayan Machine Learning, 1992 differential dynamic programming Differential dynamic programming D. H. Jacobson and D. Q. Mayne American Elsevier Pub. Co. 1970
    11. VIA initialized with Zero What is Q learning? Average cost Fluid value function 50 100 150 200 250 300 Iteration Watkin’s 1992 formulation applied to finite state space MDPs Q-Learning Idea is similar to Mayne & Jacobson’s C. J. C. H. Watkins and P. Dayan Machine Learning, 1992 differential dynamic programming Differential dynamic programming D. H. Jacobson and D. Q. Mayne American Elsevier Pub. Co. 1970 Deterministic formulation: Nonlinear system on Euclidean space, d dt x(t) = f (x(t), u(t)), t≥ 0 Infinite-horizon discounted cost criterion, ∞ J ∗ (x) = inf e−γs c(x(s), u(s)) ds, x(0) = x 0 with c a non-negative cost function.
    12. 1 0.08 Optimal policy 0.07 0.06 What is Q learning? 0 0.05 0.04 0.03 0.02 0.01 −1 −1 0 1 Deterministic formulation: Nonlinear system on Euclidean space, d dt x(t) = f (x(t), u(t)), t≥ 0 Infinite-horizon discounted cost criterion, ∞ J ∗ (x) = inf e−γs c(x(s), u(s)) ds, x(0) = x 0 with c a non-negative cost function. Differential generator: For any smooth function h, Du h (x) := (∇h (x))T f (x, u)
    13. 1 0.08 Optimal policy 0.07 0.06 What is Q learning? 0 0.05 0.04 0.03 0.02 0.01 −1 −1 0 1 Deterministic formulation: Nonlinear system on Euclidean space, d dt x(t) = f (x(t), u(t)), t≥ 0 Infinite-horizon discounted cost criterion, ∞ J ∗ (x) = inf e−γs c(x(s), u(s)) ds, x(0) = x 0 with c a non-negative cost function. Differential generator: For any smooth function h, Du h (x) := (∇h (x))T f (x, u) HJB equation: min c(x, u) + Du J ∗ (x) = γJ ∗ (x) u
    14. 1 0.08 Optimal policy 0.07 0.06 What is Q learning? 0 0.05 0.04 0.03 0.02 0.01 −1 −1 0 1 Deterministic formulation: Nonlinear system on Euclidean space, d dt x(t) = f (x(t), u(t)), t≥ 0 Infinite-horizon discounted cost criterion, ∞ J ∗ (x) = inf e−γs c(x(s), u(s)) ds, x(0) = x 0 with c a non-negative cost function. Differential generator: For any smooth function h, Du h (x) := (∇h (x))T f (x, u) HJB equation: min c(x, u) + Du J ∗ (x) = γJ ∗ (x) u The Q-function of Q-learning is this function of two variables
    15. 1 0.08 Optimal policy 0.07 0.06 Q learning - Steps towards an algorithm 0 0.05 0.04 0.03 0.02 0.01 −1 −1 0 1 Sequence of five steps: Step 1: Recognize fixed point equation for the Q-function Step 2: Find a stabilizing policy that is ergodic Step 3: Optimality criterion - minimize Bellman error Step 4: Adjoint operation Step 5: Interpret and simulate!
    16. 1 0.08 Optimal policy 0.07 0.06 Q learning - Steps towards an algorithm 0 0.05 0.04 0.03 0.02 0.01 −1 −1 0 1 Sequence of five steps: Step 1: Recognize fixed point equation for the Q-function Step 2: Find a stabilizing policy that is ergodic Step 3: Optimality criterion - minimize Bellman error Step 4: Adjoint operation Step 5: Interpret and simulate! Goal - seek the best approximation, within a parameterized class
    17. 1 0.08 Optimal policy 0.07 0.06 Q learning - Steps towards an algorithm 0 0.05 0.04 0.03 0.02 0.01 −1 −1 0 1 Step 1: Recognize fixed point equation for the Q-function Q-function: H ∗(x, u) = c(x, u) + Du J ∗ (x) Its minimum: H ∗ (x) := min H ∗ (x, u) = γJ ∗ (x) u∈U Fixed point equation: Du H ∗ (x) = −γ(c(x, u) − H ∗ (x, u)) Step 1: Recognize xed point equation for the Q-function Step 2: Find a stabilizing policy that is ergodic Step 3: Optimality criterion - minimize Bellman error Step 4: Adjoint operation Step 5: Interpret and simulate!
    18. 1 0.08 Optimal policy 0.07 0.06 Q learning - Steps towards an algorithm 0 0.05 0.04 0.03 0.02 0.01 −1 −1 0 1 Step 1: Recognize fixed point equation for the Q-function Q-function: H ∗(x, u) = c(x, u) + Du J ∗ (x) Its minimum: H ∗ (x) := min H ∗ (x, u) = γJ ∗ (x) u∈U Fixed point equation: Du H ∗ (x) = −γ(c(x, u) − H ∗ (x, u)) Key observation for learning: For any input-output pair, Du H ∗ (x) = d dt H ∗ (x(t)) x=x(t) u=u(t) Step 1: Recognize xed point equation for the Q-function Step 2: Find a stabilizing policy that is ergodic Step 3: Optimality criterion - minimize Bellman error Step 4: Adjoint operation Step 5: Interpret and simulate!
    19. 1 0.08 Optimal policy 0.07 0.06 Q learning - LQR example 0 0.05 0.04 0.03 0.02 0.01 −1 −1 0 1 Linear model and quadratic cost, 1 T 1 T Cost: c(x, u) = 2 x Qx + 2 u Ru Q-function: H ∗(x, u) = c(x, u) + (Ax + Bu)T P ∗ x = c(x, u) + Du J ∗ (x) Solves Riccatti eqn 1 T ∗ J ∗ (x) = 2x P x Step 2: Find a stabilizing policy that is ergodic Step 3: Optimality criterion - minimize Bellman error Step 4: Adjoint operation Step 5: Interpret and simulate!
    20. 1 0.08 Optimal policy 0.07 0.06 Q learning - LQR example 0 0.05 0.04 0.03 0.02 0.01 −1 −1 0 1 Linear model and quadratic cost, 1 T 1 T Cost: c(x, u) = 2 x Qx + 2 u Ru Q-function: H ∗(x, u) = c(x, u) + (Ax + Bu)T P ∗ x Solves Riccatti eqn Q-function approx: dx dxu H θ (x, u) = c(x, u) + 1 2 θi xT E i x + x θj xT F i u x i=1 j=1 Minimum: θ 1 T θT H (x) = 2x Q + E − F R−1 F θ x θ Minimizer: uθ (x) = φθ (x) = −R−1 F θ x Step 2: Find a stabilizing policy that is ergodic Step 3: Optimality criterion - minimize Bellman error Step 4: Adjoint operation Step 5: Interpret and simulate!
    21. 1 0.08 Optimal policy 0.07 0.06 Q learning - Steps towards an algorithm 0 0.05 0.04 0.03 0.02 0.01 −1 −1 0 1 Step 2: Stationary policy that is ergodic? Assume the LLN holds for continuous functions F: R × R u → R As T → ∞, T 1 F (x(t), u(t)) dt −→ F (x, u) (dx, du) T 0 X×U Step 1: Recognize xed point equation for the Q-function Step 2: Find a stabilizing policy that is ergodic Step 3: Optimality criterion - minimize Bellman error Step 4: Adjoint operation Step 5: Interpret and simulate!
    22. 1 0.08 Optimal policy 0.07 0.06 Q learning - Steps towards an algorithm 0 0.05 0.04 0.03 0.02 0.01 −1 −1 0 1 Step 2: Stationary policy that is ergodic? Suppose for example the input is scalar, and the system is stable [Bounded-input/Bounded-state] Can try a linear combination of sinusouids Step 1: Recognize xed point equation for the Q-function Step 2: Find a stabilizing policy that is ergodic Step 3: Optimality criterion - minimize Bellman error Step 4: Adjoint operation Step 5: Interpret and simulate!
    23. 1 0.08 Optimal policy 0.07 0.06 Q learning - Steps towards an algorithm 0 0.05 0.04 0.03 0.02 0.01 −1 −1 0 1 Step 2: Stationary policy that is ergodic? Suppose for example the input is scalar, and the system is stable [Bounded-input/Bounded-state] 0.08 0.07 0.06 0.05 Can try a linear combination 0.04 of sinusouids 0.03 0.02 0.01 Step 1: Recognize xed point equation for the Q-function u(t) = A(sin(t) + sin(πt) + sin(et)) Step 2: Find a stabilizing policy that is ergodic Step 3: Optimality criterion - minimize Bellman error Step 4: Adjoint operation Step 5: Interpret and simulate!
    24. 1 0.08 Optimal policy 0.07 0.06 Q learning - Steps towards an algorithm 0 0.05 0.04 0.03 0.02 0.01 −1 −1 0 1 Step 3: Bellman error Based on observations, minimize the mean-square Bellman error: θ θ , First order condition for optimality: θ , Du ψ θ − γψi i θ =0 with ψ θ (x) = ψi (x, φθ (x)), i θ 1≤i≤d Step 1: Recognize xed point equation for the Q-function Step 2: Find a stabilizing policy that is ergodic Step 3: Optimality criterion - minimize Bellman error Step 4: Adjoint operation Step 5: Interpret and simulate!
    25. 1 0.08 Optimal policy 0.07 0.06 Q learning - Convex Reformulation 0 0.05 0.04 0.03 0.02 0.01 −1 −1 0 1 Step 3: Bellman error Based on observations, minimize the mean-square Bellman error: θ θ , G Gθ (x) ≤ H θ (x, u), all x, u Step 2: Find a stabilizing policy that is ergodic Step 3: Optimality criterion - minimize Bellman error Step 4: Adjoint operation Step 5: Interpret and simulate!
    26. 1 0.08 Optimal policy 0.07 0.06 Q learning - LQR example 0 0.05 0.04 0.03 0.02 0.01 −1 −1 0 1 Linear model and quadratic cost, 1 T 1 T Cost: c(x, u) = 2 x Qx + 2 u Ru Q-function: H ∗ (x) = c(x, u) + (Ax + Bu)T P ∗ x Solves Riccatti eqn Q-function approx: dx dxu H θ (x, u) = c(x, u) + 1 2 θi xT E i x + x θj xT F i u x i=1 j=1 Approximation to minimum G θ (x) = 1 xT Gθ x 2 Minimizer: uθ (x) = φθ (x) = −R−1 F θ x Step 2: Find a stabilizing policy that is ergodic Step 3: Optimality criterion - minimize Bellman error Step 4: Adjoint operation Step 5: Interpret and simulate!
    27. Q learning - Steps towards an algorithm Step 4: Causal smoothing to avoid differentiation For any function of two variables, g : R × R w → R Resolvent gives a new function, ∞ −βt Rβ g (x, w) = e g(x(t), ξ(t)) dt 0 Skip to examples
    28. Q learning - Steps towards an algorithm Step 4: Causal smoothing to avoid differentiation For any function of two variables, g : R × R w → R Resolvent gives a new function, ∞ Rβ g (x, w) = e−βt g(x(t), ξ (t)) dt , β>0 0 controlled using the nominal policy u(t) = φ(x(t), ξ(t)), t≥0 stabilizing & ergodic
    29. Q learning - Steps towards an algorithm Step 4: Causal smoothing to avoid differentiation For any function of two variables, g : R × R w → R Resolvent gives a new function, ∞ Rβ g (x, w) = e−βt g(x(t), ξ (t)) dt , β>0 0 Resolvent equation:
    30. Q learning - Steps towards an algorithm Step 4: Causal smoothing to avoid differentiation For any function of two variables, g : R × R w → R Resolvent gives a new function, ∞ Rβ g (x, w) = e−βt g(x(t), ξ (t)) dt , β>0 0 Resolvent equation: Smoothed Bellman error: Lθ,β = Rβ Lθ θ θ = [βRβ − I]H + γRβ (c − H )
    31. Q learning - Steps towards an algorithm Smoothed Bellman error: 1 θ,β 2 Eβ (θ) := 2 θ,β Eβ (θ) = , θ Lθ,β = zero at an optimum Step 4: Causal smoothing to avoid differentiation
    32. Q learning - Steps towards an algorithm Smoothed Bellman error: 1 θ,β 2 Eβ (θ) := 2 θ,β Eβ (θ) = , θ Lθ,β = zero at an optimum Involves terms of the form Rβ g,R β h Step 4: Causal smoothing to avoid differentiation
    33. Q learning - Steps towards an algorithm 1 θ,β 2 Smoothed Bellman error: Eβ (θ) := 2 θ,β θ,β Eβ (θ) = , θL Adjoint operation: † 1 † Rβ Rβ = (Rβ + Rβ ) 2β 1 † † Rβ g,R β h = g,R β h + h,R β g 2β Step 4: Causal smoothing to avoid differentiation
    34. Q learning - Steps towards an algorithm 1 θ,β 2 Smoothed Bellman error: Eβ (θ) := 2 θ,β θ,β Eβ (θ) = , θL Adjoint operation: 1 † † Rβ Rβ = (Rβ + Rβ ) 2β 1 † † Rβ g,R β h = g,R β h + h,R β g 2β Adjoint realization: time-reversal ∞ † Rβ g (x, w) = e−βt Ex, w [g(x◦ (−t), ξ ◦ (−t))] dt 0 expectation conditional on x◦ (0) = x, ξ ◦ (0) = w. Step 4: Causal smoothing to avoid differentiation
    35. 1 0.08 Optimal policy 0.07 0.06 Q learning - Steps towards an algorithm 0 0.05 0.04 0.03 0.02 0.01 −1 −1 0 1 After Step 5: Not quite adaptive control: Desired behavior Compare Outputs and learn Inputs Complex system Measured behavior Ergodic input applied Step 1: Recognize xed point equation for the Q-function Step 2: Find a stabilizing policy that is ergodic Step 3: Optimality criterion - minimize Bellman error Step 4: Adjoint operation Step 5: Interpret and simulate!
    36. 1 0.08 Optimal policy 0.07 0.06 Q learning - Steps towards an algorithm 0 0.05 0.04 0.03 0.02 0.01 −1 −1 0 1 After Step 5: Not quite adaptive control: Desired behavior Compare Outputs and learn Inputs Complex system Measured behavior Ergodic input applied Based on observations minimize the mean-square Bellman error:
    37. 1 (individual state) (ensemble state) Deterministic Stochastic Approximation 0 -1 0 1 2 3 4 5 6 7 8 9 10 Gradient descent: d dt θ = −ε θ , Du θ Hθ − γ θ Hθ Converges* to the minimizer of the mean-square Bellman error: d * Convergence observed in experiments! dt h(x(t)) x=x(t) = Du h (x) For a convex re-formulation of w=ξ(t) the problem, see Mehta & Meyn 2009
    38. 1 (individual state) (ensemble state) Deterministic Stochastic Approximation 0 -1 0 1 2 3 4 5 6 7 8 9 10 Stochastic Approximation θ d dt θ = −εt Lθ t d dt θH (x◦ (t)) − γ θH θ (x◦ (t), u◦ (t)) Lθ := dt H θ (x◦ (t)) + γ(c(x◦ (t) , u◦ (t)) − H θ (x◦ (t), u◦ (t))) t d Gradient descent: d θ θ θ dt θ = −ε , Du θH −γ θH Mean-square Bellman error: d dt h(x(t)) x=x(t) = Du h (x) w=ξ(t)
    39. Outline ? Coarse models - what to do with them? Step 1: Recognize Step 2: Find a stab... Step 3: Optimality Q-learning for nonlinear state space models Step 4: Adjoint Step 5: Interpret Example: Local approximation Example: Decentralized control
    40. Desired behavior Compare Outputs Q learning - Local Learning and learn Inputs Complex system Measured behavior Cubic nonlinearity: d dt x = −x3 + u, c(x, u) = 1 x2 + 1 u2 2 2
    41. Desired behavior Compare Outputs Q learning - Local Learning and learn Inputs Complex system Measured behavior Cubic nonlinearity: d dt x = −x3 + u, c(x, u) = 1 x2 + 1 u2 2 2 HJB: min ( 2 x2 + 1 u2 + (−x3 + u) J ∗ (x)) = γJ ∗ (x) 1 2 u
    42. Desired behavior Compare Outputs Q learning - Local Learning and learn Inputs Complex system Measured behavior Cubic nonlinearity: d dt x = −x3 + u, c(x, u) = 1 x2 + 1 u2 2 2 HJB: min ( 2 x2 + 1 u2 + (−x3 + u) J ∗ (x)) = γJ ∗ (x) 1 2 u Basis: θ x x 2 xu H (x, u) = c(x, u) + θ x + θ 2 u 1 + 2x
    43. Desired behavior Compare Outputs Q learning - Local Learning and learn Inputs Complex system Measured behavior Cubic nonlinearity: d dt x = −x3 + u, c(x, u) = 1 x2 + 1 u2 2 2 HJB: min ( 2 x2 + 1 u2 + (−x3 + u) J ∗ (x)) = γJ ∗ (x) 1 2 u x Basis: H θ (x, u) = c(x, u) + θx x2 + θxu 2 u 1 + 2x 1 1 0.08 Optimal policy Optimal policy 0.06 0.07 0.05 0.06 0.05 0.04 0 0.04 0 0.03 0.03 0.02 0.02 0.01 0.01 −1 −1 −1 0 1 −1 0 1 Low amplitude input High amplitude input u(t) = A(sin(t) + sin(πt) + sin(et))
    44. Outline ? Coarse models - what to do with them? Step 1: Recognize Step 2: Find a stab... Step 3: Optimality Q-learning for nonlinear state space models Step 4: Adjoint Step 5: Interpret Example: Local approximation Example: Decentralized control
    45. M. Huang, P. E. Caines, and R. P. Malhame. Large-population cost-coupled LQG problems with nonuniform agents: Individual-mass Multi-agent model behavior and decentralized ε-Nash equilibria. IEEE Trans. Auto. Control, 52(9):1560–1571, 2007. Huang et. al. Local optimization for global coordination
    46. Multi-agent model Model: Linear autonomous models - global cost objective HJB: Individual state + global average Basis: Consistent with low dimensional LQG model Results from five agent model:
    47. Multi-agent model Model: Linear autonomous models - global cost objective HJB: Individual state + global average Basis: Consistent with low dimensional LQG model Results from five agent model: 1 Estimated state feedback gains 0 (individual state) (ensemble state) time -1 Gains for agent 4: Q-learning sample paths and gains predicted from ∞-agent limit
    48. Outline ? Coarse models - what to do with them? Step 1: Recognize Step 2: Find a stab... Step 3: Optimality Q-learning for nonlinear state space models Step 4: Adjoint Step 5: Interpret Example: Local approximation Example: Decentralized control ... Conclusions
    49. Conclusions Coarse models give tremendous insight They are also tremendously useful for design in approximate dynamic programming algorithms
    50. Conclusions Coarse models give tremendous insight They are also tremendously useful for design in approximate dynamic programming algorithms Q-learning is as fundamental as the Riccati equation - this should be included in our first-year graduate control courses
    51. Conclusions Coarse models give tremendous insight They are also tremendously useful for design in approximate dynamic programming algorithms Q-learning is as fundamental as the Riccati equation - this should be included in our first-year graduate control courses Current research: Algorithm analysis and improvements Applications in biology and economics Analysis of game-theoretic issues in coupled systems
    52. References . PhD thesis, University of London, London, England, 1967. . American Elsevier Pub. Co., New York, NY, 1970. Learning from Delayed Rewards. PhD thesis, King’s College, Cambridge, UK, 1989. Machine Learning, 8(3-4):279–292, 1992. SIAM J. Control Optim., 38(2):447–469, 2000. on policy iteration. Automatica, 45(2):477 – 484, 2009. Submitted to the 48th IEEE Conference on Decision and Control, December 16-18 2009. [9] C. Moallemi, S. Kumar, and B. Van Roy. Approximate and data-driven dynamic programming for queueing networks. Preprint available at http://moallemi.com/ciamac/research-interests.php, 2008.
    SlideShare Zeitgeist 2009

    + Sean MeynSean Meyn Nominate

    custom

    174 views, 0 favs, 2 embeds more stats

    Abstract: Q-learning is a technique used to comput more

    More info about this document

    © All Rights Reserved

    Go to text version

    • Total Views 174
      • 134 on SlideShare
      • 40 from embeds
    • Comments 0
    • Favorites 0
    • Downloads 4
    Most viewed embeds
    • 39 views on http://decision.csl.illinois.edu
    • 1 views on http://black.csl.illinois.edu

    more

    All embeds
    • 39 views on http://decision.csl.illinois.edu
    • 1 views on http://black.csl.illinois.edu

    less

    Flagged as inappropriate Flag as inappropriate
    Flag as inappropriate

    Select your reason for flagging this presentation as inappropriate. If needed, use the feedback form to let us know more details.

    Cancel
    File a copyright complaint
    Having problems? Go to our helpdesk?

    Categories