DeepLearn2022 3. TD and Q Learning

1. Part 3: Introduction to Reinforcement Learning Temporal Difference Techniques Sean Meyn Department of Electrical and Computer Engineering University of Florida Inria International Chair Inria, Paris Thanks to to our sponsors: NSF and ARO

2. Part 3: Introduction to Reinforcement Learning Outline 1 Recap 2 TD Learning 3 Exploration 4 Q Learning 5 Zap Q 6 Conclusions & Future Directions 7 References

3. 5 10 0 0 20 40 60 80 100 ρ = 0.8 τk Outcomes from M repeated runs -20 0 20 √ Nθ̃N Crank Up Gain: Tame Transients Average: Tame Variance PR θk ϑτk Algorithm Design

4. Recap Algorithm Design Steps to a Successful Design With Thanks to Lyapunov and Polyak! 0 20 40 60 80 100 ρ = 0.8 τk 5 10 0 Outcomes from M repeated runs -20 0 20 Crank Up Gain: Tame Transients Average: Tame Variance θk ϑτk N − N0 θ̃PR N Follow These Steps: 1 Design ¯ f for mean flow d dt ϑ = ¯ f(ϑ) 2 Design gain: αn = g/(1 + n/ne)ρ Crank up gain, means ρ < 1 3 Perform PR Averaging 4 Repeat! Obtain histogram √ N − N0 θ̃PR (m) N : 1 ≤ m ≤ M with θ (m) 0 widely dispersed and N relatively small What you will learn: How big N needs to be Approximate confidence bounds 1 / 52

5. Recap Algorithm Design Taming Nonlinear Dynamics What if stability of d dt ϑ = ¯ f(ϑ) is unknown? [typically the case in RL] Newton Raphson Flow [Smale 1976] Idea: Interpret ¯ f as the “parameter”: d dt ¯ ft = V( ¯ ft) and design V 2 / 52

6. Recap Algorithm Design Taming Nonlinear Dynamics What if stability of d dt ϑ = ¯ f(ϑ) is unknown? [typically the case in RL] Newton Raphson Flow [Smale 1976] Idea: Interpret ¯ f as the “parameter”: d dt ¯ f = − ¯ ft Linear dynamics 2 / 52

7. Recap Algorithm Design Taming Nonlinear Dynamics What if stability of d dt ϑ = ¯ f(ϑ) is unknown? [typically the case in RL] Newton Raphson Flow [Smale 1976] Idea: Interpret ¯ f as the “parameter”: d dt ¯ f = − ¯ ft d dt ¯ f(ϑt) = − ¯ f(ϑt) d dt ϑt = −[A(ϑt)]−1 ¯ f(ϑt) SA translation: Zap Stochastic Approximation 2 / 52

8. Recap Algorithm Design Zap Algorithm Designed to emulate Newton-Raphson flow d dt ϑt = −[A(ϑt)]−1 ¯ f(ϑt), A(θ) = ∂ ∂θ ¯ f(θ) Zap-SA (designed to emulate deterministic Newton-Raphson) θn+1 = θn + αn+1[− b An+1]−1 f(θn, ξn+1) b An+1 = b An + βn+1(An+1 − b An), An+1 = ∂θf(θn, ξn+1) 3 / 52

9. Recap Algorithm Design Zap Algorithm Designed to emulate Newton-Raphson flow d dt ϑt = −[A(ϑt)]−1 ¯ f(ϑt), A(θ) = ∂ ∂θ ¯ f(θ) Zap-SA (designed to emulate deterministic Newton-Raphson) θn+1 = θn + αn+1[− b An+1]−1 f(θn, ξn+1) b An+1 = b An + βn+1(An+1 − b An), An+1 = ∂θf(θn, ξn+1) Requires b An+1 ≈ A(θn) def = ∂θ ¯ f (θn) 3 / 52

10. Recap Algorithm Design Zap Algorithm Designed to emulate Newton-Raphson flow d dt ϑt = −[A(ϑt)]−1 ¯ f(ϑt), A(θ) = ∂ ∂θ ¯ f(θ) Zap-SA (designed to emulate deterministic Newton-Raphson) θn+1 = θn + αn+1[− b An+1]−1 f(θn, ξn+1) b An+1 = b An + βn+1(An+1 − b An), An+1 = ∂θf(θn, ξn+1) b An+1 ≈ A(θn) requires high-gain, βn αn → ∞, n → ∞ 3 / 52

11. Recap Algorithm Design Zap Algorithm Designed to emulate Newton-Raphson flow d dt ϑt = −[A(ϑt)]−1 ¯ f(ϑt), A(θ) = ∂ ∂θ ¯ f(θ) Zap-SA (designed to emulate deterministic Newton-Raphson) θn+1 = θn + αn+1[− b An+1]−1 f(θn, ξn+1) b An+1 = b An + βn+1(An+1 − b An), An+1 = ∂θf(θn, ξn+1) b An+1 ≈ A(θn) requires high-gain, βn αn → ∞, n → ∞ Can use αn = 1/n (without averaging). Numerics that follow: use this choice, and βn = (1/n)ρ, ρ ∈ (0.5, 1) 3 / 52

12. Recap Algorithm Design Zap Algorithm Designed to emulate Newton-Raphson flow d dt ϑt = −[A(ϑt)]−1 ¯ f(ϑt), A(θ) = ∂ ∂θ ¯ f(θ) Zap-SA (designed to emulate deterministic Newton-Raphson) θn+1 = θn + αn+1[− b An+1]−1 f(θn, ξn+1) b An+1 = b An + βn+1(An+1 − b An), An+1 = ∂θf(θn, ξn+1) b An+1 ≈ A(θn) requires high-gain, βn αn → ∞, n → ∞ Can use αn = 1/n (without averaging). Numerics that follow: use this choice, and βn = (1/n)ρ, ρ ∈ (0.5, 1) Stability? Virtually universal Optimal variance, too! Based on ancient theory from Ruppert Polyak [80, 81, 79] 3 / 52

13. Stochastic Control Problems

14. Recap Optimal Control Keep In Mind! Trajectory Generation State Feedback Process Observer Environment Σ Σ Σ Σ y u e w d xref ufb uff − + x̂ M ore feedback r ∆ n hidden state x Every Control Problem is Multi-Objective Many costs to consider Many measures of risk See CSRL Chapter 2, and Åström and Murray [1] 4 / 52

15. Recap Optimal Control Keep In Mind! Trajectory Generation State Feedback Process Observer Environment Σ Σ Σ Σ y u e w d xref ufb uff − + x̂ M ore feedback r ∆ n hidden state x Every Control Problem is Multi-Objective Many costs to consider Many measures of risk Every Control Problem is Multi-Temporal Advanced planning, and periodic updates Real time feedback φ See CSRL Chapter 2, and Åström and Murray [1] 4 / 52

16. Recap Optimal Control Keep In Mind! Trajectory Generation State Feedback Process Observer Environment Σ Σ Σ Σ y u e w d xref ufb uff − + x̂ M ore feedback r ∆ n hidden state x Every Control Problem is Multi-Objective Many costs to consider Many measures of risk Every Control Problem is Multi-Temporal Advanced planning, and periodic updates Real time feedback φ There are Choices Beyond φ? See CSRL Chapter 2, and Åström and Murray [1] 4 / 52

17. Recap Optimal Control MDP Model Keep things simple: state space X and input space U are finite MDP Model X is a stationary controlled Markov chain, with input U For all states x and sets A, P{Xn+1 ∈ A | Xn = x, Un = u, and prior history} = Pu(x, A) c: X × U → R is a cost function γ 1 discount factor 5 / 52

18. Recap Optimal Control MDP Model Keep things simple: state space X and input space U are finite Cost function c: X × U → R+ Q-function: Q? (x, u) = min U∞ 1 ∞ X k=0 γk E[c(X(k), U(k)) | X(0) = x , U(0) = u] 5 / 52

19. Recap Optimal Control MDP Model Keep things simple: state space X and input space U are finite Cost function c: X × U → R+ Q-function: Q? (x, u) = min U∞ 1 ∞ X k=0 γk E[c(X(k), U(k)) | X(0) = x , U(0) = u] DP equation: Q? (x, u) = c(x, u) + γE[Q? (X(k + 1)) | X(k) = x , U(k) = u] H(x) = min u H(x, u) 5 / 52

20. Recap Optimal Control MDP Model Keep things simple: state space X and input space U are finite Cost function c: X × U → R+ Q-function: Q? (x, u) = min U∞ 1 ∞ X k=0 γk E[c(X(k), U(k)) | X(0) = x , U(0) = u] DP equation: Q? (x, u) = c(x, u) + γE[Q? (X(k + 1)) | X(k) = x , U(k) = u] Optimal policy: U?(k) = φ?(X?(k)), φ? (x) ∈ arg min u Q? (x, u) 5 / 52

21. Recap Optimal Control MDP Model Keep things simple: state space X and input space U are finite Cost function c: X × U → R+ Q-function: Q? (x, u) = min U∞ 1 ∞ X k=0 γk E[c(X(k), U(k)) | X(0) = x , U(0) = u] DP equation: Q? (x, u) = c(x, u) + γE[Q? (X(k + 1)) | X(k) = x , U(k) = u] Optimal policy: U?(k) = φ?(X?(k)), φ? (x) ∈ arg min u Q? (x, u) Goal of SC and RL: approximate Q? and/or φ? Today’s lecture: find best approximation among {Qθ : θ ∈ Rd} 5 / 52

22. Recap Optimal Control MDP Model Some flexibility with notation: c(X(k), U(k)) = ck, etc. Keep things simple: state space X and input space U are finite Cost function c: X × U → R+ Q-function: Q? (x, u) = min U∞ 1 ∞ X k=0 γk E[ ck | X0 = x , U0 = u] DP equation: Q? (x, u) = c(x, u) + γE[ Q? k+1 | Xk = x , Uk = u] Optimal policy: U? k = φ?(X? k), φ? (x) ∈ arg min u Q? (x, u) Goal of SC and RL: approximate Q? and/or φ? Today’s lecture: find best approximation among {Qθ : θ ∈ Rd} 6 / 52

23. TD Learning

24. TD Learning Estimating the State-Action Value Function TD Learning Given policy φ, approximate the fixed-policy Q-function U(k) = φ(X(k)), k ≥ 1 Qφ (x, u) = ∞ X k=0 γk E[c(X(k), U(k)) | X(0) = x , U(0) = u] Motivation: Policy Improvement φ+ (x) = arg min u Qφ (x, u) 7 / 52

25. TD Learning Estimating the State-Action Value Function TD Learning Given policy φ, approximate the fixed-policy Q-function U(k) = φ(X(k)), k ≥ 1 DP equation: Qφ (x, u) = c(x, u) + γE[Qφ (X(k + 1)) | X(k) = x , U(k) = u] Qφ (X(k), U(k)) = c(X(k), U(k)) + γE[Qφ (X(k + 1)) | Fk] with new definition, H(x) = H(x, φ(x)). 7 / 52

26. TD Learning Estimating the State-Action Value Function TD Learning Given policy φ, approximate the fixed-policy Q-function U(k) = φ(X(k)), k ≥ 1 Qφ (X(k), U(k)) = c(X(k), U(k)) + γE[Qφ (X(k + 1)) | Fk] with new definition, H(x) = H(x, φ(x)). Galerkin relaxation: θ∗ is a root of the function ¯ f(θ) def = E[{−Qθ (X(k), U(k)) + c(X(k), U(k)) + γQθ (X(k + 1))}ζk] Eligibility vector in Rd 7 / 52

27. TD Learning Estimating the State-Action Value Function TD Learning Given policy φ, approximate the fixed-policy Q-function U(k) = φ(X(k)), k ≥ 1 Qφ (X(k), U(k)) = c(X(k), U(k)) + γE[Qφ (X(k + 1)) | Fk] with new definition, H(x) = H(x, φ(x)). Galerkin relaxation: θ∗ is a root of the function ¯ f(θ) def = E[{−Qθ (X(k), U(k)) + c(X(k), U(k)) + γQθ (X(k + 1))}ζk] Linear family: {Qθ = θT ψ} d dt ϑ = ¯ f(ϑ) is stable (clever choice of ζk, and with covariance Σψ full rank) Alternatively, compute θ∗ by a single matrix inversion SA to the rescue 7 / 52

28. TD Learning Estimating the State-Action Value Function TD Learning Given policy φ, approximate the fixed-policy Q-function U(k) = φ(X(k)), k ≥ 1 Qφ (X(k), U(k)) = c(X(k), U(k)) + γE[Qφ (X(k + 1)) | Fk] with new definition, H(x) = H(x, φ(x)). Galerkin relaxation: θ∗ is a root of the function ¯ f(θ) def = E[{−Qθ (X(k), U(k)) + c(X(k), U(k)) + γQθ (X(k + 1))}ζk] How to estimate θ? for neural network (nonlinear) architecture? SA to the rescue 7 / 52

29. TD Learning Estimating the State-Action Value Function TD(λ) ¯ f(θ) def = E[{−Qθ k + ck + γQθ k+1 }ζk] Choose λ ∈ [0, 1] as you please, and then for each n ζn = n X k=0 (γλ)n−k ∇θ[Qθ k]

31. θ=θn Interpretation for linear function class: ∇θ[Qθ k] = ψk = ψ(Xk, Uk) Basis observations passed through a low pass linear filter 8 / 52

32. TD Learning Estimating the State-Action Value Function TD(λ) ¯ f(θ) def = E[{−Qθ k + ck + γQθ k+1 }ζk] Choose λ ∈ [0, 1] as you please, and then for each n ζn = n X k=0 (γλ)n−k ∇θ[Qθ k]

34. θ=θn TD(λ) ≡ SA algorithm: θn+1 = θn + αn+1f(θn, ξn+1) = θn + αn+1{−Qθn n + cn + γQθn n+1 }ζn ζn+1 = γλζn + ∇θ[Qθ n]

36. θ=θn+1 with ξn+1 = (Xn+1, Xn, Un, ζn) 8 / 52

37. TD Learning Least-Squares Temporal Difference Learning LSTD and Zap ¯ f(θ) def = E[{−Qθ k + ck + γQθ k+1 }ζk] We require the linearization matrix: A(θ, ξk+1) = ∂θ{−Qθ k + γQθ k+1 }ζk 9 / 52

38. TD Learning Least-Squares Temporal Difference Learning LSTD and Zap ¯ f(θ) def = E[{−Qθ k + ck + γQθ k+1 }ζk] We require the linearization matrix: A(θ, ξk+1) = ∂θ{−Qθ k + γQθ k+1 }ζk = −ψ(Xk, Uk) + γψ Xk+1, φ(Xk+1) }ζk linear function class 9 / 52

39. TD Learning Least-Squares Temporal Difference Learning LSTD and Zap ¯ f(θ) def = E[{−Qθ k + ck + γQθ k+1 }ζk] We require the linearization matrix: A(θ, ξk+1) = ∂θ{−Qθ k + γQθ k+1 }ζk = −ψ(Xk, Uk) + γψ Xk+1, φ(Xk+1) }ζk linear function class A∗ = E {−ψ(Xk, Uk) + γψ Xk+1, φ(Xk+1) }ζk] is Hurwitz in this case Only requirement is that Σψ 0 9 / 52

40. TD Learning Least-Squares Temporal Difference Learning LSTD and Zap ¯ f(θ) def = E[{−Qθ k + ck + γQθ k+1 }ζk] We require the linearization matrix: A(θ, ξk+1) = ∂θ{−Qθ k + γQθ k+1 }ζk = −ψ(Xk, Uk) + γψ Xk+1, φ(Xk+1) }ζk linear function class θn+1 = θn + αn+1[− b An+1]−1 f(θn, ξn+1) f(θn, ξn+1) = {−Qθn n + cn + γQθn n+1 }ζn b An+1 = b An + αn+1(An+1 − b An) An+1 = {−ψ(Xn, Un) + γψ(Xn+1, φ(Xn+1)}ζn Zap SA doesn’t require two time-scale algorithm when A(θ) ≡ A∗ 9 / 52

41. TD Learning Least-Squares Temporal Difference Learning LSTD and Zap ¯ f(θ) def = E[{−Qθ k + ck + γQθ k+1 }ζk] We require the linearization matrix: A(θ, ξk+1) = ∂θ{−Qθ k + γQθ k+1 }ζk = −ψ(Xk, Uk) + γψ Xk+1, φ(Xk+1) }ζk linear function class LSTD(λ)-Learning θn+1 = θn + αn+1[− b An+1]−1 f(θn, ξn+1) f(θn, ξn+1) = {−Qθn n + cn + γQθn n+1 }ζn b An+1 = b An + αn+1(An+1 − b An) An+1 = {−ψ(Xn, Un) + γψ(Xn+1, φ(Xn+1)}ζn Optimal convergence rate (as obtained using PR averaging) 9 / 52

42. TD Learning Least-Squares Temporal Difference Learning What Is LSTD(λ) Solving? Beautiful Theory for λ = 1 10 / 52

43. TD Learning Least-Squares Temporal Difference Learning What Is LSTD(1) Solving? Eternal gratitude to John Tsitsiklis! Answered in the dissertation of Van Roy [18]: min θ kQθ − Qφ k2 π = X x,u Qθ (x, u) − Qφ (x, u) 2 π(x, u) See [21, 20] and Chapter 9 of CSRL 10 / 52

44. TD Learning Least-Squares Temporal Difference Learning What Is LSTD(1) Solving? Eternal gratitude to John Tsitsiklis! Answered in the dissertation of Van Roy [18]: min θ kQθ − Qφ k2 π = X x,u Qθ (x, u) − Qφ (x, u) 2 π(x, u) See [21, 20] and Chapter 9 of CSRL Foundation for Actor-critic algorithms in the dissertation of Konda [46] θn+1 = θn − αn+1 ˘ ∇Γ (n) ¯ f(θ) = −∇Γ (θ) Unbiased gradient observation: ˘ ∇Γ (n) def = Λθn n Qθn n Score function: Λθ (x, u) = ∇θ log[e φθ (u | x)] See also [47] and Chapter 10 of CSRL 10 / 52

45. TD Learning Least-Squares Temporal Difference Learning LSTD(1) Optimal Covariance May Be Large Example from CTCN [33]: 0 10 20 30 0 5 10 15 ρ = 0.8 ρ = 0.9 hθ (x) = θ1 θ1 x + θ2x2 θ2 θ∗ 1 = θ∗ 2 0 1 2 3 4 5 x 10 6 0 1 2 3 4 5 x 10 6 11 / 52

46. TD Learning Least-Squares Temporal Difference Learning LSTD(1) Optimal Covariance May Be Large Example from CTCN [33]: 0 10 20 30 0 5 10 15 ρ = 0.8 ρ = 0.9 hθ (x) = θ1 θ1 x + θ2x2 θ2 θ∗ 1 = θ∗ 2 0 1 2 3 4 5 x 10 6 0 1 2 3 4 5 x 10 6 Architecture linear, two dimensional, and ideal in every sense 11 / 52

47. TD Learning Least-Squares Temporal Difference Learning LSTD(1) Optimal Covariance May Be Large Example from CTCN [33]: 0 10 20 30 0 5 10 15 ρ = 0.8 ρ = 0.9 hθ (x) = θ1 θ1 x + θ2x2 θ2 θ∗ 1 = θ∗ 2 0 1 2 3 4 5 x 10 6 0 1 2 3 4 5 x 10 6 Architecture linear, two dimensional, and ideal in every sense Algorithm performance is far from ideal 11 / 52

48. TD Learning Changing Our Goals State Weighting Weighted Objective min θ kQθ − Qφ k2 π,Ω = X x,u 1 Ω(x, u) Qθ (x, u) − Qφ (x, u) 2 π(x, u) 12 / 52

49. TD Learning Changing Our Goals State Weighting Weighted Objective min θ kQθ − Qφ k2 π,Ω = X x,u 1 Ω(x, u) Qθ (x, u) − Qφ (x, u) 2 π(x, u) Zap ≡ LSTD(1) θn = [− b An]−1 bn b An+1 = b An + 1 n + 1 h − 1 Ωn+1 ψn+1ψT n+1 − b An i bn+1 = bn + 1 n + 1 ζncn − bn ζn+1 = γζn + 1 Ωn+1 ψn+1 12 / 52

50. TD Learning Changing Our Goals State Weighting Design state weighting so that f(θ, ξ) becomes bounded in ξ: 0 10 20 30 0 5 10 15 ρ = 0.8 ρ = 0.9 hθ (x) = θ1 θ1 x + θ2x2 θ2 θ∗ 1 = θ∗ 2 0 1 2 3 4 5 x 10 6 0 1 2 3 4 5 x 10 6 13 / 52

51. TD Learning Changing Our Goals State Weighting Design state weighting so that f(θ, ξ) becomes bounded in ξ: 0 10 20 30 0 10 20 30 0 5 10 15 0 5 10 15 ρ = 0.8 ρ = 0.9 hθ (x) = θ1 θ1 x + θ2x2 θ2 θ∗ 1 = θ∗ 2 0 1 2 3 4 5 x 10 6 0 1 2 3 4 5 x 10 6 0 1 2 3 4 5 x 10 6 0 1 2 3 4 5 x 10 6 [1 + (µ − α)x] 4 Ω(x) ≡ 1 13 / 52

52. TD Learning Changing Our Goals State Weighting Design state weighting so that f(θ, ξ) becomes bounded in ξ: 0 10 20 30 0 10 20 30 0 5 10 15 0 5 10 15 ρ = 0.8 ρ = 0.9 hθ (x) = θ1 θ1 x + θ2x2 θ2 θ∗ 1 = θ∗ 2 0 1 2 3 4 5 x 10 6 0 1 2 3 4 5 x 10 6 0 1 2 3 4 5 x 10 6 0 1 2 3 4 5 x 10 6 [1 + (µ − α)x] 4 Ω(x) ≡ 1 Far more flexible and well motivated than introducing λ ζn+1 = γζn + 1 Ωn+1 ψn+1 13 / 52

53. TD Learning Changing Our Goals Relative LSTD Far more flexible and well motivated than introducing λ? ζn+1 = γζn + 1 Ωn+1 ψ(Xn+1) 14 / 52

54. TD Learning Changing Our Goals Relative LSTD Far more flexible and well motivated than introducing λ? ζn+1 = γζn + 1 Ωn+1 ψ(Xn+1) Performance will degrade as γ gets close to one 14 / 52

55. TD Learning Changing Our Goals Relative LSTD Far more flexible and well motivated than introducing λ? ζn+1 = γζn + 1 Ωn+1 ψ(Xn+1) Performance will degrade as γ gets close to one Change your objective: min θ kHθ − Hφ k2 π,Ω with H(x, u) = Q(x, u) − Q(x•, u•) (some fixed normalization) 14 / 52

56. TD Learning Changing Our Goals Relative LSTD Far more flexible and well motivated than introducing λ? ζn+1 = γζn + 1 Ωn+1 ψ(Xn+1) Performance will degrade as γ gets close to one Change your objective: min θ kHθ − Hφ k2 π,Ω with H(x, u) = Q(x, u) − Q(x•, u•) (some fixed normalization) Relative LSTD Replace ψ(x, u) by ψ̃(x, u) def = ψ(x, u) − ψ(x•, u•) where it appears 14 / 52

57. TD Learning Changing Our Goals Relative LSTD Far more flexible and well motivated than introducing λ? ζn+1 = γζn + 1 Ωn+1 ψ(Xn+1) Performance will degrade as γ gets close to one Change your objective: min θ kHθ − Hφ k2 π,Ω with H(x, u) = Q(x, u) − Q(x•, u•) (some fixed normalization) Relative LSTD Replace ψ(x, u) by ψ̃(x, u) def = ψ(x, u) − ψ(x•, u•) where it appears Covariance is optimal, and uniformly bounded over 0 ≤ γ 1 [64] 14 / 52

58. TD Learning Changing Our Goals Relative LSTD 0 1 2 3 4 5 6 7 8 9 10 n 10 5 0 20 40 60 80 100 0 1 2 3 4 5 6 7 8 9 10 0 20 40 60 80 100 120 120 Maximal Bellman Error 1/(1 − γ) = 103 1/(1 − γ) = 104 Q-learning: g = 1/(1 − γ) g = 1/(1 − ρ∗ γ) g = 1/(1 − ρ∗ γ) Relative Q-learning: { Figure 1: Comparison of Q-learning and Relative Q-learning algorithms for the stochastic shortest path problem of [4]. The relative Q-learning algorithm is unaffected by large discounting. Relative LSTD Replace ψ(x, u) by ψ̃(x, u) def = ψ(x, u) − ψ(x•, u•) where it appears Covariance is optimal, and uniformly bounded over 0 ≤ γ 1 [64] 15 / 52

59. Exploration On and off policy algorithms Off Policy for Exploration In TD-learning, sufficient exploration means Σψ = Eπ[ψkψT k] 0 This ensures convergence, solving ¯ f(θ∗ ) = 0, perhaps θ∗ = min θ kQθ − Qφ k2 π,Ω 0.01 0.02 0.03 0.04 0.05 0.06 −1 0 1 −1 0 1 −1 0 1 Optimal policy Optimal policy Small Input Large Input Invariant Density [57] 16 / 52

60. Exploration On and off policy algorithms Off Policy for Exploration In TD-learning, sufficient exploration means Σψ = Eπ[ψkψT k] 0 Not always possible! Consider Xn+1 = aXn + Un, Un = φ(Xn) = −KXn 16 / 52

61. Exploration On and off policy algorithms Off Policy for Exploration In TD-learning, sufficient exploration means Σψ = Eπ[ψkψT k] 0 Not always possible! Consider Xn+1 = aXn + Un, Un = φ(Xn) = −KXn If |a − K| 1 then the steady-state is supported on Xn = Un ≡ 0 16 / 52

62. Exploration On and off policy algorithms Off Policy for Exploration In TD-learning, sufficient exploration means Σψ = Eπ[ψkψT k] 0 Not always possible! Off Policy TD(λ) θn+1 = θn + αn+1f(θn, ξn+1) = θn + αn+1{−Qθn n + cn + γQθn n+1 }ζn ζn+1 = γλζn + ∇θ[Qθ n]

64. θ=θn+1 16 / 52

67. θ=θn+1 The difference: On policy: Qθn n+1 = Qθn (Xn+1, Un+1), with Un+1 = φ(Xn+1) Off policy: Qθn n+1 = Qθn (Xn+1, Un+1), with Un+1 = e φ(Xn+1) 16 / 52

70. θ=θn+1 The difference: On policy: Qθn n+1 = Qθn (Xn+1, Un+1), with Un+1 = φ(Xn+1) Off policy: Qθn n+1 = Qθn (Xn+1, Un+1), with Un+1 = e φ(Xn+1) Example: Un+1 = e φ(Xn+1) = ( φ(Xn+1) with prob.1 − ε randn+1 else 16 / 52

71. Exploration What Are We Trying To Accomplish Wide World of Exploration How would this intelligent agent explore her environment to maximize her rewards? 17 / 52

72. Exploration What Are We Trying To Accomplish Wide World of Exploration How would this intelligent agent explore ...? Exploration might lead to a meeting with the cat 17 / 52

73. Exploration What Are We Trying To Accomplish Wide World of Exploration How would this intelligent agent explore ...? Exploration might lead to a meeting with the cat She will come to the conclusion that the delicious smell is not cheese 17 / 52

74. Exploration What Are We Trying To Accomplish Wide World of Exploration How would this intelligent agent explore ...? Exploration might lead to a meeting with the cat She will come to the conclusion that the delicious smell is not cheese Would she continue to meet up with the cat, and smell the poop, in order to obtain the best estimate of Q∗? 17 / 52

75. Exploration What Are We Trying To Accomplish Wide World of Exploration How would this intelligent agent explore ...? Exploration might lead to a meeting with the cat She will come to the conclusion that the delicious smell is not cheese Would she continue to meet up with the cat, and smell the poop, in order to obtain the best estimate of Q∗? As intelligent agents we, like the mouse, must not forget our goals! 17 / 52

76. 0 1 2 3 4 5 6 7 8 9 10 n 10 5 0 20 40 60 80 100 0 1 2 3 4 5 6 7 8 9 10 0 20 40 60 80 100 120 120 Maximal Bellman Error 1/(1 − γ) = 103 1/(1 − γ) = 104 Q-learning: g = 1/(1 − γ) g = 1/(1 − ρ∗ γ) g = 1/(1 − ρ∗ γ) Relative Q-learning: { Figure 1: Comparison of Q-learning and Relative Q-learning algorithms for the stochastic shortest path problem of [4]. The relative Q-learning algorithm is unaffected by large discounting. Q Learning

77. Q Learning Galerkin Relaxation Dynamic programming Find function Q∗ that solves (Fn means history) E c(Xn, Un) + γQ∗ (Xn+1) − Q∗ (Xn, Un) | Fn = 0 H(x) = min u H(x, u) 18 / 52

78. Q Learning Galerkin Relaxation Dynamic programming Find function Q∗ that solves (Fn means history) E c(Xn, Un) + γQ∗ (Xn+1) − Q∗ (Xn, Un) | Fn = 0 Hidden Goal of Q-Learning (including DQN) Given {Qθ : θ ∈ Rd}, find θ∗ that solves ¯ f(θ∗) = 0, ¯ f(θ) def = E c(Xn, Un) + γQθ (Xn+1) − Qθ (Xn, Un) ζn The family {Qθ} and eligibility vectors {ζn} are part of algorithm design. 18 / 52

79. Q Learning Galerkin Relaxation Q(0)-Learning Hidden goal ¯ f(θ∗ ) = 0 ¯ f(θ) = E c(Xn, Un) + γQθ (Xn+1) − Qθ (Xn, Un) ζn Prototypical choice ζn = ∇θQθ(Xn, Un)

81. θ=θn 19 / 52

82. Q Learning Galerkin Relaxation Q(0)-Learning Hidden goal ¯ f(θ∗ ) = 0 ¯ f(θ) = E c(Xn, Un) + γQθ (Xn+1) − Qθ (Xn, Un) ζn Prototypical choice ζn = ∇θQθ(Xn, Un)

84. θ=θn =⇒ prototypical Q-learning algorithm Q(0) Learning Algorithm Estimates obtained using SA θn+1 = θn + αn+1fn+1 fn+1 = cn + γQθ n+1 − Qθ n ζn

87. θ=θn Qθ n+1 = Qθ (Xn+1), φθ (Xn+1)) φθ (x) = arg min u Qθ (x, u) [Qθ-greedy policy] Input {Un} chosen for exploration. 19 / 52

88. Q Learning Watkins’ Q-Learning Q(0)-Learning Hidden goal ¯ f(θ∗ ) = 0 Q(0)-learning with linear function approximation Estimates obtained using SA θn+1 = θn + αn+1fn+1 fn+1 = cn + γQθ n+1 − Qθ n

91. θ=θn ζn Qθ n+1 = Qθ (Xn+1, φθ (Xn+1)) Qθ(x, u) = θT ψ(x, u) Qθ (x) = θT ψ(x, φθ(x)) ζn = ∇θQθ(Xn, Un)

93. θ=θn = ψ(Xn, Un) 20 / 52

94. Q Learning Watkins’ Q-Learning Q(0)-Learning Hidden goal ¯ f(θ∗ ) = 0 Q(0)-learning with linear function approximation Estimates obtained using SA θn+1 = θn + αn+1fn+1 fn+1 = cn + γQθ n+1 − Qθ n

97. θ=θn ζn Qθ n+1 = Qθ (Xn+1, φθ (Xn+1)) Qθ(x, u) = θT ψ(x, u) Qθ (x) = θT ψ(x, φθ(x)) ζn = ∇θQθ(Xn, Un)

99. θ=θn = ψ(Xn, Un) ¯ f(θ) = A(θ)θ − b̄ = 0 A(θ) = E ζn γψ(Xn+1, φθ (Xn+1)) − ψ(Xn, Un) T b̄ def = E ζnc(Xn, Un) 20 / 52

100. Q Learning Watkins’ Q-Learning Watkins’ Q-learning E c(Xn, Un) + γQθ∗ (Xn+1) − Qθ∗ (Xn, Un) ζn = 0 21 / 52

101. Q Learning Watkins’ Q-Learning Watkins’ Q-learning E c(Xn, Un) + γQθ∗ (Xn+1) − Qθ∗ (Xn, Un) ζn = 0 Watkin’s algorithm A special case of Q(0)-learning The family {Qθ} and eligibility vectors {ζn} in this design: Linearly parameterized family of functions: Qθ(x, u) = θT ψ(x, u) ζn ≡ ψ(Xn, Un) ψi(x, u) = 1{x = xi, u = ui} (complete basis ≡ tabular Q-learning) Convergence of Qθn to Q? holds under mild conditions 21 / 52

102. Q Learning Watkins’ Q-Learning Watkins’ Q-learning E c(Xn, Un) + γQθ∗ (Xn+1) − Qθ∗ (Xn, Un) ζn = 0 Watkin’s algorithm A special case of Q(0)-learning The family {Qθ} and eligibility vectors {ζn} in this design: Linearly parameterized family of functions: Qθ(x, u) = θT ψ(x, u) ζn ≡ ψ(Xn, Un) ψi(x, u) = 1{x = xi, u = ui} (complete basis ≡ tabular Q-learning) Convergence of Qθn to Q? holds under mild conditions Asymptotic covariance is infinite for γ ≥ 1/2 [31, 62] 1 αn Cov(θ̃n) → ∞ as n → ∞, using the standard step-size rule αn = 1/n(x, u) 21 / 52

103. Q Learning Watkins’ Q-Learning Asymptotic Covariance of Watkins’ Q-Learning This is what infinite variance looks like 1 4 6 5 3 2 σ2 = lim n→∞ nE[kθn − θ∗ k2 ] = ∞ Wild oscillations? 22 / 52

104. Q Learning Watkins’ Q-Learning Asymptotic Covariance of Watkins’ Q-Learning This is what infinite variance looks like 1 4 6 5 3 2 σ2 = lim n→∞ nE[kθn − θ∗ k2 ] = ∞ Wild oscillations? Not at all, the sample paths appear frozen Histogram of parameter estimates after 106 iterations. 100 0 200 300 400 486.6 0 10 20 30 40 n = 106 Histogram for θ θ* n(15) (15) Example from [31, 62] 2017 22 / 52

105. Q Learning Watkins’ Q-Learning Asymptotic Covariance of Watkins’ Q-Learning This is what infinite variance looks like 1 4 6 5 3 2 σ2 = lim n→∞ nE[kθn − θ∗ k2 ] = ∞ Wild oscillations? Not at all, the sample paths appear frozen Sample paths using a higher gain, or relative Q-learning. 0 1 2 3 4 5 6 7 8 9 10 n 10 5 0 20 40 60 80 100 0 1 2 3 4 5 6 7 8 9 10 0 20 40 60 80 100 120 120 Maximal Bellman Error 1/(1 − γ) = 103 1/(1 − γ) = 104 Q-learning: g = 1/(1 − γ) g = 1/(1 − ρ∗ γ) g = 1/(1 − ρ∗ γ) Relative Q-learning: { Figure 1: Comparison of Q-learning and Relative Q-learning algorithms for the stochastic shortest path problem of [4]. The relative Q-learning algorithm is unaffected by large discounting. Example from [31, 62] 2017, and [64], CSRL, 2021 22 / 52

106. Q Learning Watkins’ Q-Learning Asymptotic Covariance of Watkins’ Q-Learning Can we do better? 1 4 6 5 3 2 0 1 2 3 4 5 6 7 8 9 10 n 10 5 0 20 40 60 80 100 0 1 2 3 4 5 6 7 8 9 10 0 20 40 60 80 100 120 120 Maximal Bellman Error 1/(1 − γ) = 103 1/(1 − γ) = 104 Q-learning: g = 1/(1 − γ) g = 1/(1 − ρ∗ γ) g = 1/(1 − ρ∗ γ) Relative Q-learning: { Figure 1: Comparison of Q-learning and Relative Q-learning algorithms for the stochastic shortest path problem of [4]. The relative Q-learning algorithm is unaffected by large discounting. 23 / 52

107. Q Learning Watkins’ Q-Learning Asymptotic Covariance of Watkins’ Q-Learning Can we do better? 1 4 6 5 3 2 0 1 2 3 4 5 6 7 8 9 10 n 10 5 0 20 40 60 80 100 0 1 2 3 4 5 6 7 8 9 10 0 20 40 60 80 100 120 120 Maximal Bellman Error 1/(1 − γ) = 103 1/(1 − γ) = 104 Q-learning: g = 1/(1 − γ) g = 1/(1 − ρ∗ γ) g = 1/(1 − ρ∗ γ) Relative Q-learning: { Figure 1: Comparison of Q-learning and Relative Q-learning algorithms for the stochastic shortest path problem of [4]. The relative Q-learning algorithm is unaffected by large discounting. Our intelligent mouse might offer clues 23 / 52

108. Zap Momentum Return to Zap

109. Zap Q Motivation The ODE method begins with design of the ODE: d dt ϑ = ¯ f(ϑ) Challenges we have faced with Q-learning: 24 / 52

110. Zap Q Motivation The ODE method begins with design of the ODE: d dt ϑ = ¯ f(ϑ) Challenges we have faced with Q-learning: How can we design dynamics for 1 Stability few results outside of Watkins’ tabular setting 2 ¯ f(θ∗ ) = 0 solves a relevant problem or has a solution 24 / 52

111. Zap Q Motivation The ODE method begins with design of the ODE: d dt ϑ = ¯ f(ϑ) Challenges we have faced with Q-learning: How can we design dynamics for 1 Stability 2 ¯ f(θ∗ ) = 0 solves a relevant problem How can we better manage problems introduced by 1/(1 − γ)? Relative Q-Learning is one approach 24 / 52

112. Zap Q Motivation The ODE method begins with design of the ODE: d dt ϑ = ¯ f(ϑ) Challenges we have faced with Q-learning: How can we design dynamics for 1 Stability 2 ¯ f(θ∗ ) = 0 solves a relevant problem How can we better manage problems introduced by 1/(1 − γ)? Relative Q-Learning is one approach Assuming we have solved 2, approximate Newton-Raphson flow: d dt ¯ f(ϑt) = − ¯ f(ϑt) giving ¯ f(ϑt) = ¯ f(ϑ0)e−t 24 / 52

113. Zap Q Zap Algorithm Designed to emulate Newton-Raphson flow d dt ϑt = −[A(ϑt)]−1 ¯ f(ϑt), A(θ) = ∂ ∂θ ¯ f(θ) Zap-SA θn+1 = θn + αn+1Gn+1f(θn, Φn+1) Gn+1 = [− b An+1]−1 b An+1 = b An + βn+1(An+1 − b An) An+1 = ∂θf(θn, Φn+1) 25 / 52

114. Zap Q Zap Algorithm Designed to emulate Newton-Raphson flow d dt ϑt = −[A(ϑt)]−1 ¯ f(ϑt), A(θ) = ∂ ∂θ ¯ f(θ) Zap-SA θn+1 = θn + αn+1Gn+1f(θn, Φn+1) Gn+1 = [− b An+1]−1 b An+1 = b An + βn+1(An+1 − b An) An+1 = ∂θf(θn, Φn+1) b An+1 ≈ A(θn) requires high-gain, βn αn → ∞, n → ∞ 25 / 52

115. Zap Q Zap Algorithm Designed to emulate Newton-Raphson flow d dt ϑt = −[A(ϑt)]−1 ¯ f(ϑt), A(θ) = ∂ ∂θ ¯ f(θ) Zap-SA θn+1 = θn + αn+1Gn+1f(θn, Φn+1) Gn+1 = [− b An+1]−1 b An+1 = b An + βn+1(An+1 − b An) An+1 = ∂θf(θn, Φn+1) b An+1 ≈ A(θn) requires high-gain, βn αn → ∞, n → ∞ Numerics that follow: αn = 1/n, βn = (1/n)ρ, ρ ∈ (0.5, 1) Zap Q-Learning: f(θn, Φn+1) = c(Xn, Un) + γQθ∗ (Xn+1) − Qθ∗ (Xn, Un) ζn ζn = ∇θQθ (Xn, Un)

117. θ=θn An+1 = ζn γψ(Xn+1, φθ (Xn+1)) − ψ(Xn, Un) T φθ (x) = arg min u Qθ (x, u) 25 / 52

118. Zap Q Challenges Q-learning: {Qθ(x, u) : θ ∈ Rd , u ∈ U , x ∈ X} Find θ∗ such that ¯ f(θ∗) = 0, with ¯ f(θ) = E c(Xn, Un) + γQθ (Xn+1) − Qθ (Xn, Un) ζn What makes theory difficult: 1 Does ¯ f have a root? 2 Does the inverse of A exist? 3 SA theory is weak for a discontinuous ODE ⇐= biggest technical challenge 26 / 52

119. Zap Q Challenges Q-learning: {Qθ(x, u) : θ ∈ Rd , u ∈ U , x ∈ X} Find θ∗ such that ¯ f(θ∗) = 0, with ¯ f(θ) = E c(Xn, Un) + γQθ (Xn+1) − Qθ (Xn, Un) ζn What makes theory difficult: 1 Does ¯ f have a root? 2 Does the inverse of A exist? 3 SA theory is weak for a discontinuous ODE ⇐= biggest technical challenge =⇒ Resolved by exploiting special structure, even for NN function approximation [32, 9] 26 / 52

120. Zap Q Zap Along Walk to the Cafe Zap Q-Learning Optimize Walk to Cafe 1 4 6 5 3 2 Convergence with Zap gain βn = n−0.85 0 1 2 3 4 5 6 7 8 9 10 105 0 20 40 60 80 100 Watkins, Speedy Q-learning, Polyak-Ruppert Averaging Zap Bellman Error n Convergence of Zap-Q Learning Discount factor: γ = 0.99 27 / 52

121. Zap Q Zap Along Walk to the Cafe Zap Q-Learning Optimize Walk to Cafe 1 4 6 5 3 2 Convergence with Zap gain βn = n−0.85 Infinite covariance with αn = 1/n or 1/n(x, u). 0 1 2 3 4 5 6 7 8 9 10 105 0 20 40 60 80 100 Watkins, Speedy Q-learning, Polyak-Ruppert Averaging Zap Bellman Error n Zap, βn = αn Convergence of Zap-Q Learning Discount factor: γ = 0.99 27 / 52

122. Zap Q Zap Along Walk to the Cafe Zap Q-Learning Optimize Walk to Cafe 1 4 6 5 3 2 Convergence with Zap gain βn = n−0.85 Infinite covariance with αn = 1/n or 1/n(x, u). 0 1 2 3 4 5 6 7 8 9 10 105 0 20 40 60 80 100 Watkins, Speedy Q-learning, Polyak-Ruppert Averaging Zap Bellman Error n Watkins, g = 1500 αn = g/n Zap, βn = αn Convergence of Zap-Q Learning Discount factor: γ = 0.99 27 / 52

123. Zap Q Zap Along Walk to the Cafe Zap Q-Learning Optimize Walk to Cafe 1 4 6 5 3 2 Convergence with Zap gain βn = n−0.85 -20 -10 0 10 -20 -10 0 10 -8 -6 -4 -2 0 2 4 6 8 103 -8 -6 -4 -2 0 2 4 6 8 n = 104 n = 106 Theoritical pdf Experimental pdf Empirical: 1000 trials Wn = √ nθ̃n Entry #18: n = 104 n = 106 Entry #10: CLT gives good prediction of finite-n performance Discount factor: γ = 0.99 28 / 52

124. Zap Q Zap Along Walk to the Cafe Zap Q-Learning Optimize Walk to Cafe 1 4 6 5 3 2 Convergence with Zap gain βn = n−0.85 -20 -10 0 10 -20 -10 0 10 -8 -6 -4 -2 0 2 4 6 8 103 -8 -6 -4 -2 0 2 4 6 8 n = 104 n = 106 Theoritical pdf Experimental pdf Empirical: 1000 trials Wn = √ nθ̃n Entry #18: n = 104 n = 106 Entry #10: CLT gives good prediction of finite-n performance Discount factor: γ = 0.99 Do not forget to use this in practice! Multiple runs to estimate CLT variance Requires wide range of initial conditions, but a smaller run length as you fine-tune your algorithm 28 / 52

125. Zap Q Zap Q-Learning with Neural Networks Zap with Neural Networks 0 = ¯ f(θ∗ ) = E c(Xn, Un) + γQθ∗ (Xn+1) − Qθ∗ (Xn, Un) ζn ζn = ∇θQθ(Xn, Un)

127. θ=θn computed using back-progagation A few things to note: As far as we know, the empirical success of plain vanilla DQN is extraordinary (however, nobody reports failure) 29 / 52

128. Zap Q Zap Q-Learning with Neural Networks Zap with Neural Networks 0 = ¯ f(θ∗ ) = E c(Xn, Un) + γQθ∗ (Xn+1) − Qθ∗ (Xn, Un) ζn ζn = ∇θQθ(Xn, Un)

130. θ=θn computed using back-progagation A few things to note: As far as we know, the empirical success of plain vanilla DQN is extraordinary (however, nobody reports failure) Zap Q-learning is the only approach for which convergence has been established (under mild conditions) We can expect stunning transient performance, based on coupling with the Newton-Raphson flow. 29 / 52

131. Zap Q Zap Q-Learning with Neural Networks Zap with Neural Networks VI. Stunning reliability with parameterized by various neural networks Qθ Reliability and stunning transient performance —from coupling with the Newton-Raphson flow. 29 / 52

132. Conclusions Future Directions Conclusions Reinforcement Learning is cursed by dimension, variance, and nonlinear (algorithm) dynamics We have a big basket of tools for reinforcement learning. Many algorithms fail due to bad design, or unrealistic goals 30 / 52

133. Conclusions Future Directions Conclusions Reinforcement Learning is cursed by dimension, variance, and nonlinear (algorithm) dynamics We have a big basket of tools for reinforcement learning. Many algorithms fail due to bad design, or unrealistic goals Among the tools we have for taming algorithmic volatility: Zap: a second order method that optimizes variance and transient behavior 30 / 52

134. Conclusions Future Directions Conclusions Reinforcement Learning is cursed by dimension, variance, and nonlinear (algorithm) dynamics We have a big basket of tools for reinforcement learning. Many algorithms fail due to bad design, or unrealistic goals Among the tools we have for taming algorithmic volatility: Zap: a second order method that optimizes variance and transient behavior State weighting: more justifiable and flexible than the “λ hack” 30 / 52

135. Conclusions Future Directions Conclusions Reinforcement Learning is cursed by dimension, variance, and nonlinear (algorithm) dynamics We have a big basket of tools for reinforcement learning. Many algorithms fail due to bad design, or unrealistic goals Among the tools we have for taming algorithmic volatility: Zap: a second order method that optimizes variance and transient behavior State weighting: more justifiable and flexible than the “λ hack” Relative TD/Q-learning: Hφ(x, u) = Qφ(x, u) − Qφ(x•, u•) is sufficient for policy update 30 / 52

136. Conclusions Future Directions Conclusions Reinforcement Learning is cursed by dimension, variance, and nonlinear (algorithm) dynamics We have a big basket of tools for reinforcement learning. Many algorithms fail due to bad design, or unrealistic goals Among the tools we have for taming algorithmic volatility: Zap: a second order method that optimizes variance and transient behavior State weighting: more justifiable and flexible than the “λ hack” Relative TD/Q-learning: Hφ(x, u) = Qφ(x, u) − Qφ(x•, u•) is sufficient for policy update Clever exploration: Bandits literature has vast resources [13] 30 / 52

137. Conclusions Future Directions Outlook Beyond the projected Bellman error for Q-learning [57, 58, 59, 60] Beyond the shackles of DP equations? 31 / 52

138. Conclusions Future Directions Outlook Beyond the projected Bellman error for Q-learning [57, 58, 59, 60] Beyond the shackles of DP equations? Acceleration techniques (momentum and matrix momentum) See also Zap-Zero in CSRL 10 4 0 20 40 60 80 100 120 Q Span norm: n − Q sp 0 2 4 6 8 10 n g = 1/(1 − γ) Watkins PRaveraging Zap ZapZero 31 / 52

139. Conclusions Future Directions Outlook Beyond the projected Bellman error for Q-learning [57, 58, 59, 60] Beyond the shackles of DP equations? Acceleration techniques (momentum and matrix momentum) See also Zap-Zero in CSRL 10 4 0 20 40 60 80 100 120 Q Span norm: n − Q sp 0 2 4 6 8 10 n g = 1/(1 − γ) Watkins PRaveraging Zap ZapZero Further variance reduction using control variates 31 / 52

140. Conclusions Future Directions Outlook Beyond the projected Bellman error for Q-learning [57, 58, 59, 60] Beyond the shackles of DP equations? Acceleration techniques (momentum and matrix momentum) See also Zap-Zero in CSRL Further variance reduction using control variates Thank you for your endurance! I welcome questions and comments now and in the future 31 / 52

142. References Control Techniques FOR Complex Networks Sean Meyn Pre-publication version for on-line viewing. Monograph available for purchase at your favorite retailer More information available at http://www.cambridge.org/us/catalogue/catalogue.asp?isbn=9780521884419 Markov Chains and Stochastic Stability S. P. Meyn and R. L. Tweedie August 2008 Pre-publication version for on-line viewing. Monograph to appear Februrary 2009 π(f ) ∞ ∆V (x) ≤ −f(x) + bIC(x) Pn (x, · ) − πf → 0 sup C E x [S τ C (f )] ∞ References 32 / 52

143. References Control Background I [1] K. J. Åström and R. M. Murray. Feedback Systems: An Introduction for Scientists and Engineers. Princeton University Press, USA, 2008 (recent edition on-line). [2] K. J. Åström and B. Wittenmark. Adaptive Control. Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA, 2nd edition, 1994. [3] A. Fradkov and B. T. Polyak. Adaptive and robust control in the USSR. IFAC–PapersOnLine, 53(2):1373–1378, 2020. 21th IFAC World Congress. [4] M. Krstic, P. V. Kokotovic, and I. Kanellakopoulos. Nonlinear and adaptive control design. John Wiley Sons, Inc., 1995. [5] K. J. Åström. Theory and applications of adaptive control—a survey. Automatica, 19(5):471–486, 1983. [6] K. J. Åström. Adaptive control around 1960. IEEE Control Systems Magazine, 16(3):44–49, 1996. [7] L. Ljung. Analysis of recursive stochastic algorithms. IEEE Transactions on Automatic Control, 22(4):551–575, 1977. 33 / 52

144. References Control Background II [8] N. Matni, A. Proutiere, A. Rantzer, and S. Tu. From self-tuning regulators to reinforcement learning and back again. In Proc. of the IEEE Conf. on Dec. and Control, pages 3724–3740, 2019. 34 / 52

145. References RL Background I [9] S. Meyn. Control Systems and Reinforcement Learning. Cambridge University Press, Cambridge, 2021. [10] R. Sutton and A. Barto. Reinforcement Learning: An Introduction. MIT Press. On-line edition at http://www.cs.ualberta.ca/~sutton/book/the-book.html, Cambridge, MA, 2nd edition, 2018. [11] C. Szepesvári. Algorithms for Reinforcement Learning. Synthesis Lectures on Artificial Intelligence and Machine Learning. Morgan Claypool Publishers, 2010. [12] D. P. Bertsekas. Reinforcement learning and optimal control. Athena Scientific, Belmont, MA, 2019. [13] T. Lattimore and C. Szepesvari. Bandit Algorithms. Cambridge University Press, 2020. [14] R. S. Sutton.Learning to predict by the methods of temporal differences. Mach. Learn., 3(1):9–44, 1988. [15] C. J. C. H. Watkins and P. Dayan. Q-learning. Machine Learning, 8(3-4):279–292, 1992. [16] J. Tsitsiklis. Asynchronous stochastic approximation and Q-learning. Machine Learning, 16:185–202, 1994. 35 / 52

146. References RL Background II [17] T. Jaakola, M. Jordan, and S. Singh. On the convergence of stochastic iterative dynamic programming algorithms. Neural Computation, 6:1185–1201, 1994. [18] B. Van Roy. Learning and Value Function Approximation in Complex Decision Processes. PhD thesis, Massachusetts Institute of Technology, Cambridge, MA, 1998. [19] J. N. Tsitsiklis and B. Van Roy. Feature-based methods for large scale dynamic programming. Mach. Learn., 22(1-3):59–94, 1996. [20] J. N. Tsitsiklis and B. Van Roy. An analysis of temporal-difference learning with function approximation. IEEE Trans. Automat. Control, 42(5):674–690, 1997. [21] J. N. Tsitsiklis and B. V. Roy. Average cost temporal-difference learning. Automatica, 35(11):1799–1808, 1999. [22] J. N. Tsitsiklis and B. Van Roy. Optimal stopping of Markov processes: Hilbert space theory, approximation algorithms, and an application to pricing high-dimensional financial derivatives. IEEE Trans. Automat. Control, 44(10):1840–1851, 1999. [23] D. Choi and B. Van Roy. A generalized Kalman filter for fixed point approximation and efficient temporal-difference learning. Discrete Event Dynamic Systems: Theory and Applications, 16(2):207–239, 2006. 36 / 52

147. References RL Background III [24] S. J. Bradtke and A. G. Barto. Linear least-squares algorithms for temporal difference learning. Mach. Learn., 22(1-3):33–57, 1996. [25] J. A. Boyan. Technical update: Least-squares temporal difference learning. Mach. Learn., 49(2-3):233–246, 2002. [26] A. Nedic and D. Bertsekas. Least squares policy evaluation algorithms with linear function approximation. Discrete Event Dyn. Systems: Theory and Appl., 13(1-2):79–110, 2003. [27] C. Szepesvári. The asymptotic convergence-rate of Q-learning. In Proceedings of the 10th Internat. Conf. on Neural Info. Proc. Systems, 1064–1070. MIT Press, 1997. [28] E. Even-Dar and Y. Mansour. Learning rates for Q-learning. Journal of Machine Learning Research, 5(Dec):1–25, 2003. [29] M. G. Azar, R. Munos, M. Ghavamzadeh, and H. Kappen. Speedy Q-learning. In Advances in Neural Information Processing Systems, 2011. [30] D. Huang, W. Chen, P. Mehta, S. Meyn, and A. Surana. Feature selection for neuro-dynamic programming. In F. Lewis, editor, Reinforcement Learning and Approximate Dynamic Programming for Feedback Control. Wiley, 2011. 37 / 52

148. References RL Background IV [31] A. M. Devraj and S. P. Meyn. Zap Q-learning. In Proc. of the Intl. Conference on Neural Information Processing Systems, pages 2232–2241, 2017. [32] S. Chen, A. M. Devraj, F. Lu, A. Busic, and S. Meyn. Zap Q-Learning with nonlinear function approximation. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, and arXiv e-prints 1910.05405, volume 33, pages 16879–16890, 2020. [33] S. P. Meyn. Control Techniques for Complex Networks. Cambridge University Press, 2007. See last chapter on simulation and average-cost TD learning DQN: [34] M. Riedmiller. Neural fitted Q iteration – first experiences with a data efficient neural reinforcement learning method. In J. Gama, R. Camacho, P. B. Brazdil, A. M. Jorge, and L. Torgo, editors, Machine Learning: ECML 2005, pages 317–328, Berlin, Heidelberg, 2005. Springer Berlin Heidelberg. [35] S. Lange, T. Gabel, and M. Riedmiller. Batch reinforcement learning. In Reinforcement learning, pages 45–73. Springer, 2012. [36] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. A. Riedmiller. Playing Atari with deep reinforcement learning. ArXiv, abs/1312.5602, 2013. 38 / 52

149. References RL Background V [37] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. A. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis. Human-level control through deep reinforcement learning. Nature, 518:529–533, 2015. [38] O. Anschel, N. Baram, and N. Shimkin. Averaged-DQN: Variance reduction and stabilization for deep reinforcement learning. In Proc. of ICML, pages 176–185. JMLR.org, 2017. Actor Critic / Policy Gradient [39] P. J. Schweitzer. Perturbation theory and finite Markov chains. J. Appl. Prob., 5:401–403, 1968. [40] C. D. Meyer, Jr. The role of the group generalized inverse in the theory of finite Markov chains. SIAM Review, 17(3):443–464, 1975. [41] P. W. Glynn. Stochastic approximation for Monte Carlo optimization. In Proceedings of the 18th conference on Winter simulation, pages 356–365, 1986. [42] R. J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8(3-4):229–256, 1992. 39 / 52

150. References RL Background VI [43] T. Jaakkola, S. P. Singh, and M. I. Jordan. Reinforcement learning algorithm for partially observable Markov decision problems. In Advances in neural information processing systems, pages 345–352, 1995. [44] X.-R. Cao and H.-F. Chen. Perturbation realization, potentials, and sensitivity analysis of Markov processes. IEEE Transactions on Automatic Control, 42(10):1382–1393, Oct 1997. [45] P. Marbach and J. N. Tsitsiklis. Simulation-based optimization of Markov reward processes. IEEE Trans. Automat. Control, 46(2):191–209, 2001. [46] V. Konda. Actor-critic algorithms. PhD thesis, Massachusetts Institute of Technology, 2002. [47] V. R. Konda and J. N. Tsitsiklis. Actor-critic algorithms. In Advances in neural information processing systems, pages 1008–1014, 2000. [48] R. S. Sutton, D. A. McAllester, S. P. Singh, and Y. Mansour. Policy gradient methods for reinforcement learning with function approximation. In Advances in neural information processing systems, pages 1057–1063, 2000. [49] P. Marbach and J. N. Tsitsiklis. Simulation-based optimization of Markov reward processes. IEEE Trans. Automat. Control, 46(2):191–209, 2001. 40 / 52

151. References RL Background VII [50] S. M. Kakade. A natural policy gradient. In Advances in neural information processing systems, pages 1531–1538, 2002. [51] H. Mania, A. Guy, and B. Recht. Simple random search provides a competitive approach to reinforcement learning. In Advances in Neural Information Processing Systems, pages 1800–1809, 2018. MDPs, LPs and Convex Q: [52] A. S. Manne. Linear programming and sequential decisions. Management Sci., 6(3):259–267, 1960. [53] C. Derman. Finite State Markovian Decision Processes, volume 67 of Mathematics in Science and Engineering. Academic Press, Inc., 1970. [54] V. S. Borkar. Convex analytic methods in Markov decision processes. In Handbook of Markov decision processes, volume 40 of Internat. Ser. Oper. Res. Management Sci., pages 347–375. Kluwer Acad. Publ., Boston, MA, 2002. [55] D. P. de Farias and B. Van Roy. The linear programming approach to approximate dynamic programming. Operations Res., 51(6):850–865, 2003. 41 / 52

152. References RL Background VIII [56] D. P. de Farias and B. Van Roy. A cost-shaping linear program for average-cost approximate dynamic programming with performance guarantees. Math. Oper. Res., 31(3):597–620, 2006. [57] P. G. Mehta and S. P. Meyn. Q-learning and Pontryagin’s minimum principle. In Proc. of the IEEE Conf. on Dec. and Control, pages 3598–3605, Dec. 2009. [58] J. Bas Serrano, S. Curi, A. Krause, and G. Neu. Logistic Q-learning. In A. Banerjee and K. Fukumizu, editors, Proc. of The Intl. Conference on Artificial Intelligence and Statistics, volume 130, pages 3610–3618, 13–15 Apr 2021. [59] F. Lu, P. G. Mehta, S. P. Meyn, and G. Neu. Convex Q-learning. In American Control Conf., pages 4749–4756. IEEE, 2021. [60] F. Lu, P. G. Mehta, S. P. Meyn, and G. Neu. Convex analytic theory for convex Q-learning. In Conference on Decision and Control–to appear. IEEE, 2022. Gator Nation: [61] A. M. Devraj, A. Bušić, and S. Meyn. Fundamental design principles for reinforcement learning algorithms. In K. G. Vamvoudakis, Y. Wan, F. L. Lewis, and D. Cansever, editors, Handbook on Reinforcement Learning and Control, Studies in Systems, Decision and Control series (SSDC, volume 325). Springer, 2021. 42 / 52

153. References RL Background IX [62] A. M. Devraj and S. P. Meyn. Fastest convergence for Q-learning. ArXiv , July 2017 (extended version of NIPS 2017). [63] A. M. Devraj. Reinforcement Learning Design with Optimal Learning Rate. PhD thesis, University of Florida, 2019. [64] A. M. Devraj and S. P. Meyn. Q-learning with uniformly bounded variance: Large discounting is not a barrier to fast learning. IEEE Trans Auto Control (and arXiv:2002.10301), 2021. [65] A. M. Devraj, A. Bušić, and S. Meyn. On matrix momentum stochastic approximation and applications to Q-learning. In Allerton Conference on Communication, Control, and Computing, pages 749–756, Sep 2019. 43 / 52

154. References Stochastic Miscellanea I [66] S. Asmussen and P. W. Glynn. Stochastic Simulation: Algorithms and Analysis, volume 57 of Stochastic Modelling and Applied Probability. Springer-Verlag, New York, 2007. [67] P. W. Glynn and S. P. Meyn. A Liapounov bound for solutions of the Poisson equation. Ann. Probab., 24(2):916–931, 1996. [68] S. P. Meyn and R. L. Tweedie. Markov chains and stochastic stability. Cambridge University Press, Cambridge, second edition, 2009. Published in the Cambridge Mathematical Library. [69] R. Douc, E. Moulines, P. Priouret, and P. Soulier. Markov Chains. Springer, 2018. 44 / 52

155. References Stochastic Approximation I [70] V. S. Borkar. Stochastic Approximation: A Dynamical Systems Viewpoint. Hindustan Book Agency and Cambridge University Press, Delhi, India Cambridge, UK, 2008. [71] A. Benveniste, M. Métivier, and P. Priouret. Adaptive algorithms and stochastic approximations, volume 22 of Applications of Mathematics (New York). Springer-Verlag, Berlin, 1990. Translated from the French by Stephen S. Wilson. [72] V. S. Borkar and S. P. Meyn. The ODE method for convergence of stochastic approximation and reinforcement learning. SIAM J. Control Optim., 38(2):447–469, 2000. [73] V. Borkar, S. Chen, A. Devraj, I. Kontoyiannis, and S. Meyn. The ODE method for asymptotic statistics in stochastic approximation and reinforcement learning. arXiv e-prints:2110.14427, pages 1–50, 2021. [74] M. Benaı̈m. Dynamics of stochastic approximation algorithms. In Séminaire de Probabilités, XXXIII, pages 1–68. Springer, Berlin, 1999. [75] V. Borkar and S. P. Meyn. Oja’s algorithm for graph clustering, Markov spectral decomposition, and risk sensitive control. Automatica, 48(10):2512–2519, 2012. [76] J. Kiefer and J. Wolfowitz. Stochastic estimation of the maximum of a regression function. Ann. Math. Statist., 23(3):462–466, 09 1952. 45 / 52

156. References Stochastic Approximation II [77] J. C. Spall. Introduction to stochastic search and optimization: estimation, simulation, and control. John Wiley Sons, 2003. [78] D. Ruppert. A Newton-Raphson version of the multivariate Robbins-Monro procedure. The Annals of Statistics, 13(1):236–245, 1985. [79] D. Ruppert. Efficient estimators from a slowly convergent Robbins-Monro processes. Technical Report Tech. Rept. No. 781, Cornell University, School of Operations Research and Industrial Engineering, Ithaca, NY, 1988. [80] B. T. Polyak. A new method of stochastic approximation type. Avtomatika i telemekhanika, 98–107, 1990 (in Russian). Translated in Automat. Remote Control, 51 1991. [81] B. T. Polyak and A. B. Juditsky. Acceleration of stochastic approximation by averaging. SIAM J. Control Optim., 30(4):838–855, 1992. [82] V. R. Konda and J. N. Tsitsiklis. Convergence rate of linear two-time-scale stochastic approximation. Ann. Appl. Probab., 14(2):796–819, 2004. [83] E. Moulines and F. R. Bach. Non-asymptotic analysis of stochastic approximation algorithms for machine learning. In Advances in Neural Information Processing Systems 24, 451–459. Curran Associates, Inc., 2011. 46 / 52

157. References Stochastic Approximation III [84] W. Mou, C. Junchi Li, M. J. Wainwright, P. L. Bartlett, and M. I. Jordan. On Linear Stochastic Approximation: Fine-grained Polyak-Ruppert and Non-Asymptotic Concentration. arXiv e-prints, page arXiv:2004.04719, Apr. 2020. 47 / 52

158. References Optimization and ODEs I [85] W. Su, S. Boyd, and E. Candes. A differential equation for modeling nesterov’s accelerated gradient method: Theory and insights. In Advances in neural information processing systems, pages 2510–2518, 2014. [86] B. Shi, S. S. Du, W. Su, and M. I. Jordan. Acceleration via symplectic discretization of high-resolution differential equations. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems 32, pages 5744–5752. Curran Associates, Inc., 2019. [87] B. T. Polyak. Some methods of speeding up the convergence of iteration methods. USSR Computational Mathematics and Mathematical Physics, 4(5):1–17, 1964. [88] Y. Nesterov. A method of solving a convex programming problem with convergence rate O(1/k2). In Soviet Mathematics Doklady, 1983. 48 / 52

159. References QSA and Extremum Seeking Control I [89] C. K. Lauand and S. Meyn. Extremely fast convergence rates for extremum seeking control with Polyak-Ruppert averaging. arXiv 2206.00814, 2022. [90] C. K. Lauand and S. Meyn. Markovian foundations for quasi stochastic approximation with applications to extremum seeking control, 2022. [91] B. Lapeybe, G. Pages, and K. Sab. Sequences with low discrepancy generalisation and application to Robbins-Monro algorithm. Statistics, 21(2):251–272, 1990. [92] S. Laruelle and G. Pagès. Stochastic approximation with averaging innovation applied to finance. Monte Carlo Methods and Applications, 18(1):1–51, 2012. [93] S. Shirodkar and S. Meyn. Quasi stochastic approximation. In Proc. of the 2011 American Control Conference (ACC), pages 2429–2435, July 2011. [94] S. Chen, A. Devraj, A. Bernstein, and S. Meyn. Revisiting the ODE method for recursive algorithms: Fast using quasi stochastic approximation. Journal of Systems Science and Complexity, 34(5):1681–1702, 2021. [95] Y. Chen, A. Bernstein, A. Devraj, and S. Meyn. Model-Free Primal-Dual Methods for Network Optimization with Application to Real-Time Optimal Power Flow. In Proc. of the American Control Conf., pages 3140–3147, Sept. 2019. 49 / 52

160. References QSA and Extremum Seeking Control II [96] S. Bhatnagar and V. S. Borkar. Multiscale chaotic spsa and smoothed functional algorithms for simulation optimization. Simulation, 79(10):568–580, 2003. [97] S. Bhatnagar, M. C. Fu, S. I. Marcus, and I.-J. Wang. Two-timescale simultaneous perturbation stochastic approximation using deterministic perturbation sequences. ACM Transactions on Modeling and Computer Simulation (TOMACS), 13(2):180–209, 2003. [98] M. Le Blanc. Sur l’electrification des chemins de fer au moyen de courants alternatifs de frequence elevee [On the electrification of railways by means of alternating currents of high frequency]. Revue Generale de l’Electricite, 12(8):275–277, 1922. [99] Y. Tan, W. H. Moase, C. Manzie, D. Nešić, and I. M. Y. Mareels. Extremum seeking from 1922 to 2010. In Proceedings of the 29th Chinese Control Conference, pages 14–26, July 2010. [100] P. F. Blackman. Extremum-seeking regulators. In An Exposition of Adaptive Control. Macmillan, 1962. [101] J. Sternby. Adaptive control of extremum systems. In H. Unbehauen, editor, Methods and Applications in Adaptive Control, pages 151–160, Berlin, Heidelberg, 1980. Springer Berlin Heidelberg. 50 / 52

161. References QSA and Extremum Seeking Control III [102] J. Sternby. Extremum control systems–an area for adaptive control? In Joint Automatic Control Conference, number 17, page 8, 1980. [103] K. B. Ariyur and M. Krstić. Real Time Optimization by Extremum Seeking Control. John Wiley Sons, Inc., New York, NY, USA, 2003. [104] M. Krstić and H.-H. Wang. Stability of extremum seeking feedback for general nonlinear dynamic systems. Automatica, 36(4):595 – 601, 2000. [105] S. Liu and M. Krstic. Introduction to extremum seeking. In Stochastic Averaging and Stochastic Extremum Seeking, Communications and Control Engineering. Springer, London, 2012. [106] O. Trollberg and E. W. Jacobsen. On the convergence rate of extremum seeking control. In European Control Conference (ECC), pages 2115–2120. 2014. [107] Y. Bugeaud. Linear forms in logarithms and applications, volume 28 of IRMA Lectures in Mathematics and Theoretical Physics. EMS Press, 2018. [108] G. Wüstholz, editor. A Panorama of Number Theory—Or—The View from Baker’s Garden. Cambridge University Press, 2002. 51 / 52

162. References Selected Applications I [109] N. S. Raman, A. M. Devraj, P. Barooah, and S. P. Meyn. Reinforcement learning for control of building HVAC systems. In American Control Conference, July 2020. [110] K. Mason and S. Grijalva. A review of reinforcement learning for autonomous building energy management. arXiv.org, 2019. arXiv:1903.05196. 52 / 52

DeepLearn2022 3. TD and Q Learning

Recommended

Recommended

More Related Content

More from Sean Meyn

More from Sean Meyn (19)

DeepLearn2022 3. TD and Q Learning