Lecture 2 from https://irdta.eu/deeplearn/2022su/
Covers final chapters of my new book, https://meyn.ece.ufl.edu/2021/08/01/control-systems-and-reinforcement-learning/
All about algorithm design for TD- and Q-learning in a stochastic environment.
1. Part 3: Introduction to Reinforcement Learning
Temporal Difference Techniques
Sean Meyn
Department of Electrical and Computer Engineering University of Florida
Inria International Chair Inria, Paris
Thanks to to our sponsors: NSF and ARO
3. 5 10
0
0
20
40
60
80
100
ρ = 0.8
τk
Outcomes
from
M
repeated
runs
-20 0 20
√
Nθ̃N
Crank Up Gain: Tame Transients Average: Tame Variance
PR
θk
ϑτk
Algorithm Design
4. Recap Algorithm Design
Steps to a Successful Design With Thanks to Lyapunov and Polyak!
0
20
40
60
80
100
ρ = 0.8
τk
5 10
0
Outcomes
from
M
repeated
runs
-20 0 20
Crank Up Gain: Tame Transients
Average: Tame Variance
θk
ϑτk
N − N0 θ̃PR
N
Follow These Steps:
1 Design ¯
f for mean flow d
dt ϑ = ¯
f(ϑ)
2 Design gain: αn = g/(1 + n/ne)ρ
Crank up gain, means ρ < 1
3 Perform PR Averaging
4 Repeat!
Obtain histogram
√
N − N0 θ̃PR (m)
N : 1 ≤ m ≤ M
with θ
(m)
0 widely dispersed and N relatively small
What you will learn:
How big N needs to be
Approximate confidence bounds
1 / 52
5. Recap Algorithm Design
Taming Nonlinear Dynamics
What if stability of d
dt ϑ = ¯
f(ϑ) is unknown? [typically the case in RL]
Newton Raphson Flow [Smale 1976]
Idea: Interpret ¯
f as the “parameter”: d
dt
¯
ft = V( ¯
ft) and design V
2 / 52
6. Recap Algorithm Design
Taming Nonlinear Dynamics
What if stability of d
dt ϑ = ¯
f(ϑ) is unknown? [typically the case in RL]
Newton Raphson Flow [Smale 1976]
Idea: Interpret ¯
f as the “parameter”: d
dt
¯
f = − ¯
ft
Linear dynamics
2 / 52
7. Recap Algorithm Design
Taming Nonlinear Dynamics
What if stability of d
dt ϑ = ¯
f(ϑ) is unknown? [typically the case in RL]
Newton Raphson Flow [Smale 1976]
Idea: Interpret ¯
f as the “parameter”: d
dt
¯
f = − ¯
ft
d
dt
¯
f(ϑt) = − ¯
f(ϑt)
d
dt
ϑt = −[A(ϑt)]−1 ¯
f(ϑt)
SA translation: Zap Stochastic Approximation
2 / 52
8. Recap Algorithm Design
Zap Algorithm Designed to emulate Newton-Raphson flow
d
dt
ϑt = −[A(ϑt)]−1 ¯
f(ϑt), A(θ) = ∂
∂θ
¯
f(θ)
Zap-SA (designed to emulate deterministic Newton-Raphson)
θn+1 = θn + αn+1[− b
An+1]−1
f(θn, ξn+1)
b
An+1 = b
An + βn+1(An+1 − b
An), An+1 = ∂θf(θn, ξn+1)
3 / 52
9. Recap Algorithm Design
Zap Algorithm Designed to emulate Newton-Raphson flow
d
dt
ϑt = −[A(ϑt)]−1 ¯
f(ϑt), A(θ) = ∂
∂θ
¯
f(θ)
Zap-SA (designed to emulate deterministic Newton-Raphson)
θn+1 = θn + αn+1[− b
An+1]−1
f(θn, ξn+1)
b
An+1 = b
An + βn+1(An+1 − b
An), An+1 = ∂θf(θn, ξn+1)
Requires b
An+1 ≈ A(θn)
def
= ∂θ
¯
f (θn)
3 / 52
10. Recap Algorithm Design
Zap Algorithm Designed to emulate Newton-Raphson flow
d
dt
ϑt = −[A(ϑt)]−1 ¯
f(ϑt), A(θ) = ∂
∂θ
¯
f(θ)
Zap-SA (designed to emulate deterministic Newton-Raphson)
θn+1 = θn + αn+1[− b
An+1]−1
f(θn, ξn+1)
b
An+1 = b
An + βn+1(An+1 − b
An), An+1 = ∂θf(θn, ξn+1)
b
An+1 ≈ A(θn) requires high-gain,
βn
αn
→ ∞, n → ∞
3 / 52
11. Recap Algorithm Design
Zap Algorithm Designed to emulate Newton-Raphson flow
d
dt
ϑt = −[A(ϑt)]−1 ¯
f(ϑt), A(θ) = ∂
∂θ
¯
f(θ)
Zap-SA (designed to emulate deterministic Newton-Raphson)
θn+1 = θn + αn+1[− b
An+1]−1
f(θn, ξn+1)
b
An+1 = b
An + βn+1(An+1 − b
An), An+1 = ∂θf(θn, ξn+1)
b
An+1 ≈ A(θn) requires high-gain,
βn
αn
→ ∞, n → ∞
Can use αn = 1/n (without averaging).
Numerics that follow: use this choice, and βn = (1/n)ρ, ρ ∈ (0.5, 1)
3 / 52
12. Recap Algorithm Design
Zap Algorithm Designed to emulate Newton-Raphson flow
d
dt
ϑt = −[A(ϑt)]−1 ¯
f(ϑt), A(θ) = ∂
∂θ
¯
f(θ)
Zap-SA (designed to emulate deterministic Newton-Raphson)
θn+1 = θn + αn+1[− b
An+1]−1
f(θn, ξn+1)
b
An+1 = b
An + βn+1(An+1 − b
An), An+1 = ∂θf(θn, ξn+1)
b
An+1 ≈ A(θn) requires high-gain,
βn
αn
→ ∞, n → ∞
Can use αn = 1/n (without averaging).
Numerics that follow: use this choice, and βn = (1/n)ρ, ρ ∈ (0.5, 1)
Stability? Virtually universal
Optimal variance, too!
Based on ancient theory from Ruppert Polyak [80, 81, 79]
3 / 52
14. Recap Optimal Control
Keep In Mind!
Trajectory
Generation State
Feedback
Process
Observer
Environment
Σ
Σ
Σ
Σ
y
u
e
w
d
xref ufb
uff
−
+
x̂
M
ore feedback
r
∆
n
hidden state x
Every Control Problem is Multi-Objective
Many costs to consider
Many measures of risk
See CSRL Chapter 2, and Åström and Murray [1]
4 / 52
15. Recap Optimal Control
Keep In Mind!
Trajectory
Generation State
Feedback
Process
Observer
Environment
Σ
Σ
Σ
Σ
y
u
e
w
d
xref ufb
uff
−
+
x̂
M
ore feedback
r
∆
n
hidden state x
Every Control Problem is Multi-Objective
Many costs to consider
Many measures of risk
Every Control Problem is Multi-Temporal
Advanced planning, and periodic updates
Real time feedback φ
See CSRL Chapter 2, and Åström and Murray [1]
4 / 52
16. Recap Optimal Control
Keep In Mind!
Trajectory
Generation State
Feedback
Process
Observer
Environment
Σ
Σ
Σ
Σ
y
u
e
w
d
xref ufb
uff
−
+
x̂
M
ore feedback
r
∆
n
hidden state x
Every Control Problem is Multi-Objective
Many costs to consider
Many measures of risk
Every Control Problem is Multi-Temporal
Advanced planning, and periodic updates
Real time feedback φ
There are Choices Beyond φ?
See CSRL Chapter 2, and Åström and Murray [1]
4 / 52
17. Recap Optimal Control
MDP Model
Keep things simple: state space X and input space U are finite
MDP Model
X is a stationary controlled Markov chain, with input U
For all states x and sets A,
P{Xn+1 ∈ A | Xn = x, Un = u, and prior history} = Pu(x, A)
c: X × U → R is a cost function
γ 1 discount factor
5 / 52
18. Recap Optimal Control
MDP Model
Keep things simple: state space X and input space U are finite
Cost function c: X × U → R+
Q-function:
Q?
(x, u) = min
U∞
1
∞
X
k=0
γk
E[c(X(k), U(k)) | X(0) = x , U(0) = u]
5 / 52
19. Recap Optimal Control
MDP Model
Keep things simple: state space X and input space U are finite
Cost function c: X × U → R+
Q-function:
Q?
(x, u) = min
U∞
1
∞
X
k=0
γk
E[c(X(k), U(k)) | X(0) = x , U(0) = u]
DP equation:
Q?
(x, u) = c(x, u) + γE[Q?
(X(k + 1)) | X(k) = x , U(k) = u]
H(x) = min
u
H(x, u)
5 / 52
20. Recap Optimal Control
MDP Model
Keep things simple: state space X and input space U are finite
Cost function c: X × U → R+
Q-function:
Q?
(x, u) = min
U∞
1
∞
X
k=0
γk
E[c(X(k), U(k)) | X(0) = x , U(0) = u]
DP equation:
Q?
(x, u) = c(x, u) + γE[Q?
(X(k + 1)) | X(k) = x , U(k) = u]
Optimal policy: U?(k) = φ?(X?(k)), φ?
(x) ∈ arg min
u
Q?
(x, u)
5 / 52
21. Recap Optimal Control
MDP Model
Keep things simple: state space X and input space U are finite
Cost function c: X × U → R+
Q-function:
Q?
(x, u) = min
U∞
1
∞
X
k=0
γk
E[c(X(k), U(k)) | X(0) = x , U(0) = u]
DP equation:
Q?
(x, u) = c(x, u) + γE[Q?
(X(k + 1)) | X(k) = x , U(k) = u]
Optimal policy: U?(k) = φ?(X?(k)), φ?
(x) ∈ arg min
u
Q?
(x, u)
Goal of SC and RL: approximate Q? and/or φ?
Today’s lecture: find best approximation among {Qθ : θ ∈ Rd} 5 / 52
22. Recap Optimal Control
MDP Model Some flexibility with notation: c(X(k), U(k)) = ck, etc.
Keep things simple: state space X and input space U are finite
Cost function c: X × U → R+
Q-function:
Q?
(x, u) = min
U∞
1
∞
X
k=0
γk
E[ ck | X0 = x , U0 = u]
DP equation:
Q?
(x, u) = c(x, u) + γE[ Q?
k+1
| Xk = x , Uk = u]
Optimal policy: U?
k = φ?(X?
k), φ?
(x) ∈ arg min
u
Q?
(x, u)
Goal of SC and RL: approximate Q? and/or φ?
Today’s lecture: find best approximation among {Qθ : θ ∈ Rd} 6 / 52
24. TD Learning Estimating the State-Action Value Function
TD Learning Given policy φ, approximate the fixed-policy Q-function
U(k) = φ(X(k)), k ≥ 1
Qφ
(x, u) =
∞
X
k=0
γk
E[c(X(k), U(k)) | X(0) = x , U(0) = u]
Motivation: Policy Improvement φ+
(x) = arg min
u
Qφ
(x, u)
7 / 52
25. TD Learning Estimating the State-Action Value Function
TD Learning Given policy φ, approximate the fixed-policy Q-function
U(k) = φ(X(k)), k ≥ 1
DP equation:
Qφ
(x, u) = c(x, u) + γE[Qφ
(X(k + 1)) | X(k) = x , U(k) = u]
Qφ
(X(k), U(k)) = c(X(k), U(k)) + γE[Qφ
(X(k + 1)) | Fk]
with new definition, H(x) = H(x, φ(x)).
7 / 52
26. TD Learning Estimating the State-Action Value Function
TD Learning Given policy φ, approximate the fixed-policy Q-function
U(k) = φ(X(k)), k ≥ 1
Qφ
(X(k), U(k)) = c(X(k), U(k)) + γE[Qφ
(X(k + 1)) | Fk]
with new definition, H(x) = H(x, φ(x)).
Galerkin relaxation: θ∗ is a root of the function
¯
f(θ)
def
= E[{−Qθ
(X(k), U(k)) + c(X(k), U(k)) + γQθ
(X(k + 1))}ζk]
Eligibility vector in Rd
7 / 52
27. TD Learning Estimating the State-Action Value Function
TD Learning Given policy φ, approximate the fixed-policy Q-function
U(k) = φ(X(k)), k ≥ 1
Qφ
(X(k), U(k)) = c(X(k), U(k)) + γE[Qφ
(X(k + 1)) | Fk]
with new definition, H(x) = H(x, φ(x)).
Galerkin relaxation: θ∗ is a root of the function
¯
f(θ)
def
= E[{−Qθ
(X(k), U(k)) + c(X(k), U(k)) + γQθ
(X(k + 1))}ζk]
Linear family: {Qθ = θT
ψ}
d
dt ϑ = ¯
f(ϑ) is stable (clever choice of ζk, and with covariance Σψ full rank)
Alternatively, compute θ∗ by a single matrix inversion
SA to the rescue
7 / 52
28. TD Learning Estimating the State-Action Value Function
TD Learning Given policy φ, approximate the fixed-policy Q-function
U(k) = φ(X(k)), k ≥ 1
Qφ
(X(k), U(k)) = c(X(k), U(k)) + γE[Qφ
(X(k + 1)) | Fk]
with new definition, H(x) = H(x, φ(x)).
Galerkin relaxation: θ∗ is a root of the function
¯
f(θ)
def
= E[{−Qθ
(X(k), U(k)) + c(X(k), U(k)) + γQθ
(X(k + 1))}ζk]
How to estimate θ? for neural network (nonlinear) architecture?
SA to the rescue
7 / 52
29. TD Learning Estimating the State-Action Value Function
TD(λ) ¯
f(θ)
def
= E[{−Qθ
k + ck + γQθ
k+1
}ζk]
Choose λ ∈ [0, 1] as you please, and then for each n
ζn =
n
X
k=0
(γλ)n−k
∇θ[Qθ
k]
30.
31. θ=θn
Interpretation for linear function class: ∇θ[Qθ
k] = ψk = ψ(Xk, Uk)
Basis observations passed through a low pass linear filter
8 / 52
32. TD Learning Estimating the State-Action Value Function
TD(λ) ¯
f(θ)
def
= E[{−Qθ
k + ck + γQθ
k+1
}ζk]
Choose λ ∈ [0, 1] as you please, and then for each n
ζn =
n
X
k=0
(γλ)n−k
∇θ[Qθ
k]
37. TD Learning Least-Squares Temporal Difference Learning
LSTD and Zap ¯
f(θ)
def
= E[{−Qθ
k + ck + γQθ
k+1
}ζk]
We require the linearization matrix:
A(θ, ξk+1) = ∂θ{−Qθ
k + γQθ
k+1
}ζk
9 / 52
38. TD Learning Least-Squares Temporal Difference Learning
LSTD and Zap ¯
f(θ)
def
= E[{−Qθ
k + ck + γQθ
k+1
}ζk]
We require the linearization matrix:
A(θ, ξk+1) = ∂θ{−Qθ
k + γQθ
k+1
}ζk
=
−ψ(Xk, Uk) + γψ Xk+1, φ(Xk+1)
}ζk linear function class
9 / 52
39. TD Learning Least-Squares Temporal Difference Learning
LSTD and Zap ¯
f(θ)
def
= E[{−Qθ
k + ck + γQθ
k+1
}ζk]
We require the linearization matrix:
A(θ, ξk+1) = ∂θ{−Qθ
k + γQθ
k+1
}ζk
=
−ψ(Xk, Uk) + γψ Xk+1, φ(Xk+1)
}ζk linear function class
A∗ = E
{−ψ(Xk, Uk) + γψ Xk+1, φ(Xk+1)
}ζk] is Hurwitz in this case
Only requirement is that Σψ 0
9 / 52
40. TD Learning Least-Squares Temporal Difference Learning
LSTD and Zap ¯
f(θ)
def
= E[{−Qθ
k + ck + γQθ
k+1
}ζk]
We require the linearization matrix:
A(θ, ξk+1) = ∂θ{−Qθ
k + γQθ
k+1
}ζk
=
−ψ(Xk, Uk) + γψ Xk+1, φ(Xk+1)
}ζk linear function class
θn+1 = θn + αn+1[− b
An+1]−1
f(θn, ξn+1)
f(θn, ξn+1) = {−Qθn
n + cn + γQθn
n+1
}ζn
b
An+1 = b
An + αn+1(An+1 − b
An)
An+1 = {−ψ(Xn, Un) + γψ(Xn+1, φ(Xn+1)}ζn
Zap SA doesn’t require two time-scale algorithm when A(θ) ≡ A∗
9 / 52
41. TD Learning Least-Squares Temporal Difference Learning
LSTD and Zap ¯
f(θ)
def
= E[{−Qθ
k + ck + γQθ
k+1
}ζk]
We require the linearization matrix:
A(θ, ξk+1) = ∂θ{−Qθ
k + γQθ
k+1
}ζk
=
−ψ(Xk, Uk) + γψ Xk+1, φ(Xk+1)
}ζk linear function class
LSTD(λ)-Learning
θn+1 = θn + αn+1[− b
An+1]−1
f(θn, ξn+1)
f(θn, ξn+1) = {−Qθn
n + cn + γQθn
n+1
}ζn
b
An+1 = b
An + αn+1(An+1 − b
An)
An+1 = {−ψ(Xn, Un) + γψ(Xn+1, φ(Xn+1)}ζn
Optimal convergence rate (as obtained using PR averaging)
9 / 52
42. TD Learning Least-Squares Temporal Difference Learning
What Is LSTD(λ) Solving?
Beautiful Theory for λ = 1
10 / 52
43. TD Learning Least-Squares Temporal Difference Learning
What Is LSTD(1) Solving? Eternal gratitude to John Tsitsiklis!
Answered in the dissertation of Van Roy [18]:
min
θ
kQθ
− Qφ
k2
π =
X
x,u
Qθ
(x, u) − Qφ
(x, u)
2
π(x, u)
See [21, 20] and Chapter 9 of CSRL
10 / 52
44. TD Learning Least-Squares Temporal Difference Learning
What Is LSTD(1) Solving? Eternal gratitude to John Tsitsiklis!
Answered in the dissertation of Van Roy [18]:
min
θ
kQθ
− Qφ
k2
π =
X
x,u
Qθ
(x, u) − Qφ
(x, u)
2
π(x, u)
See [21, 20] and Chapter 9 of CSRL
Foundation for Actor-critic algorithms in the dissertation of Konda [46]
θn+1 = θn − αn+1
˘
∇Γ (n) ¯
f(θ) = −∇Γ (θ)
Unbiased gradient observation: ˘
∇Γ (n)
def
= Λθn
n Qθn
n
Score function: Λθ
(x, u) = ∇θ log[e
φθ
(u | x)]
See also [47] and Chapter 10 of CSRL
10 / 52
45. TD Learning Least-Squares Temporal Difference Learning
LSTD(1) Optimal Covariance May Be Large
Example from CTCN [33]:
0
10
20
30
0
5
10
15
ρ = 0.8 ρ = 0.9 hθ
(x) = θ1
θ1
x + θ2x2
θ2
θ∗
1 = θ∗
2
0 1 2 3 4 5 x 10
6 0 1 2 3 4 5 x 10
6
11 / 52
46. TD Learning Least-Squares Temporal Difference Learning
LSTD(1) Optimal Covariance May Be Large
Example from CTCN [33]:
0
10
20
30
0
5
10
15
ρ = 0.8 ρ = 0.9 hθ
(x) = θ1
θ1
x + θ2x2
θ2
θ∗
1 = θ∗
2
0 1 2 3 4 5 x 10
6 0 1 2 3 4 5 x 10
6
Architecture linear, two dimensional, and ideal in every sense
11 / 52
47. TD Learning Least-Squares Temporal Difference Learning
LSTD(1) Optimal Covariance May Be Large
Example from CTCN [33]:
0
10
20
30
0
5
10
15
ρ = 0.8 ρ = 0.9 hθ
(x) = θ1
θ1
x + θ2x2
θ2
θ∗
1 = θ∗
2
0 1 2 3 4 5 x 10
6 0 1 2 3 4 5 x 10
6
Architecture linear, two dimensional, and ideal in every sense
Algorithm performance is far from ideal
11 / 52
52. TD Learning Changing Our Goals
State Weighting
Design state weighting so that f(θ, ξ) becomes bounded in ξ:
0
10
20
30
0
10
20
30
0
5
10
15
0
5
10
15
ρ = 0.8 ρ = 0.9 hθ
(x) = θ1
θ1
x + θ2x2
θ2
θ∗
1 = θ∗
2
0 1 2 3 4 5 x 10
6 0 1 2 3 4 5 x 10
6
0 1 2 3 4 5 x 10
6 0 1 2 3 4 5 x 10
6
[1
+
(µ
−
α)x]
4
Ω(x)
≡
1
Far more flexible and well motivated than introducing λ
ζn+1 = γζn + 1
Ωn+1
ψn+1
13 / 52
53. TD Learning Changing Our Goals
Relative LSTD
Far more flexible and well motivated than introducing λ?
ζn+1 = γζn +
1
Ωn+1
ψ(Xn+1)
14 / 52
54. TD Learning Changing Our Goals
Relative LSTD
Far more flexible and well motivated than introducing λ?
ζn+1 = γζn +
1
Ωn+1
ψ(Xn+1)
Performance will degrade as γ gets close to one
14 / 52
55. TD Learning Changing Our Goals
Relative LSTD
Far more flexible and well motivated than introducing λ?
ζn+1 = γζn +
1
Ωn+1
ψ(Xn+1)
Performance will degrade as γ gets close to one
Change your objective: min
θ
kHθ
− Hφ
k2
π,Ω
with H(x, u) = Q(x, u) − Q(x•, u•) (some fixed normalization)
14 / 52
56. TD Learning Changing Our Goals
Relative LSTD
Far more flexible and well motivated than introducing λ?
ζn+1 = γζn +
1
Ωn+1
ψ(Xn+1)
Performance will degrade as γ gets close to one
Change your objective: min
θ
kHθ
− Hφ
k2
π,Ω
with H(x, u) = Q(x, u) − Q(x•, u•) (some fixed normalization)
Relative LSTD
Replace ψ(x, u) by ψ̃(x, u)
def
= ψ(x, u) − ψ(x•, u•) where it appears
14 / 52
57. TD Learning Changing Our Goals
Relative LSTD
Far more flexible and well motivated than introducing λ?
ζn+1 = γζn +
1
Ωn+1
ψ(Xn+1)
Performance will degrade as γ gets close to one
Change your objective: min
θ
kHθ
− Hφ
k2
π,Ω
with H(x, u) = Q(x, u) − Q(x•, u•) (some fixed normalization)
Relative LSTD
Replace ψ(x, u) by ψ̃(x, u)
def
= ψ(x, u) − ψ(x•, u•) where it appears
Covariance is optimal, and uniformly bounded over 0 ≤ γ 1 [64]
14 / 52
58. TD Learning Changing Our Goals
Relative LSTD
0 1 2 3 4 5 6 7 8 9 10
n 10
5
0
20
40
60
80
100
0 1 2 3 4 5 6 7 8 9 10
0
20
40
60
80
100
120 120
Maximal
Bellman
Error
1/(1 − γ) = 103
1/(1 − γ) = 104
Q-learning:
g = 1/(1 − γ)
g = 1/(1 − ρ∗
γ)
g = 1/(1 − ρ∗
γ)
Relative Q-learning:
{
Figure 1: Comparison of Q-learning and Relative Q-learning algorithms for the stochastic shortest
path problem of [4]. The relative Q-learning algorithm is unaffected by large discounting.
Relative LSTD
Replace ψ(x, u) by ψ̃(x, u)
def
= ψ(x, u) − ψ(x•, u•) where it appears
Covariance is optimal, and uniformly bounded over 0 ≤ γ 1 [64]
15 / 52
59. Exploration On and off policy algorithms
Off Policy for Exploration
In TD-learning, sufficient exploration means Σψ = Eπ[ψkψT
k] 0
This ensures convergence, solving ¯
f(θ∗
) = 0, perhaps θ∗
= min
θ
kQθ
− Qφ
k2
π,Ω
0.01
0.02
0.03
0.04
0.05
0.06
−1 0 1 −1 0 1
−1
0
1
Optimal policy
Optimal policy
Small Input Large Input
Invariant
Density
[57]
16 / 52
60. Exploration On and off policy algorithms
Off Policy for Exploration
In TD-learning, sufficient exploration means Σψ = Eπ[ψkψT
k] 0
Not always possible!
Consider Xn+1 = aXn + Un, Un = φ(Xn) = −KXn
16 / 52
61. Exploration On and off policy algorithms
Off Policy for Exploration
In TD-learning, sufficient exploration means Σψ = Eπ[ψkψT
k] 0
Not always possible!
Consider Xn+1 = aXn + Un, Un = φ(Xn) = −KXn
If |a − K| 1 then the steady-state is supported on Xn = Un ≡ 0
16 / 52
62. Exploration On and off policy algorithms
Off Policy for Exploration
In TD-learning, sufficient exploration means Σψ = Eπ[ψkψT
k] 0
Not always possible!
Off Policy TD(λ)
θn+1 = θn + αn+1f(θn, ξn+1) = θn + αn+1{−Qθn
n + cn + γQθn
n+1
}ζn
ζn+1 = γλζn + ∇θ[Qθ
n]
65. Exploration On and off policy algorithms
Off Policy for Exploration
In TD-learning, sufficient exploration means Σψ = Eπ[ψkψT
k] 0
Not always possible!
Off Policy TD(λ)
θn+1 = θn + αn+1f(θn, ξn+1) = θn + αn+1{−Qθn
n + cn + γQθn
n+1
}ζn
ζn+1 = γλζn + ∇θ[Qθ
n]
66.
67. θ=θn+1
The difference:
On policy: Qθn
n+1
= Qθn (Xn+1, Un+1), with Un+1 = φ(Xn+1)
Off policy: Qθn
n+1
= Qθn (Xn+1, Un+1), with Un+1 = e
φ(Xn+1)
16 / 52
68. Exploration On and off policy algorithms
Off Policy for Exploration
In TD-learning, sufficient exploration means Σψ = Eπ[ψkψT
k] 0
Not always possible!
Off Policy TD(λ)
θn+1 = θn + αn+1f(θn, ξn+1) = θn + αn+1{−Qθn
n + cn + γQθn
n+1
}ζn
ζn+1 = γλζn + ∇θ[Qθ
n]
69.
70. θ=θn+1
The difference:
On policy: Qθn
n+1
= Qθn (Xn+1, Un+1), with Un+1 = φ(Xn+1)
Off policy: Qθn
n+1
= Qθn (Xn+1, Un+1), with Un+1 = e
φ(Xn+1)
Example: Un+1 = e
φ(Xn+1) =
(
φ(Xn+1) with prob.1 − ε
randn+1 else
16 / 52
71. Exploration What Are We Trying To Accomplish
Wide World of Exploration
How would this intelligent agent
explore her environment to maximize her rewards?
17 / 52
72. Exploration What Are We Trying To Accomplish
Wide World of Exploration
How would this intelligent agent explore ...?
Exploration might lead to a meeting with the cat
17 / 52
73. Exploration What Are We Trying To Accomplish
Wide World of Exploration
How would this intelligent agent explore ...?
Exploration might lead to a meeting with the cat
She will come to the conclusion that the delicious smell is not cheese
17 / 52
74. Exploration What Are We Trying To Accomplish
Wide World of Exploration
How would this intelligent agent explore ...?
Exploration might lead to a meeting with the cat
She will come to the conclusion that the delicious smell is not cheese
Would she continue to meet up with the cat, and smell the poop, in order
to obtain the best estimate of Q∗?
17 / 52
75. Exploration What Are We Trying To Accomplish
Wide World of Exploration
How would this intelligent agent explore ...?
Exploration might lead to a meeting with the cat
She will come to the conclusion that the delicious smell is not cheese
Would she continue to meet up with the cat, and smell the poop, in order
to obtain the best estimate of Q∗?
As intelligent agents we, like the mouse, must not forget our goals!
17 / 52
76. 0 1 2 3 4 5 6 7 8 9 10
n 10
5
0
20
40
60
80
100
0 1 2 3 4 5 6 7 8 9 10
0
20
40
60
80
100
120 120
Maximal
Bellman
Error
1/(1 − γ) = 103
1/(1 − γ) = 104
Q-learning:
g = 1/(1 − γ)
g = 1/(1 − ρ∗
γ)
g = 1/(1 − ρ∗
γ)
Relative Q-learning:
{
Figure 1: Comparison of Q-learning and Relative Q-learning algorithms for the stochastic shortest
path problem of [4]. The relative Q-learning algorithm is unaffected by large discounting.
Q Learning
77. Q Learning Galerkin Relaxation
Dynamic programming
Find function Q∗ that solves (Fn means history)
E
c(Xn, Un) + γQ∗
(Xn+1) − Q∗
(Xn, Un) | Fn
= 0
H(x) = min
u
H(x, u)
18 / 52
78. Q Learning Galerkin Relaxation
Dynamic programming
Find function Q∗ that solves (Fn means history)
E
c(Xn, Un) + γQ∗
(Xn+1) − Q∗
(Xn, Un) | Fn
= 0
Hidden Goal of Q-Learning (including DQN)
Given {Qθ : θ ∈ Rd}, find θ∗ that solves ¯
f(θ∗) = 0,
¯
f(θ)
def
= E
c(Xn, Un) + γQθ
(Xn+1) − Qθ
(Xn, Un)
ζn
The family {Qθ} and eligibility vectors {ζn} are part of algorithm design.
18 / 52
101. Q Learning Watkins’ Q-Learning
Watkins’ Q-learning
E
c(Xn, Un) + γQθ∗
(Xn+1) − Qθ∗
(Xn, Un)
ζn
= 0
Watkin’s algorithm A special case of Q(0)-learning
The family {Qθ} and eligibility vectors {ζn} in this design:
Linearly parameterized family of functions: Qθ(x, u) = θT
ψ(x, u)
ζn ≡ ψ(Xn, Un)
ψi(x, u) = 1{x = xi, u = ui} (complete basis ≡ tabular Q-learning)
Convergence of Qθn to Q? holds under mild conditions
21 / 52
102. Q Learning Watkins’ Q-Learning
Watkins’ Q-learning
E
c(Xn, Un) + γQθ∗
(Xn+1) − Qθ∗
(Xn, Un)
ζn
= 0
Watkin’s algorithm A special case of Q(0)-learning
The family {Qθ} and eligibility vectors {ζn} in this design:
Linearly parameterized family of functions: Qθ(x, u) = θT
ψ(x, u)
ζn ≡ ψ(Xn, Un)
ψi(x, u) = 1{x = xi, u = ui} (complete basis ≡ tabular Q-learning)
Convergence of Qθn to Q? holds under mild conditions
Asymptotic covariance is infinite for γ ≥ 1/2 [31, 62]
1
αn
Cov(θ̃n) → ∞ as n → ∞, using the standard step-size rule αn = 1/n(x, u)
21 / 52
103. Q Learning Watkins’ Q-Learning
Asymptotic Covariance of Watkins’ Q-Learning
This is what infinite variance looks like
1
4
6
5
3 2
σ2
= lim
n→∞
nE[kθn − θ∗
k2
] = ∞ Wild oscillations?
22 / 52
104. Q Learning Watkins’ Q-Learning
Asymptotic Covariance of Watkins’ Q-Learning
This is what infinite variance looks like
1
4
6
5
3 2
σ2
= lim
n→∞
nE[kθn − θ∗
k2
] = ∞ Wild oscillations?
Not at all, the sample paths appear frozen
Histogram of parameter estimates after 106 iterations.
100
0 200 300 400 486.6
0
10
20
30
40
n = 106
Histogram for θ
θ*
n(15)
(15)
Example from [31, 62] 2017
22 / 52
105. Q Learning Watkins’ Q-Learning
Asymptotic Covariance of Watkins’ Q-Learning
This is what infinite variance looks like
1
4
6
5
3 2
σ2
= lim
n→∞
nE[kθn − θ∗
k2
] = ∞ Wild oscillations?
Not at all, the sample paths appear frozen
Sample paths using a higher gain, or relative Q-learning.
0 1 2 3 4 5 6 7 8 9 10
n 10
5
0
20
40
60
80
100
0 1 2 3 4 5 6 7 8 9 10
0
20
40
60
80
100
120 120
Maximal
Bellman
Error
1/(1 − γ) = 103
1/(1 − γ) = 104
Q-learning:
g = 1/(1 − γ)
g = 1/(1 − ρ∗
γ)
g = 1/(1 − ρ∗
γ)
Relative Q-learning:
{
Figure 1: Comparison of Q-learning and Relative Q-learning algorithms for the stochastic shortest
path problem of [4]. The relative Q-learning algorithm is unaffected by large discounting.
Example from [31, 62] 2017, and [64], CSRL, 2021
22 / 52
106. Q Learning Watkins’ Q-Learning
Asymptotic Covariance of Watkins’ Q-Learning
Can we do better?
1
4
6
5
3 2
0 1 2 3 4 5 6 7 8 9 10
n 10
5
0
20
40
60
80
100
0 1 2 3 4 5 6 7 8 9 10
0
20
40
60
80
100
120 120
Maximal
Bellman
Error
1/(1 − γ) = 103
1/(1 − γ) = 104
Q-learning:
g = 1/(1 − γ)
g = 1/(1 − ρ∗
γ)
g = 1/(1 − ρ∗
γ)
Relative Q-learning:
{
Figure 1: Comparison of Q-learning and Relative Q-learning algorithms for the stochastic shortest
path problem of [4]. The relative Q-learning algorithm is unaffected by large discounting.
23 / 52
107. Q Learning Watkins’ Q-Learning
Asymptotic Covariance of Watkins’ Q-Learning
Can we do better?
1
4
6
5
3 2
0 1 2 3 4 5 6 7 8 9 10
n 10
5
0
20
40
60
80
100
0 1 2 3 4 5 6 7 8 9 10
0
20
40
60
80
100
120 120
Maximal
Bellman
Error
1/(1 − γ) = 103
1/(1 − γ) = 104
Q-learning:
g = 1/(1 − γ)
g = 1/(1 − ρ∗
γ)
g = 1/(1 − ρ∗
γ)
Relative Q-learning:
{
Figure 1: Comparison of Q-learning and Relative Q-learning algorithms for the stochastic shortest
path problem of [4]. The relative Q-learning algorithm is unaffected by large discounting.
Our intelligent mouse might offer clues
23 / 52
109. Zap Q
Motivation
The ODE method begins with design of the ODE: d
dt ϑ = ¯
f(ϑ)
Challenges we have faced with Q-learning:
24 / 52
110. Zap Q
Motivation
The ODE method begins with design of the ODE: d
dt ϑ = ¯
f(ϑ)
Challenges we have faced with Q-learning:
How can we design dynamics for
1 Stability few results outside of Watkins’ tabular setting
2 ¯
f(θ∗
) = 0 solves a relevant problem or has a solution
24 / 52
111. Zap Q
Motivation
The ODE method begins with design of the ODE: d
dt ϑ = ¯
f(ϑ)
Challenges we have faced with Q-learning:
How can we design dynamics for
1 Stability
2 ¯
f(θ∗
) = 0 solves a relevant problem
How can we better manage problems introduced by 1/(1 − γ)?
Relative Q-Learning is one approach
24 / 52
112. Zap Q
Motivation
The ODE method begins with design of the ODE: d
dt ϑ = ¯
f(ϑ)
Challenges we have faced with Q-learning:
How can we design dynamics for
1 Stability
2 ¯
f(θ∗
) = 0 solves a relevant problem
How can we better manage problems introduced by 1/(1 − γ)?
Relative Q-Learning is one approach
Assuming we have solved 2,
approximate Newton-Raphson flow:
d
dt
¯
f(ϑt) = − ¯
f(ϑt) giving ¯
f(ϑt) = ¯
f(ϑ0)e−t
24 / 52
113. Zap Q
Zap Algorithm Designed to emulate Newton-Raphson flow
d
dt
ϑt = −[A(ϑt)]−1 ¯
f(ϑt), A(θ) = ∂
∂θ
¯
f(θ)
Zap-SA
θn+1 = θn + αn+1Gn+1f(θn, Φn+1) Gn+1 = [− b
An+1]−1
b
An+1 = b
An + βn+1(An+1 − b
An) An+1 = ∂θf(θn, Φn+1)
25 / 52
114. Zap Q
Zap Algorithm Designed to emulate Newton-Raphson flow
d
dt
ϑt = −[A(ϑt)]−1 ¯
f(ϑt), A(θ) = ∂
∂θ
¯
f(θ)
Zap-SA
θn+1 = θn + αn+1Gn+1f(θn, Φn+1) Gn+1 = [− b
An+1]−1
b
An+1 = b
An + βn+1(An+1 − b
An) An+1 = ∂θf(θn, Φn+1)
b
An+1 ≈ A(θn) requires high-gain,
βn
αn
→ ∞, n → ∞
25 / 52
115. Zap Q
Zap Algorithm Designed to emulate Newton-Raphson flow
d
dt
ϑt = −[A(ϑt)]−1 ¯
f(ϑt), A(θ) = ∂
∂θ
¯
f(θ)
Zap-SA
θn+1 = θn + αn+1Gn+1f(θn, Φn+1) Gn+1 = [− b
An+1]−1
b
An+1 = b
An + βn+1(An+1 − b
An) An+1 = ∂θf(θn, Φn+1)
b
An+1 ≈ A(θn) requires high-gain,
βn
αn
→ ∞, n → ∞
Numerics that follow: αn = 1/n, βn = (1/n)ρ, ρ ∈ (0.5, 1)
Zap Q-Learning: f(θn, Φn+1) =
c(Xn, Un) + γQθ∗
(Xn+1) − Qθ∗
(Xn, Un)
ζn
ζn = ∇θQθ
(Xn, Un)
118. Zap Q
Challenges
Q-learning: {Qθ(x, u) : θ ∈ Rd , u ∈ U , x ∈ X}
Find θ∗ such that ¯
f(θ∗) = 0, with
¯
f(θ) = E
c(Xn, Un) + γQθ
(Xn+1) − Qθ
(Xn, Un)
ζn
What makes theory difficult:
1 Does ¯
f have a root?
2 Does the inverse of A exist?
3 SA theory is weak for a discontinuous ODE ⇐= biggest technical challenge
26 / 52
119. Zap Q
Challenges
Q-learning: {Qθ(x, u) : θ ∈ Rd , u ∈ U , x ∈ X}
Find θ∗ such that ¯
f(θ∗) = 0, with
¯
f(θ) = E
c(Xn, Un) + γQθ
(Xn+1) − Qθ
(Xn, Un)
ζn
What makes theory difficult:
1 Does ¯
f have a root?
2 Does the inverse of A exist?
3 SA theory is weak for a discontinuous ODE ⇐= biggest technical challenge
=⇒ Resolved by exploiting special structure,
even for NN function approximation [32, 9]
26 / 52
120. Zap Q Zap Along Walk to the Cafe
Zap Q-Learning
Optimize Walk to Cafe
1
4
6
5
3 2
Convergence with Zap gain βn = n−0.85
0 1 2 3 4 5 6 7 8 9 10 105
0
20
40
60
80
100 Watkins, Speedy Q-learning,
Polyak-Ruppert Averaging
Zap
Bellman
Error
n
Convergence of Zap-Q Learning
Discount factor: γ = 0.99
27 / 52
121. Zap Q Zap Along Walk to the Cafe
Zap Q-Learning
Optimize Walk to Cafe
1
4
6
5
3 2
Convergence with Zap gain βn = n−0.85
Infinite covariance with αn = 1/n or 1/n(x, u).
0 1 2 3 4 5 6 7 8 9 10 105
0
20
40
60
80
100 Watkins, Speedy Q-learning,
Polyak-Ruppert Averaging
Zap
Bellman
Error
n
Zap, βn = αn
Convergence of Zap-Q Learning
Discount factor: γ = 0.99
27 / 52
122. Zap Q Zap Along Walk to the Cafe
Zap Q-Learning
Optimize Walk to Cafe
1
4
6
5
3 2
Convergence with Zap gain βn = n−0.85
Infinite covariance with αn = 1/n or 1/n(x, u).
0 1 2 3 4 5 6 7 8 9 10 105
0
20
40
60
80
100 Watkins, Speedy Q-learning,
Polyak-Ruppert Averaging
Zap
Bellman
Error
n
Watkins, g = 1500
αn = g/n
Zap, βn = αn
Convergence of Zap-Q Learning
Discount factor: γ = 0.99
27 / 52
123. Zap Q Zap Along Walk to the Cafe
Zap Q-Learning
Optimize Walk to Cafe
1
4
6
5
3 2
Convergence with Zap gain βn = n−0.85
-20 -10 0 10 -20 -10 0 10 -8 -6 -4 -2 0 2 4 6 8 103
-8 -6 -4 -2 0 2 4 6 8
n = 104
n = 106
Theoritical pdf Experimental pdf Empirical: 1000 trials
Wn =
√
nθ̃n
Entry #18: n = 104
n = 106
Entry #10:
CLT gives good prediction of finite-n performance
Discount factor: γ = 0.99
28 / 52
124. Zap Q Zap Along Walk to the Cafe
Zap Q-Learning
Optimize Walk to Cafe
1
4
6
5
3 2
Convergence with Zap gain βn = n−0.85
-20 -10 0 10 -20 -10 0 10 -8 -6 -4 -2 0 2 4 6 8 103
-8 -6 -4 -2 0 2 4 6 8
n = 104
n = 106
Theoritical pdf Experimental pdf Empirical: 1000 trials
Wn =
√
nθ̃n
Entry #18: n = 104
n = 106
Entry #10:
CLT gives good prediction of finite-n performance
Discount factor: γ = 0.99
Do not forget to use this in practice!
Multiple runs to estimate CLT variance
Requires wide range of initial conditions, but a smaller run length as you
fine-tune your algorithm
28 / 52
127. θ=θn
computed using back-progagation
A few things to note:
As far as we know, the empirical success of plain vanilla DQN is
extraordinary (however, nobody reports failure)
29 / 52
130. θ=θn
computed using back-progagation
A few things to note:
As far as we know, the empirical success of plain vanilla DQN is
extraordinary (however, nobody reports failure)
Zap Q-learning is the only approach for which convergence has been
established (under mild conditions)
We can expect stunning transient performance, based on coupling
with the Newton-Raphson flow.
29 / 52
131. Zap Q Zap Q-Learning with Neural Networks
Zap with Neural Networks
VI. Stunning reliability with parameterized by various neural networks
Qθ
Reliability and stunning transient performance
—from coupling with the Newton-Raphson flow.
29 / 52
132. Conclusions Future Directions
Conclusions
Reinforcement Learning is cursed by dimension, variance, and nonlinear
(algorithm) dynamics
We have a big basket of tools for reinforcement learning.
Many algorithms fail due to bad design, or unrealistic goals
30 / 52
133. Conclusions Future Directions
Conclusions
Reinforcement Learning is cursed by dimension, variance, and nonlinear
(algorithm) dynamics
We have a big basket of tools for reinforcement learning.
Many algorithms fail due to bad design, or unrealistic goals
Among the tools we have for taming algorithmic volatility:
Zap: a second order method that optimizes variance and transient
behavior
30 / 52
134. Conclusions Future Directions
Conclusions
Reinforcement Learning is cursed by dimension, variance, and nonlinear
(algorithm) dynamics
We have a big basket of tools for reinforcement learning.
Many algorithms fail due to bad design, or unrealistic goals
Among the tools we have for taming algorithmic volatility:
Zap: a second order method that optimizes variance and transient
behavior
State weighting: more justifiable and flexible than the “λ hack”
30 / 52
135. Conclusions Future Directions
Conclusions
Reinforcement Learning is cursed by dimension, variance, and nonlinear
(algorithm) dynamics
We have a big basket of tools for reinforcement learning.
Many algorithms fail due to bad design, or unrealistic goals
Among the tools we have for taming algorithmic volatility:
Zap: a second order method that optimizes variance and transient
behavior
State weighting: more justifiable and flexible than the “λ hack”
Relative TD/Q-learning:
Hφ(x, u) = Qφ(x, u) − Qφ(x•, u•) is sufficient for policy update
30 / 52
136. Conclusions Future Directions
Conclusions
Reinforcement Learning is cursed by dimension, variance, and nonlinear
(algorithm) dynamics
We have a big basket of tools for reinforcement learning.
Many algorithms fail due to bad design, or unrealistic goals
Among the tools we have for taming algorithmic volatility:
Zap: a second order method that optimizes variance and transient
behavior
State weighting: more justifiable and flexible than the “λ hack”
Relative TD/Q-learning:
Hφ(x, u) = Qφ(x, u) − Qφ(x•, u•) is sufficient for policy update
Clever exploration:
Bandits literature has vast resources [13]
30 / 52
137. Conclusions Future Directions
Outlook
Beyond the projected Bellman error for Q-learning [57, 58, 59, 60]
Beyond the shackles of DP equations?
31 / 52
138. Conclusions Future Directions
Outlook
Beyond the projected Bellman error for Q-learning [57, 58, 59, 60]
Beyond the shackles of DP equations?
Acceleration techniques (momentum and matrix momentum)
See also Zap-Zero in CSRL
10
4
0
20
40
60
80
100
120
Q
Span
norm:
n
−
Q
sp
0 2 4 6 8 10
n
g = 1/(1 − γ)
Watkins
PRaveraging
Zap
ZapZero
31 / 52
139. Conclusions Future Directions
Outlook
Beyond the projected Bellman error for Q-learning [57, 58, 59, 60]
Beyond the shackles of DP equations?
Acceleration techniques (momentum and matrix momentum)
See also Zap-Zero in CSRL
10
4
0
20
40
60
80
100
120
Q
Span
norm:
n
−
Q
sp
0 2 4 6 8 10
n
g = 1/(1 − γ)
Watkins
PRaveraging
Zap
ZapZero
Further variance reduction using control variates
31 / 52
140. Conclusions Future Directions
Outlook
Beyond the projected Bellman error for Q-learning [57, 58, 59, 60]
Beyond the shackles of DP equations?
Acceleration techniques (momentum and matrix momentum)
See also Zap-Zero in CSRL
Further variance reduction using control variates
Thank you for your endurance!
I welcome questions and comments now and in the future
31 / 52
141.
142. References
Control Techniques
FOR
Complex Networks
Sean Meyn
Pre-publication version for on-line viewing. Monograph available for purchase at your favorite retailer
More information available at http://www.cambridge.org/us/catalogue/catalogue.asp?isbn=9780521884419
Markov Chains
and
Stochastic Stability
S. P. Meyn and R. L. Tweedie
August 2008 Pre-publication version for on-line viewing. Monograph to appear Februrary 2009
π(f
)
∞
∆V (x) ≤ −f(x) + bIC(x)
Pn
(x, · ) − πf → 0
sup
C
E
x
[S
τ
C
(f
)]
∞
References
32 / 52
143. References
Control Background I
[1] K. J. Åström and R. M. Murray. Feedback Systems: An Introduction for Scientists and
Engineers. Princeton University Press, USA, 2008 (recent edition on-line).
[2] K. J. Åström and B. Wittenmark. Adaptive Control. Addison-Wesley Longman Publishing
Co., Inc., Boston, MA, USA, 2nd edition, 1994.
[3] A. Fradkov and B. T. Polyak. Adaptive and robust control in the USSR.
IFAC–PapersOnLine, 53(2):1373–1378, 2020. 21th IFAC World Congress.
[4] M. Krstic, P. V. Kokotovic, and I. Kanellakopoulos. Nonlinear and adaptive control
design. John Wiley Sons, Inc., 1995.
[5] K. J. Åström. Theory and applications of adaptive control—a survey. Automatica,
19(5):471–486, 1983.
[6] K. J. Åström. Adaptive control around 1960. IEEE Control Systems Magazine,
16(3):44–49, 1996.
[7] L. Ljung. Analysis of recursive stochastic algorithms. IEEE Transactions on Automatic
Control, 22(4):551–575, 1977.
33 / 52
144. References
Control Background II
[8] N. Matni, A. Proutiere, A. Rantzer, and S. Tu. From self-tuning regulators to
reinforcement learning and back again. In Proc. of the IEEE Conf. on Dec. and Control,
pages 3724–3740, 2019.
34 / 52
145. References
RL Background I
[9] S. Meyn. Control Systems and Reinforcement Learning. Cambridge University Press,
Cambridge, 2021.
[10] R. Sutton and A. Barto. Reinforcement Learning: An Introduction. MIT Press. On-line
edition at http://www.cs.ualberta.ca/~sutton/book/the-book.html, Cambridge,
MA, 2nd edition, 2018.
[11] C. Szepesvári. Algorithms for Reinforcement Learning. Synthesis Lectures on Artificial
Intelligence and Machine Learning. Morgan Claypool Publishers, 2010.
[12] D. P. Bertsekas. Reinforcement learning and optimal control. Athena Scientific, Belmont,
MA, 2019.
[13] T. Lattimore and C. Szepesvari. Bandit Algorithms. Cambridge University Press, 2020.
[14] R. S. Sutton.Learning to predict by the methods of temporal differences. Mach. Learn.,
3(1):9–44, 1988.
[15] C. J. C. H. Watkins and P. Dayan. Q-learning. Machine Learning, 8(3-4):279–292, 1992.
[16] J. Tsitsiklis. Asynchronous stochastic approximation and Q-learning. Machine Learning,
16:185–202, 1994.
35 / 52
146. References
RL Background II
[17] T. Jaakola, M. Jordan, and S. Singh. On the convergence of stochastic iterative dynamic
programming algorithms. Neural Computation, 6:1185–1201, 1994.
[18] B. Van Roy. Learning and Value Function Approximation in Complex Decision Processes.
PhD thesis, Massachusetts Institute of Technology, Cambridge, MA, 1998.
[19] J. N. Tsitsiklis and B. Van Roy. Feature-based methods for large scale dynamic
programming. Mach. Learn., 22(1-3):59–94, 1996.
[20] J. N. Tsitsiklis and B. Van Roy. An analysis of temporal-difference learning with function
approximation. IEEE Trans. Automat. Control, 42(5):674–690, 1997.
[21] J. N. Tsitsiklis and B. V. Roy. Average cost temporal-difference learning. Automatica,
35(11):1799–1808, 1999.
[22] J. N. Tsitsiklis and B. Van Roy. Optimal stopping of Markov processes: Hilbert space
theory, approximation algorithms, and an application to pricing high-dimensional financial
derivatives. IEEE Trans. Automat. Control, 44(10):1840–1851, 1999.
[23] D. Choi and B. Van Roy. A generalized Kalman filter for fixed point approximation and
efficient temporal-difference learning. Discrete Event Dynamic Systems: Theory and
Applications, 16(2):207–239, 2006.
36 / 52
147. References
RL Background III
[24] S. J. Bradtke and A. G. Barto. Linear least-squares algorithms for temporal difference
learning. Mach. Learn., 22(1-3):33–57, 1996.
[25] J. A. Boyan. Technical update: Least-squares temporal difference learning. Mach. Learn.,
49(2-3):233–246, 2002.
[26] A. Nedic and D. Bertsekas. Least squares policy evaluation algorithms with linear function
approximation. Discrete Event Dyn. Systems: Theory and Appl., 13(1-2):79–110, 2003.
[27] C. Szepesvári. The asymptotic convergence-rate of Q-learning. In Proceedings of the 10th
Internat. Conf. on Neural Info. Proc. Systems, 1064–1070. MIT Press, 1997.
[28] E. Even-Dar and Y. Mansour. Learning rates for Q-learning. Journal of Machine Learning
Research, 5(Dec):1–25, 2003.
[29] M. G. Azar, R. Munos, M. Ghavamzadeh, and H. Kappen. Speedy Q-learning. In
Advances in Neural Information Processing Systems, 2011.
[30] D. Huang, W. Chen, P. Mehta, S. Meyn, and A. Surana. Feature selection for
neuro-dynamic programming. In F. Lewis, editor, Reinforcement Learning and
Approximate Dynamic Programming for Feedback Control. Wiley, 2011.
37 / 52
148. References
RL Background IV
[31] A. M. Devraj and S. P. Meyn. Zap Q-learning. In Proc. of the Intl. Conference on Neural
Information Processing Systems, pages 2232–2241, 2017.
[32] S. Chen, A. M. Devraj, F. Lu, A. Busic, and S. Meyn. Zap Q-Learning with nonlinear
function approximation. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and
H. Lin, editors, Advances in Neural Information Processing Systems, and arXiv e-prints
1910.05405, volume 33, pages 16879–16890, 2020.
[33] S. P. Meyn. Control Techniques for Complex Networks. Cambridge University Press, 2007.
See last chapter on simulation and average-cost TD learning
DQN:
[34] M. Riedmiller. Neural fitted Q iteration – first experiences with a data efficient neural
reinforcement learning method. In J. Gama, R. Camacho, P. B. Brazdil, A. M. Jorge, and
L. Torgo, editors, Machine Learning: ECML 2005, pages 317–328, Berlin, Heidelberg,
2005. Springer Berlin Heidelberg.
[35] S. Lange, T. Gabel, and M. Riedmiller. Batch reinforcement learning. In Reinforcement
learning, pages 45–73. Springer, 2012.
[36] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. A.
Riedmiller. Playing Atari with deep reinforcement learning. ArXiv, abs/1312.5602, 2013.
38 / 52
149. References
RL Background V
[37] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves,
M. A. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik,
I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis. Human-level
control through deep reinforcement learning. Nature, 518:529–533, 2015.
[38] O. Anschel, N. Baram, and N. Shimkin. Averaged-DQN: Variance reduction and
stabilization for deep reinforcement learning. In Proc. of ICML, pages 176–185.
JMLR.org, 2017.
Actor Critic / Policy Gradient
[39] P. J. Schweitzer. Perturbation theory and finite Markov chains. J. Appl. Prob., 5:401–403,
1968.
[40] C. D. Meyer, Jr. The role of the group generalized inverse in the theory of finite Markov
chains. SIAM Review, 17(3):443–464, 1975.
[41] P. W. Glynn. Stochastic approximation for Monte Carlo optimization. In Proceedings of
the 18th conference on Winter simulation, pages 356–365, 1986.
[42] R. J. Williams. Simple statistical gradient-following algorithms for connectionist
reinforcement learning. Machine learning, 8(3-4):229–256, 1992.
39 / 52
150. References
RL Background VI
[43] T. Jaakkola, S. P. Singh, and M. I. Jordan. Reinforcement learning algorithm for partially
observable Markov decision problems. In Advances in neural information processing
systems, pages 345–352, 1995.
[44] X.-R. Cao and H.-F. Chen. Perturbation realization, potentials, and sensitivity analysis of
Markov processes. IEEE Transactions on Automatic Control, 42(10):1382–1393, Oct
1997.
[45] P. Marbach and J. N. Tsitsiklis. Simulation-based optimization of Markov reward
processes. IEEE Trans. Automat. Control, 46(2):191–209, 2001.
[46] V. Konda. Actor-critic algorithms. PhD thesis, Massachusetts Institute of Technology,
2002.
[47] V. R. Konda and J. N. Tsitsiklis. Actor-critic algorithms. In Advances in neural
information processing systems, pages 1008–1014, 2000.
[48] R. S. Sutton, D. A. McAllester, S. P. Singh, and Y. Mansour. Policy gradient methods for
reinforcement learning with function approximation. In Advances in neural information
processing systems, pages 1057–1063, 2000.
[49] P. Marbach and J. N. Tsitsiklis. Simulation-based optimization of Markov reward
processes. IEEE Trans. Automat. Control, 46(2):191–209, 2001.
40 / 52
151. References
RL Background VII
[50] S. M. Kakade. A natural policy gradient. In Advances in neural information processing
systems, pages 1531–1538, 2002.
[51] H. Mania, A. Guy, and B. Recht. Simple random search provides a competitive approach
to reinforcement learning. In Advances in Neural Information Processing Systems, pages
1800–1809, 2018.
MDPs, LPs and Convex Q:
[52] A. S. Manne. Linear programming and sequential decisions. Management Sci.,
6(3):259–267, 1960.
[53] C. Derman. Finite State Markovian Decision Processes, volume 67 of Mathematics in
Science and Engineering. Academic Press, Inc., 1970.
[54] V. S. Borkar. Convex analytic methods in Markov decision processes. In Handbook of
Markov decision processes, volume 40 of Internat. Ser. Oper. Res. Management Sci.,
pages 347–375. Kluwer Acad. Publ., Boston, MA, 2002.
[55] D. P. de Farias and B. Van Roy. The linear programming approach to approximate
dynamic programming. Operations Res., 51(6):850–865, 2003.
41 / 52
152. References
RL Background VIII
[56] D. P. de Farias and B. Van Roy. A cost-shaping linear program for average-cost
approximate dynamic programming with performance guarantees. Math. Oper. Res.,
31(3):597–620, 2006.
[57] P. G. Mehta and S. P. Meyn. Q-learning and Pontryagin’s minimum principle. In Proc. of
the IEEE Conf. on Dec. and Control, pages 3598–3605, Dec. 2009.
[58] J. Bas Serrano, S. Curi, A. Krause, and G. Neu. Logistic Q-learning. In A. Banerjee and
K. Fukumizu, editors, Proc. of The Intl. Conference on Artificial Intelligence and
Statistics, volume 130, pages 3610–3618, 13–15 Apr 2021.
[59] F. Lu, P. G. Mehta, S. P. Meyn, and G. Neu. Convex Q-learning. In American Control
Conf., pages 4749–4756. IEEE, 2021.
[60] F. Lu, P. G. Mehta, S. P. Meyn, and G. Neu. Convex analytic theory for convex
Q-learning. In Conference on Decision and Control–to appear. IEEE, 2022.
Gator Nation:
[61] A. M. Devraj, A. Bušić, and S. Meyn. Fundamental design principles for reinforcement
learning algorithms. In K. G. Vamvoudakis, Y. Wan, F. L. Lewis, and D. Cansever,
editors, Handbook on Reinforcement Learning and Control, Studies in Systems, Decision
and Control series (SSDC, volume 325). Springer, 2021.
42 / 52
153. References
RL Background IX
[62] A. M. Devraj and S. P. Meyn. Fastest convergence for Q-learning. ArXiv , July 2017
(extended version of NIPS 2017).
[63] A. M. Devraj. Reinforcement Learning Design with Optimal Learning Rate. PhD thesis,
University of Florida, 2019.
[64] A. M. Devraj and S. P. Meyn. Q-learning with uniformly bounded variance: Large
discounting is not a barrier to fast learning. IEEE Trans Auto Control (and
arXiv:2002.10301), 2021.
[65] A. M. Devraj, A. Bušić, and S. Meyn. On matrix momentum stochastic approximation
and applications to Q-learning. In Allerton Conference on Communication, Control, and
Computing, pages 749–756, Sep 2019.
43 / 52
154. References
Stochastic Miscellanea I
[66] S. Asmussen and P. W. Glynn. Stochastic Simulation: Algorithms and Analysis, volume 57
of Stochastic Modelling and Applied Probability. Springer-Verlag, New York, 2007.
[67] P. W. Glynn and S. P. Meyn. A Liapounov bound for solutions of the Poisson equation.
Ann. Probab., 24(2):916–931, 1996.
[68] S. P. Meyn and R. L. Tweedie. Markov chains and stochastic stability. Cambridge
University Press, Cambridge, second edition, 2009. Published in the Cambridge
Mathematical Library.
[69] R. Douc, E. Moulines, P. Priouret, and P. Soulier. Markov Chains. Springer, 2018.
44 / 52
155. References
Stochastic Approximation I
[70] V. S. Borkar. Stochastic Approximation: A Dynamical Systems Viewpoint. Hindustan
Book Agency and Cambridge University Press, Delhi, India Cambridge, UK, 2008.
[71] A. Benveniste, M. Métivier, and P. Priouret. Adaptive algorithms and stochastic
approximations, volume 22 of Applications of Mathematics (New York). Springer-Verlag,
Berlin, 1990. Translated from the French by Stephen S. Wilson.
[72] V. S. Borkar and S. P. Meyn. The ODE method for convergence of stochastic
approximation and reinforcement learning. SIAM J. Control Optim., 38(2):447–469, 2000.
[73] V. Borkar, S. Chen, A. Devraj, I. Kontoyiannis, and S. Meyn. The ODE method for
asymptotic statistics in stochastic approximation and reinforcement learning. arXiv
e-prints:2110.14427, pages 1–50, 2021.
[74] M. Benaı̈m. Dynamics of stochastic approximation algorithms. In Séminaire de
Probabilités, XXXIII, pages 1–68. Springer, Berlin, 1999.
[75] V. Borkar and S. P. Meyn. Oja’s algorithm for graph clustering, Markov spectral
decomposition, and risk sensitive control. Automatica, 48(10):2512–2519, 2012.
[76] J. Kiefer and J. Wolfowitz. Stochastic estimation of the maximum of a regression
function. Ann. Math. Statist., 23(3):462–466, 09 1952.
45 / 52
156. References
Stochastic Approximation II
[77] J. C. Spall. Introduction to stochastic search and optimization: estimation, simulation,
and control. John Wiley Sons, 2003.
[78] D. Ruppert. A Newton-Raphson version of the multivariate Robbins-Monro procedure.
The Annals of Statistics, 13(1):236–245, 1985.
[79] D. Ruppert. Efficient estimators from a slowly convergent Robbins-Monro processes.
Technical Report Tech. Rept. No. 781, Cornell University, School of Operations Research
and Industrial Engineering, Ithaca, NY, 1988.
[80] B. T. Polyak. A new method of stochastic approximation type. Avtomatika i
telemekhanika, 98–107, 1990 (in Russian). Translated in Automat. Remote Control, 51
1991.
[81] B. T. Polyak and A. B. Juditsky. Acceleration of stochastic approximation by averaging.
SIAM J. Control Optim., 30(4):838–855, 1992.
[82] V. R. Konda and J. N. Tsitsiklis. Convergence rate of linear two-time-scale stochastic
approximation. Ann. Appl. Probab., 14(2):796–819, 2004.
[83] E. Moulines and F. R. Bach. Non-asymptotic analysis of stochastic approximation
algorithms for machine learning. In Advances in Neural Information Processing Systems
24, 451–459. Curran Associates, Inc., 2011.
46 / 52
157. References
Stochastic Approximation III
[84] W. Mou, C. Junchi Li, M. J. Wainwright, P. L. Bartlett, and M. I. Jordan. On Linear
Stochastic Approximation: Fine-grained Polyak-Ruppert and Non-Asymptotic
Concentration. arXiv e-prints, page arXiv:2004.04719, Apr. 2020.
47 / 52
158. References
Optimization and ODEs I
[85] W. Su, S. Boyd, and E. Candes. A differential equation for modeling nesterov’s
accelerated gradient method: Theory and insights. In Advances in neural information
processing systems, pages 2510–2518, 2014.
[86] B. Shi, S. S. Du, W. Su, and M. I. Jordan. Acceleration via symplectic discretization of
high-resolution differential equations. In H. Wallach, H. Larochelle, A. Beygelzimer,
F. d’Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information
Processing Systems 32, pages 5744–5752. Curran Associates, Inc., 2019.
[87] B. T. Polyak. Some methods of speeding up the convergence of iteration methods. USSR
Computational Mathematics and Mathematical Physics, 4(5):1–17, 1964.
[88] Y. Nesterov. A method of solving a convex programming problem with convergence rate
O(1/k2). In Soviet Mathematics Doklady, 1983.
48 / 52
159. References
QSA and Extremum Seeking Control I
[89] C. K. Lauand and S. Meyn. Extremely fast convergence rates for extremum seeking
control with Polyak-Ruppert averaging. arXiv 2206.00814, 2022.
[90] C. K. Lauand and S. Meyn. Markovian foundations for quasi stochastic approximation
with applications to extremum seeking control, 2022.
[91] B. Lapeybe, G. Pages, and K. Sab. Sequences with low discrepancy generalisation and
application to Robbins-Monro algorithm. Statistics, 21(2):251–272, 1990.
[92] S. Laruelle and G. Pagès. Stochastic approximation with averaging innovation applied to
finance. Monte Carlo Methods and Applications, 18(1):1–51, 2012.
[93] S. Shirodkar and S. Meyn. Quasi stochastic approximation. In Proc. of the 2011 American
Control Conference (ACC), pages 2429–2435, July 2011.
[94] S. Chen, A. Devraj, A. Bernstein, and S. Meyn. Revisiting the ODE method for recursive
algorithms: Fast using quasi stochastic approximation.
Journal of Systems Science and Complexity, 34(5):1681–1702, 2021.
[95] Y. Chen, A. Bernstein, A. Devraj, and S. Meyn. Model-Free Primal-Dual Methods for
Network Optimization with Application to Real-Time Optimal Power Flow. In Proc. of
the American Control Conf., pages 3140–3147, Sept. 2019.
49 / 52
160. References
QSA and Extremum Seeking Control II
[96] S. Bhatnagar and V. S. Borkar. Multiscale chaotic spsa and smoothed functional
algorithms for simulation optimization. Simulation, 79(10):568–580, 2003.
[97] S. Bhatnagar, M. C. Fu, S. I. Marcus, and I.-J. Wang. Two-timescale simultaneous
perturbation stochastic approximation using deterministic perturbation sequences. ACM
Transactions on Modeling and Computer Simulation (TOMACS), 13(2):180–209, 2003.
[98] M. Le Blanc. Sur l’electrification des chemins de fer au moyen de courants alternatifs de
frequence elevee [On the electrification of railways by means of alternating currents of
high frequency]. Revue Generale de l’Electricite, 12(8):275–277, 1922.
[99] Y. Tan, W. H. Moase, C. Manzie, D. Nešić, and I. M. Y. Mareels. Extremum seeking from
1922 to 2010. In Proceedings of the 29th Chinese Control Conference, pages 14–26, July
2010.
[100] P. F. Blackman. Extremum-seeking regulators. In An Exposition of Adaptive Control.
Macmillan, 1962.
[101] J. Sternby. Adaptive control of extremum systems. In H. Unbehauen, editor, Methods and
Applications in Adaptive Control, pages 151–160, Berlin, Heidelberg, 1980. Springer
Berlin Heidelberg.
50 / 52
161. References
QSA and Extremum Seeking Control III
[102] J. Sternby. Extremum control systems–an area for adaptive control? In Joint Automatic
Control Conference, number 17, page 8, 1980.
[103] K. B. Ariyur and M. Krstić. Real Time Optimization by Extremum Seeking Control. John
Wiley Sons, Inc., New York, NY, USA, 2003.
[104] M. Krstić and H.-H. Wang. Stability of extremum seeking feedback for general nonlinear
dynamic systems. Automatica, 36(4):595 – 601, 2000.
[105] S. Liu and M. Krstic. Introduction to extremum seeking. In Stochastic Averaging and
Stochastic Extremum Seeking, Communications and Control Engineering. Springer,
London, 2012.
[106] O. Trollberg and E. W. Jacobsen. On the convergence rate of extremum seeking control.
In European Control Conference (ECC), pages 2115–2120. 2014.
[107] Y. Bugeaud. Linear forms in logarithms and applications, volume 28 of IRMA Lectures in
Mathematics and Theoretical Physics. EMS Press, 2018.
[108] G. Wüstholz, editor. A Panorama of Number Theory—Or—The View from Baker’s
Garden. Cambridge University Press, 2002.
51 / 52
162. References
Selected Applications I
[109] N. S. Raman, A. M. Devraj, P. Barooah, and S. P. Meyn. Reinforcement learning for
control of building HVAC systems. In American Control Conference, July 2020.
[110] K. Mason and S. Grijalva. A review of reinforcement learning for autonomous building
energy management. arXiv.org, 2019. arXiv:1903.05196.
52 / 52