SlideShare a Scribd company logo
1 of 127
Download to read offline
Part 3: Introduction to Reinforcement Learning
Temporal Difference Techniques
Sean Meyn
Department of Electrical and Computer Engineering University of Florida
Inria International Chair Inria, Paris
Thanks to to our sponsors: NSF and ARO
Part 3: Introduction to Reinforcement Learning
Outline
1 Recap
2 TD Learning
3 Exploration
4 Q Learning
5 Zap Q
6 Conclusions & Future Directions
7 References
5 10
0
0
20
40
60
80
100
ρ = 0.8
τk
Outcomes
from
M
repeated
runs
-20 0 20
√
Nθ̃N
Crank Up Gain: Tame Transients Average: Tame Variance
PR
θk
ϑτk
Algorithm Design
Recap Algorithm Design
Steps to a Successful Design With Thanks to Lyapunov and Polyak!
0
20
40
60
80
100
ρ = 0.8
τk
5 10
0
Outcomes
from
M
repeated
runs
-20 0 20
Crank Up Gain: Tame Transients
Average: Tame Variance
θk
ϑτk
N − N0 θ̃PR
N
Follow These Steps:
1 Design ¯
f for mean flow d
dt ϑ = ¯
f(ϑ)
2 Design gain: αn = g/(1 + n/ne)ρ
Crank up gain, means ρ < 1
3 Perform PR Averaging
4 Repeat!
Obtain histogram
√
N − N0 θ̃PR (m)
N : 1 ≤ m ≤ M
	
with θ
(m)
0 widely dispersed and N relatively small
What you will learn:
How big N needs to be
Approximate confidence bounds
1 / 52
Recap Algorithm Design
Taming Nonlinear Dynamics
What if stability of d
dt ϑ = ¯
f(ϑ) is unknown? [typically the case in RL]
Newton Raphson Flow [Smale 1976]
Idea: Interpret ¯
f as the “parameter”: d
dt
¯
ft = V( ¯
ft) and design V
2 / 52
Recap Algorithm Design
Taming Nonlinear Dynamics
What if stability of d
dt ϑ = ¯
f(ϑ) is unknown? [typically the case in RL]
Newton Raphson Flow [Smale 1976]
Idea: Interpret ¯
f as the “parameter”: d
dt
¯
f = − ¯
ft
Linear dynamics
2 / 52
Recap Algorithm Design
Taming Nonlinear Dynamics
What if stability of d
dt ϑ = ¯
f(ϑ) is unknown? [typically the case in RL]
Newton Raphson Flow [Smale 1976]
Idea: Interpret ¯
f as the “parameter”: d
dt
¯
f = − ¯
ft
d
dt
¯
f(ϑt) = − ¯
f(ϑt)
d
dt
ϑt = −[A(ϑt)]−1 ¯
f(ϑt)
SA translation: Zap Stochastic Approximation
2 / 52
Recap Algorithm Design
Zap Algorithm Designed to emulate Newton-Raphson flow
d
dt
ϑt = −[A(ϑt)]−1 ¯
f(ϑt), A(θ) = ∂
∂θ
¯
f(θ)
Zap-SA (designed to emulate deterministic Newton-Raphson)
θn+1 = θn + αn+1[− b
An+1]−1
f(θn, ξn+1)
b
An+1 = b
An + βn+1(An+1 − b
An), An+1 = ∂θf(θn, ξn+1)
3 / 52
Recap Algorithm Design
Zap Algorithm Designed to emulate Newton-Raphson flow
d
dt
ϑt = −[A(ϑt)]−1 ¯
f(ϑt), A(θ) = ∂
∂θ
¯
f(θ)
Zap-SA (designed to emulate deterministic Newton-Raphson)
θn+1 = θn + αn+1[− b
An+1]−1
f(θn, ξn+1)
b
An+1 = b
An + βn+1(An+1 − b
An), An+1 = ∂θf(θn, ξn+1)
Requires b
An+1 ≈ A(θn)
def
= ∂θ
¯
f (θn)
3 / 52
Recap Algorithm Design
Zap Algorithm Designed to emulate Newton-Raphson flow
d
dt
ϑt = −[A(ϑt)]−1 ¯
f(ϑt), A(θ) = ∂
∂θ
¯
f(θ)
Zap-SA (designed to emulate deterministic Newton-Raphson)
θn+1 = θn + αn+1[− b
An+1]−1
f(θn, ξn+1)
b
An+1 = b
An + βn+1(An+1 − b
An), An+1 = ∂θf(θn, ξn+1)
b
An+1 ≈ A(θn) requires high-gain,
βn
αn
→ ∞, n → ∞
3 / 52
Recap Algorithm Design
Zap Algorithm Designed to emulate Newton-Raphson flow
d
dt
ϑt = −[A(ϑt)]−1 ¯
f(ϑt), A(θ) = ∂
∂θ
¯
f(θ)
Zap-SA (designed to emulate deterministic Newton-Raphson)
θn+1 = θn + αn+1[− b
An+1]−1
f(θn, ξn+1)
b
An+1 = b
An + βn+1(An+1 − b
An), An+1 = ∂θf(θn, ξn+1)
b
An+1 ≈ A(θn) requires high-gain,
βn
αn
→ ∞, n → ∞
Can use αn = 1/n (without averaging).
Numerics that follow: use this choice, and βn = (1/n)ρ, ρ ∈ (0.5, 1)
3 / 52
Recap Algorithm Design
Zap Algorithm Designed to emulate Newton-Raphson flow
d
dt
ϑt = −[A(ϑt)]−1 ¯
f(ϑt), A(θ) = ∂
∂θ
¯
f(θ)
Zap-SA (designed to emulate deterministic Newton-Raphson)
θn+1 = θn + αn+1[− b
An+1]−1
f(θn, ξn+1)
b
An+1 = b
An + βn+1(An+1 − b
An), An+1 = ∂θf(θn, ξn+1)
b
An+1 ≈ A(θn) requires high-gain,
βn
αn
→ ∞, n → ∞
Can use αn = 1/n (without averaging).
Numerics that follow: use this choice, and βn = (1/n)ρ, ρ ∈ (0.5, 1)
Stability? Virtually universal
Optimal variance, too!
Based on ancient theory from Ruppert  Polyak [80, 81, 79]
3 / 52
Stochastic Control Problems
Recap Optimal Control
Keep In Mind!
Trajectory
Generation State
Feedback
Process
Observer
Environment
Σ
Σ
Σ
Σ
y
u
e
w
d
xref ufb
uff
−
+
x̂
M
ore feedback
r
∆
n
hidden state x
Every Control Problem is Multi-Objective
Many costs to consider
Many measures of risk
See CSRL Chapter 2, and Åström and Murray [1]
4 / 52
Recap Optimal Control
Keep In Mind!
Trajectory
Generation State
Feedback
Process
Observer
Environment
Σ
Σ
Σ
Σ
y
u
e
w
d
xref ufb
uff
−
+
x̂
M
ore feedback
r
∆
n
hidden state x
Every Control Problem is Multi-Objective
Many costs to consider
Many measures of risk
Every Control Problem is Multi-Temporal
Advanced planning, and periodic updates
Real time feedback φ
See CSRL Chapter 2, and Åström and Murray [1]
4 / 52
Recap Optimal Control
Keep In Mind!
Trajectory
Generation State
Feedback
Process
Observer
Environment
Σ
Σ
Σ
Σ
y
u
e
w
d
xref ufb
uff
−
+
x̂
M
ore feedback
r
∆
n
hidden state x
Every Control Problem is Multi-Objective
Many costs to consider
Many measures of risk
Every Control Problem is Multi-Temporal
Advanced planning, and periodic updates
Real time feedback φ
There are Choices Beyond φ?
See CSRL Chapter 2, and Åström and Murray [1]
4 / 52
Recap Optimal Control
MDP Model
Keep things simple: state space X and input space U are finite
MDP Model
X is a stationary controlled Markov chain, with input U
For all states x and sets A,
P{Xn+1 ∈ A | Xn = x, Un = u, and prior history} = Pu(x, A)
c: X × U → R is a cost function
γ  1 discount factor
5 / 52
Recap Optimal Control
MDP Model
Keep things simple: state space X and input space U are finite
Cost function c: X × U → R+
Q-function:
Q?
(x, u) = min
U∞
1
∞
X
k=0
γk
E[c(X(k), U(k)) | X(0) = x , U(0) = u]
5 / 52
Recap Optimal Control
MDP Model
Keep things simple: state space X and input space U are finite
Cost function c: X × U → R+
Q-function:
Q?
(x, u) = min
U∞
1
∞
X
k=0
γk
E[c(X(k), U(k)) | X(0) = x , U(0) = u]
DP equation:
Q?
(x, u) = c(x, u) + γE[Q?
(X(k + 1)) | X(k) = x , U(k) = u]
H(x) = min
u
H(x, u)
5 / 52
Recap Optimal Control
MDP Model
Keep things simple: state space X and input space U are finite
Cost function c: X × U → R+
Q-function:
Q?
(x, u) = min
U∞
1
∞
X
k=0
γk
E[c(X(k), U(k)) | X(0) = x , U(0) = u]
DP equation:
Q?
(x, u) = c(x, u) + γE[Q?
(X(k + 1)) | X(k) = x , U(k) = u]
Optimal policy: U?(k) = φ?(X?(k)), φ?
(x) ∈ arg min
u
Q?
(x, u)
5 / 52
Recap Optimal Control
MDP Model
Keep things simple: state space X and input space U are finite
Cost function c: X × U → R+
Q-function:
Q?
(x, u) = min
U∞
1
∞
X
k=0
γk
E[c(X(k), U(k)) | X(0) = x , U(0) = u]
DP equation:
Q?
(x, u) = c(x, u) + γE[Q?
(X(k + 1)) | X(k) = x , U(k) = u]
Optimal policy: U?(k) = φ?(X?(k)), φ?
(x) ∈ arg min
u
Q?
(x, u)
Goal of SC and RL: approximate Q? and/or φ?
Today’s lecture: find best approximation among {Qθ : θ ∈ Rd} 5 / 52
Recap Optimal Control
MDP Model Some flexibility with notation: c(X(k), U(k)) = ck, etc.
Keep things simple: state space X and input space U are finite
Cost function c: X × U → R+
Q-function:
Q?
(x, u) = min
U∞
1
∞
X
k=0
γk
E[ ck | X0 = x , U0 = u]
DP equation:
Q?
(x, u) = c(x, u) + γE[ Q?
k+1
| Xk = x , Uk = u]
Optimal policy: U?
k = φ?(X?
k), φ?
(x) ∈ arg min
u
Q?
(x, u)
Goal of SC and RL: approximate Q? and/or φ?
Today’s lecture: find best approximation among {Qθ : θ ∈ Rd} 6 / 52
TD Learning
TD Learning Estimating the State-Action Value Function
TD Learning Given policy φ, approximate the fixed-policy Q-function
U(k) = φ(X(k)), k ≥ 1
Qφ
(x, u) =
∞
X
k=0
γk
E[c(X(k), U(k)) | X(0) = x , U(0) = u]
Motivation: Policy Improvement φ+
(x) = arg min
u
Qφ
(x, u)
7 / 52
TD Learning Estimating the State-Action Value Function
TD Learning Given policy φ, approximate the fixed-policy Q-function
U(k) = φ(X(k)), k ≥ 1
DP equation:
Qφ
(x, u) = c(x, u) + γE[Qφ
(X(k + 1)) | X(k) = x , U(k) = u]
Qφ
(X(k), U(k)) = c(X(k), U(k)) + γE[Qφ
(X(k + 1)) | Fk]
with new definition, H(x) = H(x, φ(x)).
7 / 52
TD Learning Estimating the State-Action Value Function
TD Learning Given policy φ, approximate the fixed-policy Q-function
U(k) = φ(X(k)), k ≥ 1
Qφ
(X(k), U(k)) = c(X(k), U(k)) + γE[Qφ
(X(k + 1)) | Fk]
with new definition, H(x) = H(x, φ(x)).
Galerkin relaxation: θ∗ is a root of the function
¯
f(θ)
def
= E[{−Qθ
(X(k), U(k)) + c(X(k), U(k)) + γQθ
(X(k + 1))}ζk]
Eligibility vector in Rd
7 / 52
TD Learning Estimating the State-Action Value Function
TD Learning Given policy φ, approximate the fixed-policy Q-function
U(k) = φ(X(k)), k ≥ 1
Qφ
(X(k), U(k)) = c(X(k), U(k)) + γE[Qφ
(X(k + 1)) | Fk]
with new definition, H(x) = H(x, φ(x)).
Galerkin relaxation: θ∗ is a root of the function
¯
f(θ)
def
= E[{−Qθ
(X(k), U(k)) + c(X(k), U(k)) + γQθ
(X(k + 1))}ζk]
Linear family: {Qθ = θT
ψ}
d
dt ϑ = ¯
f(ϑ) is stable (clever choice of ζk, and with covariance Σψ full rank)
Alternatively, compute θ∗ by a single matrix inversion
SA to the rescue
7 / 52
TD Learning Estimating the State-Action Value Function
TD Learning Given policy φ, approximate the fixed-policy Q-function
U(k) = φ(X(k)), k ≥ 1
Qφ
(X(k), U(k)) = c(X(k), U(k)) + γE[Qφ
(X(k + 1)) | Fk]
with new definition, H(x) = H(x, φ(x)).
Galerkin relaxation: θ∗ is a root of the function
¯
f(θ)
def
= E[{−Qθ
(X(k), U(k)) + c(X(k), U(k)) + γQθ
(X(k + 1))}ζk]
How to estimate θ? for neural network (nonlinear) architecture?
SA to the rescue
7 / 52
TD Learning Estimating the State-Action Value Function
TD(λ) ¯
f(θ)
def
= E[{−Qθ
k + ck + γQθ
k+1
}ζk]
Choose λ ∈ [0, 1] as you please, and then for each n
ζn =
n
X
k=0
(γλ)n−k
∇θ[Qθ
k]
θ=θn
Interpretation for linear function class: ∇θ[Qθ
k] = ψk = ψ(Xk, Uk)
Basis observations passed through a low pass linear filter
8 / 52
TD Learning Estimating the State-Action Value Function
TD(λ) ¯
f(θ)
def
= E[{−Qθ
k + ck + γQθ
k+1
}ζk]
Choose λ ∈ [0, 1] as you please, and then for each n
ζn =
n
X
k=0
(γλ)n−k
∇θ[Qθ
k]
θ=θn
TD(λ) ≡ SA algorithm:
θn+1 = θn + αn+1f(θn, ξn+1) = θn + αn+1{−Qθn
n + cn + γQθn
n+1
}ζn
ζn+1 = γλζn + ∇θ[Qθ
n]
θ=θn+1
with ξn+1 = (Xn+1, Xn, Un, ζn)
8 / 52
TD Learning Least-Squares Temporal Difference Learning
LSTD and Zap ¯
f(θ)
def
= E[{−Qθ
k + ck + γQθ
k+1
}ζk]
We require the linearization matrix:
A(θ, ξk+1) = ∂θ{−Qθ
k + γQθ
k+1
}ζk
9 / 52
TD Learning Least-Squares Temporal Difference Learning
LSTD and Zap ¯
f(θ)
def
= E[{−Qθ
k + ck + γQθ
k+1
}ζk]
We require the linearization matrix:
A(θ, ξk+1) = ∂θ{−Qθ
k + γQθ
k+1
}ζk
=

−ψ(Xk, Uk) + γψ Xk+1, φ(Xk+1)

}ζk linear function class
9 / 52
TD Learning Least-Squares Temporal Difference Learning
LSTD and Zap ¯
f(θ)
def
= E[{−Qθ
k + ck + γQθ
k+1
}ζk]
We require the linearization matrix:
A(θ, ξk+1) = ∂θ{−Qθ
k + γQθ
k+1
}ζk
=

−ψ(Xk, Uk) + γψ Xk+1, φ(Xk+1)

}ζk linear function class
A∗ = E

{−ψ(Xk, Uk) + γψ Xk+1, φ(Xk+1)

}ζk] is Hurwitz in this case
Only requirement is that Σψ  0
9 / 52
TD Learning Least-Squares Temporal Difference Learning
LSTD and Zap ¯
f(θ)
def
= E[{−Qθ
k + ck + γQθ
k+1
}ζk]
We require the linearization matrix:
A(θ, ξk+1) = ∂θ{−Qθ
k + γQθ
k+1
}ζk
=

−ψ(Xk, Uk) + γψ Xk+1, φ(Xk+1)

}ζk linear function class
θn+1 = θn + αn+1[− b
An+1]−1
f(θn, ξn+1)
f(θn, ξn+1) = {−Qθn
n + cn + γQθn
n+1
}ζn
b
An+1 = b
An + αn+1(An+1 − b
An)
An+1 = {−ψ(Xn, Un) + γψ(Xn+1, φ(Xn+1)}ζn
Zap SA doesn’t require two time-scale algorithm when A(θ) ≡ A∗
9 / 52
TD Learning Least-Squares Temporal Difference Learning
LSTD and Zap ¯
f(θ)
def
= E[{−Qθ
k + ck + γQθ
k+1
}ζk]
We require the linearization matrix:
A(θ, ξk+1) = ∂θ{−Qθ
k + γQθ
k+1
}ζk
=

−ψ(Xk, Uk) + γψ Xk+1, φ(Xk+1)

}ζk linear function class
LSTD(λ)-Learning
θn+1 = θn + αn+1[− b
An+1]−1
f(θn, ξn+1)
f(θn, ξn+1) = {−Qθn
n + cn + γQθn
n+1
}ζn
b
An+1 = b
An + αn+1(An+1 − b
An)
An+1 = {−ψ(Xn, Un) + γψ(Xn+1, φ(Xn+1)}ζn
Optimal convergence rate (as obtained using PR averaging)
9 / 52
TD Learning Least-Squares Temporal Difference Learning
What Is LSTD(λ) Solving?
Beautiful Theory for λ = 1
10 / 52
TD Learning Least-Squares Temporal Difference Learning
What Is LSTD(1) Solving? Eternal gratitude to John Tsitsiklis!
Answered in the dissertation of Van Roy [18]:
min
θ
kQθ
− Qφ
k2
π =
X
x,u
Qθ
(x, u) − Qφ
(x, u)
2
π(x, u)
See [21, 20] and Chapter 9 of CSRL
10 / 52
TD Learning Least-Squares Temporal Difference Learning
What Is LSTD(1) Solving? Eternal gratitude to John Tsitsiklis!
Answered in the dissertation of Van Roy [18]:
min
θ
kQθ
− Qφ
k2
π =
X
x,u
Qθ
(x, u) − Qφ
(x, u)
2
π(x, u)
See [21, 20] and Chapter 9 of CSRL
Foundation for Actor-critic algorithms in the dissertation of Konda [46]
θn+1 = θn − αn+1
˘
∇Γ (n) ¯
f(θ) = −∇Γ (θ)
Unbiased gradient observation: ˘
∇Γ (n)
def
= Λθn
n Qθn
n
Score function: Λθ
(x, u) = ∇θ log[e
φθ
(u | x)]
See also [47] and Chapter 10 of CSRL
10 / 52
TD Learning Least-Squares Temporal Difference Learning
LSTD(1) Optimal Covariance May Be Large
Example from CTCN [33]:
0
10
20
30
0
5
10
15
ρ = 0.8 ρ = 0.9 hθ
(x) = θ1
θ1
x + θ2x2
θ2
θ∗
1 = θ∗
2
0 1 2 3 4 5 x 10
6 0 1 2 3 4 5 x 10
6
11 / 52
TD Learning Least-Squares Temporal Difference Learning
LSTD(1) Optimal Covariance May Be Large
Example from CTCN [33]:
0
10
20
30
0
5
10
15
ρ = 0.8 ρ = 0.9 hθ
(x) = θ1
θ1
x + θ2x2
θ2
θ∗
1 = θ∗
2
0 1 2 3 4 5 x 10
6 0 1 2 3 4 5 x 10
6
Architecture linear, two dimensional, and ideal in every sense
11 / 52
TD Learning Least-Squares Temporal Difference Learning
LSTD(1) Optimal Covariance May Be Large
Example from CTCN [33]:
0
10
20
30
0
5
10
15
ρ = 0.8 ρ = 0.9 hθ
(x) = θ1
θ1
x + θ2x2
θ2
θ∗
1 = θ∗
2
0 1 2 3 4 5 x 10
6 0 1 2 3 4 5 x 10
6
Architecture linear, two dimensional, and ideal in every sense
Algorithm performance is far from ideal
11 / 52
TD Learning Changing Our Goals
State Weighting
Weighted Objective
min
θ
kQθ
− Qφ
k2
π,Ω =
X
x,u
1
Ω(x, u)
Qθ
(x, u) − Qφ
(x, u)
2
π(x, u)
12 / 52
TD Learning Changing Our Goals
State Weighting
Weighted Objective
min
θ
kQθ
− Qφ
k2
π,Ω =
X
x,u
1
Ω(x, u)
Qθ
(x, u) − Qφ
(x, u)
2
π(x, u)
Zap ≡ LSTD(1) θn = [− b
An]−1
bn
b
An+1 = b
An +
1
n + 1
h
−
1
Ωn+1
ψn+1ψT
n+1 − b
An
i
bn+1 = bn +
1
n + 1

ζncn − bn

ζn+1 = γζn +
1
Ωn+1
ψn+1
12 / 52
TD Learning Changing Our Goals
State Weighting
Design state weighting so that f(θ, ξ) becomes bounded in ξ:
0
10
20
30
0
5
10
15
ρ = 0.8 ρ = 0.9 hθ
(x) = θ1
θ1
x + θ2x2
θ2
θ∗
1 = θ∗
2
0 1 2 3 4 5 x 10
6 0 1 2 3 4 5 x 10
6
13 / 52
TD Learning Changing Our Goals
State Weighting
Design state weighting so that f(θ, ξ) becomes bounded in ξ:
0
10
20
30
0
10
20
30
0
5
10
15
0
5
10
15
ρ = 0.8 ρ = 0.9 hθ
(x) = θ1
θ1
x + θ2x2
θ2
θ∗
1 = θ∗
2
0 1 2 3 4 5 x 10
6 0 1 2 3 4 5 x 10
6
0 1 2 3 4 5 x 10
6 0 1 2 3 4 5 x 10
6
[1
+
(µ
−
α)x]
4
Ω(x)
≡
1
13 / 52
TD Learning Changing Our Goals
State Weighting
Design state weighting so that f(θ, ξ) becomes bounded in ξ:
0
10
20
30
0
10
20
30
0
5
10
15
0
5
10
15
ρ = 0.8 ρ = 0.9 hθ
(x) = θ1
θ1
x + θ2x2
θ2
θ∗
1 = θ∗
2
0 1 2 3 4 5 x 10
6 0 1 2 3 4 5 x 10
6
0 1 2 3 4 5 x 10
6 0 1 2 3 4 5 x 10
6
[1
+
(µ
−
α)x]
4
Ω(x)
≡
1
Far more flexible and well motivated than introducing λ
ζn+1 = γζn + 1
Ωn+1
ψn+1
13 / 52
TD Learning Changing Our Goals
Relative LSTD
Far more flexible and well motivated than introducing λ?
ζn+1 = γζn +
1
Ωn+1
ψ(Xn+1)
14 / 52
TD Learning Changing Our Goals
Relative LSTD
Far more flexible and well motivated than introducing λ?
ζn+1 = γζn +
1
Ωn+1
ψ(Xn+1)
Performance will degrade as γ gets close to one
14 / 52
TD Learning Changing Our Goals
Relative LSTD
Far more flexible and well motivated than introducing λ?
ζn+1 = γζn +
1
Ωn+1
ψ(Xn+1)
Performance will degrade as γ gets close to one
Change your objective: min
θ
kHθ
− Hφ
k2
π,Ω
with H(x, u) = Q(x, u) − Q(x•, u•) (some fixed normalization)
14 / 52
TD Learning Changing Our Goals
Relative LSTD
Far more flexible and well motivated than introducing λ?
ζn+1 = γζn +
1
Ωn+1
ψ(Xn+1)
Performance will degrade as γ gets close to one
Change your objective: min
θ
kHθ
− Hφ
k2
π,Ω
with H(x, u) = Q(x, u) − Q(x•, u•) (some fixed normalization)
Relative LSTD
Replace ψ(x, u) by ψ̃(x, u)
def
= ψ(x, u) − ψ(x•, u•) where it appears
14 / 52
TD Learning Changing Our Goals
Relative LSTD
Far more flexible and well motivated than introducing λ?
ζn+1 = γζn +
1
Ωn+1
ψ(Xn+1)
Performance will degrade as γ gets close to one
Change your objective: min
θ
kHθ
− Hφ
k2
π,Ω
with H(x, u) = Q(x, u) − Q(x•, u•) (some fixed normalization)
Relative LSTD
Replace ψ(x, u) by ψ̃(x, u)
def
= ψ(x, u) − ψ(x•, u•) where it appears
Covariance is optimal, and uniformly bounded over 0 ≤ γ  1 [64]
14 / 52
TD Learning Changing Our Goals
Relative LSTD
0 1 2 3 4 5 6 7 8 9 10
n 10
5
0
20
40
60
80
100
0 1 2 3 4 5 6 7 8 9 10
0
20
40
60
80
100
120 120
Maximal
Bellman
Error
1/(1 − γ) = 103
1/(1 − γ) = 104
Q-learning:
g = 1/(1 − γ)
g = 1/(1 − ρ∗
γ)
g = 1/(1 − ρ∗
γ)
Relative Q-learning:
{
Figure 1: Comparison of Q-learning and Relative Q-learning algorithms for the stochastic shortest
path problem of [4]. The relative Q-learning algorithm is unaffected by large discounting.
Relative LSTD
Replace ψ(x, u) by ψ̃(x, u)
def
= ψ(x, u) − ψ(x•, u•) where it appears
Covariance is optimal, and uniformly bounded over 0 ≤ γ  1 [64]
15 / 52
Exploration On and off policy algorithms
Off Policy for Exploration
In TD-learning, sufficient exploration means Σψ = Eπ[ψkψT
k]  0
This ensures convergence, solving ¯
f(θ∗
) = 0, perhaps θ∗
= min
θ
kQθ
− Qφ
k2
π,Ω
0.01
0.02
0.03
0.04
0.05
0.06
−1 0 1 −1 0 1
−1
0
1
Optimal policy
Optimal policy
Small Input Large Input
Invariant
Density
[57]
16 / 52
Exploration On and off policy algorithms
Off Policy for Exploration
In TD-learning, sufficient exploration means Σψ = Eπ[ψkψT
k]  0
Not always possible!
Consider Xn+1 = aXn + Un, Un = φ(Xn) = −KXn
16 / 52
Exploration On and off policy algorithms
Off Policy for Exploration
In TD-learning, sufficient exploration means Σψ = Eπ[ψkψT
k]  0
Not always possible!
Consider Xn+1 = aXn + Un, Un = φ(Xn) = −KXn
If |a − K|  1 then the steady-state is supported on Xn = Un ≡ 0
16 / 52
Exploration On and off policy algorithms
Off Policy for Exploration
In TD-learning, sufficient exploration means Σψ = Eπ[ψkψT
k]  0
Not always possible!
Off Policy TD(λ)
θn+1 = θn + αn+1f(θn, ξn+1) = θn + αn+1{−Qθn
n + cn + γQθn
n+1
}ζn
ζn+1 = γλζn + ∇θ[Qθ
n]
θ=θn+1
16 / 52
Exploration On and off policy algorithms
Off Policy for Exploration
In TD-learning, sufficient exploration means Σψ = Eπ[ψkψT
k]  0
Not always possible!
Off Policy TD(λ)
θn+1 = θn + αn+1f(θn, ξn+1) = θn + αn+1{−Qθn
n + cn + γQθn
n+1
}ζn
ζn+1 = γλζn + ∇θ[Qθ
n]
θ=θn+1
The difference:
On policy: Qθn
n+1
= Qθn (Xn+1, Un+1), with Un+1 = φ(Xn+1)
Off policy: Qθn
n+1
= Qθn (Xn+1, Un+1), with Un+1 = e
φ(Xn+1)
16 / 52
Exploration On and off policy algorithms
Off Policy for Exploration
In TD-learning, sufficient exploration means Σψ = Eπ[ψkψT
k]  0
Not always possible!
Off Policy TD(λ)
θn+1 = θn + αn+1f(θn, ξn+1) = θn + αn+1{−Qθn
n + cn + γQθn
n+1
}ζn
ζn+1 = γλζn + ∇θ[Qθ
n]
θ=θn+1
The difference:
On policy: Qθn
n+1
= Qθn (Xn+1, Un+1), with Un+1 = φ(Xn+1)
Off policy: Qθn
n+1
= Qθn (Xn+1, Un+1), with Un+1 = e
φ(Xn+1)
Example: Un+1 = e
φ(Xn+1) =
(
φ(Xn+1) with prob.1 − ε
randn+1 else
16 / 52
Exploration What Are We Trying To Accomplish
Wide World of Exploration
How would this intelligent agent
explore her environment to maximize her rewards?
17 / 52
Exploration What Are We Trying To Accomplish
Wide World of Exploration
How would this intelligent agent explore ...?
Exploration might lead to a meeting with the cat
17 / 52
Exploration What Are We Trying To Accomplish
Wide World of Exploration
How would this intelligent agent explore ...?
Exploration might lead to a meeting with the cat
She will come to the conclusion that the delicious smell is not cheese
17 / 52
Exploration What Are We Trying To Accomplish
Wide World of Exploration
How would this intelligent agent explore ...?
Exploration might lead to a meeting with the cat
She will come to the conclusion that the delicious smell is not cheese
Would she continue to meet up with the cat, and smell the poop, in order
to obtain the best estimate of Q∗?
17 / 52
Exploration What Are We Trying To Accomplish
Wide World of Exploration
How would this intelligent agent explore ...?
Exploration might lead to a meeting with the cat
She will come to the conclusion that the delicious smell is not cheese
Would she continue to meet up with the cat, and smell the poop, in order
to obtain the best estimate of Q∗?
As intelligent agents we, like the mouse, must not forget our goals!
17 / 52
0 1 2 3 4 5 6 7 8 9 10
n 10
5
0
20
40
60
80
100
0 1 2 3 4 5 6 7 8 9 10
0
20
40
60
80
100
120 120
Maximal
Bellman
Error
1/(1 − γ) = 103
1/(1 − γ) = 104
Q-learning:
g = 1/(1 − γ)
g = 1/(1 − ρ∗
γ)
g = 1/(1 − ρ∗
γ)
Relative Q-learning:
{
Figure 1: Comparison of Q-learning and Relative Q-learning algorithms for the stochastic shortest
path problem of [4]. The relative Q-learning algorithm is unaffected by large discounting.
Q Learning
Q Learning Galerkin Relaxation
Dynamic programming
Find function Q∗ that solves (Fn means history)
E

c(Xn, Un) + γQ∗
(Xn+1) − Q∗
(Xn, Un) | Fn

= 0
H(x) = min
u
H(x, u)
18 / 52
Q Learning Galerkin Relaxation
Dynamic programming
Find function Q∗ that solves (Fn means history)
E

c(Xn, Un) + γQ∗
(Xn+1) − Q∗
(Xn, Un) | Fn

= 0
Hidden Goal of Q-Learning (including DQN)
Given {Qθ : θ ∈ Rd}, find θ∗ that solves ¯
f(θ∗) = 0,
¯
f(θ)
def
= E

c(Xn, Un) + γQθ
(Xn+1) − Qθ
(Xn, Un)
	
ζn

The family {Qθ} and eligibility vectors {ζn} are part of algorithm design.
18 / 52
Q Learning Galerkin Relaxation
Q(0)-Learning Hidden goal ¯
f(θ∗
) = 0
¯
f(θ) = E

c(Xn, Un) + γQθ
(Xn+1) − Qθ
(Xn, Un)
	
ζn

Prototypical choice ζn = ∇θQθ(Xn, Un)
θ=θn
19 / 52
Q Learning Galerkin Relaxation
Q(0)-Learning Hidden goal ¯
f(θ∗
) = 0
¯
f(θ) = E

c(Xn, Un) + γQθ
(Xn+1) − Qθ
(Xn, Un)
	
ζn

Prototypical choice ζn = ∇θQθ(Xn, Un)
θ=θn
=⇒ prototypical Q-learning algorithm
Q(0) Learning Algorithm
Estimates obtained using SA
θn+1 = θn + αn+1fn+1 fn+1 =

cn + γQθ
n+1
− Qθ
n
	
ζn
θ=θn
Qθ
n+1
= Qθ
(Xn+1), φθ
(Xn+1))
φθ
(x) = arg min
u
Qθ
(x, u) [Qθ-greedy policy]
Input {Un} chosen for exploration.
19 / 52
Q Learning Watkins’ Q-Learning
Q(0)-Learning Hidden goal ¯
f(θ∗
) = 0
Q(0)-learning with linear function approximation
Estimates obtained using SA
θn+1 = θn + αn+1fn+1 fn+1 =

cn + γQθ
n+1
− Qθ
n
θ=θn
ζn
Qθ
n+1
= Qθ
(Xn+1, φθ
(Xn+1))
Qθ(x, u) = θT
ψ(x, u)
Qθ
(x) = θT
ψ(x, φθ(x))
ζn = ∇θQθ(Xn, Un)
θ=θn
= ψ(Xn, Un)
20 / 52
Q Learning Watkins’ Q-Learning
Q(0)-Learning Hidden goal ¯
f(θ∗
) = 0
Q(0)-learning with linear function approximation
Estimates obtained using SA
θn+1 = θn + αn+1fn+1 fn+1 =

cn + γQθ
n+1
− Qθ
n
θ=θn
ζn
Qθ
n+1
= Qθ
(Xn+1, φθ
(Xn+1))
Qθ(x, u) = θT
ψ(x, u)
Qθ
(x) = θT
ψ(x, φθ(x))
ζn = ∇θQθ(Xn, Un)
θ=θn
= ψ(Xn, Un)
¯
f(θ) = A(θ)θ − b̄ = 0
A(θ) = E

ζn

γψ(Xn+1, φθ
(Xn+1)) − ψ(Xn, Un)
T 
b̄
def
= E

ζnc(Xn, Un)

20 / 52
Q Learning Watkins’ Q-Learning
Watkins’ Q-learning
E

c(Xn, Un) + γQθ∗
(Xn+1) − Qθ∗
(Xn, Un)
	
ζn

= 0
21 / 52
Q Learning Watkins’ Q-Learning
Watkins’ Q-learning
E

c(Xn, Un) + γQθ∗
(Xn+1) − Qθ∗
(Xn, Un)
	
ζn

= 0
Watkin’s algorithm A special case of Q(0)-learning
The family {Qθ} and eligibility vectors {ζn} in this design:
Linearly parameterized family of functions: Qθ(x, u) = θT
ψ(x, u)
ζn ≡ ψ(Xn, Un)
ψi(x, u) = 1{x = xi, u = ui} (complete basis ≡ tabular Q-learning)
Convergence of Qθn to Q? holds under mild conditions
21 / 52
Q Learning Watkins’ Q-Learning
Watkins’ Q-learning
E

c(Xn, Un) + γQθ∗
(Xn+1) − Qθ∗
(Xn, Un)
	
ζn

= 0
Watkin’s algorithm A special case of Q(0)-learning
The family {Qθ} and eligibility vectors {ζn} in this design:
Linearly parameterized family of functions: Qθ(x, u) = θT
ψ(x, u)
ζn ≡ ψ(Xn, Un)
ψi(x, u) = 1{x = xi, u = ui} (complete basis ≡ tabular Q-learning)
Convergence of Qθn to Q? holds under mild conditions
Asymptotic covariance is infinite for γ ≥ 1/2 [31, 62]
1
αn
Cov(θ̃n) → ∞ as n → ∞, using the standard step-size rule αn = 1/n(x, u)
21 / 52
Q Learning Watkins’ Q-Learning
Asymptotic Covariance of Watkins’ Q-Learning
This is what infinite variance looks like
1
4
6
5
3 2
σ2
= lim
n→∞
nE[kθn − θ∗
k2
] = ∞ Wild oscillations?
22 / 52
Q Learning Watkins’ Q-Learning
Asymptotic Covariance of Watkins’ Q-Learning
This is what infinite variance looks like
1
4
6
5
3 2
σ2
= lim
n→∞
nE[kθn − θ∗
k2
] = ∞ Wild oscillations?
Not at all, the sample paths appear frozen
Histogram of parameter estimates after 106 iterations.
100
0 200 300 400 486.6
0
10
20
30
40
n = 106
Histogram for θ
θ*
n(15)
(15)
Example from [31, 62] 2017
22 / 52
Q Learning Watkins’ Q-Learning
Asymptotic Covariance of Watkins’ Q-Learning
This is what infinite variance looks like
1
4
6
5
3 2
σ2
= lim
n→∞
nE[kθn − θ∗
k2
] = ∞ Wild oscillations?
Not at all, the sample paths appear frozen
Sample paths using a higher gain, or relative Q-learning.
0 1 2 3 4 5 6 7 8 9 10
n 10
5
0
20
40
60
80
100
0 1 2 3 4 5 6 7 8 9 10
0
20
40
60
80
100
120 120
Maximal
Bellman
Error
1/(1 − γ) = 103
1/(1 − γ) = 104
Q-learning:
g = 1/(1 − γ)
g = 1/(1 − ρ∗
γ)
g = 1/(1 − ρ∗
γ)
Relative Q-learning:
{
Figure 1: Comparison of Q-learning and Relative Q-learning algorithms for the stochastic shortest
path problem of [4]. The relative Q-learning algorithm is unaffected by large discounting.
Example from [31, 62] 2017, and [64], CSRL, 2021
22 / 52
Q Learning Watkins’ Q-Learning
Asymptotic Covariance of Watkins’ Q-Learning
Can we do better?
1
4
6
5
3 2
0 1 2 3 4 5 6 7 8 9 10
n 10
5
0
20
40
60
80
100
0 1 2 3 4 5 6 7 8 9 10
0
20
40
60
80
100
120 120
Maximal
Bellman
Error
1/(1 − γ) = 103
1/(1 − γ) = 104
Q-learning:
g = 1/(1 − γ)
g = 1/(1 − ρ∗
γ)
g = 1/(1 − ρ∗
γ)
Relative Q-learning:
{
Figure 1: Comparison of Q-learning and Relative Q-learning algorithms for the stochastic shortest
path problem of [4]. The relative Q-learning algorithm is unaffected by large discounting.
23 / 52
Q Learning Watkins’ Q-Learning
Asymptotic Covariance of Watkins’ Q-Learning
Can we do better?
1
4
6
5
3 2
0 1 2 3 4 5 6 7 8 9 10
n 10
5
0
20
40
60
80
100
0 1 2 3 4 5 6 7 8 9 10
0
20
40
60
80
100
120 120
Maximal
Bellman
Error
1/(1 − γ) = 103
1/(1 − γ) = 104
Q-learning:
g = 1/(1 − γ)
g = 1/(1 − ρ∗
γ)
g = 1/(1 − ρ∗
γ)
Relative Q-learning:
{
Figure 1: Comparison of Q-learning and Relative Q-learning algorithms for the stochastic shortest
path problem of [4]. The relative Q-learning algorithm is unaffected by large discounting.
Our intelligent mouse might offer clues
23 / 52
Zap
Momentum
Return to Zap
Zap Q
Motivation
The ODE method begins with design of the ODE: d
dt ϑ = ¯
f(ϑ)
Challenges we have faced with Q-learning:
24 / 52
Zap Q
Motivation
The ODE method begins with design of the ODE: d
dt ϑ = ¯
f(ϑ)
Challenges we have faced with Q-learning:
How can we design dynamics for
1 Stability few results outside of Watkins’ tabular setting
2 ¯
f(θ∗
) = 0 solves a relevant problem or has a solution
24 / 52
Zap Q
Motivation
The ODE method begins with design of the ODE: d
dt ϑ = ¯
f(ϑ)
Challenges we have faced with Q-learning:
How can we design dynamics for
1 Stability
2 ¯
f(θ∗
) = 0 solves a relevant problem
How can we better manage problems introduced by 1/(1 − γ)?
Relative Q-Learning is one approach
24 / 52
Zap Q
Motivation
The ODE method begins with design of the ODE: d
dt ϑ = ¯
f(ϑ)
Challenges we have faced with Q-learning:
How can we design dynamics for
1 Stability
2 ¯
f(θ∗
) = 0 solves a relevant problem
How can we better manage problems introduced by 1/(1 − γ)?
Relative Q-Learning is one approach
Assuming we have solved 2,
approximate Newton-Raphson flow:
d
dt
¯
f(ϑt) = − ¯
f(ϑt) giving ¯
f(ϑt) = ¯
f(ϑ0)e−t
24 / 52
Zap Q
Zap Algorithm Designed to emulate Newton-Raphson flow
d
dt
ϑt = −[A(ϑt)]−1 ¯
f(ϑt), A(θ) = ∂
∂θ
¯
f(θ)
Zap-SA
θn+1 = θn + αn+1Gn+1f(θn, Φn+1) Gn+1 = [− b
An+1]−1
b
An+1 = b
An + βn+1(An+1 − b
An) An+1 = ∂θf(θn, Φn+1)
25 / 52
Zap Q
Zap Algorithm Designed to emulate Newton-Raphson flow
d
dt
ϑt = −[A(ϑt)]−1 ¯
f(ϑt), A(θ) = ∂
∂θ
¯
f(θ)
Zap-SA
θn+1 = θn + αn+1Gn+1f(θn, Φn+1) Gn+1 = [− b
An+1]−1
b
An+1 = b
An + βn+1(An+1 − b
An) An+1 = ∂θf(θn, Φn+1)
b
An+1 ≈ A(θn) requires high-gain,
βn
αn
→ ∞, n → ∞
25 / 52
Zap Q
Zap Algorithm Designed to emulate Newton-Raphson flow
d
dt
ϑt = −[A(ϑt)]−1 ¯
f(ϑt), A(θ) = ∂
∂θ
¯
f(θ)
Zap-SA
θn+1 = θn + αn+1Gn+1f(θn, Φn+1) Gn+1 = [− b
An+1]−1
b
An+1 = b
An + βn+1(An+1 − b
An) An+1 = ∂θf(θn, Φn+1)
b
An+1 ≈ A(θn) requires high-gain,
βn
αn
→ ∞, n → ∞
Numerics that follow: αn = 1/n, βn = (1/n)ρ, ρ ∈ (0.5, 1)
Zap Q-Learning: f(θn, Φn+1) =

c(Xn, Un) + γQθ∗
(Xn+1) − Qθ∗
(Xn, Un)
	
ζn
ζn = ∇θQθ
(Xn, Un)
θ=θn
An+1 = ζn

γψ(Xn+1, φθ
(Xn+1)) − ψ(Xn, Un)
T
φθ
(x) = arg min
u
Qθ
(x, u)
25 / 52
Zap Q
Challenges
Q-learning: {Qθ(x, u) : θ ∈ Rd , u ∈ U , x ∈ X}
Find θ∗ such that ¯
f(θ∗) = 0, with
¯
f(θ) = E

c(Xn, Un) + γQθ
(Xn+1) − Qθ
(Xn, Un)
	
ζn

What makes theory difficult:
1 Does ¯
f have a root?
2 Does the inverse of A exist?
3 SA theory is weak for a discontinuous ODE ⇐= biggest technical challenge
26 / 52
Zap Q
Challenges
Q-learning: {Qθ(x, u) : θ ∈ Rd , u ∈ U , x ∈ X}
Find θ∗ such that ¯
f(θ∗) = 0, with
¯
f(θ) = E

c(Xn, Un) + γQθ
(Xn+1) − Qθ
(Xn, Un)
	
ζn

What makes theory difficult:
1 Does ¯
f have a root?
2 Does the inverse of A exist?
3 SA theory is weak for a discontinuous ODE ⇐= biggest technical challenge
=⇒ Resolved by exploiting special structure,
even for NN function approximation [32, 9]
26 / 52
Zap Q Zap Along Walk to the Cafe
Zap Q-Learning
Optimize Walk to Cafe
1
4
6
5
3 2
Convergence with Zap gain βn = n−0.85
0 1 2 3 4 5 6 7 8 9 10 105
0
20
40
60
80
100 Watkins, Speedy Q-learning,
Polyak-Ruppert Averaging
Zap
Bellman
Error
n
Convergence of Zap-Q Learning
Discount factor: γ = 0.99
27 / 52
Zap Q Zap Along Walk to the Cafe
Zap Q-Learning
Optimize Walk to Cafe
1
4
6
5
3 2
Convergence with Zap gain βn = n−0.85
Infinite covariance with αn = 1/n or 1/n(x, u).
0 1 2 3 4 5 6 7 8 9 10 105
0
20
40
60
80
100 Watkins, Speedy Q-learning,
Polyak-Ruppert Averaging
Zap
Bellman
Error
n
Zap, βn = αn
Convergence of Zap-Q Learning
Discount factor: γ = 0.99
27 / 52
Zap Q Zap Along Walk to the Cafe
Zap Q-Learning
Optimize Walk to Cafe
1
4
6
5
3 2
Convergence with Zap gain βn = n−0.85
Infinite covariance with αn = 1/n or 1/n(x, u).
0 1 2 3 4 5 6 7 8 9 10 105
0
20
40
60
80
100 Watkins, Speedy Q-learning,
Polyak-Ruppert Averaging
Zap
Bellman
Error
n
Watkins, g = 1500
αn = g/n
Zap, βn = αn
Convergence of Zap-Q Learning
Discount factor: γ = 0.99
27 / 52
Zap Q Zap Along Walk to the Cafe
Zap Q-Learning
Optimize Walk to Cafe
1
4
6
5
3 2
Convergence with Zap gain βn = n−0.85
-20 -10 0 10 -20 -10 0 10 -8 -6 -4 -2 0 2 4 6 8 103
-8 -6 -4 -2 0 2 4 6 8
n = 104
n = 106
Theoritical pdf Experimental pdf Empirical: 1000 trials
Wn =
√
nθ̃n
Entry #18: n = 104
n = 106
Entry #10:
CLT gives good prediction of finite-n performance
Discount factor: γ = 0.99
28 / 52
Zap Q Zap Along Walk to the Cafe
Zap Q-Learning
Optimize Walk to Cafe
1
4
6
5
3 2
Convergence with Zap gain βn = n−0.85
-20 -10 0 10 -20 -10 0 10 -8 -6 -4 -2 0 2 4 6 8 103
-8 -6 -4 -2 0 2 4 6 8
n = 104
n = 106
Theoritical pdf Experimental pdf Empirical: 1000 trials
Wn =
√
nθ̃n
Entry #18: n = 104
n = 106
Entry #10:
CLT gives good prediction of finite-n performance
Discount factor: γ = 0.99
Do not forget to use this in practice!
Multiple runs to estimate CLT variance
Requires wide range of initial conditions, but a smaller run length as you
fine-tune your algorithm
28 / 52
Zap Q Zap Q-Learning with Neural Networks
Zap with Neural Networks
0 = ¯
f(θ∗
) = E

c(Xn, Un) + γQθ∗
(Xn+1) − Qθ∗
(Xn, Un)
	
ζn

ζn = ∇θQθ(Xn, Un)
θ=θn
computed using back-progagation
A few things to note:
As far as we know, the empirical success of plain vanilla DQN is
extraordinary (however, nobody reports failure)
29 / 52

More Related Content

More from Sean Meyn

Reinforcement Learning: Hidden Theory and New Super-Fast Algorithms
Reinforcement Learning: Hidden Theory and New Super-Fast AlgorithmsReinforcement Learning: Hidden Theory and New Super-Fast Algorithms
Reinforcement Learning: Hidden Theory and New Super-Fast AlgorithmsSean Meyn
 
State estimation and Mean-Field Control with application to demand dispatch
State estimation and Mean-Field Control with application to demand dispatchState estimation and Mean-Field Control with application to demand dispatch
State estimation and Mean-Field Control with application to demand dispatchSean Meyn
 
Demand-Side Flexibility for Reliable Ancillary Services
Demand-Side Flexibility for Reliable Ancillary ServicesDemand-Side Flexibility for Reliable Ancillary Services
Demand-Side Flexibility for Reliable Ancillary ServicesSean Meyn
 
Spectral Decomposition of Demand-Side Flexibility for Reliable Ancillary Serv...
Spectral Decomposition of Demand-Side Flexibility for Reliable Ancillary Serv...Spectral Decomposition of Demand-Side Flexibility for Reliable Ancillary Serv...
Spectral Decomposition of Demand-Side Flexibility for Reliable Ancillary Serv...Sean Meyn
 
Demand-Side Flexibility for Reliable Ancillary Services in a Smart Grid: Elim...
Demand-Side Flexibility for Reliable Ancillary Services in a Smart Grid: Elim...Demand-Side Flexibility for Reliable Ancillary Services in a Smart Grid: Elim...
Demand-Side Flexibility for Reliable Ancillary Services in a Smart Grid: Elim...Sean Meyn
 
Why Do We Ignore Risk in Power Economics?
Why Do We Ignore Risk in Power Economics?Why Do We Ignore Risk in Power Economics?
Why Do We Ignore Risk in Power Economics?Sean Meyn
 
Distributed Randomized Control for Ancillary Service to the Power Grid
Distributed Randomized Control for Ancillary Service to the Power GridDistributed Randomized Control for Ancillary Service to the Power Grid
Distributed Randomized Control for Ancillary Service to the Power GridSean Meyn
 
Ancillary service to the grid from deferrable loads: the case for intelligent...
Ancillary service to the grid from deferrable loads: the case for intelligent...Ancillary service to the grid from deferrable loads: the case for intelligent...
Ancillary service to the grid from deferrable loads: the case for intelligent...Sean Meyn
 
2012 Tutorial: Markets for Differentiated Electric Power Products
2012 Tutorial:  Markets for Differentiated Electric Power Products2012 Tutorial:  Markets for Differentiated Electric Power Products
2012 Tutorial: Markets for Differentiated Electric Power ProductsSean Meyn
 
Control Techniques for Complex Systems
Control Techniques for Complex SystemsControl Techniques for Complex Systems
Control Techniques for Complex SystemsSean Meyn
 
Tutorial for Energy Systems Week - Cambridge 2010
Tutorial for Energy Systems Week - Cambridge 2010Tutorial for Energy Systems Week - Cambridge 2010
Tutorial for Energy Systems Week - Cambridge 2010Sean Meyn
 
Panel Lecture for Energy Systems Week
Panel Lecture for Energy Systems WeekPanel Lecture for Energy Systems Week
Panel Lecture for Energy Systems WeekSean Meyn
 
The Value of Volatile Resources... Caltech, May 6 2010
The Value of Volatile Resources... Caltech, May 6 2010The Value of Volatile Resources... Caltech, May 6 2010
The Value of Volatile Resources... Caltech, May 6 2010Sean Meyn
 
Approximate dynamic programming using fluid and diffusion approximations with...
Approximate dynamic programming using fluid and diffusion approximations with...Approximate dynamic programming using fluid and diffusion approximations with...
Approximate dynamic programming using fluid and diffusion approximations with...Sean Meyn
 
Anomaly Detection Using Projective Markov Models
Anomaly Detection Using Projective Markov ModelsAnomaly Detection Using Projective Markov Models
Anomaly Detection Using Projective Markov ModelsSean Meyn
 
Markov Tutorial CDC Shanghai 2009
Markov Tutorial CDC Shanghai 2009Markov Tutorial CDC Shanghai 2009
Markov Tutorial CDC Shanghai 2009Sean Meyn
 
Q-Learning and Pontryagin's Minimum Principle
Q-Learning and Pontryagin's Minimum PrincipleQ-Learning and Pontryagin's Minimum Principle
Q-Learning and Pontryagin's Minimum PrincipleSean Meyn
 
Efficiency and Marginal Cost Pricing in Dynamic Competitive Markets with Fric...
Efficiency and Marginal Cost Pricing in Dynamic Competitive Markets with Fric...Efficiency and Marginal Cost Pricing in Dynamic Competitive Markets with Fric...
Efficiency and Marginal Cost Pricing in Dynamic Competitive Markets with Fric...Sean Meyn
 
Why are stochastic networks so hard to simulate?
Why are stochastic networks so hard to simulate?Why are stochastic networks so hard to simulate?
Why are stochastic networks so hard to simulate?Sean Meyn
 

More from Sean Meyn (19)

Reinforcement Learning: Hidden Theory and New Super-Fast Algorithms
Reinforcement Learning: Hidden Theory and New Super-Fast AlgorithmsReinforcement Learning: Hidden Theory and New Super-Fast Algorithms
Reinforcement Learning: Hidden Theory and New Super-Fast Algorithms
 
State estimation and Mean-Field Control with application to demand dispatch
State estimation and Mean-Field Control with application to demand dispatchState estimation and Mean-Field Control with application to demand dispatch
State estimation and Mean-Field Control with application to demand dispatch
 
Demand-Side Flexibility for Reliable Ancillary Services
Demand-Side Flexibility for Reliable Ancillary ServicesDemand-Side Flexibility for Reliable Ancillary Services
Demand-Side Flexibility for Reliable Ancillary Services
 
Spectral Decomposition of Demand-Side Flexibility for Reliable Ancillary Serv...
Spectral Decomposition of Demand-Side Flexibility for Reliable Ancillary Serv...Spectral Decomposition of Demand-Side Flexibility for Reliable Ancillary Serv...
Spectral Decomposition of Demand-Side Flexibility for Reliable Ancillary Serv...
 
Demand-Side Flexibility for Reliable Ancillary Services in a Smart Grid: Elim...
Demand-Side Flexibility for Reliable Ancillary Services in a Smart Grid: Elim...Demand-Side Flexibility for Reliable Ancillary Services in a Smart Grid: Elim...
Demand-Side Flexibility for Reliable Ancillary Services in a Smart Grid: Elim...
 
Why Do We Ignore Risk in Power Economics?
Why Do We Ignore Risk in Power Economics?Why Do We Ignore Risk in Power Economics?
Why Do We Ignore Risk in Power Economics?
 
Distributed Randomized Control for Ancillary Service to the Power Grid
Distributed Randomized Control for Ancillary Service to the Power GridDistributed Randomized Control for Ancillary Service to the Power Grid
Distributed Randomized Control for Ancillary Service to the Power Grid
 
Ancillary service to the grid from deferrable loads: the case for intelligent...
Ancillary service to the grid from deferrable loads: the case for intelligent...Ancillary service to the grid from deferrable loads: the case for intelligent...
Ancillary service to the grid from deferrable loads: the case for intelligent...
 
2012 Tutorial: Markets for Differentiated Electric Power Products
2012 Tutorial:  Markets for Differentiated Electric Power Products2012 Tutorial:  Markets for Differentiated Electric Power Products
2012 Tutorial: Markets for Differentiated Electric Power Products
 
Control Techniques for Complex Systems
Control Techniques for Complex SystemsControl Techniques for Complex Systems
Control Techniques for Complex Systems
 
Tutorial for Energy Systems Week - Cambridge 2010
Tutorial for Energy Systems Week - Cambridge 2010Tutorial for Energy Systems Week - Cambridge 2010
Tutorial for Energy Systems Week - Cambridge 2010
 
Panel Lecture for Energy Systems Week
Panel Lecture for Energy Systems WeekPanel Lecture for Energy Systems Week
Panel Lecture for Energy Systems Week
 
The Value of Volatile Resources... Caltech, May 6 2010
The Value of Volatile Resources... Caltech, May 6 2010The Value of Volatile Resources... Caltech, May 6 2010
The Value of Volatile Resources... Caltech, May 6 2010
 
Approximate dynamic programming using fluid and diffusion approximations with...
Approximate dynamic programming using fluid and diffusion approximations with...Approximate dynamic programming using fluid and diffusion approximations with...
Approximate dynamic programming using fluid and diffusion approximations with...
 
Anomaly Detection Using Projective Markov Models
Anomaly Detection Using Projective Markov ModelsAnomaly Detection Using Projective Markov Models
Anomaly Detection Using Projective Markov Models
 
Markov Tutorial CDC Shanghai 2009
Markov Tutorial CDC Shanghai 2009Markov Tutorial CDC Shanghai 2009
Markov Tutorial CDC Shanghai 2009
 
Q-Learning and Pontryagin's Minimum Principle
Q-Learning and Pontryagin's Minimum PrincipleQ-Learning and Pontryagin's Minimum Principle
Q-Learning and Pontryagin's Minimum Principle
 
Efficiency and Marginal Cost Pricing in Dynamic Competitive Markets with Fric...
Efficiency and Marginal Cost Pricing in Dynamic Competitive Markets with Fric...Efficiency and Marginal Cost Pricing in Dynamic Competitive Markets with Fric...
Efficiency and Marginal Cost Pricing in Dynamic Competitive Markets with Fric...
 
Why are stochastic networks so hard to simulate?
Why are stochastic networks so hard to simulate?Why are stochastic networks so hard to simulate?
Why are stochastic networks so hard to simulate?
 

DeepLearn2022 3. TD and Q Learning

  • 1. Part 3: Introduction to Reinforcement Learning Temporal Difference Techniques Sean Meyn Department of Electrical and Computer Engineering University of Florida Inria International Chair Inria, Paris Thanks to to our sponsors: NSF and ARO
  • 2. Part 3: Introduction to Reinforcement Learning Outline 1 Recap 2 TD Learning 3 Exploration 4 Q Learning 5 Zap Q 6 Conclusions & Future Directions 7 References
  • 3. 5 10 0 0 20 40 60 80 100 ρ = 0.8 τk Outcomes from M repeated runs -20 0 20 √ Nθ̃N Crank Up Gain: Tame Transients Average: Tame Variance PR θk ϑτk Algorithm Design
  • 4. Recap Algorithm Design Steps to a Successful Design With Thanks to Lyapunov and Polyak! 0 20 40 60 80 100 ρ = 0.8 τk 5 10 0 Outcomes from M repeated runs -20 0 20 Crank Up Gain: Tame Transients Average: Tame Variance θk ϑτk N − N0 θ̃PR N Follow These Steps: 1 Design ¯ f for mean flow d dt ϑ = ¯ f(ϑ) 2 Design gain: αn = g/(1 + n/ne)ρ Crank up gain, means ρ < 1 3 Perform PR Averaging 4 Repeat! Obtain histogram √ N − N0 θ̃PR (m) N : 1 ≤ m ≤ M with θ (m) 0 widely dispersed and N relatively small What you will learn: How big N needs to be Approximate confidence bounds 1 / 52
  • 5. Recap Algorithm Design Taming Nonlinear Dynamics What if stability of d dt ϑ = ¯ f(ϑ) is unknown? [typically the case in RL] Newton Raphson Flow [Smale 1976] Idea: Interpret ¯ f as the “parameter”: d dt ¯ ft = V( ¯ ft) and design V 2 / 52
  • 6. Recap Algorithm Design Taming Nonlinear Dynamics What if stability of d dt ϑ = ¯ f(ϑ) is unknown? [typically the case in RL] Newton Raphson Flow [Smale 1976] Idea: Interpret ¯ f as the “parameter”: d dt ¯ f = − ¯ ft Linear dynamics 2 / 52
  • 7. Recap Algorithm Design Taming Nonlinear Dynamics What if stability of d dt ϑ = ¯ f(ϑ) is unknown? [typically the case in RL] Newton Raphson Flow [Smale 1976] Idea: Interpret ¯ f as the “parameter”: d dt ¯ f = − ¯ ft d dt ¯ f(ϑt) = − ¯ f(ϑt) d dt ϑt = −[A(ϑt)]−1 ¯ f(ϑt) SA translation: Zap Stochastic Approximation 2 / 52
  • 8. Recap Algorithm Design Zap Algorithm Designed to emulate Newton-Raphson flow d dt ϑt = −[A(ϑt)]−1 ¯ f(ϑt), A(θ) = ∂ ∂θ ¯ f(θ) Zap-SA (designed to emulate deterministic Newton-Raphson) θn+1 = θn + αn+1[− b An+1]−1 f(θn, ξn+1) b An+1 = b An + βn+1(An+1 − b An), An+1 = ∂θf(θn, ξn+1) 3 / 52
  • 9. Recap Algorithm Design Zap Algorithm Designed to emulate Newton-Raphson flow d dt ϑt = −[A(ϑt)]−1 ¯ f(ϑt), A(θ) = ∂ ∂θ ¯ f(θ) Zap-SA (designed to emulate deterministic Newton-Raphson) θn+1 = θn + αn+1[− b An+1]−1 f(θn, ξn+1) b An+1 = b An + βn+1(An+1 − b An), An+1 = ∂θf(θn, ξn+1) Requires b An+1 ≈ A(θn) def = ∂θ ¯ f (θn) 3 / 52
  • 10. Recap Algorithm Design Zap Algorithm Designed to emulate Newton-Raphson flow d dt ϑt = −[A(ϑt)]−1 ¯ f(ϑt), A(θ) = ∂ ∂θ ¯ f(θ) Zap-SA (designed to emulate deterministic Newton-Raphson) θn+1 = θn + αn+1[− b An+1]−1 f(θn, ξn+1) b An+1 = b An + βn+1(An+1 − b An), An+1 = ∂θf(θn, ξn+1) b An+1 ≈ A(θn) requires high-gain, βn αn → ∞, n → ∞ 3 / 52
  • 11. Recap Algorithm Design Zap Algorithm Designed to emulate Newton-Raphson flow d dt ϑt = −[A(ϑt)]−1 ¯ f(ϑt), A(θ) = ∂ ∂θ ¯ f(θ) Zap-SA (designed to emulate deterministic Newton-Raphson) θn+1 = θn + αn+1[− b An+1]−1 f(θn, ξn+1) b An+1 = b An + βn+1(An+1 − b An), An+1 = ∂θf(θn, ξn+1) b An+1 ≈ A(θn) requires high-gain, βn αn → ∞, n → ∞ Can use αn = 1/n (without averaging). Numerics that follow: use this choice, and βn = (1/n)ρ, ρ ∈ (0.5, 1) 3 / 52
  • 12. Recap Algorithm Design Zap Algorithm Designed to emulate Newton-Raphson flow d dt ϑt = −[A(ϑt)]−1 ¯ f(ϑt), A(θ) = ∂ ∂θ ¯ f(θ) Zap-SA (designed to emulate deterministic Newton-Raphson) θn+1 = θn + αn+1[− b An+1]−1 f(θn, ξn+1) b An+1 = b An + βn+1(An+1 − b An), An+1 = ∂θf(θn, ξn+1) b An+1 ≈ A(θn) requires high-gain, βn αn → ∞, n → ∞ Can use αn = 1/n (without averaging). Numerics that follow: use this choice, and βn = (1/n)ρ, ρ ∈ (0.5, 1) Stability? Virtually universal Optimal variance, too! Based on ancient theory from Ruppert Polyak [80, 81, 79] 3 / 52
  • 14. Recap Optimal Control Keep In Mind! Trajectory Generation State Feedback Process Observer Environment Σ Σ Σ Σ y u e w d xref ufb uff − + x̂ M ore feedback r ∆ n hidden state x Every Control Problem is Multi-Objective Many costs to consider Many measures of risk See CSRL Chapter 2, and Åström and Murray [1] 4 / 52
  • 15. Recap Optimal Control Keep In Mind! Trajectory Generation State Feedback Process Observer Environment Σ Σ Σ Σ y u e w d xref ufb uff − + x̂ M ore feedback r ∆ n hidden state x Every Control Problem is Multi-Objective Many costs to consider Many measures of risk Every Control Problem is Multi-Temporal Advanced planning, and periodic updates Real time feedback φ See CSRL Chapter 2, and Åström and Murray [1] 4 / 52
  • 16. Recap Optimal Control Keep In Mind! Trajectory Generation State Feedback Process Observer Environment Σ Σ Σ Σ y u e w d xref ufb uff − + x̂ M ore feedback r ∆ n hidden state x Every Control Problem is Multi-Objective Many costs to consider Many measures of risk Every Control Problem is Multi-Temporal Advanced planning, and periodic updates Real time feedback φ There are Choices Beyond φ? See CSRL Chapter 2, and Åström and Murray [1] 4 / 52
  • 17. Recap Optimal Control MDP Model Keep things simple: state space X and input space U are finite MDP Model X is a stationary controlled Markov chain, with input U For all states x and sets A, P{Xn+1 ∈ A | Xn = x, Un = u, and prior history} = Pu(x, A) c: X × U → R is a cost function γ 1 discount factor 5 / 52
  • 18. Recap Optimal Control MDP Model Keep things simple: state space X and input space U are finite Cost function c: X × U → R+ Q-function: Q? (x, u) = min U∞ 1 ∞ X k=0 γk E[c(X(k), U(k)) | X(0) = x , U(0) = u] 5 / 52
  • 19. Recap Optimal Control MDP Model Keep things simple: state space X and input space U are finite Cost function c: X × U → R+ Q-function: Q? (x, u) = min U∞ 1 ∞ X k=0 γk E[c(X(k), U(k)) | X(0) = x , U(0) = u] DP equation: Q? (x, u) = c(x, u) + γE[Q? (X(k + 1)) | X(k) = x , U(k) = u] H(x) = min u H(x, u) 5 / 52
  • 20. Recap Optimal Control MDP Model Keep things simple: state space X and input space U are finite Cost function c: X × U → R+ Q-function: Q? (x, u) = min U∞ 1 ∞ X k=0 γk E[c(X(k), U(k)) | X(0) = x , U(0) = u] DP equation: Q? (x, u) = c(x, u) + γE[Q? (X(k + 1)) | X(k) = x , U(k) = u] Optimal policy: U?(k) = φ?(X?(k)), φ? (x) ∈ arg min u Q? (x, u) 5 / 52
  • 21. Recap Optimal Control MDP Model Keep things simple: state space X and input space U are finite Cost function c: X × U → R+ Q-function: Q? (x, u) = min U∞ 1 ∞ X k=0 γk E[c(X(k), U(k)) | X(0) = x , U(0) = u] DP equation: Q? (x, u) = c(x, u) + γE[Q? (X(k + 1)) | X(k) = x , U(k) = u] Optimal policy: U?(k) = φ?(X?(k)), φ? (x) ∈ arg min u Q? (x, u) Goal of SC and RL: approximate Q? and/or φ? Today’s lecture: find best approximation among {Qθ : θ ∈ Rd} 5 / 52
  • 22. Recap Optimal Control MDP Model Some flexibility with notation: c(X(k), U(k)) = ck, etc. Keep things simple: state space X and input space U are finite Cost function c: X × U → R+ Q-function: Q? (x, u) = min U∞ 1 ∞ X k=0 γk E[ ck | X0 = x , U0 = u] DP equation: Q? (x, u) = c(x, u) + γE[ Q? k+1 | Xk = x , Uk = u] Optimal policy: U? k = φ?(X? k), φ? (x) ∈ arg min u Q? (x, u) Goal of SC and RL: approximate Q? and/or φ? Today’s lecture: find best approximation among {Qθ : θ ∈ Rd} 6 / 52
  • 24. TD Learning Estimating the State-Action Value Function TD Learning Given policy φ, approximate the fixed-policy Q-function U(k) = φ(X(k)), k ≥ 1 Qφ (x, u) = ∞ X k=0 γk E[c(X(k), U(k)) | X(0) = x , U(0) = u] Motivation: Policy Improvement φ+ (x) = arg min u Qφ (x, u) 7 / 52
  • 25. TD Learning Estimating the State-Action Value Function TD Learning Given policy φ, approximate the fixed-policy Q-function U(k) = φ(X(k)), k ≥ 1 DP equation: Qφ (x, u) = c(x, u) + γE[Qφ (X(k + 1)) | X(k) = x , U(k) = u] Qφ (X(k), U(k)) = c(X(k), U(k)) + γE[Qφ (X(k + 1)) | Fk] with new definition, H(x) = H(x, φ(x)). 7 / 52
  • 26. TD Learning Estimating the State-Action Value Function TD Learning Given policy φ, approximate the fixed-policy Q-function U(k) = φ(X(k)), k ≥ 1 Qφ (X(k), U(k)) = c(X(k), U(k)) + γE[Qφ (X(k + 1)) | Fk] with new definition, H(x) = H(x, φ(x)). Galerkin relaxation: θ∗ is a root of the function ¯ f(θ) def = E[{−Qθ (X(k), U(k)) + c(X(k), U(k)) + γQθ (X(k + 1))}ζk] Eligibility vector in Rd 7 / 52
  • 27. TD Learning Estimating the State-Action Value Function TD Learning Given policy φ, approximate the fixed-policy Q-function U(k) = φ(X(k)), k ≥ 1 Qφ (X(k), U(k)) = c(X(k), U(k)) + γE[Qφ (X(k + 1)) | Fk] with new definition, H(x) = H(x, φ(x)). Galerkin relaxation: θ∗ is a root of the function ¯ f(θ) def = E[{−Qθ (X(k), U(k)) + c(X(k), U(k)) + γQθ (X(k + 1))}ζk] Linear family: {Qθ = θT ψ} d dt ϑ = ¯ f(ϑ) is stable (clever choice of ζk, and with covariance Σψ full rank) Alternatively, compute θ∗ by a single matrix inversion SA to the rescue 7 / 52
  • 28. TD Learning Estimating the State-Action Value Function TD Learning Given policy φ, approximate the fixed-policy Q-function U(k) = φ(X(k)), k ≥ 1 Qφ (X(k), U(k)) = c(X(k), U(k)) + γE[Qφ (X(k + 1)) | Fk] with new definition, H(x) = H(x, φ(x)). Galerkin relaxation: θ∗ is a root of the function ¯ f(θ) def = E[{−Qθ (X(k), U(k)) + c(X(k), U(k)) + γQθ (X(k + 1))}ζk] How to estimate θ? for neural network (nonlinear) architecture? SA to the rescue 7 / 52
  • 29. TD Learning Estimating the State-Action Value Function TD(λ) ¯ f(θ) def = E[{−Qθ k + ck + γQθ k+1 }ζk] Choose λ ∈ [0, 1] as you please, and then for each n ζn = n X k=0 (γλ)n−k ∇θ[Qθ k]
  • 30.
  • 31. θ=θn Interpretation for linear function class: ∇θ[Qθ k] = ψk = ψ(Xk, Uk) Basis observations passed through a low pass linear filter 8 / 52
  • 32. TD Learning Estimating the State-Action Value Function TD(λ) ¯ f(θ) def = E[{−Qθ k + ck + γQθ k+1 }ζk] Choose λ ∈ [0, 1] as you please, and then for each n ζn = n X k=0 (γλ)n−k ∇θ[Qθ k]
  • 33.
  • 34. θ=θn TD(λ) ≡ SA algorithm: θn+1 = θn + αn+1f(θn, ξn+1) = θn + αn+1{−Qθn n + cn + γQθn n+1 }ζn ζn+1 = γλζn + ∇θ[Qθ n]
  • 35.
  • 36. θ=θn+1 with ξn+1 = (Xn+1, Xn, Un, ζn) 8 / 52
  • 37. TD Learning Least-Squares Temporal Difference Learning LSTD and Zap ¯ f(θ) def = E[{−Qθ k + ck + γQθ k+1 }ζk] We require the linearization matrix: A(θ, ξk+1) = ∂θ{−Qθ k + γQθ k+1 }ζk 9 / 52
  • 38. TD Learning Least-Squares Temporal Difference Learning LSTD and Zap ¯ f(θ) def = E[{−Qθ k + ck + γQθ k+1 }ζk] We require the linearization matrix: A(θ, ξk+1) = ∂θ{−Qθ k + γQθ k+1 }ζk = −ψ(Xk, Uk) + γψ Xk+1, φ(Xk+1) }ζk linear function class 9 / 52
  • 39. TD Learning Least-Squares Temporal Difference Learning LSTD and Zap ¯ f(θ) def = E[{−Qθ k + ck + γQθ k+1 }ζk] We require the linearization matrix: A(θ, ξk+1) = ∂θ{−Qθ k + γQθ k+1 }ζk = −ψ(Xk, Uk) + γψ Xk+1, φ(Xk+1) }ζk linear function class A∗ = E {−ψ(Xk, Uk) + γψ Xk+1, φ(Xk+1) }ζk] is Hurwitz in this case Only requirement is that Σψ 0 9 / 52
  • 40. TD Learning Least-Squares Temporal Difference Learning LSTD and Zap ¯ f(θ) def = E[{−Qθ k + ck + γQθ k+1 }ζk] We require the linearization matrix: A(θ, ξk+1) = ∂θ{−Qθ k + γQθ k+1 }ζk = −ψ(Xk, Uk) + γψ Xk+1, φ(Xk+1) }ζk linear function class θn+1 = θn + αn+1[− b An+1]−1 f(θn, ξn+1) f(θn, ξn+1) = {−Qθn n + cn + γQθn n+1 }ζn b An+1 = b An + αn+1(An+1 − b An) An+1 = {−ψ(Xn, Un) + γψ(Xn+1, φ(Xn+1)}ζn Zap SA doesn’t require two time-scale algorithm when A(θ) ≡ A∗ 9 / 52
  • 41. TD Learning Least-Squares Temporal Difference Learning LSTD and Zap ¯ f(θ) def = E[{−Qθ k + ck + γQθ k+1 }ζk] We require the linearization matrix: A(θ, ξk+1) = ∂θ{−Qθ k + γQθ k+1 }ζk = −ψ(Xk, Uk) + γψ Xk+1, φ(Xk+1) }ζk linear function class LSTD(λ)-Learning θn+1 = θn + αn+1[− b An+1]−1 f(θn, ξn+1) f(θn, ξn+1) = {−Qθn n + cn + γQθn n+1 }ζn b An+1 = b An + αn+1(An+1 − b An) An+1 = {−ψ(Xn, Un) + γψ(Xn+1, φ(Xn+1)}ζn Optimal convergence rate (as obtained using PR averaging) 9 / 52
  • 42. TD Learning Least-Squares Temporal Difference Learning What Is LSTD(λ) Solving? Beautiful Theory for λ = 1 10 / 52
  • 43. TD Learning Least-Squares Temporal Difference Learning What Is LSTD(1) Solving? Eternal gratitude to John Tsitsiklis! Answered in the dissertation of Van Roy [18]: min θ kQθ − Qφ k2 π = X x,u Qθ (x, u) − Qφ (x, u) 2 π(x, u) See [21, 20] and Chapter 9 of CSRL 10 / 52
  • 44. TD Learning Least-Squares Temporal Difference Learning What Is LSTD(1) Solving? Eternal gratitude to John Tsitsiklis! Answered in the dissertation of Van Roy [18]: min θ kQθ − Qφ k2 π = X x,u Qθ (x, u) − Qφ (x, u) 2 π(x, u) See [21, 20] and Chapter 9 of CSRL Foundation for Actor-critic algorithms in the dissertation of Konda [46] θn+1 = θn − αn+1 ˘ ∇Γ (n) ¯ f(θ) = −∇Γ (θ) Unbiased gradient observation: ˘ ∇Γ (n) def = Λθn n Qθn n Score function: Λθ (x, u) = ∇θ log[e φθ (u | x)] See also [47] and Chapter 10 of CSRL 10 / 52
  • 45. TD Learning Least-Squares Temporal Difference Learning LSTD(1) Optimal Covariance May Be Large Example from CTCN [33]: 0 10 20 30 0 5 10 15 ρ = 0.8 ρ = 0.9 hθ (x) = θ1 θ1 x + θ2x2 θ2 θ∗ 1 = θ∗ 2 0 1 2 3 4 5 x 10 6 0 1 2 3 4 5 x 10 6 11 / 52
  • 46. TD Learning Least-Squares Temporal Difference Learning LSTD(1) Optimal Covariance May Be Large Example from CTCN [33]: 0 10 20 30 0 5 10 15 ρ = 0.8 ρ = 0.9 hθ (x) = θ1 θ1 x + θ2x2 θ2 θ∗ 1 = θ∗ 2 0 1 2 3 4 5 x 10 6 0 1 2 3 4 5 x 10 6 Architecture linear, two dimensional, and ideal in every sense 11 / 52
  • 47. TD Learning Least-Squares Temporal Difference Learning LSTD(1) Optimal Covariance May Be Large Example from CTCN [33]: 0 10 20 30 0 5 10 15 ρ = 0.8 ρ = 0.9 hθ (x) = θ1 θ1 x + θ2x2 θ2 θ∗ 1 = θ∗ 2 0 1 2 3 4 5 x 10 6 0 1 2 3 4 5 x 10 6 Architecture linear, two dimensional, and ideal in every sense Algorithm performance is far from ideal 11 / 52
  • 48. TD Learning Changing Our Goals State Weighting Weighted Objective min θ kQθ − Qφ k2 π,Ω = X x,u 1 Ω(x, u) Qθ (x, u) − Qφ (x, u) 2 π(x, u) 12 / 52
  • 49. TD Learning Changing Our Goals State Weighting Weighted Objective min θ kQθ − Qφ k2 π,Ω = X x,u 1 Ω(x, u) Qθ (x, u) − Qφ (x, u) 2 π(x, u) Zap ≡ LSTD(1) θn = [− b An]−1 bn b An+1 = b An + 1 n + 1 h − 1 Ωn+1 ψn+1ψT n+1 − b An i bn+1 = bn + 1 n + 1 ζncn − bn ζn+1 = γζn + 1 Ωn+1 ψn+1 12 / 52
  • 50. TD Learning Changing Our Goals State Weighting Design state weighting so that f(θ, ξ) becomes bounded in ξ: 0 10 20 30 0 5 10 15 ρ = 0.8 ρ = 0.9 hθ (x) = θ1 θ1 x + θ2x2 θ2 θ∗ 1 = θ∗ 2 0 1 2 3 4 5 x 10 6 0 1 2 3 4 5 x 10 6 13 / 52
  • 51. TD Learning Changing Our Goals State Weighting Design state weighting so that f(θ, ξ) becomes bounded in ξ: 0 10 20 30 0 10 20 30 0 5 10 15 0 5 10 15 ρ = 0.8 ρ = 0.9 hθ (x) = θ1 θ1 x + θ2x2 θ2 θ∗ 1 = θ∗ 2 0 1 2 3 4 5 x 10 6 0 1 2 3 4 5 x 10 6 0 1 2 3 4 5 x 10 6 0 1 2 3 4 5 x 10 6 [1 + (µ − α)x] 4 Ω(x) ≡ 1 13 / 52
  • 52. TD Learning Changing Our Goals State Weighting Design state weighting so that f(θ, ξ) becomes bounded in ξ: 0 10 20 30 0 10 20 30 0 5 10 15 0 5 10 15 ρ = 0.8 ρ = 0.9 hθ (x) = θ1 θ1 x + θ2x2 θ2 θ∗ 1 = θ∗ 2 0 1 2 3 4 5 x 10 6 0 1 2 3 4 5 x 10 6 0 1 2 3 4 5 x 10 6 0 1 2 3 4 5 x 10 6 [1 + (µ − α)x] 4 Ω(x) ≡ 1 Far more flexible and well motivated than introducing λ ζn+1 = γζn + 1 Ωn+1 ψn+1 13 / 52
  • 53. TD Learning Changing Our Goals Relative LSTD Far more flexible and well motivated than introducing λ? ζn+1 = γζn + 1 Ωn+1 ψ(Xn+1) 14 / 52
  • 54. TD Learning Changing Our Goals Relative LSTD Far more flexible and well motivated than introducing λ? ζn+1 = γζn + 1 Ωn+1 ψ(Xn+1) Performance will degrade as γ gets close to one 14 / 52
  • 55. TD Learning Changing Our Goals Relative LSTD Far more flexible and well motivated than introducing λ? ζn+1 = γζn + 1 Ωn+1 ψ(Xn+1) Performance will degrade as γ gets close to one Change your objective: min θ kHθ − Hφ k2 π,Ω with H(x, u) = Q(x, u) − Q(x•, u•) (some fixed normalization) 14 / 52
  • 56. TD Learning Changing Our Goals Relative LSTD Far more flexible and well motivated than introducing λ? ζn+1 = γζn + 1 Ωn+1 ψ(Xn+1) Performance will degrade as γ gets close to one Change your objective: min θ kHθ − Hφ k2 π,Ω with H(x, u) = Q(x, u) − Q(x•, u•) (some fixed normalization) Relative LSTD Replace ψ(x, u) by ψ̃(x, u) def = ψ(x, u) − ψ(x•, u•) where it appears 14 / 52
  • 57. TD Learning Changing Our Goals Relative LSTD Far more flexible and well motivated than introducing λ? ζn+1 = γζn + 1 Ωn+1 ψ(Xn+1) Performance will degrade as γ gets close to one Change your objective: min θ kHθ − Hφ k2 π,Ω with H(x, u) = Q(x, u) − Q(x•, u•) (some fixed normalization) Relative LSTD Replace ψ(x, u) by ψ̃(x, u) def = ψ(x, u) − ψ(x•, u•) where it appears Covariance is optimal, and uniformly bounded over 0 ≤ γ 1 [64] 14 / 52
  • 58. TD Learning Changing Our Goals Relative LSTD 0 1 2 3 4 5 6 7 8 9 10 n 10 5 0 20 40 60 80 100 0 1 2 3 4 5 6 7 8 9 10 0 20 40 60 80 100 120 120 Maximal Bellman Error 1/(1 − γ) = 103 1/(1 − γ) = 104 Q-learning: g = 1/(1 − γ) g = 1/(1 − ρ∗ γ) g = 1/(1 − ρ∗ γ) Relative Q-learning: { Figure 1: Comparison of Q-learning and Relative Q-learning algorithms for the stochastic shortest path problem of [4]. The relative Q-learning algorithm is unaffected by large discounting. Relative LSTD Replace ψ(x, u) by ψ̃(x, u) def = ψ(x, u) − ψ(x•, u•) where it appears Covariance is optimal, and uniformly bounded over 0 ≤ γ 1 [64] 15 / 52
  • 59. Exploration On and off policy algorithms Off Policy for Exploration In TD-learning, sufficient exploration means Σψ = Eπ[ψkψT k] 0 This ensures convergence, solving ¯ f(θ∗ ) = 0, perhaps θ∗ = min θ kQθ − Qφ k2 π,Ω 0.01 0.02 0.03 0.04 0.05 0.06 −1 0 1 −1 0 1 −1 0 1 Optimal policy Optimal policy Small Input Large Input Invariant Density [57] 16 / 52
  • 60. Exploration On and off policy algorithms Off Policy for Exploration In TD-learning, sufficient exploration means Σψ = Eπ[ψkψT k] 0 Not always possible! Consider Xn+1 = aXn + Un, Un = φ(Xn) = −KXn 16 / 52
  • 61. Exploration On and off policy algorithms Off Policy for Exploration In TD-learning, sufficient exploration means Σψ = Eπ[ψkψT k] 0 Not always possible! Consider Xn+1 = aXn + Un, Un = φ(Xn) = −KXn If |a − K| 1 then the steady-state is supported on Xn = Un ≡ 0 16 / 52
  • 62. Exploration On and off policy algorithms Off Policy for Exploration In TD-learning, sufficient exploration means Σψ = Eπ[ψkψT k] 0 Not always possible! Off Policy TD(λ) θn+1 = θn + αn+1f(θn, ξn+1) = θn + αn+1{−Qθn n + cn + γQθn n+1 }ζn ζn+1 = γλζn + ∇θ[Qθ n]
  • 63.
  • 65. Exploration On and off policy algorithms Off Policy for Exploration In TD-learning, sufficient exploration means Σψ = Eπ[ψkψT k] 0 Not always possible! Off Policy TD(λ) θn+1 = θn + αn+1f(θn, ξn+1) = θn + αn+1{−Qθn n + cn + γQθn n+1 }ζn ζn+1 = γλζn + ∇θ[Qθ n]
  • 66.
  • 67. θ=θn+1 The difference: On policy: Qθn n+1 = Qθn (Xn+1, Un+1), with Un+1 = φ(Xn+1) Off policy: Qθn n+1 = Qθn (Xn+1, Un+1), with Un+1 = e φ(Xn+1) 16 / 52
  • 68. Exploration On and off policy algorithms Off Policy for Exploration In TD-learning, sufficient exploration means Σψ = Eπ[ψkψT k] 0 Not always possible! Off Policy TD(λ) θn+1 = θn + αn+1f(θn, ξn+1) = θn + αn+1{−Qθn n + cn + γQθn n+1 }ζn ζn+1 = γλζn + ∇θ[Qθ n]
  • 69.
  • 70. θ=θn+1 The difference: On policy: Qθn n+1 = Qθn (Xn+1, Un+1), with Un+1 = φ(Xn+1) Off policy: Qθn n+1 = Qθn (Xn+1, Un+1), with Un+1 = e φ(Xn+1) Example: Un+1 = e φ(Xn+1) = ( φ(Xn+1) with prob.1 − ε randn+1 else 16 / 52
  • 71. Exploration What Are We Trying To Accomplish Wide World of Exploration How would this intelligent agent explore her environment to maximize her rewards? 17 / 52
  • 72. Exploration What Are We Trying To Accomplish Wide World of Exploration How would this intelligent agent explore ...? Exploration might lead to a meeting with the cat 17 / 52
  • 73. Exploration What Are We Trying To Accomplish Wide World of Exploration How would this intelligent agent explore ...? Exploration might lead to a meeting with the cat She will come to the conclusion that the delicious smell is not cheese 17 / 52
  • 74. Exploration What Are We Trying To Accomplish Wide World of Exploration How would this intelligent agent explore ...? Exploration might lead to a meeting with the cat She will come to the conclusion that the delicious smell is not cheese Would she continue to meet up with the cat, and smell the poop, in order to obtain the best estimate of Q∗? 17 / 52
  • 75. Exploration What Are We Trying To Accomplish Wide World of Exploration How would this intelligent agent explore ...? Exploration might lead to a meeting with the cat She will come to the conclusion that the delicious smell is not cheese Would she continue to meet up with the cat, and smell the poop, in order to obtain the best estimate of Q∗? As intelligent agents we, like the mouse, must not forget our goals! 17 / 52
  • 76. 0 1 2 3 4 5 6 7 8 9 10 n 10 5 0 20 40 60 80 100 0 1 2 3 4 5 6 7 8 9 10 0 20 40 60 80 100 120 120 Maximal Bellman Error 1/(1 − γ) = 103 1/(1 − γ) = 104 Q-learning: g = 1/(1 − γ) g = 1/(1 − ρ∗ γ) g = 1/(1 − ρ∗ γ) Relative Q-learning: { Figure 1: Comparison of Q-learning and Relative Q-learning algorithms for the stochastic shortest path problem of [4]. The relative Q-learning algorithm is unaffected by large discounting. Q Learning
  • 77. Q Learning Galerkin Relaxation Dynamic programming Find function Q∗ that solves (Fn means history) E c(Xn, Un) + γQ∗ (Xn+1) − Q∗ (Xn, Un) | Fn = 0 H(x) = min u H(x, u) 18 / 52
  • 78. Q Learning Galerkin Relaxation Dynamic programming Find function Q∗ that solves (Fn means history) E c(Xn, Un) + γQ∗ (Xn+1) − Q∗ (Xn, Un) | Fn = 0 Hidden Goal of Q-Learning (including DQN) Given {Qθ : θ ∈ Rd}, find θ∗ that solves ¯ f(θ∗) = 0, ¯ f(θ) def = E c(Xn, Un) + γQθ (Xn+1) − Qθ (Xn, Un) ζn The family {Qθ} and eligibility vectors {ζn} are part of algorithm design. 18 / 52
  • 79. Q Learning Galerkin Relaxation Q(0)-Learning Hidden goal ¯ f(θ∗ ) = 0 ¯ f(θ) = E c(Xn, Un) + γQθ (Xn+1) − Qθ (Xn, Un) ζn Prototypical choice ζn = ∇θQθ(Xn, Un)
  • 80.
  • 82. Q Learning Galerkin Relaxation Q(0)-Learning Hidden goal ¯ f(θ∗ ) = 0 ¯ f(θ) = E c(Xn, Un) + γQθ (Xn+1) − Qθ (Xn, Un) ζn Prototypical choice ζn = ∇θQθ(Xn, Un)
  • 83.
  • 84. θ=θn =⇒ prototypical Q-learning algorithm Q(0) Learning Algorithm Estimates obtained using SA θn+1 = θn + αn+1fn+1 fn+1 = cn + γQθ n+1 − Qθ n ζn
  • 85.
  • 86.
  • 87. θ=θn Qθ n+1 = Qθ (Xn+1), φθ (Xn+1)) φθ (x) = arg min u Qθ (x, u) [Qθ-greedy policy] Input {Un} chosen for exploration. 19 / 52
  • 88. Q Learning Watkins’ Q-Learning Q(0)-Learning Hidden goal ¯ f(θ∗ ) = 0 Q(0)-learning with linear function approximation Estimates obtained using SA θn+1 = θn + αn+1fn+1 fn+1 = cn + γQθ n+1 − Qθ n
  • 89.
  • 90.
  • 91. θ=θn ζn Qθ n+1 = Qθ (Xn+1, φθ (Xn+1)) Qθ(x, u) = θT ψ(x, u) Qθ (x) = θT ψ(x, φθ(x)) ζn = ∇θQθ(Xn, Un)
  • 92.
  • 94. Q Learning Watkins’ Q-Learning Q(0)-Learning Hidden goal ¯ f(θ∗ ) = 0 Q(0)-learning with linear function approximation Estimates obtained using SA θn+1 = θn + αn+1fn+1 fn+1 = cn + γQθ n+1 − Qθ n
  • 95.
  • 96.
  • 97. θ=θn ζn Qθ n+1 = Qθ (Xn+1, φθ (Xn+1)) Qθ(x, u) = θT ψ(x, u) Qθ (x) = θT ψ(x, φθ(x)) ζn = ∇θQθ(Xn, Un)
  • 98.
  • 99. θ=θn = ψ(Xn, Un) ¯ f(θ) = A(θ)θ − b̄ = 0 A(θ) = E ζn γψ(Xn+1, φθ (Xn+1)) − ψ(Xn, Un) T b̄ def = E ζnc(Xn, Un) 20 / 52
  • 100. Q Learning Watkins’ Q-Learning Watkins’ Q-learning E c(Xn, Un) + γQθ∗ (Xn+1) − Qθ∗ (Xn, Un) ζn = 0 21 / 52
  • 101. Q Learning Watkins’ Q-Learning Watkins’ Q-learning E c(Xn, Un) + γQθ∗ (Xn+1) − Qθ∗ (Xn, Un) ζn = 0 Watkin’s algorithm A special case of Q(0)-learning The family {Qθ} and eligibility vectors {ζn} in this design: Linearly parameterized family of functions: Qθ(x, u) = θT ψ(x, u) ζn ≡ ψ(Xn, Un) ψi(x, u) = 1{x = xi, u = ui} (complete basis ≡ tabular Q-learning) Convergence of Qθn to Q? holds under mild conditions 21 / 52
  • 102. Q Learning Watkins’ Q-Learning Watkins’ Q-learning E c(Xn, Un) + γQθ∗ (Xn+1) − Qθ∗ (Xn, Un) ζn = 0 Watkin’s algorithm A special case of Q(0)-learning The family {Qθ} and eligibility vectors {ζn} in this design: Linearly parameterized family of functions: Qθ(x, u) = θT ψ(x, u) ζn ≡ ψ(Xn, Un) ψi(x, u) = 1{x = xi, u = ui} (complete basis ≡ tabular Q-learning) Convergence of Qθn to Q? holds under mild conditions Asymptotic covariance is infinite for γ ≥ 1/2 [31, 62] 1 αn Cov(θ̃n) → ∞ as n → ∞, using the standard step-size rule αn = 1/n(x, u) 21 / 52
  • 103. Q Learning Watkins’ Q-Learning Asymptotic Covariance of Watkins’ Q-Learning This is what infinite variance looks like 1 4 6 5 3 2 σ2 = lim n→∞ nE[kθn − θ∗ k2 ] = ∞ Wild oscillations? 22 / 52
  • 104. Q Learning Watkins’ Q-Learning Asymptotic Covariance of Watkins’ Q-Learning This is what infinite variance looks like 1 4 6 5 3 2 σ2 = lim n→∞ nE[kθn − θ∗ k2 ] = ∞ Wild oscillations? Not at all, the sample paths appear frozen Histogram of parameter estimates after 106 iterations. 100 0 200 300 400 486.6 0 10 20 30 40 n = 106 Histogram for θ θ* n(15) (15) Example from [31, 62] 2017 22 / 52
  • 105. Q Learning Watkins’ Q-Learning Asymptotic Covariance of Watkins’ Q-Learning This is what infinite variance looks like 1 4 6 5 3 2 σ2 = lim n→∞ nE[kθn − θ∗ k2 ] = ∞ Wild oscillations? Not at all, the sample paths appear frozen Sample paths using a higher gain, or relative Q-learning. 0 1 2 3 4 5 6 7 8 9 10 n 10 5 0 20 40 60 80 100 0 1 2 3 4 5 6 7 8 9 10 0 20 40 60 80 100 120 120 Maximal Bellman Error 1/(1 − γ) = 103 1/(1 − γ) = 104 Q-learning: g = 1/(1 − γ) g = 1/(1 − ρ∗ γ) g = 1/(1 − ρ∗ γ) Relative Q-learning: { Figure 1: Comparison of Q-learning and Relative Q-learning algorithms for the stochastic shortest path problem of [4]. The relative Q-learning algorithm is unaffected by large discounting. Example from [31, 62] 2017, and [64], CSRL, 2021 22 / 52
  • 106. Q Learning Watkins’ Q-Learning Asymptotic Covariance of Watkins’ Q-Learning Can we do better? 1 4 6 5 3 2 0 1 2 3 4 5 6 7 8 9 10 n 10 5 0 20 40 60 80 100 0 1 2 3 4 5 6 7 8 9 10 0 20 40 60 80 100 120 120 Maximal Bellman Error 1/(1 − γ) = 103 1/(1 − γ) = 104 Q-learning: g = 1/(1 − γ) g = 1/(1 − ρ∗ γ) g = 1/(1 − ρ∗ γ) Relative Q-learning: { Figure 1: Comparison of Q-learning and Relative Q-learning algorithms for the stochastic shortest path problem of [4]. The relative Q-learning algorithm is unaffected by large discounting. 23 / 52
  • 107. Q Learning Watkins’ Q-Learning Asymptotic Covariance of Watkins’ Q-Learning Can we do better? 1 4 6 5 3 2 0 1 2 3 4 5 6 7 8 9 10 n 10 5 0 20 40 60 80 100 0 1 2 3 4 5 6 7 8 9 10 0 20 40 60 80 100 120 120 Maximal Bellman Error 1/(1 − γ) = 103 1/(1 − γ) = 104 Q-learning: g = 1/(1 − γ) g = 1/(1 − ρ∗ γ) g = 1/(1 − ρ∗ γ) Relative Q-learning: { Figure 1: Comparison of Q-learning and Relative Q-learning algorithms for the stochastic shortest path problem of [4]. The relative Q-learning algorithm is unaffected by large discounting. Our intelligent mouse might offer clues 23 / 52
  • 109. Zap Q Motivation The ODE method begins with design of the ODE: d dt ϑ = ¯ f(ϑ) Challenges we have faced with Q-learning: 24 / 52
  • 110. Zap Q Motivation The ODE method begins with design of the ODE: d dt ϑ = ¯ f(ϑ) Challenges we have faced with Q-learning: How can we design dynamics for 1 Stability few results outside of Watkins’ tabular setting 2 ¯ f(θ∗ ) = 0 solves a relevant problem or has a solution 24 / 52
  • 111. Zap Q Motivation The ODE method begins with design of the ODE: d dt ϑ = ¯ f(ϑ) Challenges we have faced with Q-learning: How can we design dynamics for 1 Stability 2 ¯ f(θ∗ ) = 0 solves a relevant problem How can we better manage problems introduced by 1/(1 − γ)? Relative Q-Learning is one approach 24 / 52
  • 112. Zap Q Motivation The ODE method begins with design of the ODE: d dt ϑ = ¯ f(ϑ) Challenges we have faced with Q-learning: How can we design dynamics for 1 Stability 2 ¯ f(θ∗ ) = 0 solves a relevant problem How can we better manage problems introduced by 1/(1 − γ)? Relative Q-Learning is one approach Assuming we have solved 2, approximate Newton-Raphson flow: d dt ¯ f(ϑt) = − ¯ f(ϑt) giving ¯ f(ϑt) = ¯ f(ϑ0)e−t 24 / 52
  • 113. Zap Q Zap Algorithm Designed to emulate Newton-Raphson flow d dt ϑt = −[A(ϑt)]−1 ¯ f(ϑt), A(θ) = ∂ ∂θ ¯ f(θ) Zap-SA θn+1 = θn + αn+1Gn+1f(θn, Φn+1) Gn+1 = [− b An+1]−1 b An+1 = b An + βn+1(An+1 − b An) An+1 = ∂θf(θn, Φn+1) 25 / 52
  • 114. Zap Q Zap Algorithm Designed to emulate Newton-Raphson flow d dt ϑt = −[A(ϑt)]−1 ¯ f(ϑt), A(θ) = ∂ ∂θ ¯ f(θ) Zap-SA θn+1 = θn + αn+1Gn+1f(θn, Φn+1) Gn+1 = [− b An+1]−1 b An+1 = b An + βn+1(An+1 − b An) An+1 = ∂θf(θn, Φn+1) b An+1 ≈ A(θn) requires high-gain, βn αn → ∞, n → ∞ 25 / 52
  • 115. Zap Q Zap Algorithm Designed to emulate Newton-Raphson flow d dt ϑt = −[A(ϑt)]−1 ¯ f(ϑt), A(θ) = ∂ ∂θ ¯ f(θ) Zap-SA θn+1 = θn + αn+1Gn+1f(θn, Φn+1) Gn+1 = [− b An+1]−1 b An+1 = b An + βn+1(An+1 − b An) An+1 = ∂θf(θn, Φn+1) b An+1 ≈ A(θn) requires high-gain, βn αn → ∞, n → ∞ Numerics that follow: αn = 1/n, βn = (1/n)ρ, ρ ∈ (0.5, 1) Zap Q-Learning: f(θn, Φn+1) = c(Xn, Un) + γQθ∗ (Xn+1) − Qθ∗ (Xn, Un) ζn ζn = ∇θQθ (Xn, Un)
  • 116.
  • 117. θ=θn An+1 = ζn γψ(Xn+1, φθ (Xn+1)) − ψ(Xn, Un) T φθ (x) = arg min u Qθ (x, u) 25 / 52
  • 118. Zap Q Challenges Q-learning: {Qθ(x, u) : θ ∈ Rd , u ∈ U , x ∈ X} Find θ∗ such that ¯ f(θ∗) = 0, with ¯ f(θ) = E c(Xn, Un) + γQθ (Xn+1) − Qθ (Xn, Un) ζn What makes theory difficult: 1 Does ¯ f have a root? 2 Does the inverse of A exist? 3 SA theory is weak for a discontinuous ODE ⇐= biggest technical challenge 26 / 52
  • 119. Zap Q Challenges Q-learning: {Qθ(x, u) : θ ∈ Rd , u ∈ U , x ∈ X} Find θ∗ such that ¯ f(θ∗) = 0, with ¯ f(θ) = E c(Xn, Un) + γQθ (Xn+1) − Qθ (Xn, Un) ζn What makes theory difficult: 1 Does ¯ f have a root? 2 Does the inverse of A exist? 3 SA theory is weak for a discontinuous ODE ⇐= biggest technical challenge =⇒ Resolved by exploiting special structure, even for NN function approximation [32, 9] 26 / 52
  • 120. Zap Q Zap Along Walk to the Cafe Zap Q-Learning Optimize Walk to Cafe 1 4 6 5 3 2 Convergence with Zap gain βn = n−0.85 0 1 2 3 4 5 6 7 8 9 10 105 0 20 40 60 80 100 Watkins, Speedy Q-learning, Polyak-Ruppert Averaging Zap Bellman Error n Convergence of Zap-Q Learning Discount factor: γ = 0.99 27 / 52
  • 121. Zap Q Zap Along Walk to the Cafe Zap Q-Learning Optimize Walk to Cafe 1 4 6 5 3 2 Convergence with Zap gain βn = n−0.85 Infinite covariance with αn = 1/n or 1/n(x, u). 0 1 2 3 4 5 6 7 8 9 10 105 0 20 40 60 80 100 Watkins, Speedy Q-learning, Polyak-Ruppert Averaging Zap Bellman Error n Zap, βn = αn Convergence of Zap-Q Learning Discount factor: γ = 0.99 27 / 52
  • 122. Zap Q Zap Along Walk to the Cafe Zap Q-Learning Optimize Walk to Cafe 1 4 6 5 3 2 Convergence with Zap gain βn = n−0.85 Infinite covariance with αn = 1/n or 1/n(x, u). 0 1 2 3 4 5 6 7 8 9 10 105 0 20 40 60 80 100 Watkins, Speedy Q-learning, Polyak-Ruppert Averaging Zap Bellman Error n Watkins, g = 1500 αn = g/n Zap, βn = αn Convergence of Zap-Q Learning Discount factor: γ = 0.99 27 / 52
  • 123. Zap Q Zap Along Walk to the Cafe Zap Q-Learning Optimize Walk to Cafe 1 4 6 5 3 2 Convergence with Zap gain βn = n−0.85 -20 -10 0 10 -20 -10 0 10 -8 -6 -4 -2 0 2 4 6 8 103 -8 -6 -4 -2 0 2 4 6 8 n = 104 n = 106 Theoritical pdf Experimental pdf Empirical: 1000 trials Wn = √ nθ̃n Entry #18: n = 104 n = 106 Entry #10: CLT gives good prediction of finite-n performance Discount factor: γ = 0.99 28 / 52
  • 124. Zap Q Zap Along Walk to the Cafe Zap Q-Learning Optimize Walk to Cafe 1 4 6 5 3 2 Convergence with Zap gain βn = n−0.85 -20 -10 0 10 -20 -10 0 10 -8 -6 -4 -2 0 2 4 6 8 103 -8 -6 -4 -2 0 2 4 6 8 n = 104 n = 106 Theoritical pdf Experimental pdf Empirical: 1000 trials Wn = √ nθ̃n Entry #18: n = 104 n = 106 Entry #10: CLT gives good prediction of finite-n performance Discount factor: γ = 0.99 Do not forget to use this in practice! Multiple runs to estimate CLT variance Requires wide range of initial conditions, but a smaller run length as you fine-tune your algorithm 28 / 52
  • 125. Zap Q Zap Q-Learning with Neural Networks Zap with Neural Networks 0 = ¯ f(θ∗ ) = E c(Xn, Un) + γQθ∗ (Xn+1) − Qθ∗ (Xn, Un) ζn ζn = ∇θQθ(Xn, Un)
  • 126.
  • 127. θ=θn computed using back-progagation A few things to note: As far as we know, the empirical success of plain vanilla DQN is extraordinary (however, nobody reports failure) 29 / 52
  • 128. Zap Q Zap Q-Learning with Neural Networks Zap with Neural Networks 0 = ¯ f(θ∗ ) = E c(Xn, Un) + γQθ∗ (Xn+1) − Qθ∗ (Xn, Un) ζn ζn = ∇θQθ(Xn, Un)
  • 129.
  • 130. θ=θn computed using back-progagation A few things to note: As far as we know, the empirical success of plain vanilla DQN is extraordinary (however, nobody reports failure) Zap Q-learning is the only approach for which convergence has been established (under mild conditions) We can expect stunning transient performance, based on coupling with the Newton-Raphson flow. 29 / 52
  • 131. Zap Q Zap Q-Learning with Neural Networks Zap with Neural Networks VI. Stunning reliability with parameterized by various neural networks Qθ Reliability and stunning transient performance —from coupling with the Newton-Raphson flow. 29 / 52
  • 132. Conclusions Future Directions Conclusions Reinforcement Learning is cursed by dimension, variance, and nonlinear (algorithm) dynamics We have a big basket of tools for reinforcement learning. Many algorithms fail due to bad design, or unrealistic goals 30 / 52
  • 133. Conclusions Future Directions Conclusions Reinforcement Learning is cursed by dimension, variance, and nonlinear (algorithm) dynamics We have a big basket of tools for reinforcement learning. Many algorithms fail due to bad design, or unrealistic goals Among the tools we have for taming algorithmic volatility: Zap: a second order method that optimizes variance and transient behavior 30 / 52
  • 134. Conclusions Future Directions Conclusions Reinforcement Learning is cursed by dimension, variance, and nonlinear (algorithm) dynamics We have a big basket of tools for reinforcement learning. Many algorithms fail due to bad design, or unrealistic goals Among the tools we have for taming algorithmic volatility: Zap: a second order method that optimizes variance and transient behavior State weighting: more justifiable and flexible than the “λ hack” 30 / 52
  • 135. Conclusions Future Directions Conclusions Reinforcement Learning is cursed by dimension, variance, and nonlinear (algorithm) dynamics We have a big basket of tools for reinforcement learning. Many algorithms fail due to bad design, or unrealistic goals Among the tools we have for taming algorithmic volatility: Zap: a second order method that optimizes variance and transient behavior State weighting: more justifiable and flexible than the “λ hack” Relative TD/Q-learning: Hφ(x, u) = Qφ(x, u) − Qφ(x•, u•) is sufficient for policy update 30 / 52
  • 136. Conclusions Future Directions Conclusions Reinforcement Learning is cursed by dimension, variance, and nonlinear (algorithm) dynamics We have a big basket of tools for reinforcement learning. Many algorithms fail due to bad design, or unrealistic goals Among the tools we have for taming algorithmic volatility: Zap: a second order method that optimizes variance and transient behavior State weighting: more justifiable and flexible than the “λ hack” Relative TD/Q-learning: Hφ(x, u) = Qφ(x, u) − Qφ(x•, u•) is sufficient for policy update Clever exploration: Bandits literature has vast resources [13] 30 / 52
  • 137. Conclusions Future Directions Outlook Beyond the projected Bellman error for Q-learning [57, 58, 59, 60] Beyond the shackles of DP equations? 31 / 52
  • 138. Conclusions Future Directions Outlook Beyond the projected Bellman error for Q-learning [57, 58, 59, 60] Beyond the shackles of DP equations? Acceleration techniques (momentum and matrix momentum) See also Zap-Zero in CSRL 10 4 0 20 40 60 80 100 120 Q Span norm: n − Q sp 0 2 4 6 8 10 n g = 1/(1 − γ) Watkins PRaveraging Zap ZapZero 31 / 52
  • 139. Conclusions Future Directions Outlook Beyond the projected Bellman error for Q-learning [57, 58, 59, 60] Beyond the shackles of DP equations? Acceleration techniques (momentum and matrix momentum) See also Zap-Zero in CSRL 10 4 0 20 40 60 80 100 120 Q Span norm: n − Q sp 0 2 4 6 8 10 n g = 1/(1 − γ) Watkins PRaveraging Zap ZapZero Further variance reduction using control variates 31 / 52
  • 140. Conclusions Future Directions Outlook Beyond the projected Bellman error for Q-learning [57, 58, 59, 60] Beyond the shackles of DP equations? Acceleration techniques (momentum and matrix momentum) See also Zap-Zero in CSRL Further variance reduction using control variates Thank you for your endurance! I welcome questions and comments now and in the future 31 / 52
  • 141.
  • 142. References Control Techniques FOR Complex Networks Sean Meyn Pre-publication version for on-line viewing. Monograph available for purchase at your favorite retailer More information available at http://www.cambridge.org/us/catalogue/catalogue.asp?isbn=9780521884419 Markov Chains and Stochastic Stability S. P. Meyn and R. L. Tweedie August 2008 Pre-publication version for on-line viewing. Monograph to appear Februrary 2009 π(f ) ∞ ∆V (x) ≤ −f(x) + bIC(x) Pn (x, · ) − πf → 0 sup C E x [S τ C (f )] ∞ References 32 / 52
  • 143. References Control Background I [1] K. J. Åström and R. M. Murray. Feedback Systems: An Introduction for Scientists and Engineers. Princeton University Press, USA, 2008 (recent edition on-line). [2] K. J. Åström and B. Wittenmark. Adaptive Control. Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA, 2nd edition, 1994. [3] A. Fradkov and B. T. Polyak. Adaptive and robust control in the USSR. IFAC–PapersOnLine, 53(2):1373–1378, 2020. 21th IFAC World Congress. [4] M. Krstic, P. V. Kokotovic, and I. Kanellakopoulos. Nonlinear and adaptive control design. John Wiley Sons, Inc., 1995. [5] K. J. Åström. Theory and applications of adaptive control—a survey. Automatica, 19(5):471–486, 1983. [6] K. J. Åström. Adaptive control around 1960. IEEE Control Systems Magazine, 16(3):44–49, 1996. [7] L. Ljung. Analysis of recursive stochastic algorithms. IEEE Transactions on Automatic Control, 22(4):551–575, 1977. 33 / 52
  • 144. References Control Background II [8] N. Matni, A. Proutiere, A. Rantzer, and S. Tu. From self-tuning regulators to reinforcement learning and back again. In Proc. of the IEEE Conf. on Dec. and Control, pages 3724–3740, 2019. 34 / 52
  • 145. References RL Background I [9] S. Meyn. Control Systems and Reinforcement Learning. Cambridge University Press, Cambridge, 2021. [10] R. Sutton and A. Barto. Reinforcement Learning: An Introduction. MIT Press. On-line edition at http://www.cs.ualberta.ca/~sutton/book/the-book.html, Cambridge, MA, 2nd edition, 2018. [11] C. Szepesvári. Algorithms for Reinforcement Learning. Synthesis Lectures on Artificial Intelligence and Machine Learning. Morgan Claypool Publishers, 2010. [12] D. P. Bertsekas. Reinforcement learning and optimal control. Athena Scientific, Belmont, MA, 2019. [13] T. Lattimore and C. Szepesvari. Bandit Algorithms. Cambridge University Press, 2020. [14] R. S. Sutton.Learning to predict by the methods of temporal differences. Mach. Learn., 3(1):9–44, 1988. [15] C. J. C. H. Watkins and P. Dayan. Q-learning. Machine Learning, 8(3-4):279–292, 1992. [16] J. Tsitsiklis. Asynchronous stochastic approximation and Q-learning. Machine Learning, 16:185–202, 1994. 35 / 52
  • 146. References RL Background II [17] T. Jaakola, M. Jordan, and S. Singh. On the convergence of stochastic iterative dynamic programming algorithms. Neural Computation, 6:1185–1201, 1994. [18] B. Van Roy. Learning and Value Function Approximation in Complex Decision Processes. PhD thesis, Massachusetts Institute of Technology, Cambridge, MA, 1998. [19] J. N. Tsitsiklis and B. Van Roy. Feature-based methods for large scale dynamic programming. Mach. Learn., 22(1-3):59–94, 1996. [20] J. N. Tsitsiklis and B. Van Roy. An analysis of temporal-difference learning with function approximation. IEEE Trans. Automat. Control, 42(5):674–690, 1997. [21] J. N. Tsitsiklis and B. V. Roy. Average cost temporal-difference learning. Automatica, 35(11):1799–1808, 1999. [22] J. N. Tsitsiklis and B. Van Roy. Optimal stopping of Markov processes: Hilbert space theory, approximation algorithms, and an application to pricing high-dimensional financial derivatives. IEEE Trans. Automat. Control, 44(10):1840–1851, 1999. [23] D. Choi and B. Van Roy. A generalized Kalman filter for fixed point approximation and efficient temporal-difference learning. Discrete Event Dynamic Systems: Theory and Applications, 16(2):207–239, 2006. 36 / 52
  • 147. References RL Background III [24] S. J. Bradtke and A. G. Barto. Linear least-squares algorithms for temporal difference learning. Mach. Learn., 22(1-3):33–57, 1996. [25] J. A. Boyan. Technical update: Least-squares temporal difference learning. Mach. Learn., 49(2-3):233–246, 2002. [26] A. Nedic and D. Bertsekas. Least squares policy evaluation algorithms with linear function approximation. Discrete Event Dyn. Systems: Theory and Appl., 13(1-2):79–110, 2003. [27] C. Szepesvári. The asymptotic convergence-rate of Q-learning. In Proceedings of the 10th Internat. Conf. on Neural Info. Proc. Systems, 1064–1070. MIT Press, 1997. [28] E. Even-Dar and Y. Mansour. Learning rates for Q-learning. Journal of Machine Learning Research, 5(Dec):1–25, 2003. [29] M. G. Azar, R. Munos, M. Ghavamzadeh, and H. Kappen. Speedy Q-learning. In Advances in Neural Information Processing Systems, 2011. [30] D. Huang, W. Chen, P. Mehta, S. Meyn, and A. Surana. Feature selection for neuro-dynamic programming. In F. Lewis, editor, Reinforcement Learning and Approximate Dynamic Programming for Feedback Control. Wiley, 2011. 37 / 52
  • 148. References RL Background IV [31] A. M. Devraj and S. P. Meyn. Zap Q-learning. In Proc. of the Intl. Conference on Neural Information Processing Systems, pages 2232–2241, 2017. [32] S. Chen, A. M. Devraj, F. Lu, A. Busic, and S. Meyn. Zap Q-Learning with nonlinear function approximation. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, and arXiv e-prints 1910.05405, volume 33, pages 16879–16890, 2020. [33] S. P. Meyn. Control Techniques for Complex Networks. Cambridge University Press, 2007. See last chapter on simulation and average-cost TD learning DQN: [34] M. Riedmiller. Neural fitted Q iteration – first experiences with a data efficient neural reinforcement learning method. In J. Gama, R. Camacho, P. B. Brazdil, A. M. Jorge, and L. Torgo, editors, Machine Learning: ECML 2005, pages 317–328, Berlin, Heidelberg, 2005. Springer Berlin Heidelberg. [35] S. Lange, T. Gabel, and M. Riedmiller. Batch reinforcement learning. In Reinforcement learning, pages 45–73. Springer, 2012. [36] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. A. Riedmiller. Playing Atari with deep reinforcement learning. ArXiv, abs/1312.5602, 2013. 38 / 52
  • 149. References RL Background V [37] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. A. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis. Human-level control through deep reinforcement learning. Nature, 518:529–533, 2015. [38] O. Anschel, N. Baram, and N. Shimkin. Averaged-DQN: Variance reduction and stabilization for deep reinforcement learning. In Proc. of ICML, pages 176–185. JMLR.org, 2017. Actor Critic / Policy Gradient [39] P. J. Schweitzer. Perturbation theory and finite Markov chains. J. Appl. Prob., 5:401–403, 1968. [40] C. D. Meyer, Jr. The role of the group generalized inverse in the theory of finite Markov chains. SIAM Review, 17(3):443–464, 1975. [41] P. W. Glynn. Stochastic approximation for Monte Carlo optimization. In Proceedings of the 18th conference on Winter simulation, pages 356–365, 1986. [42] R. J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8(3-4):229–256, 1992. 39 / 52
  • 150. References RL Background VI [43] T. Jaakkola, S. P. Singh, and M. I. Jordan. Reinforcement learning algorithm for partially observable Markov decision problems. In Advances in neural information processing systems, pages 345–352, 1995. [44] X.-R. Cao and H.-F. Chen. Perturbation realization, potentials, and sensitivity analysis of Markov processes. IEEE Transactions on Automatic Control, 42(10):1382–1393, Oct 1997. [45] P. Marbach and J. N. Tsitsiklis. Simulation-based optimization of Markov reward processes. IEEE Trans. Automat. Control, 46(2):191–209, 2001. [46] V. Konda. Actor-critic algorithms. PhD thesis, Massachusetts Institute of Technology, 2002. [47] V. R. Konda and J. N. Tsitsiklis. Actor-critic algorithms. In Advances in neural information processing systems, pages 1008–1014, 2000. [48] R. S. Sutton, D. A. McAllester, S. P. Singh, and Y. Mansour. Policy gradient methods for reinforcement learning with function approximation. In Advances in neural information processing systems, pages 1057–1063, 2000. [49] P. Marbach and J. N. Tsitsiklis. Simulation-based optimization of Markov reward processes. IEEE Trans. Automat. Control, 46(2):191–209, 2001. 40 / 52
  • 151. References RL Background VII [50] S. M. Kakade. A natural policy gradient. In Advances in neural information processing systems, pages 1531–1538, 2002. [51] H. Mania, A. Guy, and B. Recht. Simple random search provides a competitive approach to reinforcement learning. In Advances in Neural Information Processing Systems, pages 1800–1809, 2018. MDPs, LPs and Convex Q: [52] A. S. Manne. Linear programming and sequential decisions. Management Sci., 6(3):259–267, 1960. [53] C. Derman. Finite State Markovian Decision Processes, volume 67 of Mathematics in Science and Engineering. Academic Press, Inc., 1970. [54] V. S. Borkar. Convex analytic methods in Markov decision processes. In Handbook of Markov decision processes, volume 40 of Internat. Ser. Oper. Res. Management Sci., pages 347–375. Kluwer Acad. Publ., Boston, MA, 2002. [55] D. P. de Farias and B. Van Roy. The linear programming approach to approximate dynamic programming. Operations Res., 51(6):850–865, 2003. 41 / 52
  • 152. References RL Background VIII [56] D. P. de Farias and B. Van Roy. A cost-shaping linear program for average-cost approximate dynamic programming with performance guarantees. Math. Oper. Res., 31(3):597–620, 2006. [57] P. G. Mehta and S. P. Meyn. Q-learning and Pontryagin’s minimum principle. In Proc. of the IEEE Conf. on Dec. and Control, pages 3598–3605, Dec. 2009. [58] J. Bas Serrano, S. Curi, A. Krause, and G. Neu. Logistic Q-learning. In A. Banerjee and K. Fukumizu, editors, Proc. of The Intl. Conference on Artificial Intelligence and Statistics, volume 130, pages 3610–3618, 13–15 Apr 2021. [59] F. Lu, P. G. Mehta, S. P. Meyn, and G. Neu. Convex Q-learning. In American Control Conf., pages 4749–4756. IEEE, 2021. [60] F. Lu, P. G. Mehta, S. P. Meyn, and G. Neu. Convex analytic theory for convex Q-learning. In Conference on Decision and Control–to appear. IEEE, 2022. Gator Nation: [61] A. M. Devraj, A. Bušić, and S. Meyn. Fundamental design principles for reinforcement learning algorithms. In K. G. Vamvoudakis, Y. Wan, F. L. Lewis, and D. Cansever, editors, Handbook on Reinforcement Learning and Control, Studies in Systems, Decision and Control series (SSDC, volume 325). Springer, 2021. 42 / 52
  • 153. References RL Background IX [62] A. M. Devraj and S. P. Meyn. Fastest convergence for Q-learning. ArXiv , July 2017 (extended version of NIPS 2017). [63] A. M. Devraj. Reinforcement Learning Design with Optimal Learning Rate. PhD thesis, University of Florida, 2019. [64] A. M. Devraj and S. P. Meyn. Q-learning with uniformly bounded variance: Large discounting is not a barrier to fast learning. IEEE Trans Auto Control (and arXiv:2002.10301), 2021. [65] A. M. Devraj, A. Bušić, and S. Meyn. On matrix momentum stochastic approximation and applications to Q-learning. In Allerton Conference on Communication, Control, and Computing, pages 749–756, Sep 2019. 43 / 52
  • 154. References Stochastic Miscellanea I [66] S. Asmussen and P. W. Glynn. Stochastic Simulation: Algorithms and Analysis, volume 57 of Stochastic Modelling and Applied Probability. Springer-Verlag, New York, 2007. [67] P. W. Glynn and S. P. Meyn. A Liapounov bound for solutions of the Poisson equation. Ann. Probab., 24(2):916–931, 1996. [68] S. P. Meyn and R. L. Tweedie. Markov chains and stochastic stability. Cambridge University Press, Cambridge, second edition, 2009. Published in the Cambridge Mathematical Library. [69] R. Douc, E. Moulines, P. Priouret, and P. Soulier. Markov Chains. Springer, 2018. 44 / 52
  • 155. References Stochastic Approximation I [70] V. S. Borkar. Stochastic Approximation: A Dynamical Systems Viewpoint. Hindustan Book Agency and Cambridge University Press, Delhi, India Cambridge, UK, 2008. [71] A. Benveniste, M. Métivier, and P. Priouret. Adaptive algorithms and stochastic approximations, volume 22 of Applications of Mathematics (New York). Springer-Verlag, Berlin, 1990. Translated from the French by Stephen S. Wilson. [72] V. S. Borkar and S. P. Meyn. The ODE method for convergence of stochastic approximation and reinforcement learning. SIAM J. Control Optim., 38(2):447–469, 2000. [73] V. Borkar, S. Chen, A. Devraj, I. Kontoyiannis, and S. Meyn. The ODE method for asymptotic statistics in stochastic approximation and reinforcement learning. arXiv e-prints:2110.14427, pages 1–50, 2021. [74] M. Benaı̈m. Dynamics of stochastic approximation algorithms. In Séminaire de Probabilités, XXXIII, pages 1–68. Springer, Berlin, 1999. [75] V. Borkar and S. P. Meyn. Oja’s algorithm for graph clustering, Markov spectral decomposition, and risk sensitive control. Automatica, 48(10):2512–2519, 2012. [76] J. Kiefer and J. Wolfowitz. Stochastic estimation of the maximum of a regression function. Ann. Math. Statist., 23(3):462–466, 09 1952. 45 / 52
  • 156. References Stochastic Approximation II [77] J. C. Spall. Introduction to stochastic search and optimization: estimation, simulation, and control. John Wiley Sons, 2003. [78] D. Ruppert. A Newton-Raphson version of the multivariate Robbins-Monro procedure. The Annals of Statistics, 13(1):236–245, 1985. [79] D. Ruppert. Efficient estimators from a slowly convergent Robbins-Monro processes. Technical Report Tech. Rept. No. 781, Cornell University, School of Operations Research and Industrial Engineering, Ithaca, NY, 1988. [80] B. T. Polyak. A new method of stochastic approximation type. Avtomatika i telemekhanika, 98–107, 1990 (in Russian). Translated in Automat. Remote Control, 51 1991. [81] B. T. Polyak and A. B. Juditsky. Acceleration of stochastic approximation by averaging. SIAM J. Control Optim., 30(4):838–855, 1992. [82] V. R. Konda and J. N. Tsitsiklis. Convergence rate of linear two-time-scale stochastic approximation. Ann. Appl. Probab., 14(2):796–819, 2004. [83] E. Moulines and F. R. Bach. Non-asymptotic analysis of stochastic approximation algorithms for machine learning. In Advances in Neural Information Processing Systems 24, 451–459. Curran Associates, Inc., 2011. 46 / 52
  • 157. References Stochastic Approximation III [84] W. Mou, C. Junchi Li, M. J. Wainwright, P. L. Bartlett, and M. I. Jordan. On Linear Stochastic Approximation: Fine-grained Polyak-Ruppert and Non-Asymptotic Concentration. arXiv e-prints, page arXiv:2004.04719, Apr. 2020. 47 / 52
  • 158. References Optimization and ODEs I [85] W. Su, S. Boyd, and E. Candes. A differential equation for modeling nesterov’s accelerated gradient method: Theory and insights. In Advances in neural information processing systems, pages 2510–2518, 2014. [86] B. Shi, S. S. Du, W. Su, and M. I. Jordan. Acceleration via symplectic discretization of high-resolution differential equations. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems 32, pages 5744–5752. Curran Associates, Inc., 2019. [87] B. T. Polyak. Some methods of speeding up the convergence of iteration methods. USSR Computational Mathematics and Mathematical Physics, 4(5):1–17, 1964. [88] Y. Nesterov. A method of solving a convex programming problem with convergence rate O(1/k2). In Soviet Mathematics Doklady, 1983. 48 / 52
  • 159. References QSA and Extremum Seeking Control I [89] C. K. Lauand and S. Meyn. Extremely fast convergence rates for extremum seeking control with Polyak-Ruppert averaging. arXiv 2206.00814, 2022. [90] C. K. Lauand and S. Meyn. Markovian foundations for quasi stochastic approximation with applications to extremum seeking control, 2022. [91] B. Lapeybe, G. Pages, and K. Sab. Sequences with low discrepancy generalisation and application to Robbins-Monro algorithm. Statistics, 21(2):251–272, 1990. [92] S. Laruelle and G. Pagès. Stochastic approximation with averaging innovation applied to finance. Monte Carlo Methods and Applications, 18(1):1–51, 2012. [93] S. Shirodkar and S. Meyn. Quasi stochastic approximation. In Proc. of the 2011 American Control Conference (ACC), pages 2429–2435, July 2011. [94] S. Chen, A. Devraj, A. Bernstein, and S. Meyn. Revisiting the ODE method for recursive algorithms: Fast using quasi stochastic approximation. Journal of Systems Science and Complexity, 34(5):1681–1702, 2021. [95] Y. Chen, A. Bernstein, A. Devraj, and S. Meyn. Model-Free Primal-Dual Methods for Network Optimization with Application to Real-Time Optimal Power Flow. In Proc. of the American Control Conf., pages 3140–3147, Sept. 2019. 49 / 52
  • 160. References QSA and Extremum Seeking Control II [96] S. Bhatnagar and V. S. Borkar. Multiscale chaotic spsa and smoothed functional algorithms for simulation optimization. Simulation, 79(10):568–580, 2003. [97] S. Bhatnagar, M. C. Fu, S. I. Marcus, and I.-J. Wang. Two-timescale simultaneous perturbation stochastic approximation using deterministic perturbation sequences. ACM Transactions on Modeling and Computer Simulation (TOMACS), 13(2):180–209, 2003. [98] M. Le Blanc. Sur l’electrification des chemins de fer au moyen de courants alternatifs de frequence elevee [On the electrification of railways by means of alternating currents of high frequency]. Revue Generale de l’Electricite, 12(8):275–277, 1922. [99] Y. Tan, W. H. Moase, C. Manzie, D. Nešić, and I. M. Y. Mareels. Extremum seeking from 1922 to 2010. In Proceedings of the 29th Chinese Control Conference, pages 14–26, July 2010. [100] P. F. Blackman. Extremum-seeking regulators. In An Exposition of Adaptive Control. Macmillan, 1962. [101] J. Sternby. Adaptive control of extremum systems. In H. Unbehauen, editor, Methods and Applications in Adaptive Control, pages 151–160, Berlin, Heidelberg, 1980. Springer Berlin Heidelberg. 50 / 52
  • 161. References QSA and Extremum Seeking Control III [102] J. Sternby. Extremum control systems–an area for adaptive control? In Joint Automatic Control Conference, number 17, page 8, 1980. [103] K. B. Ariyur and M. Krstić. Real Time Optimization by Extremum Seeking Control. John Wiley Sons, Inc., New York, NY, USA, 2003. [104] M. Krstić and H.-H. Wang. Stability of extremum seeking feedback for general nonlinear dynamic systems. Automatica, 36(4):595 – 601, 2000. [105] S. Liu and M. Krstic. Introduction to extremum seeking. In Stochastic Averaging and Stochastic Extremum Seeking, Communications and Control Engineering. Springer, London, 2012. [106] O. Trollberg and E. W. Jacobsen. On the convergence rate of extremum seeking control. In European Control Conference (ECC), pages 2115–2120. 2014. [107] Y. Bugeaud. Linear forms in logarithms and applications, volume 28 of IRMA Lectures in Mathematics and Theoretical Physics. EMS Press, 2018. [108] G. Wüstholz, editor. A Panorama of Number Theory—Or—The View from Baker’s Garden. Cambridge University Press, 2002. 51 / 52
  • 162. References Selected Applications I [109] N. S. Raman, A. M. Devraj, P. Barooah, and S. P. Meyn. Reinforcement learning for control of building HVAC systems. In American Control Conference, July 2020. [110] K. Mason and S. Grijalva. A review of reinforcement learning for autonomous building energy management. arXiv.org, 2019. arXiv:1903.05196. 52 / 52