Unbiased Hamiltonian Monte Carlo

Unbiased Hamiltonian Monte Carlo
Jeremy Heng
Information Systems, Decision Sciences and Statistics (IDS)
Department, ESSEC
Joint work with Pierre Jacob
Department of Statistics, Harvard University
Nanyang Technological University
20 February 2019
Jeremy Heng Unbiased HMC 1/ 48

Outline
1 MCMC, burn-in bias and parallel computing
2 Couplings of MCMC algorithms

Setting
• Target distribution
π(dx) = π(x)dx, x ∈ Rd

Setting
• For Bayesian inference, target is the posterior distribution of
parameters x given data y
π(x) = p(x|y) ∝ p(x)
prior
p(y|x)
likelihood

Setting
π(x) = p(x|y) ∝ p(x)
prior
p(y|x)
likelihood
• Objective: compute expectation
Eπ [h(X)] =
Rd
h(x)π(x)dx
for some test function h : Rd → R

Setting
π(x) = p(x|y) ∝ p(x)
prior
p(y|x)
likelihood
• Objective: compute expectation
Eπ [h(X)] =
Rd
h(x)π(x)dx
for some test function h : Rd → R
• Monte Carlo method: sample X0, . . . , XT ∼ π and compute
1
T + 1
T
t=0
h(Xt) → Eπ [h(X)] as T → ∞

Markov chain Monte Carlo (MCMC)
• MCMC algorithm deﬁnes π-invariant Markov kernel K

• Initialize X0 ∼ π0 = π and iterate
Xt ∼ K(Xt−1, ·) for t = 1, . . . , T

• Initialize X0 ∼ π0 = π and iterate
Xt ∼ K(Xt−1, ·) for t = 1, . . . , T
• Compute
1
T − b + 1
T
t=b
h(Xt) → Eπ [h(X)] as T → ∞
where b ≥ 0 iterations are discarded as burn-in

MCMC trajectory
π = N(0, 1), π0 = N(10, 32), K = RWMH with proposal std 0.5

MCMC trajectories

MCMC marginal distributions

Burn-in bias and parallel computing
• Since π0 = π, the bias
E
1
T − b + 1
T
t=b
h(Xt) − Eπ [h(X)] = 0
for any ﬁxed b, T

E
1
T − b + 1
T
t=b
h(Xt) − Eπ [h(X)] = 0
for any ﬁxed b, T
• Bias converges to zero only if b is ﬁxed and T → ∞

E
1
T − b + 1
T
t=b
h(Xt) − Eπ [h(X)] = 0
for any ﬁxed b, T
• Naive parallelization: generate R chains (X
(r)
t )R
r=1 and
compute
1
R
R
r=1
1
T − b + 1
T
t=b
h(X
(r)
t )

E
1
T − b + 1
T
t=b
h(Xt) − Eπ [h(X)] = 0
for any ﬁxed b, T
(r)
t )R
r=1 and
compute
1
R
R
r=1
1
T − b + 1
T
t=b
h(X
(r)
t )
• This estimator is not consistent as R → ∞ for ﬁxed b, T

E
1
T − b + 1
T
t=b
h(Xt) − Eπ [h(X)] = 0
for any fixed b, T
(r)
t )R
r=1 and
compute
1
R
R
r=1
1
T − b + 1
T
t=b
h(X
(r)
t )
• This estimator is not consistent as R → ∞ for fixed b, T
• But consistent as T → ∞ for fixed b, R

Proposed methodology
• Each processor runs two
coupled chains X = (Xt) and
Y = (Yt)
Parallel MCMC
processors
#
1

Y = (Yt)
• Terminates at some random
time which involves their
meeting time
Parallel MCMC
processors
#
1

Y = (Yt)
meeting time
• Returns unbiased estimator
Hk:m of Eπ [h(X)]
Parallel MCMC
processors
#
1

Y = (Yt)
meeting time
Hk:m of Eπ [h(X)]
• Average over R processors:
1
R
R
r=1 H
(r)
k:m → Eπ [h(X)] as
R → ∞
Parallel MCMC
processors
#
1

Y = (Yt)
meeting time
Hk:m of Eπ [h(X)]
1
R
R
r=1 H
(r)
R → ∞
• Eﬃciency depends on
expected compute cost and
variance of Hk:m
Parallel MCMC
processors
#
1

Y = (Yt)
meeting time
Hk:m of Eπ [h(X)]
1
R
R
r=1 H
(r)
R → ∞
variance of Hk:m
Parallel MCMC
processors
#
1

Debiasing idea (Glynn and Rhee 2014)
• Ergodicity of Markov chain implies
lim
t→∞
E [h(Xt)] = Eπ [h(X)]

lim
t→∞
E [h(Xt)] = Eπ [h(X)]
• Writing limit as telescopic sum (starting from k ≥ 0)
lim
t→∞
E [h(Xt)] = E [h(Xk )] +
∞
t=k+1
E [h(Xt) − h(Xt−1)]

lim
t→∞
E [h(Xt)] = Eπ [h(X)]
lim
t→∞
E [h(Xt)] = E [h(Xk )] +
∞
t=k+1
E [h(Xt) − h(Xt−1)]
• If interchanging summation and expectation is valid
E h(Xk ) +
∞
t=k+1
{h(Xt) − h(Xt−1)} = Eπ [h(X)]

lim
t→∞
E [h(Xt)] = Eπ [h(X)]
lim
t→∞
E [h(Xt)] = E [h(Xk )] +
∞
t=k+1
E [h(Xt) − h(Xt−1)]
• If interchanging summation and expectation is valid
E h(Xk ) +
∞
t=k+1
{h(Xt) − h(Xt−1)} = Eπ [h(X)]
• If we construct another Markov chain (Yt) such that
Xt
d.
= Yt and Xt = Yt−1 for t ≥ τ
then
E h(Xk ) +
τ−1
t=k+1
{h(Xt) − h(Yt−1)} = Eπ [h(X)]

Unbiased estimators
Hk(X, Y ) = h(Xk) +
τ−1
t=k+1
{h(Xt) − h(Yt−1)} for any k ≥ 0
with τ−1
t=k+1{·} = 0 if τ − 1 < k + 1, is an unbiased estimator of
Eπ [h(X)], with ﬁnite variance and expected cost

Unbiased estimators
Hk(X, Y ) = h(Xk) +
τ−1
t=k+1
with τ−1
(Glynn and Rhee 2014, Vihola 2017, Jacob et al. 2017)
1 Convergence of marginal chain:
lim
t→∞
E [h(Xt)] = Eπ [h(X)] and sup
t≥0
E|h(Xt)|2+δ
< ∞, δ > 0

Unbiased estimators
Hk(X, Y ) = h(Xk) +
τ−1
t=k+1
with τ−1
lim
t→∞
t≥0
E|h(Xt)|2+δ
< ∞, δ > 0
2 Meeting time τ = inf{t ≥ 1 : Xt = Yt−1} has geometric
tails:
P(τ > t) ≤ Cρt
for C < ∞, ρ ∈ (0, 1)

Unbiased estimators
Hk(X, Y ) = h(Xk) +
τ−1
t=k+1
with τ−1
lim
t→∞
t≥0
E|h(Xt)|2+δ
< ∞, δ > 0
2 Meeting time τ = inf{t ≥ 1 : Xt = Yt−1} has geometric
tails:
P(τ > t) ≤ Cρt
for C < ∞, ρ ∈ (0, 1)
3 Faithfulness: Xt = Yt−1 for t ≥ τ

Unbiased estimators
• For any tuning parameter k ∈ N
Hk(X, Y ) = h(Xk) +
τ−1
t=k+1
{h(Xt) − h(Yt−1)}
is unbiased

Unbiased estimators
Hk(X, Y ) = h(Xk) +
τ−1
t=k+1
{h(Xt) − h(Yt−1)}
is unbiased
• First term h(Xk) is biased; second term corrects for bias (zero
if k ≥ τ − 1)

Unbiased estimators
Hk(X, Y ) = h(Xk) +
τ−1
t=k+1
{h(Xt) − h(Yt−1)}
is unbiased
if k ≥ τ − 1)
• As k → ∞, Hk(X, Y ) = h(Xk) with increasing probability, so
V [Hk(X, Y )] → Vπ[h(X)]

Unbiased estimators
Hk(X, Y ) = h(Xk) +
τ−1
t=k+1
{h(Xt) − h(Yt−1)}
is unbiased
if k ≥ τ − 1)
• As k → ∞, Hk(X, Y ) = h(Xk) with increasing probability, so
V [Hk(X, Y )] → Vπ[h(X)]
• Cost of computing Hk(X, Y ) is roughly
2(τ − 1) + max(1, k + 1 − τ)
applications of K

Time-averaged estimators
• Since Hk(X, Y ) is unbiased for all k ≥ 0, the time-averaged
estimator
Hk:m(X, Y ) =
1
m − k + 1
m
t=k
Ht(X, Y ) for any k ≤ m
is also unbiased

estimator
Hk:m(X, Y ) =
1
m − k + 1
m
t=k
is also unbiased
• Rewrite estimator as
1
m − k + 1
m
t=k
h(Xt) +
τ−1
t=k+1
min 1,
t − k
m − k + 1
{h(Xt) − h(Yt−1)}

estimator
Hk:m(X, Y ) =
1
m − k + 1
m
t=k
is also unbiased
• Rewrite estimator as
1
m − k + 1
m
t=k
h(Xt) +
τ−1
t=k+1
min 1,
t − k
m − k + 1
{h(Xt) − h(Yt−1)}
• First term is standard MCMC average; second term is bias
correction (zero if k ≥ τ − 1)

1
m − k + 1
m
t=k
h(Xt) +
τ−1
t=k+1
min 1,
t − k
m − k + 1
{h(Xt) − h(Yt−1)}
q
q q
q
q q q q
q q
q
q
q q
q q
q
q
q q q
q
q q
q q q q
q q
q
Xt
Yt−1
∆t
−4
0
4
0 k = 5 τ = 10 m = 20
iteration
statespace

Y = (Yt)
meeting time
Hk:m of Eπ [h(X)]
1
R
R
r=1 H
(r)
R → ∞
variance of Hk:m
Parallel MCMC
processors
#
1

Eﬃciency
• Following Glynn and Whitt (1992), the asymptotic
ineﬃciency of Hk:m(X, Y ) as compute budget → ∞ is
E [2(τ − 1) + max(1, m + 1 − τ)]
expected cost
× V [Hk:m(X, Y )]
variance

Eﬃciency
E [2(τ − 1) + max(1, m + 1 − τ)]
expected cost
× V [Hk:m(X, Y )]
variance
• Bias removal leads to variance inﬂation

Eﬃciency
E [2(τ − 1) + max(1, m + 1 − τ)]
expected cost
× V [Hk:m(X, Y )]
variance
• Variance inﬂation can be mitigated by increasing k and m

Eﬃciency
E [2(τ − 1) + max(1, m + 1 − τ)]
expected cost
× V [Hk:m(X, Y )]
variance
• As k → ∞, Hk:m(X, Y ) is standard MCMC average with
increasing probability, so its variance should be similar

Eﬃciency
E [2(τ − 1) + max(1, m + 1 − τ)]
expected cost
× V [Hk:m(X, Y )]
variance
• As k → ∞, Hk:m(X, Y ) is standard MCMC average with
increasing probability, so its variance should be similar
• If τ k m, asymptotic ineﬃciency is approximately
m ×
σ2(h)
m − k + 1
≈ σ2
(h)
the asymptotic variance of marginal chain

Y = (Yt)
meeting time
Hk:m of Eπ [h(X)]
1
R
R
r=1 H
(r)
R → ∞
variance of Hk:m
Parallel MCMC
processors
#
1

Coupled chains
• To compute Hk:m(X, Y )

Coupled chains
1 Initialize (X0, Y0) ∼ ¯π0 from a coupling with π0 as marginals,
i.e. X0 ∼ π0 and Y0 ∼ π0

Coupled chains
i.e. X0 ∼ π0 and Y0 ∼ π0
2 Sample X1 ∼ K(X0, ·)

Coupled chains
i.e. X0 ∼ π0 and Y0 ∼ π0
3 For t = 1, . . . , max(m, τ) sample
(Xt+1, Yt) ∼ ¯K((Xt, Yt−1), ·)
from coupled kernel ¯K that admits K as marginals, i.e.
Xt+1 ∼ K(Xt, ·) and Yt ∼ K(Yt−1, ·)

Coupled chains
i.e. X0 ∼ π0 and Y0 ∼ π0
(Xt+1, Yt) ∼ ¯K((Xt, Yt−1), ·)
Xt+1 ∼ K(Xt, ·) and Yt ∼ K(Yt−1, ·)
• Note that Xt
d.
= Yt for t ≥ 0

Coupled chains
i.e. X0 ∼ π0 and Y0 ∼ π0
(Xt+1, Yt) ∼ ¯K((Xt, Yt−1), ·)
Xt+1 ∼ K(Xt, ·) and Yt ∼ K(Yt−1, ·)
• Note that Xt
d.
= Yt for t ≥ 0
• Need to design ¯K so that Xτ = Yτ−1 (chains meet) and
Xt = Yt−1 for t ≥ τ (are faithful)

Outline
1 MCMC, burn-in bias and parallel computing
2 Couplings of MCMC algorithms

Couplings
• Given distributions p(x) and q(y) on Rd , a coupling c(x, y)
is a joint distribution on Rd × Rd such that
p(x) =
Rd
c(x, y) dy and q(y) =
Rd
c(x, y) dx

Couplings
p(x) =
Rd
Rd
c(x, y) dx
• (X, Y ) ∼ c implies X ∼ p and Y ∼ q

Couplings
p(x) =
Rd
Rd
c(x, y) dx
• There are inﬁnitely many couplings of p and q

Couplings
p(x) =
Rd
Rd
c(x, y) dx
• Independent coupling: X ∼ p and Y ∼ q independently

Couplings
p(x) =
Rd
Rd
c(x, y) dx
• Optimal coupling: minimizes E |X − Y |2

Couplings
p(x) =
Rd
Rd
c(x, y) dx
• Optimal coupling: minimizes E |X − Y |2
• Maximal coupling: maximizes P(X = Y )

Independent coupling of Gamma and Gaussian

Maximal coupling of Gamma and Gaussian

Maximal coupling: algorithm
Sampling (X, Y ) from maximal coupling of p and q
1 Sample X ∼ p and U ∼ U([0, 1])
0.0
0.5
1.0
1.5
2.0
−2 −1 0 1 2 3
Density
Thorisson, Coupling, stationarity, and regeneration (2000)

If U ≤ q(X)/p(X), output (X, X)
0.0
0.5
1.0
1.5
2.0
−2 −1 0 1 2 3
Density

2 Otherwise, sample Y ∼ q and
U ∼ U([0, 1])
0.0
0.5
1.0
1.5
2.0
−2 −1 0 1 2 3
Density

2 Otherwise, sample Y ∼ q and
U ∼ U([0, 1])
until U > p(Y )/q(Y ), and
output (X, Y ) 0.0
0.5
1.0
1.5
2.0
−2 −1 0 1 2 3
Density

Remarks:
• Step 1 samples from overlap
min{p(x), q(x)}
0.0
0.5
1.0
1.5
2.0
−2 −1 0 1 2 3
Density

Remarks:
min{p(x), q(x)}
• Maximality follows from coupling
inequality
P(X = Y ) =
Rd
min{p(x), q(x)}dx
= 1 − TV(p, q)
0.0
0.5
1.0
1.5
2.0
−2 −1 0 1 2 3
Density

Remarks:
min{p(x), q(x)}
• Maximality follows from coupling
inequality
P(X = Y ) =
Rd
min{p(x), q(x)}dx
= 1 − TV(p, q)
• Expected cost does not depend
on p and q
0.0
0.5
1.0
1.5
2.0
−2 −1 0 1 2 3
Density

Metropolis–Hastings (kernel K)
At iteration t − 1, Markov chain at state Xt−1
1 Propose X ∼ q(Xt−1, ·), e.g.
for RWMH X ∼ N(Xt−1, σ2Id ),
for MALA X ∼ N(Xt−1 + σ2
2 log π(Xt−1), σ2Id )

2 Sample U ∼ U([0, 1])

2 Sample U ∼ U([0, 1])
3 If
U ≤ min 1,
π(X )q(X , Xt−1)
π(Xt−1)q(Xt−1, X )
,

2 Sample U ∼ U([0, 1])
3 If
U ≤ min 1,
π(X )q(X , Xt−1)
π(Xt−1)q(Xt−1, X )
,
set Xt = X , otherwise set Xt = Xt−1

Coupled Metropolis–Hastings (kernel ¯K)
At iteration t − 1, two Markov chains at states Xt−1 and Yt−1
1 Propose (X , Y ) from maximal coupling of q(Xt−1, ·) and
q(Yt−1, ·)

q(Yt−1, ·)
2 Sample U ∼ U([0, 1])

q(Yt−1, ·)
2 Sample U ∼ U([0, 1])
3 If
U ≤ min 1,
π(X )q(X , Xt−1)
π(Xt−1)q(Xt−1, X )
,

q(Yt−1, ·)
2 Sample U ∼ U([0, 1])
3 If
U ≤ min 1,
π(X )q(X , Xt−1)
π(Xt−1)q(Xt−1, X )
,
If
U ≤ min 1,
π(Y )q(Y , Yt−1)
π(Yt−1)q(Yt−1, Y )
,

q(Yt−1, ·)
2 Sample U ∼ U([0, 1])
3 If
U ≤ min 1,
π(X )q(X , Xt−1)
π(Xt−1)q(Xt−1, X )
,
If
U ≤ min 1,
π(Y )q(Y , Yt−1)
π(Yt−1)q(Yt−1, Y )
,
set Yt = Y , otherwise set Yt = Yt−1

RWMH on Gaussian target: trajectories

RWMH on Gaussian target: meetings

RWMH on Gaussian target: meeting times
0.000
0.005
0.010
0.015
0 50 100 150 200
meeting time
density

RWMH on Gaussian target: scaling with dimension
π = π0 = N(0, Id ), ¯K = coupled RWMH with proposal std Cd−1/2
C=1.0
C=1.5
C=2.00
1000
2000
3000
2 4 6 8 10
dimension
averagemeetingtime

HMC on Gaussian target: scaling with dimension
π = π0 = N(0, Id ), ¯K = coupled HMC with step size Cd−1/4
C=1.0
C=1.5
C=2.0
40
45
50
55
2000 4000 6000 8000 10000
dimension
averagemeetingtime

Hamiltonian Monte Carlo (HMC)
• Deﬁne potential energy U(q) = − log π(q) and
Hamiltonian E(q, p) = U(q) + 1
2|p|2

2|p|2
• Hamiltonian dynamics (q(t), p(t)) ∈ Rd × Rd , for t ≥ 0
d
dt
q(t) = pE(q(t), p(t)) = p(t)
d
dt
p(t) = − qE(q(t), p(t)) = − U(q(t))

2|p|2
d
dt
q(t) = pE(q(t), p(t)) = p(t)
d
dt
p(t) = − qE(q(t), p(t)) = − U(q(t))
• Ideal algorithm deﬁning π-invariant K:
at iteration t − 1, Markov chain at state Xt−1

2|p|2
d
dt
q(t) = pE(q(t), p(t)) = p(t)
d
dt
p(t) = − qE(q(t), p(t)) = − U(q(t))
1 Set q(0) = Xt−1 and sample p(0) ∼ N(0, Id )

2|p|2
d
dt
q(t) = pE(q(t), p(t)) = p(t)
d
dt
p(t) = − qE(q(t), p(t)) = − U(q(t))
2 Solve dynamics over time length T to get (q(T), p(T))

2|p|2
d
dt
q(t) = pE(q(t), p(t)) = p(t)
d
dt
p(t) = − qE(q(t), p(t)) = − U(q(t))
2 Solve dynamics over time length T to get (q(T), p(T))
3 Set Xt = q(T).

• Solving Hamiltonian dynamics exactly is typically intractable

• Leap-frog integrator:

1 Set q0 = Xt−1 and sample p0 ∼ N(0, Id )

2 For = 0, . . . , L − 1, compute
p +1/2 = p −
ε
2
U(q )
q +1 = q + ε p +1/2
p +1 = p +1/2 −
ε
2
U(q +1)

2 For = 0, . . . , L − 1, compute
p +1/2 = p −
ε
2
U(q )
q +1 = q + ε p +1/2
p +1 = p +1/2 −
ε
2
U(q +1)
3 Sample U ∼ U([0, 1])

2 For = 0, . . . , L − 1, compute
p +1/2 = p −
ε
2
U(q )
q +1 = q + ε p +1/2
p +1 = p +1/2 −
ε
2
U(q +1)
3 Sample U ∼ U([0, 1])
4 If
U ≤ min {1, exp [E(q0, p0) − E(qL, pL)]}

2 For = 0, . . . , L − 1, compute
p +1/2 = p −
ε
2
U(q )
q +1 = q + ε p +1/2
p +1 = p +1/2 −
ε
2
U(q +1)
3 Sample U ∼ U([0, 1])
4 If
U ≤ min {1, exp [E(q0, p0) − E(qL, pL)]}
set Xt = qL, otherwise set Xt = Xt−1

Coupled Hamiltonian dynamics
• Consider coupling two particles (qi (t), pi (t)), i = 1, 2
following Hamiltonian dynamics

• For Gaussian target π = N(µ, σ2)
q1
(t)−q2
(t) = cos(t/σ) q1
(0) − q2
(0) +σ sin(t/σ) p1
(0) − p2
(0)
therefore if p1(0) = p2(0) then
|q1
(t) − q2
(t)| = | cos(t/σ)| |q1
(0) − q2
(0)|
• Diﬀerence ∆(t) = q1(t) − q2(t) satisﬁes
1
2
d
dt
|∆(t)|2
= ∆(t)T
{p1
(t) − p2
(t)}
therefore if p1(0) = p2(0) then t → |∆(t)|2 has a stationary
point at t = 0

• To characterize stationary point
1
2
d2
dt2
|∆(0)|2
= −∆(0)T
{ U(q1
(0)) − U(q2
(0))}
≤ −α|∆(0)|2
if q1(0), q2(0) ∈ S where U is α-strongly convex

• To characterize stationary point
1
2
d2
dt2
|∆(0)|2
= −∆(0)T
{ U(q1
(0)) − U(q2
(0))}
≤ −α|∆(0)|2
if q1(0), q2(0) ∈ S where U is α-strongly convex
• Since t = 0 is a strict local maximum point, there exists
T > 0 such that for any t ∈ (0, T]
|q1
(t) − q2
(t)| ≤ ρt|q1
(0) − q2
(0)|, ρt ∈ [0, 1)

Logistic regression: distance against integration time
15
20
25
30
35
0.00 0.25 0.50 0.75 1.00
Integration time
Distance

• Assuming U is β-Lipschitz, we established contraction
using Taylor expansion around t = 0 (Lemma 1)

• More quantitative results by Mangoubi and Smith (2017,
Theorem 6) Bou-Rabee et al. (2018, Theorem 2.1) give
T =
√
α
β
and ρt =
1
2
αt2

• More quantitative results by Mangoubi and Smith (2017,
Theorem 6) Bou-Rabee et al. (2018, Theorem 2.1) give
T =
√
α
β
and ρt =
1
2
αt2
• Coupling can be eﬀective in high dimensions if problem is
well-conditioned

Coupled HMC kernel ( ¯Kε,L)
1 Set q1
0 = Xt−1, q2
0 = Yt−1 and sample p0 ∼ N(0, Id )

1 Set q1
0 = Xt−1, q2
2 Perform leap-frog integration to obtain (qi
L, pi
L), i = 1, 2

1 Set q1
0 = Xt−1, q2
L, pi
L), i = 1, 2
3 Sample U ∼ U([0, 1])

1 Set q1
0 = Xt−1, q2
L, pi
L), i = 1, 2
3 Sample U ∼ U([0, 1])
4 If
U ≤ min 1, exp E(q1
0, p0) − E(q1
L, p1
L)

1 Set q1
0 = Xt−1, q2
L, pi
L), i = 1, 2
3 Sample U ∼ U([0, 1])
4 If
0, p0) − E(q1
L, p1
L)
set Xt = q1
L, otherwise set Xt = Xt−1

1 Set q1
0 = Xt−1, q2
L, pi
L), i = 1, 2
3 Sample U ∼ U([0, 1])
4 If
0, p0) − E(q1
L, p1
L)
set Xt = q1
5 If
0, p0) − E(q2
L, p2
L)

1 Set q1
0 = Xt−1, q2
L, pi
L), i = 1, 2
3 Sample U ∼ U([0, 1])
4 If
0, p0) − E(q1
L, p1
L)
set Xt = q1
5 If
0, p0) − E(q2
L, p2
L)
set Yt = q2
L, otherwise set Yt = Yt−1

Coupled HMC chains
−0.5
0.0
0.5
1.0
−1.0 −0.5 0.0 0.5
x1
x2

Logistic regression: distance after 1000 iterations
1e−12
1e−08
1e−04
1e+00
0.25 0.50 0.75 1.00 1.25
Integration time
Distanceafter1000iterations
L 10 20 30

Mixture of coupled kernels (kernel ¯K)
• To enable exact meetings, we consider for γ ∈ (0, 1)
¯K = (1 − γ) ¯Kε,L
coupled HMC
+ γ ¯Kσ
coupled RWMH

¯K = (1 − γ) ¯Kε,L
coupled HMC
+ γ ¯Kσ
coupled RWMH
• Choice of RWMH proposal std σ:
distance between chains < σ < spread of π

¯K = (1 − γ) ¯Kε,L
coupled HMC
+ γ ¯Kσ
coupled RWMH
• Choice of RWMH proposal std σ:
distance between chains < σ < spread of π
• Advocate small RWMH probability γ to minimize ineﬃciency

Geometric tails of meeting time
• To ensure validity of unbiased estimators:

1 Convergence of marginal chain (inherited from HMC)

2 Meeting time has geometric tails (Theorem 2)

3 Faithfulness (by construction)

• Main assumptions:
(Theorem 2) Meeting time has geometric tails if (ε, L, σ, γ)
are small enough

1 U is globally Lipschitz
are small enough

2 U is strongly convex on S ⊂ Rd
are small enough

3 Geometric drift condition on HMC kernel
are small enough

3 Geometric drift condition on HMC kernel
are small enough
• Assumptions can be veriﬁed for Gaussian targets, Bayesian
logistic regression relying on Durmus et al. (2017, Theorem 9)

Sensitivity of RWMH proposal std σ
Logistic regression (left), Cox process (right)
200
400
600
800
1e−061e−051e−04 0.001 0.01
σ
Meetingtime
50
100
150
1e−061e−051e−04 0.001 0.01
σ
Meetingtime

Sensitivity of RWMH probability γ
Logistic regression (left), Cox process (right)
200
400
600
800
0.01 0.03 0.05 0.07 0.09 0.11
γ
Meetingtime
0
200
400
600
0.01 0.16 0.31 0.46 0.61 0.76
γ
Meetingtime

Cox process: eﬀect of dimension and algorithm
• Better algorithms yield smaller
meeting times
RHMC
HMC
256 1024 4096
0
200
400
600
0
200
400
600
Dimension
Meetingtime

meeting times
• Proposed methodology cannot
work if marginal chain fails to mix
RHMC
HMC
256 1024 4096
0
200
400
600
0
200
400
600
Dimension
Meetingtime

meeting times
• Proposed methodology cannot
work if marginal chain fails to mix
• Writing πt = π0Kt, we have
TV(πt, π)
≤ min {1, E [max(0, τ − t + 1)]}
RHMC
HMC
256 1024 4096
0
200
400
600
0
200
400
600
Dimension
Meetingtime

Logistic regression: impact of k and m
k m Cost Relative inefficiency
1 k 436 1989.07
1 5k 436 1671.93
1 10k 436 1403.28
90% quantile(τ) k 553 38.11
90% quantile(τ) 5k 1868 1.23
90% quantile(τ) 10k 3518 1.05
Relative inefficiency =
Asymptotic inefficiency
Asymptotic variance of optimal HMC
k,m → ∞
−−−−−→
Asymptotic variance of marginal HMC
Asymptotic variance of optimal HMC

Concluding remarks
Bou-Rabee et al. (2018) introduced another coupling for HMC

Concluding remarks
Could combine synchronous coupling (L = 1) and maximal
coupling for MALA

Concluding remarks
coupling for MALA
Extension to other variants of HMC:

Concluding remarks
coupling for MALA
1 No-U-Turn Sampler (Hoﬀman and Gelman, 2014)

Concluding remarks
coupling for MALA
2 Partial momentum refreshment (Horowitz, 1991)

Concluding remarks
coupling for MALA
3 Diﬀerent choices of kinetic energy (Livingstone et al., 2017)

Concluding remarks
coupling for MALA
3 Diﬀerent choices of kinetic energy (Livingstone et al., 2017)
4 Hamiltonian bouncy particle sampler (Vanetti et al., 2017)

References
J. Heng and P. Jacob. Unbiased Hamiltonian Monte Carlo
with couplings. Biometrika (to appear), arXiv:1709.00404,
2019.

References
2019.
R package:
https://github.com/pierrejacob/debiasedhmc

References
2019.
R package:
P. Jacob, J. O’Leary, Y. Atchad´e. Unbiased Markov chain
Monte Carlo with couplings. arXiv:1708.03625, 2017.

References
2019.
R package:
P. Jacob, F. Lindsten, T. Sch¨on. Smoothing with Couplings
of Conditional Particle Filters. JASA, 2018.

References
2019.
R package:
P. Jacob, F. Lindsten, T. Sch¨on. Smoothing with Couplings
of Conditional Particle Filters. JASA, 2018.
L. Middleton, G. Deligiannidis, A. Doucet, P. Jacob. Unbiased
Markov chain Monte Carlo for intractable target distributions.
arXiv:1807.08691, 2018.

Unbiased Hamiltonian Monte Carlo

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Unbiased Hamiltonian Monte Carlo

Similar to Unbiased Hamiltonian Monte Carlo (20)

Recently uploaded

Recently uploaded (20)

Unbiased Hamiltonian Monte Carlo