Markov chain Monte Carlo methods and some attempts at parallelizing them

Markov chain Monte Carlo methods and some
attempts at parallelizing them
Pierre E. Jacob
Department of Statistics, Harvard University
(and many fantastic collaborators!)
MIT IDS.190, October 2019
blog: https://statisfaction.wordpress.com
Pierre E. Jacob Unbiased MCMC

Setting
Continuous or discrete space of dimension d.
Target probability distribution π,
with probability density/mass function x → π(x).
Goal: approximate π, e.g.
Eπ[h(X)] = h(x)π(x)dx = π(h),
for a class of “test” functions h.

Monte Carlo
Originates from physics, and still very much a research
topic in physics e.g.
K. Binder et al, Monte Carlo methods in statistical physics, 2012.
Often state-of-the-art for numerical integration e.g.
E. Novak, Some results on the complexity of numerical
integration, 2016.
Plays an important role in Bayesian inference e.g.
P. Green et al, Bayesian computation: a summary of the current
state, and samples backwards and forwards, 2015.
Can be useful for many other tasks in statistics e.g.
J. Besag, MCMC for Statistical Inference, 2001.
See also P. Diaconis, The MCMC revolution, 2009.

Outline
1 Monte Carlo and bias
2 Sequential Monte Carlo samplers
3 Regeneration
4 Unbiased estimators from coupled Markov chains
5 Bonus: new convergence diagnostics for MCMC

Markov chain Monte Carlo
Initially, X0 ∼ π0, then Xt|Xt−1 ∼ P(Xt−1, ·) for t = 1, . . . , T.
Estimator:
1
T − b
T
t=b+1
h(Xt),
where b iterations are discarded as burn-in.
Might converge to Eπ[h(X)] as T → ∞ by the ergodic theorem.
Biased for any fixed b, T, unless π0 is equal to π.
Averaging independent copies of such estimators for fixed b, T
would not provide a consistent estimator of Eπ[h(X)]
as the number of independent copies goes to infinity.

Example: Metropolis–Hastings kernel P
With Markov chain at state Xt,
1 propose X ∼ q(Xt, ·),
2 sample U ∼ Uniform(0, 1),
3 if
U ≤
π(X )q(X , Xt)
π(Xt)q(Xt, X )
,
set Xt+1 = X , otherwise set Xt+1 = Xt.
Hastings, Monte Carlo sampling methods using Markov chains and
their applications, 1970.

MCMC trace
π = N(0, 1), RWMH with Normal proposal std = 0.5, π0 = N(10, 32
)

MCMC marginal distributions
)

Independent replicates and MCMC
The bias is the difference |E[h(Xt)] − Eπ[h(X)]| for fixed t.
The bias has always been recognized as an obstacle on the way
to parallelize Monte Carlo calculations, e.g.
When running parallel Monte Carlo with many comput-
ers, it is more important to start with an unbiased (or
low-bias) estimate than with a low-variance estimate.
Rosenthal, Parallel computing and Monte Carlo algorithms, 2000.
For general statistical estimators, mean squared error is often the
prefered measure of accuracy.
In Monte Carlo, variance can be both quantified and arbitrarily
reduced with independent runs, but neither is true for the bias.

Importance sampling
Importance sampling relies on a proposal distribution q, chosen
by user to be an approximation of π.
1 Sample X1:N ∼ q, independently.
2 Weight w(Xn) = π(Xn)/q(Xn).
3 Normalize weights to obtain W1:N .
The procedure yields
ˆπN
(·) =
N
n=1
Wn
δXn (·)
approximates π as N → ∞ under conditions on q and π.

Importance sampling with MCMC proposals
Finding proposal q that approximates π might be diﬃcult.
Can we use MCMC as an importance sampling proposal?
Something that would look like:
1 Sample X1:N by running N chains for T steps.
2 Weight w(Xn) somehow (?).
3 Normalize weights to obtain W1:N .
An immediate diﬃculty is that the marginal distributions of
MCMC chains are generally intractable, so importance weights
seem hard to compute.

Annealed importance sampling
For instance, sample X0 ∼ π0 and X1|X0 ∼ P(X0, ·).
Problem: marginal distribution of X1 is intractable.
Introduce backward kernel L(x1, x0) = P(x0, x1)π(x0)/π(x1).
Then consider
proposal distribution ¯q(x0, x1) = π0(x0)P(x0, x1),
target distribution ¯π(x0, x1) = π(x1)L(x1, x0).
Writing down importance sampling procedure leads to
tractable weights ∝ π(x0)/π0(x0),
desired marginal distribution: ¯π(x0, x1)dx0 = π(x1).
Neal, Annealed importance sampling, 2001,

Sequential Monte Carlo samplers
Del Moral, Doucet & Jasra, SMC samplers, 2006.
AIS and SMC samplers work by introducing sequence of target
distributions πt, for t = 0, . . . , T, and a sequence of MCMC
kernels Pt targeting πt.
Then N chains start from π0 and
move through the speciﬁed Markov kernels,
are weighted using ratios of successive target distributions,
are resampled according to weights (in SMC samplers).
At ﬁnal step T, weighted samples approximate π.
The resampling steps induce interaction between the chains,
which possibly means communication between machines.
Whiteley, Lee & Heine, On the role of interaction in sequential Monte
Carlo algorithms, 2016.

Sequential Monte Carlo samplers
π = N(0, 1), adaptive SMC sampler with MH moves, π0 = N(10, 32
)

Regeneration in Markov chain samplers
Mykland, Tierney & Yu, Regeneration in Markov chain samplers, 1995.
−3
0
3
6
0 50 100 150 200
iteration
x
We might be able to identify regeneration times (Tn)n≥1
such that the tours (XTn−1 , . . . , XTn−1) are i.i.d.
and such that
N
n=1
Tn
t=Tn−1
h(Xt)
N
n=1(Tn − Tn−1)
a.s.
−−−−→
N→∞
Eπ[h(X)]
. . . but it might be diﬃcult to identify these times.

Brockwell and Kadane’s regeneration technique
Design new chain such that regeneration is easier to identify.
State space E ∪ α, Markov kernel ˜P on E ∪ α that targets ˜π,
such that ˜π is equal to π on E.
Set ˜π(α) (to be chosen), and design “re-entry” proposal φ on E.
If Xt = α, propose X ∼ φ on E, acceptance probability
min(1, π(X )/(˜π(α)φ(X ))),
if Xt ∈ E, propose move to α, acceptance probability
min(1, ˜π(α)φ(Xt)/π(Xt)).
Perform these moves with probability ω, otherwise sample
Xt+1 ∼ P(Xt, ·) if Xt ∈ E, and set Xt+1 = α if Xt = α.
With the new chain, every re-entry in E is a regeneration.

Illustration of regeneration technique
π = N(0, 1), MH with Normal proposal std = 0.5, π0 = N(10, 32
)
Set ˜π(α) = 1, φ = N(2, 1), ω = 0.1.
−2
0
2
0 50 100 150 200
iteration
x
Brockwell & Kadane, Identiﬁcation of regeneration times in MCMC
simulation, with application to adaptive schemes, 2005.
See also Nummelin, MC’s for MCMC’ists, 2002.

Coupled chains
Glynn & Rhee, Exact estimation for MC equilibrium expectations, 2014.
Generate two chains (Xt) and (Yt) as follows,
sample X0 and Y0 from π0 (independently, or not),
sample X1|X0 ∼ P(X0, ·),
for t ≥ 1, sample (Xt+1, Yt)|(Xt, Yt−1) ∼ ¯P ((Xt, Yt−1), ·).
¯P must be such that
Xt+1|Xt ∼ P(Xt, ·) and Yt|Yt−1 ∼ P(Yt−1, ·)
(thus Xt and Yt have the same distribution for all t ≥ 0),
there exists a random time τ such that Xt = Yt−1 for t ≥ τ
(the chains meet and remain “faithful”).

Metropolis on Normal target: coupled paths
0
4
8
0 50 100 150 200
iteration
x
)

Metropolis on Normal target: coupled paths
0
5
10
15
0 50 100 150 200
iteration
x
)

Debiasing idea (one slide version)
Limit as a telescopic sum, for all k ≥ 0,
Eπ[h(X)] = lim
t→∞
E[h(Xt)] = E[h(Xk)] +
∞
t=k+1
E[h(Xt) − h(Xt−1)].
Since for all t ≥ 0, Xt and Yt have the same distribution,
= E[h(Xk)] +
∞
t=k+1
E[h(Xt) − h(Yt−1)].
If we can swap expectation and limit,
= E[h(Xk) +
∞
t=k+1
(h(Xt) − h(Yt−1))].
Random variable in above expectation is unbiased for Eπ[h(X)].

Unbiased estimators
Unbiased estimator, for any user-chosen k, is given by
Hk(X, Y ) = h(Xk) +
τ−1
t=k+1
(h(Xt) − h(Yt−1)),
with the convention τ−1
t=k+1{·} = 0 if τ − 1 < k + 1.
h(Xk) alone is biased; the other terms correct for the bias.
Cost: τ − 1 calls to ¯P and 1 + max(0, k − τ) calls to P.
Glynn & Rhee, Exact estimation for Markov chain equilibrium expectations,
2014. Also Agapiou, Roberts & Vollmer, Unbiased Monte Carlo: Posterior
estimation for intractable/inﬁnite-dimensional models, 2018.
Note: same reasoning would work with arbitrary lags L ≥ 1.

Conditions
Jacob, O’Leary, Atchadé, Unbiased MCMC with couplings, 2019.
1 Marginal chain converges:
E[h(Xt)] → Eπ[h(X)],
and h(Xt) has (2 + η)-finite moments for all t.
2 Meeting time τ has geometric tails:
∃C < +∞ ∃δ ∈ (0, 1) ∀t ≥ 0 P(τ > t) ≤ Cδt
.
3 Chains stay together: Xt = Yt−1 for all t ≥ τ.
Condition 2 itself implied by e.g. geometric drift condition.
Under these conditions, Hk(X, Y ) is unbiased, has finite
expected cost and finite variance, for all k.

Metropolis on Normal target: meeting times
0.000
0.005
0.010
0.015
0 50 100 150 200
meeting time
density
)

Metropolis on Normal target: estimators of Eπ[X]
0.000
0.002
0.004
0.006
−1000 0 1000
estimator
density
k = 0
E[2τ] ≈ 96, V[H0(X, Y )] ≈ 65, 000.

Asymptotic inefficiency
Final estimator: average of R independent estimators.
In a given computing time,
more estimators can be produced if each estimator is cheaper.
An appropriate measure of performance is
[expected cost] × [variance],
called the asymptotic inefficiency.
Glynn & Whitt, Asymptotic efficiency of simulation estimators, 1992.
Glynn & Heidelberger, Bias properties of budget constrained
simulations, 1990.

0.0
0.1
0.2
0.3
−200 −100 0 100
estimator
density
k = 100
E[max(k + τ, 2τ)] ≈ 148, V[Hk(X, Y )] ≈ 100.

0.0
0.1
0.2
0.3
0.4
−4 −2 0 2 4
estimator
density
k = 200
E[max(k + τ, 2τ)] ≈ 248, V[Hk(X, Y )] ≈ 1.

Time-averaged unbiased estimators
Eﬃciency matters, thus in practice we recommend a variation
of the previous estimator, deﬁned for integers k ≤ m as
Hk:m(X, Y ) =
1
m − k + 1
m
t=k
Ht(X, Y )
which can also be written
1
m − k + 1
m
t=k
h(Xt)+
τ−1
t=k+1
min 1,
t − k
m − k + 1
(h(Xt)−h(Yt−1)),
i.e. standard MCMC average + bias correction term.

Metropolis on Normal target: time-averaged estimators
0.0
0.5
1.0
1.5
2.0
2.5
−0.4 0.0 0.4
estimator
density
k = 200, m = 1000
E[max(m + τ, 2τ)] ≈ 1048, V[Hk(X, Y )] ≈ 0.028.

How to design appropriate coupled chains?
To implement the proposed unbiased estimators,
we need to sample from a Markov kernel ¯P,
such that, when (Xt+1, Yt) is sampled from ¯P ((Xt, Yt−1), ·),
marginally Xt+1|Xt ∼ P(Xt, ·), and Yt|Yt−1 ∼ P(Yt−1, ·),
it is possible that Xt+1 = Yt exactly for some t ≥ 0,
if Xt = Yt−1, then Xt+1 = Yt almost surely.

Couplings of MCMC algorithms
We can ﬁnd many couplings in the literature. . .
Propp & Wilson, Exact sampling with coupled Markov chains
and applications to statistical mechanics, Random Structures &
Algorithms, 1996.
Johnson, Studying convergence of Markov chain Monte Carlo
algorithms using coupled sample paths, JASA, 1996.
Neal, Circularly-coupled Markov chain sampling, UoT tech
report, 1999.
Pinto & Neal, Improving Markov chain Monte Carlo estimators
by coupling to an approximating chain, UoT tech report, 2001.
Glynn & Rhee, Exact estimation for Markov chain equilibrium
expectations, Journal of Applied Probability, 2014.

Couplings of MCMC algorithms
Conditional particle filters
Jacob, Lindsten, Schön, Smoothing with Couplings of
Conditional Particle Filters, 2019.
Metropolis–Hastings, Gibbs samplers, parallel tempering
Jacob, O’Leary, Atchadé, Unbiased MCMC with couplings, 2019.
Hamiltonian Monte Carlo
Heng & Jacob, Unbiased HMC with couplings, 2019.
Pseudo-marginal MCMC, exchange algorithm
Middleton, Deligiannidis, Doucet, Jacob, Unbiased MCMC for
intractable target distributions, 2018.
Particle independent Metropolis–Hastings
Middleton, Deligiannidis, Doucet, Jacob, Unbiased Smoothing
using Particle Independent Metropolis-Hastings, 2019.

Maximal couplings
(X, Y ) follows a coupling of p and q if X ∼ p and Y ∼ q,
The coupling inequality states that
P(X = Y ) ≤ 1 − p − q TV,
for any coupling, with p − q TV = 1
2 |p(x) − q(x)|dx.
Maximal couplings achieve the bound.

Maximal coupling of Gamma and Normal

Maximal coupling: algorithm
Requires: evaluations of p and q, sampling from p and q.
1 Sample X ∼ p and W ∼ Uniform(0, 1).
If W ≤ q(X)/p(X), set Y = X, output (X, Y ).
2 Otherwise, sample Y ∼ q and W ∼ Uniform(0, 1)
until W > p(Y )/q(Y ), set Y = Y and output (X, Y ).
Output: a pair (X, Y ) such that X ∼ p, Y ∼ q
and P(X = Y ) is maximal.

Back to Metropolis–Hastings (kernel P)
At each iteration t, Markov chain at state Xt,
1 propose X ∼ q(Xt, ·),
3 if
U ≤
π(X )q(X , Xt)
π(Xt)q(Xt, X )
,
set Xt+1 = X , otherwise set Xt+1 = Xt.
How to propagate two MH chains from states Xt and Yt−1
such that {Xt+1 = Yt} can happen?

Coupling of Metropolis–Hastings (kernel ¯P)
At each iteration t, two Markov chains at states Xt, Yt−1,
1 propose (X , Y ) from max coupling of q(Xt, ·), q(Yt−1, ·),
3 if
U ≤
π(X )q(X , Xt)
π(Xt)q(Xt, X )
,
set Xt+1 = X , otherwise set Xt+1 = Xt,
if
U ≤
π(Y )q(Y , Yt−1)
π(Yt−1)q(Yt−1, Y )
,
set Yt = Y , otherwise set Yt = Yt−1.

Scaling with dimension (not doing so well)
With naive maximum coupling of proposals. . .
10
100
1000
10000
1 2 3 4 5
dimension
averagemeetingtime
initialization: target offset

Scaling with dimension (much better)
With “reﬂection-maximal” couplings of proposals. . .
0
500
1000
1500
2000
1 13 25 37 50
dimension
averagemeetingtime

Hamiltonian Monte Carlo
Introduce potential energy U(q) = − log π(q),
and total energy E(q, p) = U(q) + 1
2|p|2.
Hamiltonian dynamics for (q(s), p(s)), where s ≥ 0:
d
ds
q(s) = pE(q(s), p(s)),
d
ds
p(s) = − qE(q(s), p(s)).
Solving Hamiltonian dynamics exactly is not feasible,
but discretization + Metropolis–Hastings correction ensure that
π remains invariant.
Common random numbers can make two HMC chains contract,
under assumptions on the target such as strong log-concavity.

Coupling of Hamiltonian Monte Carlo
Mangoubi & Smith, Rapid mixing of HMC on strongly
log-concave distributions, 2017
Bou-Rabee, Eberle & Zimmer, Coupling and Convergence for
Hamiltonian Monte Carlo, 2018.
Heng & Jacob, Unbiased HMC with couplings, 2019.

Coupling of Hamiltonian Monte Carlo
Figure 2 of Mangoubi & Smith, Rapid mixing of HMC strongly
log-concave distributions, 2017.
Coupling two copies X1, X2, . . . (blue) and Y1, Y2, . . .
(green) of HMC by choosing same momentum pi at ev-
ery step.

Scaling of Hamiltonian Monte Carlo
0
20
40
60
10 50 100 200 300
dimension
averagemeetingtime

Assessing ﬁnite-time bias of MCMC
Total variation distance between Xk ∼ πk and π = limk→∞ πk:
πk − π TV =
1
2
sup
h:|h|≤1
|E[h(Xk)] − Eπ[h(X)]|
=
1
2
sup
h:|h|≤1
|E[
τ−1
t=k+1
h(Xt) − h(Yt−1)]|
≤ E[max(0, τ − k − 1)].
0.000
0.005
0.010
0.015
0 50 100 150 200
meeting time
density
1e−03
1e−02
1e−01
1e+00
1e+01
0 50 100
k
upperbound

Assessing ﬁnite-time bias of MCMC
With L-lag couplings, τ(L) = inf{t ≥ L : Xt = Yt−L},
πk − π TV ≤ E max(0, (τ(L)
− L − k)/L ) .
0.00
0.25
0.50
0.75
1.00
1e+01 1e+02 1e+03 1e+04 1e+05 1e+06
iterations
dTV
SSG PT
Biswas, Jacob & Vanetti, Estimating Convergence of Markov chains
with L-Lag Couplings, 2019.

Discussion
Perfect samplers, that sample i.i.d. from π, would yield the
same beneﬁts and more. Is any of this helping create
perfect samplers?
If underlying MCMC “doesn’t work”, proposed unbiased
estimators will have large cost and/or large variance.
Choice of tuning parameters? Choice of lag? Why couple
only two chains?
Lack of bias useful beyond parallel computation.
So far we have used Markovian couplings: can we do
better?
Thank you for listening!
Funding provided by the National Science Foundation, grants
DMS-1712872 and DMS-1844695.

Markov chain Monte Carlo methods and some attempts at parallelizing them

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Markov chain Monte Carlo methods and some attempts at parallelizing them

Similar to Markov chain Monte Carlo methods and some attempts at parallelizing them (20)

More from Pierre Jacob

More from Pierre Jacob (15)

Recently uploaded

Recently uploaded (20)

Markov chain Monte Carlo methods and some attempts at parallelizing them