Couplings of Markov chains and the Poisson equation

Couplings of Markov chains
& the Poisson equation
Pierre E. Jacob
Department of Statistics, Harvard University
March 22, 2021
Pierre E. Jacob Couplings, donkeys, coins and fish

Outline
1 Context
2 Couplings
General idea
Donkey walk
Conditional Bernoulli
Empirical rates of convergence
3 Poisson equation
Definition
Asymptotic variance estimation

Thank you!
First I want to thank these fantastic co-authors whose works
will be mentioned in this talk:
Yves Atchadé, Anirban Bhattacharya, Niloy Biswas, Arthur P.
Dempster, Randal Douc, Paul Edlefsen, Ruobin Gong, Jeremy
Heng, James Johndrow, Nianqiao (Phyllis) Ju, Anthony Lee,
John O’Leary, Natesh Pillai, Emilia Pompe, Maxime Rischard,
Paul Vanetti, Dootika Vats, Guanyang Wang.

Dr. Arianna Wright Rosenbluth (1927-2020)
From https://www.nytimes.com/2021/02/09/science/
arianna-wright-dead.html, by Katie Hafner.

Setting
Target probability distribution π. Markov chain Monte Carlo:
X0 ∼ π0, then Xt|Xt−1 ∼ P(Xt−1, ·) for t = 1, 2, . . .

Setting
Notation:
πt = πt−1P =
R
πt−1(dxt−1)P(xt−1, ·),
π(h) =
R
h(x)π(dx).

Setting
Notation:
πt = πt−1P =
R
πt−1(dxt−1)P(xt−1, ·),
π(h) =
R
h(x)π(dx).
Convergence of marginals:
kπt − πk → 0.

Setting
Notation:
πt = πt−1P =
R
πt−1(dxt−1)P(xt−1, ·),
π(h) =
R
h(x)π(dx).
kπt − πk → 0.
Central limit theorem:
√
t t−1
t−1
X
s=0
h(Xs) − π(h)
!
→ N(0, v(P, h)).

Setting
Notation:
πt = πt−1P =
R
πt−1(dxt−1)P(xt−1, ·),
π(h) =
R
h(x)π(dx).
kπt − πk → 0.
Central limit theorem:
√
t t−1
t−1
X
s=0
h(Xs) − π(h)
!
→ N(0, v(P, h)).
How can we choose t? How can we estimate v(P, h)?

How many iterations are enough?
Charles Geyer: “If you can’t get a good answer with one
long run, then you can’t get a good answer with many short
runs either.”

runs either.”
An anonymous source: “I still remember fondly (?!) my
first Valencia Bayesian Statistics meeting in I think 1991
when Adrian Smith and Andrew Gelman had a bit of a
stand-up argument about MCMC implementation with
multiple or single chains! It’s 30 years since then but many
of the issues are still unresolved.”

runs either.”
From C. McCartan & K. Imai: “[. . . ] Pegden ran an
MCMC algorithm for one trillion steps [. . . ]”.

runs either.”
In Stan, the default is 2000 iterations. In Nimble, the user
must specify that number.

runs either.”
In Stan, the default is 2000 iterations. In Nimble, the user
must specify that number.
It would be simpler if we could just specify a “tolerance”
parameter, or a time limit.

Outline
Reminders on couplings of Markov chains to obtain
convergence rates.
Might work “out of the box” (e.g. donkey walk) or might
require some extra care (e.g. conditional Bernoulli).
Couplings are implementable too, and provide useful
empirical assessments.
We will discuss connections to another mainstay of Markov
chain analysis, the Poisson equation, leading to a new
asymptotic variance estimator.

Couplings
Technique to study the convergence of Markov chains.
Construct a joint process (Xt, Yt) such that Yt ∼ π for all t ≥ 0,
and marginally both chains evolve according to same kernel P.

Couplings
Suppose that there exists τ a random
variable such that Xt = Yt for all t ≥ τ.

Couplings
Suppose that there exists τ a random
variable such that Xt = Yt for all t ≥ τ.
Then
kπt − πkTV = kL(Xt) − L(Yt)kTV
≤ P(Xt 6= Yt) = P(τ > t),
where k · kTV is the total variation distance.
Bru & Yor, Comments on the life and mathematical legacy of
Wolfgang Doeblin, 2002.

Couplings
Coupling techniques have proved very successful, in some cases
giving precise rates of convergence.
Jerrum, Mathematical foundations of the MCMC method, 1998.
Eberle, Reflection couplings and contraction rates for diffusions, PTRF,
2016.
Pillai & Smith, Kac’s walk on n-sphere mixes in n log n steps, AoAP, 2017.
Dieuleveut, Durmus & Bach, Bridging the gap between constant step size
stochastic gradient descent and Markov chains, AoS, 2020.

Couplings
2016.
Coupling techniques provide
bounds on other metrics than TV,
kπt − πkW1 = inf
γ∈Γ(πt,π)
Eγ[d(X, Y )]
≤E[d(Xt, Yt)].

Couplings
2016.
Coupling techniques provide
bounds on other metrics than TV,
kπt − πkW1 = inf
γ∈Γ(πt,π)
Eγ[d(X, Y )]
≤E[d(Xt, Yt)].
All of this appears theoretical, since we cannot sample Y0 ∼ π.

Example motivated by Dempster–Shafer inference
Pierre E. Jacob, Ruobin Gong, Paul T. Edlefsen & Arthur P.
Dempster, A Gibbs sampler for a class of random convex
polytopes, forthcoming discussion paper at JASA.
Consider two categories, and N0 + N1 = N counts, x1, . . . , xN .

Model: xn = 1(un ≤ θ) for all n, with un ∼ Uniform(0, 1).
Dempster–Shafer framework asks for
F(u) = {θ ∈ (0, 1) : ∀n xn = 1(un ≤ θ)}, given F(u) 6= ∅.

Model: xn = 1(un ≤ θ) for all n, with un ∼ Uniform(0, 1).
Dempster–Shafer framework asks for
F(u) = {θ ∈ (0, 1) : ∀n xn = 1(un ≤ θ)}, given F(u) 6= ∅.
We can work out the exact distribution of F(u) but here we
consider a Gibbs sampler which can be generalized to arbitrary
numbers of categories.

Denote Ik = {n : xn = k}. Conditionals:
{un : n ∈ I1}|{un : n ∈ I0} ∼ Uniform(0, min
n∈I0
un),
{un : n ∈ I0}|{un : n ∈ I1} ∼ Uniform(max
n∈I1
un, 1).
Example with N0 = 2, N1 = 3:

Donkey walk
We calculate the conditional distributions of Y = maxn∈I1 un
and Z = minn∈I0 un, and the Gibbs sampler simplifies to:
Zt = B1(1 − B0)Zt−1 + B0,
where B1 ∼ Beta(N1, 1) and B0 ∼ Beta(1, N0) are independent.

Donkey walk
We calculate the conditional distributions of Y = maxn∈I1 un
and Z = minn∈I0 un, and the Gibbs sampler simplifies to:
Zt = B1(1 − B0)Zt−1 + B0,
where B1 ∼ Beta(N1, 1) and B0 ∼ Beta(1, N0) are independent.
Letac, Donkey walk and Dirichlet distributions, Statistics & Probability
Letters, 2002.

Donkey walk
A “common random numbers” coupling
Zt = B1(1 − B0)Zt−1 + B0
Z̃t = B1(1 − B0)Z̃t−1 + B0,
leads to
kπt − πkW1 ≤

N0
N0 + 1
×
N1
N1 + 1
t
E
h

i
.

i
.
By Kantorovich–Rubinstein duality, and considering
h : x 7→ ±x, we can obtain a lower bound with the same rate, as
was pointed out by Guanyang Wang (Rutgers).

i
.
By Kantorovich–Rubinstein duality, and considering
h : x 7→ ±x, we can obtain a lower bound with the same rate, as
was pointed out by Guanyang Wang (Rutgers).
Here we obtain practical guidance on the choice of number of
iterations t to perform; this is not often the case.

Example: Conditional Bernoulli
Jeremy Heng, Pierre E. Jacob Nianqiao Ju, A simple Markov
chain for independent Bernoulli variables conditioned on their
sum, on arXiv.
Let p = (p1, . . . , pN ) ∈ (0, 1)N and define wn = pn/(1 − pn), the
associated odds.

sum, on arXiv.
associated odds.
Let X = (X1, . . . , XN ) ∈ {0, 1}N such that Xn ∼ Bernoulli(pn),
independently.

sum, on arXiv.
associated odds.
independently.
The conditional distribution of X given
PN
n=1 Xn = S is called
“conditional Bernoulli”, denoted by CBernoulli(p, S).

sum, on arXiv.
associated odds.
independently.
The conditional distribution of X given
PN
n=1 Xn = S is called
“conditional Bernoulli”, denoted by CBernoulli(p, S).
Exact sampling costs O(S · N) operations. We assume S ∝ N.
Chen Liu, Statistical applications of the Poisson-Binomial and
conditional Bernoulli distributions, Statistica Sinica, 1997.

A Rosenbluth–Hastings transition goes as follows:
independently sample i0 ∈ I0 = {n : xn = 0} and
i1 ∈ I1 = {n : xn = 1} uniformly;
construct proposed state y with a swap i0 ↔ i1;
accept y as next state with probability min{1, wi0 /wi1 }.
Chen, Dempster Liu, Weighted finite population sampling to
maximize entropy, Biometrika, 1994.

Relevance
Identical success probabilities (pn):
the chain obtained by successive swaps is known as the
Bernoulli-Laplace diffusion model;

Relevance
the chain has been thoroughly studied; if S = N/2, mixing
occurs in N/8 · log N iterations (+ cutoff phenomenon).
Diaconis Shahshahani, Time to reach stationarity in the
Bernoulli-Laplace diffusion model, SIAM Journal on
Mathematical Analysis, 1987.

Relevance
the chain has been thoroughly studied; if S = N/2, mixing
occurs in N/8 · log N iterations (+ cutoff phenomenon).
Diaconis Shahshahani, Time to reach stationarity in the
Bernoulli-Laplace diffusion model, SIAM Journal on
Mathematical Analysis, 1987.
Non-identical (pn): arises in various contexts in statistics, and
occurred in our research on agent-based models:
Nianqiao Ju, Jeremy Heng Pierre E. Jacob, Sequential Monte Carlo
algorithms for agent-based models of disease transmission, on arXiv.

Assumptions
(Condition on the odds). The odds (wn) are such that
there exist ζ 0, 0 l r ∞ and η 0 such that for all
N large enough,
P (|{n ∈ [N] : wn /
∈ (l, r)}| ≤ ζN) ≥ 1 − exp(−ηN).
(Condition on S). There exist 0 ξ ≤ 1/2 and η0 0 such
that for all N large enough,
P (ξN ≤ S) ≥ 1 − exp(−η0
N).
We will work under these assumptions and ζ ξ.
We also assume S ≤ N/2 without loss of generality.

Convergence rate from couplings
Introduce two chains (x(t)) and (x̃(t)) evolving according to
coupled kernel P̄, x(0) ∼ π(0) and x̃(0) ∼ π.

Hamming distance d(x, x̃) =
PN
n=1 1 (xn 6= x̃n).

PN
n=1 1 (xn 6= x̃n).
Total variation distance
kπ(t)
− πkTV ≤ E
h
d(x(t)
, x̃(t)
)
i
.
We start from d(0) = d(x(0), x̃(0)) ≤ N.

PN
n=1 1 (xn 6= x̃n).
kπ(t)
− πkTV ≤ E
h
d(x(t)
, x̃(t)
)
i
.
Contraction:
E
h
d(t+1)
| x(t)
, x̃(t)
i
≤ (1 − cN )d(t)

PN
n=1 1 (xn 6= x̃n).
kπ(t)
− πkTV ≤ E
h
d(x(t)
, x̃(t)
)
i
.
Contraction:
E
h
d(t+1)
| x(t)
, x̃(t)
i
≤ (1 − cN )d(t)
implies, for any ∈ (0, 1),
kπ(t)
− πkTV ≤ ∀t ≥
log(N/)
− log(1 − cN )
,
We want cN ≥ N−1.

Path coupling argument (Bubley Dyer, 1997): we can focus
on contraction from adjacent states, i.e. d(x, x̃) = 2.

Let x, x̃ ∈ {0, 1}N be adjacent: they differ at locations a and b.
Assume xa = 0, xb = 1, x̃a = 1, x̃b = 0 and wa ≤ wb.

Let x, x̃ ∈ {0, 1}N be adjacent: they differ at locations a and b.
Assume xa = 0, xb = 1, x̃a = 1, x̃b = 0 and wa ≤ wb.
Contraction rate from a maximal coupling strategy:
c(x, x̃) = P d(x0
, x̃0
) = 0

x, x̃)
=
1 − wa
wb
+
P
i1∈I1∩Ĩ1
min

1, wa
wi1

+
P
i0∈I0∩Ĩ0
min

1,
wi0
wb

(N − S)S
.

Summary of problem and way forward
When pn ∼ Uniform(0, 1), wa = minn wn, wb = maxn wn,
contraction rate is of order N−2.

However, by assumptions for most pairs of adjacent states,
wa, wb are of constant order. Starting from these states, chains
can meet with probability of order N−1.

However, by assumptions for most pairs of adjacent states,
wa, wb are of constant order. Starting from these states, chains
can meet with probability of order N−1.
Thankfully, chains can move from ‘unfavorable’ to ‘favorable’
states quickly enough.

Favorable and unfavorable pairs
We can define ξF→D, ξU→F, ξF→U, ν 0 and 0 wlo whi ∞
such that, for all N large enough, with probability at least
1 − exp(−νN), the sets defined as
X̄U = {(x, x̃) ∈ X̄adj : wa wlo and wb whi},
X̄F = {(x, x̃) ∈ X̄adj : wa ≥ wlo or wb ≤ whi},
X̄D = {(x, x̃) ∈ X2
: x = x̃},
satisfy the following statements,
∀(x, x̃) ∈ X̄F, P̄((x, x̃), X̄D) ≥ ξF→D/N,
∀(x, x̃) ∈ X̄U, P̄((x, x̃), X̄F) ≥ ξU→F/N,
∀(x, x̃) ∈ X̄F, P̄((x, x̃), X̄U) ≤ ξF→U/N.

A three-state process specified by pairs of chains
Consider adjacent or identical states, define
B(x, x̃) =







1 if (x, x̃) ∈ X̄U (unfavorable),
2 if (x, x̃) ∈ X̄F (favorable),
3 if (x, x̃) ∈ X̄D (x = x̃).

A three-state process specified by pairs of chains
Consider adjacent or identical states, define
B(x, x̃) =







1 if (x, x̃) ∈ X̄U (unfavorable),
2 if (x, x̃) ∈ X̄F (favorable),
3 if (x, x̃) ∈ X̄D (x = x̃).
The process B(x(t), x̃(t)) can be coupled with a Markov
chain B̃(t) with transition matrix



1 − ξU→F/N ξU→F/N 0
ξF→U/N 1 − (ξF→U + ξF→D)/N ξF→D/N
0 0 1


 ,
which converges to the absorbing state 3 in O(N) steps.

Chasing chain and main result
We construct B̃(t) ∈ {1, 2, 3}, such that
B̃(t) converges to 3 in O(N) steps,
B̃(t) ≤ B(x(t), x̃(t)) at each time t,
thus {B̃(t) = 3} ⇒ {x(t) = x̃(t)}.

thus {B̃(t) = 3} ⇒ {x(t) = x̃(t)}.
There exist κ 0, ν 0, N0 ∈ N
independent of N such that, for any
∈ (0, 1), and for all N ≥ N0, with probability at least
1 − exp(−νN),
kx(t)
− CBernoulli(p, S)kTV ≤ for all t ≥ κN log(N/).

thus {B̃(t) = 3} ⇒ {x(t) = x̃(t)}.
There exist κ 0, ν 0, N0 ∈ N
independent of N such that, for any
∈ (0, 1), and for all N ≥ N0, with probability at least
1 − exp(−νN),
kx(t)
− CBernoulli(p, S)kTV ≤ for all t ≥ κN log(N/).
This Markov chain provides samples for a cheaper cost
than exact sampling when N is large: N log N versus N2.
The constants in our bounds are not helpful.

Upper bounds using couplings without stationarity
Generate (Xt, Yt) such that
Xt and Yt follow πt,
Xt = Yt−L for t ≥ τ.

Upper bounds using couplings without stationarity
Generate (Xt, Yt) such that
Xt and Yt follow πt,
Xt = Yt−L for t ≥ τ.
Then
kπt − πkTV ≤ E[max(0,

(τ − L − t)/L

)].
Jacob, O’Leary Atchadé, Unbiased MCMC with couplings, JRSS B
(with discussion), 2020, and Biswas, Jacob Vanetti, Estimating
Convergence of Markov chains with L-Lag Couplings, NeurIPS, 2019.

Improved bounds
Define Jt,L = max(0,

(τ − L − t)/L

).
Previous bounds: kπt − πkTV ≤ E[Jt,L].

Improved bounds
Define Jt,L = max(0,

(τ − L − t)/L

).
Previous bounds: kπt − πkTV ≤ E[Jt,L].
Improved bounds:
kπt − πkTV ≤
X
j≥1
min {P(Jt,L ≥ j), P(Jt,L ≤ j)} .
Equation (2.10) in Craiu Meng, Double Happiness: Enhancing
the Coupled Gains of L-lag Coupling via Control Variates, Statistica
Sinica, 2021.

Couplings of MCMC algorithms
Can we generate a chain (Xt, Yt) such that, Xt ∼ πt, Yt ∼ πt,
and for all t ≥ τ, Xt = Yt−L?

On the Rosenbluth–Teller–Metropolis–Hastings algorithm:
Valen Johnson, Studying convergence of Markov chain Monte Carlo
algorithms using coupled sample paths, JASA, 1996.

On the Rosenbluth–Teller–Metropolis–Hastings algorithm:
Valen Johnson, Studying convergence of Markov chain Monte Carlo
algorithms using coupled sample paths, JASA, 1996.
John O’Leary, Guanyang Wang Pierre E. Jacob, Maximal couplings
of the Metropolis-Hastings algorithm, oral presentation at AISTATS
2021.
John O’Leary Guanyang Wang, Transition kernel couplings of the
Metropolis-Hastings algorithm, on arXiv.
John O’Leary, Couplings of the Random-Walk Metropolis algorithm,
on arXiv.

Example: large-scale Bayesian regression
Niloy Biswas, Anirban Bhattacharya, Pierre E. Jacob James
Johndrow, Coupled Markov chain Monte Carlo for high-dimensional
regression with Half-t(ν) priors, on arXiv.
Linear regression setting, n rows, p columns with p n.
Y ∼ N(Xβ, σ2
In),
σ2
∼ InverseGamma(a0/2, b0/2),
ξ−1/2
∼ Cauchy+
,
for j = 1, . . . , p βj ∼ N(0, σ2
/ξηj), η
−1/2
j ∼ t(ν)+
.
Global precision ξ, local precision ηj for j = 1, . . . , p.

Gibbs sampler:
For j = 1, . . . , p, ηj given β, ξ, σ2 can be sampled from,
exactly or by slice sampling.
Given η, we can sample β, ξ, σ2:
ξ given η using MH step,
σ2 given η, ξ from InverseGamma,
β given η, ξ, σ2 from p-dimensional Normal.
Algorithm has n2p cost per iteration.

Gibbs sampler:
Coupling strategy involves maximal couplings and common
random numbers, combined in bespoke way, for each update.

Gibbs sampler:
Coupling strategy involves maximal couplings and common
random numbers, combined in bespoke way, for each update.
Genome-wide association study with n = 2, 266 and p = 98, 385.
Outcome: average number of days for silk emergence in maize.
Covariates: single nucleotide polymorphisms of maize.

Meeting times of lagged chains, with L = 750.
0.000
0.002
0.004
0.006
0 200 400 600
Meeting time τ
density

Meeting times can be turned into upper bounds on the TV
distance to stationarity.
0.00
0.25
0.50
0.75
1.00
0 250 500 750 1000
t
Total
variation
distance

The equation
Write Ph(x) =
R
P(x, dx0)h(x0) = E[h(X1)|X0 = x].

The equation
Write Ph(x) =
R
P(x, dx0)h(x0) = E[h(X1)|X0 = x].
A function h̃ in L1(π) is said to be a solution of the Poisson
equation associated with h and P, if
h̃ − Ph̃ = h − π(h).
For brevity we say that h̃ is fishy.

The equation
Write Ph(x) =
R
P(x, dx0)h(x0) = E[h(X1)|X0 = x].
A function h̃ in L1(π) is said to be a solution of the Poisson
equation associated with h and P, if
h̃ − Ph̃ = h − π(h).
For brevity we say that h̃ is fishy.
If
P
t≥0 kPt{h − π(h)}kL1(π) ∞ then the function
x 7→
∞
X
t=0
Pt
{h − π(h)} (x),
is fishy.
Marie Duflo, Opérateurs potentiels des chaı̂nes et des processus de
Markov irréductibles, 1970.

Central limit theorem
Aiming for a CLT for Markov chain ergodic averages, write

Couplings of Markov chains and the Poisson equation

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Couplings of Markov chains and the Poisson equation

Similar to Couplings of Markov chains and the Poisson equation (20)

More from Pierre Jacob

More from Pierre Jacob (13)

Recently uploaded

Recently uploaded (20)

Couplings of Markov chains and the Poisson equation