Recent developments on unbiased MCMC

Recent developments on
unbiased Markov chain Monte Carlo
Pierre E. Jacob (Harvard)
joint work with
John O’Leary (Harvard), Yves Atchad´e (Boston University)
Machine Learning Seminar
Duke University
December 5, 2018
Pierre E. Jacob Unbiased MCMC

Outline
1 Setting: MCMC and bias
2 Unbiased estimators from Markov kernels
3 Illustrations on standard MCMC settings
4 Beyond the standard MCMC setting

Setting
Continuous or discrete space of dimension d.
Target probability distribution π,
with probability density function x → π(x).
Goal: approximate π, e.g.
for a class of functions h, approximate Eπ[h(X)],
deﬁned as h(x)π(x)dx and also written π(h).

Markov chain Monte Carlo
Initially, X0 ∼ π0, then Xt|Xt−1 ∼ P(Xt−1, ·) for t = 1, . . . , T.
Estimator:
1
T − b
T
t=b+1
h(Xt),
where b iterations are discarded as burn-in.
Might converge to Eπ[h(X)] as T → ∞ by the ergodic theorem.
An MCMC algorithm ≡ a description of how to sample from
π0 and from a Markov kernel P leaving π invariant.

Bias and variance
The bias goes to zero as T → ∞:
E[
1
T − b
T
t=b+1
h(Xt)] − Eπ[h(X)] → 0,
but the bias is not zero for any ﬁxed b, T, since π0 = π,
i.e. because the chain does not start at stationarity.
Therefore, obtaining many such estimators in parallel, for ﬁxed
T, b, and averaging them, would not provide a consistent
estimator of Eπ[h(X)].

Preview
In the proposed approach, each run will involve two coupled
chains (Xt) and (Yt),
until some event occurs, which will happen after a random
number of iterations denoted by τ.
At that point an estimator later denoted by Hk:m will be
returned,
with the guarantee that E[Hk:m] = Eπ[h(X)] (unbiasedness).
Removing the bias will come at a price in terms of computing
cost and variance.

Why do we care about unbiased estimators?
In classical point estimation, unbiasedness is not crucial.
Larry Wasserman in “All of Statistics” (2003) writes:
Unbiasedness used to receive much attention but these
days is considered less important.
On the other hand, Jeﬀ Rosenthal in “Parallel computing and
Monte Carlo algorithms” (2000) writes
When running parallel Monte Carlo with many
computers, it is more important to start with an
unbiased (or low-bias) estimate than with a
low-variance estimate.

Why do we care about unbiased estimators?

Parallel computing
Each among R processors generates unbiased estimators.
The number of estimators completed by time t is random.
There are different ways of aggregating estimators at time t,
either discarding on-going calculations or not.
We might want central limit theorems
as R → ∞ for fixed t,
as t → ∞ for fixed R,
as t and R both go to infinity.
This has been thoroughly studied already, e.g.
Glynn & Heidelberger, Analysis of parallel replicated
simulations under a completion time constraint, 1991.

Coupled chains
Generate two Markov chains (Xt) and (Yt) as follows,
sample X0 and Y0 from π0 (independently or not),
sample X1|X0 ∼ P(X0, ·),
for t ≥ 1, sample (Xt+1, Yt)|(Xt, Yt−1) ∼ ¯P ((Xt, Yt−1), ·),
¯P is such that Xt+1|Xt ∼ P(Xt, ·) and Yt|Yt−1 ∼ P(Yt−1, ·),
so each chain marginally evolves according to P.
Xt and Yt have the same distribution for all t ≥ 0.
¯P is also such that ∃τ such that Xτ = Yτ−1, and Xt = Yt−1 for
all t ≥ τ (the chains are faithful).

Debiasing idea
Marginal convergence of each chain:
Eπ[h(X)] = lim
t→∞
E[h(Xt)].
Limit as a telescopic sum, for all k ≥ 0,
lim
t→∞
E[h(Xt)] = E[h(Xk)] +
∞
t=k+1
E[h(Xt) − h(Xt−1)].
Since for all t ≥ 0, Xt and Yt have the same distribution,
Eπ[h(X)] = E[h(Xk)] +
∞
t=k+1
E[h(Xt) − h(Yt−1)].

Debiasing idea
If we can swap expectation and limit,
Eπ[h(X)] = E[h(Xk) +
∞
t=k+1
(h(Xt) − h(Yt−1))],
the random variable
h(Xk) +
∞
t=k+1
(h(Xt) − h(Yt−1))
is an unbiased estimator of Eπ[h(X)].
We can compute the inﬁnite sum since Xt = Yt−1 for all t ≥ τ.

Unbiased estimators
Unbiased estimator is given by
Hk(X, Y ) = h(Xk) +
τ−1
t=k+1
(h(Xt) − h(Yt−1)),
with the convention τ−1
t=k+1{·} = 0 if τ − 1 < k + 1.
Note: h(Xk) alone would be biased; the other terms correct for
the bias.
Cost: τ − 1 calls to ¯P and 1 + max(0, k − τ) calls to P.
Glynn & Rhee, Exact estimation for Markov chain equilibrium
expectations, 2014.

Conditions
1 Marginal chain converges:
E[h(Xt)] → Eπ[h(X)],
and h(Xt) has (2 + η)-finite moments for all t.
2 Meeting time τ has geometric tails:
∃C < +∞ ∃δ ∈ (0, 1) ∀t ≥ 0 P(τ > t) ≤ Cδt
.
3 Chains stay together: Xt = Yt−1 for all t ≥ τ.
Condition 2 itself implied by e.g. geometric drift condition.
Under these conditions, Hk(X, Y ) is unbiased, has finite
expected cost and finite variance, for all k.

Conditions: update
Middleton, Deligiannidis, Doucet, J., 2018.
1 Marginal chain converges:
E[h(Xt)] → Eπ[h(X)],
and h(Xt) has (2 + η)-finite moments for all t.
2 Meeting time τ has polynomial tails:
∃C < +∞ ∃δ > 2(2η−1
+ 1) ∀t ≥ 0 P(τ > t) ≤ Ct−δ
.
3 Chains stay together: Xt = Yt−1 for all t ≥ τ.
Condition 2 itself implied by e.g. polynomial drift condition.
Under these conditions, Hk(X, Y ) is unbiased, has finite
expected cost and finite variance, for all k.

Time-averaged estimators
Since Hk(X, Y ) is unbiased for all k, the “time-averaged”
estimator, for all m ≥ k,
Hk:m(X, Y ) =
1
m − k + 1
m
=k
H (X, Y ),
is also unbiased, and can be written
1
m − k + 1
m
t=k
h(Xt)+
τ−1
t=k+1
min 1,
t − k
m − k + 1
(h(Xt)−h(Yt−1)),
i.e. standard MCMC average + bias correction term.
As k → ∞ and m ≥ k, Hk:m(X, Y ) becomes MCMC average
with increasing probability; thus variance should be
comparable?

Eﬃciency
Writing Hk:m(X, Y ) = MCMCk:m + BCk:m,
then, denoting the MSE of MCMCk:m by MSEk:m,
V[Hk:m(X, Y )] ≤ MSEk:m+2 MSEk:m E BC2
k:m +E BC2
k:m .
Under geometric drift condition and regularity assumptions on
h, for some δ < 1, C < ∞,
E[BC2
k:m] ≤
Cδk
(m − k + 1)2
,
and δ is directly related to tails of the meeting time.
Similarly under polynomial drift, we obtain a term of the form
k−δ in the numerator instead of δk.

Signed measure estimator
Replacing function evaluations by delta masses leads to
ˆπ(·) =
1
m − k + 1
m
t=k
δXt (·)
+
τ−1
t=k+1
min 1,
t − k
m − k + 1
(δXt (·) − δYt−1 (·)),
which is of the form
ˆπ(·) =
N
s=1
ωsδZs (·),
where N
s=1 ωs = 1 but some ωs might be negative.

Designing coupled chains
To implement the proposed unbiased estimators,
we need to sample from a Markov kernel ¯P,
such that, when (Xt+1, Yt) is sampled from ¯P ((Xt, Yt−1), ·),
marginally Xt+1|Xt ∼ P(Xt, ·), and Yt|Yt−1 ∼ P(Yt−1, ·),
it is possible that Xt+1 = Yt exactly for some t ≥ 0,
if Xt = Yt−1, then Xt+1 = Yt almost surely.

Couplings of MCMC algorithms
Propp & Wilson, Exact sampling with coupled Markov chains
and applications to statistical mechanics, 1996.
Johnson, Studying convergence of Markov chain Monte Carlo
algorithms using coupled sample paths, 1996.
** Neal, Circularly-coupled Markov chain sampling, 1999.
Pinto & Neal, Improving Markov chain Monte Carlo estimators
by coupling to an approximating chain, 2001.
** Glynn & Rhee, Exact estimation for Markov chain
equilibrium expectations, 2014.

Couplings
(X, Y ) follows a coupling of p and q if X ∼ p and Y ∼ q,
i.e. a coupling of p and q is a joint distribution such that
ﬁrst marginal is p, second marginal is q.
There are inﬁnitely many couplings of p and q.
If X ∼ p and Y ∼ q are independent,
(X, Y ) follows an independent coupling of p and q.
A maximal coupling maximizes P(X = Y ),
over all couplings of p and q.

Independent coupling of Gamma and Normal

Maximal coupling of Gamma and Normal

Maximal coupling: algorithm
Input: p and q.
Output: pairs (X, Y ) from max coupling of p and q.
1 Sample X ∼ p and W ∼ U([0, 1]).
If W ≤ q(X)/p(X), output (X, X).
2 Otherwise, sample Y ∼ q and W ∼ U([0, 1])
until W > p(Y )/q(Y ), and output (X, Y ).
e.g. Thorisson, Coupling, stationarity, and regeneration, 2000,
Chapter 1, Section 4.5.

Metropolis–Hastings (kernel P)
At each iteration t, Markov chain at state Xt,
1 propose X ∼ k(Xt, ·), e.g. X ∼ N(Xt, V ),
2 sample U ∼ U([0, 1]),
3 if
U ≤ min 1,
π(X )k(X , Xt)
π(Xt)k(Xt, X )
,
set Xt+1 = X , otherwise set Xt+1 = Xt.
How to construct two MH chains at states Xt and Yt
such that {Xt+1 = Yt+1} can happen?

Coupling of Metropolis–Hastings (kernel ¯P)
At each iteration t, two Markov chains at states Xt, Yt,
1 propose (X , Y ) from max coupling of k(Xt, ·) and k(Yt, ·),
2 sample U ∼ U([0, 1]),
3 if
U ≤ min 1,
π(X )k(X , Xt)
π(Xt)k(Xt, X )
,
set Xt+1 = X , otherwise set Xt+1 = Xt,
if
U ≤ min 1,
π(Y )k(Y , Yt)
π(Yt)k(Yt, Y )
,
set Yt+1 = Y , otherwise set Yt+1 = Yt.

Coupling of Gibbs
Gibbs sampler: update component i of the chain,
leaving π(dxi|x1, . . . , xi−1, xi+1, . . . , xd) invariant.
For instance, we can propose X ∼ k(Xi
t, ·) to replace Xi
t,
and accept or not with a Metropolis–Hastings step.
These proposals can be maximally coupled across two chains,
at each component update.
The chains meet when all components have met.

Hamiltonian Monte Carlo
Introduce potential energy U(q) = − log π(q),
and total energy E(q, p) = U(q) + 1
2|p|2.
Hamiltonian dynamics for (q(s), p(s)), where s ≥ 0:
d
ds
q(s) = pE(q(s), p(s)) = p(s),
d
ds
p(s) = − qE(q(s), p(s)) = − U(q(s)).
Ideal algorithm: at step t, given position q(0) = Xt,
sample p(0) ∼ N(0, I),
solve dynamics over time length S to get (q(S), p(S)),
set Xt+1 = q(S).
then the Markov chain (Xt) would be converging to π.

Hamiltonian Monte Carlo
Solving Hamiltonian dynamics exactly is not feasible,
but many discretizations have been proposed.
Starting from position q0, sample p0 ∼ N(0, I),
and for = 0, . . . , L − 1, compute
p +1/2 = p −
ε
2
U(q ),
q +1 = q + ε p +1/2,
p +1 = p +1/2 −
ε
2
U(q +1),
accept or not qL as the new state of the chain,
according to a Metropolis–Hastings ratio
min(1, exp (E(q0, p0) − E(qL, pL))).

Coupling of HMC
Common random numbers can make two HMC chains contract,
under assumptions on the target such as strong log-concavity,
i.e. the potential energy is strongly convex.
Using mixtures of HMC and RWMH kernels,
two chains can then exactly meet.
Heng & J., Unbiased HMC with couplings, 2018.
Mangoubi & Smith, Rapid mixing of HMC strongly log-concave
distributions, 2017.
In non-convex cases, see Bou-Rabee, Eberle & Zimmer, Coupling
and Convergence for Hamiltonian Monte Carlo, 2018.

Couplings of other MCMC algorithms
Conditional particle ﬁlters, with or without ancestor sampling:
J., Fredrik Lindsten, Thomas Sch¨on
Smoothing with Couplings of Conditional Particle Filters, 2018.
Pseudo-marginal methods, including particle marginal
Metropolis–Hastings (PMMH), and exchange algorithm:
Middleton, Deligiannidis, Doucet, J.,
Unbiased Markov chain Monte Carlo for intractable target
distributions, 2018.

Logistic regression
X1, . . . , Xp ∈ Rn: covariates, with n = 1000, p = 300 (German
credit data).
Y ∈ Rn: binary response variable.
Model: P(Yi = 1) = {1 + exp(−a − b xi)}−1.
Parameters: intercept a ∈ R, coeﬃcients b ∈ R300.
Prior: a|s2 ∼ N(0, s2), b|s2 ∼ N(0, s2I) and s2 ∼ Exp(0.01).
Target π is distribution of (a, b, log s2) given Y, X, on R302.
Heng & J., Unbiased Hamiltonian Monte Carlo with couplings,
2018.

Logistic regression
k m Cost Relative inefficiency
1 k 436 1989.07
1 5k 436 1671.93
1 10k 436 1403.28
90% quantile(τ) k 553 38.11
90% quantile(τ) 5k 1868 1.23
90% quantile(τ) 10k 3518 1.05
Inefficiencies are averaged over test functions yielding all first
and second moments of the target distribution.

Variable selection
Integers p, n, with possibly n < p.
X1, . . . , Xp ∈ Rn: covariates.
Y ∈ Rn: response variable.
γ ∈ {0, 1}p represents which covariates to select.
|γ| = p
i=1 γi: number of selected variables.

Variable selection
Xγ is the n × |γ| matrix of covariates selected by γ.
Model:
Y = Xγβγ + w, where w ∼ N(0, σ2
In).
Conjugate prior leads to
π(Y |X, γ) ∝
(1 + g)−|γ|/2
(1 + g(1 − R2
γ))n/2
, where R2
γ =
Y Xγ(XγXγ)−1XγY
Y Y
.
The prior on γ is set as π(γ) ∝ p−κ|γ|1(|γ| ≤ s0).
Yang, Wainwright & Jordan, On the computational complexity
of high-dimensional Bayesian variable selection, 2016.

Variable selection
MCMC algorithm randomly alternates between
ﬂipping a component, from 1 to 0 or from 0 to 1,
accept or not with MH step;
randomly swapping a 0 and an 1,
accept or not with MH step.
Maximal couplings of these proposals can be implemented.

Variable selection
Generate n = 500 rows, with covariates X
generated as independent standard Normals.
Outcome variable Y generated from the model, with a
coeﬃcient β such that ﬁrst ten entries are non-zero.
Other parameters: σ2 = σ2
0 = 1, s0 = 100, g = p3, κ = 2.

Variable selection: scaling of the meeting times with p
0
10
20
30
40
100 250 500 750 1000
p
meetingtimes/p

Variable selection: inclusion probabilities
q
q
q
q
q q
q
q
q
q
q q q q q q q q q q
0.00
0.25
0.50
0.75
1.00
0 5 10 15 20
variable
inclusionprobability
kappa: q
0.1 1 2
For p = 1, 000, three values of κ, using k = 75, 000 and m = 2k. Error
bars generated on 600 processors for 2 hours. Lines are obtained with
106
iterations of standard MCMC.

Unbiased property
Unbiased estimators can be used to tackle problems for which
MCMC methods are ill-suited.
For instance, problems where one needs to approximate an
integral of a function which itself is deﬁned as an integral:
h(x1, x2)π2(x2|x1)dx2 π1(x1)dx1,
or equivalently
E1 [E2[h(X1, X2)|X1]] .
Rainforth, Cornish, Yang, Warrington, On nesting Monte Carlo
estimators, 2018.

Cut distribution
How to do inference on θ2
while taking the uncertainty of θ1 into account?
Full posterior under the joint model:
π2(θ1, θ2|Y1, Y2).
Can be approximated via MCMC,
but misspeciﬁcation of any part “contaminates” the ensemble.
Liu, Bayarri, Berger, Modularization in Bayesian Analysis, with
Emphasis on Analysis of Computer Models, 2009.
J., Murray, Holmes, Robert, Better together? Statistical learning in
models made of modules, 2017

Cut distribution
Cut distribution:
πcut(θ1, θ2) = π1(θ1|Y1)π2(θ2|Y2, θ1).
Diﬀerent than full posterior = π1(θ1|Y1, Y2)π2(θ2|Y2, θ1).
The marginal of θ1 is the ﬁrst model’s posterior.
Probabilistic counterpart to two-step estimators.
But how to sample from the cut distribution?
Plummer, Cuts in Bayesian graphical models, 2015.

Cut distribution
Express expectation with respect to cut distribution
Ecut[h(θ1, θ2)]
using the tower property, i.e.
E1[E2[h(θ1, θ2)|θ1]].
Can be unbiasedly estimated using the tower property of
expectations, and
unbiased estimators of E2[·|θ1] for all θ1,
and unbiased estimators of E1[·].

Example: epidemiological study
Model of virus prevalence
∀i = 1, . . . , I Zi ∼ Binomial(Ni, ϕi),
Zi is number of women infected with high-risk HPV in a
sample of size Ni in country i.
Impact of prevalence onto cervical cancer occurrence
∀i = 1, . . . , I Yi ∼ Poisson(λiTi), log(λi) = θ2,1 + θ2,2ϕi,
Yi is number of cancer cases arising from Ti woman-years of
follow-up in country i.
Plummer, Cuts in Bayesian graphical models, 2015.

Example: epidemiological study
Approximations of the cut distribution marginals:
0
1
2
3
−2.5 −2.0 −1.5
θ2,1
density
0.00
0.05
0.10
0.15
10 15 20 25
θ2,2
density
Red curves were obtained by long MCMC run targeting
π2(θ2|Y2, θ1), for draws θ1 from ﬁrst posterior π1(θ1|Y1).
Black bars were obtained with unbiased MCMC.

Example 2: normalizing constants
We are interested in normalizing constant of π, for which we
can compute unnormalized density x → exp(−U(x)).
Introduce
∀λ ∈ [0, 1] πλ(x) = Z−1
λ exp(−Uλ(x)),
where Zλ = exp(−Uλ(x))dx.
Thermodynamics integration or path sampling identity:
log
Z1
Z0
= −
1
0
λUλ(x)πλ(dx)
q(λ)
q(dλ),
where q(λ) is a distribution supported on [0, 1].
Kirkwood, Statistical mechanics of ﬂuid mixtures, 1935.

Normalizing constants
Proposed procedure:
sample λ ∼ q(dλ),
produce an unbiased estimator ˆE(λ) of − λUλ(x)πλ(dx),
return ˆr01 = ˆE(λ)/q(λ).
The resulting variable ˆr01 has expectation log(Z1/Z0).
The cost and variance of ˆr01 can be estimated, and leveraged to
guide the choice of q(dλ).
Rischard, J., Pillai, Unbiased estimation of log normalizing constants
with applications to Bayesian cross-validation, 2018.

Example 3: Bayesian cross-validation
Data y = {y1, . . . , yn}, made of n units.
Partition y into training T and validation V sets.
For instance, by sampling nT indices out of n at random
without replacement to form T.
For each split y = {T, V }, one might want to assess predictive
performance with
log p(V |T) = log p(V |θ, T)p(dθ|T),
i.e. using the logarithmic scoring rule.
Then log p(V |T) is a log-ratio of normalizing constants, p(T, V )
and p(T), so that the proposed estimator ˆr01 can be applied.

Bayesian cross-validation
One might want to approximate the criterion
CV = −
n
nT T,V
log p(V |T),
where the sum is over all splits (T, V ) of size (nT , n − nT ).
Proposed procedure:
randomly sample a split (T, V ) of size (nT , n − nT ),
introduce a “path” between p(dθ|T) and p(dθ|T, V ),
sample ˆr01, an unbiased estimator of log p(V |T),
return −ˆr01.
The resulting variable has expectation equal to CV.

Example 4: bagging Bayesian predictions?
Consider “bagging” (Breiman, 1994), applied to
predictors obtained by Bayesian procedures.
obtain y , a bootstrap sample of the data y,
approximate the resulting posterior distribution p(dθ|y ),
perform posterior predictions based on approximation.
This is another example of nested Monte Carlo problem.
B¨uhlmann, Discussion of Big Bayes Stories and BayesBag,
2014.

Discussion
Perfect samplers would yield same beneﬁts and more.
Proposed estimators are perhaps more applicable,
no need for extra knowledge about π,
but require coming up with coupled kernel ¯P.
Cannot work if underlying MCMC doesn’t work. . .
but parallelism allows for more expensive MCMC kernels.
Optimal tuning of underlying MCMC algorithms is not
necessarily optimal for the proposed estimators.

Links – thank you for listening!
** P.J., John O’Leary, Yves F. Atchad´e
Unbiased Markov chain Monte Carlo with couplings, 2019+
P.J., Fredrik Lindsten, Thomas Sch¨on
Smoothing with Couplings of Conditional Particle Filters, 2018.
Jeremy Heng, P.J.
Unbiased Hamiltonian Monte Carlo with couplings, 2018.
Lawrence Middleton, George Deligiannidis, Arnaud Doucet, P.J.
Unbiased Markov chain Monte Carlo for intractable target
distributions, 2019+.
Maxime Rischard, P.J., Natesh Pillai,
Unbiased estimation of log normalizing constants with applications to
Bayesian cross-validation, 2019+.
Code on Github: github.com/pierrejacob/
Blog: statisfaction.wordpress.com/

Recent developments on unbiased MCMC

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Recent developments on unbiased MCMC

Similar to Recent developments on unbiased MCMC (20)

More from Pierre Jacob

More from Pierre Jacob (13)

Recently uploaded

Recently uploaded (20)

Recent developments on unbiased MCMC