QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop, Paralell Markov Chain Monte Carlo - Scott Schmidler, Dec 11, 2017

Parallel Markov Chain Monte Carlo
Scott C. Schmidler∗
Department of Statistical Science
Duke University
SAMSI Workshop
December 11, 2017
∗
joint work with Doug VanDerwerken
Scott C. Schmidler Parallel Markov Chain Monte Carlo

Markov chain Monte Carlo integration
A general problem in (esp Bayesian) statistics and statistical
mechanics is calculation of integrals of the form:
h π = Eπ (h(x)) =
X
h(x)π(dx)
A common, powerful approach is Monte Carlo integration:
h ≈
1
n
n
i=1
h(Xi ) for X1, X2, . . . , Xn ∼ π
When sampling π is diﬃcult, can construct a Markov chain.
MCMC
Many ways to do so: Metropolis-Hastings, Gibbs sampling,
Langevin & Hamiltonian methods, etc.

Problem: MCMC can be slow
When X0, X1, X2, . . . , Xn come from a Markov chain, convergence
of ergodic averages
ˆµh =
1
n
n
i=1
h(Xi )
can converge very slowly.
Mixing time
τ = sup
π0
min{n : πn − π TV < ∀n ≥ n}.
where
πn − π TV = sup
A⊂X
|πn(A) − π(A)|
In problems with multimodality, high dimensions, or simply strong
dependence, mixing times can be very, very long.

Rapid and slow mixing
One way to characterize this is rapid mixing.
Let (X(d), F(d), λ(d)) a sequence of measure spaces, and π(d)
densities wrt λ(d) for d ∈ N the problem size.
P is rapidly mixing if τ (d) is bounded above by a polynomial in d.
P is torpidly mixing if τ (d) bounded below by an exponential in d.
Even if the chain is “rapidly” mixing, τ may be impractically large.

Computation is changing
At the same time, the computing landscape has shifted
dramatically.
Moore’s law (exponential growth of processor speed) is dead.
Future growth must come through parallelism:
Multi-core platforms
Cluster computing
Massive parallelism (GPUs)
Cloud computing

Parallel algorithms
Basic idea: Break a problem into pieces that can be solved
independently - preferably asynchronously - and recombined into a
full solution.
Integration (wrt prob measure π):
Θ
h(θ)π(dθ)
One possibility:
partition space: J
j=1 Θj
integrate within each element Θj : µj = Θj
h(θ)π(θ)dθ
sum the results: µ = j µj
Easily done for grid-based quadrature, but . . .
For ﬁxed , # evals grows exponentially in dim(Θ)
In contrast, Monte Carlo integration“spends” function evals only in
relevant parts of Θ. (Hence preferred for d > 8).

Parallelization
Our goal: Combine the best of both worlds: expend computation
only in regions of signiﬁcant probability, while enabling parallel
evaluation in distinct regions.
Quandary: MCMC is an inherently serial algorithm, and number of
steps may be exponential in dim(Θ).

MCMC is a serial algorithm
MCMC is inherently serial:
Cannot compute Xt without ﬁrst computing X1, X2, . . . Xt−1.
⇒ incompatible with parallelization
What we can do:
Parallelize individual steps (e.g. expensive likelihood calcs)
Proposing moves in parallel, or precomputing acceptance
ratios, for a individual steps
Markov chains with natural parallel structure
Parallel tempering
Population MCMC
but . . . such chains have inherent limitations on number of
processors; cannot parallelize component chains
Split ’big data’ and recombine results in ad hoc ways
Particle ﬁltering/SMC

MCMC is a serial algorithm
Moreover, these approaches all require processor synchronization.
Achievable only on dedicated clusters, with high-speed
connectivity
Without this, parallelization may slow-down compared to
single processor.
Finally, all require the component (or joint) chains to reach
equilibrium for valid inference.
⇒ Cannot reduce the number of serial steps required..
e.g. Parallel Tempering:
may speed convergence vs single-temp, but . . .
increasing # processors > # temps doesn’t help.
When mixing slow, e.g in presence of multimodality, may not help.
(e.g Woodard, S., Huber 2009).
These algs fundamentally limited by mixing time of joint process.

Goal of this work
Goal: A procedure that can be applied to any Markov chain Monte
Carlo algorithm (including above methods) to make it parallel, with
the ability to take advantage of as many processors as available:
Asynchronously parallel.
Ideally, linear speedup in # processors.
Not limited by the mixing time of the component chain(s).

Basic idea (not quite what we do)
Given a partition Θ = J
j=1 Θj .
For each j run an MCMC chain θ
(j)
1 , . . . , θ
(j)
nj on the target
distribution restricted to Θj :
πj (θ)
∆
= π(θ)1Θj
(θ)/wj
where wj = Θj
π(θ)ν(dθ).
Then for ergodic averages ˆµj,n = n−1
j
nj
i=1 f (θ
(i)
j ) we have
ˆµj,n −→ Eπj (f ) =
Θj
f (θ)πj (θ)ν(dθ)
as nj → ∞, for each j ∈ {1, . . . , J}.

Combining the chains
If we can also construct estimators for the weights:
ˆwj,n → wj
Then the combined estimator
ˆµn =
J
j=1
ˆwj,n ˆµj,n −→ µ = Eπ(f )
If ˆµj,n’s and ˆwj,n’s unbiased and independent, then ˆµn unbiased.
Notice:
Need only ˆµj,n’s and ˆwj,n’s to converge, not the chains!
Requires only that each chain mix locally

Estimating the weights
Let g(θ) be unnormalized target density, i.e. π(θ) = g(θ)
c
Estimating cj = Θj
g(θ)ν(dθ) equivalent to estimating the
normalizing constant of target density gj (θ) = g(θ)1Θj
(θ)
Many techniques available (but requires care).
Then form
ˆwj,n =
n
i=1
ˆc
(i)
j /
n
i=1
J
k=1
ˆc
(i)
k
which is consistent (but not unbiased) for w.
Other ratio estimators may improve eﬃciency (Tin 1965), allowing
reduction in n.

Approach 1: Markov chain output
Estimate cj directly from MCMC trajectories.
HME (Newton & Raftery 1994, Raftery et. al. 2007)
Chib’s method (Chib 1995, Chib & Jeliazkov 2001)
Bridge/path sampling (Meng & Wong 1996, Gelman & Meng
1998, Meng & Schilling 2002).
Note restriction to Θj helps avoid problems (eg Wolpert & S. 2012).

Approach 2: Adaptive importance sampling
Construct approximation qj to πj from MCMC draws:
t(mj , Sj ) distn for sample mean mj , covar Sj
Adaptive mixture of t-distributions (Ji & S. 2013, Wang & S.
2013)
Draw θt
iid
∼ qj to get unbiased IS estimate
ˆcj = T−1
T
t=1
g(θt)1Θj
(θt)/qj (θt)
Again, qj need only approximate π locally on Θj , so λ∗
j = sup πj /qj
much smaller

Approach 2: Adaptive importance sampling (cont’d)
More generally, may use a sequence of distributions qj,t.
Markov chain θt | θt−1 ∼ qj (θt | θt−1)
Adaptive MIS chain (Ji & S. 2013, Wang & S. 2013)
‘Sample’ (’trajectory’) to denote independent (conditional) draws.
Averaging n independent ˆcj ’s decreases variance as n−1.
Pseudo-marginal approach (Andrieu & Roberts 2009) using these
techniques signiﬁcantly less eﬃcient.

Mixture of normals
Consider a simple mixture of 2 normals:
π(z) =
1
2
NM(z; −1M, σ2
1IM) +
1
2
NM(z; 1M, σ2
2IM)
Upper bounds on spectral gap (WSH07a,b) yield:
Thm: RW-MH is torpidly mixing.
Thm: Tempering is torpidly mixing for σ1 = σ2.
Lower bounds on hitting times obtained by (SW10) yield:
Thm: Equi-energy sampler torpidly mixing for σ1 = σ2.
Thm: Haario adaptive RW kernel torpidly mixing for σ1 = σ2.

Towards some theory
However, if partition J
j=1 Θj is such that:
Θj ’s are convex
πj is log-concave for j = 1, . . . , J, then
Then
πj can be sampled in polynomial time (Frieze, Kannan, et al)
cj can be estimated in polynomial time (Lovasz, Vempala)
+ some additional technical restrictions gives:
⇒ we can sample π and approximate Eπ(h(x)) in polynomial time
. . . assuming we can intialize within the basins of attraction in poly
time!
(VanDerwerken & S., 2015)

FPRAS for mixture-of-normals
Theorem
Under above conditions, PMCMC algorithm returns a sample in
time O(poly(d)) from a distribution ˆπ for which ||ˆπ − π∗||TV ≤
with prob at least 1 − δ.
HPD region of modes sampled in poly-time
Use samples to estimate HPD hyperellipsoid Bj at each mode
where π logconcave on Bj .
Apply logconcave integration
Similar result allows construction of a rapidly mixing MIS chain
using adaptive mixture IS instead (VanDerwerken & S., 2015)

FPRAS for mixture-of-normals
Note: exponentially faster than estimating transition matrix as
in MD
Shows problem diﬃculty is ﬁnding modes, not mixing between
them. (Hard even in normal problem?)
Currently exploring limits of generalizability.

Problems with Approach #1
This approach has some shortcomings:
1 Requires # chains (processors) equal to partition size, which
could be exponential in dim(Θ).
2 Where does the partition come from?
3 Restriction πj requires rejection; makes evaluating transition
density hard for ˆwj ’s.
4 Restriction could slow down mixing of chains.

Solution
No need for 1-to-1 correspondence between chains and estimators.
For L indpt chains, let
ˆµj,n = n−1
j
L
l=1
Kl
k=1
f (θlk)1Θj
(θlk)
nj = L
l=1
Kl
k=1 1Θj
(θlk) is # draws in Θj from any chain
L can be much smaller; need not be exponential in dim(Θ).
⇒ Chains unrestricted, can cross between partition elements.
Partition imposed on samples after the fact.

Adaptive partitioning
Still need a partition.
Key: Must not grow exponentially in dim(Θ).
PACE clustering algorithm (VanDerwerken & S., 2013)
Let x
(j)
t denote draw t from chain j
Xi set of draws available at iter i
1 Deﬁne x∗
i = arg max
x
(j)
t ∈Xi
{log π(x
(j)
t )}
2 Assign all draws lying in B (x∗
i ) to Ci , and set Xi+1 = Xi Ci .
3 Repeat (1)-(2) until 1 − α of draws clustered (e.g 98%).
4 Reallocate all draws to nearest cluster (Voronoi).

Examples: Multimodal target
Mixture of normals:
π(x) =
4
k=1
wj N(µj , Σj )
Weights: w = (0.02, 0.20, 0.20, 0.58)
Means: µ1 = (3, 3), µ2 = (7, −3), µ3 = (2, 7), and µ4 = (−5, 0)
Covariances:
Σ1 =
1 .2
.2 1
Σ2 =
2 −.5
−.5 .5
Σ3 =
1.3 .3
.3 .4
Σ4 =
1 1
1 2.5

Multimodal example
Langevin diﬀusion:
dθt =
σ2
2
log π(θt)dt + σdWt
10 chains initialized uniformly
25k iterations each in parallel on 10 processors
Cluster ﬁrsting 1k draws after 250 burn-in ⇒ 7 element partition.
In parallel, 1 processor per element (7 total) each generated:
n ≈ 5000 trajectories of length T = 5, and corresponding ˆcj ’s
initialized
iid
∼ t4(mj , J−1(θ))
t4 perturbations instead of Gaussian to ensure var(ˆcj ) < ∞

Multimodal example
−10 −5 0 5 10
−10
−5
0
5
10
1
2
3
4
5
6
7
Clustering of 10 chains initialized uniformly within dashed lines.
Ellipses show 95% contours for component densities of target.

Multimodal example
Estimated weights: [0.02, 0.23, 0.20, 0.55].
(True weights: [0.02, 0.20, 0.20, 0.58])
Using AIS approach instead:
5000 ˆcj ’s from samples of size T = 5 from t4 distn
requires 18s vs. 90s for simulating diﬀusions
Clustering + IS takes < 1
2 time of parallel 25k chains, so weights
estimated in parallel before sampling complete.
Estimated partition weights:
ˆw7 = [.378, .201, .201, .105, .020, .093, .002]
Estimated component weights (nearly exact):
ˆw = [.020, .201, .201, 0.578]

Multimodal example: higher dimensions
Harder example:
p = 10 dim
4 component-means drawn uniformly on (−10, 10)p
Random covar matrices LT L with L ∼ MNp×p(0, Ip, Ip)
Weights ∼Dirichlet(1, 1, 1, 1)
- 20 parallel r.w. Metropolis chains, 100k iterations each
- Proposal scales tuned adaptively during the ﬁrst 1k iterations
- Next 49k draws clustered ⇒ 4 partition elements.
- IS using t4(mj , Sj ) for cluster center mj , empirical covar Sj ,
T = 100, n = 1000
Results:
dTV( ˆw, w) = .0024 ˆµ − E(X) L1
= 0.17

Sensitivity to partitioning:
Repeating with diﬀerent gives 8-partition
dTV( ˆw, w) = .0074 ˆµ − E(X) L1
= 0.12
More, smaller weights to estimate, but better mixing within
(smaller) partition elements.
Since successfully repeated in 50, 100 dimensions.

p = 50:
2 components: w = [0.1, 0.9]
Random means ∼ U(−10, 10)p; Corr LT L for
L ∼ MNp×p(0, Ip, Ip).
Parallel MCMC
14 chains, initialized uniformly
Normal RW MH with adaptive covar tuned during 100k iter
burn-in
2M post-burn-in draws each, thinned to 1000 draws.
Partition size: 2 ( 2 = 2p)
AIS using t4(mj , Σj ) (5M draws): ˆw = [.101, .899]
Pooling chains directly gives ˜w = [.210, .790], as 3 chains happen
to get stuck in mode 1, 11 in mode 2.

Beyond multimodality
Parallelization easily visualized for multimodal problems, but our
approach is completely general.
What about other types of slowly-mixing chains?
E.g. component-wise chains with strong dependence between
dimensions. (such as correlated Gibbs samplers)

Example: Probit regression
Probit regression model:
Assigns probs 1 − Φ(βX), Φ(βX) to response Y ∈ {0, 1} for
covariate X.
Posterior
π0(β)
n
i=1
Φ(βXi )yi
{1 − Φ(βXi )}1−yi
Data: N = 2000 pairs simulated X ∼ Bern(1
2), β = 5/
√
2.
Diﬀuse prior (π0(β) = N(0, 102)).
Model also studied by Nobile (1998), Imai & van Dyk (2005)
Traditional Gibbs sampler (Albert & Chib 1993) mixes slowly:
autocorr ρ > 0.999.

Probit regression: Parallel MCMC
10 parallel chains initialized U(0, 20), run for 50k iterations
Partition formed by Voronoi with centers the deciles from the 500k
pooled draws.
Weights estimated via AIS with:
qj = N(mj , 2sj ) for mean, sd draws in Θj
n = 500, T = 10
TV distance to “truth” (200k indpt rejection sampling draws)
calculated on a ﬁne discretization gives dTV = .075.
dTV for serial Gibbs sampler reaches .075 at ∼1.2 million iterations.
⇒ Parallelized Gibbs sampler: same accuracy with < 1
2 as many
draws, and more than 20× speedup due to parallelization.

Probit regression: higher dimensional
N = 500 points for p = 8 covariates drawn from:
(1, Bern(1
2 ), U(0,1), N(0,1), Exp(1), N(5,1), Pois(10), N(20,25))
with β = (0.25, 5, 1, −1.5, −0.1, 0, 0, 0) .
Compare:
1M iterations serial Gibbs sampler
300k iterations each for 10 parallel chains
Partitioning: 2
= p for normalized dimensions
Weights: AIS with qj = t4(m∗
j , Sj ) for empirical mode m∗
j and
covariance Sj in element j, using n = 500, T = 50.
β2 much slower to converge (ρ > 0.999) than others (ρ < 0.95).
So compare marginal distribution for β2 with “truth” (5M MH
samples, ρ < 0.95) using dTV calculated by discretization.

Multivariate Probit Regression: Parallel vs serial Gibbs
0 200 400 600 800 1000
0.0
0.1
0.2
0.3
0.4
0.5
Thousands of iterations
Totalvariationdistance
Using PACE convergence threshold 0.10 (VDW & S., 2013),
parallel Gibbs sampler converges ∼ 20× faster.

Example: Loss of Heterozygosity
Data from Seattle Barrett Esophagus project.
LOH a genetic change undergone by cancer cells; chromosomal
regions with high loss rates may contain regulatory genes.
Loss frequencies modeled by mixture (Desai & Emond, 2004):
(also studied by Craiu et. al. 2009, 2011)
Xi ∼ η Bin(Ni , π1) + (1 − η)Beta-Bin (Ni , π2, γ),
γ controls beta-binomial overdispersion.
Likelihood:
40
i=1
η
ni
xi
πxi
1 (1 − π1)(ni −xi )
+ (1 − η)
ni
xi
B(xi + π2
ω2
, ni − xi + 1−π2
ω2
)
B(π2
ω2
, 1−π2
ω2
)
,
for ω2 = eγ/(2(1 + eγ)) and beta function B.

Example: Loss of Heterozygosity
8 parallel chains initialized at logit(u) for u ∼ [0, 1]4
Clustering in logistic space (to choose 2 = 0.1) yields 7 clusters.
Weight estimation via AIS using:
t4(mj , Sj ) for cluster mean mj , empirical covar Sj
n = 10000 and T = 100.
Results agree with previous analyses, except γ slightly smaller.
Our results conﬁrmed 4 times by iid importance sampling using
3-component t4 mixture, overdispersed covariances, n = 500, 000.
η π1 π2 γ
Parallel MCMC .816 (.001) .299 (.001) 0.678 (.002) 9.49 (.51)
IS .814 (.001) .299 (.001) .676 (.001) 9.84 (.06)

Conclusions
A general scheme for parallelizing any MCMC algorithm
Requires approximating norm constants, but only on local
regions
Requires MCMC to mix locally only
Doesn’t solve all problems, e.g. hitting modes in the ﬁrst
place (which can be provably intractable)
Potentially powerful. Bigger applications in progress

References
VanDerwerken, DN and Schmidler, SC (2013). Parallel Markov
Chain Monte Carlo. arXiv:1312.7479
VanDerwerken, DN and Schmidler, SC (2017). Parallel Markov
Chain Monte Carlo. (revised and expanded version)

QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop, Paralell Markov Chain Monte Carlo - Scott Schmidler, Dec 11, 2017

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop, Paralell Markov Chain Monte Carlo - Scott Schmidler, Dec 11, 2017

Similar to QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop, Paralell Markov Chain Monte Carlo - Scott Schmidler, Dec 11, 2017 (20)

More from The Statistical and Applied Mathematical Sciences Institute

More from The Statistical and Applied Mathematical Sciences Institute (20)

Recently uploaded

Recently uploaded (20)

QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop, Paralell Markov Chain Monte Carlo - Scott Schmidler, Dec 11, 2017