Stratified Monte Carlo for fast ABC using resampling

Stratiﬁed Monte Carlo for fast ABC
using resampling
Umberto Picchini
@uPicchini
Chalmers University of Technology
and University of Gothenburg
Sweden

Joint work1 with Richard Everitt (Reading, UK).
I am going to talk of ABC-MCMC and specifically pseudomarginal
ABC-MCMC.
The goal is to accelerate this (typically) expensive procedure using
resampling techniques.
Resampling induces a bias and the resulting posterior has (too)
large variance.
We reduce bias using stratified Monte Carlo.
1
P and Everitt (2019) Stratified sampling and resampling for approximate
Bayesian computation, arXiv:1905.07976.
Umberto Picchini, @uPicchini 2/26

• We are interested in Bayesian inference for parameters θ of a
model having an intractable likelihood function;
• that is the likelihood p(xobs|θ) for data xobs is analytically
unavailable;
• however we assume the ability to simulate pseudo-data x∗
from a simulator of a stochastic model.
• this is the same as writing x∗ ∼ p(x|θ);
• we use an ABC approach (approximate Bayesian computation);
• the key idea in ABC is to accept parameters θ∗ generating
x∗ ≈ xobs, i.e. ||x∗ − xobs|| < δ;
• the above is very ineﬃcient (super-small acceptance
probability);
• typically much better to introduce low-dimensional summary
statistics S(·) so that sobs ≡ S(xobs), s∗ ≡ S(x∗);

ABC rejection sampler
This is the most basic (ineﬃcient) ABC sampler.
1 Sample from the prior θ∗ ∼ p(θ)
2 Plug θ∗ into the simulator, simulate x∗ ∼ p(x|θ∗)
3 compute summary stats S(x∗)
4 accept θ∗ if ||S(x∗) − S(xobs)|| < δ
5 go back to 1 and repeat
The collection of accepted θ∗ is an ensemble of draws from
approximate posterior πδ(θ|S(xobs)).
This is super ineﬃcient (due to proposing from the prior)
Better samplers rely on SMC or MCMC, where θ∗ ∼ q(θ|θ ).
Comprehensive monography (many chapters on arxiv):
Sisson, Fan, Beaumont. (2018). Handbook of approximate Bayesian
computation. Chapman and Hall/CRC.

Constructing appropriate summary statistics opens a pandora box
of additional issues I am not going to talk about.2
Let’s just assume we have some “informative” summaries for θ.
This way we can sample from the approximate posterior
πδ(θ|sobs) ∝ π(θ) I{||s∗−sobs||<δ}p(s∗
|θ)ds∗
We have that
πδ(θ|sobs) → π(θ|sobs) (δ → 0)
Rather unrealistic to assume S(·) a suﬃcient statistics, but if that
happens to be the case:
πδ(θ|sobs) ≡ πδ(θ|xobs)
2
A review is Prangle, D. (2015). Summary statistics in approximate
Bayesian computation. arXiv:1512.05633.

More in general, in place of the indicator funct. we can consider a
kernel funct Kδ(s∗, sobs) and write
πδ(θ|sobs) ∝ π(θ) Kδ(s∗
, sobs)p(s∗
|θ)ds∗
ABC likelihood
for example we can use a Gaussian kernel
Kδ(s∗
, sobs) ∝ exp(−
1
2δ2
(s∗
− sobs) Σ−1
(s∗
− sobs))
Other kernels are possible, e.g. Epanechnikov’s kernel.

The ABC likelihood: Kδ(s∗, sobs)p(s∗|θ)ds∗.
This can trivially be approximated unbiasedly via Monte Carlo as3
Kδ(s∗
, sobs)p(s∗
|θ)ds∗
≈
1
M
M
r=1
Kδ(s∗r
, sobs), s∗r
∼
iid
p(s∗
|θ).
and plugged in a Metropolis-Hastings ABC-MCMC algorithm,
proposing a move θ∗ ∼ q(θ|θ#), and accepting with probability
1 ∧
1
M
M
r=1 Kδ(s∗r, s)
1
M
M
r=1 Kδ(s#r, s)
·
π(θ∗)
π(θ#)
·
q(θ#|θ∗)
q(θ∗|θ#)
Problem: if the model simulator is computationally expensive,
having M large is unfeasible.
3
Lee, Andrieu, Doucet (2012): Discussion of Prangle 2012 JRSS-B.

So we have the approximated ABC posterior (up to a constant)
πδ(θ|sobs) ≈ π(θ) ·
1
M
M
r=1
Kδ(s∗r
, sobs)
unbiased
, s∗r
∼
iid M times
p(s∗
|θ).
• No matter the value of M, the ABC-MCMC will sample exactly
from πδ(θ|sobs), because of the unbiased likelihood estimator;
• this makes the algorithm an instance of pseudomarginal MCMC
[Andrieu, Roberts 2009]4
• Typically ABC-MCMC is computationally intensive and a small M
is chosen, say M = 1;
• the lower the M the higher the variance of the estimate of the ABC
likelihood Kδ(s∗
, s)p(s∗
|θ)ds∗
• the larger the variance the worse the mixing of the chain (due to
occasional overestimation of the likelihood).
4
Andrieu, C., and Roberts, G. O. (2009). The pseudo-marginal approach for
eﬃcient Monte Carlo computations. The Annals of Statistics, 37(2), 697-725.

Dilemma:
a small M will decrease the runtime considerably, however it will
increase the chance to overestimate the likelihood → possibly
high-rejection rates.
Question: is it worth to have M > 1 to reduce the variance of the
ABC likelihood given the higher computational cost?
Bornn et al 20175 found that no, it is not worth and M = 1 is
just ﬁne (when using a uniform kernel).
Basically using M = 1 is so much faster to run that the decreased
variance obtained with M > 1 is not worth given the higher
computational cost.
5
L. Bornn, N. S. Pillai, A. Smith, and D. Woodard. The use of a single
pseudo-sample in approximate Bayesian computation. Statistics and
Computing, 27(3):583590, 2017.

Data resampling
In a similar context (based on synthetic likelihood approaches) Everitt
20176
used the following approach:
At any proposed θ
• simulate say M = 1 datasets x∗
∼ p(x|θ);
• sample with replacement from elements in x∗
to obtain a
resampled dataset (with dimension dim(xobs));
• repeat the resampling to obtain x∗1
, ..., x∗R
resampled datasets
from x∗
;
• compute the summaries s∗1
, ..., s∗R
for each resampled dataset;
Cheap compared to producing M independent summaries from the
model, when the simulator is computationally intensive.
This reduces the variance of the ABC likelihood compared to using
M = 1 without resampling.
6
Everitt (2017). Bootstrapped synthetic likelihood. arXiv:1711.05825.

The problem with using the bootstrapped procedure within ABC is
that it bias the estimation of the ABC likelihood considerably.
Example: data is 1000 iid observations from N(θ = 0, 1).
Set Gaussian prior on θ → known analytic posterior.
• Left: pseudo-marginal ABC with M = 100 independent
datasets and suﬃcient S(xobs) = ¯xobs;
• Right: M = 1 and R = 100 resampled datasets

Stratiﬁed Monte Carlo
Stratiﬁed Monte Carlo is a variance reduction technique.
In full generality: want to approximate
µ =
D
f(x)p(x)dx
over some space D, for some function f and density (or probability
mass) function p.
Now partition D into J “strata” D1, ..., DJ :
• ∪J
j=1Dj = D
• Dj ∩ Dj = ∅, j = j

Samples from bivariate N2(0, I2)
6 concentric rings and 7 equally probable strata.
Each stratum has exactly 3 points sampled from within it.
Better to oversample from the most important slices, where the
integrand has higher mass.

Ideally the statistician should decide how many Monte Carlo draws to
sample from each stratum Dj.
• Call this number ñj;
• define ωj := P(X ∈ Dj)
Probabilities ωj should be known.
Then we approximate µ = D
f(x)p(x)dx with
ˆµstrat =
J
j=1
ωj
ñj
x∗∈Dj
f(x∗
) , x∗
∼ p(x|x ∈ Dj)
This is the (unbiased) stratified MC estimator.
Variance reduction compared to vanilla MC estimator can be obtained if
we know how many ñj to sample from each stratum (e.g “proportional
allocation method”7
)
7
Art Owen (2013), Monte Carlo theory, methods and examples.

However in our settings we can’t assume ability to simulate
from within a given stratum; so we can’t decide ñj.
And we can’t assume to know anything about ωj := P(X ∈ Dj).
We use a “post stratification” approach (e.g. Owen 2013)8
• first generate many x∗ ∼ p(x) (i.e. from the model simulator);
• count the number of x∗ ending up in each stratum Dj;
• call these frequencies nj;
So these frequencies are known after the simulation is done, not
before.
However we still do not know anything about the ωj = P(X ∈ Dj).
We are going to address this soon within an ABC framework.
8
Art Owen: “Monte Carlo theory, methods and examples” 2013.

Deﬁne strata for ABC
Suppose we have an ns-dimensional summary, i.e. ns = dim(sobs)
and consider the Gaussian kernel
Kδ(s∗
, sobs) =
1
δns
exp −
1
2δ2
(s∗
− sobs) Σ−1
(s∗
− sobs) .
In ABC the µ to approximate via stratiﬁed MC is the likelihood
D
Kδ(s∗
, sobs)p(s∗
|θ)ds∗
So lets partition D...

Deﬁne strata for ABC
Example to deﬁne three strata:
• D1 = {s∗ s.t. s∗ − sobs < δ/2}
• D2 = {s∗ s.t. s∗ − sobs < δ}D1
• D3 = D{D1 ∪ D2}
And more explicitly:
• D1 = {s∗ s.t. (s∗ − sobs) Σ−1(s∗ − sobs) ∈ (0, δ/2]}
• D2 = {s∗ s.t. (s∗ − sobs) Σ−1(s∗ − sobs) ∈ (δ/2, δ]}
• D3 = {s∗ s.t. (s∗ − sobs) Σ−1(s∗ − sobs) ∈ (δ, ∞)}.
Because of our resampling approach, for every θ we have R 1
simulated summaries, say R = 100.
We just need to count how many summaries fall into D1
instead of D2 instead of D3.
This give us n1, n2 and n3 = R − (n1 + n2).

How about the strata probabilities?
We still need to estimate the strata probabilities ωj = P(s∗ ∈ Dj).
This is easy because ωj = Dj
p(s∗|θ)ds∗ which we estimate by
another MC simulation.
So
1 simulate once from the model x∗ ∼ p(x|θ)
2 resample R times from x∗ to obtain x∗1, ..., x∗R
3 compute summaries s∗1, ..., s∗R
4 obtain distances dr := (s∗r − sobs) Σ−1(s∗r − sobs)
ˆω1 :=
1
R
R
r=1
I{dr≤δ/2}, ˆω2 :=
1
R
R
r=1
I{δ/2<dr≤δ},
ˆω3 := 1 −
2
j=1
ˆωj.

We ﬁnally have a (biased) estimator of the ABC likelihood using J
strata:
ˆˆµstrat =
J
j=1
ˆωj
nj
r∈Dj
Kδ(s∗r
, sobs) ,
Bias dues both to resampling and stratiﬁcation with estimated ωj.
Notice the above is not quite ok. What if some nj = 0?
(neglected stratum)
In our ABC-MCMC we reject proposal θ∗ as soon as nj = 0, so
we actually use
ˆˆµstrat =
J
j=1
ˆωj
nj
r∈Dj
Kδ(s∗r
, sobs) I{nj>0,∀j}

Stratiﬁed MC within ABC-MCMC
As usual, we accept a proposal using a MH step:
propose θ∗ ∼ q(θ|θ#) and accept with probability
1 ∧
ˆˆµstrat(θ∗)
ˆˆµstrat(θ#)
·
π(θ∗)
π(θ#)
·
q(θ#|θ∗)
q(θ∗|θ#)
if we accept, set: θ# := θ∗ and ˆˆµstrat(θ#) := ˆˆµstrat(θ∗).
Repeat a few thousands of times.

Reprising the Gaussian example
This is a super-trivial study, but it is still instructive.
Data: 1000 iid observations ∼ N(θ = 0, 1). Gaussian prior →
exact posterior
Red: exact posterior.
Blue: different types of ABC-MCMC posteriors.
-0.2 -0.15 -0.1 -0.05 0 0.05 0.1 0.15 0.2
0
2
4
6
8
10
12
14
(a) pseudomarginal ABC,
M = 100
-0.2 -0.15 -0.1 -0.05 0 0.05 0.1 0.15 0.2
0
2
4
6
8
10
12
14
(b) M = 1 and R = 100
resamples
-0.2 -0.15 -0.1 -0.05 0 0.05 0.1 0.15 0.2
0
2
4
6
8
10
12
14
(c) M = 1, R = 100 and
stratification
With stratification and only M = 1 we get results as good as with
M = 100 (compare left and right).

Time series
Our methodology is not restricted to iid data.
Next example, uses “block bootstrap”9, where we resample blocks
of observations for stationary time-series.
• blocks are chosen to be suﬃciently large such that they retain
the short range dependence structure of the data;
• so that a resampled time series, constructed by concatenating
resampled blocks, has similar statistical properties to real data
B =



(1 : B)
block1
, (B + 1 : 2B)
block2
, ..., (nobs − B + 1 : nobs)
block
nobs
B



.
At each θ we resample blocks of indeces of simulated data.
9
Kunsch, H. R. (1989). The jackknife and the bootstrap for general
stationary observations. The Annals of Statistics, 1217-1241.

2D Lotka-Volterra time series
A predator-prey model with an intractable likelihood (Markov
jump process).
Two interacting species: X1 (# predators) and X2 (# prey).
Populations evolve according to three interactions:
• A prey may be born, with rate θ1X2, increasing X2 by one.
• The predator-prey interaction in which X1 increases by one
and X2 decreases by one, with rate θ2X1X2.
• A predator may die, with rate θ3X1, decreasing X1 by one.
Its solution may be simulated exactly using the “Gillespie
algorithm”10.
10
Gillespie, D. T. (1977). Exact stochastic simulation of coupled chemical
reactions. The journal of physical chemistry, 81(25), 2340-2361.

We have 32 observations for each species simulated via Gillespie’s
algorithm.
At each θ we simulate and resample 4 blocks each having size
B = 8.
We want inference for reaction rates (θ1, θ2, θ3).
Set vague priors log θj ∼ U(−6, 2)

We run several experiments:
• standard ABC-MCMC with M = 1 indep. datasets for each θ;
• ABC-MCMC with M = 1, R = 100 resampled datasets and allocate
across three strata;
• D1 = {s∗
s.t. distance ∈ (0, δ/2)}
• D2 = {s∗
s.t. distance ∈ (δ/2, δ]}
• D2 = {s∗
s.t. distance ∈ (δ, ∞)}
θ1 θ2 θ3 IAT accept. rate
(%)
true parameters 1 0.005 0.6
standard ABC (M = 1) 1.011 [0.93,1.13] 0.005 [0.0046,0.0055] 0.575 [0.504,0.627] 145 2.5
stratified ABC (R = 100) 0.989 [0.88,1.11] 0.005 [0.0044,0.0056] 0.577 [0.479,0.668] 114 6.5
Table: Mean and 95% posterior intervals for θ.
For stratified ABC we used a δ 5-times larger than for standard ABC!
With stratification: similar inference, but better acceptance rate and
lower IAT.

Conclusions
• stratified Monte Carlo is straightforward to implement and
effective in reducing resampling bias;
• Allows for precise ABC while using a larger δ;
• smaller variance ABC likelihood → better mixing MCMC;
• Downside: neglected strata may increase rejection rate;
• more research needed for constructing optimal strata;
• Ongoing work: comments most welcome!
• more examples at:
P and Everitt (2019). Stratified sampling and resampling for
approximate Bayesian computation,
https://arxiv.org/abs/1806.05982
picchini@chalmers.se
@uPicchini

ABC loglikelihoods for Gaussian example
-0.1 -0.08 -0.06 -0.04 -0.02 0 0.02 0.04 0.06 0.08 0.1
-1
0
1
2
3
4
5
6
7
8
standard ABC M=100 resampling M=1, R=100 stratification M=1, R=100
Figure: 1D Gauss model: loglikelihood estimated via standard ABC (red),
resampling ABC (blue), resampling + stratiﬁcation (magenta). Solid lines are
mean values over 500 estimations. Dashed lines are 2.5 and 97.5 percentiles.
— — Umberto Picchini, @uPicchini 1/1

Stratified Monte Carlo for fast ABC using resampling

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Stratified Monte Carlo for fast ABC using resampling

Similar to Stratified Monte Carlo for fast ABC using resampling (20)

Recently uploaded

Recently uploaded (20)

Stratified Monte Carlo for fast ABC using resampling