Talk at 2013 WSC, ISI Conference in Hong Kong, August 26, 2013

An [under]view of Monte Carlo methods, from
importance sampling to MCMC, to ABC
(& kudos to Bernoulli)
Christian P. Robert
Universit´e Paris-Dauphine, University of Warwick, & CREST, Paris
2013 WSC, Hong Kong
bayesianstatistics@gmail.com

Outline
Bernoulli, Jakob (1654–1705)
MCMC connected steps
Metropolis-Hastings revisited
Approximate Bayesian computation
(ABC)

Bernoulli as founding father of Monte Carlo methods
The weak law of large numbers (or Bernoulli’s [Golden] theorem)
provides the justiﬁcation for Monte Carlo approximations:
if x1, . . . , xn are i.i.d. rv’s with density f ,
lim
n→∞
h(x1) + . . . + h(xn)
n
=
X
h(x)f (x) dx
Stigler’s Law of Eponimy: Cardano (1501–1576) ﬁrst stated the
result

...and indeed
h(x1) + . . . + h(xn)
n
converges to
I =
X
h(x)f (x) dx

...and indeed
h(x1) + . . . + h(xn)
n
converges to
I =
X
h(x)f (x) dx
...meaning that provided we can simulate xi ∼ f (·) long and fast
“enough”, the empirical mean will be a good “enough”
approximation to I

Early implementations of the LLN
While Jakob Bernoulli
himself apparently did not
engage in simulation,
Buﬀon (1707–1788) resorted
to a (not-yet-Monte-Carlo)
experiment in 1735 to
estimate the value of the
Saint Petersburg game
(even though he did not
perform a similar experiment
for estimating π)
[Stigler, STS, 1991; Stigler, JRSS A, 2010]

De Forest (1834–1888)
found the median of a
log-Cauchy distribution,
using normal simulations
approximated to the second
digit (in 1876)

followed closely by the
ubuquitous Galton using
“normal” dice in 1890, after
developping the Quincunx,
used both for checking the
CLT and simulating from a
posterior distribution as
early as 1877

Importance Sampling
When focussing on integral approximation, very loose principle in
that proposal distribution with pdf q(·) leads to alternative
representation
I =
X
h(x){f /q}(x) q(x) dx
Principle of importance
Generate an iid sample x1, . . . , xn ∼ q(·) and estimate I by
^IIS
= n−1
n
i=1
h(xi ){f /q}(xi ).
...provided q is positive on the right set

things aren’t all rosy...
LLN not sufficient to justify Monte
Carlo methods: if
n−1
n
i=1
h(xi ){f /q}(xi )
has an infinite variance, the estimator
ÎIS is useless Importance sampling estimation of
P(2 Z 6) Z is Cauchy and
importance is normal, compared
with exact value, 0.095.

The harmonic mean estimator
Bayesian posterior distribution deﬁned as
π(θ|x) = π(θ)L(θ|x)/m(x)
When θi ∼ π(θ|x),
1
T
T
t=1
1
L(θt|x)
is an unbiased estimator of 1/m(x)
[Gelfand & Dey, 1994; Newton & Raftery, 1994]
Highly hazardous material: Most often leads to an inﬁnite
variance!!!

“The Worst Monte Carlo Method Ever”
“The good news is that the Law of Large Numbers guarantees that this
estimator is consistent ie, it will very likely be very close to the correct
answer if you use a suﬃciently large number of points from the posterior
distribution.
The bad news is that the number of points required for this estimator to
get close to the right answer will often be greater than the number of
atoms in the observable universe. The even worse news is that it’s easy
for people to not realize this, and to na¨ıvely accept estimates that are
nowhere close to the correct value of the marginal likelihood.”
[Radford Neal’s blog, Aug. 23, 2008]

Comparison with regular importance sampling
Harmonic mean: Constraint opposed to usual importance sampling
constraints: proposal ϕ(·) must have lighter (rather than fatter)
tails than π(·)L(·) for the approximation
1
1
T
T
t=1
ϕ(θt)
πk(θt)L(θt)
θt ∼ ϕ(·)
to have a ﬁnite variance.
E.g., use ﬁnite support kernels (like Epanechnikov’s kernel) for ϕ

HPD indicator as ϕ
Use the convex hull of MCMC simulations (θt)t=1,...,T
corresponding to the 10% HPD region (easily derived!) and ϕ as
indicator:
ϕ(θ) =
10
T
t∈HPD
Id(θ,θt )
[X & Wraith, 2009]

Bayesian computing (R)evolution
(ABC)

computational jam
In the 1970’s and early 1980’s, theoretical foundations of Bayesian
statistics were sound, but methodology was lagging for lack of
computing tools.
restriction to conjugate priors
limited complexity of models
small sample sizes
The ﬁeld was desperately in need of a new computing paradigm!
[X & Casella, STS, 2012]

MCMC as in Markov Chain Monte Carlo
Notion that i.i.d. simulation is deﬁnitely not necessary, all that
matters is the ergodic theorem
Realization that Markov chains could be used in a wide variety of
situations only came to mainstream statisticians with Gelfand and
Smith (1990) despite earlier publications in the statistical literature
like Hastings (1970) and growing awareness in spatial statistics
(Besag, 1986)
Reasons:
lack of computing machinery
lack of background on Markov chains
lack of trust in the practicality of the method

pre-Gibbs/pre-Hastings era
Early 1970’s, Hammersley, Clifford, and Besag were working on the
specification of joint distributions from conditional distributions
and on necessary and sufficient conditions for the conditional
distributions to be compatible with a joint distribution.
[Hammersley and Clifford, 1971]

pre-Gibbs/pre-Hastings era
Early 1970’s, Hammersley, Clifford, and Besag were working on the
specification of joint distributions from conditional distributions
and on necessary and sufficient conditions for the conditional
distributions to be compatible with a joint distribution.
“What is the most general form of the conditional
probability functions that define a coherent joint
function? And what will the joint look like?”
[Besag, 1972]

Hammersley-Cliﬀord[-Besag] theorem
Theorem (Hammersley-Cliﬀord)
Joint distribution of vector associated with a dependence graph
must be represented as product of functions over the cliques of the
graphs, i.e., of functions depending only on the components
indexed by the labels in the clique.
[Cressie, 1993; Lauritzen, 1996]

A probability distribution P with positive and continuous density f
satisﬁes the pairwise Markov property with respect to an
undirected graph G if and only if it factorizes according to G, i.e.,
(F) ≡ (G)

Under the positivity condition, the joint distribution g satisﬁes
g(y1, . . . , yp) ∝
p
j=1
g j
(y j
|y 1 , . . . , y j−1
, y j+1
, . . . , y p
)
g j
(y j
|y 1 , . . . , y j−1
, y j+1
, . . . , y p
)
for every permutation on {1, 2, . . . , p} and every y ∈ Y.

Clicking in
After Peskun (1973), MCMC mostly dormant in mainstream
statistical world for about 10 years, then several papers/books
highlighted its usefulness in speciﬁc settings:
Geman and Geman (1984)
Besag (1986)
Strauss (1986)
Ripley (Stochastic Simulation, 1987)
Tanner and Wong (1987)
Younes (1988)

[Re-]Enters the Gibbs sampler
Geman and Geman (1984), building on
Metropolis et al. (1953), Hastings (1970), and
Peskun (1973), constructed a Gibbs sampler
for optimisation in a discrete image processing
problem with a Gibbs random ﬁeld without
completion.
Back to Metropolis et al., 1953: the Gibbs
sampler is already in use therein and ergodicity
is proven on the collection of global maxima

Removing the jam
In early 1990s, researchers found that Gibbs and then Metropolis -
Hastings algorithms would crack almost any problem!
Flood of papers followed applying MCMC:
linear mixed models (Gelfand & al., 1990; Zeger & Karim, 1991;
Wang & al., 1993, 1994)
generalized linear mixed models (Albert & Chib, 1993)
mixture models (Tanner & Wong, 1987; Diebolt & X., 1990, 1994;
Escobar & West, 1993)
changepoint analysis (Carlin & al., 1992)
point processes (Grenander & Møller, 1994)
&tc

Removing the jam
In early 1990s, researchers found that Gibbs and then Metropolis -
Hastings algorithms would crack almost any problem!
Flood of papers followed applying MCMC:
genomics (Stephens & Smith, 1993; Lawrence & al., 1993;
Churchill, 1995; Geyer & Thompson, 1995; Stephens & Donnelly,
2000)
ecology (George & X, 1992)
variable selection in regression (George & mcCulloch, 1993; Green,
1995; Chen & al., 2000)
spatial statistics (Raftery & Banﬁeld, 1991; Besag & Green, 1993))
longitudinal studies (Lange & al., 1992)
&tc

MCMC and beyond
reversible jump MCMC which impacted considerably Bayesian model
choice (Green, 1995)
adaptive MCMC algorithms (Haario & al., 1999; Roberts & Rosenthal,
2009)
exact approximations to targets (Tanner & Wong, 1987; Beaumont,
2003; Andrieu & Roberts, 2009)
comp’al stats catching up with comp’al physics: free energy sampling
(e.g., Wang-Landau), Hamilton Monte Carlo (Girolami & Calderhead,
2011)
sequential Monte Carlo (SMC) for non-sequential problems (Chopin,
2002; Neal, 2001; Del Moral et al 2006)
retrospective sampling
intractability: EP – GIMH – PMCMC – SMC2
– INLA
QMC[MC] (Owen, 2011)

Particles
Iterating/sequential importance sampling is about as old as Monte
Carlo methods themselves!
[Hammersley and Morton,1954; Rosenbluth and Rosenbluth, 1955]
Found in the molecular simulation literature of the 50’s with
self-avoiding random walks and signal processing
[Marshall, 1965; Handschin and Mayne, 1969]
Use of the term “particle” dates back to Kitagawa (1996), and Carpenter
et al. (1997) coined the term “particle ﬁlter”.

pMC & pMCMC
Recycling of past simulations legitimate to build better
importance sampling functions as in population Monte Carlo
[Iba, 2000; Capp´e et al, 2004; Del Moral et al., 2007]
synthesis by Andrieu, Doucet, and Hollenstein (2010) using
particles to build an evolving MCMC kernel ^pθ(y1:T ) in state
space models p(x1:T )p(y1:T |x1:T )
importance sampling on discretely observed diﬀusions
[Beskos et al., 2006; Fearnhead et al., 2008, 2010]

Reinterpretation and
Rao-Blackwellisation
Russian roulette
(ABC)

Metropolis Hastings algorithm
1. We wish to approximate
I =
h(x)π(x)dx
π(x)dx
= h(x)¯π(x)dx
2. π(x) is known but not π(x)dx.
3. Approximate I with δ = 1
n
n
t=1 h(x(t)) where (x(t)) is a
Markov chain with limiting distribution ¯π.
4. Convergence obtained from Law of Large Numbers or CLT for
Markov chains.

Metropolis Hasting Algorithm
Suppose that x(t) is drawn.
1. Simulate yt ∼ q(·|x(t)).
2. Set x(t+1) = yt with probability
α(x(t)
, yt) = min 1,
π(yt)
π(x(t))
q(x(t)|yt)
q(yt|x(t))
Otherwise, set x(t+1) = x(t).
3. α is such that the detailed balance equation is satisﬁed: ¯π
is the stationary distribution of (x(t)).
The accepted candidates are simulated with the rejection
algorithm.

Metropolis Hasting Algorithm
Suppose that x(t) is drawn.
1. Simulate yt ∼ q(·|x(t)).
2. Set x(t+1) = yt with probability
α(x(t)
, yt) = min 1,
π(yt)
π(x(t))
q(x(t)|yt)
q(yt|x(t))
Otherwise, set x(t+1) = x(t).
3. α is such that the detailed balance equation is satisﬁed:
π(x)q(y|x)α(x, y) = π(y)q(x|y)α(y, x).
¯π is the stationary distribution of (x(t)).
The accepted candidates are simulated with the rejection
algorithm.

Some properties of the HM algorithm
Alternative representation of the estimator δ is
δ =
1
n
n
t=1
h(x(t)
) =
1
n
Mn
i=1
ni h(zi ) ,
where
zi ’s are the accepted yj ’s,
Mn is the number of accepted yj ’s till time n,
ni is the number of times zi appears in the sequence (x(t))t.

”accepted” Markov chain
Lemma (Douc & X., AoS, 2011)
The sequence (zi , ni ) satisﬁes
1. (zi , ni )i is a Markov chain;
2. zi+1 and ni are independent given zi ;
3. ni is distributed as a geometric random variable with
probability parameter
p(zi ) := α(zi , y) q(y|zi ) dy ; (1)
4. (zi )i is a Markov chain with transition kernel
˜Q(z, dy) = ˜q(y|z)dy and stationary distribution ˜π such that
˜q(·|z) ∝ α(z, ·) q(·|z) and ˜π(·) ∝ π(·)p(·) .

Importance sampling perspective
1. A natural idea:
δ∗
=
1
n
Mn
i=1
h(zi )
p(zi )
,

1. A natural idea:
δ∗
Mn
i=1
h(zi )
p(zi )
Mn
i=1
1
p(zi )
=
Mn
i=1
π(zi )
˜π(zi )
h(zi )
Mn
i=1
π(zi )
˜π(zi )
.

1. A natural idea:
δ∗
Mn
i=1
h(zi )
p(zi )
Mn
i=1
1
p(zi )
=
Mn
i=1
π(zi )
˜π(zi )
h(zi )
Mn
i=1
π(zi )
˜π(zi )
.
2. But p not available in closed form.

1. A natural idea:
δ∗
Mn
i=1
h(zi )
p(zi )
Mn
i=1
1
p(zi )
=
Mn
i=1
π(zi )
˜π(zi )
h(zi )
Mn
i=1
π(zi )
˜π(zi )
.
2. But p not available in closed form.
3. The geometric ni is the replacement, an obvious solution that
is used in the original Metropolis–Hastings estimate since
E[ni ] = 1/p(zi ).

The Bernoulli factory
The crude estimate of 1/p(zi ),
ni = 1 +
∞
j=1 j
I {u α(zi , y )} ,
can be improved:
Lemma (Douc & X., AoS, 2011)
If (yj )j is an iid sequence with distribution q(y|zi ), the quantity
^ξi = 1 +
∞
j=1 j
{1 − α(zi , y )}
is an unbiased estimator of 1/p(zi ) which variance, conditional on
zi , is lower than the conditional variance of ni , {1 − p(zi )}/p2(zi ).

Rao-Blackwellised, for sure?
^ξi = 1 +
∞
j=1 j
{1 − α(zi , y )}
1. Infinite sum but finite with at least positive probability:
α(x(t)
, yt) = min 1,
π(yt)
π(x(t))
q(x(t)|yt)
q(yt|x(t))
For example: take a symmetric random walk as a proposal.
2. What if we wish to be sure that the sum is finite?
Finite horizon k version:
^ξk
i = 1 +
∞
j=1 1 k∧j
1 − α(zi , yj )
k+1 j
I {u α(zi , y )}

which Bernoulli factory?!
Not the spice warehouse of Leon Bernoulli!
Query:
Given an algorithm delivering iid B(p) rv’s, is it possible to derive
an algorithm delivering iid B(p) rv’s when f is known and p
unknown?
[von Neumann, 1951; Keane & O’Brien, 1994]
existence (e.g., impossible for f (p) = min(2p, 1))
condition: for some n,
min{f (p), 1 − f (p)} min{p, 1 − p}n
implementation (polynomial vs. exponential time)
use of sandwiching polynomials/power series

Variance improvement
Theorem (Douc & X., AoS, 2011)
If (yj )j is an iid sequence with distribution q(y|zi ) and (uj )j is an
iid uniform sequence, for any k 0, the quantity
^ξk
i = 1 +
∞
j=1 1 k∧j
1 − α(zi , yj )
k+1 j
I {u α(zi , y )}
is an unbiased estimator of 1/p(zi ) with an almost sure ﬁnite
number of terms.

^ξk
i = 1 +
∞
j=1 1 k∧j
1 − α(zi , yj )
k+1 j
I {u α(zi , y )}
number of terms. Moreover, for k 1,
V ^ξk
i zi =
1 − p(zi )
p2(zi )
−
1 − (1 − 2p(zi ) + r(zi ))k
2p(zi ) − r(zi )
2 − p(zi )
p2(zi )
(p(zi ) − r(zi )) ,
where p(zi ) := α(zi , y) q(y|zi ) dy. and r(zi ) := α2(zi , y) q(y|zi ) dy.

^ξk
i = 1 +
∞
j=1 1 k∧j
1 − α(zi , yj )
k+1 j
I {u α(zi , y )}
number of terms. Therefore, we have
V ^ξi zi V ^ξk
i zi V ^ξ0
i zi = V [ni | zi ] .

B motivation for Russian roulette
drior π(θ), data density p(y|θ) = f (y; θ)/Z(θ) with
Z(θ) = f (x; θ)dx
intractable (e.g., Ising spin model, MRF, diﬀusion processes,
networks, &tc)
doubly-intractable posterior follows as
π(θ|y) = p(y|θ) × π(θ) ×
1
Z(y)
=
f (y; θ)
Z(θ)
× π(θ) ×
1
Z(y)
where Z(y) = p(y|θ)π(θ)dθ
both Z(θ) and Z(y) are intractable with massively diﬀerent
consequences
[thanks to Mark Girolami for his Russian slides!]

B motivation for Russian roulette
If Z(θ) is intractable, Metropolis–Hasting acceptance
probability
α(θ , θ) = min 1,
f (y; θ )π(θ )
f (y; θ)π(θ)
×
q(θ|θ )
q(θ |θ)
×
Z(θ)
Z(θ )
is not available
Use instead biased approximations e.g. pseudo-likelihoods,
plugin ^Z(θ ) estimates without sacriﬁcing exactness of MCMC

Existing solution
Unbiased plugin estimate
Z(θ)
Z(θ )
≈
f (x; θ)
f (x; θ )
where x ∼
f (x; θ )
Z(θ )
[Møller et al, Bka, 2006; Murray et al 2006]
auxiliary variable method
removes Z(θ) Z(θ ) from the picture
require simulations from the model (e.g., via perfect sampling)

Exact approximate methods
Pseudo-Marginal construction that allows for the use of unbiased,
positive estimates of target in acceptance probability
α(θ , θ) = min 1,
^π(θ |y)
^π(θ|y)
×
q(θ|θ )
q(θ |θ)
[Beaumont, 2003; Andrieu and Roberts, 2009; Doucet et al, 2012]
Transition kernel has invariant distribution with exact target
density π(θ|y)

Infinite series estimator
For each (θ, y), construct rv’s {V
(j)
θ , j 0} such that
^π(θ, {V
(j)
θ }|y) :=
∞
j=0
V
(j)
θ
is a.s. finite with finite expectation
E ^π(θ, {V
(j)
θ } |y) = π(θ|y)

(j)
θ , j 0} such that
^π(θ, {V
(j)
θ }|y) :=
∞
j=0
V
(j)
θ
E ^π(θ, {V
(j)
θ } |y) = π(θ|y)
Introduce a random stopping time τθ, such that with
ξ := (τθ, {V
(j)
θ , 0 j τθ}) the estimate
^π(θ, ξ|y) :=
τθ
j=0
V
(j)
θ
satisﬁes
E ^π(θ, ξ|y)|{V
(j)
θ , j 0} = ^π(θ, {V
(j)
θ }|y)

(j)
θ , j 0} such that
^π(θ, {V
(j)
θ }|y) :=
∞
j=0
V
(j)
θ
E ^π(θ, {V
(j)
θ } |y) = π(θ|y)
Warning: unbiased estimate ^π(θ, ξ|y) using series
construction no general guarantee of positivity

Russian roulette
Method that requires unbiased truncation of a series
S(θ) =
∞
i=0
φi (θ)
Russian roulette employed extensively in simulation of neutron
scattering and computer graphics
Assign probabilities {qj , j 1} qj ∈ (0, 1] and generate
U(0, 1) i.i.d. r.v’s {Uj , j 1}
Find the ﬁrst time k 1 such that Uk qk
Russian roulette estimate of S(θ) is
^S(θ) =
k
j=0
φj (θ)
j−1
i=1
qi ,
[Girolami, Lyne, Strathman, Simpson, & Atchad´e, arXiv:1306.4032]

Russian roulette
S(θ) =
∞
i=0
φi (θ)
U(0, 1) i.i.d. r.v’s {Uj , j 1}
^S(θ) =
k
j=0
φj (θ)
j−1
i=1
qi ,
If limn→∞
n
j=1 qj = 0, Russian roulette terminates with
probability one

Russian roulette
S(θ) =
∞
i=0
φi (θ)
U(0, 1) i.i.d. r.v’s {Uj , j 1}
^S(θ) =
k
j=0
φj (θ)
j−1
i=1
qi ,
E{^S(θ)} = S(θ)
variance ﬁnite under certain known conditions

towards ever more complexity
(ABC)

New challenges
Novel statisticial issues that forces a diﬀerent Bayesian answer:
very large datasets
complex or unknown dependence structures with maybe p n
multiple and involved random eﬀects
missing data structures containing most of the information
sequential structures involving most of the above

New paradigm?
“Surprisingly, the conﬁdent prediction of the previous
generation that Bayesian methods would ultimately supplant
frequentist methods has given way to a realization that Markov
chain Monte Carlo (MCMC) may be too slow to handle
modern data sets. Size matters because large data sets stress
computer storage and processing power to the breaking point.
The most successful compromises between Bayesian and
frequentist methods now rely on penalization and
optimization.”
[Lange at al., ISR, 2013]

New paradigm?
sad reality constraint that
size does matter
focus on much smaller
dimensions and on sparse
summaries
many (fast if non-Bayesian)
ways of producing those
summaries
Bayesian inference can kick
in almost automatically at
this stage

Approximate Bayesian computation (ABC)
Case of a well-deﬁned statistical model where the likelihood
function
(θ|y) = f (y1, . . . , yn|θ)
is out of reach!
Empirical approximations to the original
Bayesian inference problem
Degrading the data precision down
to a tolerance ε
Replacing the likelihood with a
non-parametric approximation
Summarising/replacing the data
with insuﬃcient statistics

ABC methodology
Bayesian setting: target is π(θ)f (x|θ)
When likelihood f (x|θ) not in closed form, likelihood-free rejection
technique:
Foundation
For an observation y ∼ f (y|θ), under the prior π(θ), if one keeps
jointly simulating
θ ∼ π(θ) , z ∼ f (z|θ ) ,
until the auxiliary variable z is equal to the observed value, z = y,
then the selected
θ ∼ π(θ|y)
[Rubin, 1984; Diggle & Gratton, 1984; Griﬃth et al., 1997]

ABC algorithm
In most implementations, degree of approximation:
Algorithm 1 Likelihood-free rejection sampler
for i = 1 to N do
repeat
generate θ from the prior distribution π(·)
generate z from the likelihood f (·|θ )
until ρ{η(z), η(y)}
set θi = θ
end for
where η(y) deﬁnes a (not necessarily suﬃcient) statistic

Comments
role of distance paramount
(because = 0)
scaling of components of η(y) also
capital
matters little if “small enough”
representative of “curse of
dimensionality”
small is beautiful!, i.e. data as a
whole may be weakly informative
for ABC
non-parametric method at core

ABC simulation advances
Simulating from the prior is often poor in eﬃciency
Either modify the proposal distribution on θ to increase the density
of x’s within the vicinity of y...
[Marjoram et al, 2003; Beaumont et al., 2009, Del Moral et al., 2012]
...or by viewing the problem as a conditional density estimation
and by developing techniques to allow for larger
[Beaumont et al., 2002; Blum & Fran¸cois, 2010; Biau et al., 2013]
.....or even by including in the inferential framework [ABCµ]
[Ratmann et al., 2009]

ABC as an inference machine
Starting point is summary statistic
η(y), either chosen for computational
realism or imposed by external
constraints
ABC can produce a distribution on the parameter of interest
conditional on this summary statistic η(y)
inference based on ABC may be consistent or not, so it needs
to be validated on its own
the choice of the tolerance level is dictated by both
computational and convergence constraints

How Bayesian aBc is..?
At best, ABC approximates π(θ|η(y)):
approximation error unknown (w/o massive simulation)
pragmatic or empirical Bayes (there is no other solution!)
many calibration issues (tolerance, distance, statistics)
the NP side should be incorporated into the whole Bayesian
picture
the approximation error should also be part of the Bayesian
inference

Noisy ABC
ABC approximation error (under non-zero tolerance ) replaced
with exact simulation from a controlled approximation to the
target, convolution of true posterior with kernel function
π (θ, z|y) =
π(θ)f (z|θ)K (y − z)
π(θ)f (z|θ)K (y − z)dzdθ
,
with K kernel parameterised by bandwidth .
[Wilkinson, 2013]
Theorem
The ABC algorithm based on a randomised observation y = ˜y + ξ,
ξ ∼ K , and an acceptance probability of
K (y − z)/M
gives draws from the posterior distribution π(θ|y).

Which summary?
Fundamental difficulty of the choice of the summary statistic when
there is no non-trivial sufficient statistics [except when done by the
experimenters in the field]

Which summary?
Loss of statistical information balanced against gain in data
roughening
Approximation error and information loss remain unknown
Choice of statistics induces choice of distance function
towards standardisation
borrowing tools from data analysis (LDA) machine learning
[Estoup et al., ME, 2012]

Which summary?
may be imposed for external/practical reasons
may gather several non-B point estimates
we can learn about eﬃcient combination
distance can be provided by estimation techniques

Which summary for model choice?
‘This is also why focus on model discrimination typically (...)
proceeds by (...) accepting that the Bayes Factor that one obtains
is only derived from the summary statistics and may in no way
correspond to that of the full model.’
[S. Sisson, Jan. 31, 2011, xianblog]
Depending on the choice of η(·), the Bayes factor based on this
insuﬃcient statistic,
Bη
12(y) =
π1(θ1)f η
1 (η(y)|θ1) dθ1
π2(θ2)f η
2 (η(y)|θ2) dθ2
,
is either consistent or not
[X et al., PNAS, 2012]

Which summary for model choice?
Depending on the choice of η(·), the Bayes factor based on this
insuﬃcient statistic,
Bη
12(y) =
π1(θ1)f η
1 (η(y)|θ1) dθ1
π2(θ2)f η
2 (η(y)|θ2) dθ2
,
is either consistent or not
[X et al., PNAS, 2012]
q
q
q
q
q
q
q
q
q
q
q
Gauss Laplace
0.00.10.20.30.40.50.60.7
n=100
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
Gauss Laplace
0.00.20.40.60.81.0
n=100

Selecting proper summaries
Consistency only depends on the range of
µi (θ) = Ei [η(y)]
under both models against the asymptotic mean µ0 of η(y)
Theorem
If Pn belongs to one of the two models and if µ0 cannot be
attained by the other one :
0 = min (inf{|µ0 − µi (θi )|; θi ∈ Θi }, i = 1, 2)
< max (inf{|µ0 − µi (θi )|; θi ∈ Θi }, i = 1, 2) ,
then the Bayes factor Bη
12 is consistent
[Marin et al., JRSS B, 2013]

Selecting proper summaries
Consistency only depends on the range of
µi (θ) = Ei [η(y)]
under both models against the asymptotic mean µ0 of η(y)
q
M1 M2
0.30.40.50.60.7
q
q
q
q
M1 M2
0.30.40.50.60.7
M1 M2
0.30.40.50.60.7
q
q
q
q
q
q
q
q
M1 M2
0.00.20.40.60.8
q
qq
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
M1 M2
0.00.20.40.60.81.0
q
q
q
q
q
q
q
q
M1 M2
0.00.20.40.60.81.0
q
q
q
q
q
q
q
M1 M2
0.00.20.40.60.8
q
q
qq
q
q
q
q
qq
q
q
q
q
q
q
q
M1 M2
0.00.20.40.60.81.0
q
q
qq
q
qq
M1 M2
0.00.20.40.60.81.0
[Marin et al., JRSS B, 2013]

Talk at 2013 WSC, ISI Conference in Hong Kong, August 26, 2013

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (6)

Similar to Talk at 2013 WSC, ISI Conference in Hong Kong, August 26, 2013

Similar to Talk at 2013 WSC, ISI Conference in Hong Kong, August 26, 2013 (20)

More from Christian Robert

More from Christian Robert (20)

Recently uploaded

Recently uploaded (20)

Talk at 2013 WSC, ISI Conference in Hong Kong, August 26, 2013