ABC in Roma

Likelihood-free Methods


Christian P. Robert

Universit´ Paris-Dauphine & CREST
e
http://www.ceremade.dauphine.fr/~xian

February 16, 2012


Outline
Introduction

The Metropolis-Hastings Algorithm

simulation-based methods in Econometrics

Genetics of ABC

Approximate Bayesian computation

ABC for model choice

ABC model choice consistency

Introduction


Introduction
Monte Carlo basics
Population Monte Carlo
AMIS



Genetics of ABC


Introduction
Monte Carlo basics

General purpose

Given a density π known up to a normalizing constant, and an
integrable function h, compute

h(x)˜ (x)µ(dx)
π
Π(h) = h(x)π(x)µ(dx) =
π (x)µ(dx)
˜

when h(x)˜ (x)µ(dx) is intractable.
π

Introduction
Monte Carlo basics

Monte Carlo basics

Generate an iid sample x1 , . . . , xN from π and estimate Π(h) by
N
ΠMC (h) = N −1
ˆ
N h(xi ).
i=1

ˆ as
LLN: ΠMC (h) −→ Π(h)
N
If Π(h2 ) = h2 (x)π(x)µ(dx) < ∞,
√ L
CLT: ˆ
N ΠMC (h) − Π(h)
N N 0, Π [h − Π(h)]2 .

Introduction
Monte Carlo basics

Monte Carlo basics

Generate an iid sample x1 , . . . , xN from π and estimate Π(h) by
N
ΠMC (h) = N −1
ˆ
N h(xi ).
i=1

ˆ as
LLN: ΠMC (h) −→ Π(h)
N
If Π(h2 ) = h2 (x)π(x)µ(dx) < ∞,
√ L
CLT: ˆ
N ΠMC (h) − Π(h)
N N 0, Π [h − Π(h)]2 .

Caveat
Often impossible or ineﬃcient to simulate directly from Π

Introduction
Monte Carlo basics

Importance Sampling

For Q proposal distribution such that Q(dx) = q(x)µ(dx),
alternative representation

Π(h) = h(x){π/q}(x)q(x)µ(dx).

Introduction
Monte Carlo basics

Importance Sampling

For Q proposal distribution such that Q(dx) = q(x)µ(dx),
alternative representation

Π(h) = h(x){π/q}(x)q(x)µ(dx).

Principle
Generate an iid sample x1 , . . . , xN ∼ Q and estimate Π(h) by
N
ΠIS (h) = N −1
ˆ
Q,N h(xi ){π/q}(xi ).
i=1

Introduction
Monte Carlo basics

Importance Sampling (convergence)

Then
ˆ as
LLN: ΠIS (h) −→ Π(h)
Q,N and if Q((hπ/q)2 ) < ∞,
√ L
CLT: ˆ
N(ΠIS (h) − Π(h))
Q,N N 0, Q{(hπ/q − Π(h))2 } .

Introduction
Monte Carlo basics

Importance Sampling (convergence)

Then
ˆ as
LLN: ΠIS (h) −→ Π(h)
Q,N and if Q((hπ/q)2 ) < ∞,
√ L
CLT: ˆ
N(ΠIS (h) − Π(h))
Q,N N 0, Q{(hπ/q − Π(h))2 } .

Caveat
ˆ
If normalizing constant unknown, impossible to use ΠIS
Q,N

Generic problem in Bayesian Statistics: π(θ|x) ∝ f (x|θ)π(θ).

Introduction
Monte Carlo basics

Self-Normalised Importance Sampling

Self normalized version
N −1 N
ˆ
ΠSNIS (h) = {π/q}(xi ) h(xi ){π/q}(xi ).
Q,N
i=1 i=1

Introduction
Monte Carlo basics


N −1 N
ˆ
Q,N
i=1 i=1

ˆ as
LLN : ΠSNIS (h) −→ Π(h)
Q,N
and if Π((1 + h2 )(π/q)) < ∞,
√ L
CLT : ˆ
N(ΠSNIS (h) − Π(h))
Q,N N 0, π {(π/q)(h − Π(h)}2 ) .

Introduction
Monte Carlo basics


N −1 N
ˆ
Q,N
i=1 i=1

ˆ as
LLN : ΠSNIS (h) −→ Π(h)
Q,N
and if Π((1 + h2 )(π/q)) < ∞,
√ L
CLT : ˆ
N(ΠSNIS (h) − Π(h))
Q,N N 0, π {(π/q)(h − Π(h)}2 ) .
c The quality of the SNIS approximation depends on the
choice of Q

Introduction
Monte Carlo basics

Iterated importance sampling

Introduction of an algorithmic temporal dimension :
(t) (t−1)
xi ∼ qt (x|xi ) i = 1, . . . , n, t = 1, . . .

and
n
ˆ 1 (t) (t)
It = i h(xi )
n
i=1

is still unbiased for
(t)
(t) πt (xi )
i = (t) (t−1)
, i = 1, . . . , n
qt (xi |xi )

Introduction

PMC: Population Monte Carlo Algorithm
At time t = 0
iid
Generate (xi,0 )1≤i≤N ∼ Q0
Set ωi,0 = {π/q0 }(xi,0 )
iid
Generate (Ji,0 )1≤i≤N ∼ M(1, (¯ i,0 )1≤i≤N )
ω
Set xi,0 = xJi ,0
˜

Introduction

PMC: Population Monte Carlo Algorithm
At time t = 0
iid
Generate (xi,0 )1≤i≤N ∼ Q0
Set ωi,0 = {π/q0 }(xi,0 )
iid
Generate (Ji,0 )1≤i≤N ∼ M(1, (¯ i,0 )1≤i≤N )
ω
Set xi,0 = xJi ,0
˜
At time t (t = 1, . . . , T ),
ind
Generate xi,t ∼ Qi,t (˜i,t−1 , ·)
x
Set ωi,t = {π(xi,t )/qi,t (˜i,t−1 , xi,t )}
x
iid
Generate (Ji,t )1≤i≤N ∼ M(1, (¯ i,t )1≤i≤N )
ω
Set xi,t = xJi,t ,t .
˜

[Capp´, Douc, Guillin, Marin, & CPR, 2009, Stat.& Comput.]
e

Introduction

Notes on PMC

After T iterations of PMC, PMC estimator of Π(h) given by
T N
¯ 1
ΠPMC (h) =
N,T
¯
ωi,t h(xi,t ).
T
t=1 i=1

Introduction

Notes on PMC

After T iterations of PMC, PMC estimator of Π(h) given by
T N
¯ 1
ΠPMC (h) =
N,T
¯
ωi,t h(xi,t ).
T
t=1 i=1

1. ωi,t means normalising over whole sequence of simulations
¯
2. Qi,t ’s chosen arbitrarily under support constraint
3. Qi,t ’s may depend on whole sequence of simulations
4. natural solution for ABC

Introduction

Improving quality
The eﬃciency of the SNIS approximation depends on the choice of
Q, ranging from optimal

q(x) ∝ |h(x) − Π(h)|π(x)

to useless
ˆ
var ΠSNIS (h) = +∞
Q,N

Introduction

Improving quality
The eﬃciency of the SNIS approximation depends on the choice of
Q, ranging from optimal

q(x) ∝ |h(x) − Π(h)|π(x)

to useless
ˆ
var ΠSNIS (h) = +∞
Q,N

Example (PMC=adaptive importance sampling)
Population Monte Carlo is producing a sequence of proposals Qt
aiming at improving eﬃciency
ˆ ˆ
Kull(π, qt ) ≤ Kull(π, qt−1 ) or var ΠSNIS (h) ≤ var ΠSNIS ,∞ (h)
Qt ,∞ Qt−1

[Capp´, Douc, Guillin, Marin, Robert, 04, 07a, 07b, 08]
e

Introduction
AMIS

Multiple Importance Sampling

Reycling: given several proposals Q1 , . . . , QT , for 1 ≤ t ≤ T
generate an iid sample
t t
x1 , . . . , xN ∼ Qt

and estimate Π(h) by
T N
ΠMIS (h) = T −1
ˆ
Q,N N −1
h(xit )ωit
t=1 i=1

where
π(xit )
ωit = correct...
qt (xit )

Introduction
AMIS

Multiple Importance Sampling

Reycling: given several proposals Q1 , . . . , QT , for 1 ≤ t ≤ T
generate an iid sample
t t
x1 , . . . , xN ∼ Qt

and estimate Π(h) by
T N
ΠMIS (h) = T −1
ˆ
Q,N N −1
h(xit )ωit
t=1 i=1

where
π(xit )
ωit = T
still correct!
T −1 =1 q (xit )

Introduction
AMIS

Mixture representation

Deterministic mixture correction of the weights proposed by Owen
and Zhou (JASA, 2000)
The corresponding estimator is still unbiased [if not
self-normalised]
All particles are on the same weighting scale rather than their
own
Large variance proposals Qt do not take over
Variance reduction thanks to weight stabilization & recycling
[K.o.] removes the randomness in the component choice
[=Rao-Blackwellisation]

Introduction
AMIS

Global adaptation

Global Adaptation
At iteration t = 1, · · · , T ,
ˆ
1. For 1 ≤ i ≤ N1 , generate xit ∼ T3 (ˆt−1 , Σt−1 )
µ
2. Calculate the mixture importance weight of particle xit

ωit = π xit δit

where
t−1
δit = ˆ ˆ
qT (3) xit ; µl , Σl
l=0

Introduction
AMIS

Backward reweighting
3. If t ≥ 2, actualize the weights of all past particles, xil
1≤l ≤t −1
ωil = π xit δil
where
ˆ
δil = δil + qT (3) xil ; µt−1 , Σt−1
ˆ

ˆ
4. Compute IS estimates of target mean and variance µt and Σt ,
ˆ
where
t N1 t N1
µt
ˆj = ωil (xj )li ωil . . .
l=1 i=1 l=1 i=1

Introduction
AMIS

A toy example

10
0
−10
−20
−30

−40 −20 0 20 40

Banana shape benchmark: marginal distribution of (x1 , x2 ) for the parameters
2
σ1 = 100 and b = 0.03. Contours represent 60% (red), 90% (black) and 99.9%
(blue) conﬁdence regions in the marginal space.

Introduction
AMIS

A toy example

p=5 p=10 p=20

q
1e+05

20000 40000 60000 80000

60000
8e+04
6e+04

20000
q
4e+04

q

q q q

0
Banana shape example: boxplots of 10 replicate ESSs for the AMIS scheme
(left) and the NOT-AMIS scheme (right) for p = 5, 10, 20.



Introduction

Monte Carlo Methods based on Markov Chains
The Metropolis–Hastings algorithm
The random walk Metropolis-Hastings algorithm


Genetics of ABC



Running Monte Carlo via Markov Chains

Epiphany! It is not necessary to use a sample from the distribution
f to approximate the integral

I= h(x)f (x)dx ,


Running Monte Carlo via Markov Chains

Epiphany! It is not necessary to use a sample from the distribution
f to approximate the integral

I= h(x)f (x)dx ,

Principle: Obtain X1 , . . . , Xn ∼ f (approx) without directly
simulating from f , using an ergodic Markov chain with stationary
distribution f
[Metropolis et al., 1953]


Running Monte Carlo via Markov Chains (2)

Idea
For an arbitrary starting value x (0) , an ergodic chain (X (t) ) is
generated using a transition kernel with stationary distribution f



Idea

Insures the convergence in distribution of (X (t) ) to a random
variable from f .
For a “large enough” T0 , X (T0 ) can be considered as
distributed from f
Produce a dependent sample X (T0 ) , X (T0 +1) , . . ., which is
generated from f , suﬃcient for most approximation purposes.



Idea

Insures the convergence in distribution of (X (t) ) to a random
variable from f .
For a “large enough” T0 , X (T0 ) can be considered as
distributed from f
Produce a dependent sample X (T0 ) , X (T0 +1) , . . ., which is
generated from f , suﬃcient for most approximation purposes.
Problem: How can one build a Markov chain with a given
stationary distribution?



Basics
The algorithm uses the objective (target) density

f

and a conditional density
q(y |x)
called the instrumental (or proposal) distribution


The MH algorithm

Algorithm (Metropolis–Hastings)
Given x (t) ,
1. Generate Yt ∼ q(y |x (t) ).
2. Take

Yt with prob. ρ(x (t) , Yt ),
X (t+1) =
x (t) with prob. 1 − ρ(x (t) , Yt ),

where
f (y ) q(x|y )
ρ(x, y ) = min ,1 .
f (x) q(y |x)


Features

Independent of normalizing constants for both f and q(·|x)
(ie, those constants independent of x)
Never move to values with f (y ) = 0
The chain (x (t) )t may take the same value several times in a
row, even though f is a density wrt Lebesgue measure
The sequence (yt )t is usually not a Markov chain


Convergence properties

The M-H Markov chain is reversible, with invariant/stationary
density f since it satisﬁes the detailed balance condition
f (y ) K (y , x) = f (x) K (x, y )


Convergence properties

The M-H Markov chain is reversible, with invariant/stationary
density f since it satisﬁes the detailed balance condition
f (y ) K (y , x) = f (x) K (x, y )
If
q(y |x) > 0 for every (x, y ),
the chain is Harris recurrent and
T
1
lim h(X (t) ) = h(x)df (x) a.e. f .
T →∞ T
t=1

lim K n (x, ·)µ(dx) − f =0
n→∞
TV
for every initial distribution µ


Random walk Metropolis–Hastings

Use of a local perturbation as proposal

Yt = X (t) + εt ,

where εt ∼ g , independent of X (t) .
The instrumental density is now of the form g (y − x) and the
Markov chain is a random walk if we take g to be symmetric
g (x) = g (−x)


Algorithm (Random walk Metropolis)
Given x (t)
1. Generate Yt ∼ g (y − x (t) )
2. Take
f (Yt )

Y with prob. min 1, ,
(t+1) t
X = f (x (t) )
 (t)
x otherwise.


RW-MH on mixture posterior distribution
3
2
µ2

1
0

X
−1

−1 0 1 2 3

µ1

Random walk MCMC output for .7N (µ1 , 1) + .3N (µ2 , 1)


Acceptance rate

A high acceptance rate is not indication of eﬃciency since the
random walk may be moving “too slowly” on the target surface


Acceptance rate


If x (t) and yt are “too close”, i.e. f (x (t) ) f (yt ), yt is accepted
with probability
f (yt )
min ,1 1.
f (x (t) )
and acceptance rate high


Acceptance rate


If average acceptance rate low, the proposed values f (yt ) tend to
be small wrt f (x (t) ), i.e. the random walk [not the algorithm!]
moves quickly on the target surface often reaching its boundaries


Rule of thumb

In small dimensions, aim at an average acceptance rate of 50%. In
large dimensions, at an average acceptance rate of 25%.
[Gelman,Gilks and Roberts, 1995]


Noisy AR(1)

Target distribution of x given x1 , x2 and y is

−1 τ2
exp (x − ϕx1 )2 + (x2 − ϕx)2 + (y − x 2 )2 .
2τ 2 σ2

For a Gaussian random walk with scale ω small enough, the
random walk never jumps to the other mode. But if the scale ω is
suﬃciently large, the Markov chain explores both modes and give a
satisfactory approximation of the target distribution.


Noisy AR(2)

Markov chain based on a random walk with scale ω = .1.


Noisy AR(3)

Markov chain based on a random walk with scale ω = .5.


Meanwhile, in a Gaulish village...

Similar exploration of simulation-based techniques in Econometrics
Simulated method of moments
Method of simulated moments
Simulated pseudo-maximum-likelihood
Indirect inference
[Gouri´roux & Monfort, 1996]
e



o
Given observations y1:n from a model

yt = r (y1:(t−1) , t , θ) , t ∼ g (·)

simulate 1:n , derive

yt (θ) = r (y1:(t−1) , t , θ)

and estimate θ by
n
arg min (yto − yt (θ))2
θ
t=1



o
Given observations y1:n from a model

yt = r (y1:(t−1) , t , θ) , t ∼ g (·)

simulate 1:n , derive

yt (θ) = r (y1:(t−1) , t , θ)

and estimate θ by
n n 2
arg min yto − yt (θ)
θ
t=1 t=1


Method of simulated moments

Given a statistic vector K (y ) with

Eθ [K (Yt )|y1:(t−1) ] = k(y1:(t−1) ; θ)

ﬁnd an unbiased estimator of k(y1:(t−1) ; θ),

˜
k( t , y1:(t−1) ; θ)

Estimate θ by
n S
arg min K (yt ) − ˜ t
k( s , y1:(t−1) ; θ)/S
θ
t=1 s=1

[Pakes & Pollard, 1989]


Indirect inference

ˆ
Minimise (in θ) the distance between estimators β based on
pseudo-models for genuine observations and for observations
simulated under the true model and the parameter θ.

[Gouri´roux, Monfort, & Renault, 1993;
e
Smith, 1993; Gallant & Tauchen, 1996]


Indirect inference (PML vs. PSE)

Example of the pseudo-maximum-likelihood (PML)

ˆ
β(y) = arg max log f (yt |β, y1:(t−1) )
β
t

leading to
ˆ ˆ
arg min ||β(yo ) − β(y1 (θ), . . . , yS (θ))||2
θ

when
ys (θ) ∼ f (y|θ) s = 1, . . . , S


Indirect inference (PML vs. PSE)

Example of the pseudo-score-estimator (PSE)
2
ˆ ∂ log f
β(y) = arg min (yt |β, y1:(t−1) )
β
t
∂β

leading to
ˆ ˆ
arg min ||β(yo ) − β(y1 (θ), . . . , yS (θ))||2
θ

when
ys (θ) ∼ f (y|θ) s = 1, . . . , S


Consistent indirect inference

...in order to get a unique solution the dimension of
the auxiliary parameter β must be larger than or equal to
the dimension of the initial parameter θ. If the problem is
just identiﬁed the diﬀerent methods become easier...


Consistent indirect inference

...in order to get a unique solution the dimension of
the auxiliary parameter β must be larger than or equal to
the dimension of the initial parameter θ. If the problem is
just identified the different methods become easier...

Consistency depending on the criterion and on the asymptotic
identifiability of θ
[Gouri´roux, Monfort, 1996, p. 66]
e


AR(2) vs. MA(1) example

true (AR) model
yt = t −θ t−1

and [wrong!] auxiliary (MA) model

yt = β1 yt−1 + β2 yt−2 + ut

R code
x=eps=rnorm(250)
x[2:250]=x[2:250]-0.5*x[1:249]
simeps=rnorm(250)
propeta=seq(-.99,.99,le=199)
dist=rep(0,199)
bethat=as.vector(arima(x,c(2,0,0),incl=FALSE)$coef)
for (t in 1:199)
dist[t]=sum((as.vector(arima(c(simeps[1],simeps[2:250]-propeta[t]*
simeps[1:249]),c(2,0,0),incl=FALSE)$coef)-bethat)^2)


One sample:
0.8
0.6
distance

0.4
0.2
0.0

−1.0 −0.5 0.0 0.5 1.0


Many samples:
6
5
4
3
2
1
0

0.2 0.4 0.6 0.8 1.0


Choice of pseudo-model

Pick model such that
ˆ
1. β(θ) not ﬂat
(i.e. sensitive to changes in θ)
ˆ
2. β(θ) not dispersed (i.e. robust agains changes in ys (θ))
[Frigessi & Heggland, 2004]


ABC using indirect inference

We present a novel approach for developing summary statistics
for use in approximate Bayesian computation (ABC) algorithms by
using indirect inference(...) In the indirect inference approach to
ABC the parameters of an auxiliary model ﬁtted to the data become
the summary statistics. Although applicable to any ABC technique,
we embed this approach within a sequential Monte Carlo algorithm
that is completely adaptive and requires very little tuning(...)

[Drovandi, Pettitt & Faddy, 2011]

c Indirect inference provides summary statistics for ABC...


ABC using indirect inference

...the above result shows that, in the limit as h → 0, ABC will
be more accurate than an indirect inference method whose auxiliary
statistics are the same as the summary statistic that is used for
ABC(...) Initial analysis showed that which method is more
accurate depends on the true value of θ.

[Fearnhead and Prangle, 2012]

c Indirect inference provides estimates rather than global inference...

Genetics of ABC

Genetic background of ABC

ABC is a recent computational technique that only requires being
able to sample from the likelihood f (·|θ)
This technique stemmed from population genetics models, about
15 years ago, and population geneticists still contribute
signiﬁcantly to methodological developments of ABC.
[Griﬃth & al., 1997; Tavar´ & al., 1999]
e

Genetics of ABC

Population genetics

[Part derived from the teaching material of Raphael Leblois, ENS Lyon, November 2010]

Describe the genotypes, estimate the alleles frequencies,
determine their distribution among individuals, populations
and between populations;

Predict and understand the evolution of gene frequencies in
populations as a result of various factors.

c Analyses the eﬀect of various evolutive forces (mutation, drift,
migration, selection) on the evolution of gene frequencies in time
and space.

Genetics of ABC

A[B]ctors
Les mécanismes de l’évolution
Towards an explanation of the mechanisms of evolutionary
changes, mathematical models ofl’évolution ont to be developed
•! Les mathématiques de evolution started
in the years 1920-1930.
commencé à être développés dans les
années 1920-1930

Ronald A Fisher (1890-1962) Sewall Wright (1889-1988) John BS Haldane (1892-1964)

Genetics of ABC

Wright-Fisher model de
Le modèle Wright-Fisher

•! En l’absence de mutation et de
sélection,A population of constant
les fréquences
alléliquessize, in which individuals
dérivent (augmentent
et diminuent) inévitablement
reproduce at the same time.
jusqu’à la fixation d’un allèle
Each gene in a generation is
a copy of a gene of the
•! La dérivepreviousdonc à la
conduit generation.
perte de variation génétique à
l’intérieur In the absence of mutation
des populations
and selection, allele
frequencies derive inevitably
until the ﬁxation of an
allele.

Genetics of ABC

Coalescent theory
[Kingman, 1982; Tajima, Tavar´, &tc]
e

!"#$%&'(('")**+$,-'".'"/010234%'".'5"*$*%()23$15"6"
Coalescence theory interested in the genealogy of a sample of
!!7**+$,-'",()5534%'" common ancestor of the sample.
"
genes back in time to the " "!"7**+$,-'"8",$)('5,'1,'"9"

Genetics of ABC

Common ancestor

Les différentes
lignées fusionnent
(coalescent) au fur
et à mesure que
l’on remonte vers le

Time of coalescence
passé

(T)
Modélisation du processus de dérive génétique
The diﬀerent lineages mergeen “remontant dans lein the past.
when we go back temps”
jusqu’à l’ancêtre commun d’un échantillon de gènes 6

Genetics of ABC
Sous l’hypothèse de neutralité des marqueurs génétiques étudiés,
les mutations sont indépendantes de la généalogie
Neutral généalogie ne dépend que des processus démographiques
i.e. la mutations

On construit donc la généalogie selon les paramètres
démographiques (ex. N),
puis on ajoute aassumption les
Under the posteriori of
mutations sur les différentes
neutrality, the mutations
branches,independentau feuilles de
are du MRCA of the
l’arbregenealogy.
On obtient ainsi des données de
We construct the genealogy
polymorphisme sous les modèles
according to the
démographiques et mutationnels
demographic parameters,
considérés
then we add a posteriori the
mutations. 20

Genetics of ABC

Demo-genetic inference

Each model is characterized by a set of parameters θ that cover
historical (time divergence, admixture time ...), demographics
(population sizes, admixture rates, migration rates, ...) and genetic
(mutation rate, ...) factors
The goal is to estimate these parameters from a dataset of
polymorphism (DNA sample) y observed at the present time
Problem: most of the time, we can not calculate the likelihood of
the polymorphism data f (y|θ).

Genetics of ABC

Untractable likelihood

Missing (too missing!) data structure:

f (y|θ) = f (y|G , θ)f (G |θ)dG
G

The genealogies are considered as nuisance parameters.
This problematic thus diﬀers from the phylogenetic approach
where the tree is the parameter of interesst.

Genetics of ABC

A genuine !""#$%&'()*+,(-*.&(/+0$'"1)()&$/+2!,03!
example of application
1/+*%*'"4*+56(""4&7()&$/.+.1#+4*.+8-9':*.+

94

Pygmies populations: do they have a common origin? Is there a
lot of exchanges between pygmies and non-pygmies populations?

Genetics of ABC

!""#$%&'()*+,(-*.&(/+0$'"1)()&$/+2!,03!
1/+*%*'"4*+56(""4&7()&$/.+.1#+4*.+8-9':*.+
Alternative scenarios

Différents scénarios possibles, choix de scenario par ABC

Verdu et al. 2009

96

1/+*%*'"4*+56(""4&7()&$/.+.1#+4*.+

!""#$%&'()*+,(-*.&(/+0$'"1)()&$/+2!,03
Genetics of ABC

1/+*%*'"4*+56(""4&7()&$/.+.1#+4*.+8-9':*.
Différentsresults
Simulation scénarios possibles, choix de scenari

Différents scénarios possibles, choix de scenario par ABC

c Scenario 1A is chosen. largement soutenu par rapport aux
Le scenario 1a est
autres ! plaide pour une origine commune des
Le scenario 1a est largement soutenu par
populations pygmées d’Afrique de l’Ouest rap
Verdu e

Genetics of ABC

!""#$%&'()*+,(-*.&(/+0$'"1)()&$/+2!,03!
Most likely scenario
1/+*%*'"4*+56(""4&7()&$/.+.1#+4*.+8-9':*.+

Scénario
évolutif :
on « raconte »
une histoire à
partir de ces
inférences

Verdu et al. 2009

99



Introduction



Genetics of ABC

ABC basics
Alphabet soup
Automated summary statistic selection

ABC basics

Untractable likelihoods

Cases when the likelihood function f (y|θ) is unavailable and when
the completion step

f (y|θ) = f (y, z|θ) dz
Z

is impossible or too costly because of the dimension of z
c MCMC cannot be implemented!

ABC basics

Untractable likelihoods

ABC basics

Illustrations

Example
Stochastic volatility model: for Highest weight trajectories

t = 1, . . . , T ,

0.4
0.2
yt = exp(zt ) t , zt = a+bzt−1 +σηt ,

0.0
−0.2
T very large makes it diﬃcult to
−0.4
include z within the simulated 0 200 400

t
600 800 1000

parameters

ABC basics

Illustrations

Example
Potts model: if y takes values on a grid Y of size k n and

f (y|θ) ∝ exp θ Iyl =yi
l∼i

where l∼i denotes a neighbourhood relation, n moderately large
prohibits the computation of the normalising constant

ABC basics

Illustrations

Example
Inference on CMB: in cosmology, study of the Cosmic Microwave
Background via likelihoods immensely slow to computate (e.g
WMAP, Plank), because of numerically costly spectral transforms
[Data is a Fortran program]
[Kilbinger et al., 2010, MNRAS]

ABC basics

Illustrations

Example
Phylogenetic tree: in population
genetics, reconstitution of a common
ancestor from a sample of genes via
a phylogenetic tree that is close to
impossible to integrate out
[100 processor days with 4
parameters]
[Cornuet et al., 2009, Bioinformatics]

ABC basics

The ABC method

Bayesian setting: target is π(θ)f (x|θ)

ABC basics

The ABC method

When likelihood f (x|θ) not in closed form, likelihood-free rejection
technique:

ABC basics

The ABC method

When likelihood f (x|θ) not in closed form, likelihood-free rejection
technique:
ABC algorithm
For an observation y ∼ f (y|θ), under the prior π(θ), keep jointly
simulating
θ ∼ π(θ) , z ∼ f (z|θ ) ,
until the auxiliary variable z is equal to the observed value, z = y.

[Tavar´ et al., 1997]
e

ABC basics

Why does it work?!

The proof is trivial:

f (θi ) ∝ π(θi )f (z|θi )Iy (z)
z∈D
∝ π(θi )f (y|θi )
= π(θi |y) .

[Accept–Reject 101]

ABC basics

Earlier occurrence

‘Bayesian statistics and Monte Carlo methods are ideally
suited to the task of passing many models over one
dataset’
[Don Rubin, Annals of Statistics, 1984]

Note Rubin (1984) does not promote this algorithm for
likelihood-free simulation but frequentist intuition on posterior
distributions: parameters from posteriors are more likely to be
those that could have generated the data.

ABC basics

A as A...pproximative

When y is a continuous random variable, equality z = y is replaced
with a tolerance condition,

(y, z) ≤

where is a distance

ABC basics

A as A...pproximative

When y is a continuous random variable, equality z = y is replaced
with a tolerance condition,

(y, z) ≤

where is a distance
Output distributed from

π(θ) Pθ { (y, z) < } ∝ π(θ| (y, z) < )

[Pritchard et al., 1999]

ABC basics

ABC algorithm

Algorithm 1 Likelihood-free rejection sampler 2
for i = 1 to N do
repeat
generate θ from the prior distribution π(·)
generate z from the likelihood f (·|θ )
until ρ{η(z), η(y)} ≤
set θi = θ
end for

where η(y) deﬁnes a (not necessarily suﬃcient) statistic

ABC basics

Output

The likelihood-free algorithm samples from the marginal in z of:

π(θ)f (z|θ)IA ,y (z)
π (θ, z|y) = ,
A ,y ×Θ π(θ)f (z|θ)dzdθ

where A ,y = {z ∈ D|ρ(η(z), η(y)) < }.

ABC basics

Output

The likelihood-free algorithm samples from the marginal in z of:

π(θ)f (z|θ)IA ,y (z)
π (θ, z|y) = ,
A ,y ×Θ π(θ)f (z|θ)dzdθ

where A ,y = {z ∈ D|ρ(η(z), η(y)) < }.
The idea behind ABC is that the summary statistics coupled with a
small tolerance should provide a good approximation of the
posterior distribution:

π (θ|y) = π (θ, z|y)dz ≈ π(θ|y) .

ABC basics

Convergence of ABC (ﬁrst attempt)

What happens when → 0?

ABC basics


δ > 0, there exists 0 such that < 0 implies

π(θ) f (z|θ)IA ,y (z) dz π(θ)f (y|θ)(1 δ)
XX )
µ(BX
∈
A ,y ×Θ π(θ)f (z|θ)dzdθ Θ π(θ)f (y|θ)dθ(1 ± δ)
µ(B )
XXX

ABC basics


δ 0, there exists 0 such that 0 implies

π(θ) f (z|θ)IA ,y (z) dz π(θ)f (y|θ)(1 δ)
XX )
µ(BX
∈
A ,y ×Θ π(θ)f (z|θ)dzdθ Θ π(θ)f (y|θ)dθ(1 ± δ) X
µ(B )
X
X

[Proof extends to other continuous-in-0 kernels K ]

ABC basics

Convergence of ABC (second attempt)


ABC basics

Probit modelling on Pima Indian women
Example (R benchmark)
200 Pima Indian women with observed variables
plasma glucose concentration in oral glucose tolerance test
diastolic blood pressure
diabetes pedigree function
presence/absence of diabetes

ABC basics


Probability of diabetes function of above variables
P(y = 1|x) = Φ(x1 β1 + x2 β2 + x3 β3 ) ,

ABC basics


P(y = 1|x) = Φ(x1 β1 + x2 β2 + x3 β3 ) ,
Test of H0 : β3 = 0 for 200 observations of Pima.tr based on a
g -prior modelling:
β ∼ N3 (0, n XT X)−1

ABC basics


P(y = 1|x) = Φ(x1 β1 + x2 β2 + x3 β3 ) ,
Test of H0 : β3 = 0 for 200 observations of Pima.tr based on a
g -prior modelling:
β ∼ N3 (0, n XT X)−1
Use of importance function inspired from the MLE estimate
distribution
ˆ ˆ
β ∼ N (β, Σ)

ABC basics

Pima Indian benchmark

80
100

1.0
80

60

0.8
60

0.6
Density

Density

Density
40
40

0.4
20
20

0.2
0.0
0

0

−0.005 0.010 0.020 0.030 −0.05 −0.03 −0.01 −1.0 0.0 1.0 2.0

Figure: Comparison between density estimates of the marginals on β1
(left), β2 (center) and β3 (right) from ABC rejection samples (red) and
MCMC samples (black)

.

ABC basics

MA example

Back to the MA(q) model
q
xt = t + ϑi t−i
i=1

Simple prior: uniform over the inverse [real and complex] roots in
q
Q(u) = 1 − ϑi u i
i=1

under the identiﬁability conditions

ABC basics

MA example

Back to the MA(q) model
q
xt = t + ϑi t−i
i=1

Simple prior: uniform prior over the identiﬁability zone, e.g.
triangle for MA(2)

ABC basics

MA example (2)
ABC algorithm thus made of
1. picking a new value (ϑ1 , ϑ2 ) in the triangle
2. generating an iid sequence ( t )−qt≤T
3. producing a simulated series (xt )1≤t≤T

ABC basics

MA example (2)
ABC algorithm thus made of
1. picking a new value (ϑ1 , ϑ2 ) in the triangle
2. generating an iid sequence ( t )−qt≤T
3. producing a simulated series (xt )1≤t≤T
Distance: basic distance between the series
T
ρ((xt )1≤t≤T , (xt )1≤t≤T ) = (xt − xt )2
t=1

or distance between summary statistics like the q autocorrelations
T
τj = xt xt−j
t=j+1

ABC basics

Comparison of distance impact

Evaluation of the tolerance on the ABC sample against both
distances ( = 100%, 10%, 1%, 0.1%) for an MA(2) model

ABC basics

Comparison of distance impact

4

1.5
3

1.0
2

0.5
1

0.0
0

0.0 0.2 0.4 0.6 0.8 −2.0 −1.0 0.0 0.5 1.0 1.5

θ1 θ2

Evaluation of the tolerance on the ABC sample against both
distances ( = 100%, 10%, 1%, 0.1%) for an MA(2) model

ABC basics

Homonomy
The ABC algorithm is not to be confused with the ABC algorithm
The Artificial Bee Colony algorithm is a swarm based meta-heuristic
algorithm that was introduced by Karaboga in 2005 for optimizing
numerical problems. It was inspired by the intelligent foraging
behavior of honey bees. The algorithm is specifically based on the
model proposed by Tereshko and Loengarov (2005) for the foraging
behaviour of honey bee colonies. The model consists of three
essential components: employed and unemployed foraging bees, and
food sources. The first two components, employed and unemployed
foraging bees, search for rich food sources (...) close to their hive.
The model also defines two leading modes of behaviour (...):
recruitment of foragers to rich food sources resulting in positive
feedback and abandonment of poor sources by foragers causing
negative feedback.

[Karaboga, Scholarpedia]

ABC basics

ABC advances

Simulating from the prior is often poor in eﬃciency

ABC basics

ABC advances

Either modify the proposal distribution on θ to increase the density
of x’s within the vicinity of y ...
[Marjoram et al, 2003; Bortot et al., 2007, Sisson et al., 2007]

ABC basics

ABC advances


...or by viewing the problem as a conditional density estimation
and by developing techniques to allow for larger
[Beaumont et al., 2002]

ABC basics

ABC advances


...or by viewing the problem as a conditional density estimation
and by developing techniques to allow for larger

.....or even by including in the inferential framework [ABCµ ]
[Ratmann et al., 2009]

Alphabet soup

ABC-NP
Better usage of [prior] simulations by
adjustement: instead of throwing away
θ such that ρ(η(z), η(y)) , replace
θ’s with locally regressed transforms
(use with BIC)

θ∗ = θ − {η(z) − η(y)}T β
ˆ [Csill´ry et al., TEE, 2010]
e

ˆ
where β is obtained by [NP] weighted least square regression on
(η(z) − η(y)) with weights

Kδ {ρ(η(z), η(y))}

[Beaumont et al., 2002, Genetics]

Alphabet soup

ABC-NP (regression)

Also found in the subsequent literature, e.g. in Fearnhead-Prangle (2012) :
weight directly simulation by

Kδ {ρ(η(z(θ)), η(y))}

or
S
1
Kδ {ρ(η(zs (θ)), η(y))}
S
s=1

[consistent estimate of f (η|θ)]

Alphabet soup

ABC-NP (regression)

Also found in the subsequent literature, e.g. in Fearnhead-Prangle (2012) :
weight directly simulation by

Kδ {ρ(η(z(θ)), η(y))}

or
S
1
Kδ {ρ(η(zs (θ)), η(y))}
S
s=1

[consistent estimate of f (η|θ)]
Curse of dimensionality: poor estimate when d = dim(η) is large...

Alphabet soup

ABC-NP (density estimation)

Use of the kernel weights

Kδ {ρ(η(z(θ)), η(y))}

leads to the NP estimate of the posterior expectation

i θi Kδ {ρ(η(z(θi )), η(y))}
i Kδ {ρ(η(z(θi )), η(y))}

[Blum, JASA, 2010]

Alphabet soup

ABC-NP (density estimation)

Use of the kernel weights

Kδ {ρ(η(z(θ)), η(y))}

leads to the NP estimate of the posterior conditional density
˜
Kb (θi − θ)Kδ {ρ(η(z(θi )), η(y))}
i

[Blum, JASA, 2010]

Alphabet soup

ABC-NP (density estimations)

Other versions incorporating regression adjustments
˜
Kb (θi∗ − θ)Kδ {ρ(η(z(θi )), η(y))}
i

Alphabet soup


˜
i

In all cases, error

E[ˆ (θ|y)] − g (θ|y) = cb 2 + cδ 2 + OP (b 2 + δ 2 ) + OP (1/nδ d )
g
c
var(ˆ (θ|y)) =
g (1 + oP (1))
nbδ d
[Blum, JASA, 2010]

Alphabet soup


˜
i

In all cases, error

E[ˆ (θ|y)] − g (θ|y) = cb 2 + cδ 2 + OP (b 2 + δ 2 ) + OP (1/nδ d )
g
c
var(ˆ (θ|y)) =
g (1 + oP (1))
nbδ d
[standard NP calculations]

Alphabet soup

ABC-NCH

Incorporating non-linearities and heterocedasticities:

σ (η(y))
ˆ
θ∗ = m(η(y)) + [θ − m(η(z))]
ˆ ˆ
σ (η(z))
ˆ

Alphabet soup

ABC-NCH

Incorporating non-linearities and heterocedasticities:

σ (η(y))
ˆ
θ∗ = m(η(y)) + [θ − m(η(z))]
ˆ ˆ
σ (η(z))
ˆ

where
m(η) estimated by non-linear regression (e.g., neural network)
ˆ
σ (η) estimated by non-linear regression on residuals
ˆ

log{θi − m(ηi )}2 = log σ 2 (ηi ) + ξi
ˆ

[Blum Fran¸ois, 2009]
c

Alphabet soup

ABC-NCH (2)

Why neural network?

Alphabet soup

ABC-NCH (2)

Why neural network?
ﬁghts curse of dimensionality
selects relevant summary statistics
provides automated dimension reduction
oﬀers a model choice capability
improves upon multinomial logistic
[Blum Fran¸ois, 2009]
c

Alphabet soup

ABC-MCMC

Markov chain (θ(t) ) created via the transition function

θ ∼ Kω (θ |θ(t) ) if x ∼ f (x|θ ) is such that x = y


π(θ )Kω (t) |θ )
θ (t+1)
= and u ∼ U(0, 1) ≤ π(θ(t) )K (θ |θ(t) ) ,
 ω (θ
 (t)
θ otherwise,

Alphabet soup

ABC-MCMC

Markov chain (θ(t) ) created via the transition function

θ ∼ Kω (θ |θ(t) ) if x ∼ f (x|θ ) is such that x = y


π(θ )Kω (t) |θ )
θ (t+1)
= and u ∼ U(0, 1) ≤ π(θ(t) )K (θ |θ(t) ) ,
 ω (θ
 (t)
θ otherwise,

has the posterior π(θ|y ) as stationary distribution
[Marjoram et al, 2003]

Alphabet soup

ABC-MCMC (2)

Algorithm 2 Likelihood-free MCMC sampler
Use Algorithm 1 to get (θ(0) , z(0) )
for t = 1 to N do
Generate θ from Kω ·|θ(t−1) ,
Generate z from the likelihood f (·|θ ),
Generate u from U[0,1] ,
π(θ )Kω (θ(t−1) |θ )
if u ≤ I
π(θ(t−1) Kω (θ |θ(t−1) ) A ,y (z ) then
set (θ(t) , z(t) ) = (θ , z )
else
(θ(t) , z(t) )) = (θ(t−1) , z(t−1) ),
end if
end for

Alphabet soup

A toy example

Case of
1 1
x ∼ N (θ, 1) + N (θ, 1)
2 2
under prior θ ∼ N (0, 10)

Alphabet soup

A toy example

Case of
1 1
x ∼ N (θ, 1) + N (θ, 1)
2 2

ABC sampler
thetas=rnorm(N,sd=10)
zed=sample(c(1,-1),N,rep=TRUE)*thetas+rnorm(N,sd=1)
eps=quantile(abs(zed-x),.01)
abc=thetas[abs(zed-x)eps]

Alphabet soup

A toy example

Case of
1 1
x ∼ N (θ, 1) + N (θ, 1)
2 2

ABC-MCMC sampler
metas=rep(0,N)
metas[1]=rnorm(1,sd=10)
zed[1]=x
for (t in 2:N){
metas[t]=rnorm(1,mean=metas[t-1],sd=5)
zed[t]=rnorm(1,mean=(1-2*(runif(1).5))*metas[t],sd=1)
if ((abs(zed[t]-x)eps)||(runif(1)dnorm(metas[t],sd=10)/dnorm(metas[t-1],sd=10))){
metas[t]=metas[t-1]
zed[t]=zed[t-1]}
}

Alphabet soup

A toy example

x =2

Alphabet soup

A toy example

x =2
0.30
0.25
0.20
0.15
0.10
0.05
0.00

−4 −2 0 2 4

θ

Alphabet soup

A toy example

x = 10

Alphabet soup

A toy example

x = 10
0.25
0.20
0.15
0.10
0.05
0.00

−10 −5 0 5 10

θ

Alphabet soup

A toy example

x = 50

Alphabet soup

A toy example

x = 50
0.10
0.08
0.06
0.04
0.02
0.00

−40 −20 0 20 40

θ

Alphabet soup

ABCµ

[Ratmann, Andrieu, Wiuf and Richardson, 2009, PNAS]

Use of a joint density

f (θ, |y) ∝ ξ( |y, θ) × πθ (θ) × π ( )

where y is the data, and ξ( |y, θ) is the prior predictive density of
ρ(η(z), η(y)) given θ and x when z ∼ f (z|θ)

Alphabet soup

ABCµ

[Ratmann, Andrieu, Wiuf and Richardson, 2009, PNAS]

Use of a joint density

f (θ, |y) ∝ ξ( |y, θ) × πθ (θ) × π ( )

where y is the data, and ξ( |y, θ) is the prior predictive density of
ρ(η(z), η(y)) given θ and x when z ∼ f (z|θ)
Warning! Replacement of ξ( |y, θ) with a non-parametric kernel
approximation.

Alphabet soup

ABCµ details

Multidimensional distances ρk (k = 1, . . . , K ) and errors
k = ρk (ηk (z), ηk (y)), with

ˆ 1
k ∼ ξk ( |y, θ) ≈ ξk ( |y, θ) = K [{ k −ρk (ηk (zb ), ηk (y))}/hk ]
Bhk
b

ˆ
then used in replacing ξ( |y, θ) with mink ξk ( |y, θ)

Alphabet soup

ABCµ details

Multidimensional distances ρk (k = 1, . . . , K ) and errors
k = ρk (ηk (z), ηk (y)), with

ˆ 1
k ∼ ξk ( |y, θ) ≈ ξk ( |y, θ) = K [{ k −ρk (ηk (zb ), ηk (y))}/hk ]
Bhk
b

ˆ
then used in replacing ξ( |y, θ) with mink ξk ( |y, θ)
ABCµ involves acceptance probability

ˆ
π(θ , ) q(θ , θ)q( , ) mink ξk ( |y, θ )
ˆ
π(θ, ) q(θ, θ )q( , ) mink ξk ( |y, θ)

Alphabet soup

ABCµ multiple errors

[ c Ratmann et al., PNAS, 2009]

Alphabet soup

ABCµ for model choice

[ c Ratmann et al., PNAS, 2009]

Alphabet soup

Questions about ABCµ

For each model under comparison, marginal posterior on used to
assess the ﬁt of the model (HPD includes 0 or not).

Alphabet soup

Questions about ABCµ

For each model under comparison, marginal posterior on used to
assess the ﬁt of the model (HPD includes 0 or not).
Is the data informative about ? [Identiﬁability]
How is the prior π( ) impacting the comparison?
How is using both ξ( |x0 , θ) and π ( ) compatible with a
standard probability model? [remindful of Wilkinson ]
Where is the penalisation for complexity in the model
comparison?
[X, Mengersen Chen, 2010, PNAS]

Alphabet soup

ABC-PRC

Another sequential version producing a sequence of Markov
(t) (t)
transition kernels Kt and of samples (θ1 , . . . , θN ) (1 ≤ t ≤ T )

Alphabet soup

ABC-PRC

Another sequential version producing a sequence of Markov
(t) (t)
transition kernels Kt and of samples (θ1 , . . . , θN ) (1 ≤ t ≤ T )

ABC-PRC Algorithm
(t−1)
1. Select θ at random from previous θi ’s with probabilities
(t−1)
ωi (1 ≤ i ≤ N).
2. Generate
(t) (t)
θi ∼ Kt (θ|θ ) , x ∼ f (x|θi ) ,
3. Check that (x, y ) , otherwise start again.

[Sisson et al., 2007]

Alphabet soup

Why PRC?

Partial rejection control: Resample from a population of weighted
particles by pruning away particles with weights below threshold C ,
replacing them by new particles obtained by propagating an
existing particle by an SMC step and modifying the weights
accordinly.
[Liu, 2001]

Alphabet soup

Why PRC?

Partial rejection control: Resample from a population of weighted
particles by pruning away particles with weights below threshold C ,
replacing them by new particles obtained by propagating an
existing particle by an SMC step and modifying the weights
accordinly.
[Liu, 2001]
PRC justiﬁcation in ABC-PRC:
Suppose we then implement the PRC algorithm for some
c 0 such that only identically zero weights are smaller
than c
Trouble is, there is no such c...

Alphabet soup

ABC-PRC weight

(t)
Probability ωi computed as
(t) (t) (t) (t)
ωi ∝ π(θi )Lt−1 (θ |θi ){π(θ )Kt (θi |θ )}−1 ,

where Lt−1 is an arbitrary transition kernel.

Alphabet soup

ABC-PRC weight

(t)
(t) (t) (t) (t)

In case
Lt−1 (θ |θ) = Kt (θ|θ ) ,
all weights are equal under a uniform prior.

Alphabet soup

ABC-PRC weight

(t)
(t) (t) (t) (t)

In case
Lt−1 (θ |θ) = Kt (θ|θ ) ,
all weights are equal under a uniform prior.
Inspired from Del Moral et al. (2006), who use backward kernels
Lt−1 in SMC to achieve unbiasedness

Alphabet soup

ABC-PRC bias

Lack of unbiasedness of the method

Alphabet soup

A mixture example (1)

Toy model of Sisson et al. (2007): if

θ ∼ U(−10, 10) , x|θ ∼ 0.5 N (θ, 1) + 0.5 N (θ, 1/100) ,

then the posterior distribution associated with y = 0 is the normal
mixture
θ|y = 0 ∼ 0.5 N (0, 1) + 0.5 N (0, 1/100)
restricted to [−10, 10].
Furthermore, true target available as

π(θ||x| ) ∝ Φ( −θ)−Φ(− −θ)+Φ(10( −θ))−Φ(−10( +θ)) .

Alphabet soup

“Ugly, squalid graph...”
1.0

1.0

1.0

1.0

1.0
0.8

0.8

0.8

0.8

0.8
0.6

0.6

0.6

0.6

0.6
0.4

0.4

0.4

0.4

0.4
0.2

0.2

0.2

0.2

0.2
0.0

0.0

0.0

0.0

0.0
−3 −1 1 3 −3 −1 1 3 −3 −1 1 3 −3 −1 1 3 −3 −1 1 3

θ θ θ θ θ
1.0

1.0

1.0

1.0

1.0
0.8

0.8

0.8

0.8

0.8
0.6

0.6

0.6

0.6

0.6
0.4

0.4

0.4

0.4

0.4
0.2

0.2

0.2

0.2

0.2
0.0

0.0

0.0

0.0

0.0
−3 −1 1 3 −3 −1 1 3 −3 −1 1 3 −3 −1 1 3 −3 −1 1 3

θ θ θ θ θ

Comparison of τ = 0.15 and τ = 1/0.15 in Kt

Alphabet soup

A PMC version
Use of the same kernel idea as ABC-PRC but with IS correction
( check PMC )
Generate a sample at iteration t by
N
(t) (t−1) (t−1)
πt (θ
ˆ )∝ ωj Kt (θ(t) |θj )
j=1

modulo acceptance of the associated xt , and use an importance
(t)
weight associated with an accepted simulation θi
(t) (t) (t)
ωi ∝ π(θi ) πt (θi ) .
ˆ

c Still likelihood free

Alphabet soup

Our ABC-PMC algorithm
Given a decreasing sequence of approximation levels 1 ≥ ... ≥ T,

1. At iteration t = 1,
For i = 1, ..., N
(1) (1)
Simulate θi ∼ π(θ) and x ∼ f (x|θi ) until (x, y ) 1
(1)
Set ωi = 1/N
(1)
Take τ 2 as twice the empirical variance of the θi ’s
2. At iteration 2 ≤ t ≤ T ,
For i = 1, ..., N, repeat
(t−1) (t−1)
Pick θi from the θj ’s with probabilities ωj
(t) 2 (t)
generate θi |θi ∼ N (θi , σt ) and x ∼ f (x|θi )
until (x, y ) t
(t) (t) N (t−1) −1 (t) (t−1)
Set ωi ∝ π(θi )/ j=1 ωj ϕ σt θi − θj )
2 (t)
Take τt+1 as twice the weighted empirical variance of the θi ’s

Alphabet soup

Sequential Monte Carlo

SMC is a simulation technique to approximate a sequence of
related probability distributions πn with π0 “easy” and πT as
target.
Iterated IS as PMC: particles moved from time n to time n via
kernel Kn and use of a sequence of extended targets πn˜
n
πn (z0:n ) = πn (zn )
˜ Lj (zj+1 , zj )
j=0

where the Lj ’s are backward Markov kernels [check that πn (zn ) is a
marginal]
[Del Moral, Doucet Jasra, Series B, 2006]

Alphabet soup

Sequential Monte Carlo (2)

Algorithm 3 SMC sampler
(0)
sample zi ∼ γ0 (x) (i = 1, . . . , N)
(0) (0) (0)
compute weights wi = π0 (zi )/γ0 (zi )
for t = 1 to N do
if ESS(w (t−1) ) NT then
resample N particles z (t−1) and set weights to 1
end if
(t−1) (t−1)
generate zi ∼ Kt (zi , ·) and set weights to
(t) (t) (t−1)
(t) (t−1) πt (zi ))Lt−1 (zi ), zi ))
wi = wi−1 (t−1) (t−1) (t)
πt−1 (zi ))Kt (zi ), zi ))

end for

Alphabet soup

ABC-SMC
[Del Moral, Doucet Jasra, 2009]

True derivation of an SMC-ABC algorithm
Use of a kernel Kn associated with target π n and derivation of the
backward kernel
π n (z )Kn (z , z)
Ln−1 (z, z ) =
πn (z)

Update of the weights
M m
m=1 IA n
(xin )
win ∝ wi(n−1) M m
(xi(n−1) )
m=1 IA n−1

m
when xin ∼ K (xi(n−1) , ·)

Alphabet soup

ABC-SMCM

Modiﬁcation: Makes M repeated simulations of the pseudo-data z
given the parameter, rather than using a single [M = 1] simulation,
leading to weight that is proportional to the number of accepted
zi s
M
1
ω(θ) = Iρ(η(y),η(zi ))
M
i=1

[limit in M means exact simulation from (tempered) target]

Alphabet soup

Properties of ABC-SMC

The ABC-SMC method properly uses a backward kernel L(z, z ) to
simplify the importance weight and to remove the dependence on
the unknown likelihood from this weight. Update of importance
weights is reduced to the ratio of the proportions of surviving
particles
Major assumption: the forward kernel K is supposed to be invariant
against the true target [tempered version of the true posterior]

Alphabet soup

Properties of ABC-SMC

The ABC-SMC method properly uses a backward kernel L(z, z ) to
simplify the importance weight and to remove the dependence on
the unknown likelihood from this weight. Update of importance
weights is reduced to the ratio of the proportions of surviving
particles
Major assumption: the forward kernel K is supposed to be invariant
against the true target [tempered version of the true posterior]
Adaptivity in ABC-SMC algorithm only found in on-line
construction of the thresholds t , slowly enough to keep a large
number of accepted transitions

Alphabet soup

A mixture example (2)
Recovery of the target, whether using a ﬁxed standard deviation of
τ = 0.15 or τ = 1/0.15, or a sequence of adaptive τt ’s.
1.0

1.0

1.0

1.0

1.0
0.8

0.8

0.8

0.8

0.8
0.6

0.6

0.6

0.6

0.6
0.4

0.4

0.4

0.4

0.4
0.2

0.2

0.2

0.2

0.2
0.0

0.0

0.0

0.0

0.0
−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3

θ θ θ θ θ

Alphabet soup

Yet another ABC-PRC
Another version of ABC-PRC called PRC-ABC with a proposal
distribution Mt , S replications of the pseudo-data, and a PRC step:
PRC-ABC Algorithm
(t−1) (t−1)
1. Select θ at random from the θi ’s with probabilities ωi
(t) iid (t−1)
2. Generate θi ∼ Kt , x1 , . . . , xS ∼ f (x|θi ), set
(t) (t)
ωi = Kt {η(xs ) − η(y)} Kt (θi ) ,
s
(t)
and accept with probability p (i) = ωi /ct,N
(t) (t)
3. Set ωi ∝ ωi /p (i) (1 ≤ i ≤ N)

[Peters, Sisson Fan, Stat Computing, 2012]

Alphabet soup

Yet another ABC-PRC

Another version of ABC-PRC called PRC-ABC with a proposal
distribution Mt , S replications of the pseudo-data, and a PRC step:
(t−1) (t−1)
Use of a NP estimate Mt (θ) = ωi ψ(θ|θi (with a
ﬁxed variance?!)
S = 1 recommended for “greatest gains in sampler
performance”
ct,N upper quantile of non-zero weights at time t

Alphabet soup

Wilkinson’s exact BC
ABC approximation error (i.e. non-zero tolerance) replaced with
exact simulation from a controlled approximation to the target,
convolution of true posterior with kernel function
π(θ)f (z|θ)K (y − z)
π (θ, z|y) = ,
π(θ)f (z|θ)K (y − z)dzdθ
with K kernel parameterised by bandwidth .
[Wilkinson, 2008]

Alphabet soup

Wilkinson’s exact BC
ABC approximation error (i.e. non-zero tolerance) replaced with
exact simulation from a controlled approximation to the target,
convolution of true posterior with kernel function
π(θ)f (z|θ)K (y − z)
π (θ, z|y) = ,
π(θ)f (z|θ)K (y − z)dzdθ
with K kernel parameterised by bandwidth .
[Wilkinson, 2008]
Theorem
The ABC algorithm based on the assumption of a randomised
observation y = y + ξ, ξ ∼ K , and an acceptance probability of
˜

K (y − z)/M

gives draws from the posterior distribution π(θ|y).

Alphabet soup

How exact a BC?

“Using to represent measurement error is
straightforward, whereas using to model the model
discrepancy is harder to conceptualize and not as
commonly used”
[Richard Wilkinson, 2008]

Alphabet soup

How exact a BC?
Pros
Pseudo-data from true model and observed data from noisy
model
Interesting perspective in that outcome is completely
controlled
Link with ABCµ and assuming y is observed with a
measurement error with density K
Relates to the theory of model approximation
[Kennedy O’Hagan, 2001]
Cons
Requires K to be bounded by M
True approximation error never assessed
Requires a modiﬁcation of the standard ABC algorithm

Alphabet soup

ABC for HMMs

Speciﬁc case of a hidden Markov model

Xt+1 ∼ Qθ (Xt , ·)
Yt+1 ∼ gθ (·|xt )
0
where only y1:n is observed.
[Dean, Singh, Jasra, Peters, 2011]

Alphabet soup

ABC for HMMs

Speciﬁc case of a hidden Markov model

Xt+1 ∼ Qθ (Xt , ·)
Yt+1 ∼ gθ (·|xt )
0
where only y1:n is observed.
Use of speciﬁc constraints, adapted to the Markov structure:
0 0
y1 ∈ B(y1 , ) × · · · × yn ∈ B(yn , )

Alphabet soup

ABC-MLE for HMMs

ABC-MLE deﬁned by
ˆ 0 0
θn = arg max Pθ Y1 ∈ B(y1 , ), . . . , Yn ∈ B(yn , )
θ

Alphabet soup

ABC-MLE for HMMs

ABC-MLE deﬁned by
ˆ 0 0
θn = arg max Pθ Y1 ∈ B(y1 , ), . . . , Yn ∈ B(yn , )
θ

Exact MLE for the likelihood same basis as Wilkinson!

0
pθ (y1 , . . . , yn )

corresponding to the perturbed process

(xt , yt + zt )1≤t≤n zt ∼ U(B(0, 1)


Alphabet soup

ABC-MLE is biased

ABC-MLE is asymptotically (in n) biased with target

l (θ) = Eθ∗ [log pθ (Y1 |Y−∞:0 )]

Alphabet soup

ABC-MLE is biased

ABC-MLE is asymptotically (in n) biased with target

l (θ) = Eθ∗ [log pθ (Y1 |Y−∞:0 )]

but ABD-MLE converges to true value in the sense

l n (θn ) → l (θ)

for all sequences (θn ) converging to θ and n

Alphabet soup

Noisy ABC-MLE

Idea: Modify instead the data from the start
0
(y1 + ζ1 , . . . , yn + ζn )

[ see Fearnhead-Prangle ]
noisy ABC-MLE estimate
0 0
arg max Pθ Y1 ∈ B(y1 + ζ1 , ), . . . , Yn ∈ B(yn + ζn , )
θ


Alphabet soup

Consistent noisy ABC-MLE

Degrading the data improves the estimation performances:
Noisy ABC-MLE is asymptotically (in n) consistent
under further assumptions, the noisy ABC-MLE is
asymptotically normal
increase in variance of order −2
likely degradation in precision or computing time due to the
lack of summary statistic

Alphabet soup

SMC for ABC likelihood

Algorithm 4 SMC ABC for HMMs
Given θ
for k = 1, . . . , n do
1 1 N N
generate proposals (xk , yk ), . . . , (xk , yk ) from the model
weigh each proposal with ωk l =I l
B(yk + ζk , ) (yk )
0

l
renormalise the weights and sample the xk ’s accordingly
end for
approximate the likelihood by
n N
l
ωk N
k=1 l=1

[Jasra, Singh, Martin, McCoy, 2010]

Alphabet soup

Which summary?

Fundamental difficulty of the choice of the summary statistic when
there is no non-trivial sufficient statistics [except when done by the
experimenters in the field]

Alphabet soup

Which summary?

Starting from a large collection of summary statistics is available,
Joyce and Marjoram (2008) consider the sequential inclusion into
the ABC target, with a stopping rule based on a likelihood ratio
test.

Alphabet soup

Which summary?

Starting from a large collection of summary statistics is available,
Joyce and Marjoram (2008) consider the sequential inclusion into
the ABC target, with a stopping rule based on a likelihood ratio
test.
Does not taking into account the sequential nature of the tests
Depends on parameterisation
Order of inclusion matters.

Alphabet soup

A connected Monte Carlo study

of the pseudo-data per simulated parameter
Repeating simulations

does not improve approximation
Tolerance level does not seem to be highly inﬂuential
Choice of distance / summary statistics / calibration factors
are paramount to successful approximation
ABC-SMC outperforms ABC-MCMC
[Mckinley, Cook, Deardon, 2009]


Semi-automatic ABC

Fearnhead and Prangle (2010) study ABC and the selection of the
summary statistic in close proximity to Wilkinson’s proposal
ABC then considered from a purely inferential viewpoint and
calibrated for estimation purposes.
Use of a randomised (or ‘noisy’) version of the summary statistics

η (y) = η(y) + τ
˜

Derivation of a well-calibrated version of ABC, i.e. an algorithm
that gives proper predictions for the distribution associated with
this randomised summary statistic.


Semi-automatic ABC

Fearnhead and Prangle (2010) study ABC and the selection of the
summary statistic in close proximity to Wilkinson’s proposal
ABC then considered from a purely inferential viewpoint and
calibrated for estimation purposes.
Use of a randomised (or ‘noisy’) version of the summary statistics

η (y) = η(y) + τ
˜

Derivation of a well-calibrated version of ABC, i.e. an algorithm
that gives proper predictions for the distribution associated with
this randomised summary statistic. [calibration constraint: ABC
approximation with same posterior mean as the true randomised
posterior.]


Summary statistics

Optimality of the posterior expectation E[θ|y] of the parameter of
interest as summary statistics η(y)!


Summary statistics

Optimality of the posterior expectation E[θ|y] of the parameter of
interest as summary statistics η(y)!
Use of the standard quadratic loss function

(θ − θ0 )T A(θ − θ0 ) .


Details on Fearnhead and Prangle (FP) ABC

Use of a summary statistic S(·), an importance proposal g (·), a
kernel K (·) ≤ 1 and a bandwidth h 0 such that

(θ, ysim ) ∼ g (θ)f (ysim |θ)

is accepted with probability (hence the bound)

K [{S(ysim ) − sobs }/h]

and the corresponding importance weight deﬁned by

π(θ) g (θ)

[Fearnhead Prangle, 2012]


Errors, errors, and errors

Three levels of approximation
π(θ|yobs ) by π(θ|sobs ) loss of information
[ignored]
π(θ|sobs ) by

π(s)K [{s − sobs }/h]π(θ|s) ds
πABC (θ|sobs ) =
π(s)K [{s − sobs }/h] ds

noisy observations
πABC (θ|sobs ) by importance Monte Carlo based on N
simulations, represented by var(a(θ)|sobs )/Nacc [expected
number of acceptances]
[M. Twain/B. Disraeli]


Average acceptance asymptotics

For the average acceptance probability/approximate likelihood

p(θ|sobs ) = f (ysim |θ) K [{S(ysim ) − sobs }/h] dysim ,

overall acceptance probability

p(sobs ) = p(θ|sobs ) π(θ) dθ = π(sobs )hd + o(hd )

[FP, Lemma 1]


Optimal importance proposal

Best choice of importance proposal in terms of eﬀective sample size

g (θ|sobs ) ∝ π(θ)p(θ|sobs )1/2

[Not particularly useful in practice]


Optimal importance proposal

Best choice of importance proposal in terms of eﬀective sample size

g (θ|sobs ) ∝ π(θ)p(θ|sobs )1/2

[Not particularly useful in practice]

note that p(θ|sobs ) is an approximate likelihood
reminiscent of parallel tempering
could be approximately achieved by attrition of half of the
data


Calibration of h

“This result gives insight into how S(·) and h aﬀect the Monte
Carlo error. To minimize Monte Carlo error, we need hd to be not
too small. Thus ideally we want S(·) to be a low dimensional
summary of the data that is suﬃciently informative about θ that
π(θ|sobs ) is close, in some sense, to π(θ|yobs )” (FP, p.5)

turns h into an absolute value while it should be
context-dependent and user-calibrated
only addresses one term in the approximation error and
acceptance probability (“curse of dimensionality”)
h large prevents πABC (θ|sobs ) to be close to π(θ|sobs )
d small prevents π(θ|sobs ) to be close to π(θ|yobs ) (“curse of
[dis]information”)


Calibrating ABC

“If πABC is calibrated, then this means that probability statements
that are derived from it are appropriate, and in particular that we
can use πABC to quantify uncertainty in estimates” (FP, p.5)


Calibrating ABC

“If πABC is calibrated, then this means that probability statements
that are derived from it are appropriate, and in particular that we
can use πABC to quantify uncertainty in estimates” (FP, p.5)

Deﬁnition
For 0 q 1 and subset A, event Eq (A) made of sobs such that
PrABC (θ ∈ A|sobs ) = q. Then ABC is calibrated if

Pr(θ ∈ A|Eq (A)) = q

unclear meaning of conditioning on Eq (A)


Calibrated ABC

Theorem (FP)
Noisy ABC, where

sobs = S(yobs ) + h , ∼ K (·)

is calibrated
[Wilkinson, 2008]
no condition on h!!


Calibrated ABC

Consequence: when h = ∞

Theorem (FP)
The prior distribution is always calibrated

is this a relevant property then?


More about calibrated ABC

“Calibration is not universally accepted by Bayesians. It is even more
questionable here as we care how statements we make relate to the
real world, not to a mathematically deﬁned posterior.” R. Wilkinson

Same reluctance about the prior being calibrated
Property depending on prior, likelihood, and summary
Calibration is a frequentist property (almost a p-value!)
More sensible to account for the simulator’s imperfections
than using noisy-ABC against a meaningless based measure
[Wilkinson, 2012]


Converging ABC

Theorem (FP)
For noisy ABC, the expected noisy-ABC log-likelihood,

E {log[p(θ|sobs )]} = log[p(θ|S(yobs ) + )]π(yobs |θ0 )K ( )dyobs d ,

has its maximum at θ = θ0 .

True for any choice of summary statistic? even ancilary statistics?!
[Imposes at least identiﬁability...]
Relevant in asymptotia and not for the data


Converging ABC

Corollary
For noisy ABC, the ABC posterior converges onto a point mass on
the true parameter value as m → ∞.

For standard ABC, not always the case (unless h goes to zero).
Strength of regularity conditions (c1) and (c2) in Bernardo
Smith, 1994?
[out-of-reach constraints on likelihood and posterior]
Again, there must be conditions imposed upon summary
statistics...


Loss motivated statistic
Under quadratic loss function,

Theorem (FP)
ˆ
(i) The minimal posterior error E[L(θ, θ)|yobs ] occurs when
ˆ = E(θ|yobs ) (!)
θ
(ii) When h → 0, EABC (θ|sobs ) converges to E(θ|yobs )
ˆ
(iii) If S(yobs ) = E[θ|yobs ] then for θ = EABC [θ|sobs ]

ˆ
E[L(θ, θ)|yobs ] = trace(AΣ) + h2 xT AxK (x)dx + o(h2 ).

measure-theoretic diﬃculties?
dependence of sobs on h makes me uncomfortable inherent to noisy
ABC
Relevant for choice of K ?


Optimal summary statistic
“We take a diﬀerent approach, and weaken the requirement for
πABC to be a good approximation to π(θ|yobs ). We argue for πABC
to be a good approximation solely in terms of the accuracy of
certain estimates of the parameters.” (FP, p.5)

From this result, FP
derive their choice of summary statistic,

S(y) = E(θ|y)

[almost suﬃcient]
suggest

h = O(N −1/(2+d) ) and h = O(N −1/(4+d) )

as optimal bandwidths for noisy and standard ABC.


Optimal summary statistic
“We take a diﬀerent approach, and weaken the requirement for
πABC to be a good approximation to π(θ|yobs ). We argue for πABC
to be a good approximation solely in terms of the accuracy of
certain estimates of the parameters.” (FP, p.5)

From this result, FP
derive their choice of summary statistic,

S(y) = E(θ|y)

[wow! EABC [θ|S(yobs )] = E[θ|yobs ]]
suggest

h = O(N −1/(2+d) ) and h = O(N −1/(4+d) )

as optimal bandwidths for noisy and standard ABC.


Caveat

Since E(θ|yobs ) is most usually unavailable, FP suggest
(i) use a pilot run of ABC to determine a region of non-negligible
posterior mass;
(ii) simulate sets of parameter values and data;
(iii) use the simulated sets of parameter values and data to
estimate the summary statistic; and
(iv) run ABC with this choice of summary statistic.


Caveat

Since E(θ|yobs ) is most usually unavailable, FP suggest
(i) use a pilot run of ABC to determine a region of non-negligible
posterior mass;
(ii) simulate sets of parameter values and data;
(iii) use the simulated sets of parameter values and data to
estimate the summary statistic; and
(iv) run ABC with this choice of summary statistic.
where is the assessment of the ﬁrst stage error?


Approximating the summary statistic

As Beaumont et al. (2002) and Blum and Fran¸ois (2010), FP
c
use a linear regression to approximate E(θ|yobs ):
(i)
θi = β0 + β (i) f (yobs ) + i

with f made of standard transforms
[Further justiﬁcations?]


[my]questions about semi-automatic ABC

dependence on h and S(·) in the early stage
reduction of Bayesian inference to point estimation
approximation error in step (i) not accounted for
not parameterisation invariant
practice shows that proper approximation to genuine posterior
distributions stems from using a (much) larger number of
summary statistics than the dimension of the parameter
the validity of the approximation to the optimal summary
statistic depends on the quality of the pilot run
important inferential issues like model choice are not covered
by this approach.
[Robert, 2012]


More about semi-automatic ABC
[End of section derived from comments on Read Paper, to appear]

“The apparently arbitrary nature of the choice of summary statistics
has always been perceived as the Achilles heel of ABC.” M.
Beaumont


More about semi-automatic ABC
[End of section derived from comments on Read Paper, to appear]

“The apparently arbitrary nature of the choice of summary statistics
has always been perceived as the Achilles heel of ABC.” M.
Beaumont

“Curse of dimensionality” linked with the increase of the
dimension of the summary statistic
Connection with principal component analysis
[Itan et al., 2010]
Connection with partial least squares
[Wegman et al., 2009]
Beaumont et al. (2002) postprocessed output is used as input
by FP to run a second ABC


Wood’s alternative
Instead of a non-parametric kernel approximation to the likelihood
1
K {η(yr ) − η(yobs )}
R r

Wood (2010) suggests a normal approximation

η(y(θ)) ∼ Nd (µθ , Σθ )

whose parameters can be approximated based on the R simulations
(for each value of θ).


Wood’s alternative
Instead of a non-parametric kernel approximation to the likelihood
1
K {η(yr ) − η(yobs )}
R r

Wood (2010) suggests a normal approximation

η(y(θ)) ∼ Nd (µθ , Σθ )

whose parameters can be approximated based on the R simulations
(for each value of θ).
Parametric versus non-parametric rate [Uh?!]
Automatic weighting of components of η(·) through Σθ
Dependence on normality assumption (pseudo-likelihood?)
[Cornebise, Girolami Kosmidis, 2012]


Reinterpretation and extensions
Reinterpretation of ABC output as joint simulation from

π (x, y |θ) = f (x|θ)¯Y |X (y |x)
¯ π

where
πY |X (y |x) = K (y − x)
¯


Reinterpretation of ABC output as joint simulation from

π (x, y |θ) = f (x|θ)¯Y |X (y |x)
¯ π

where
πY |X (y |x) = K (y − x)
¯

Reinterpretation of noisy ABC
if y |y obs ∼ πY |X (·|y obs ), then marginally
¯ ¯

y ∼ πY |θ (·|θ0 )
¯ ¯

c Explain for the consistency of Bayesian inference based on y and π
¯ ¯
[Lee, Andrieu Doucet, 2012]


Also ﬁts within the pseudo-marginal of Andrieu and Roberts (2009)

Algorithm 5 Rejuvenating GIMH-ABC
At time t, given θt = θ:
Sample θ ∼ g (·|θ).
⊗N
Sample z1:N ∼ πY |Θ (·|θ ).
⊗(N−1)
Sample x1:N−1 ∼ πY |Θ (·|θ).
With probability
N
π(θ )g (θ|θ ) i=1 IBh (zi ) (y )
min 1, × N−1
π(θ)g (θ |θ) 1+ i=1 IBh (xi ) (y )

set θt+1 = θ .
Otherwise set θt+1 = θ.



Algorithm 6 One-hit MCMC-ABC
At time t, with θt = θ:
Sample θ ∼ g (·|θ).
With probability 1 − min 1, π(θ )g(θ |θ)) , set θt+1 = θ and go to
π(θ)g
(θ|θ

time t + 1.
Sample zi ∼ πY |Θ (·|θ ) and xi ∼ πY |Θ (·|θ) for i = 1, . . . until
y ∈ Bh (zi ) and/or y ∈ Bh (xi ).
If y ∈ Bh (zi ) set θt+1 = θ and go to time t + 1.
If y ∈ Bh (xi ) set θt+1 = θ and go to time t + 1.





Validation
c The invariant distribution of θ is

πθ|Y (·|y )
¯

for both algorithms.



ABC for Markov chains

Rewriting the posterior as

π(θ)1−n (θ)π(θ|x1 ) π(θ|xt−1 , xt )

where π(θ|xt−1 , xt ) ∝ f (xt |xt−1 , θ)π(θ)


ABC for Markov chains

Rewriting the posterior as

π(θ)1−n (θ)π(θ|x1 ) π(θ|xt−1 , xt )

where π(θ|xt−1 , xt ) ∝ f (xt |xt−1 , θ)π(θ)
Allows for a stepwise ABC, replacing each π(θ|xt−1 , xt ) by an
ABC approximation
Similarity with FP’s multiple sources of data (and also with
Dean et al., 2011 )

[White et al., 2010, 2012]


Back to sufficiency
Difference between regular sufficiency, equivalent to
π(θ|y) = π(θ|η(y))
for all θ’s and all priors π, and


Back to suﬃciency
π(θ|y) = π(θ|η(y))
marginal suﬃciency, stated as
π(µ(θ)|y) = π(µ(θ)|η(y))
for all θ’s, the given prior π and a subvector µ(θ)
[Basu, 1977]


Back to sufficiency
π(θ|y) = π(θ|η(y))
marginal sufficiency, stated as
π(µ(θ)|y) = π(µ(θ)|η(y))
for all θ’s, the given prior π and a subvector µ(θ)
[Basu, 1977]
Relates to F P’s main result, but could event be reduced to
conditional sufficiency
π(µ(θ)|yobs ) = π(µ(θ)|η(yobs ))
(if feasible at all...)
[Dawson, 2012]


ABC and BCa’s

Arguments in favour of a comparison of ABC with bootstrap in the
event π(θ) is not deﬁned from prior information but as
conventional non-informative prior.
[Efron, 2012]


Predictive performances

Instead of posterior means, other aspects of posterior to explore.
E.g., look at minimising loss of information

p(θ, y) p(θ, η(y))
p(θ, y) log dθdy − p(θ, η(y)) log dθdη(y)
p(θ)p(y) p(θ)p(η(y))

for selection of summary statistics.
[Filippi, Barnes, Stumpf, 2012]


Auxiliary variables

Auxiliary variable method avoids computations of untractable
constant in likelihood
˜
f (y|θ) = Zθ f (y|θ)


Auxiliary variables

˜

Accept with probability
˜ ˜
π(θ )f (y|θ )g (z |θ , y) K (θ , θ)f (z|θ)
∧1
˜ ˜
π(θ)f (y|θ)g (z|θ, y) K (θ, θ )f (z |θ )



Auxiliary variables

˜


For Gibbs random ﬁelds , existence of a genuine suﬃcient statistic η(y).


Auxiliary variables and ABC

Special case of ABC when
g (z|θ, y) = K (η(bz) − η(y))
˜ ˜ ˜ ˜
f (y|θ )f (z|θ)/f (y|θ)f (z |θ ) replaced by one [or not?!]


Auxiliary variables and ABC

Special case of ABC when
g (z|θ, y) = K (η(bz) − η(y))
˜ ˜ ˜ ˜
f (y|θ )f (z|θ)/f (y|θ)f (z |θ ) replaced by one [or not?!]
Consequences
likelihood-free (ABC) versus constant-free (AVM)
in ABC, K (·) should be allowed to depend on θ
for Gibbs random ﬁelds, the auxiliary approach should be
prefered to ABC
[Møller, 2012]


ABC and BIC

Idea of applying BIC during the local regression :
Run regular ABC
Select summary statistics during local regression
Recycle the prior simulation sample (reference table) with
those summary statistics
Rerun the corresponding local regression (low cost)
[Pudlo Sedki, 2012]


A Brave New World?!



Introduction



Genetics of ABC


Model choice
Gibbs random ﬁelds

Model choice

Bayesian model choice

Several models M1 , M2 , . . . are considered simultaneously for a
dataset y and the model index M is part of the inference.
Use of a prior distribution. π(M = m), plus a prior distribution on
the parameter conditional on the value m of the model index,
πm (θ m )
Goal is to derive the posterior distribution of M, challenging
computational target when models are complex.

Model choice

Generic ABC for model choice

Algorithm 7 Likelihood-free model choice sampler (ABC-MC)
for t = 1 to T do
repeat
Generate m from the prior π(M = m)
Generate θ m from the prior πm (θ m )
Generate z from the model fm (z|θ m )
until ρ{η(z), η(y)}
Set m(t) = m and θ (t) = θ m
end for

Model choice

ABC estimates

Posterior probability π(M = m|y) approximated by the frequency
of acceptances from model m
T
1
Im(t) =m .
T
t=1

Issues with implementation:
should tolerances be the same for all models?
should summary statistics vary across models (incl. their
dimension)?
should the distance measure ρ vary as well?

Model choice

ABC estimates

Posterior probability π(M = m|y) approximated by the frequency
of acceptances from model m
T
1
Im(t) =m .
T
t=1

Extension to a weighted polychotomous logistic regression estimate
of π(M = m|y), with non-parametric kernel weights
[Cornuet et al., DIYABC, 2009]

Model choice

The Great ABC controversy
On-going controvery in phylogeographic genetics about the validity
of using ABC for testing

Against: Templeton, 2008,
2009, 2010a, 2010b, 2010c
argues that nested hypotheses
cannot have higher probabilities
than nesting hypotheses (!)

Model choice

The Great ABC controversy

On-going controvery in phylogeographic genetics about the validity
of using ABC for testing
Replies: Fagundes et al., 2008,
Against: Templeton, 2008, Beaumont et al., 2010, Berger et
2009, 2010a, 2010b, 2010c al., 2010, Csill`ry et al., 2010
e
argues that nested hypotheses point out that the criticisms are
cannot have higher probabilities addressed at [Bayesian]
than nesting hypotheses (!) model-based inference and have
nothing to do with ABC...



Gibbs distribution
The rv y = (y1 , . . . , yn ) is a Gibbs random ﬁeld associated with
the graph G if

1
f (y) = exp − Vc (yc ) ,
Z
c∈C

where Z is the normalising constant, C is the set of cliques of G
and Vc is any function also called potential suﬃcient statistic
U(y) = c∈C Vc (yc ) is the energy function



Gibbs distribution
The rv y = (y1 , . . . , yn ) is a Gibbs random ﬁeld associated with
the graph G if

1
f (y) = exp − Vc (yc ) ,
Z
c∈C

where Z is the normalising constant, C is the set of cliques of G
and Vc is any function also called potential suﬃcient statistic
U(y) = c∈C Vc (yc ) is the energy function

c Z is usually unavailable in closed form


Potts model
Potts model
Vc (y) is of the form

Vc (y) = θS(y) = θ δyl =yi
l∼i

where l∼i denotes a neighbourhood structure


Potts model
Potts model
Vc (y) is of the form

Vc (y) = θS(y) = θ δyl =yi
l∼i

where l∼i denotes a neighbourhood structure

In most realistic settings, summation
Zθ = exp{θ T S(x)}
x∈X

involves too many terms to be manageable and numerical
approximations cannot always be trusted
[Cucala, Marin, CPR Titterington, 2009]


Bayesian Model Choice

Comparing a model with potential S0 taking values in Rp0 versus a
model with potential S1 taking values in Rp1 can be done through
the Bayes factor corresponding to the priors π0 and π1 on each
parameter space

exp{θ T S0 (x)}/Zθ 0 ,0 π0 (dθ 0 )
0
Bm0 /m1 (x) =
exp{θ T S1 (x)}/Zθ 1 ,1 π1 (dθ 1 )
1


Bayesian Model Choice

Comparing a model with potential S0 taking values in Rp0 versus a
model with potential S1 taking values in Rp1 can be done through
the Bayes factor corresponding to the priors π0 and π1 on each
parameter space

exp{θ T S0 (x)}/Zθ 0 ,0 π0 (dθ 0 )
0
Bm0 /m1 (x) =
exp{θ T S1 (x)}/Zθ 1 ,1 π1 (dθ 1 )
1

Use of Jeﬀreys’ scale to select most appropriate model


Neighbourhood relations

Choice to be made between M neighbourhood relations
m
i ∼i (0 ≤ m ≤ M − 1)

with
Sm (x) = I{xi =xi }
m
i ∼i

driven by the posterior probabilities of the models.


Model index

Formalisation via a model index M that appears as a new
parameter with prior distribution π(M = m) and
π(θ|M = m) = πm (θm )


Model index

Formalisation via a model index M that appears as a new
parameter with prior distribution π(M = m) and
π(θ|M = m) = πm (θm )
Computational target:

P(M = m|x) ∝ fm (x|θm )πm (θm ) dθm π(M = m) ,
Θm


Sufficient statistics

By definition, if S(x) sufficient statistic for the joint parameters
(M, θ0 , . . . , θM−1 ),

P(M = m|x) = P(M = m|S(x)) .


Sufficient statistics

By definition, if S(x) sufficient statistic for the joint parameters
(M, θ0 , . . . , θM−1 ),

P(M = m|x) = P(M = m|S(x)) .

For each model m, own sufficient statistic Sm (·) and
S(·) = (S0 (·), . . . , SM−1 (·)) also sufficient.


Sufficient statistics in Gibbs random fields

For Gibbs random fields,
1 2
x|M = m ∼ fm (x|θm ) = fm (x|S(x))fm (S(x)|θm )
1
= f 2 (S(x)|θm )
n(S(x)) m

where
n(S(x)) = {˜ ∈ X : S(˜) = S(x)}
x x
c S(x) is therefore also sufficient for the joint parameters
[Specific to Gibbs random fields!]


ABC model choice Algorithm

ABC-MC
Generate m∗ from the prior π(M = m).
∗
Generate θm∗ from the prior πm∗ (·).
Generate x ∗ from the model fm∗ (·|θm∗ ).
∗

Compute the distance ρ(S(x0 ), S(x∗ )).
Accept (θm∗ , m∗ ) if ρ(S(x0 ), S(x∗ )) .
∗

Note When = 0 the algorithm is exact


ABC approximation to the Bayes factor

Frequency ratio:
ˆ
P(M = m0 |x0 ) π(M = m1 )
BF m0 /m1 (x0 ) = ×
ˆ
P(M = m1 |x0 ) π(M = m0 )
{mi∗ = m0 } π(M = m1 )
= × ,
{mi∗ = m1 } π(M = m0 )


ABC approximation to the Bayes factor

Frequency ratio:
ˆ
P(M = m0 |x0 ) π(M = m1 )
BF m0 /m1 (x0 ) = ×
ˆ
P(M = m1 |x0 ) π(M = m0 )
{mi∗ = m0 } π(M = m1 )
= × ,
{mi∗ = m1 } π(M = m0 )

replaced with

1 + {mi∗ = m0 } π(M = m1 )
BF m0 /m1 (x0 ) = ×
1 + {mi∗ = m1 } π(M = m0 )

to avoid indeterminacy (also Bayes estimate).


Toy example

iid Bernoulli model versus two-state ﬁrst-order Markov chain, i.e.
n
f0 (x|θ0 ) = exp θ0 I{xi =1} {1 + exp(θ0 )}n ,
i=1

versus
n
1
f1 (x|θ1 ) = exp θ1 I{xi =xi−1 } {1 + exp(θ1 )}n−1 ,
2
i=2

with priors θ0 ∼ U(−5, 5) and θ1 ∼ U(0, 6) (inspired by “phase
transition” boundaries).


Toy example (2)

10
5

5
BF01

BF01

0
^

^
0

−5
−5

−40 −20 0 10 −10 −40 −20 0 10

BF01 BF01

(left) Comparison of the true BF m0 /m1 (x0 ) with BF m0 /m1 (x0 ) (in
logs) over 2, 000 simulations and 4.106 proposals from the prior.
(right) Same when using tolerance corresponding to the 1%
quantile on the distances.

Generic ABC model choice

Back to suﬃciency

‘Suﬃcient statistics for individual models are unlikely to
be very informative for the model probability.’
[Scott Sisson, Jan. 31, 2011, X.’Og]


Back to sufficiency


If η1 (x) sufficient statistic for model m = 1 and parameter θ1 and
η2 (x) sufficient statistic for model m = 2 and parameter θ2 ,
(η1 (x), η2 (x)) is not always sufficient for (m, θm )


Back to sufficiency


If η1 (x) sufficient statistic for model m = 1 and parameter θ1 and
η2 (x) sufficient statistic for model m = 2 and parameter θ2 ,
(η1 (x), η2 (x)) is not always sufficient for (m, θm )
c Potential loss of information at the testing level


Limiting behaviour of B12 (T → ∞)

ABC approximation
T
t=1 Imt =1 Iρ{η(zt ),η(y)}≤
B12 (y) = T
,

where the (mt , z t )’s are simulated from the (joint) prior


Limiting behaviour of B12 (T → ∞)

ABC approximation
T
B12 (y) = T
,

where the (mt , z t )’s are simulated from the (joint) prior
As T go to inﬁnity, limit

Iρ{η(z),η(y)}≤ π1 (θ 1 )f1 (z|θ 1 ) dz dθ 1
B12 (y) =
Iρ{η(z),η(y)}≤ π2 (θ 2 )f2 (z|θ 2 ) dz dθ 2
Iρ{η,η(y)}≤ π1 (θ 1 )f1η (η|θ 1 ) dη dθ 1
= ,
Iρ{η,η(y)}≤ π2 (θ 2 )f2η (η|θ 2 ) dη dθ 2

where f1η (η|θ 1 ) and f2η (η|θ 2 ) distributions of η(z)


Limiting behaviour of B12 ( → 0)

When goes to zero,

η π1 (θ 1 )f1η (η(y)|θ 1 ) dθ 1
B12 (y) = ,
π2 (θ 2 )f2η (η(y)|θ 2 ) dθ 2


Limiting behaviour of B12 ( → 0)

When goes to zero,

η π1 (θ 1 )f1η (η(y)|θ 1 ) dθ 1
B12 (y) = ,
π2 (θ 2 )f2η (η(y)|θ 2 ) dθ 2

c Bayes factor based on the sole observation of η(y)


Limiting behaviour of B12 (under suﬃciency)

If η(y) suﬃcient statistic for both models,

fi (y|θ i ) = gi (y)fi η (η(y)|θ i )

Thus

Θ1 π(θ 1 )g1 (y)f1η (η(y)|θ 1 ) dθ 1
B12 (y) =
Θ2 π(θ 2 )g2 (y)f2η (η(y)|θ 2 ) dθ 2
g1 (y) π1 (θ 1 )f1η (η(y)|θ 1 ) dθ 1 g1 (y) η
= η = B (y) .
g2 (y) π2 (θ 2 )f2 (η(y)|θ 2 ) dθ 2 g2 (y) 12

[Didelot, Everitt, Johansen Lawson, 2011]


Limiting behaviour of B12 (under sufficiency)

If η(y) sufficient statistic for both models,

fi (y|θ i ) = gi (y)fi η (η(y)|θ i )

Thus

Θ1 π(θ 1 )g1 (y)f1η (η(y)|θ 1 ) dθ 1
B12 (y) =
Θ2 π(θ 2 )g2 (y)f2η (η(y)|θ 2 ) dθ 2
g1 (y) π1 (θ 1 )f1η (η(y)|θ 1 ) dθ 1 g1 (y) η
= η = B (y) .
g2 (y) π2 (θ 2 )f2 (η(y)|θ 2 ) dθ 2 g2 (y) 12

c No discrepancy only when cross-model sufficiency


Poisson/geometric example

Sample
x = (x1 , . . . , xn )
from either a Poisson P(λ) or from a geometric G(p) Then
n
S= yi = η(x)
i=1

suﬃcient statistic for either model but not simultaneously
Discrepancy ratio

g1 (x) S!n−S / i yi !
=
g2 (x) 1 n+S−1
S


Poisson/geometric discrepancy

η
Range of B12 (x) versus B12 (x) B12 (x): The values produced have
nothing in common.


Formal recovery

Creating an encompassing exponential family
T T
f (x|θ1 , θ2 , α1 , α2 ) ∝ exp{θ1 η1 (x) + θ1 η1 (x) + α1 t1 (x) + α2 t2 (x)}

leads to a suﬃcient statistic (η1 (x), η2 (x), t1 (x), t2 (x))


Formal recovery

T T


In the Poisson/geometric case, if i xi ! is added to S, no
discrepancy


Formal recovery

T T


Only applies in genuine suﬃciency settings...
c Inability to evaluate loss brought by summary statistics


Meaning of the ABC-Bayes factor

‘This is also why focus on model discrimination typically
(...) proceeds by (...) accepting that the Bayes Factor
that one obtains is only derived from the summary
statistics and may in no way correspond to that of the
full model.’


Meaning of the ABC-Bayes factor

‘This is also why focus on model discrimination typically
(...) proceeds by (...) accepting that the Bayes Factor
that one obtains is only derived from the summary
statistics and may in no way correspond to that of the
full model.’

In the Poisson/geometric case, if E[yi ] = θ0 0,

η (θ0 + 1)2 −θ0
lim B12 (y) = e
n→∞ θ0


MA(q) divergence

0.7
0.6
0.6

0.6

0.6
0.5
0.5

0.5

0.5
0.4
0.4

0.4

0.4
0.3
0.3

0.3

0.3
0.2
0.2

0.2

0.2
0.1
0.1

0.1

0.1
0.0

0.0

0.0

0.0
1 2 1 2 1 2 1 2

Evolution [against ] of ABC Bayes factor, in terms of frequencies of
visits to models MA(1) (left) and MA(2) (right) when equal to
10, 1, .1, .01% quantiles on insuﬃcient autocovariance distances. Sample
of 50 points from a MA(2) with θ1 = 0.6, θ2 = 0.2. True Bayes factor
equal to 17.71.


MA(q) divergence

0.8
0.6

0.6

0.6

0.6
0.4

0.4

0.4

0.4
0.2

0.2

0.2

0.2
0.0

0.0

0.0

0.0
1 2 1 2 1 2 1 2

Evolution [against ] of ABC Bayes factor, in terms of frequencies of
visits to models MA(1) (left) and MA(2) (right) when equal to
10, 1, .1, .01% quantiles on insuﬃcient autocovariance distances. Sample
of 50 points from a MA(1) model with θ1 = 0.6. True Bayes factor B21
equal to .004.


Further comments

‘There should be the possibility that for the same model,
but different (non-minimal) [summary] statistics (so
∗
different η’s: η1 and η1 ) the ratio of evidences may no
longer be equal to one.’
[Michael Stumpf, Jan. 28, 2011, ’Og]

Using different summary statistics [on different models] may
indicate the loss of information brought by each set but agreement
does not lead to trustworthy approximations.


A population genetics evaluation

Population genetics example with
3 populations
2 scenari
15 individuals
5 loci
single mutation parameter


A population genetics evaluation

Population genetics example with
3 populations
2 scenari
15 individuals
5 loci
single mutation parameter
24 summary statistics
2 million ABC proposal
importance [tree] sampling alternative


Stability of importance sampling

1.0

1.0

1.0

1.0

1.0
q

q
0.8

0.8

0.8

0.8

0.8
0.6

0.6

0.6

0.6

0.6
q
0.4

0.4

0.4

0.4

0.4
0.2

0.2

0.2

0.2

0.2
0.0

0.0

0.0

0.0

0.0


Comparison with ABC
Use of 24 summary statistics and DIY-ABC logistic correction

1.0

q q
q q
q
q
q q
q
q
q
q q
q q q qq q
q
0.8

q
q q
q
q qq
q q
q q
q q qq q
ABC estimates of P(M=1|y)

q q q q q q
qq
qq
q qq
q q q
q q
0.6

q q
q q
q q
qq q
q q
q
q q
q
q q
q q
q q
q q q q q q
q
q
q
qq
0.4

qq q
q q

q q
q
q

q q
0.2

q

q
0.0

0.0 0.2 0.4 0.6 0.8 1.0

Importance Sampling estimates of P(M=1|y)


Comparison with ABC

1.0
q q
q q qq
q
qq q qq
q qqq
qq
q qq q q q
q qq q
q
q q q
q qq q
q
q
q q
0.8

q
q
q q qq
q
q
q q q
q q
q q

q
q q q
q q q
q q q q q
q q
0.6

q

q q

qq
q
q
q q
q q qq
0.4

q q
q
q q
q
q
q
q
q q
0.2

q q
q
q
q q q
0.0

0.0 0.2 0.4 0.6 0.8 1.0



Comparison with ABC

1.0
q q
q q q
q
qq q
q
q qq q
q
q q
q q q
q
q q
q
0.8

qq
q
q
q q
q q q
q q
q q
qq qq

q
q q q

q
0.6

q
q q q
q q
q q q
q q qq q
q q
q q
q q
q
q
q q
q q q
0.4

q q
q q q
q q
q q
q q
q q
q q
q
q
q q
q
0.2

q q
q
q
q
q
0.0

0.0 0.2 0.4 0.6 0.8 1.0



The only safe cases???

Besides speciﬁc models like Gibbs random ﬁelds,
using distances over the data itself escapes the discrepancy...
[Toni Stumpf, 2010; Sousa al., 2009]


The only safe cases???

Besides specific models like Gibbs random fields,
using distances over the data itself escapes the discrepancy...
[Toni Stumpf, 2010; Sousa al., 2009]

...and so does the use of more informal model fitting measures
[Ratmann al., 2009]



Introduction



Genetics of ABC




Formalised framework

The starting point

Central question to the validation of ABC for model choice:

When is a Bayes factor based on an insuﬃcient statistic T(y)
consistent?


The starting point

Central question to the validation of ABC for model choice:

When is a Bayes factor based on an insuﬃcient statistic T(y)
consistent?

T
Note: c drawn on T(y) through B12 (y) necessarily diﬀers from c
drawn on y through B12 (y)


A benchmark if toy example

Comparison suggested by referee of PNAS paper [thanks]:
[X, Cornuet, Marin, Pillai, Aug. 2011]
Model M1 : y ∼ N (θ1 , 1) opposed to model M2 :
√
y ∼ L(θ2 , 1/ √ Laplace distribution with mean θ2 and scale
2),
parameter 1/ 2 (variance one).



√
2),
Four possible statistics
1. sample mean y (suﬃcient for M1 if not M2 );
2. sample median med(y) (insuﬃcient);
3. sample variance var(y) (ancillary);
4. median absolute deviation mad(y) = med(y − med(y));


√
2),
6
5
4
Density

3
2
1
0

0.1 0.2 0.3 0.4 0.5 0.6 0.7

posterior probability


√
2),
n=100
0.7
0.6
0.5
0.4

q
0.3

q
q

q
0.2
0.1

q
q
q

q
q
0.0

q
q

Gauss Laplace


√
2),
6

3
5
4
Density

Density

2
3
2

1
1
0

0

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.0 0.2 0.4 0.6 0.8 1.0

posterior probability probability


√
2),
n=100 n=100

1.0
0.7

q

q
0.6

q
0.8
q

q
0.5

q
q

q
0.6

q
0.4

q

q
q q
0.3

q
q
0.4

q
q
0.2

0.2

q
0.1

q
q
q q
q q

q
0.0
0.0

q
q

Gauss Laplace Gauss Laplace


Framework

Starting from sample

y = (y1 , . . . , yn )

the observed sample, not necessarily iid with true distribution

y ∼ Pn

Summary statistics

T(y) = Tn = (T1 (y), T2 (y), · · · , Td (y)) ∈ Rd

with true distribution Tn ∼ Gn .

Consistency results

Assumptions

A collection of asymptotic “standard” assumptions:
[A1] There exist a sequence {vn } converging to +∞,
an a.c. distribution Q with continuous bounded density q(·),
a symmetric, d × d positive deﬁnite matrix V0
and a vector µ0 ∈ Rd such that
−1/2 n→∞
vn V0 (Tn − µ0 ) Q, under Gn

and for all M 0
−d −1/2
sup |V0 |1/2 vn gn (t) − q vn V0 {t − µ0 } = o(1)
vn |t−µ0 |M

Consistency results

Assumptions

[A2] For i = 1, 2, there exist d × d symmetric positive deﬁnite matrices
Vi (θi ) and µi (θi ) ∈ Rd such that
n→∞
vn Vi (θi )−1/2 (Tn − µi (θi )) Q, under Gi,n (·|θi ) .

Consistency results

Assumptions

[A4] For u 0
−1
Sn,i (u) = θi ∈ Fn,i ; |µ(θi ) − µ0 | ≤ u vn

if inf{|µi (θi ) − µ0 |; θi ∈ Θi } = 0, there exist constants di τi ∧ αi − 1
such that
−d
πi (Sn,i (u)) ∼ u di vn i , ∀u vn

Consistency results

Assumptions

[A5] If inf{|µi (θi ) − µ0 |; θi ∈ Θi } = 0, there exists U 0 such that for
any M 0,

−d
sup sup |Vi (θi )|1/2 vn gi (t|θi )
vn |t−µ0 |M θi ∈Sn,i (U)

−q vn Vi (θi )−1/2 (t − µ(θi ) = o(1)

and

πi Sn,i (U) ∩ ||Vi (θi )−1 || + ||Vi (θi )|| M
lim lim sup =0.
M→∞ n πi (Sn,i (U))

Consistency results

Assumptions

[A1]–[A2] are standard central limit theorems ([A1] redundant
when one model is “true”)
[A3] controls the large deviations of the estimator Tn from the
estimand µ(θ)
[A4] is the standard prior mass condition found in Bayesian
asymptotics (di eﬀective dimension of the parameter)
[A5] controls more tightly convergence esp. when µi is not
one-to-one

Consistency results

Effective dimension

[A4] Understanding d1 , d2 : defined only when
µ0 ∈ {µi (θi ), θi ∈ Θi },

πi (θi : |µi (θi ) − µ0 | n−1/2 ) = O(n−di /2 )

is the effective dimension of the model Θi around µ0

Consistency results

Asymptotic marginals

Asymptotically, under [A1]–[A5]

mi (t) = gi (t|θi ) πi (θi ) dθi
Θi

is such that
(i) if inf{|µi (θi ) − µ0 |; θi ∈ Θi } = 0,

Cl vn i ≤ mi (Tn ) ≤ Cu vn i
d−d d−d

and
(ii) if inf{|µi (θi ) − µ0 |; θi ∈ Θi } 0

mi (Tn ) = oPn [vn i + vn i ].
d−τ d−α

Consistency results

Within-model consistency

Under same assumptions, if inf{|µi (θi ) − µ0 |; θi ∈ Θi } = 0, the
posterior distribution of µi (θi ) given Tn is consistent at rate 1/vn
provided αi ∧ τi di .

Consistency results

Within-model consistency

Under same assumptions, if inf{|µi (θi ) − µ0 |; θi ∈ Θi } = 0, the
posterior distribution of µi (θi ) given Tn is consistent at rate 1/vn
provided αi ∧ τi di .
Note: di can truly be seen as an eﬀective dimension of the model
under the posterior πi (.|Tn ), since if µ0 ∈ {µi (θi ); θi ∈ Θi },

mi (Tn ) ∼ vn i
d−d

Consistency results

Between-model consistency

Consequence of above is that asymptotic behaviour of the Bayes
factor is driven by the asymptotic mean value of Tn under both
models. And only by this mean value!

Consistency results


Indeed, if

inf{|µ0 − µ2 (θ2 )|; θ2 ∈ Θ2 } = inf{|µ0 − µ1 (θ1 )|; θ1 ∈ Θ1 } = 0

then
−(d1 −d2 ) −(d1 −d2 )
Cl vn ≤ m1 (Tn ) m2 (Tn ) ≤ Cu vn ,

where Cl , Cu = OPn (1), irrespective of the true model.
c Only depends on the diﬀerence d1 − d2

Consistency results


Else, if

inf{|µ0 − µ2 (θ2 )|; θ2 ∈ Θ2 } inf{|µ0 − µ1 (θ1 )|; θ1 ∈ Θ1 } = 0

then
m1 (Tn ) −(d1 −α2 ) −(d1 −τ2 )
n ≥ Cu min vn , vn ,
m2 (T )

Consistency results

Consistency theorem

If

inf{|µ0 − µ2 (θ2 )|; θ2 ∈ Θ2 } = inf{|µ0 − µ1 (θ1 )|; θ1 ∈ Θ1 } = 0,

Bayes factor
T −(d1 −d2 )
B12 = O(vn )
irrespective of the true model. It is consistent iﬀ Pn is within the
model with the smallest dimension

Consistency results

Consistency theorem

If Pn belongs to one of the two models and if µ0 cannot be
attained by the other one :

0 = min (inf{|µ0 − µi (θi )|; θi ∈ Θi }, i = 1, 2)
max (inf{|µ0 − µi (θi )|; θi ∈ Θi }, i = 1, 2) ,
T
then the Bayes factor B12 is consistent

Summary statistics

Consequences on summary statistics

Bayes factor driven by the means µi (θi ) and the relative position of
µ0 wrt both sets {µi (θi ); θi ∈ Θi }, i = 1, 2.
For ABC, this implies the most likely statistics Tn are ancillary
statistics with diﬀerent mean values under both models
Else, if Tn asymptotically depends on some of the parameters of
the models, it is quite likely that there exists θi ∈ Θi such that
µi (θi ) = µ0 even though model M1 is misspeciﬁed

Summary statistics

Toy example: Laplace versus Gauss [1]

If
n
Tn = n−1 Xi4 , µ1 (θ) = 3 + θ4 + 6θ2 , µ2 (θ) = 6 + · · ·
i=1

and the true distribution is Laplace with mean θ0 = 1, under the
√
Gaussian model the value θ∗ = 2 3 − 3 leads to µ0 = µ(θ∗ )
[here d1 = d2 = d = 1]

Summary statistics


If
n
Tn = n−1 Xi4 , µ1 (θ) = 3 + θ4 + 6θ2 , µ2 (θ) = 6 + · · ·
i=1

and the true distribution is Laplace with mean θ0 = 1, under the
√
Gaussian model the value θ∗ = 2 3 − 3 leads to µ0 = µ(θ∗ )
[here d1 = d2 = d = 1]
c a Bayes factor associated with such a statistic is inconsistent

Summary statistics

If
n
n −1
T =n Xi4 , µ1 (θ) = 3 + θ4 + 6θ2 , µ2 (θ) = 6 + · · ·
i=1
8
6
4
2
0

0.0 0.2 0.4 0.6 0.8 1.0

probability

Fourth moment

Summary statistics

If
n
n −1
T =n Xi4 , µ1 (θ) = 3 + θ4 + 6θ2 , µ2 (θ) = 6 + · · ·
i=1
n=1000
0.45

q
0.40
0.35

q
q

q
0.30

q

M1 M2

Summary statistics


If
n
n −1
T =n Xi4 , µ1 (θ) = 3 + θ4 + 6θ2 , µ2 (θ) = 6 + · · ·
i=1

Caption: Comparison of the distributions of the posterior
probabilities that the data is from a normal model (as opposed to a
Laplace model) with unknown mean when the data is made of
n = 1000 observations either from a normal (M1) or Laplace (M2)
distribution with mean one and when the summary statistic in the
ABC algorithm is restricted to the empirical fourth moment. The
ABC algorithm uses proposals from the prior N (0, 4) and selects
the tolerance as the 1% distance quantile.

Summary statistics


When
(4) (6)
T(y) = yn , yn
¯ ¯

and the true distribution is Laplace with mean θ0 = 0, then
√
∗ ∗
µ0 = 6, µ1 (θ1 ) = 6 with θ1 = 2 3 − 3
[d1 = 1 and d2 = 1/2]
thus
B12 ∼ n−1/4 → 0 : consistent

Summary statistics


When
(4) (6)
T(y) = yn , yn
¯ ¯

and the true distribution is Laplace with mean θ0 = 0, then
√
∗ ∗
µ0 = 6, µ1 (θ1 ) = 6 with θ1 = 2 3 − 3
[d1 = 1 and d2 = 1/2]
thus
B12 ∼ n−1/4 → 0 : consistent
Under the Gaussian model µ0 = 3 µ2 (θ2 ) ≥ 6 3 ∀θ2

B12 → +∞ : consistent

Summary statistics

When
(4) (6)
T(y) = yn , yn
¯ ¯

5
4
3
Density

2
1
0

0.0 0.2 0.4 0.6 0.8 1.0

posterior probabilities

Fourth and sixth moments

Summary statistics

Embedded models

When M1 submodel of M2 , and if the true distribution belongs to
the smaller model M1 , Bayes factor is of order
−(d1 −d2 )
vn

Summary statistics

Embedded models

−(d1 −d2 )
vn

If summary statistic only informative on a parameter that is the
same under both models, i.e if d1 = d2 , then
c the Bayes factor is not consistent

Summary statistics

Embedded models

−(d1 −d2 )
vn

Else, d1 d2 and Bayes factor is consistent under M1 . If true
distribution not in M1 , then
c Bayes factor is consistent only if µ1 = µ2 = µ0

ABC in Roma

More Related Content

What's hot

Viewers also liked

Similar to ABC in Roma

More from Christian Robert

Recently uploaded

ABC in Roma