Séminaire de Physique à Besancon, Nov. 22, 2012

MCMC and likelihood-free methods


Christian P. Robert

Universit´ Paris-Dauphine, IUF, & CREST
e

Universit´ de Besan¸on, November 22, 2012
e c

Computational issues in Bayesian cosmology


Computational issues in Bayesian
cosmology

The Metropolis-Hastings Algorithm

The Gibbs Sampler

Approximate Bayesian computation


Statistical problems in cosmology

Potentially high dimensional parameter space [Not considered
here]
Immensely slow computation of likelihoods, e.g WMAP, CMB,
because of numerically costly spectral transforms [Data is a
Fortran program]
Nonlinear dependence and degeneracies between parameters
introduced by physical constraints or theoretical assumptions


Cosmological data

Posterior distribution of cosmological parameters for recent
observational data of CMB anisotropies (diﬀerences in temperature
from directions) [WMAP], SNIa, and cosmic shear.
Combination of three likelihoods, some of which are available as
public (Fortran) code, and of a uniform prior on a hypercube.


Cosmology parameters

Parameters for the cosmology likelihood
(C=CMB, S=SNIa, L=lensing)

Symbol Description Minimum Maximum Experiment
Ωb Baryon density 0.01 0.1 C L
Ωm Total matter density 0.01 1.2 C S L
w Dark-energy eq. of state -3.0 0.5 C S L
ns Primordial spectral index 0.7 1.4 C L
∆2R Normalization (large scales) C
σ8 Normalization (small scales) C L
h Hubble constant C L
τ Optical depth C
M Absolute SNIa magnitude S
α Colour response S
β Stretch response S
a L
b galaxy z-distribution ﬁt L
c L

For WMAP5, σ8 is a deduced quantity that depends on the other parameters


Adaptation of importance function

[Benabed et al., MNRAS, 2010]


Estimates
Parameter PMC MCMC

Ωb 0.0432+0.0027
−0.0024 0.0432+0.0026
−0.0023
Ωm 0.254+0.018
−0.017 0.253+0.018
−0.016
τ 0.088+0.018
−0.016 0.088+0.019
−0.015
w −1.011 ± 0.060 −1.010+0.059
−0.060
ns 0.963+0.015
−0.014 0.963+0.015
−0.014
109 ∆2
R 2.413+0.098
−0.093 2.414+0.098
−0.092
h 0.720+0.022
−0.021 0.720+0.023
−0.021
a 0.648+0.040
−0.041 0.649+0.043
−0.042
b 9.3+1.4
−0.9 9.3+1.7
−0.9
c 0.639+0.084
−0.070 0.639+0.082
−0.070
−M 19.331 ± 0.030 19.332+0.029
−0.031
α 1.61+0.15
−0.14 1.62+0.16
−0.14
−β −1.82+0.17
−0.16 −1.82 ± 0.16
σ8 0.795+0.028
−0.030 0.795+0.030
−0.027

Means and 68% credible intervals using lensing, SNIa and CMB


Evidence/Marginal likelihood/Integrated Likelihood ...

Central quantity of interest in (Bayesian) model choice

π(x)
E = π(x)dx = q(x)dx.
q(x)

expressed as an expectation under any density q with large enough
support.


Evidence/Marginal likelihood/Integrated Likelihood ...

Central quantity of interest in (Bayesian) model choice

π(x)
E = π(x)dx = q(x)dx.
q(x)

expressed as an expectation under any density q with large enough
support.
Importance sampling provides a sample x1 , . . . xN ∼ q and
approximation of the above integral,
N
E≈ wn
n=1

π(xn )
where the wn = q(xn ) are the (unnormalised) importance weights.


Back to cosmology questions

Standard cosmology successful in explaining recent observations,
such as CMB, SNIa, galaxy clustering, cosmic shear, galaxy cluster
counts, and Lyα forest clustering.
Flat ΛCDM model with only six free parameters
(Ωm , Ωb , h, ns , τ, σ8 )


Back to cosmology questions

Standard cosmology successful in explaining recent observations,
such as CMB, SNIa, galaxy clustering, cosmic shear, galaxy cluster
counts, and Lyα forest clustering.
Flat ΛCDM model with only six free parameters
(Ωm , Ωb , h, ns , τ, σ8 )
Extensions to ΛCDM may be based on independent evidence
(massive neutrinos from oscillation experiments), predicted by
compelling hypotheses (primordial gravitational waves from
inflation) or reflect ignorance about fundamental physics
(dynamical dark energy).
Testing for dark energy, curvature, and inflationary models


Extended models

Focus on the dark energy equation-of-state parameter, modeled as

w = −1 ΛCDM
w = w0 wCDM
w = w0 + w1 (1 − a) w(z)CDM

In addition, curvature parameter ΩK for each of the above is either
ΩK = 0 (‘ﬂat’) or ΩK = 0 (‘curved’).
Choice of models represents simplest models beyond a
“cosmological constant” model able to explain the observed,
recent accelerated expansion of the Universe.


Cosmology priors

Prior ranges for dark energy and curvature models. In case of
w(a) models, the prior on w1 depends on w0
Parameter Description Min. Max.
Ωm Total matter density 0.15 0.45
Ωb Baryon density 0.01 0.08
h Hubble parameter 0.5 0.9
ΩK Curvature −1 1
w0 Constant dark-energy par. −1 −1/3
w1 Linear dark-energy par. −1 − w0 −1/3−w0
1−aacc


Results

In most cases evidence in favour of the standard model. especially
when more datasets/experiments are combined.
Largest evidence is ln B12 = 1.8, for the w(z)CDM model and
CMB alone. Case where a large part of the prior range is still
allowed by the data, and a region of comparable size is excluded.
Hence weak evidence that both w0 and w1 are required, but
excluded when adding SNIa and BAO datasets.
Results on the curvature are compatible with current ﬁndings:
non-ﬂat Universe(s) strongly disfavoured for the three dark-energy
cases.


Evidence


Posterior outcome
Posterior on dark-energy parameters w0 and w1 as 68%- and 95% credible regions for
WMAP (solid blue lines), WMAP+SNIa (dashed green) and WMAP+SNIa+BAO
(dotted red curves). Allowed prior range as red straight lines.



cosmology


The Gibbs Sampler


Monte Carlo basics

General purpose

A major computational issue in Bayesian statistics:

Given a density π known up to a normalizing constant, and an
integrable function h, compute

˜
h(x)π(x)µ(dx)
Π(h) = h(x)π(x)µ(dx) =
˜
π(x)µ(dx)

when ˜
h(x)π(x)µ(dx) is intractable.

Monte Carlo basics

Monte Carlo 101

Generate an iid sample x1 , . . . , xN from π and estimate Π(h) by
N
ΠMC (h) = N−1
^N h(xi ).
i=1

as
LLN: ΠMC (h) −→ Π(h)
^N
If Π(h2 ) = h2 (x)π(x)µ(dx) < ∞,
√ L
CLT: N ΠMC (h) − Π(h)
^N N 0, Π [h − Π(h)]2 .

Monte Carlo basics

Monte Carlo 101

Generate an iid sample x1 , . . . , xN from π and estimate Π(h) by
N
ΠMC (h) = N−1
^N h(xi ).
i=1

as
LLN: ΠMC (h) −→ Π(h)
^N
If Π(h2 ) = h2 (x)π(x)µ(dx) < ∞,
√ L
CLT: N ΠMC (h) − Π(h)
^N N 0, Π [h − Π(h)]2 .

Caveat conducting to MCMC

Often impossible or ineﬃcient to simulate directly from Π

Monte Carlo Methods based on Markov Chains

Running Monte Carlo via Markov Chains (MCMC)

It is not necessary to use a sample from the distribution f to
approximate the integral

I= h(x)f(x)dx ,




I= h(x)f(x)dx ,

[notation warnin: π turned to f!]




I= h(x)f(x)dx ,

We can obtain X1 , . . . , Xn ∼ f (approx)
without directly simulating from f,
using an ergodic Markov chain with
stationary distribution f




I= h(x)f(x)dx ,

We can obtain X1 , . . . , Xn ∼ f (approx)
without directly simulating from f,
using an ergodic Markov chain with
stationary distribution f

Andre¨ Markov
ı


Running Monte Carlo via Markov Chains (2)

Idea
For an arbitrary starting value x(0) , an ergodic chain (X(t) ) is
generated using a transition kernel with stationary distribution f



Idea

irreducible Markov chain with stationary distribution f is
ergodic with limiting distribution f under weak conditions
hence convergence in distribution of (X(t) ) to a random
variable from f.
for T0 “large enough” T0 , X(T0 ) distributed from f
Markov sequence is dependent sample X(T0 ) , X(T0 +1) , . . .
generated from f
Birkoﬀ’s ergodic theorem extends LLN, suﬃcient for most
approximation purposes



Idea

Problem: How can one build a Markov chain with a given
stationary distribution?

The Metropolis–Hastings algorithm


Arguments: The algorithm uses the
objective (target) density

f

and a conditional density

q(y|x)

called the instrumental (or proposal) Nicholas Metropolis
distribution


The MH algorithm

Algorithm (Metropolis–Hastings)
Given x(t) ,
1. Generate Yt ∼ q(y|x(t) ).
2. Take

Yt with prob. ρ(x(t) , Yt ),
X(t+1) =
x(t) with prob. 1 − ρ(x(t) , Yt ),

where
f(y) q(x|y)
ρ(x, y) = min ,1 .
f(x) q(y|x)


Features

Independent of normalizing constants for both f and q(·|x)
(ie, those constants independent of x)
Never move to values with f(y) = 0
The chain (x(t) )t may take the same value several times in a
row, even though f is a density wrt Lebesgue measure
The sequence (yt )t is usually not a Markov chain


Convergence properties

1. The M-H Markov chain is reversible, with
invariant/stationary density f since it satisﬁes the detailed
balance condition
f(y) K(y, x) = f(x) K(x, y)



balance condition
f(y) K(y, x) = f(x) K(x, y)
2. As f is a probability measure, the chain is positive recurrent



balance condition
f(y) K(y, x) = f(x) K(x, y)
2. As f is a probability measure, the chain is positive recurrent
3. If
f(Yt ) q(X(t) |Yt )
Pr 1 < 1. (1)
f(X(t) ) q(Yt |X(t) )
that is, the event {X(t+1) = X(t) } is possible, then the chain is
aperiodic

Random-walk Metropolis-Hastings algorithms

Random walk Metropolis–Hastings

Use of a local perturbation as proposal

Yt = X(t) + εt ,

where εt ∼ g, independent of X(t) .
The instrumental density is of the form g(y − x) and the Markov
chain is a random walk if we take g to be symmetric g(x) = g(−x)

Random-walk Metropolis-Hastings algorithms

Random walk Metropolis–Hastings [code]

Algorithm (Random walk Metropolis)
Given x(t)
1. Generate Yt ∼ g(y − x(t) )
2. Take

Y f(Yt )
(t+1) t with prob. min 1, ,
X = f(x(t) )
 (t)
x otherwise.

Extensions

Langevin Algorithms

Proposal based on the Langevin diffusion Lt is defined by the
stochastic differential equation
1
dLt = dBt + log f(Lt )dt,
2
where Bt is the standard Brownian motion
Theorem
The Langevin diffusion is the only non-explosive diffusion which is
reversible with respect to f.

Extensions

Discretization

Instead, consider the sequence

σ2
x(t+1) = x(t) + log f(x(t) ) + σεt , εt ∼ Np (0, Ip )
2
where σ2 corresponds to the discretization step

Extensions

Discretization

Instead, consider the sequence

σ2
x(t+1) = x(t) + log f(x(t) ) + σεt , εt ∼ Np (0, Ip )
2
where σ2 corresponds to the discretization step
Unfortunately, the discretized chain may be transient, for instance
when
lim σ2 log f(x)|x|−1 > 1
x→±∞

Extensions

MH correction

Accept the new value Yt with probability
2
σ2
exp − Yt − x(t) − 2 log f(x(t) ) 2σ2
f(Yt )
· ∧1.
f(x(t) ) σ2
2
exp − x(t) − Yt − 2 log f(Yt ) 2σ2

Choice of the scaling factor σ
Should lead to an acceptance rate of 0.574 to achieve optimal
convergence rates (when the components of x are uncorrelated)
[Roberts & Rosenthal, 1998; Girolami & Calderhead, 2011]

Extensions

Optimizing the Acceptance Rate

Problem of choosing the transition q kernel from a practical point
of view
Most common solutions:
(a) a fully automated algorithm like ARMS;
[Gilks & Wild, 1992]
(b) an instrumental density g which approximates f, such that
f/g is bounded for uniform ergodicity to apply;
(c) a random walk
In both cases (b) and (c), the choice of g is critical,

Extensions

Case of the random walk

Diﬀerent approach to acceptance rates
A high acceptance rate does not indicate that the algorithm is
moving correctly since it indicates that the random walk is moving
too slowly on the surface of f.

Extensions

Case of the random walk

Diﬀerent approach to acceptance rates
A high acceptance rate does not indicate that the algorithm is
moving correctly since it indicates that the random walk is moving
too slowly on the surface of f.
If x(t) and yt are close, i.e. f(x(t) ) f(yt ) y is accepted with
probability
f(yt )
min ,1 1.
f(x(t) )
For multimodal densities with well separated modes, the negative
eﬀect of limited moves on the surface of f clearly shows.

Extensions

Case of the random walk (2)

If the average acceptance rate is low, the successive values of f(yt )
tend to be small compared with f(x(t) ), which means that the
random walk moves quickly on the surface of f since it often
reaches the “borders” of the support of f

Extensions

Rule of thumb

In small dimensions, aim at an average acceptance rate of
50%. In large dimensions, at an average acceptance rate of
25%.
[Gelman,Gilks and Roberts, 1995]

Extensions

Rule of thumb

In small dimensions, aim at an average acceptance rate of
50%. In large dimensions, at an average acceptance rate of
25%.
[Gelman,Gilks and Roberts, 1995]

warnin: rule to be taken with a pinch of salt!

Extensions

Role of scale

Example (Noisy AR(1))
Hidden Markov chain from a regular AR(1) model,

xt+1 = ϕxt + t+1 t ∼ N(0, τ2 )

and observables
yt |xt ∼ N(x2 , σ2 )
t

Extensions

Role of scale

Example (Noisy AR(1))
Hidden Markov chain from a regular AR(1) model,

xt+1 = ϕxt + t+1 t ∼ N(0, τ2 )

and observables
yt |xt ∼ N(x2 , σ2 )
t

The distribution of xt given xt−1 , xt+1 and yt is

−1 τ2
exp (xt − ϕxt−1 )2 + (xt+1 − ϕxt )2 + (yt − x2 )2
t .
2τ2 σ2

Extensions

Role of scale

Example (Noisy AR(1) continued)
For a Gaussian random walk with scale ω small enough, the
random walk never jumps to the other mode. But if the scale ω is
suﬃciently large, the Markov chain explores both modes and give a
satisfactory approximation of the target distribution.

Extensions

Role of scale

Markov chain based on a random walk with scale ω = .1.

Extensions

Role of scale

Markov chain based on a random walk with scale ω = .5.

The Gibbs Sampler

The Gibbs Sampler

cosmology


The Gibbs Sampler


The Gibbs Sampler
General Principles

General Principles

A very speciﬁc simulation algorithm based on the target
distribution f:
1. Uses the conditional densities f1 , . . . , fp from f

The Gibbs Sampler
General Principles

General Principles

distribution f:
2. Start with the random variable X = (X1 , . . . , Xp )

The Gibbs Sampler
General Principles

General Principles

distribution f:
2. Start with the random variable X = (X1 , . . . , Xp )
3. Simulate from the conditional densities,

Xi |x1 , x2 , . . . , xi−1 , xi+1 , . . . , xp
∼ fi (xi |x1 , x2 , . . . , xi−1 , xi+1 , . . . , xp )

for i = 1, 2, . . . , p.

The Gibbs Sampler
General Principles

Gibbs code

Algorithm (Gibbs sampler)
(t) (t)
Given x(t) = (x1 , . . . , xp ), generate
(t+1) (t) (t)
1. X1 ∼ f1 (x1 |x2 , . . . , xp );
(t+1) (t+1) (t) (t)
2. X2 ∼ f2 (x2 |x1 , x3 , . . . , xp ),
...
(t+1) (t+1) (t+1)
p. Xp ∼ fp (xp |x1 , . . . , xp−1 )

X(t+1) → X ∼ f

The Gibbs Sampler
General Principles

Properties

The full conditionals densities f1 , . . . , fp are the only densities used
for simulation. Thus, even in a high dimensional problem, all of
the simulations may be univariate

The Gibbs Sampler
General Principles

toy example: iid N(µ, σ2 ) variates

iid
When Y1 , . . . , Yn ∼ N(y|µ, σ2 ) with both µ and σ unknown, the
posterior in (µ, σ2 ) is conjugate outside a standard familly

The Gibbs Sampler
General Principles

toy example: iid N(µ, σ2 ) variates

iid
When Y1 , . . . , Yn ∼ N(y|µ, σ2 ) with both µ and σ unknown, the
posterior in (µ, σ2 ) is conjugate outside a standard familly

But...
n σ2
µ|Y 0:n , σ2 ∼ N µ 1
n i=1 Yi , n )

σ2 |Y 1:n , µ ∼ IG σ2 n − 1, 2 n (Yi
2
1
i=1 − µ)2
assuming constant (improper) priors on both µ and σ2

Hence we may use the Gibbs sampler for simulating from the
posterior of (µ, σ2 )

The Gibbs Sampler
General Principles

toy example: R code

Gibbs Sampler for Gaussian posterior
n = length(Y);
S = sum(Y);
mu = S/n;
for (i in 1:500)
S2 = sum((Y-mu)^2);
sigma2 = 1/rgamma(1,n/2-1,S2/2);
mu = S/n + sqrt(sigma2/n)*rnorm(1);

The Gibbs Sampler
General Principles

Example of results with n = 10 observations from the
N(0, 1) distribution

Number of Iterations 1

The Gibbs Sampler
General Principles


Number of Iterations 1, 2

The Gibbs Sampler
General Principles


Number of Iterations 1, 2, 3

The Gibbs Sampler
General Principles


Number of Iterations 1, 2, 3, 4

The Gibbs Sampler
General Principles


Number of Iterations 1, 2, 3, 4, 5

The Gibbs Sampler
General Principles


Number of Iterations 1, 2, 3, 4, 5, 10

The Gibbs Sampler
General Principles


Number of Iterations 1, 2, 3, 4, 5, 10, 25

The Gibbs Sampler
General Principles


Number of Iterations 1, 2, 3, 4, 5, 10, 25, 50

The Gibbs Sampler
General Principles


Number of Iterations 1, 2, 3, 4, 5, 10, 25, 50, 100

The Gibbs Sampler
General Principles


Number of Iterations 1, 2, 3, 4, 5, 10, 25, 50, 100, 500

The Gibbs Sampler
General Principles

Limitations of the Gibbs sampler

Formally, a special case of a sequence of 1-D M-H kernels, all with
acceptance rate uniformly equal to 1.
The Gibbs sampler
1. limits the choice of instrumental distributions

The Gibbs Sampler
General Principles


The Gibbs sampler
2. requires some knowledge of f

The Gibbs Sampler
General Principles


The Gibbs sampler
3. is, by construction, multidimensional

The Gibbs Sampler
General Principles


The Gibbs sampler
3. is, by construction, multidimensional
4. does not apply to problems where the number of parameters
varies as the resulting chain is not irreducible.

The Gibbs Sampler
General Principles

A wee problem
4
3
2
µ2

1
0
−1

−1 0 1 2 3 4

µ1

Gibbs started at random

The Gibbs Sampler
General Principles

A wee problem

Gibbs stuck at the wrong mode
4
3

3
2

2
µ2

1

µ2

1
0

0
−1

−1

−1 0 1 2 3 4

µ1

Gibbs started at random −1 0 1 2 3

µ1

The Gibbs Sampler
General Principles

Slice sampler as generic Gibbs

If f(θ) can be written as a product
k
fi (θ),
i=1

The Gibbs Sampler
General Principles

Slice sampler as generic Gibbs

If f(θ) can be written as a product
k
fi (θ),
i=1

it can be completed as
k
I0 ωi fi (θ) ,
i=1

leading to the following Gibbs algorithm:

The Gibbs Sampler
General Principles

Slice sampler (code)

Algorithm (Slice sampler)
Simulate
(t+1)
1. ω1 ∼ U[0,f1 (θ(t) )] ;
...
(t+1)
k. ωk ∼ U[0,fk (θ(t) )] ;
k+1. θ(t+1) ∼ UA(t+1) , with
(t+1)
A(t+1) = {y; fi (y) ωi , i = 1, . . . , k}.

The Gibbs Sampler
General Principles

Example of results with a truncated N(−3, 1) distribution

0.010
0.008
0.006
y

0.004
0.002
0.000

0.0 0.2 0.4 0.6 0.8 1.0

x

Number of Iterations 2

The Gibbs Sampler
General Principles


0.010
0.008
0.006
y

0.004
0.002
0.000

0.0 0.2 0.4 0.6 0.8 1.0

x

Number of Iterations 2, 3

The Gibbs Sampler
General Principles


0.010
0.008
0.006
y

0.004
0.002
0.000

0.0 0.2 0.4 0.6 0.8 1.0

x

Number of Iterations 2, 3, 4

The Gibbs Sampler
General Principles


0.010
0.008
0.006
y

0.004
0.002
0.000

0.0 0.2 0.4 0.6 0.8 1.0

x

Number of Iterations 2, 3, 4, 5

The Gibbs Sampler
General Principles


0.010
0.008
0.006
y

0.004
0.002
0.000

0.0 0.2 0.4 0.6 0.8 1.0

x

Number of Iterations 2, 3, 4, 5, 10

The Gibbs Sampler
General Principles


0.010
0.008
0.006
y

0.004
0.002
0.000

0.0 0.2 0.4 0.6 0.8 1.0

x

Number of Iterations 2, 3, 4, 5, 10, 50

The Gibbs Sampler
General Principles


0.010
0.008
0.006
y

0.004
0.002
0.000

0.0 0.2 0.4 0.6 0.8 1.0

x

Number of Iterations 2, 3, 4, 5, 10, 50, 100



cosmology


The Gibbs Sampler


ABC basics

Regular Bayesian computation issues

Recap’: When faced with a non-standard posterior distribution

π(θ|y) ∝ π(θ)L(θ|y)

the standard solution is to use simulation (Monte Carlo) to
produce a sample
θ1 , . . . , θT
from π(θ|y) (or approximately by Markov chain Monte Carlo
methods)
[Robert & Casella, 2004]

ABC basics

Untractable likelihoods

Cases when the likelihood function f(y|θ) is unavailable (in
analytic and numerical senses) and when the completion step

f(y|θ) = f(y, z|θ) dz
Z

is impossible or too costly because of the dimension of z
c MCMC cannot be implemented!

ABC basics

Illustration

Phylogenetic tree: in population
genetics, reconstitution of a common
ancestor from a sample of genes via
a phylogenetic tree that is close to
impossible to integrate out
[100 processor days with 4
parameters]
[Cornuet et al., 2009, Bioinformatics]

ABC basics

Illustration !""#$%&'()*+,(-*.&(/+0$'"1)()&$/+2!,03!
1/+*%*'"4*+56(""4&7()&$/.+.1#+4*.+8-9':*.+

Différents scénarios possibles, choix de scenario par ABC
demo-genetic inference

Genetic model of evolution from a
common ancestor (MRCA)
characterized by a set of parameters
that cover historical, demographic, and
genetic factors
Dataset of polymorphism (DNA sample)
observed at the present time

Le scenario 1a est largement soutenu par rapport aux
autres ! plaide pour une origine commune des
Verdu et al. 2009
populations pygmées d’Afrique de l’Ouest
97

ABC basics

Illustration

!""#$%&'()*+,(-*.&(/+0$'"1)()&$/+2!,03!
1/+*%*'"4*+56(""4&7()&$/.+.1#+4*.+8-9':*.+
Pygmies population demo-genetics
Pygmies populations: do they
have a common origin? when
and how did they split from
non-pygmies populations? were
there more recent interactions
between pygmies and
non-pygmies populations?
94

ABC basics

The ABC method

Bayesian setting: target is π(θ)f(x|θ)

ABC basics

The ABC method

When likelihood f(x|θ) not in closed form, likelihood-free rejection
technique:

ABC basics

The ABC method

When likelihood f(x|θ) not in closed form, likelihood-free rejection
technique:
ABC algorithm
For an observation y ∼ f(y|θ), under the prior π(θ), keep jointly
simulating
θ ∼ π(θ) , z ∼ f(z|θ ) ,
until the auxiliary variable z is equal to the observed value, z = y.

[Tavar´ et al., 1997]
e

ABC basics

Why does it work?!

The proof is trivial:

f(θi ) ∝ π(θi )f(z|θi )Iy (z)
z∈D
∝ π(θi )f(y|θi )
= π(θi |y) .

[Accept–Reject 101]

ABC basics

A as approximative

When y is a continuous random variable, equality z = y is
replaced with a tolerance condition,

ρ(y, z)

where ρ is a distance

ABC basics

A as approximative

When y is a continuous random variable, equality z = y is
replaced with a tolerance condition,

ρ(y, z)

where ρ is a distance
Output distributed from

π(θ) Pθ {ρ(y, z) < } ∝ π(θ|ρ(y, z) < )

ABC basics

ABC algorithm

Algorithm 1 Likelihood-free rejection sampler 2
for i = 1 to N do
repeat
generate θ from the prior distribution π(·)
generate z from the likelihood f(·|θ )
until ρ{η(z), η(y)}
set θi = θ
end for

where η(y) deﬁnes a (not necessarily suﬃcient) statistic

ABC basics

Output

The likelihood-free algorithm samples from the marginal in z of:

π(θ)f(z|θ)IA ,y (z)
π (θ, z|y) = ,
A ,y ×Θ π(θ)f(z|θ)dzdθ

where A ,y = {z ∈ D|ρ(η(z), η(y)) < }.

ABC basics

Output

The likelihood-free algorithm samples from the marginal in z of:

π(θ)f(z|θ)IA ,y (z)
π (θ, z|y) = ,
A ,y ×Θ π(θ)f(z|θ)dzdθ

where A ,y = {z ∈ D|ρ(η(z), η(y)) < }.
The idea behind ABC is that the summary statistics coupled with a
small tolerance should provide a good approximation of the
posterior distribution:

π (θ|y) = π (θ, z|y)dz ≈ π(θ|η(y)) .

ABC basics

Pima Indian benchmark

80
100

1.0
80

60

0.8
60

0.6
Density

Density

Density
40
40

0.4
20
20

0.2
0.0
0

0

−0.005 0.010 0.020 0.030 −0.05 −0.03 −0.01 −1.0 0.0 1.0 2.0

Figure: Comparison between density estimates of the marginals on β1
(left), β2 (center) and β3 (right) from ABC rejection samples (red) and
MCMC samples (black)

.

ABC basics

ABC advances

Simulating from the prior is often poor in eﬃciency

ABC basics

ABC advances

Either modify the proposal distribution on θ to increase the density
of x’s within the vicinity of y...
[Marjoram et al, 2003; Bortot et al., 2007, Sisson et al., 2007]

ABC basics

ABC advances


...or by viewing the problem as a conditional density estimation
and by developing techniques to allow for larger
[Beaumont et al., 2002]

ABC basics

ABC advances


...or by viewing the problem as a conditional density estimation
and by developing techniques to allow for larger
[Beaumont et al., 2002]

.....or even by including in the inferential framework [ABCµ ]
[Ratmann et al., 2009]

ABC basics

ABC-MCMC

Markov chain (θ(t) ) created via the transition function

θ ∼ Kω (θ |θ(t) ) if x ∼ f(x|θ ) is such that x = y


π(θ )Kω (t) |θ )
θ (t+1)
= and u ∼ U(0, 1) π(θ(t) )K (θ |θ(t) ) ,

 (t) ω (θ
θ otherwise,

ABC basics

ABC-MCMC

Markov chain (θ(t) ) created via the transition function

θ ∼ Kω (θ |θ(t) ) if x ∼ f(x|θ ) is such that x = y


π(θ )Kω (t) |θ )
θ (t+1)
= and u ∼ U(0, 1) π(θ(t) )K (θ |θ(t) ) ,

 (t) ω (θ
θ otherwise,

has the posterior π(θ|y) as stationary distribution
[Marjoram et al, 2003]

ABC basics

ABC-MCMC (2)

Algorithm 2 Likelihood-free MCMC sampler
Use Algorithm 1 to get (θ(0) , z(0) )
for t = 1 to N do
Generate θ from Kω ·|θ(t−1) ,
Generate z from the likelihood f(·|θ ),
Generate u from U[0,1] ,
π(θ )Kω (θ(t−1) |θ )
if u I
π(θ(t−1) Kω (θ |θ(t−1) ) A ,y (z ) then
set (θ(t) , z(t) ) = (θ , z )
else
(θ(t) , z(t) )) = (θ(t−1) , z(t−1) ),
end if
end for

ABC basics

Sequential Monte Carlo

SMC is a simulation technique to approximate a sequence of
related probability distributions πn with π0 “easy” and πT as
target.
Iterated IS as PMC : particles moved from time n to time n via
kernel Kn and use of a sequence of extended targets πn˜
n
˜
πn (z0:n ) = πn (zn ) Lj (zj+1 , zj )
j=0

where the Lj ’s are backward Markov kernels [check that πn (zn ) is
a marginal]
[Del Moral, Doucet & Jasra, Series B, 2006]

ABC basics

Sequential Monte Carlo (2)

Algorithm 3 SMC sampler [Del Moral, Doucet & Jasra, Series B,
2006]
(0)
sample zi ∼ γ0 (x) (i = 1, . . . , N)
(0) (0) (0)
compute weights wi = π0 (zi ))/γ0 (zi )
for t = 1 to N do
if ESS(w(t−1) ) < NT then
resample N particles z(t−1) and set weights to 1
end if
(t−1) (t−1)
generate zi ∼ Kt (zi , ·) and set weights to
(t) (t) (t−1)
(t) (t−1) πt (zi ))Lt−1 (zi ), zi ))
wi = Wi−1 (t−1) (t−1) (t)
πt−1 (zi ))Kt (zi ), zi ))

end for

ABC basics

ABC-SMC
[Del Moral, Doucet & Jasra, 2009]

True derivation of an SMC-ABC algorithm
Use of a kernel Kn associated with target π n and derivation of the
backward kernel
π n (z )Kn (z , z)
Ln−1 (z, z ) =
πn (z)

Update of the weights
M
m=1 IA n
(xm )
in
win ∝ wi(n−1) M
m=1 IA n−1
(xm
i(n−1) )

when xm ∼ K(xi(n−1) , ·)
in

Séminaire de Physique à Besancon, Nov. 22, 2012

Recommended

Recommended

More Related Content

What's hot

What's hot (15)

Viewers also liked

Viewers also liked (7)

More from Christian Robert

More from Christian Robert (20)

Recently uploaded

Recently uploaded (20)

Séminaire de Physique à Besancon, Nov. 22, 2012