Introduction to advanced Monte Carlo methods

An introduction to advanced (?) MCMC methods


Christian P. Robert

Universit´ Paris-Dauphine and CREST-INSEE
e
http://www.ceremade.dauphine.fr/~xian

Royal Statistical Society, October 13, 2010

Motivating example

Motivating example

1 Motivating example

2 The Metropolis-Hastings Algorithm

Motivating example

Latent structures make life harder!

Even simple models may lead to computational complications,
as in latent variable models

f (x|θ) = f ⋆ (x, x⋆ |θ) dx⋆

Motivating example



f (x|θ) = f ⋆ (x, x⋆ |θ) dx⋆

If (x, x⋆ ) observed, ﬁne!

Motivating example



f (x|θ) = f ⋆ (x, x⋆ |θ) dx⋆

If (x, x⋆ ) observed, ﬁne!
If only x observed, trouble!

Motivating example

Example (Mixture models)
Models of mixtures of distributions:

X ∼ fj with probability pj ,

for j = 1, 2, . . . , k, with overall density

X ∼ p1 f1 (x) + · · · + pk fk (x) .

Motivating example




X ∼ p1 f1 (x) + · · · + pk fk (x) .

For a sample of independent random variables (X1 , · · · , Xn ),
sample density
n
{p1 f1 (xi ) + · · · + pk fk (xi )} .
i=1

Motivating example




X ∼ p1 f1 (x) + · · · + pk fk (x) .

For a sample of independent random variables (X1 , · · · , Xn ),
sample density
n
{p1 f1 (xi ) + · · · + pk fk (xi )} .
i=1

Expanding this product involves k n elementary terms: prohibitive
to compute in large samples.

Motivating example

0.3N (µ1 , 1) + 0.7N (µ2 , 1) loglikelihood
3
2
µ2

1
0
−1

−1 0 1 2 3

µ1

Motivating example

A typology of Bayes computational problems
(i) use of a complex parameter space, as for instance in
constrained parameter sets like those resulting from imposing
stationarity constraints in dynamic models;

Motivating example

(ii) use of a complex sampling model with an intractable
likelihood, as for instance in missing data and graphical
models;

Motivating example

models;
(iii) use of a huge dataset;

Motivating example

models;
(iv) use of a complex prior distribution (which may be the
posterior distribution associated with an earlier sample);

Motivating example

models;
(iv) use of a complex prior distribution (which may be the
posterior distribution associated with an earlier sample);
(v) use of a complex inferential procedure as for instance, Bayes
factors
π π(θ ∈ Θ0 )
B01 (x) = P (θ ∈ Θ0 | x)/P (θ ∈ Θ1 | x) .
π(θ ∈ Θ1 )

The Metropolis-Hastings Algorithm


1 Motivating example

2 The Metropolis-Hastings Algorithm
Monte Carlo Methods based on Markov Chains
The Metropolis–Hastings algorithm
A collection of Metropolis-Hastings algorithms
Extensions
Convergence assessment


Running Monte Carlo via Markov Chains

Fact: It is not necessary to use a sample from the distribution f to
approximate the integral

I= h(x)f (x)dx ,


Running Monte Carlo via Markov Chains

Fact: It is not necessary to use a sample from the distribution f to
approximate the integral

I= h(x)f (x)dx ,

We can obtain X1 , . . . , Xn ∼ f (approx) without directly
simulating from f , using an ergodic Markov chain with
stationary distribution f


Running Monte Carlo via Markov Chains (2)

Idea
For an arbitrary starting value x(0) , an ergodic chain (X (t) ) is
generated using a transition kernel with stationary distribution f


Running Monte Carlo via Markov Chains (2)

Idea
For an arbitrary starting value x(0) , an ergodic chain (X (t) ) is
generated using a transition kernel with stationary distribution f

Ensures the convergence in distribution of (X (t) ) to a random
variable from f .
For a “large enough” T0 , X (T0 ) can be considered as
distributed from f
Produces a dependent sample X (T0 ) , X (T0 +1) , . . ., which is
generated from f , suﬃcient for most approximation purposes.



Problem:
How can one build a Markov chain with a given stationary
distribution?



Problem:
How can one build a Markov chain with a given stationary
distribution?

MH basics
Algorithm that converges to the objective (target) density

f

using an arbitrary transition kernel density

q(x, y)

called instrumental (or proposal) distribution


The MH algorithm

Algorithm (Metropolis–Hastings)
Given x(t) ,
1 Generate Yt ∼ q(x(t) , y).
2 Take

Yt with prob. ρ(x(t) , Yt ),
X (t+1) =
x(t) with prob. 1 − ρ(x(t) , Yt ),

where
f (y) q(y, x)
ρ(x, y) = min ,1 .
f (x) q(x, y)


Features

Independent of normalizing constants for both f and q(x, ·)
(ie, those constants independent of x)
Never move to values with f (y) = 0
The chain (x(t) )t may take the same value several times in a
row, even though f is a density wrt Lebesgue measure
The sequence (yt )t is usually not a Markov chain


Features

Independent of normalizing constants for both f and q(x, ·)
(ie, those constants independent of x)
Never move to values with f (y) = 0
The chain (x(t) )t may take the same value several times in a
row, even though f is a density wrt Lebesgue measure
The sequence (yt )t is usually not a Markov chain
P( θ-> θ ’)
Satisﬁes the detailed balance condition
θ’
θ

f (x)K(x, y) = f (y)K(y, x) P(θ’-> θ )

[Green, 1995]


Convergence properties

1 The M-H Markov chain is reversible, with invariant/stationary
density f .



density f .
2 As f is a probability measure, the chain is positive recurrent



density f .
2 As f is a probability measure, the chain is positive recurrent
3 If
f (Yt ) q(Yt , X (t) )
Pr ≥ 1 < 1. (1)
f (X (t) ) q(X (t) , Yt )

i.e., if the event {X (t+1) = X (t) } occurs with positive
probability, then the chain is aperiodic


Convergence properties (2)
4 If
q(x, y) > 0 for every (x, y), (2)
the chain is irreducible


4 If
q(x, y) > 0 for every (x, y), (2)
5 For M-H, f -irreducibility implies Harris recurrence


4 If
q(x, y) > 0 for every (x, y), (2)
5 For M-H, f -irreducibility implies Harris recurrence
6 Thus, under conditions (1) and (2)
(i) For h, with Ef |h(X)| < ∞,
T
1
lim h(X (t) ) = h(x)df (x) a.e. f.
T →∞ T t=1

(ii) and
lim K n (x, ·)µ(dx) − f =0
n→∞
TV
for every initial distribution µ, where K n (x, ·) denotes the
kernel for n transitions.


The Independent Case

The instrumental distribution q(x, ·) is independent of x and is
denoted g


The Independent Case

The instrumental distribution q(x, ·) is independent of x and is
denoted g

Algorithm (Independent Metropolis-Hastings)
Given x(t) ,
1 Generate Yt ∼ g(y)
2 Take

Y f (Yt ) g(x(t) )
with prob. min ,1 ,

t
X (t+1) = f (x(t) ) g(Yt )

x(t) otherwise.



Properties
The resulting sample is not iid


Properties
The resulting sample is not iid but there exist strong convergence
properties:
Theorem (Ergodicity)
The algorithm produces a uniformly ergodic chain if there exists a
constant M such that

f (x) ≤ M g(x) , x ∈ supp f.

In this case,
n
1
K n (x, ·) − f TV ≤ 1− .
M

[Mengersen & Tweedie, 1996]


Example (Noisy AR(1))
Hidden Markov chain from a regular AR(1) model,

xt+1 = ϕxt + ǫt+1 ǫt ∼ N (0, τ 2 )

and observables
yt |xt ∼ N (x2 , σ 2 )
t


Example (Noisy AR(1))
Hidden Markov chain from a regular AR(1) model,

xt+1 = ϕxt + ǫt+1 ǫt ∼ N (0, τ 2 )

and observables
yt |xt ∼ N (x2 , σ 2 )
t

The distribution of xt given xt−1 , xt+1 and yt is

−1 τ2
exp (xt − ϕxt−1 )2 + (xt+1 − ϕxt )2 + (yt − x2 )2
t .
2τ 2 σ2


Example (Noisy AR(1) too)
2
Use for proposal the N (µt , ωt ) distribution, with

xt−1 + xt+1 2 τ2
µt = ϕ and ωt = .
1 + ϕ2 1 + ϕ2


Example (Noisy AR(1) too)
2
Use for proposal the N (µt , ωt ) distribution, with

xt−1 + xt+1 2 τ2
µt = ϕ and ωt = .
1 + ϕ2 1 + ϕ2
Ratio
π(x)/qind (x) = exp −(yt − x2 )2 /2σ 2
t

is bounded


(top) Last 500 realisations of the chain {Xk }k out of 10, 000
iterations; (bottom) histogram of the chain, compared with
the target distribution.


Random walk Metropolis–Hastings

Instead, use a local perturbation as proposal

Yt = X (t) + εt ,

where εt ∼ g, independent of X (t) .
The instrumental density is now of the form g(y − x) and the
Markov chain is a random walk if g is symmetric

g(x) = g(−x)


Algorithm (Random walk Metropolis)
Given x(t)
1 Generate Yt ∼ g(y − x(t) )
2 Take
f (Yt )

Y with prob. min 1, ,
(t+1) t
X = f (x(t) )
 (t)
x otherwise.


Probit illustration

Likelihood and posterior given by
n
π(β|y, X) ∝ ℓ(β|y, X) ∝ Φ(xiT β)yi (1 − Φ(xiT β))ni −yi .
i=1

under the ﬂat prior


Probit illustration

Likelihood and posterior given by
n
π(β|y, X) ∝ ℓ(β|y, X) ∝ Φ(xiT β)yi (1 − Φ(xiT β))ni −yi .
i=1

under the ﬂat prior
A random walk proposal works well for a small number of
ˆ
predictors. Use the maximum likelihood estimate β as starting
ˆ
value and asymptotic (Fisher) covariance matrix of the MLE, Σ, as
scale


MCMC algorithm

Probit random-walk Metropolis-Hastings
ˆ ˆ
Initialization: Set β (0) = β and compute Σ
Iteration t:
1 ˜ ˆ
Generate β ∼ Nk+1 (β (t−1) , τ Σ)
2 Compute

˜
π(β|y)
˜
ρ(β (t−1) , β) = min 1,
π(β (t−1) |y)

3 ˜ ˜
With probability ρ(β (t−1) , β) set β (t) = β;
otherwise set β (t) = β (t−1) .


R bank benchmark
Probit modelling with
no intercept over the

0.8
−1.0

1.0

0.4
four measurements.

−2.0

0.0
0.0
0 4000 8000 −2.0 −1.5 −1.0 −0.5 0 200 600 1000

Three diﬀerent scales

3

0.0 0.4 0.8
τ = 1, 0.1, 10: best

2

0.4
1
mixing behavior is

−1

0.0
0 4000 8000 −1 0 1 2 3 0 200 600 1000

associated with τ = 1.

2.5

0.8

0.0 0.4 0.8
−0.5 1.0
Average of the

0.4
0.0
parameters over 1.8
0 4000 8000 −0.5 0.5 1.5 2.5 0 200 600 1000

MCMC 9, 000

0.0 0.4 0.8
2.0
1.2

1.0
iterations gives plug-in
0.0
0.6

0 4000 8000 0.6 1.0 1.4 1.8 0 200 600 1000

estimate

pi = Φ (−1.2193xi1 + 0.9540xi2 + 0.9795xi3 + 1.1481xi4 ) .
ˆ


n k
π(θ|x) ∝ pℓ f (xj |µℓ , σℓ ) π(θ)
j=1 ℓ=1


n k
π(θ|x) ∝ pℓ f (xj |µℓ , σℓ ) π(θ)
j=1 ℓ=1

Metropolis-Hastings proposal:

θ(t) + ωε(t) if u(t) < ρ(t)
θ(t+1) =
θ(t) otherwise

where
π(θ(t) + ωε(t) |x)
ρ(t) = ∧1
π(θ(t) |x)
and ω scaled for good acceptance rate


Random walk MCMC output for
.7N (µ1 , 1) + .3N (µ2 , 1)
and scale 1
Iteration 1
4
3
2
µ2

1
0
−1

−1 0 1 2 3 4

µ1


.7N (µ1 , 1) + .3N (µ2 , 1)
and scale 1
Iteration 10
4
3
2
µ2

1
0
−1

−1 0 1 2 3 4

µ1


.7N (µ1 , 1) + .3N (µ2 , 1)
and scale 1
Iteration 100
4
3
2
µ2

1
0
−1

−1 0 1 2 3 4

µ1


.7N (µ1 , 1) + .3N (µ2 , 1)
and scale 1
Iteration 500
4
3
2
µ2

1
0
−1

−1 0 1 2 3 4

µ1


.7N (µ1 , 1) + .3N (µ2 , 1)
and scale 1
Iteration 1000
4
3
2
µ2

1
0
−1

−1 0 1 2 3 4

µ1


.7N (µ1 , 1) + .3N (µ2 , 1)
√
and scale .1
Iteration 10
4
3
2
µ2

1
0
−1

−1 0 1 2 3 4

µ1


.7N (µ1 , 1) + .3N (µ2 , 1)
√
and scale .1
Iteration 100
4
3
2
µ2

1
0
−1

−1 0 1 2 3 4

µ1


.7N (µ1 , 1) + .3N (µ2 , 1)
√
and scale .1
Iteration 500
4
3
2
µ2

1
0
−1

−1 0 1 2 3 4

µ1


.7N (µ1 , 1) + .3N (µ2 , 1)
√
and scale .1
Iteration 1000
4
3
2
µ2

1
0
−1

−1 0 1 2 3 4

µ1


.7N (µ1 , 1) + .3N (µ2 , 1)
√
and scale .1
Iteration 10,000
4
3
2
µ2

1
0
−1

−1 0 1 2 3 4

µ1


.7N (µ1 , 1) + .3N (µ2 , 1)
√
and scale .1
Iteration 5000
4
3
2
µ2

1
0
−1

−1 0 1 2 3 4

µ1



Uniform ergodicity prohibited by random walk structure



Uniform ergodicity prohibited by random walk structure
At best, geometric ergodicity:

Theorem (Suﬃcient ergodicity)
For a symmetric density f , log-concave in the tails, and a positive
and symmetric density g, the chain (X (t) ) is geometrically ergodic.
[Mengersen & Tweedie, 1996]
no tail eﬀect


1.5

1.5
1.0

1.0
Example (Comparison of tail
eﬀects)

0.5

0.5
0.0

0.0
Random-walk
Metropolis–Hastings algorithms

-0.5

-0.5
based on a N (0, 1) instrumental

-1.0

-1.0
for the generation of (a) a

-1.5

-1.5
N (0, 1) distribution and (b) a 0 50 100

(a)
150 200 0 50 100

(b)
150 200

distribution with density 90% conﬁdence envelopes of
ψ(x) ∝ (1 + |x|)−3 the means, derived from 500
parallel independent chains

Extensions

Extensions

There are many other families of HM algorithms
Adaptive Rejection Metropolis Sampling
Reversible Jump
Langevin algorithms
to name just a few...

Extensions

Langevin Algorithms

Proposal based on the Langevin diffusion Lt is defined by the
stochastic differential equation
1
dLt = dBt + ∇ log f (Lt )dt,
2
where Bt is the standard Brownian motion
Theorem
The Langevin diffusion is the only non-explosive diffusion which is
reversible with respect to f .

Extensions

Discretization

Because continuous time cannot be simulated, consider the
discretised sequence

σ2
x(t+1) = x(t) + ∇ log f (x(t) ) + σεt , εt ∼ Np (0, Ip )
2
where σ 2 corresponds to the discretisation step
0.6
0.5
0.4

Example of
Density

0.3

f (x) = exp(−x4 )
0.2
0.1
0.0

−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5

σ2 = .1

Extensions

Discretization


σ2
2
0.6
0.5
0.4

Example of
0.3

f (x) = exp(−x4 )
0.2
0.1
0.0

−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5

σ2 = .01

Extensions

Discretization


σ2
2
0.6
0.5
0.4

Example of
Density

0.3

f (x) = exp(−x4 )
0.2
0.1
0.0

−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5

σ2 = .001

Extensions

Discretization


σ2
2
0.8
0.6

Example of
Density

0.4

f (x) = exp(−x4 )
0.2
0.0

−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5

σ2 = .0001

Extensions

Discretization


σ2
2
0.6
0.5
0.4

Example of
Density

0.3

f (x) = exp(−x4 )
0.2
0.1
0.0

−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5

σ2 = .0001∗

Extensions

Discretization

Unfortunately, the discretized chain may be transient, for instance
when
lim σ 2 ∇ log f (x)|x|−1 > 1
x→±∞

Example of f (x) = exp(−x4 ) when σ 2 = .2

Extensions

MH correction

Accept the new value Yt with probability
2
σ2
exp − Yt − x(t) − (t)
2 ∇ log f (x ) 2σ 2
f (Yt )
· ∧1.
f (x(t) ) σ2
2
exp − x(t) − Yt − 2 ∇ log f (Yt ) 2σ 2

Choice of the scaling factor σ
Should lead to an acceptance rate of 0.574 to achieve optimal
convergence rates (when the components of x are uncorrelated)
[Roberts & Rosenthal, 1998]

Extensions

Optimizing the Acceptance Rate

Problem of choice of the transition kernel from a practical point of
view
Most common alternatives:
1 a fully automated algorithm like ARMS;
2 an instrumental density g which approximates f , such that
f /g is bounded for uniform ergodicity to apply;
3 a random walk
In both cases (b) and (c), the choice of g is critical,

Extensions

Case of the random walk

Diﬀerent approach to acceptance rates
A high acceptance rate does not indicate that the algorithm is
moving correctly since it indicates that the random walk is moving
too slowly on the surface of f .

Extensions

Case of the random walk

Diﬀerent approach to acceptance rates
A high acceptance rate does not indicate that the algorithm is
moving correctly since it indicates that the random walk is moving
too slowly on the surface of f .
If x(t) and yt are close, i.e. f (x(t) ) ≃ f (yt ) y is accepted with
probability
f (yt )
min ,1 ≃ 1 .
f (x(t) )
For multimodal densities with well separated modes, the negative
eﬀect of limited moves on the surface of f clearly shows.

Extensions

Case of the random walk (2)

If the average acceptance rate is low, the successive values of f (yt )
tend to be small compared with f (x(t) ), which means that the
random walk moves quickly on the surface of f since it often
reaches the “borders” of the support of f

Extensions

Rule of thumb

In small dimensions, aim at an average acceptance rate of 50%. In
large dimensions, at an average acceptance rate of 25%.
[Gelman,Gilks and Roberts, 1995]

Extensions

Rule of thumb

In small dimensions, aim at an average acceptance rate of 50%. In
large dimensions, at an average acceptance rate of 25%.
[Gelman,Gilks and Roberts, 1995]

This rule is to be taken with a pinch of salt!

Extensions

Example (Noisy AR(1) continued)
For a Gaussian random walk with scale ω small enough, the
random walk never jumps to the other mode. But if the scale ω is
suﬃciently large, the Markov chain explores both modes and give a
satisfactory approximation of the target distribution.

Extensions

Markov chain based on a random walk with scale ω = .1

Extensions

Markov chain based on a random walk with scale ω = .5

Extensions

Where do we stand?
MCMC in a nutshell:

Extensions

Where do we stand?
MCMC in a nutshell:
Running a sequence Xt+1 = Ψ(Xt , Yy ) provides approximation
to target density f when detailed balance condition holds

f (x)K(x, y) = f (y)K(y, x)

Extensions

Where do we stand?
MCMC in a nutshell:

f (x)K(x, y) = f (y)K(y, x)

Easiest implementation of the principle is random walk
Metropolis-Hastings

Yt = X (t) + εt

Extensions

Where do we stand?
MCMC in a nutshell:

f (x)K(x, y) = f (y)K(y, x)

Easiest implementation of the principle is random walk
Metropolis-Hastings

Yt = X (t) + εt

Practical convergence requires suﬃcient energy from the
proposal that is calibrated by trial and error.


Convergence diagnostics

How many iterations?



Rule # 1 There is no absolute number of simulations, i.e.
1, 000 is neither large, nor small.
Rule # 2 It takes [much] longer to check for convergence
than for the chain itself to converge.
Rule # 3 MCMC is a “what-you-get-is-what-you-see”
algorithm: it fails to tell about unexplored parts of the space.
Rule # 4 When in doubt, run MCMC chains in parallel and
check for consistency.



Rule # 1 There is no absolute number of simulations, i.e.
1, 000 is neither large, nor small.
Rule # 2 It takes [much] longer to check for convergence
than for the chain itself to converge.
Rule # 3 MCMC is a “what-you-get-is-what-you-see”
algorithm: it fails to tell about unexplored parts of the space.
Rule # 4 When in doubt, run MCMC chains in parallel and
check for consistency.

Many “quick-&-dirty” solutions in the literature, but not
necessarily 100% trustworthy.


Example (Bimodal target)

0.4
Density

0.3
exp −x2 /2 4(x − .3)2 + .01

0.2
f (x) = √ .
4(1 + (.3)2 ) + .01

0.1
2π

0.0
−4 −2 0 2 4

and use of random walk Metropolis–Hastings algorithm with
variance .04
Evaluation of the missing mass by
T −1
[θ(t+1) − θ(t) ] f (θ(t) )
t=1


1.0
0.8
0.6
mass

0.4
0.2
0.0

0 500 1000 1500 2000

Index

Sequence [in blue] and mass evaluation [in brown]

[Philippe & Robert, 2001]


Eﬀective sample size
How many iid simulations from π are equivalent to N simulations
from the MCMC algorithm?


Eﬀective sample size
How many iid simulations from π are equivalent to N simulations
from the MCMC algorithm?

Based on estimated k-th order auto-correlation,

ρk = cov x(t) , x(t+k) ,

eﬀective sample size
T0 −1/2
ess
N =n 1+2 ρk
ˆ ,
k=1

Only partial indicator that fails to signal chains stuck in one
mode of the target


Tempering

Facilitate exploration of π by ﬂattening the target: simulate from
πα (x) ∝ π(x)α for α > 0 small enough


Tempering

Facilitate exploration of π by flattening the target: simulate from
πα (x) ∝ π(x)α for α > 0 small enough
Determine where the modal regions of π are (possibly with
parallel versions using different α’s)
Recycle simulations from π(x)α into simulations from π by
importance sampling
Simple modification of the Metropolis–Hastings algorithm,
with new acceptance
α
π(θ′ |x) q(θ|θ′ )
∧1
π(θ|x) q(θ′ |θ)


Tempering with the mean mixture

1 0.5 0.2
4

4

4
3

3

3
2

2

2
1

1

1
0

0

0
−1

−1

−1
−1 0 1 2 3 4 −1 0 1 2 3 4 −1 0 1 2 3 4

Introduction to advanced Monte Carlo methods

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Introduction to advanced Monte Carlo methods

Similar to Introduction to advanced Monte Carlo methods (20)

More from Christian Robert

More from Christian Robert (20)

Introduction to advanced Monte Carlo methods