Mark Girolami's Read Paper 2010

Manifold Monte Carlo Methods

Mark Girolami

Department of Statistical Science
University College London

Joint work with Ben Calderhead

Research Section Ordinary Meeting

The Royal Statistical Society
October 13, 2010

Riemann manifold Langevin and Hamiltonian Monte Carlo Methods


Girolami, M. & Calderhead, B., J.R.Statist. Soc. B (2011), 73, 2, 1 - 37.


Girolami, M. & Calderhead, B., J.R.Statist. Soc. B (2011), 73, 2, 1 - 37.

Advanced Monte Carlo methodology founded on geometric principles

Talk Outline

Motivation to improve MCMC capability for challenging problems

Talk Outline


Stochastic diffusion as adaptive proposal process

Talk Outline



Exploring geometric concepts in MCMC methodology

Talk Outline




Diffusions across Riemann manifold as proposal mechanism

Talk Outline





Deterministic geodesic ﬂows on manifold form basis of MCMC methods

Talk Outline





Illustrative Examples:-
Warped Bivariate Gaussian

Talk Outline





Gaussian mixture model

Talk Outline





Log-Gaussian Cox process

Talk Outline





Log-Gaussian Cox process

Conclusions

Motivation Simulation Based Inference

Monte Carlo method employs samples from p(θ) to obtain estimate
Z
1 X 1
φ(θ)p(θ)dθ = φ(θ n ) + O(N − 2 )
N n


Z
1 X 1
φ(θ)p(θ)dθ = φ(θ n ) + O(N − 2 )
N n

Draw θ n from ergodic Markov process with stationary distribution p(θ)


Z
1 X 1
φ(θ)p(θ)dθ = φ(θ n ) + O(N − 2 )
N n

Construct process in two parts


Z
1 X 1
φ(θ)p(θ)dθ = φ(θ n ) + O(N − 2 )
N n

Propose a move θ → θ with probability pp (θ |θ)


Z
1 X 1
φ(θ)p(θ)dθ = φ(θ n ) + O(N − 2 )
N n

accept or reject proposal with probability
» –
p(θ )pp (θ|θ )
pa (θ |θ) = min 1,
p(θ)pp (θ |θ)


Z
1 X 1
φ(θ)p(θ)dθ = φ(θ n ) + O(N − 2 )
N n

» –
p(θ )pp (θ|θ )
pa (θ |θ) = min 1,
p(θ)pp (θ |θ)

Efﬁciency dependent on pp (θ |θ) deﬁning proposal mechanism


Z
1 X 1
φ(θ)p(θ)dθ = φ(θ n ) + O(N − 2 )
N n

» –
p(θ )pp (θ|θ )
pa (θ |θ) = min 1,
p(θ)pp (θ |θ)

Efﬁciency dependent on pp (θ |θ) deﬁning proposal mechanism

Success of MCMC reliant upon appropriate proposal design

Adaptive Proposal Distributions - Exploit Discretised Diffusion


For θ ∈ RD with density p(θ), L(θ) ≡ log p(θ), deﬁne Langevin diffusion
1
dθ(t) = θ L(θ(t))dt + db(t)
2


1
2
First order Euler-Maruyama discrete integration of diffusion
2
θ(τ + ) = θ(τ ) + θ L(θ(τ )) + z(τ )
2


1
2
2
θ(τ + ) = θ(τ ) + θ L(θ(τ )) + z(τ )
2
Proposal 2
2
pp (θ |θ) = N (θ |µ(θ, ), I) with µ(θ, ) = θ + θ L(θ)
2


1
2
2
θ(τ + ) = θ(τ ) + θ L(θ(τ )) + z(τ )
2
Proposal 2
2
2
Acceptance probability to correct for bias
» –
p(θ )pp (θ|θ )
pa (θ |θ) = min 1,
p(θ)pp (θ |θ)


1
2
2
θ(τ + ) = θ(τ ) + θ L(θ(τ )) + z(τ )
2
Proposal 2
2
2
» –
p(θ )pp (θ|θ )
pa (θ |θ) = min 1,
p(θ)pp (θ |θ)
Isotropic diffusion inefﬁcient, employ pre-conditioning
√
θ = θ + 2 M θ L(θ)/2 + Mz


1
2
2
θ(τ + ) = θ(τ ) + θ L(θ(τ )) + z(τ )
2
Proposal 2
2
2
» –
p(θ )pp (θ|θ )
pa (θ |θ) = min 1,
p(θ)pp (θ |θ)
Isotropic diffusion inefﬁcient, employ pre-conditioning
√
θ = θ + 2 M θ L(θ)/2 + Mz

How to set M systematically? Tuning in transient & stationary phases

Geometric Concepts in MCMC

Rao, 1945; Jeffreys, 1948, to ﬁrst order
Z
p(y; θ + δθ)
p(y; θ + δθ) log dθ ≈ δθ T G(θ)δθ
p(y; θ)


Z
p(y; θ + δθ)
p(y; θ)
where ( )
T
θ p(y; θ) θ p(y; θ)
G(θ) = Ey|θ
p(y; θ) p(y; θ)


Z
p(y; θ + δθ)
p(y; θ)
where ( )
T
G(θ) = Ey|θ
p(y; θ) p(y; θ)

Fisher Information G(θ) is p.d. metric deﬁning a Riemann manifold


Z
p(y; θ + δθ)
p(y; θ)
where ( )
T
G(θ) = Ey|θ
p(y; θ) p(y; θ)

Non-Euclidean geometry for probabilities - distances, metrics, invariants,
curvature, geodesics


Z
p(y; θ + δθ)
p(y; θ)
where ( )
T
G(θ) = Ey|θ
p(y; θ) p(y; θ)

Asymptotic statistical analysis. Amari, 1981, 85, 90; Murray & Rice,
1993; Critchley et al, 1993; Kass, 1989; Efron, 1975; Dawid, 1975;
Lauritsen, 1989


Z
p(y; θ + δθ)
p(y; θ)
where ( )
T
G(θ) = Ey|θ
p(y; θ) p(y; θ)

Lauritsen, 1989

Statistical shape analysis Kent et al, 1996; Dryden & Mardia, 1998


Z
p(y; θ + δθ)
p(y; θ)
where ( )
T
G(θ) = Ey|θ
p(y; θ) p(y; θ)

Lauritsen, 1989

Statistical shape analysis Kent et al, 1996; Dryden & Mardia, 1998

Can geometric structure be employed in MCMC methodology?


Tangent space - local metric deﬁned by δθ T G(θ)δθ =
P
k ,l gkl δθk δθl


Tangent space - local metric deﬁned by δθ T G(θ)δθ = k ,l gkl δθk δθl
P

Christoffel symbols - characterise connection on curved manifold


P

„ «
1 X im ∂gmk ∂gml ∂gkl
Γikl = g + − m
2 m ∂θl ∂θk ∂θ


P

„ «
Γikl = g + − m

Geodesics - shortest path between two points on manifold


P

„ «
Γikl = g + − m

Geodesics - shortest path between two points on manifold
d 2 θi X i dθk dθl
+ Γkl =0
dt 2 dt dt
k,l

Illustration of Geometric Concepts

Consider Normal density p(x|µ, σ) = Nx (µ, σ)



Local inner product on tangent space deﬁned by metric tensor, i.e.
δθ T G(θ)δθ, where θ = (µ, σ)T




Metric is Fisher Information
σ −2
» –
0
G(µ, σ) =
0 2σ −2




σ −2
» –
0
G(µ, σ) =
0 2σ −2

Inner-product σ −2 (δµ2 + 2δσ 2 ) so densities N (0, 1) & N (1, 1) further
apart than the densities N (0, 2) & N (1, 2) - distance non-Euclidean




σ −2
» –
0
G(µ, σ) =
0 2σ −2

Inner-product σ −2 (δµ2 + 2δσ 2 ) so densities N (0, 1) & N (1, 1) further
apart than the densities N (0, 2) & N (1, 2) - distance non-Euclidean

Metric tensor for univariate Normal deﬁnes a Hyperbolic Space

M.C. Escher, Heaven and Hell, 1960

Langevin Diffusion on Riemannian manifold
Discretised Langevin diffusion on manifold deﬁnes proposal mechanism
2 “ ” D
X “p ”
θd = θd + G −1 (θ) θ L(θ) − 2
G(θ)−1 Γd +
ij ij G−1 (θ)z
2 d d
i,j

2 “ ” D
X “p ”
θd = θd + G −1 (θ) θ L(θ) − 2
G(θ)−1 Γd +
ij ij G−1 (θ)z
2 d d
i,j
Manifold with constant curvature then proposal mechanism reduces to
2
G −1 (θ)
p
θ =θ+ θ L(θ) + G−1 (θ)z
2

2 “ ” D
X “p ”
θd = θd + G −1 (θ) θ L(θ) − 2
G(θ)−1 Γd +
ij ij G−1 (θ)z
2 d d
i,j
2
G −1 (θ) θ L(θ) +
p
θ =θ+ G−1 (θ)z
2
MALA proposal with preconditioning
2 √
θ =θ+ M θ L(θ) + Mz
2

2 “ ” D
X “p ”
θd = θd + G −1 (θ) θ L(θ) − 2
G(θ)−1 Γd +
ij ij G−1 (θ)z
2 d d
i,j
2
G −1 (θ) θ L(θ) +
p
θ =θ+ G−1 (θ)z
2
2 √
2

Proposal and acceptance probability

pp (θ |θ) = N (θ |µ(θ, ), 2 G(θ))
» –
p(θ )pp (θ|θ )
pa (θ |θ) = min 1,
p(θ)pp (θ |θ)

Proposal mechanism diffuses approximately along the manifold


40 40

35 35

30 30

25 25
!

!
20 20

15 15

10 10

5 5
−5 0 5 −5 0 5
µ µ


25 25

20 20

15 15
!

!
10 10

5 5

−10 0 10 −10 0 10
µ µ

Geodesic ﬂow as proposal mechanism

Desirable that proposals follow direct path on manifold - geodesics



Deﬁne geodesic ﬂow on manifold by solving

+ Γkl =0
dt 2 dt dt
k,l




+ Γkl =0
dt 2 dt dt
k,l

How can this be exploited in the design of a transition operator?




+ Γkl =0
dt 2 dt dt
k,l


Need slight detour




+ Γkl =0
dt 2 dt dt
k,l


Need slight detour - ﬁrst deﬁne log-density under model as L(θ)




+ Γkl =0
dt 2 dt dt
k,l



Introduce auxiliary variable p ∼ N (0, G(θ))




+ Γkl =0
dt 2 dt dt
k,l



Introduce auxiliary variable p ∼ N (0, G(θ))

Negative joint log density is
1 1
H(θ, p) = −L(θ) + log 2π D |G(θ)| + pT G(θ)−1 p
2 2

Riemannian Hamiltonian Monte Carlo
Negative joint log-density ≡ Hamiltonian deﬁned on Riemann manifold
1 1
| 2 {z } |2 {z }
Potential Energy Kinetic Energy

1 1
| 2 {z } |2 {z }

Marginal density follows as required
Z  ﬀ
exp {L(θ)} 1
p(θ) ∝ p exp − pT G(θ)−1 p dp = exp {L(θ)}
2π D |G(θ)| 2

1 1
| 2 {z } |2 {z }

Z  ﬀ
exp {L(θ)} 1
2π D |G(θ)| 2

Obtain samples from marginal p(θ) using Gibbs sampler for p(θ, p)

pn+1 |θ n ∼ N (0, G(θ n ))
n+1 n+1
θ |p ∼ p(θ n+1 |pn+1 )

1 1
| 2 {z } |2 {z }

Z  ﬀ
exp {L(θ)} 1
2π D |G(θ)| 2


pn+1 |θ n ∼ N (0, G(θ n ))
n+1 n+1
θ |p ∼ p(θ n+1 |pn+1 )

Employ Hamiltonian dynamics to propose samples for p(θ n+1 |pn+1 ),
Duane et al, 1987; Neal, 2010.

1 1
| 2 {z } |2 {z }

Z  ﬀ
exp {L(θ)} 1
2π D |G(θ)| 2


pn+1 |θ n ∼ N (0, G(θ n ))
n+1 n+1
θ |p ∼ p(θ n+1 |pn+1 )

Employ Hamiltonian dynamics to propose samples for p(θ n+1 |pn+1 ),
Duane et al, 1987; Neal, 2010.
dθ ∂ dp ∂
= H(θ, p) = − H(θ, p)
dt ∂p dt ∂θ

Riemannian Manifold Hamiltonian Monte Carlo
˜ ˜
Consider the Hamiltonian H(θ, p) = 1 pT G(θ)−1 p
2

˜ ˜
2

Hamiltonians with only a quadratic kinetic energy term exactly describe
geodesic ﬂow on the coordinate space θ with metric G ˜

˜ ˜
2


However our Hamiltonian is H(θ, p) = V (θ) + 1 pT G(θ)−1 p
2

˜ ˜
2


2

˜
If we deﬁne G(θ) = G(θ) × (h − V (θ)), where h is a constant H(θ, p)

˜ ˜
2


2

˜

Then the Maupertuis principle tells us that the Hamiltonian ﬂow for
˜
H(θ, p) and H(θ, p) are equivalent along energy level h

˜ ˜
2


2

˜

˜

The solution of
dθ ∂ dp ∂
= H(θ, p) = − H(θ, p)
dt ∂p dt ∂θ

˜ ˜
2


2

˜

˜

The solution of
dθ ∂ dp ∂
= H(θ, p) = − H(θ, p)
dt ∂p dt ∂θ
is therefore equivalent to the solution of

˜
+ Γkl =0
dt 2 dt dt
k,l

˜ ˜
2


2

˜

˜

The solution of
dθ ∂ dp ∂
= H(θ, p) = − H(θ, p)
dt ∂p dt ∂θ
is therefore equivalent to the solution of

˜
+ Γkl =0
dt 2 dt dt
k,l

RMHMC proposals are along the manifold geodesics

QN 2 2 2
p(w1 , w2 |y, σx , σy ) ∝ n=1 N (yn |w1 + w2 , σy )N (w1 , w2 |0, σx I)

QN 2 2 2
p(w1 , w2 |y, σx , σy ) ∝ n=1 N (yn |w1 + w2 , σy )N (w1 , w2 |0, σx I)
3 3

HMC RMHMC
2 2

1 1

w 0 w 0

-1 -1

-2 -2

-3 -3
-4 -3 -2 -1 0 1 2 3 -4 -3 -2 -1 0 1 2 3

w w

Gaussian Mixture Model

Univariate ﬁnite mixture model
K
X 2
p(xi |θ) = πk N (xi |µk , σk )
k=1


K
X 2
k=1

FI based metric tensor non-analytic - employ empirical FI


K
X 2
k=1

FI based metric tensor non-analytic - employ empirical FI
1 T 1
G(θ) = S S − 2 ssT
¯¯ −−→
−− cov ( θ L(θ)) = I(θ)
N N N→∞

!
∂S T ∂ sT
„ «
∂G(θ) 1 ∂S 1 ¯
∂s T ¯
= S + ST − ¯ ¯
s +s
∂θd N ∂θd ∂θd N2 ∂θd ∂θd

∂ log p(xi |θ) PN
with score matrix S with elements Si,d = ∂θd
¯
and s = i=1 ST
i,·

p(x|µ, σ 2 ) = 0.7 × N (x|0, σ 2 ) + 0.3 × N (x|µ, σ 2 )

p(x|µ, σ 2 ) = 0.7 × N (x|0, σ 2 ) + 0.3 × N (x|µ, σ 2 )

4

3

2

1
lnσ

0

−1

−2

−3

−4
−4 −3 −2 −1 0 1 2 3 4
µ

Figure: Arrows correspond to the gradients and ellipses to the inverse metric
tensor. Dashed lines are isocontours of the joint log density


MALA path mMALA path Simplified mMALA path
4 4 4
−1410 −1410 −1410
−1310 −1310 −1310
3 3 3
−1210 −1210 −1210
−1110 −1110 −1110
2 2 2
−1010 −1010 −1010

1 1 1
−910 −910 −910
lnσ

lnσ

lnσ
0 0 0

−1 −1 −1

−1410 −1410 −1410
−2 −2 −2

−2910 −2910 −2910
−3 −3 −3

−4 −4 −4
−4 −3 −2 −1 0 1 2 3 4 −4 −3 −2 −1 0 1 2 3 4 −4 −3 −2 −1 0 1 2 3 4
µ µ µ

MALA Sample Autocorrelation Function mMALA Sample Autocorrelation Function Simplified mMALA Sample Autocorrelation Function

0.8 0.8 0.8
Sample Autocorrelation


0.6 0.6 0.6

0.4 0.4 0.4

0.2 0.2 0.2

0 0 0

−0.2 −0.2 −0.2
0 2 4 6 8 10 12 14 16 18 20 0 2 4 6 8 10 12 14 16 18 20 0 2 4 6 8 10 12 14 16 18 20
Lag Lag Lag

Figure: Comparison of MALA (left), mMALA (middle) and simpliﬁed mMALA (right)
convergence paths and autocorrelation plots. Autocorrelation plots are from the
stationary chains, i.e. once the chains have converged to the stationary distribution.


HMC path RMHMC path Gibbs path
4 4 4
−1410 −1410 −1410
−1310 −1310 −1310
3 3 3
−1210 −1210 −1210
−1110 −1110 −1110
2 2 2
−1010 −1010 −1010

1 1 1
−910 −910 −910
lnσ

lnσ

lnσ
0 0 0

−1 −1 −1

−1410 −1410 −1410
−2 −2 −2

−2910 −2910 −2910
−3 −3 −3

−4 −4 −4
−4 −3 −2 −1 0 1 2 3 4 −4 −3 −2 −1 0 1 2 3 4 −4 −3 −2 −1 0 1 2 3 4
µ µ µ

HMC Sample Autocorrelation Function RMHMC Sample Autocorrelation Function Gibbs Sample Autocorrelation Function

0.8 0.8 0.8


0.6 0.6 0.6

0.4 0.4 0.4

0.2 0.2 0.2

0 0 0

−0.2 −0.2 −0.2
0 2 4 6 8 10 12 14 16 18 20 0 2 4 6 8 10 12 14 16 18 20 0 2 4 6 8 10 12 14 16 18 20
Lag Lag Lag

Figure: Comparison of HMC (left), RMHMC (middle) and GIBBS (right) convergence
paths and autocorrelation plots. Autocorrelation plots are from the stationary chains,
i.e. once the chains have converged to the stationary distribution.

Log-Gaussian Cox Point Process with Latent Field

The joint density for Poisson counts and latent Gaussian ﬁeld
Y64
p(y, x|µ, σ, β) ∝ exp{yi,j xi,j −m exp(xi,j )} exp(−(x−µ1)T Σ−1 (x−µ1)/2)
θ
i,j


Y64
θ
i,j

Metric tensors
„ «
1 ∂Σθ −1 ∂Σθ
G(θ)i,j = trace Σ−1
θ Σθ
2 ∂θi ∂θj
G(x) = Λ + Σ−1
θ

where Λ is diagonal with elements m exp(µ + (Σθ )i,i )


Y64
θ
i,j

Metric tensors
„ «
1 ∂Σθ −1 ∂Σθ
θ Σθ
2 ∂θi ∂θj
G(x) = Λ + Σ−1
θ


Latent field metric tensor defining flat manifold is 4096 × 4096, O(N 3 )
obtained from parameter conditional


Y64
θ
i,j

Metric tensors
„ «
1 ∂Σθ −1 ∂Σθ
θ Σθ
2 ∂θi ∂θj
G(x) = Λ + Σ−1
θ


Latent field metric tensor defining flat manifold is 4096 × 4096, O(N 3 )
obtained from parameter conditional

MALA requires transformation of latent field to sample - with separate
tuning in transient and stationary phases of Markov chain

Manifold methods requires no pilot tuning or additional transformations

RMHMC for Log-Gaussian Cox Point Processes

Table: Sampling the latent variables of a Log-Gaussian Cox Process - Comparison of
sampling methods

Method Time ESS (Min, Med, Max) s/Min ESS Rel. Speed
MALA (Transient) 31,577 (3, 8, 50) 10,605 ×1
MALA (Stationary) 31,118 (4, 16, 80) 7836 ×1.35
mMALA 634 (26, 84, 174) 24.1 ×440
RMHMC 2936 (1951, 4545, 5000) 1.5 ×7070

Conclusion and Discussion

Geometry of statistical models harnessed in Monte Carlo methods


Diffusions that respect structure and curvature of space - Manifold MALA


Geodesic ﬂows on model manifold - RMHMC - generalisation of HMC


Assessed on correlated & high-dimensional latent variable models


Promising capability of methodology



Ongoing development



Ongoing development
Potential bottleneck at metric tensor and Christoffel symbols



Ongoing development
Theoretical analysis of convergence



Ongoing development
Investigate alternative manifold structures



Ongoing development
Design and effect of metric



Ongoing development
Optimality of Hamiltonian ﬂows as local geodesics



Ongoing development
Alternative transition kernels



Ongoing development
Alternative transition kernels

No silver bullet or cure all - new powerful methodology for MC toolkit

Funding Acknowledgment

Girolami funded by EPSRC Advanced Research Fellowship and BBSRC
project grant

Calderhead supported by Microsoft Research PhD Scholarship

Mark Girolami's Read Paper 2010

More Related Content

What's hot

Similar to Mark Girolami's Read Paper 2010

More from Christian Robert

Recently uploaded

Mark Girolami's Read Paper 2010