Computation of the marginal likelihood

Computation of the marginal likelihood:
brief summary and method of power posteriors

Jean-Louis Foulley
jean-louis.foulley@jouy.inra.fr

06/01/2011 JLF/BigMC 1

Outline
Objectives
Brief summary of current methods
Monte Carlo direct
Harmonic mean
Generalized harmonic mean
Chib
Bridge sampling
Nested sampling
Power Posteriors
Relationship with fractional BF
Algorithm
Examples
Conclusion

06/01/2011 JLF/BigMC 2

Objectives
Marginal likelihood ("Prior Predictive", "Evidence")
m ( y ) = ∫ f ( y | θ )π ( θ ) dθ
Θ

-Normalization constant of π * ( θ | y )
π * (θ)
π (θ | y ) = where π * ( θ | y ) = f ( y | θ ) π ( θ )
m(y)
-Component of the Bayes factor
π ( M1 | y ) / π ( M 2 | y ) m1 ( y )
BF12 = =
π ( M1 ) / π ( M 2 ) m2 ( y )
∆Dm,12 = −2 ln BF12 = Dm,1 − Dm,2
Dm , j = −2 ln m j ( y ) : Marginal deviance
Calibration: Jeffreys & Turing (Deciban: 10log10 BF)
06/01/2011 JLF/BigMC 3

Methods/Monte Carlo, Harmonic Mean
1 G
1) Direct Monte Carlo mMC ( y ) =
ˆ ∑
G g =1
f y | θ( )
g
( )
θ( ) ,..., θ( ) : draws from π ( θ )
1 g

Converges (a. s) to m ( y ) but very inefficient
Many samples outside regions ofhigh likelihood
2)Harmonic mean (Newton & Raftery, 1994)
−1
 
1 G 1
mNR ( y ) =  ∑ g =1  θ( ) ,..., θ( ) : draws from π ( θ | y )
1 g
ˆ
G
 (
f y | θ( )
g
) 

A special case of WIS: ∑ j =1 f y | θ(
J
( j)
)w (θ( ) ) / ∑
j J
j =1 ( ))
w θ(
j

where w θ(( ) ) ∝ π ( θ ) / g ( θ ) for g ( θ ) ∝ f ( y | θ ) π ( θ )
j

Converges (a.s) but very instable (infinite variance): to be absolutely avoided
"Worst Monte Carlo Method Ever" Radford Neal (2010)
Harmonic mean not really affected by change in prior while true marginal
highly sensitive to prior
06/01/2011 JLF/BigMC 4

Methods/Gelfand&Dey & Chib
3) Generalized harmonic mean
(Gelfand & Dey, 1994; Chen & Shao, 1997)
−1

1 G
mGD ( y ) =  ∑ g =1
ˆ
( )
g θ( )
g 

G
 ( ) ( )
f y | θ( ) π θ( )
g g


θ( ) ,..., θ( ) : draws from π ( θ | y )
1 g

g (.) as an approx of the posterior: pbs in large dimension
4)Chib's methods (1995)
ln m ( y ) = ln f ( y | θ ) + ln π ( θ ) − ln π ( θ | y ) , ∀θ
ln mSC ( y ) = ln f ( y | θ* ) + ln π ( θ* ) − ln π ( θ* | y )
ˆ ˆ
π ( θ | y ) to be estimated & θ* = ML, MAP, E ( θ | y ) selected
ˆ
Simple & often effective
06/01/2011 JLF/BigMC 5

Chib(Cont.)

4)Chib (1995)
ln mSC ( y ) = ln f ( y | θ* ) + ln π ( θ* ) − ln π ( θ* | y )
ˆ ˆ
a) Gibbs & RaoBlackwellization (Chib,1995)
b) Metropolis-Hastings (Chib & Jeliazkov, 2001)
c) Kernel estimator (Chen, 1994)

06/01/2011 JLF/BigMC 6

Bridge sampling

5)Bridge sampling (Meng & Wong, 1996)

f ( y | θ)π (θ)
∫ α ( θ ) g ( θ ) m ( y ) dθ
=1
∫ α ( θ ) g ( θ ) π ( θ | y ) dθ

06/01/2011 JLF/BigMC 8

Bridge sampling/cont.

∫ α ( θ ) f ( y | θ ) π ( θ ) g ( θ ) dθ
g (θ)

m(y) = =
E (α ( θ ) f ( y | θ ) π ( θ ) )
∫ α ( θ ) g ( θ ) π ( θ | y ) dθ E ( ) (α ( θ ) g ( θ ) )
π θ|y

α ( θ ) "bridge function" g ( θ ) = density to be calibated
For α ( θ ) = 1/ g ( θ )
−1
ˆ −1
 ( ) ( ) ( )
mBS 1 ( y ) = L ∑ l =1  f y | θ( ) π θ( ) / g θ( )  ( IS )
L l l l

For α ( θ ) = 1/ f ( y | θ ) π ( θ ) mBS 2 ( y ) = Gelfand-Dey (1994)
ˆ
1/ 2
For α ( θ ) = 1/  f ( y | θ ) π ( θ ) g ( θ ) 
  mBS 3 ( y ) = Lopes-West (2004)
ˆ
1/ 2

mBS 3 ( y ) =
ˆ
−1

L l
( ) ( ) ( )
L ∑ l =1  f y | θ( ) π θ( ) / g θ( ) 
l l

1/ 2
M ∑ m =1
−1 M

 ( ) ( ) ( )
 g θ( m ) / f y | θ ( m ) π θ( m ) 

θ( ) : draws from g ( θ ) ; θ( ) : draws from π ( θ | y )
l m

06/01/2011 JLF/BigMC 9

Bridge sampling (cont.)

∫ α ( θ ) f ( y | θ ) π ( θ ) g ( θ ) dθ = E ( ) (α ( θ ) f ( y | θ ) π ( θ ) )
g θ

m (y) =
∫ α ( θ ) g ( θ ) π ( θ | y ) dθ E ( ) (α ( θ ) g ( θ ) )
π θ|y

For α ( θ ) = 1/ f ( y | θ ) π ( θ ) g ( θ )

mBS 4 ( y ) =
ˆ
L−1 ∑ l =1 1/ g θ( ) 
L


l
 ( )
(Lopes & West, 2004; Ando, 2010)
1/ 2
M ∑ m =1 1/ f y | θ π θ
−1 M


( m)
(
( m) 
 ) ( )
θ( ) : draws from g ( θ ) ; θ( ) : draws from π ( θ | y ) Odd (cf numerator)
l m
draws
-1
For α ( θ ) ∝  sM π ( θ | y ) +sL g ( θ )  , optimum estim. wrt E(RMSE)
 
(Meng & Wong, 1996; Lopes & West, 2004; Fruhwirth-Schnatter,2004)

L−1 ∑ l =1
L
ˆ (
π t θ( l ) | y )
mBS 5) ( y ) = mBS) 5
ˆ ( t +1 ˆ (t
ˆ ( )
sM π t θ ( ) | y + s L g θ ( )
l
( )
l

( )) g θ(
m

∑
−1 M
M m =1
sM π ( θ( ) | y ) + s g ( θ( ) )
ˆt
m
L
m

where π t ( θ | y ) = f ( y | θ ) π ( θ ) / mBS) 5 and mBS)5 = mBS 1 ou mBS 2
ˆ ˆ (t ˆ (0 ˆ ˆ
sM = 1 − sL = M /( M + L)
06/01/2011 JLF/BigMC 10

Nested sampling

6)Nested sampling
(Skilling, 2006; Murray et al, 2006; Chopin & Robert, 2010)
m ( y ) = ∫ f ( y | θ) π ( θ) dθ = Eπ  L ( θ) 
 
Z L( θ)

Let x = ϕ −1 ( l ) = Pr  L ( θ) > l  be the survival function of rv L ( θ)
 
where l = ϕ( x) (upper tail) quantile function of L ( θ) so that x ~ U (0,1)
1
ˆ = ∑m ∆ l
Then Z = ∫ ϕ ( x)dx area under curve l =ϕ ( x )  and Z
0   i =1 xi i

with ∆xi = xi−1 − xi or ∆xi = ½ ( xi−1 − xi+1 ) if trapezoidal integration

06/01/2011 JLF/BigMC 11

Nested sampling/Cont.
1) Draw N points θ1,i from prior, θ1 = Argmin i =1,.., N L (θ1,i ) set l1 = L (θ1 )
2) Obtain N points θ 2,i by repeating θ1,i except θ1 replaced by a draw
from prior constrained by L (θ ) > l1 ,
record θ 2 = Argmin i =1,.., N L (θ 2,i ) and set l2 = L (θ 2 )
3) Repeat 1 & 2 until a stopping rule (change in max of L ≤ ε )
Since xi = ϕ −1 ( li ) is unknown
Set a) deterministic xi = exp(−i / N ) so that lnxi = E ( ln ϕ −1 ( li ) )
or b) random xi +1 = ti xi with x0 = 1, ti ~ Be ( N ,1)
Main difficulty in sampling θ from the prior constrained by L ( θ ) > l ?
See Chopin & Robert (2010) Extended Importance Sampling scheme
Z = ∑ i =1 ∆ xi ϕi wi with π (θ ) L (θ ) = π (θ ) L (θ ) w (θ )
m

06/01/2011 JLF/BigMC 12

Power Posteriors/basic principle

Method due to Friel & Petit (2008)
Lartillot & Philippe (2006) "Annealing-Melting"
t
f ( y | θ) π (θ)
Power Posterior defined as π ( θ | y , t ) =
zt ( y )
where zt ( y ) = ∫ f ( y | θ ) π ( θ )dθ
t

and t ∈ ]0,1[ with t −1 equivalent to "physical temperature"
t = 0 to 1: cooling down or "annealing"; t = 1 to 0 "melting"
Notice the path sampling scheme (Gelman & Meng, 1998)
π ( θ | y, 0 ) = π ( θ ) with z0 ( y ) = 1
π ( θ | y,1) = π ( θ | y ) with z1 ( y ) = m ( y )
06/01/2011 JLF/BigMC 13

PP/key result
1
log m ( y ) = ∫ Eθ|y ,t log f ( y | θ ) dt
 
0

where θ | y , t has density:
t
f ( y | θ) π (θ)
π ( θ | y, t ) =
zt ( y )
Thermodynamic integration (end of the 70's)
Ripley (1988),Ogata (1989), Neal (1993)
"Path sampling" (Gelman & Meng, 1998)
06/01/2011 JLF/BigMC 14

PP/Example
yi | θ ~ iid N (θ ,1) , i = 1,.., N
θ ~ N ( µ ,τ 2 )
Alors θ | y, t ~ N ( µt ,τ t2 )
Nty + µτ −2 1
µt = ; τ t2 =
Nt + τ −2 Nt + τ −2
−2 Eθ |y ,t log f ( y | θ )  =
 
Dt (θ )


(µ − y ) + 1 
2

N log 2π + log s 2 + 
 ( µτ 2t + 1) Nt + τ 
2 −2

 
y = N −1 ∑ i =1 yi ; s 2 = N −1 ∑ i =1 ( yi − y )
N N 2

D0 (θ ) = N Cte + ( µ − y )  + Nτ 2
2
 
High sensitivity to τ 2 (τ 2 → ∞, D0 (θ ) → ∞)
06/01/2011 JLF/BigMC 16

PP/Example/cont.

06/01/2011 JLF/BigMC 17

PP/partial BF

1)if π (θ ) improper ⇒ marginal f ( x ) also improper
resulting in problems for defining BF
2) High sensitivity of BF to priors (does not vanish with
increasing sample size)
sample
Idea behind partial BF (Lempers,1971) y = ( y P , y T )
-Learning or pilot sample y P to tune the prior
-Testing sample y T for data analysis
Intrinsinc BF (Berger & Perrichi, 1996)
Fractional BF (O'Hagan, 1995)

06/01/2011 JLF/BigMC 19

Fractional BF

A fraction b of the likelihood is used to tune the prior
b
f ( y P | θ ) ≈ f ( y | θ ) b = m / N < 1 (O'Hagan, 1995)
resulting in:
in:
b
π ( θ, b ) ∝ f ( y | θ ) π ( θ )

06/01/2011 20

PP/algorithm
MCMC with discretization of t on [ 0,1[
t0 = 0 < t1 < ... < ti < ... < tn −1 < tn = 1
ti = (i / n)c with i = 1,.., n; n = 20 − 100; c = 2 − 5
1)Make draws of θ(
gi )
MCMC from π ( θ | y, ti )
1 G
 
G i
(
2)Compute Eθ|y ,t =ti log p ( y | θ )  = ∑ g =1 log p y | θ( i )
ˆ g
)
Often conditional independence, log p ( y | θ ) = ∑ i =1 log p ( yi | θ )
N

eg if θ if the closest stochastic parent of y = ( yi ) (as for DIC)
3)Approximate the integral (eg trapezoidal rule)

∑ i=0 i+1 i i i+1
ˆog m ( y ) = ½ n ( t − t )( E + E )
l
Error due to this numerical approx. (Calderhead & Girolami,2009)
Formula for MC sampling error: see Friel & Pettitt
06/01/2011 JLF/BigMC 22

PP/Little toy example
yi

0) yi | λi ~ id P ( λi xi ) ⇔
(λ x )
f ( yi | λi ) = i i
exp ( −λi xi )
yi !
β α λiα −1 exp ( − βλi )
1)λi ~ id G (α , β ) ⇔ π ( λi ) =
Γ (α )
0 + 1) yi ~ id BN (α , pi ) where pi = β / ( β + xi )
Γ ( yi + α ) α y
Direct approach: f ( yi ) = pi (1 − pi ) i
Γ (α ) yi !

f ( y ) = − n ln Γ (α ) + ∑ i =1 ln Γ ( yi + α ) −∑ i =1 ln ( yi !)
n n

+α ∑ i =1 ln pi + ∑ i =1 yi ln (1 − p )i
n n

n
Indirect approach: f ( y ) = ∏ i =1 ∫ f ( yi | λi ) π ( λi ) d λi

06/01/2011 JLF/BigMC 23

PP/Little toy example/cont.

Ex / Pump data: Ex#2 in Winbugs, Carlin-Louis (p126)
y = # failures of pumps in x (103 hrs )
y = ( 5,1,5,14,3,19,1,1, 4, 22 ) ; n = 10; α = β = 1
x = (94.3,15.7, 62.9,126,5.24,31.4,1.05,1.05, 2.1,10.5)
ˆ
D = −2 ln f ( y ) = 66.03 D = 66.28 ± 0.03 (20pts)
FP

06/01/2011 JLF/BigMC 24

PP/Toy example in Openbugs

06/01/2011 JLF/BigMC 25

PP/Toy example in Openbugs/Cont.

06/01/2011 JLF/BigMC 26


06/01/2011 JLF/BigMC 27


06/01/2011 JLF/BigMC 28

Example 1/ Pothoff&Roy’s data
Growth measurements in 11 girls and 16 boys: Pothoff and Roy,1964; Little and Rubin, 1987

Age (years) Age (years)
Girl 8 10 12 14 Boy 8 10 12 14
1 210 200 215 230 1 260 250 290 310
2 210 215 240 255 2 215 230 265
3 205 245 260 3 230 225 240 275
4 235 245 250 265 4 255 275 265 270
5 215 230 225 235 5 200 225 260
6 200 210 225 6 245 255 270 285
7 215 225 230 250 7 220 220 245 265
8 230 230 235 240 8 240 215 245 255
9 200 220 215 9 230 205 310 260
10 165 190 195 10 275 280 310 315
11 245 250 280 280 11 230 230 235 250
12 215 240 280
13 170 260 295
14 225 255 255 260
15 230 245 260 300
16 220 235 250
distance from the centre of the pituary to the pteryomaxillary fissure (unit 10-4m)

06/01/2011 JLF/BigMC 30

Model comparison on Pothoff’s data
i: subscript for individual i = 1,.., I = 25 (11girls+16boys)
j: subscript for measurement at age t j (8,10,12,14 yrs )
1)Purely Fixed Model
yij = (α 0 + α xi ) + ( β 0 + β xi ) ( t j − 8 ) + eij
int ercept pente

2)Random intercept model
yij = (α 0 + α xi + ai ) + ( β 0 + β xi ) ( t j − 8 ) + eij
3)Random intercept & slope model assuming independent effects
yij = (α 0 + α xi + ai ) + ( β 0 + β xi + bi ) ( t j − 8 ) + eij
or
yij = φi1 + φi 2 ( t j − 8 ) + eij , yij ~ id N (ηij , σ e2 )

 φi1   α 0 + α xi   σ a 0  
2
with φi =   ~ N  , 
 φi 2  
 β 0 + β xi   0 σ b2  
 
4)Random intercept & slope model assuming correlated effects
 φi1   α 0 + α xi   σ a σ ab  
2
φi =   ~ N  , 
 φi 2  
 β 0 + β xi   σ ab σ b2  
 
06/01/2011 JLF/BigMC 31

Model presentation:Hierarchical Bayes

1st level:yij ~ id N (ηij , σ e2 ) with ηij = φi1 + φi 2 ( t j − 8 )
2nd level :
 
φ   α 0 + α xi   σ a σ ab  
2
2a) φi =  i1  ~ N  , 
 φi 2   β 0 + β xi   σ ab σ b2  


 Σ 

2b) σ e ~ U ( 0, ∆ e ) or σ e2 ~ InvG (1, σ e2 )
3rd level:
Fixed effects: α 0 , α , β 0 , β ~ U(inf,sup)
Var (Covar) components:
− If σ ab = 0, then i) σ a ~ U ( 0, ∆ a ) , same for σ b ~ U ( 0, ∆ b )
or ii) σ a ~ InvG (1, σ a ) ,same for σ b2 ~ InvG (1, σ b2 )
2 2

− If σ ab ≠ 0, then i)σ a ~ U ( 0, ∆ a ) , σ b ~ U ( 0, ∆ b ) , ρ ~ U ( -1,1)
*
(
or ii) Ω ~ W (νΣ ) ,ν
−1
) for Ω = Σ −1

with ν = dim(Ω) + 1 and Σ known location parameter
*Take care as Winbugs uses another notation ie W ( (νΣ ) ,ν )

06/01/2011 JLF/BigMC 32

Results

06/01/2011 JLF/BigMC 33

Results/fractional priors (b=0 vs 0.125)

06/01/2011 JLF/BigMC 34

Example 2:Models of genetic differentiation

2 level hierarchical model
i =locus; j =(sub)population
aij =Nbre of genes carrying a given allele at locus i in pop. j
pij = Frequency of that allele at locus i in pop. j

0) yij | α ij ~ id B ( nij , α ij )
1− cj
1) α ij | xi ,λij ~ id Beta (τ jπ i ,τ j (1 − π i ) ) τ j = where c j ( Dif. index )
cj
π i = Frequency of that allele at locus i in the gene pool
2)π i ~ id Beta ( aπ , bπ ) , c j ~ id Beta ( ac , bc )
Migration-Drift at equilibrium (Balding)

06/01/2011 JLF/BigMC 35

Ex2: Nicholson’s model

Nicholson et al (2002) same as previously but
1) α ij | xi ,λij ~ id N (π i , c jπ i (1 − π i ) )
Truncated normal with masses in 0 and 1
so that yij | α ij ~ id B ( nij , α ij )
*

*
where α ij = max(0, min(1, α ij ))
2)π i ~ id Beta ( aπ , bπ ) , c j ~ id Beta ( ac , bc )
Pure drift model

06/01/2011 JLF/BigMC 36

Results

06/01/2011 JLF/BigMC 37

Conclusion
Derived from thermodynamical integration
Link with « path sampling »
Easy to understand and quite general
Well suited to complex hierarchical models
« Theta’s » can be defined as the closest stochastic parents
of data making the latter conditionally independent
Draws only from posterior distributions
Gives as a by product fractional BF
Easy to implement (including in Openbugs) but time
consuming
Caution needed in discretization of t (close to 0)
06/01/2011 JLF/BigMC 38

Some references
Chen M, Shao Q, Ibrahim J (2000) Monte Carlo methods in Bayesian
computation. Springer
Chib S (1995) Marginal likelihood from the Gibbs output. JASA 90,1313-1321
Chopin N, Robert CP (2010) Properties of nested sampling. Biometrika, 97, 741-
755
Friel N, Pettitt AN (2008) Marginal likelihood estimation via power posteriors,
JRSS, B, 70, 589-607
Frühwirth-Schnatter (2004) Estimating marginal likelihoods from mixtures &
Markov switching models using bridge sampling techniques. Econometrics
Journal, 7,143-167
Gelman A, Meng X-L (1998) Simulating normalizing constants: from
importance sampling to bridge sampling and path sampling, Statistical Science,
13, 163-185
Lartillot N, Philippe H (2006) Computing Bayes factors using thermodynamic
integration. Systematic Biology, 55, 195-207
Marin JM, Robert CP (2009) Importance sampling methods for Bayesian
discrimination between embedded models. arXiv:0910.2325v1
Meng X-L, Wong WH (1996) Simulating ratios of normalizing constants via a
simple identity: a theoretical exploration. Statistica Sinica,6,831-860
O Hagan A (1995) Fractional Bayes factors for model comparison. JRSS, B, 57,
99-138
06/01/2011 JLF/BigMC 39

Acknowledgements
Nial Friel (U College, Dublin) for his interest in these
applications and his unvaluable explanations &
suggestions
Tony O’Hagan for further insight into FBF
Gilles Celeux, Mathieu Gautier as coadvisors of the
Master dissertation of Yoan Soussan (Paris VI)
Christian Robert for his blog and his relevant
comments, standpoints and bibliographical references
The Applibugs & Babayes groups for stimulating
discussions on DIC, BF,CPO & other information
criteria (AIC,BIC)

06/01/2011 JLF/BigMC 40

Computation of the marginal likelihood

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Computation of the marginal likelihood

Similar to Computation of the marginal likelihood (20)

More from BigMC

More from BigMC (11)

Computation of the marginal likelihood