1. Computation of the marginal likelihood:
brief summary and method of power posteriors
Jean-Louis Foulley
jean-louis.foulley@jouy.inra.fr
06/01/2011 JLF/BigMC 1
2. Outline
Objectives
Brief summary of current methods
Monte Carlo direct
Harmonic mean
Generalized harmonic mean
Chib
Bridge sampling
Nested sampling
Power Posteriors
Relationship with fractional BF
Algorithm
Examples
Conclusion
06/01/2011 JLF/BigMC 2
3. Objectives
Marginal likelihood ("Prior Predictive", "Evidence")
m ( y ) = ∫ f ( y | θ )π ( θ ) dθ
Θ
-Normalization constant of π * ( θ | y )
π * (θ)
π (θ | y ) = where π * ( θ | y ) = f ( y | θ ) π ( θ )
m(y)
-Component of the Bayes factor
π ( M1 | y ) / π ( M 2 | y ) m1 ( y )
BF12 = =
π ( M1 ) / π ( M 2 ) m2 ( y )
∆Dm,12 = −2 ln BF12 = Dm,1 − Dm,2
Dm , j = −2 ln m j ( y ) : Marginal deviance
Calibration: Jeffreys & Turing (Deciban: 10log10 BF)
06/01/2011 JLF/BigMC 3
4. Methods/Monte Carlo, Harmonic Mean
1 G
1) Direct Monte Carlo mMC ( y ) =
ˆ ∑
G g =1
f y | θ( )
g
( )
θ( ) ,..., θ( ) : draws from π ( θ )
1 g
Converges (a. s) to m ( y ) but very inefficient
Many samples outside regions ofhigh likelihood
2)Harmonic mean (Newton & Raftery, 1994)
−1
1 G 1
mNR ( y ) = ∑ g =1 θ( ) ,..., θ( ) : draws from π ( θ | y )
1 g
ˆ
G
(
f y | θ( )
g
)
A special case of WIS: ∑ j =1 f y | θ(
J
( j)
)w (θ( ) ) / ∑
j J
j =1 ( ))
w θ(
j
where w θ(( ) ) ∝ π ( θ ) / g ( θ ) for g ( θ ) ∝ f ( y | θ ) π ( θ )
j
Converges (a.s) but very instable (infinite variance): to be absolutely avoided
"Worst Monte Carlo Method Ever" Radford Neal (2010)
Harmonic mean not really affected by change in prior while true marginal
highly sensitive to prior
06/01/2011 JLF/BigMC 4
5. Methods/Gelfand&Dey & Chib
3) Generalized harmonic mean
(Gelfand & Dey, 1994; Chen & Shao, 1997)
−1
1 G
mGD ( y ) = ∑ g =1
ˆ
( )
g θ( )
g
G
( ) ( )
f y | θ( ) π θ( )
g g
θ( ) ,..., θ( ) : draws from π ( θ | y )
1 g
g (.) as an approx of the posterior: pbs in large dimension
4)Chib's methods (1995)
ln m ( y ) = ln f ( y | θ ) + ln π ( θ ) − ln π ( θ | y ) , ∀θ
ln mSC ( y ) = ln f ( y | θ* ) + ln π ( θ* ) − ln π ( θ* | y )
ˆ ˆ
π ( θ | y ) to be estimated & θ* = ML, MAP, E ( θ | y ) selected
ˆ
Simple & often effective
06/01/2011 JLF/BigMC 5
6. Chib(Cont.)
4)Chib (1995)
ln mSC ( y ) = ln f ( y | θ* ) + ln π ( θ* ) − ln π ( θ* | y )
ˆ ˆ
a) Gibbs & RaoBlackwellization (Chib,1995)
b) Metropolis-Hastings (Chib & Jeliazkov, 2001)
c) Kernel estimator (Chen, 1994)
06/01/2011 JLF/BigMC 6
7. Chib via Gibbs
If θ = ( θ1 , θ2 )
π ( θ1 , θ 2 | y ) = π ( θ1 | y, θ 2 ) π ( θ 2 | y )
known estimated
π ( θ 2 | y ) = ∫ π ( θ 2 | y , θ1 ) π ( θ1 | y ) dθ1
known MCMC draws
"Estimation by Rao-Blackwellization"
1 G
ˆ *
2
G
*
(
π ( θ | y ) = ∑ g =1 π θ 2 | y , θ1
(g)
)
(g)
θ1 : draws from π ( θ1 | y )
06/01/2011 JLF/BigMC 7
8. Bridge sampling
5)Bridge sampling (Meng & Wong, 1996)
f ( y | θ)π (θ)
∫ α ( θ ) g ( θ ) m ( y ) dθ
=1
∫ α ( θ ) g ( θ ) π ( θ | y ) dθ
06/01/2011 JLF/BigMC 8
9. Bridge sampling/cont.
5)Bridge sampling (Meng & Wong, 1996)
∫ α ( θ ) f ( y | θ ) π ( θ ) g ( θ ) dθ
g (θ)
m(y) = =
E (α ( θ ) f ( y | θ ) π ( θ ) )
∫ α ( θ ) g ( θ ) π ( θ | y ) dθ E ( ) (α ( θ ) g ( θ ) )
π θ|y
α ( θ ) "bridge function" g ( θ ) = density to be calibated
For α ( θ ) = 1/ g ( θ )
−1
ˆ −1
( ) ( ) ( )
mBS 1 ( y ) = L ∑ l =1 f y | θ( ) π θ( ) / g θ( ) ( IS )
L l l l
For α ( θ ) = 1/ f ( y | θ ) π ( θ ) mBS 2 ( y ) = Gelfand-Dey (1994)
ˆ
1/ 2
For α ( θ ) = 1/ f ( y | θ ) π ( θ ) g ( θ )
mBS 3 ( y ) = Lopes-West (2004)
ˆ
1/ 2
mBS 3 ( y ) =
ˆ
−1
L l
( ) ( ) ( )
L ∑ l =1 f y | θ( ) π θ( ) / g θ( )
l l
1/ 2
M ∑ m =1
−1 M
( ) ( ) ( )
g θ( m ) / f y | θ ( m ) π θ( m )
θ( ) : draws from g ( θ ) ; θ( ) : draws from π ( θ | y )
l m
06/01/2011 JLF/BigMC 9
10. Bridge sampling (cont.)
5)Bridge sampling (Meng & Wong, 1996)
∫ α ( θ ) f ( y | θ ) π ( θ ) g ( θ ) dθ = E ( ) (α ( θ ) f ( y | θ ) π ( θ ) )
g θ
m (y) =
∫ α ( θ ) g ( θ ) π ( θ | y ) dθ E ( ) (α ( θ ) g ( θ ) )
π θ|y
For α ( θ ) = 1/ f ( y | θ ) π ( θ ) g ( θ )
mBS 4 ( y ) =
ˆ
L−1 ∑ l =1 1/ g θ( )
L
l
( )
(Lopes & West, 2004; Ando, 2010)
1/ 2
M ∑ m =1 1/ f y | θ π θ
−1 M
( m)
(
( m)
) ( )
θ( ) : draws from g ( θ ) ; θ( ) : draws from π ( θ | y ) Odd (cf numerator)
l m
draws
-1
For α ( θ ) ∝ sM π ( θ | y ) +sL g ( θ ) , optimum estim. wrt E(RMSE)
(Meng & Wong, 1996; Lopes & West, 2004; Fruhwirth-Schnatter,2004)
L−1 ∑ l =1
L
ˆ (
π t θ( l ) | y )
mBS 5) ( y ) = mBS) 5
ˆ ( t +1 ˆ (t
ˆ ( )
sM π t θ ( ) | y + s L g θ ( )
l
( )
l
( )) g θ(
m
∑
−1 M
M m =1
sM π ( θ( ) | y ) + s g ( θ( ) )
ˆt
m
L
m
where π t ( θ | y ) = f ( y | θ ) π ( θ ) / mBS) 5 and mBS)5 = mBS 1 ou mBS 2
ˆ ˆ (t ˆ (0 ˆ ˆ
sM = 1 − sL = M /( M + L)
06/01/2011 JLF/BigMC 10
11. Nested sampling
6)Nested sampling
(Skilling, 2006; Murray et al, 2006; Chopin & Robert, 2010)
m ( y ) = ∫ f ( y | θ) π ( θ) dθ = Eπ L ( θ)
Z L( θ)
Let x = ϕ −1 ( l ) = Pr L ( θ) > l be the survival function of rv L ( θ)
where l = ϕ( x) (upper tail) quantile function of L ( θ) so that x ~ U (0,1)
1
ˆ = ∑m ∆ l
Then Z = ∫ ϕ ( x)dx area under curve l =ϕ ( x ) and Z
0 i =1 xi i
with ∆xi = xi−1 − xi or ∆xi = ½ ( xi−1 − xi+1 ) if trapezoidal integration
06/01/2011 JLF/BigMC 11
12. Nested sampling/Cont.
1) Draw N points θ1,i from prior, θ1 = Argmin i =1,.., N L (θ1,i ) set l1 = L (θ1 )
2) Obtain N points θ 2,i by repeating θ1,i except θ1 replaced by a draw
from prior constrained by L (θ ) > l1 ,
record θ 2 = Argmin i =1,.., N L (θ 2,i ) and set l2 = L (θ 2 )
3) Repeat 1 & 2 until a stopping rule (change in max of L ≤ ε )
Since xi = ϕ −1 ( li ) is unknown
Set a) deterministic xi = exp(−i / N ) so that lnxi = E ( ln ϕ −1 ( li ) )
or b) random xi +1 = ti xi with x0 = 1, ti ~ Be ( N ,1)
Main difficulty in sampling θ from the prior constrained by L ( θ ) > l ?
See Chopin & Robert (2010) Extended Importance Sampling scheme
Z = ∑ i =1 ∆ xi ϕi wi with π (θ ) L (θ ) = π (θ ) L (θ ) w (θ )
m
06/01/2011 JLF/BigMC 12
13. Power Posteriors/basic principle
Method due to Friel & Petit (2008)
Lartillot & Philippe (2006) "Annealing-Melting"
t
f ( y | θ) π (θ)
Power Posterior defined as π ( θ | y , t ) =
zt ( y )
where zt ( y ) = ∫ f ( y | θ ) π ( θ )dθ
t
and t ∈ ]0,1[ with t −1 equivalent to "physical temperature"
t = 0 to 1: cooling down or "annealing"; t = 1 to 0 "melting"
Notice the path sampling scheme (Gelman & Meng, 1998)
π ( θ | y, 0 ) = π ( θ ) with z0 ( y ) = 1
π ( θ | y,1) = π ( θ | y ) with z1 ( y ) = m ( y )
06/01/2011 JLF/BigMC 13
14. PP/key result
1
log m ( y ) = ∫ Eθ|y ,t log f ( y | θ ) dt
0
where θ | y , t has density:
t
f ( y | θ) π (θ)
π ( θ | y, t ) =
zt ( y )
Thermodynamic integration (end of the 70's)
Ripley (1988),Ogata (1989), Neal (1993)
"Path sampling" (Gelman & Meng, 1998)
06/01/2011 JLF/BigMC 14
15. PP formula/proof as a special case of path sampling
If p (θ | t ) = q (θ | t ) / z ( t ) où z ( t ) = ∫ q (θ | t ) dθ
Let label U (θ , t ) = ln q (θ | t ) as the potential
d
dt
z (1) 1
One has ln = ∫ Eθ |t U (θ , t ) dt
z ( 0) 0
Here p (θ | t ) = π ( θ | y, t ) ; q (θ | t ) = f ( y | θ ) π ( θ )
t
Then U (θ , t ) = ln f ( y | θ )
06/01/2011 JLF/BigMC 15
16. PP/Example
yi | θ ~ iid N (θ ,1) , i = 1,.., N
θ ~ N ( µ ,τ 2 )
Alors θ | y, t ~ N ( µt ,τ t2 )
Nty + µτ −2 1
µt = ; τ t2 =
Nt + τ −2 Nt + τ −2
−2 Eθ |y ,t log f ( y | θ ) =
Dt (θ )
(µ − y ) + 1
2
N log 2π + log s 2 +
( µτ 2t + 1) Nt + τ
2 −2
y = N −1 ∑ i =1 yi ; s 2 = N −1 ∑ i =1 ( yi − y )
N N 2
D0 (θ ) = N Cte + ( µ − y ) + Nτ 2
2
High sensitivity to τ 2 (τ 2 → ∞, D0 (θ ) → ∞)
06/01/2011 JLF/BigMC 16
18. KL distance Prior-Posterior
π (θ | y )
KL (π ( θ | y ) , π ( θ ) ) = ∫ ln π ( θ | y ) dθ
π (θ)
f ( y | θ) π (θ)
KL = ∫ ln π ( θ | y ) dθ
m ( y ) π (θ)
KL = Eθ|y ln f ( y | θ ) − ln m ( y )
−2 KL = D − Dm (by-product of PP) ⇒ Dm = D + 2 KL
DIC = D + pD where pD = D − D ( θ ) model complexity
06/01/2011 JLF/BigMC 18
19. PP/partial BF
1)if π (θ ) improper ⇒ marginal f ( x ) also improper
resulting in problems for defining BF
2) High sensitivity of BF to priors (does not vanish with
increasing sample size)
sample
Idea behind partial BF (Lempers,1971) y = ( y P , y T )
-Learning or pilot sample y P to tune the prior
-Testing sample y T for data analysis
Intrinsinc BF (Berger & Perrichi, 1996)
Fractional BF (O'Hagan, 1995)
06/01/2011 JLF/BigMC 19
20. Fractional BF
A fraction b of the likelihood is used to tune the prior
b
f ( y P | θ ) ≈ f ( y | θ ) b = m / N < 1 (O'Hagan, 1995)
resulting in:
in:
b
π ( θ, b ) ∝ f ( y | θ ) π ( θ )
06/01/2011 20
21. PP & fractional BF
b
π ( θ, b ) ∝ f ( y | θ ) π ( θ )
1−b
m F
( y, b ) = ∫ f ( y | θ ) π ( θ, b ) dθ
m F
( y, b ) = ∫ f ( y | θ ) π ( θ ) dθ = m ( y,1)
∫ f ( y | θ ) π ( θ ) dθ m ( y , b )
b
PP directly provides
−π ( θ, b ) via π ( θ | y , t = b )
1
− log m F
( y, b ) = ∫ b Eθ|y ,t log f ( y | θ )dt
06/01/2011 JLF/BigMC 21
22. PP/algorithm
MCMC with discretization of t on [ 0,1[
t0 = 0 < t1 < ... < ti < ... < tn −1 < tn = 1
ti = (i / n)c with i = 1,.., n; n = 20 − 100; c = 2 − 5
1)Make draws of θ(
gi )
MCMC from π ( θ | y, ti )
1 G
G i
(
2)Compute Eθ|y ,t =ti log p ( y | θ ) = ∑ g =1 log p y | θ( i )
ˆ g
)
Often conditional independence, log p ( y | θ ) = ∑ i =1 log p ( yi | θ )
N
eg if θ if the closest stochastic parent of y = ( yi ) (as for DIC)
3)Approximate the integral (eg trapezoidal rule)
∑ i=0 i+1 i i i+1
ˆog m ( y ) = ½ n ( t − t )( E + E )
l
Error due to this numerical approx. (Calderhead & Girolami,2009)
Formula for MC sampling error: see Friel & Pettitt
06/01/2011 JLF/BigMC 22
23. PP/Little toy example
yi
0) yi | λi ~ id P ( λi xi ) ⇔
(λ x )
f ( yi | λi ) = i i
exp ( −λi xi )
yi !
β α λiα −1 exp ( − βλi )
1)λi ~ id G (α , β ) ⇔ π ( λi ) =
Γ (α )
0 + 1) yi ~ id BN (α , pi ) where pi = β / ( β + xi )
Γ ( yi + α ) α y
Direct approach: f ( yi ) = pi (1 − pi ) i
Γ (α ) yi !
f ( y ) = − n ln Γ (α ) + ∑ i =1 ln Γ ( yi + α ) −∑ i =1 ln ( yi !)
n n
+α ∑ i =1 ln pi + ∑ i =1 yi ln (1 − p )i
n n
n
Indirect approach: f ( y ) = ∏ i =1 ∫ f ( yi | λi ) π ( λi ) d λi
06/01/2011 JLF/BigMC 23
24. PP/Little toy example/cont.
Ex / Pump data: Ex#2 in Winbugs, Carlin-Louis (p126)
y = # failures of pumps in x (103 hrs )
y = ( 5,1,5,14,3,19,1,1, 4, 22 ) ; n = 10; α = β = 1
x = (94.3,15.7, 62.9,126,5.24,31.4,1.05,1.05, 2.1,10.5)
ˆ
D = −2 ln f ( y ) = 66.03 D = 66.28 ± 0.03 (20pts)
FP
06/01/2011 JLF/BigMC 24
29. Sampling both θ & t
1
log m ( y ) = ∫ log f ( y | θ ) π ( θ | y , t ) dt
0
log f ( y | θ )
1
log m ( y ) = ∫ π ( θ | y, t ) p(t ) dt
0 p (t )
π ( θ ,t | y )
log f ( y | θ )
log
log m ( y ) = Eθ ,t|y
p (t )
t
π ( θ | y, t ) ∝ f ( y | θ ) π ( θ )
t
if we assume p (t ) ∝ zt ( y ) ⇒ π ( t | θ, y ) ∝ f ( y | θ )
Sampling ( θ, t ) in such conditions gives poor estimation
(too few draws of t close to 0)
06/01/2011 JLF/BigMC 29
31. Model comparison on Pothoff’s data
i: subscript for individual i = 1,.., I = 25 (11girls+16boys)
j: subscript for measurement at age t j (8,10,12,14 yrs )
1)Purely Fixed Model
yij = (α 0 + α xi ) + ( β 0 + β xi ) ( t j − 8 ) + eij
int ercept pente
2)Random intercept model
yij = (α 0 + α xi + ai ) + ( β 0 + β xi ) ( t j − 8 ) + eij
3)Random intercept & slope model assuming independent effects
yij = (α 0 + α xi + ai ) + ( β 0 + β xi + bi ) ( t j − 8 ) + eij
or
yij = φi1 + φi 2 ( t j − 8 ) + eij , yij ~ id N (ηij , σ e2 )
φi1 α 0 + α xi σ a 0
2
with φi = ~ N ,
φi 2
β 0 + β xi 0 σ b2
4)Random intercept & slope model assuming correlated effects
φi1 α 0 + α xi σ a σ ab
2
φi = ~ N ,
φi 2
β 0 + β xi σ ab σ b2
06/01/2011 JLF/BigMC 31
32. Model presentation:Hierarchical Bayes
1st level:yij ~ id N (ηij , σ e2 ) with ηij = φi1 + φi 2 ( t j − 8 )
2nd level :
φ α 0 + α xi σ a σ ab
2
2a) φi = i1 ~ N ,
φi 2 β 0 + β xi σ ab σ b2
Σ
2b) σ e ~ U ( 0, ∆ e ) or σ e2 ~ InvG (1, σ e2 )
3rd level:
Fixed effects: α 0 , α , β 0 , β ~ U(inf,sup)
Var (Covar) components:
− If σ ab = 0, then i) σ a ~ U ( 0, ∆ a ) , same for σ b ~ U ( 0, ∆ b )
or ii) σ a ~ InvG (1, σ a ) ,same for σ b2 ~ InvG (1, σ b2 )
2 2
− If σ ab ≠ 0, then i)σ a ~ U ( 0, ∆ a ) , σ b ~ U ( 0, ∆ b ) , ρ ~ U ( -1,1)
*
(
or ii) Ω ~ W (νΣ ) ,ν
−1
) for Ω = Σ −1
with ν = dim(Ω) + 1 and Σ known location parameter
*Take care as Winbugs uses another notation ie W ( (νΣ ) ,ν )
06/01/2011 JLF/BigMC 32
35. Example 2:Models of genetic differentiation
2 level hierarchical model
i =locus; j =(sub)population
aij =Nbre of genes carrying a given allele at locus i in pop. j
pij = Frequency of that allele at locus i in pop. j
0) yij | α ij ~ id B ( nij , α ij )
1− cj
1) α ij | xi ,λij ~ id Beta (τ jπ i ,τ j (1 − π i ) ) τ j = where c j ( Dif. index )
cj
π i = Frequency of that allele at locus i in the gene pool
2)π i ~ id Beta ( aπ , bπ ) , c j ~ id Beta ( ac , bc )
Migration-Drift at equilibrium (Balding)
06/01/2011 JLF/BigMC 35
36. Ex2: Nicholson’s model
Nicholson et al (2002) same as previously but
1) α ij | xi ,λij ~ id N (π i , c jπ i (1 − π i ) )
Truncated normal with masses in 0 and 1
so that yij | α ij ~ id B ( nij , α ij )
*
*
where α ij = max(0, min(1, α ij ))
2)π i ~ id Beta ( aπ , bπ ) , c j ~ id Beta ( ac , bc )
Pure drift model
06/01/2011 JLF/BigMC 36
38. Conclusion
Derived from thermodynamical integration
Link with « path sampling »
Easy to understand and quite general
Well suited to complex hierarchical models
« Theta’s » can be defined as the closest stochastic parents
of data making the latter conditionally independent
Draws only from posterior distributions
Gives as a by product fractional BF
Easy to implement (including in Openbugs) but time
consuming
Caution needed in discretization of t (close to 0)
06/01/2011 JLF/BigMC 38
39. Some references
Chen M, Shao Q, Ibrahim J (2000) Monte Carlo methods in Bayesian
computation. Springer
Chib S (1995) Marginal likelihood from the Gibbs output. JASA 90,1313-1321
Chopin N, Robert CP (2010) Properties of nested sampling. Biometrika, 97, 741-
755
Friel N, Pettitt AN (2008) Marginal likelihood estimation via power posteriors,
JRSS, B, 70, 589-607
Frühwirth-Schnatter (2004) Estimating marginal likelihoods from mixtures &
Markov switching models using bridge sampling techniques. Econometrics
Journal, 7,143-167
Gelman A, Meng X-L (1998) Simulating normalizing constants: from
importance sampling to bridge sampling and path sampling, Statistical Science,
13, 163-185
Lartillot N, Philippe H (2006) Computing Bayes factors using thermodynamic
integration. Systematic Biology, 55, 195-207
Marin JM, Robert CP (2009) Importance sampling methods for Bayesian
discrimination between embedded models. arXiv:0910.2325v1
Meng X-L, Wong WH (1996) Simulating ratios of normalizing constants via a
simple identity: a theoretical exploration. Statistica Sinica,6,831-860
O Hagan A (1995) Fractional Bayes factors for model comparison. JRSS, B, 57,
99-138
06/01/2011 JLF/BigMC 39
40. Acknowledgements
Nial Friel (U College, Dublin) for his interest in these
applications and his unvaluable explanations &
suggestions
Tony O’Hagan for further insight into FBF
Gilles Celeux, Mathieu Gautier as coadvisors of the
Master dissertation of Yoan Soussan (Paris VI)
Christian Robert for his blog and his relevant
comments, standpoints and bibliographical references
The Applibugs & Babayes groups for stimulating
discussions on DIC, BF,CPO & other information
criteria (AIC,BIC)
06/01/2011 JLF/BigMC 40