SlideShare a Scribd company logo
1 of 40
Download to read offline
Computation of the marginal likelihood:
             brief summary and method of power posteriors




                             Jean-Louis Foulley
                       jean-louis.foulley@jouy.inra.fr

06/01/2011                         JLF/BigMC                1
Outline
             Objectives
             Brief summary of current methods
                Monte Carlo direct
                Harmonic mean
                Generalized harmonic mean
                Chib
                Bridge sampling
                Nested sampling
             Power Posteriors
             Relationship with fractional BF
             Algorithm
             Examples
             Conclusion



06/01/2011                          JLF/BigMC   2
Objectives
    Marginal likelihood ("Prior Predictive", "Evidence")
              m ( y ) = ∫ f ( y | θ )π ( θ ) dθ
                         Θ

    -Normalization constant of π * ( θ | y )
                            π * (θ)
             π (θ | y ) =             where π * ( θ | y ) = f ( y | θ ) π ( θ )
                             m(y)
    -Component of the Bayes factor
                    π ( M1 | y ) / π ( M 2 | y ) m1 ( y )
             BF12 =                             =
                       π ( M1 ) / π ( M 2 )       m2 ( y )
             ∆Dm,12 = −2 ln BF12 = Dm,1 − Dm,2
             Dm , j = −2 ln m j ( y ) : Marginal deviance
    Calibration: Jeffreys & Turing (Deciban: 10log10 BF)
06/01/2011                                JLF/BigMC                               3
Methods/Monte Carlo, Harmonic Mean
                                                               1 G
             1) Direct Monte Carlo mMC ( y ) =
                                   ˆ                            ∑
                                                               G g =1
                                                                      f y | θ( )
                                                                              g
                                                                                   (       )
             θ( ) ,..., θ( ) : draws from π ( θ )
               1       g


             Converges (a. s) to m ( y ) but very inefficient
             Many samples outside regions ofhigh likelihood
             2)Harmonic mean (Newton & Raftery, 1994)
                                                          −1
                                                     
                           1 G       1
             mNR ( y ) =  ∑ g =1                             θ( ) ,..., θ( ) : draws from π ( θ | y )
                                                                 1             g
             ˆ
                         G
                                      (
                                  f y | θ( )
                                          g
                                                  )   
                                                      
             A special case of WIS: ∑ j =1 f y | θ(
                                              J
                                                      (              j)
                                                                          )w (θ( ) ) / ∑
                                                                                   j       J
                                                                                           j =1 ( ))
                                                                                                w θ(
                                                                                                       j



             where w θ(( ) ) ∝ π ( θ ) / g ( θ ) for g ( θ ) ∝ f ( y | θ ) π ( θ )
                           j



             Converges (a.s) but very instable (infinite variance): to be absolutely avoided
             "Worst Monte Carlo Method Ever" Radford Neal (2010)
             Harmonic mean not really affected by change in prior while true marginal
             highly sensitive to prior
06/01/2011                                                     JLF/BigMC                                   4
Methods/Gelfand&Dey & Chib
     3) Generalized harmonic mean
     (Gelfand & Dey, 1994; Chen & Shao, 1997)
                                                              −1
                  
                    1 G
      mGD ( y ) =  ∑ g =1
      ˆ
                                    ( )
                                g θ( )
                                      g                   
                                                          
                  G
                                 ( ) ( )
                           f y | θ( ) π θ( )
                                   g      g
                                                          
                                                          
     θ( ) ,..., θ( ) : draws from π ( θ | y )
        1        g


      g (.) as an approx of the posterior: pbs in large dimension
      4)Chib's methods (1995)
      ln m ( y ) = ln f ( y | θ ) + ln π ( θ ) − ln π ( θ | y ) , ∀θ
      ln mSC ( y ) = ln f ( y | θ* ) + ln π ( θ* ) − ln π ( θ* | y )
         ˆ                                               ˆ
       π ( θ | y ) to be estimated & θ* = ML, MAP, E ( θ | y ) selected
        ˆ
     Simple & often effective
06/01/2011                                    JLF/BigMC                   5
Chib(Cont.)



      4)Chib (1995)
      ln mSC ( y ) = ln f ( y | θ* ) + ln π ( θ* ) − ln π ( θ* | y )
         ˆ                                               ˆ
      a) Gibbs & RaoBlackwellization (Chib,1995)
      b) Metropolis-Hastings (Chib & Jeliazkov, 2001)
      c) Kernel estimator (Chen, 1994)




06/01/2011                         JLF/BigMC                           6
Chib via Gibbs

             If θ = ( θ1 , θ2 )
         π ( θ1 , θ 2 | y ) = π ( θ1 | y, θ 2 ) π ( θ 2 | y )
                                   known           estimated

         π ( θ 2 | y ) = ∫ π ( θ 2 | y , θ1 ) π ( θ1 | y ) dθ1
                                  known         MCMC draws

         "Estimation by Rao-Blackwellization"
                      1 G
          ˆ         *
                    2
                      G
                                 *
                                          (
         π ( θ | y ) = ∑ g =1 π θ 2 | y , θ1
                                           (g)
                                                         )
              (g)
         θ1 : draws from π ( θ1 | y )
06/01/2011                          JLF/BigMC                    7
Bridge sampling




      5)Bridge sampling (Meng & Wong, 1996)


                       f ( y | θ)π (θ)
      ∫ α ( θ ) g ( θ ) m ( y ) dθ
                                                =1
             ∫ α ( θ ) g ( θ ) π ( θ | y ) dθ

06/01/2011                          JLF/BigMC        8
Bridge sampling/cont.
             5)Bridge sampling (Meng & Wong, 1996)

                        ∫ α ( θ ) f ( y | θ ) π ( θ ) g ( θ ) dθ
                                                                             g (θ)

             m(y) =                                                  =
                                                                         E       (α ( θ ) f ( y | θ ) π ( θ ) )
                           ∫ α ( θ ) g ( θ ) π ( θ | y ) dθ                     E ( ) (α ( θ ) g ( θ ) )
                                                                                 π   θ|y



             α ( θ ) "bridge function" g ( θ ) = density to be calibated
             For α ( θ ) = 1/ g ( θ )
                                                                                     −1
             ˆ               −1
                                              (       ) ( ) ( )
             mBS 1 ( y ) = L ∑ l =1  f y | θ( ) π θ( ) / g θ( )  ( IS )
                                       L         l        l        l
                                                                     
             For α ( θ ) = 1/ f ( y | θ ) π ( θ ) mBS 2 ( y ) = Gelfand-Dey (1994)
                                                   ˆ
                                                              1/ 2
             For α ( θ ) = 1/  f ( y | θ ) π ( θ ) g ( θ ) 
                                                                   mBS 3 ( y ) = Lopes-West (2004)
                                                                     ˆ
                                                                                           1/ 2


             mBS 3 ( y ) =
             ˆ
                                  −1
                                     
                                           L    l
                                                   ( ) ( ) ( )
                            L ∑ l =1  f y | θ( ) π θ( ) / g θ( ) 
                                                        l         l
                                                                      
                                                                          1/ 2
                           M ∑ m =1
                            −1 M

                                                  ( ) (  ) ( )
                                      g θ( m ) / f y | θ ( m ) π θ( m ) 
                                                                         
             θ( ) : draws from g ( θ ) ; θ( ) : draws from π ( θ | y )
               l                                   m



06/01/2011                                             JLF/BigMC                                                  9
Bridge sampling (cont.)
             5)Bridge sampling (Meng & Wong, 1996)

                     ∫ α ( θ ) f ( y | θ ) π ( θ ) g ( θ ) dθ = E ( ) (α ( θ ) f ( y | θ ) π ( θ ) )
                                                                                           g θ

             m (y) =
                        ∫ α ( θ ) g ( θ ) π ( θ | y ) dθ            E ( ) (α ( θ ) g ( θ ) )
                                                                      π                          θ|y



             For α ( θ ) = 1/ f ( y | θ ) π ( θ ) g ( θ )


             mBS 4 ( y ) =
             ˆ
                                L−1 ∑ l =1 1/ g θ( ) 
                                        L

                                           
                                                     l
                                                                    ( )
                                                                     (Lopes & West, 2004; Ando, 2010)
                                                                1/ 2
                           M ∑ m =1 1/ f y | θ π θ
                            −1 M

                                    
                                                ( m)
                                                          (
                                                         ( m) 
                                                                     ) ( )
             θ( ) : draws from g ( θ ) ; θ( ) : draws from π ( θ | y ) Odd (cf numerator)
                l                                     m
                    draws
                                                                          -1
             For α ( θ ) ∝  sM π ( θ | y ) +sL g ( θ )  , optimum estim. wrt E(RMSE)
                                                       
             (Meng & Wong, 1996; Lopes & West, 2004; Fruhwirth-Schnatter,2004)

                                     L−1 ∑ l =1
                                                  L
                                                                          ˆ    (
                                                                         π t θ( l ) | y     )
             mBS 5) ( y ) = mBS) 5
             ˆ ( t +1       ˆ (t
                                                              ˆ      (             )
                                                          sM π t θ ( ) | y + s L g θ ( )
                                                                          l
                                                                                                 ( )
                                                                                                   l



                                                                         ( ))  g θ(
                                                                                       m

                                              ∑
                                         −1       M
                                     M            m =1
                                                          sM   π ( θ( ) | y ) + s g ( θ( ) )
                                                                ˆt
                                                                              m
                                                                                             L
                                                                                                       m



             where π t ( θ | y ) = f ( y | θ ) π ( θ ) / mBS) 5 and mBS)5 = mBS 1 ou mBS 2
                    ˆ                                    ˆ (t       ˆ (0    ˆ        ˆ
             sM = 1 − sL = M /( M + L)
06/01/2011                                                                         JLF/BigMC               10
Nested sampling

  6)Nested sampling
  (Skilling, 2006; Murray et al, 2006; Chopin & Robert, 2010)
  m ( y ) = ∫ f ( y | θ) π ( θ) dθ = Eπ  L ( θ) 
                                                
      Z            L( θ)

  Let x = ϕ −1 ( l ) = Pr  L ( θ) > l  be the survival function of rv L ( θ)
                                      
  where l = ϕ( x) (upper tail) quantile function of L ( θ) so that x ~ U (0,1)
               1
                                                         ˆ = ∑m ∆ l
  Then Z = ∫ ϕ ( x)dx area under curve l =ϕ ( x )  and Z
               0                                            i =1 xi i

  with ∆xi = xi−1 − xi or ∆xi = ½ ( xi−1 − xi+1 ) if trapezoidal integration



06/01/2011                               JLF/BigMC                               11
Nested sampling/Cont.
 1) Draw N points θ1,i from prior, θ1 = Argmin i =1,.., N L (θ1,i ) set l1 = L (θ1 )
  2) Obtain N points θ 2,i by repeating θ1,i except θ1 replaced by a draw
  from prior constrained by L (θ ) > l1 ,
  record θ 2 = Argmin i =1,.., N L (θ 2,i ) and set l2 = L (θ 2 )
  3) Repeat 1 & 2 until a stopping rule (change in max of L ≤ ε )
  Since xi = ϕ −1 ( li ) is unknown
  Set a) deterministic xi = exp(−i / N ) so that lnxi = E ( ln ϕ −1 ( li ) )
  or b) random xi +1 = ti xi with x0 = 1, ti ~ Be ( N ,1)
  Main difficulty in sampling θ from the prior constrained by L ( θ ) > l ?
  See Chopin & Robert (2010) Extended Importance Sampling scheme
  Z = ∑ i =1 ∆ xi ϕi wi with π (θ ) L (θ ) = π (θ ) L (θ ) w (θ )
             m




06/01/2011                                 JLF/BigMC                                   12
Power Posteriors/basic principle

      Method due to Friel & Petit (2008)
      Lartillot & Philippe (2006) "Annealing-Melting"
                                                            t
                                                   f ( y | θ) π (θ)
      Power Posterior defined as π ( θ | y , t ) =
                                                        zt ( y )
      where zt ( y ) = ∫ f ( y | θ ) π ( θ )dθ
                                     t



      and t ∈ ]0,1[ with t −1 equivalent to "physical temperature"
      t = 0 to 1: cooling down or "annealing"; t = 1 to 0 "melting"
      Notice the path sampling scheme (Gelman & Meng, 1998)
      π ( θ | y, 0 ) = π ( θ ) with z0 ( y ) = 1
      π ( θ | y,1) = π ( θ | y ) with z1 ( y ) = m ( y )
06/01/2011                               JLF/BigMC                    13
PP/key result
                        1
        log m ( y ) = ∫ Eθ|y ,t log f ( y | θ ) dt
                                                
                        0

       where θ | y , t has density:
                                 t
                       f ( y | θ) π (θ)
      π ( θ | y, t ) =
                            zt ( y )
      Thermodynamic integration (end of the 70's)
      Ripley (1988),Ogata (1989), Neal (1993)
      "Path sampling" (Gelman & Meng, 1998)
06/01/2011                       JLF/BigMC             14
PP formula/proof as a special case of path sampling


 If p (θ | t ) = q (θ | t ) / z ( t ) où z ( t ) = ∫ q (θ | t ) dθ

 Let label U (θ , t ) = ln q (θ | t ) as the potential
                       d
                       dt
            z (1)    1
 One has ln       = ∫ Eθ |t U (θ , t ) dt
                                       
            z ( 0) 0

 Here p (θ | t ) = π ( θ | y, t ) ; q (θ | t ) =  f ( y | θ )  π ( θ )
                                                                 t
                                                              
 Then U (θ , t ) = ln f ( y | θ )


06/01/2011                         JLF/BigMC                               15
PP/Example
                       yi | θ ~ iid N (θ ,1) , i = 1,.., N
                      θ ~ N ( µ ,τ 2 )
             Alors θ | y, t ~ N ( µt ,τ t2 )
                            Nty + µτ −2             1
                       µt =             ; τ t2 =
                             Nt + τ −2           Nt + τ −2
             −2 Eθ |y ,t log f ( y | θ )  =
                                         
                         Dt (θ )

                         
                                               (µ − y ) + 1 
                                                        2

                       N log 2π + log s 2 +                       
                                            ( µτ 2t + 1) Nt + τ 
                                                          2     −2

                                                                  
             y = N −1 ∑ i =1 yi ; s 2 = N −1 ∑ i =1 ( yi − y )
                            N                    N               2



             D0 (θ ) = N Cte + ( µ − y )  + Nτ 2
                                         2
                                          
             High sensitivity to τ 2 (τ 2 → ∞, D0 (θ ) → ∞)
06/01/2011                           JLF/BigMC                         16
PP/Example/cont.




06/01/2011                 JLF/BigMC   17
KL distance Prior-Posterior
                                   π (θ | y )
KL (π ( θ | y ) , π ( θ ) ) = ∫ ln           π ( θ | y ) dθ
                                    π (θ)
          f ( y | θ) π (θ)
KL = ∫ ln                 π ( θ | y ) dθ
           m ( y ) π (θ)
KL = Eθ|y ln f ( y | θ )  − ln m ( y )
                         
−2 KL = D − Dm (by-product of PP) ⇒ Dm = D + 2 KL
 DIC = D + pD where pD = D − D ( θ ) model complexity


06/01/2011                       JLF/BigMC                    18
PP/partial BF

    1)if π (θ ) improper ⇒ marginal f ( x ) also improper
     resulting in problems for defining BF
     2) High sensitivity of BF to priors (does not vanish with
     increasing sample size)
                sample
     Idea behind partial BF (Lempers,1971) y = ( y P , y T )
     -Learning or pilot sample y P to tune the prior
     -Testing sample y T for data analysis
     Intrinsinc BF (Berger & Perrichi, 1996)
     Fractional BF (O'Hagan, 1995)

06/01/2011                      JLF/BigMC                        19
Fractional BF


      A fraction b of the likelihood is used to tune the prior
                                  b
       f ( y P | θ ) ≈ f ( y | θ ) b = m / N < 1 (O'Hagan, 1995)
      resulting in:
                in:
                              b
       π ( θ, b ) ∝ f ( y | θ ) π ( θ )




06/01/2011                                                         20
PP & fractional BF
                                      b
       π ( θ, b ) ∝ f ( y | θ ) π ( θ )
                                          1−b
       m     F
                 ( y, b ) = ∫ f ( y | θ ) π ( θ, b ) dθ

       m     F
                 ( y, b ) =  ∫ f ( y | θ ) π ( θ ) dθ = m ( y,1)
                            ∫ f ( y | θ ) π ( θ ) dθ m ( y , b )
                                          b



       PP directly provides
                      −π ( θ, b ) via π ( θ | y , t = b )
                                                 1
                      − log m   F
                                    ( y, b ) = ∫ b Eθ|y ,t log f ( y | θ )dt
                                                                          
06/01/2011                                JLF/BigMC                              21
PP/algorithm
  MCMC with discretization of t on [ 0,1[
  t0 = 0 < t1 < ... < ti < ... < tn −1 < tn = 1
  ti = (i / n)c with i = 1,.., n; n = 20 − 100; c = 2 − 5
 1)Make draws of θ(
                           gi )
                                  MCMC from π ( θ | y, ti )
                                            1 G
                                        
                                            G   i
                                                              (
  2)Compute Eθ|y ,t =ti log p ( y | θ )  = ∑ g =1 log p y | θ( i )
            ˆ                                                   g
                                                                       )
  Often conditional independence, log p ( y | θ ) = ∑ i =1 log p ( yi | θ )
                                                                  N



  eg if θ if the closest stochastic parent of y = ( yi ) (as for DIC)
  3)Approximate the integral (eg trapezoidal rule)

                 ∑ i=0 i+1 i i i+1
  ˆog m ( y ) = ½ n ( t − t )( E + E )
  l
  Error due to this numerical approx. (Calderhead & Girolami,2009)
  Formula for MC sampling error: see Friel & Pettitt
06/01/2011                                 JLF/BigMC                          22
PP/Little toy example
                                                                yi

        0) yi | λi ~ id P ( λi xi ) ⇔
                                                       (λ x )
                                        f ( yi | λi ) = i i
                                                                     exp ( −λi xi )
                                                                      yi !
                                          β α λiα −1 exp ( − βλi )
        1)λi ~ id G (α , β ) ⇔ π ( λi ) =
                                                   Γ (α )
        0 + 1) yi ~ id BN (α , pi ) where pi = β / ( β + xi )
                                            Γ ( yi + α ) α          y
        Direct approach: f ( yi ) =                     pi (1 − pi ) i
                                            Γ (α ) yi !

         f ( y ) = − n ln Γ (α ) + ∑ i =1 ln Γ ( yi + α ) −∑ i =1 ln ( yi !)
                                        n                            n



        +α ∑ i =1 ln pi + ∑ i =1 yi ln (1 − p )i
                n              n


                                               n
        Indirect approach: f ( y ) = ∏ i =1 ∫ f ( yi | λi ) π ( λi ) d λi


06/01/2011                                         JLF/BigMC                          23
PP/Little toy example/cont.


       Ex / Pump data: Ex#2 in Winbugs, Carlin-Louis (p126)
        y = # failures of pumps in x (103 hrs )
        y = ( 5,1,5,14,3,19,1,1, 4, 22 ) ; n = 10; α = β = 1
       x = (94.3,15.7, 62.9,126,5.24,31.4,1.05,1.05, 2.1,10.5)
                                 ˆ
       D = −2 ln f ( y ) = 66.03 D = 66.28 ± 0.03 (20pts)
                                      FP




06/01/2011                          JLF/BigMC                    24
PP/Toy example in Openbugs




06/01/2011                 JLF/BigMC      25
PP/Toy example in Openbugs/Cont.




06/01/2011                JLF/BigMC      26
PP/Toy example in Openbugs/Cont.




06/01/2011                JLF/BigMC      27
PP/Toy example in Openbugs/Cont.




06/01/2011                JLF/BigMC      28
Sampling both θ & t
                      1
     log m ( y ) = ∫ log f ( y | θ ) π ( θ | y , t ) dt
                    0                
                      log f ( y | θ )
                      1
     log m ( y ) = ∫                  π ( θ | y, t ) p(t ) dt
                    0     p (t )
                                              π ( θ ,t | y )


                             log f ( y | θ ) 
                              log
      log m ( y ) = Eθ ,t|y                  
                                 p (t )      
                                t
     π ( θ | y, t ) ∝ f ( y | θ ) π ( θ )
                                                                     t
     if we assume p (t ) ∝ zt ( y ) ⇒ π ( t | θ, y ) ∝ f ( y | θ )
     Sampling ( θ, t ) in such conditions gives poor estimation
     (too few draws of t close to 0)

06/01/2011                                  JLF/BigMC                    29
Example 1/ Pothoff&Roy’s data
             Growth measurements in 11 girls and 16 boys: Pothoff and Roy,1964; Little and Rubin, 1987

                                          Age (years)                                                    Age (years)
              Girl            8            10            12             14           Boy           8     10      12    14
                 1          210           200           215            230             1           260   250    290    310
                 2          210           215           240            255             2           215          230    265
                 3          205                         245            260             3           230   225    240    275
                 4          235           245           250            265             4           255   275    265    270
                 5          215           230           225            235             5           200          225    260
                 6          200                         210            225             6           245   255    270    285
                 7          215           225           230            250             7           220   220    245    265
                 8          230           230           235            240             8           240   215    245    255
                 9          200                         220            215             9           230   205    310    260
               10           165                         190            195            10           275   280    310    315
               11           245           250           280            280            11           230   230    235    250
                                                                                      12           215          240    280
                                                                                      13           170          260    295
                                                                                      14           225   255    255    260
                                                                                      15           230   245    260    300
                                                                                      16           220          235    250
             distance from the centre of the pituary to the pteryomaxillary fissure (unit 10-4m)


06/01/2011                                                                        JLF/BigMC                                  30
Model comparison on Pothoff’s data
             i: subscript for individual i = 1,.., I = 25 (11girls+16boys)
              j: subscript for measurement at age t j (8,10,12,14 yrs )
             1)Purely Fixed Model
             yij = (α 0 + α xi ) + ( β 0 + β xi ) ( t j − 8 ) + eij
                       int ercept          pente

             2)Random intercept model
             yij = (α 0 + α xi + ai ) + ( β 0 + β xi ) ( t j − 8 ) + eij
             3)Random intercept & slope model assuming independent effects
             yij = (α 0 + α xi + ai ) + ( β 0 + β xi + bi ) ( t j − 8 ) + eij
             or
             yij = φi1 + φi 2 ( t j − 8 ) + eij , yij ~ id N (ηij , σ e2 )

                        φi1    α 0 + α xi   σ a 0  
                                                    2
             with φi =   ~ N                ,       
                        φi 2  
                                  β 0 + β xi   0 σ b2  
                                                          
             4)Random intercept & slope model assuming correlated effects
                   φi1    α 0 + α xi   σ a σ ab  
                                                2
             φi =   ~ N                ,          
                   φi 2  
                             β 0 + β xi   σ ab σ b2  
                                                        
06/01/2011                                   JLF/BigMC                          31
Model presentation:Hierarchical Bayes

             1st level:yij ~ id N (ηij , σ e2 ) with ηij = φi1 + φi 2 ( t j − 8 )
             2nd level :
                                                                        
                                φ         α 0 + α xi   σ a σ ab  
                                                                2
                       2a) φi =  i1  ~ N               ,          
                                 φi 2     β 0 + β xi   σ ab σ b2  
                                                         
                                           
                                                                 Σ      
                                                                         
                       2b) σ e ~ U ( 0, ∆ e ) or σ e2 ~ InvG (1, σ e2 )
             3rd level:
             Fixed effects: α 0 , α , β 0 , β ~ U(inf,sup)
             Var (Covar) components:
                       − If σ ab = 0, then i) σ a ~ U ( 0, ∆ a ) , same for σ b ~ U ( 0, ∆ b )
                       or ii) σ a ~ InvG (1, σ a ) ,same for σ b2 ~ InvG (1, σ b2 )
                                2              2


                       − If σ ab ≠ 0, then i)σ a ~ U ( 0, ∆ a ) , σ b ~ U ( 0, ∆ b ) , ρ ~ U ( -1,1)
                                                      *
                                        (
                       or ii) Ω ~ W (νΣ ) ,ν
                                               −1
                                                     ) for Ω = Σ     −1



                       with ν = dim(Ω) + 1 and Σ known location parameter
             *Take care as Winbugs uses another notation ie W ( (νΣ ) ,ν )

06/01/2011                                                  JLF/BigMC                                  32
Results




06/01/2011   JLF/BigMC   33
Results/fractional priors (b=0 vs 0.125)




06/01/2011             JLF/BigMC              34
Example 2:Models of genetic differentiation


      2 level hierarchical model
      i =locus; j =(sub)population
      aij =Nbre of genes carrying a given allele at locus i in pop. j
       pij = Frequency of that allele at locus i in pop. j

      0) yij | α ij ~ id B ( nij , α ij )
                                                                    1− cj
      1) α ij | xi ,λij ~ id Beta (τ jπ i ,τ j (1 − π i ) ) τ j =           where c j ( Dif. index )
                                                                     cj
      π i = Frequency of that allele at locus i in the gene pool
      2)π i ~ id Beta ( aπ , bπ ) , c j ~ id Beta ( ac , bc )
      Migration-Drift at equilibrium (Balding)



06/01/2011                                        JLF/BigMC                                            35
Ex2: Nicholson’s model

    Nicholson et al (2002) same as previously but
   1) α ij | xi ,λij ~ id N (π i , c jπ i (1 − π i ) )
    Truncated normal with masses in 0 and 1
   so that yij | α ij ~ id B ( nij , α ij )
                                       *


            *
    where α ij = max(0, min(1, α ij ))
    2)π i ~ id Beta ( aπ , bπ ) , c j ~ id Beta ( ac , bc )
    Pure drift model

06/01/2011                            JLF/BigMC               36
Results




06/01/2011       JLF/BigMC   37
Conclusion
             Derived from thermodynamical integration
             Link with « path sampling »
             Easy to understand and quite general
             Well suited to complex hierarchical models
             « Theta’s » can be defined as the closest stochastic parents
             of data making the latter conditionally independent
             Draws only from posterior distributions
             Gives as a by product fractional BF
             Easy to implement (including in Openbugs) but time
             consuming
             Caution needed in discretization of t (close to 0)
06/01/2011                           JLF/BigMC                          38
Some references
         Chen M, Shao Q, Ibrahim J (2000) Monte Carlo methods in Bayesian
         computation. Springer
         Chib S (1995) Marginal likelihood from the Gibbs output. JASA 90,1313-1321
         Chopin N, Robert CP (2010) Properties of nested sampling. Biometrika, 97, 741-
         755
         Friel N, Pettitt AN (2008) Marginal likelihood estimation via power posteriors,
         JRSS, B, 70, 589-607
         Frühwirth-Schnatter (2004) Estimating marginal likelihoods from mixtures &
         Markov switching models using bridge sampling techniques. Econometrics
         Journal, 7,143-167
         Gelman A, Meng X-L (1998) Simulating normalizing constants: from
         importance sampling to bridge sampling and path sampling, Statistical Science,
         13, 163-185
         Lartillot N, Philippe H (2006) Computing Bayes factors using thermodynamic
         integration. Systematic Biology, 55, 195-207
         Marin JM, Robert CP (2009) Importance sampling methods for Bayesian
         discrimination between embedded models. arXiv:0910.2325v1
         Meng X-L, Wong WH (1996) Simulating ratios of normalizing constants via a
         simple identity: a theoretical exploration. Statistica Sinica,6,831-860
         O Hagan A (1995) Fractional Bayes factors for model comparison. JRSS, B, 57,
         99-138
06/01/2011                              JLF/BigMC                                    39
Acknowledgements
             Nial Friel (U College, Dublin) for his interest in these
             applications and his unvaluable explanations &
             suggestions
             Tony O’Hagan for further insight into FBF
             Gilles Celeux, Mathieu Gautier as coadvisors of the
             Master dissertation of Yoan Soussan (Paris VI)
             Christian Robert for his blog and his relevant
             comments, standpoints and bibliographical references
             The Applibugs & Babayes groups for stimulating
             discussions on DIC, BF,CPO & other information
             criteria (AIC,BIC)

06/01/2011                       JLF/BigMC                          40

More Related Content

What's hot

Common and private ownership of exhaustible resources: theoretical implicat...
Common and private ownership  of exhaustible resources:  theoretical implicat...Common and private ownership  of exhaustible resources:  theoretical implicat...
Common and private ownership of exhaustible resources: theoretical implicat...alexandersurkov
 
Phase de-trending of modulated signals
Phase de-trending of modulated signalsPhase de-trending of modulated signals
Phase de-trending of modulated signalsNMDG NV
 
Notes on exact and approximate bayesian implementation
Notes on exact and approximate bayesian implementationNotes on exact and approximate bayesian implementation
Notes on exact and approximate bayesian implementationLeo Vivas
 
Fourier Transform
Fourier TransformFourier Transform
Fourier TransformAamir Saeed
 
fourier representation of signal and systems
fourier representation of signal and systemsfourier representation of signal and systems
fourier representation of signal and systemsSugeng Widodo
 
Tele4653 l1
Tele4653 l1Tele4653 l1
Tele4653 l1Vin Voro
 
P805 bourgeois
P805 bourgeoisP805 bourgeois
P805 bourgeoiskklub
 
Eece 301 note set 14 fourier transform
Eece 301 note set 14 fourier transformEece 301 note set 14 fourier transform
Eece 301 note set 14 fourier transformSandilya Sridhara
 
Tele3113 wk2wed
Tele3113 wk2wedTele3113 wk2wed
Tele3113 wk2wedVin Voro
 
Toward an Improved Computational Strategy for Vibration-Proof Structures Equi...
Toward an Improved Computational Strategy for Vibration-Proof Structures Equi...Toward an Improved Computational Strategy for Vibration-Proof Structures Equi...
Toward an Improved Computational Strategy for Vibration-Proof Structures Equi...Alessandro Palmeri
 
Adiabatic Theorem for Discrete Time Evolution
Adiabatic Theorem for Discrete Time EvolutionAdiabatic Theorem for Discrete Time Evolution
Adiabatic Theorem for Discrete Time Evolutiontanaka-atushi
 
Balanced homodyne detection
Balanced homodyne detectionBalanced homodyne detection
Balanced homodyne detectionwtyru1989
 
Signal Processing Course : Sparse Regularization of Inverse Problems
Signal Processing Course : Sparse Regularization of Inverse ProblemsSignal Processing Course : Sparse Regularization of Inverse Problems
Signal Processing Course : Sparse Regularization of Inverse ProblemsGabriel Peyré
 
Using blurred images to assess damage in bridge structures?
Using blurred images to assess damage in bridge structures?Using blurred images to assess damage in bridge structures?
Using blurred images to assess damage in bridge structures? Alessandro Palmeri
 
TR tabling presentation_2010_09
TR tabling presentation_2010_09TR tabling presentation_2010_09
TR tabling presentation_2010_09Paul Fodor
 
Numerical Technique, Initial Conditions, Eos,
Numerical Technique, Initial Conditions, Eos,Numerical Technique, Initial Conditions, Eos,
Numerical Technique, Initial Conditions, Eos,Udo Ornik
 

What's hot (20)

Common and private ownership of exhaustible resources: theoretical implicat...
Common and private ownership  of exhaustible resources:  theoretical implicat...Common and private ownership  of exhaustible resources:  theoretical implicat...
Common and private ownership of exhaustible resources: theoretical implicat...
 
Properties of Fourier transform
Properties of Fourier transformProperties of Fourier transform
Properties of Fourier transform
 
Phase de-trending of modulated signals
Phase de-trending of modulated signalsPhase de-trending of modulated signals
Phase de-trending of modulated signals
 
Notes on exact and approximate bayesian implementation
Notes on exact and approximate bayesian implementationNotes on exact and approximate bayesian implementation
Notes on exact and approximate bayesian implementation
 
Fourier Transform
Fourier TransformFourier Transform
Fourier Transform
 
fourier representation of signal and systems
fourier representation of signal and systemsfourier representation of signal and systems
fourier representation of signal and systems
 
Tele4653 l1
Tele4653 l1Tele4653 l1
Tele4653 l1
 
P805 bourgeois
P805 bourgeoisP805 bourgeois
P805 bourgeois
 
Eece 301 note set 14 fourier transform
Eece 301 note set 14 fourier transformEece 301 note set 14 fourier transform
Eece 301 note set 14 fourier transform
 
Tele3113 wk2wed
Tele3113 wk2wedTele3113 wk2wed
Tele3113 wk2wed
 
Toward an Improved Computational Strategy for Vibration-Proof Structures Equi...
Toward an Improved Computational Strategy for Vibration-Proof Structures Equi...Toward an Improved Computational Strategy for Vibration-Proof Structures Equi...
Toward an Improved Computational Strategy for Vibration-Proof Structures Equi...
 
Adiabatic Theorem for Discrete Time Evolution
Adiabatic Theorem for Discrete Time EvolutionAdiabatic Theorem for Discrete Time Evolution
Adiabatic Theorem for Discrete Time Evolution
 
Balanced homodyne detection
Balanced homodyne detectionBalanced homodyne detection
Balanced homodyne detection
 
Signal Processing Course : Sparse Regularization of Inverse Problems
Signal Processing Course : Sparse Regularization of Inverse ProblemsSignal Processing Course : Sparse Regularization of Inverse Problems
Signal Processing Course : Sparse Regularization of Inverse Problems
 
Using blurred images to assess damage in bridge structures?
Using blurred images to assess damage in bridge structures?Using blurred images to assess damage in bridge structures?
Using blurred images to assess damage in bridge structures?
 
TR tabling presentation_2010_09
TR tabling presentation_2010_09TR tabling presentation_2010_09
TR tabling presentation_2010_09
 
Numerical Technique, Initial Conditions, Eos,
Numerical Technique, Initial Conditions, Eos,Numerical Technique, Initial Conditions, Eos,
Numerical Technique, Initial Conditions, Eos,
 
Fourier transform
Fourier transformFourier transform
Fourier transform
 
Cheat Sheet
Cheat SheetCheat Sheet
Cheat Sheet
 
Ps02 cmth03 unit 1
Ps02 cmth03 unit 1Ps02 cmth03 unit 1
Ps02 cmth03 unit 1
 

Similar to Computation of the marginal likelihood

Fractional Calculus
Fractional CalculusFractional Calculus
Fractional CalculusVRRITC
 
Olivier Cappé's talk at BigMC March 2011
Olivier Cappé's talk at BigMC March 2011Olivier Cappé's talk at BigMC March 2011
Olivier Cappé's talk at BigMC March 2011BigMC
 
Estimation of the score vector and observed information matrix in intractable...
Estimation of the score vector and observed information matrix in intractable...Estimation of the score vector and observed information matrix in intractable...
Estimation of the score vector and observed information matrix in intractable...Pierre Jacob
 
Doering Savov
Doering SavovDoering Savov
Doering Savovgh
 
Estimation of the score vector and observed information matrix in intractable...
Estimation of the score vector and observed information matrix in intractable...Estimation of the score vector and observed information matrix in intractable...
Estimation of the score vector and observed information matrix in intractable...Pierre Jacob
 
EM algorithm and its application in probabilistic latent semantic analysis
EM algorithm and its application in probabilistic latent semantic analysisEM algorithm and its application in probabilistic latent semantic analysis
EM algorithm and its application in probabilistic latent semantic analysiszukun
 
Case Study (All)
Case Study (All)Case Study (All)
Case Study (All)gudeyi
 
Presentation cm2011
Presentation cm2011Presentation cm2011
Presentation cm2011antigonon
 
Presentation cm2011
Presentation cm2011Presentation cm2011
Presentation cm2011antigonon
 
Cosmin Crucean: Perturbative QED on de Sitter Universe.
Cosmin Crucean: Perturbative QED on de Sitter Universe.Cosmin Crucean: Perturbative QED on de Sitter Universe.
Cosmin Crucean: Perturbative QED on de Sitter Universe.SEENET-MTP
 
Estimation of the score vector and observed information matrix in intractable...
Estimation of the score vector and observed information matrix in intractable...Estimation of the score vector and observed information matrix in intractable...
Estimation of the score vector and observed information matrix in intractable...Pierre Jacob
 
2003 Ames.Models
2003 Ames.Models2003 Ames.Models
2003 Ames.Modelspinchung
 
Slides2 130201091056-phpapp01
Slides2 130201091056-phpapp01Slides2 130201091056-phpapp01
Slides2 130201091056-phpapp01Deb Roy
 
Bayesian case studies, practical 2
Bayesian case studies, practical 2Bayesian case studies, practical 2
Bayesian case studies, practical 2Robin Ryder
 

Similar to Computation of the marginal likelihood (20)

rinko2010
rinko2010rinko2010
rinko2010
 
Fractional Calculus
Fractional CalculusFractional Calculus
Fractional Calculus
 
Olivier Cappé's talk at BigMC March 2011
Olivier Cappé's talk at BigMC March 2011Olivier Cappé's talk at BigMC March 2011
Olivier Cappé's talk at BigMC March 2011
 
PMED Transition Workshop - A Bayesian Model for Joint Longitudinal and Surviv...
PMED Transition Workshop - A Bayesian Model for Joint Longitudinal and Surviv...PMED Transition Workshop - A Bayesian Model for Joint Longitudinal and Surviv...
PMED Transition Workshop - A Bayesian Model for Joint Longitudinal and Surviv...
 
Estimation of the score vector and observed information matrix in intractable...
Estimation of the score vector and observed information matrix in intractable...Estimation of the score vector and observed information matrix in intractable...
Estimation of the score vector and observed information matrix in intractable...
 
Doering Savov
Doering SavovDoering Savov
Doering Savov
 
Lecture_9.pdf
Lecture_9.pdfLecture_9.pdf
Lecture_9.pdf
 
Estimation of the score vector and observed information matrix in intractable...
Estimation of the score vector and observed information matrix in intractable...Estimation of the score vector and observed information matrix in intractable...
Estimation of the score vector and observed information matrix in intractable...
 
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
 
EM algorithm and its application in probabilistic latent semantic analysis
EM algorithm and its application in probabilistic latent semantic analysisEM algorithm and its application in probabilistic latent semantic analysis
EM algorithm and its application in probabilistic latent semantic analysis
 
Holographic Cotton Tensor
Holographic Cotton TensorHolographic Cotton Tensor
Holographic Cotton Tensor
 
Case Study (All)
Case Study (All)Case Study (All)
Case Study (All)
 
Presentation cm2011
Presentation cm2011Presentation cm2011
Presentation cm2011
 
Presentation cm2011
Presentation cm2011Presentation cm2011
Presentation cm2011
 
Cosmin Crucean: Perturbative QED on de Sitter Universe.
Cosmin Crucean: Perturbative QED on de Sitter Universe.Cosmin Crucean: Perturbative QED on de Sitter Universe.
Cosmin Crucean: Perturbative QED on de Sitter Universe.
 
Signal Processing Homework Help
Signal Processing Homework HelpSignal Processing Homework Help
Signal Processing Homework Help
 
Estimation of the score vector and observed information matrix in intractable...
Estimation of the score vector and observed information matrix in intractable...Estimation of the score vector and observed information matrix in intractable...
Estimation of the score vector and observed information matrix in intractable...
 
2003 Ames.Models
2003 Ames.Models2003 Ames.Models
2003 Ames.Models
 
Slides2 130201091056-phpapp01
Slides2 130201091056-phpapp01Slides2 130201091056-phpapp01
Slides2 130201091056-phpapp01
 
Bayesian case studies, practical 2
Bayesian case studies, practical 2Bayesian case studies, practical 2
Bayesian case studies, practical 2
 

More from BigMC

Anisotropic Metropolis Adjusted Langevin Algorithm: convergence and utility i...
Anisotropic Metropolis Adjusted Langevin Algorithm: convergence and utility i...Anisotropic Metropolis Adjusted Langevin Algorithm: convergence and utility i...
Anisotropic Metropolis Adjusted Langevin Algorithm: convergence and utility i...BigMC
 
Dealing with intractability: Recent Bayesian Monte Carlo methods for dealing ...
Dealing with intractability: Recent Bayesian Monte Carlo methods for dealing ...Dealing with intractability: Recent Bayesian Monte Carlo methods for dealing ...
Dealing with intractability: Recent Bayesian Monte Carlo methods for dealing ...BigMC
 
Stability of adaptive random-walk Metropolis algorithms
Stability of adaptive random-walk Metropolis algorithmsStability of adaptive random-walk Metropolis algorithms
Stability of adaptive random-walk Metropolis algorithmsBigMC
 
"Monte-Carlo Tree Search for the game of Go"
"Monte-Carlo Tree Search for the game of Go""Monte-Carlo Tree Search for the game of Go"
"Monte-Carlo Tree Search for the game of Go"BigMC
 
Hedibert Lopes' talk at BigMC
Hedibert Lopes' talk at  BigMCHedibert Lopes' talk at  BigMC
Hedibert Lopes' talk at BigMCBigMC
 
Andreas Eberle
Andreas EberleAndreas Eberle
Andreas EberleBigMC
 
Olivier Féron's talk at BigMC March 2011
Olivier Féron's talk at BigMC March 2011Olivier Féron's talk at BigMC March 2011
Olivier Féron's talk at BigMC March 2011BigMC
 
Estimation de copules, une approche bayésienne
Estimation de copules, une approche bayésienneEstimation de copules, une approche bayésienne
Estimation de copules, une approche bayésienneBigMC
 
Comparing estimation algorithms for block clustering models
Comparing estimation algorithms for block clustering modelsComparing estimation algorithms for block clustering models
Comparing estimation algorithms for block clustering modelsBigMC
 
Learning spline-based curve models (Laure Amate)
Learning spline-based curve models (Laure Amate)Learning spline-based curve models (Laure Amate)
Learning spline-based curve models (Laure Amate)BigMC
 
Omiros' talk on the Bernoulli factory problem
Omiros' talk on the  Bernoulli factory problemOmiros' talk on the  Bernoulli factory problem
Omiros' talk on the Bernoulli factory problemBigMC
 

More from BigMC (11)

Anisotropic Metropolis Adjusted Langevin Algorithm: convergence and utility i...
Anisotropic Metropolis Adjusted Langevin Algorithm: convergence and utility i...Anisotropic Metropolis Adjusted Langevin Algorithm: convergence and utility i...
Anisotropic Metropolis Adjusted Langevin Algorithm: convergence and utility i...
 
Dealing with intractability: Recent Bayesian Monte Carlo methods for dealing ...
Dealing with intractability: Recent Bayesian Monte Carlo methods for dealing ...Dealing with intractability: Recent Bayesian Monte Carlo methods for dealing ...
Dealing with intractability: Recent Bayesian Monte Carlo methods for dealing ...
 
Stability of adaptive random-walk Metropolis algorithms
Stability of adaptive random-walk Metropolis algorithmsStability of adaptive random-walk Metropolis algorithms
Stability of adaptive random-walk Metropolis algorithms
 
"Monte-Carlo Tree Search for the game of Go"
"Monte-Carlo Tree Search for the game of Go""Monte-Carlo Tree Search for the game of Go"
"Monte-Carlo Tree Search for the game of Go"
 
Hedibert Lopes' talk at BigMC
Hedibert Lopes' talk at  BigMCHedibert Lopes' talk at  BigMC
Hedibert Lopes' talk at BigMC
 
Andreas Eberle
Andreas EberleAndreas Eberle
Andreas Eberle
 
Olivier Féron's talk at BigMC March 2011
Olivier Féron's talk at BigMC March 2011Olivier Féron's talk at BigMC March 2011
Olivier Féron's talk at BigMC March 2011
 
Estimation de copules, une approche bayésienne
Estimation de copules, une approche bayésienneEstimation de copules, une approche bayésienne
Estimation de copules, une approche bayésienne
 
Comparing estimation algorithms for block clustering models
Comparing estimation algorithms for block clustering modelsComparing estimation algorithms for block clustering models
Comparing estimation algorithms for block clustering models
 
Learning spline-based curve models (Laure Amate)
Learning spline-based curve models (Laure Amate)Learning spline-based curve models (Laure Amate)
Learning spline-based curve models (Laure Amate)
 
Omiros' talk on the Bernoulli factory problem
Omiros' talk on the  Bernoulli factory problemOmiros' talk on the  Bernoulli factory problem
Omiros' talk on the Bernoulli factory problem
 

Computation of the marginal likelihood

  • 1. Computation of the marginal likelihood: brief summary and method of power posteriors Jean-Louis Foulley jean-louis.foulley@jouy.inra.fr 06/01/2011 JLF/BigMC 1
  • 2. Outline Objectives Brief summary of current methods Monte Carlo direct Harmonic mean Generalized harmonic mean Chib Bridge sampling Nested sampling Power Posteriors Relationship with fractional BF Algorithm Examples Conclusion 06/01/2011 JLF/BigMC 2
  • 3. Objectives Marginal likelihood ("Prior Predictive", "Evidence") m ( y ) = ∫ f ( y | θ )π ( θ ) dθ Θ -Normalization constant of π * ( θ | y ) π * (θ) π (θ | y ) = where π * ( θ | y ) = f ( y | θ ) π ( θ ) m(y) -Component of the Bayes factor π ( M1 | y ) / π ( M 2 | y ) m1 ( y ) BF12 = = π ( M1 ) / π ( M 2 ) m2 ( y ) ∆Dm,12 = −2 ln BF12 = Dm,1 − Dm,2 Dm , j = −2 ln m j ( y ) : Marginal deviance Calibration: Jeffreys & Turing (Deciban: 10log10 BF) 06/01/2011 JLF/BigMC 3
  • 4. Methods/Monte Carlo, Harmonic Mean 1 G 1) Direct Monte Carlo mMC ( y ) = ˆ ∑ G g =1 f y | θ( ) g ( ) θ( ) ,..., θ( ) : draws from π ( θ ) 1 g Converges (a. s) to m ( y ) but very inefficient Many samples outside regions ofhigh likelihood 2)Harmonic mean (Newton & Raftery, 1994) −1   1 G 1 mNR ( y ) =  ∑ g =1  θ( ) ,..., θ( ) : draws from π ( θ | y ) 1 g ˆ G  ( f y | θ( ) g )   A special case of WIS: ∑ j =1 f y | θ( J ( j) )w (θ( ) ) / ∑ j J j =1 ( )) w θ( j where w θ(( ) ) ∝ π ( θ ) / g ( θ ) for g ( θ ) ∝ f ( y | θ ) π ( θ ) j Converges (a.s) but very instable (infinite variance): to be absolutely avoided "Worst Monte Carlo Method Ever" Radford Neal (2010) Harmonic mean not really affected by change in prior while true marginal highly sensitive to prior 06/01/2011 JLF/BigMC 4
  • 5. Methods/Gelfand&Dey & Chib 3) Generalized harmonic mean (Gelfand & Dey, 1994; Chen & Shao, 1997) −1  1 G mGD ( y ) =  ∑ g =1 ˆ ( ) g θ( ) g   G  ( ) ( ) f y | θ( ) π θ( ) g g   θ( ) ,..., θ( ) : draws from π ( θ | y ) 1 g g (.) as an approx of the posterior: pbs in large dimension 4)Chib's methods (1995) ln m ( y ) = ln f ( y | θ ) + ln π ( θ ) − ln π ( θ | y ) , ∀θ ln mSC ( y ) = ln f ( y | θ* ) + ln π ( θ* ) − ln π ( θ* | y ) ˆ ˆ π ( θ | y ) to be estimated & θ* = ML, MAP, E ( θ | y ) selected ˆ Simple & often effective 06/01/2011 JLF/BigMC 5
  • 6. Chib(Cont.) 4)Chib (1995) ln mSC ( y ) = ln f ( y | θ* ) + ln π ( θ* ) − ln π ( θ* | y ) ˆ ˆ a) Gibbs & RaoBlackwellization (Chib,1995) b) Metropolis-Hastings (Chib & Jeliazkov, 2001) c) Kernel estimator (Chen, 1994) 06/01/2011 JLF/BigMC 6
  • 7. Chib via Gibbs If θ = ( θ1 , θ2 ) π ( θ1 , θ 2 | y ) = π ( θ1 | y, θ 2 ) π ( θ 2 | y ) known estimated π ( θ 2 | y ) = ∫ π ( θ 2 | y , θ1 ) π ( θ1 | y ) dθ1 known MCMC draws "Estimation by Rao-Blackwellization" 1 G ˆ * 2 G * ( π ( θ | y ) = ∑ g =1 π θ 2 | y , θ1 (g) ) (g) θ1 : draws from π ( θ1 | y ) 06/01/2011 JLF/BigMC 7
  • 8. Bridge sampling 5)Bridge sampling (Meng & Wong, 1996) f ( y | θ)π (θ) ∫ α ( θ ) g ( θ ) m ( y ) dθ =1 ∫ α ( θ ) g ( θ ) π ( θ | y ) dθ 06/01/2011 JLF/BigMC 8
  • 9. Bridge sampling/cont. 5)Bridge sampling (Meng & Wong, 1996) ∫ α ( θ ) f ( y | θ ) π ( θ ) g ( θ ) dθ g (θ) m(y) = = E (α ( θ ) f ( y | θ ) π ( θ ) ) ∫ α ( θ ) g ( θ ) π ( θ | y ) dθ E ( ) (α ( θ ) g ( θ ) ) π θ|y α ( θ ) "bridge function" g ( θ ) = density to be calibated For α ( θ ) = 1/ g ( θ ) −1 ˆ −1  ( ) ( ) ( ) mBS 1 ( y ) = L ∑ l =1  f y | θ( ) π θ( ) / g θ( )  ( IS ) L l l l  For α ( θ ) = 1/ f ( y | θ ) π ( θ ) mBS 2 ( y ) = Gelfand-Dey (1994) ˆ 1/ 2 For α ( θ ) = 1/  f ( y | θ ) π ( θ ) g ( θ )    mBS 3 ( y ) = Lopes-West (2004) ˆ 1/ 2 mBS 3 ( y ) = ˆ −1  L l ( ) ( ) ( ) L ∑ l =1  f y | θ( ) π θ( ) / g θ( )  l l  1/ 2 M ∑ m =1 −1 M  ( ) ( ) ( )  g θ( m ) / f y | θ ( m ) π θ( m )   θ( ) : draws from g ( θ ) ; θ( ) : draws from π ( θ | y ) l m 06/01/2011 JLF/BigMC 9
  • 10. Bridge sampling (cont.) 5)Bridge sampling (Meng & Wong, 1996) ∫ α ( θ ) f ( y | θ ) π ( θ ) g ( θ ) dθ = E ( ) (α ( θ ) f ( y | θ ) π ( θ ) ) g θ m (y) = ∫ α ( θ ) g ( θ ) π ( θ | y ) dθ E ( ) (α ( θ ) g ( θ ) ) π θ|y For α ( θ ) = 1/ f ( y | θ ) π ( θ ) g ( θ ) mBS 4 ( y ) = ˆ L−1 ∑ l =1 1/ g θ( )  L  l  ( ) (Lopes & West, 2004; Ando, 2010) 1/ 2 M ∑ m =1 1/ f y | θ π θ −1 M  ( m) ( ( m)   ) ( ) θ( ) : draws from g ( θ ) ; θ( ) : draws from π ( θ | y ) Odd (cf numerator) l m draws -1 For α ( θ ) ∝  sM π ( θ | y ) +sL g ( θ )  , optimum estim. wrt E(RMSE)   (Meng & Wong, 1996; Lopes & West, 2004; Fruhwirth-Schnatter,2004) L−1 ∑ l =1 L ˆ ( π t θ( l ) | y ) mBS 5) ( y ) = mBS) 5 ˆ ( t +1 ˆ (t ˆ ( ) sM π t θ ( ) | y + s L g θ ( ) l ( ) l ( )) g θ( m ∑ −1 M M m =1 sM π ( θ( ) | y ) + s g ( θ( ) ) ˆt m L m where π t ( θ | y ) = f ( y | θ ) π ( θ ) / mBS) 5 and mBS)5 = mBS 1 ou mBS 2 ˆ ˆ (t ˆ (0 ˆ ˆ sM = 1 − sL = M /( M + L) 06/01/2011 JLF/BigMC 10
  • 11. Nested sampling 6)Nested sampling (Skilling, 2006; Murray et al, 2006; Chopin & Robert, 2010) m ( y ) = ∫ f ( y | θ) π ( θ) dθ = Eπ  L ( θ)    Z L( θ) Let x = ϕ −1 ( l ) = Pr  L ( θ) > l  be the survival function of rv L ( θ)   where l = ϕ( x) (upper tail) quantile function of L ( θ) so that x ~ U (0,1) 1 ˆ = ∑m ∆ l Then Z = ∫ ϕ ( x)dx area under curve l =ϕ ( x )  and Z 0   i =1 xi i with ∆xi = xi−1 − xi or ∆xi = ½ ( xi−1 − xi+1 ) if trapezoidal integration 06/01/2011 JLF/BigMC 11
  • 12. Nested sampling/Cont. 1) Draw N points θ1,i from prior, θ1 = Argmin i =1,.., N L (θ1,i ) set l1 = L (θ1 ) 2) Obtain N points θ 2,i by repeating θ1,i except θ1 replaced by a draw from prior constrained by L (θ ) > l1 , record θ 2 = Argmin i =1,.., N L (θ 2,i ) and set l2 = L (θ 2 ) 3) Repeat 1 & 2 until a stopping rule (change in max of L ≤ ε ) Since xi = ϕ −1 ( li ) is unknown Set a) deterministic xi = exp(−i / N ) so that lnxi = E ( ln ϕ −1 ( li ) ) or b) random xi +1 = ti xi with x0 = 1, ti ~ Be ( N ,1) Main difficulty in sampling θ from the prior constrained by L ( θ ) > l ? See Chopin & Robert (2010) Extended Importance Sampling scheme Z = ∑ i =1 ∆ xi ϕi wi with π (θ ) L (θ ) = π (θ ) L (θ ) w (θ ) m 06/01/2011 JLF/BigMC 12
  • 13. Power Posteriors/basic principle Method due to Friel & Petit (2008) Lartillot & Philippe (2006) "Annealing-Melting" t f ( y | θ) π (θ) Power Posterior defined as π ( θ | y , t ) = zt ( y ) where zt ( y ) = ∫ f ( y | θ ) π ( θ )dθ t and t ∈ ]0,1[ with t −1 equivalent to "physical temperature" t = 0 to 1: cooling down or "annealing"; t = 1 to 0 "melting" Notice the path sampling scheme (Gelman & Meng, 1998) π ( θ | y, 0 ) = π ( θ ) with z0 ( y ) = 1 π ( θ | y,1) = π ( θ | y ) with z1 ( y ) = m ( y ) 06/01/2011 JLF/BigMC 13
  • 14. PP/key result 1 log m ( y ) = ∫ Eθ|y ,t log f ( y | θ ) dt   0 where θ | y , t has density: t f ( y | θ) π (θ) π ( θ | y, t ) = zt ( y ) Thermodynamic integration (end of the 70's) Ripley (1988),Ogata (1989), Neal (1993) "Path sampling" (Gelman & Meng, 1998) 06/01/2011 JLF/BigMC 14
  • 15. PP formula/proof as a special case of path sampling If p (θ | t ) = q (θ | t ) / z ( t ) où z ( t ) = ∫ q (θ | t ) dθ Let label U (θ , t ) = ln q (θ | t ) as the potential d dt z (1) 1 One has ln = ∫ Eθ |t U (θ , t ) dt   z ( 0) 0 Here p (θ | t ) = π ( θ | y, t ) ; q (θ | t ) =  f ( y | θ )  π ( θ ) t   Then U (θ , t ) = ln f ( y | θ ) 06/01/2011 JLF/BigMC 15
  • 16. PP/Example yi | θ ~ iid N (θ ,1) , i = 1,.., N θ ~ N ( µ ,τ 2 ) Alors θ | y, t ~ N ( µt ,τ t2 ) Nty + µτ −2 1 µt = ; τ t2 = Nt + τ −2 Nt + τ −2 −2 Eθ |y ,t log f ( y | θ )  =   Dt (θ )  (µ − y ) + 1  2 N log 2π + log s 2 +   ( µτ 2t + 1) Nt + τ  2 −2   y = N −1 ∑ i =1 yi ; s 2 = N −1 ∑ i =1 ( yi − y ) N N 2 D0 (θ ) = N Cte + ( µ − y )  + Nτ 2 2   High sensitivity to τ 2 (τ 2 → ∞, D0 (θ ) → ∞) 06/01/2011 JLF/BigMC 16
  • 18. KL distance Prior-Posterior π (θ | y ) KL (π ( θ | y ) , π ( θ ) ) = ∫ ln π ( θ | y ) dθ π (θ) f ( y | θ) π (θ) KL = ∫ ln π ( θ | y ) dθ m ( y ) π (θ) KL = Eθ|y ln f ( y | θ )  − ln m ( y )   −2 KL = D − Dm (by-product of PP) ⇒ Dm = D + 2 KL DIC = D + pD where pD = D − D ( θ ) model complexity 06/01/2011 JLF/BigMC 18
  • 19. PP/partial BF 1)if π (θ ) improper ⇒ marginal f ( x ) also improper resulting in problems for defining BF 2) High sensitivity of BF to priors (does not vanish with increasing sample size) sample Idea behind partial BF (Lempers,1971) y = ( y P , y T ) -Learning or pilot sample y P to tune the prior -Testing sample y T for data analysis Intrinsinc BF (Berger & Perrichi, 1996) Fractional BF (O'Hagan, 1995) 06/01/2011 JLF/BigMC 19
  • 20. Fractional BF A fraction b of the likelihood is used to tune the prior b f ( y P | θ ) ≈ f ( y | θ ) b = m / N < 1 (O'Hagan, 1995) resulting in: in: b π ( θ, b ) ∝ f ( y | θ ) π ( θ ) 06/01/2011 20
  • 21. PP & fractional BF b π ( θ, b ) ∝ f ( y | θ ) π ( θ ) 1−b m F ( y, b ) = ∫ f ( y | θ ) π ( θ, b ) dθ m F ( y, b ) = ∫ f ( y | θ ) π ( θ ) dθ = m ( y,1) ∫ f ( y | θ ) π ( θ ) dθ m ( y , b ) b PP directly provides −π ( θ, b ) via π ( θ | y , t = b ) 1 − log m F ( y, b ) = ∫ b Eθ|y ,t log f ( y | θ )dt   06/01/2011 JLF/BigMC 21
  • 22. PP/algorithm MCMC with discretization of t on [ 0,1[ t0 = 0 < t1 < ... < ti < ... < tn −1 < tn = 1 ti = (i / n)c with i = 1,.., n; n = 20 − 100; c = 2 − 5 1)Make draws of θ( gi ) MCMC from π ( θ | y, ti ) 1 G   G i ( 2)Compute Eθ|y ,t =ti log p ( y | θ )  = ∑ g =1 log p y | θ( i ) ˆ g ) Often conditional independence, log p ( y | θ ) = ∑ i =1 log p ( yi | θ ) N eg if θ if the closest stochastic parent of y = ( yi ) (as for DIC) 3)Approximate the integral (eg trapezoidal rule) ∑ i=0 i+1 i i i+1 ˆog m ( y ) = ½ n ( t − t )( E + E ) l Error due to this numerical approx. (Calderhead & Girolami,2009) Formula for MC sampling error: see Friel & Pettitt 06/01/2011 JLF/BigMC 22
  • 23. PP/Little toy example yi 0) yi | λi ~ id P ( λi xi ) ⇔ (λ x ) f ( yi | λi ) = i i exp ( −λi xi ) yi ! β α λiα −1 exp ( − βλi ) 1)λi ~ id G (α , β ) ⇔ π ( λi ) = Γ (α ) 0 + 1) yi ~ id BN (α , pi ) where pi = β / ( β + xi ) Γ ( yi + α ) α y Direct approach: f ( yi ) = pi (1 − pi ) i Γ (α ) yi ! f ( y ) = − n ln Γ (α ) + ∑ i =1 ln Γ ( yi + α ) −∑ i =1 ln ( yi !) n n +α ∑ i =1 ln pi + ∑ i =1 yi ln (1 − p )i n n n Indirect approach: f ( y ) = ∏ i =1 ∫ f ( yi | λi ) π ( λi ) d λi 06/01/2011 JLF/BigMC 23
  • 24. PP/Little toy example/cont. Ex / Pump data: Ex#2 in Winbugs, Carlin-Louis (p126) y = # failures of pumps in x (103 hrs ) y = ( 5,1,5,14,3,19,1,1, 4, 22 ) ; n = 10; α = β = 1 x = (94.3,15.7, 62.9,126,5.24,31.4,1.05,1.05, 2.1,10.5) ˆ D = −2 ln f ( y ) = 66.03 D = 66.28 ± 0.03 (20pts) FP 06/01/2011 JLF/BigMC 24
  • 25. PP/Toy example in Openbugs 06/01/2011 JLF/BigMC 25
  • 26. PP/Toy example in Openbugs/Cont. 06/01/2011 JLF/BigMC 26
  • 27. PP/Toy example in Openbugs/Cont. 06/01/2011 JLF/BigMC 27
  • 28. PP/Toy example in Openbugs/Cont. 06/01/2011 JLF/BigMC 28
  • 29. Sampling both θ & t 1 log m ( y ) = ∫ log f ( y | θ ) π ( θ | y , t ) dt 0  log f ( y | θ ) 1 log m ( y ) = ∫ π ( θ | y, t ) p(t ) dt 0 p (t ) π ( θ ,t | y )  log f ( y | θ )  log log m ( y ) = Eθ ,t|y    p (t )  t π ( θ | y, t ) ∝ f ( y | θ ) π ( θ ) t if we assume p (t ) ∝ zt ( y ) ⇒ π ( t | θ, y ) ∝ f ( y | θ ) Sampling ( θ, t ) in such conditions gives poor estimation (too few draws of t close to 0) 06/01/2011 JLF/BigMC 29
  • 30. Example 1/ Pothoff&Roy’s data Growth measurements in 11 girls and 16 boys: Pothoff and Roy,1964; Little and Rubin, 1987 Age (years) Age (years) Girl 8 10 12 14 Boy 8 10 12 14 1 210 200 215 230 1 260 250 290 310 2 210 215 240 255 2 215 230 265 3 205 245 260 3 230 225 240 275 4 235 245 250 265 4 255 275 265 270 5 215 230 225 235 5 200 225 260 6 200 210 225 6 245 255 270 285 7 215 225 230 250 7 220 220 245 265 8 230 230 235 240 8 240 215 245 255 9 200 220 215 9 230 205 310 260 10 165 190 195 10 275 280 310 315 11 245 250 280 280 11 230 230 235 250 12 215 240 280 13 170 260 295 14 225 255 255 260 15 230 245 260 300 16 220 235 250 distance from the centre of the pituary to the pteryomaxillary fissure (unit 10-4m) 06/01/2011 JLF/BigMC 30
  • 31. Model comparison on Pothoff’s data i: subscript for individual i = 1,.., I = 25 (11girls+16boys) j: subscript for measurement at age t j (8,10,12,14 yrs ) 1)Purely Fixed Model yij = (α 0 + α xi ) + ( β 0 + β xi ) ( t j − 8 ) + eij int ercept pente 2)Random intercept model yij = (α 0 + α xi + ai ) + ( β 0 + β xi ) ( t j − 8 ) + eij 3)Random intercept & slope model assuming independent effects yij = (α 0 + α xi + ai ) + ( β 0 + β xi + bi ) ( t j − 8 ) + eij or yij = φi1 + φi 2 ( t j − 8 ) + eij , yij ~ id N (ηij , σ e2 )  φi1   α 0 + α xi   σ a 0   2 with φi =   ~ N  ,   φi 2    β 0 + β xi   0 σ b2     4)Random intercept & slope model assuming correlated effects  φi1   α 0 + α xi   σ a σ ab   2 φi =   ~ N  ,   φi 2    β 0 + β xi   σ ab σ b2     06/01/2011 JLF/BigMC 31
  • 32. Model presentation:Hierarchical Bayes 1st level:yij ~ id N (ηij , σ e2 ) with ηij = φi1 + φi 2 ( t j − 8 ) 2nd level :   φ   α 0 + α xi   σ a σ ab   2 2a) φi =  i1  ~ N  ,   φi 2   β 0 + β xi   σ ab σ b2      Σ   2b) σ e ~ U ( 0, ∆ e ) or σ e2 ~ InvG (1, σ e2 ) 3rd level: Fixed effects: α 0 , α , β 0 , β ~ U(inf,sup) Var (Covar) components: − If σ ab = 0, then i) σ a ~ U ( 0, ∆ a ) , same for σ b ~ U ( 0, ∆ b ) or ii) σ a ~ InvG (1, σ a ) ,same for σ b2 ~ InvG (1, σ b2 ) 2 2 − If σ ab ≠ 0, then i)σ a ~ U ( 0, ∆ a ) , σ b ~ U ( 0, ∆ b ) , ρ ~ U ( -1,1) * ( or ii) Ω ~ W (νΣ ) ,ν −1 ) for Ω = Σ −1 with ν = dim(Ω) + 1 and Σ known location parameter *Take care as Winbugs uses another notation ie W ( (νΣ ) ,ν ) 06/01/2011 JLF/BigMC 32
  • 33. Results 06/01/2011 JLF/BigMC 33
  • 34. Results/fractional priors (b=0 vs 0.125) 06/01/2011 JLF/BigMC 34
  • 35. Example 2:Models of genetic differentiation 2 level hierarchical model i =locus; j =(sub)population aij =Nbre of genes carrying a given allele at locus i in pop. j pij = Frequency of that allele at locus i in pop. j 0) yij | α ij ~ id B ( nij , α ij ) 1− cj 1) α ij | xi ,λij ~ id Beta (τ jπ i ,τ j (1 − π i ) ) τ j = where c j ( Dif. index ) cj π i = Frequency of that allele at locus i in the gene pool 2)π i ~ id Beta ( aπ , bπ ) , c j ~ id Beta ( ac , bc ) Migration-Drift at equilibrium (Balding) 06/01/2011 JLF/BigMC 35
  • 36. Ex2: Nicholson’s model Nicholson et al (2002) same as previously but 1) α ij | xi ,λij ~ id N (π i , c jπ i (1 − π i ) ) Truncated normal with masses in 0 and 1 so that yij | α ij ~ id B ( nij , α ij ) * * where α ij = max(0, min(1, α ij )) 2)π i ~ id Beta ( aπ , bπ ) , c j ~ id Beta ( ac , bc ) Pure drift model 06/01/2011 JLF/BigMC 36
  • 37. Results 06/01/2011 JLF/BigMC 37
  • 38. Conclusion Derived from thermodynamical integration Link with « path sampling » Easy to understand and quite general Well suited to complex hierarchical models « Theta’s » can be defined as the closest stochastic parents of data making the latter conditionally independent Draws only from posterior distributions Gives as a by product fractional BF Easy to implement (including in Openbugs) but time consuming Caution needed in discretization of t (close to 0) 06/01/2011 JLF/BigMC 38
  • 39. Some references Chen M, Shao Q, Ibrahim J (2000) Monte Carlo methods in Bayesian computation. Springer Chib S (1995) Marginal likelihood from the Gibbs output. JASA 90,1313-1321 Chopin N, Robert CP (2010) Properties of nested sampling. Biometrika, 97, 741- 755 Friel N, Pettitt AN (2008) Marginal likelihood estimation via power posteriors, JRSS, B, 70, 589-607 Frühwirth-Schnatter (2004) Estimating marginal likelihoods from mixtures & Markov switching models using bridge sampling techniques. Econometrics Journal, 7,143-167 Gelman A, Meng X-L (1998) Simulating normalizing constants: from importance sampling to bridge sampling and path sampling, Statistical Science, 13, 163-185 Lartillot N, Philippe H (2006) Computing Bayes factors using thermodynamic integration. Systematic Biology, 55, 195-207 Marin JM, Robert CP (2009) Importance sampling methods for Bayesian discrimination between embedded models. arXiv:0910.2325v1 Meng X-L, Wong WH (1996) Simulating ratios of normalizing constants via a simple identity: a theoretical exploration. Statistica Sinica,6,831-860 O Hagan A (1995) Fractional Bayes factors for model comparison. JRSS, B, 57, 99-138 06/01/2011 JLF/BigMC 39
  • 40. Acknowledgements Nial Friel (U College, Dublin) for his interest in these applications and his unvaluable explanations & suggestions Tony O’Hagan for further insight into FBF Gilles Celeux, Mathieu Gautier as coadvisors of the Master dissertation of Yoan Soussan (Paris VI) Christian Robert for his blog and his relevant comments, standpoints and bibliographical references The Applibugs & Babayes groups for stimulating discussions on DIC, BF,CPO & other information criteria (AIC,BIC) 06/01/2011 JLF/BigMC 40