Upcoming SlideShare
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Standard text messaging rates apply

# Computational methods for Bayesian model choice

1,423

Published on

Cours OFPR given in CREST on March 2, 5, 9 and 12

Cours OFPR given in CREST on March 2, 5, 9 and 12

Published in: Education, Technology
1 Like
Statistics
Notes
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

Views
Total Views
1,423
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
64
0
Likes
1
Embeds 0
No embeds

No notes for slide

### Transcript

• 1. On some computational methods for Bayesian model choice On some computational methods for Bayesian model choice Christian P. Robert CREST-INSEE and Universit´ Paris Dauphine e http://www.ceremade.dauphine.fr/~xian c Cours OFPR, CREST, Malakoﬀ 2-12 mars 2009
• 2. On some computational methods for Bayesian model choice Outline Introduction 1 Importance sampling solutions 2 Cross-model solutions 3 Nested sampling 4 ABC model choice 5
• 3. On some computational methods for Bayesian model choice Introduction Bayes tests Construction of Bayes tests Deﬁnition (Test) Given an hypothesis H0 : θ ∈ Θ0 on the parameter θ ∈ Θ0 of a statistical model, a test is a statistical procedure that takes its values in {0, 1}. Example (Normal mean) For x ∼ N (θ, 1), decide whether or not θ ≤ 0.
• 4. On some computational methods for Bayesian model choice Introduction Bayes tests Construction of Bayes tests Deﬁnition (Test) Given an hypothesis H0 : θ ∈ Θ0 on the parameter θ ∈ Θ0 of a statistical model, a test is a statistical procedure that takes its values in {0, 1}. Example (Normal mean) For x ∼ N (θ, 1), decide whether or not θ ≤ 0.
• 5. On some computational methods for Bayesian model choice Introduction Bayes tests The 0 − 1 loss Neyman–Pearson loss for testing hypotheses Test of H0 : θ ∈ Θ0 versus H1 : θ ∈ Θ0 . Then D = {0, 1} The 0 − 1 loss 1 − d if θ ∈ Θ0 L(θ, d) = d otherwise,
• 6. On some computational methods for Bayesian model choice Introduction Bayes tests The 0 − 1 loss Neyman–Pearson loss for testing hypotheses Test of H0 : θ ∈ Θ0 versus H1 : θ ∈ Θ0 . Then D = {0, 1} The 0 − 1 loss 1 − d if θ ∈ Θ0 L(θ, d) = d otherwise,
• 7. On some computational methods for Bayesian model choice Introduction Bayes tests Type–one and type–two errors Associated with the risk R(θ, δ) = Eθ [L(θ, δ(x))] if θ ∈ Θ0 , Pθ (δ(x) = 0) = Pθ (δ(x) = 1) otherwise, Theorem (Bayes test) The Bayes estimator associated with π and with the 0 − 1 loss is if π(θ ∈ Θ0 |x) > π(θ ∈ Θ0 |x), 1 δ π (x) = 0 otherwise,
• 8. On some computational methods for Bayesian model choice Introduction Bayes tests Type–one and type–two errors Associated with the risk R(θ, δ) = Eθ [L(θ, δ(x))] if θ ∈ Θ0 , Pθ (δ(x) = 0) = Pθ (δ(x) = 1) otherwise, Theorem (Bayes test) The Bayes estimator associated with π and with the 0 − 1 loss is if π(θ ∈ Θ0 |x) > π(θ ∈ Θ0 |x), 1 δ π (x) = 0 otherwise,
• 9. On some computational methods for Bayesian model choice Introduction Bayes factor Bayes factor Deﬁnition (Bayes factors) For testing hypotheses H0 : θ ∈ Θ0 vs. Ha : θ ∈ Θ0 , under prior π(Θ0 )π0 (θ) + π(Θc )π1 (θ) , 0 central quantity f (x|θ)π0 (θ)dθ π(Θ0 |x) π(Θ0 ) Θ0 B01 = = π(Θc |x) π(Θc ) 0 0 f (x|θ)π1 (θ)dθ Θc 0 [Jeﬀreys, 1939]
• 10. On some computational methods for Bayesian model choice Introduction Bayes factor Self-contained concept Outside decision-theoretic environment: eliminates impact of π(Θ0 ) but depends on the choice of (π0 , π1 ) Bayesian/marginal equivalent to the likelihood ratio Jeﬀreys’ scale of evidence: π if log10 (B10 ) between 0 and 0.5, evidence against H0 weak, π if log10 (B10 ) 0.5 and 1, evidence substantial, π if log10 (B10 ) 1 and 2, evidence strong and π if log10 (B10 ) above 2, evidence decisive Requires the computation of the marginal/evidence under both hypotheses/models
• 11. On some computational methods for Bayesian model choice Introduction Bayes factor Hot hand Example (Binomial homogeneity) Consider H0 : yi ∼ B(ni , p) (i = 1, . . . , G) vs. H1 : yi ∼ B(ni , pi ). Conjugate priors pi ∼ Be(α = ξ/ω, β = (1 − ξ)/ω), with a uniform prior on E[pi |ξ, ω] = ξ and on p (ω is ﬁxed) 1G 1 pyi (1 − pi )ni −yi pα−1 (1 − pi )β−1 d pi B10 = i i 0 i=1 0 ×Γ(1/ω)/[Γ(ξ/ω)Γ((1 − ξ)/ω)]dξ 1 P P i (ni −yi ) i yi (1 − p) 0p dp For instance, log10 (B10 ) = −0.79 for ω = 0.005 and G = 138 slightly favours H0 . [Kass & Raftery, 1995]
• 12. On some computational methods for Bayesian model choice Introduction Bayes factor Hot hand Example (Binomial homogeneity) Consider H0 : yi ∼ B(ni , p) (i = 1, . . . , G) vs. H1 : yi ∼ B(ni , pi ). Conjugate priors pi ∼ Be(α = ξ/ω, β = (1 − ξ)/ω), with a uniform prior on E[pi |ξ, ω] = ξ and on p (ω is ﬁxed) 1G 1 pyi (1 − pi )ni −yi pα−1 (1 − pi )β−1 d pi B10 = i i 0 i=1 0 ×Γ(1/ω)/[Γ(ξ/ω)Γ((1 − ξ)/ω)]dξ 1 P P i (ni −yi ) i yi (1 − p) 0p dp For instance, log10 (B10 ) = −0.79 for ω = 0.005 and G = 138 slightly favours H0 . [Kass & Raftery, 1995]
• 13. On some computational methods for Bayesian model choice Introduction Bayes factor Hot hand Example (Binomial homogeneity) Consider H0 : yi ∼ B(ni , p) (i = 1, . . . , G) vs. H1 : yi ∼ B(ni , pi ). Conjugate priors pi ∼ Be(α = ξ/ω, β = (1 − ξ)/ω), with a uniform prior on E[pi |ξ, ω] = ξ and on p (ω is ﬁxed) 1G 1 pyi (1 − pi )ni −yi pα−1 (1 − pi )β−1 d pi B10 = i i 0 i=1 0 ×Γ(1/ω)/[Γ(ξ/ω)Γ((1 − ξ)/ω)]dξ 1 P P i (ni −yi ) i yi (1 − p) 0p dp For instance, log10 (B10 ) = −0.79 for ω = 0.005 and G = 138 slightly favours H0 . [Kass & Raftery, 1995]
• 14. On some computational methods for Bayesian model choice Introduction Model choice Model choice and model comparison Choice between models Several models available for the same observation Mi : x ∼ fi (x|θi ), i∈I where I can be ﬁnite or inﬁnite Replace hypotheses with models but keep marginal likelihoods and Bayes factors
• 15. On some computational methods for Bayesian model choice Introduction Model choice Model choice and model comparison Choice between models Several models available for the same observation Mi : x ∼ fi (x|θi ), i∈I where I can be ﬁnite or inﬁnite Replace hypotheses with models but keep marginal likelihoods and Bayes factors
• 16. On some computational methods for Bayesian model choice Introduction Model choice Bayesian model choice Probabilise the entire model/parameter space allocate probabilities pi to all models Mi deﬁne priors πi (θi ) for each parameter space Θi compute pi fi (x|θi )πi (θi )dθi Θi π(Mi |x) = pj fj (x|θj )πj (θj )dθj Θj j take largest π(Mi |x) to determine “best” model, or use averaged predictive π(Mj |x) fj (x |θj )πj (θj |x)dθj Θj j
• 17. On some computational methods for Bayesian model choice Introduction Model choice Bayesian model choice Probabilise the entire model/parameter space allocate probabilities pi to all models Mi deﬁne priors πi (θi ) for each parameter space Θi compute pi fi (x|θi )πi (θi )dθi Θi π(Mi |x) = pj fj (x|θj )πj (θj )dθj Θj j take largest π(Mi |x) to determine “best” model, or use averaged predictive π(Mj |x) fj (x |θj )πj (θj |x)dθj Θj j
• 18. On some computational methods for Bayesian model choice Introduction Model choice Bayesian model choice Probabilise the entire model/parameter space allocate probabilities pi to all models Mi deﬁne priors πi (θi ) for each parameter space Θi compute pi fi (x|θi )πi (θi )dθi Θi π(Mi |x) = pj fj (x|θj )πj (θj )dθj Θj j take largest π(Mi |x) to determine “best” model, or use averaged predictive π(Mj |x) fj (x |θj )πj (θj |x)dθj Θj j
• 19. On some computational methods for Bayesian model choice Introduction Model choice Bayesian model choice Probabilise the entire model/parameter space allocate probabilities pi to all models Mi deﬁne priors πi (θi ) for each parameter space Θi compute pi fi (x|θi )πi (θi )dθi Θi π(Mi |x) = pj fj (x|θj )πj (θj )dθj Θj j take largest π(Mi |x) to determine “best” model, or use averaged predictive π(Mj |x) fj (x |θj )πj (θj |x)dθj Θj j
• 20. On some computational methods for Bayesian model choice Introduction Evidence Evidence All these problems end up with a similar quantity, the evidence Zk = πk (θk )Lk (θk ) dθk , Θk aka the marginal likelihood.
• 21. On some computational methods for Bayesian model choice Importance sampling solutions Regular importance Importance sampling Paradox Simulation from f (the true density) is not necessarily optimal Alternative to direct sampling from f is importance sampling, based on the alternative representation f (x) Ef [h(X)] = h(x) g(x) dx . g(x) X which allows us to use other distributions than f
• 22. On some computational methods for Bayesian model choice Importance sampling solutions Regular importance Importance sampling Paradox Simulation from f (the true density) is not necessarily optimal Alternative to direct sampling from f is importance sampling, based on the alternative representation f (x) Ef [h(X)] = h(x) g(x) dx . g(x) X which allows us to use other distributions than f
• 23. On some computational methods for Bayesian model choice Importance sampling solutions Regular importance Importance sampling algorithm Evaluation of Ef [h(X)] = h(x) f (x) dx X by Generate a sample X1 , . . . , Xn from a distribution g 1 Use the approximation 2 m f (Xj ) 1 h(Xj ) m g(Xj ) j=1
• 24. On some computational methods for Bayesian model choice Importance sampling solutions Regular importance Bayes factor approximation When approximating the Bayes factor f0 (x|θ0 )π0 (θ0 )dθ0 Θ0 B01 = f1 (x|θ1 )π1 (θ1 )dθ1 Θ1 use of importance functions and and 0 1 n−1 n0 i i i i=1 f0 (x|θ0 )π0 (θ0 )/ 0 (θ0 ) 0 B01 = n−1 n1 i i i i=1 f1 (x|θ1 )π1 (θ1 )/ 1 (θ1 ) 1
• 25. On some computational methods for Bayesian model choice Importance sampling solutions Regular importance Bridge sampling Special case: If π1 (θ1 |x) ∝ π1 (θ1 |x) ˜ π2 (θ2 |x) ∝ π2 (θ2 |x) ˜ live on the same space (Θ1 = Θ2 ), then n π1 (θi |x) 1 ˜ B12 ≈ θi ∼ π2 (θ|x) π2 (θi |x) n ˜ i=1 [Gelman & Meng, 1998; Chen, Shao & Ibrahim, 2000]
• 26. On some computational methods for Bayesian model choice Importance sampling solutions Regular importance Bridge sampling variance The bridge sampling estimator does poorly if 2 π1 (θ) − π2 (θ) var(B12 ) 1 =E 2 n π2 (θ) B12 is large, i.e. if π1 and π2 have little overlap...
• 27. On some computational methods for Bayesian model choice Importance sampling solutions Regular importance Bridge sampling variance The bridge sampling estimator does poorly if 2 π1 (θ) − π2 (θ) var(B12 ) 1 =E 2 n π2 (θ) B12 is large, i.e. if π1 and π2 have little overlap...
• 28. On some computational methods for Bayesian model choice Importance sampling solutions Regular importance (Further) bridge sampling In addition π2 (θ|x)α(θ)π1 (θ|x)dθ ˜ ∀ α(·) B12 = π1 (θ|x)α(θ)π2 (θ|x)dθ ˜ n1 1 π2 (θ1i |x)α(θ1i ) ˜ n1 i=1 ≈ θji ∼ πj (θ|x) n2 1 π1 (θ2i |x)α(θ2i ) ˜ n2 i=1
• 29. On some computational methods for Bayesian model choice Importance sampling solutions Regular importance An infamous example When 1 α(θ) = π1 (θ)˜2 (θ) ˜ π harmonic mean approximation to B12 n1 1 1/˜ (θ1i |x) π n1 i=1 1 θji ∼ πj (θ|x) B12 = n2 1 1/˜2 (θ2i |x) π n2 i=1 [Newton & Raftery, 1994] Infamous: Most often leads to an inﬁnite variance!!! [Radford Neal’s blog, 2008]
• 30. On some computational methods for Bayesian model choice Importance sampling solutions Regular importance An infamous example When 1 α(θ) = π1 (θ)˜2 (θ) ˜ π harmonic mean approximation to B12 n1 1 1/˜ (θ1i |x) π n1 i=1 1 θji ∼ πj (θ|x) B12 = n2 1 1/˜2 (θ2i |x) π n2 i=1 [Newton & Raftery, 1994] Infamous: Most often leads to an inﬁnite variance!!! [Radford Neal’s blog, 2008]
• 31. On some computational methods for Bayesian model choice Importance sampling solutions Regular importance “The Worst Monte Carlo Method Ever” “The good news is that the Law of Large Numbers guarantees that this estimator is consistent ie, it will very likely be very close to the correct answer if you use a suﬃciently large number of points from the posterior distribution. The bad news is that the number of points required for this estimator to get close to the right answer will often be greater than the number of atoms in the observable universe. The even worse news is that itws easy for people to not realize this, and to naively accept estimates that are nowhere close to the correct value of the marginal likelihood.” [Radford Neal’s blog, Aug. 23, 2008]
• 32. On some computational methods for Bayesian model choice Importance sampling solutions Regular importance “The Worst Monte Carlo Method Ever” “The good news is that the Law of Large Numbers guarantees that this estimator is consistent ie, it will very likely be very close to the correct answer if you use a suﬃciently large number of points from the posterior distribution. The bad news is that the number of points required for this estimator to get close to the right answer will often be greater than the number of atoms in the observable universe. The even worse news is that itws easy for people to not realize this, and to naively accept estimates that are nowhere close to the correct value of the marginal likelihood.” [Radford Neal’s blog, Aug. 23, 2008]
• 33. On some computational methods for Bayesian model choice Importance sampling solutions Regular importance Optimal bridge sampling The optimal choice of auxiliary function is n1 + n2 α= n1 π1 (θ|x) + n2 π2 (θ|x) leading to n1 π2 (θ1i |x) 1 ˜ n1 π1 (θ1i |x) + n2 π2 (θ1i |x) n1 i=1 B12 ≈ n2 π1 (θ2i |x) 1 ˜ n1 π1 (θ2i |x) + n2 π2 (θ2i |x) n2 i=1 Back later!
• 34. On some computational methods for Bayesian model choice Importance sampling solutions Regular importance Optimal bridge sampling (2) Reason: π1 (θ)π2 (θ)[n1 π1 (θ) + n2 π2 (θ)]α(θ)2 dθ Var(B12 ) 1 ≈ −1 2 2 n1 n2 B12 π1 (θ)π2 (θ)α(θ) dθ (by the δ method) Dependence on the unknown normalising constants solved iteratively
• 35. On some computational methods for Bayesian model choice Importance sampling solutions Regular importance Optimal bridge sampling (2) Reason: π1 (θ)π2 (θ)[n1 π1 (θ) + n2 π2 (θ)]α(θ)2 dθ Var(B12 ) 1 ≈ −1 2 2 n1 n2 B12 π1 (θ)π2 (θ)α(θ) dθ (by the δ method) Dependence on the unknown normalising constants solved iteratively
• 36. On some computational methods for Bayesian model choice Importance sampling solutions Regular importance Ratio importance sampling Another identity: Eϕ [˜1 (θ)/ϕ(θ)] π B12 = Eϕ [˜2 (θ)/ϕ(θ)] π for any density ϕ with suﬃciently large support [Torrie & Valleau, 1977] Use of a single sample θ1 , . . . , θn from ϕ i=1 π1 (θi )/ϕ(θi ) ˜ B12 = i=1 π2 (θi )/ϕ(θi ) ˜
• 37. On some computational methods for Bayesian model choice Importance sampling solutions Regular importance Ratio importance sampling Another identity: Eϕ [˜1 (θ)/ϕ(θ)] π B12 = Eϕ [˜2 (θ)/ϕ(θ)] π for any density ϕ with suﬃciently large support [Torrie & Valleau, 1977] Use of a single sample θ1 , . . . , θn from ϕ i=1 π1 (θi )/ϕ(θi ) ˜ B12 = i=1 π2 (θi )/ϕ(θi ) ˜
• 38. On some computational methods for Bayesian model choice Importance sampling solutions Regular importance Ratio importance sampling (2) Approximate variance: 2 (π1 (θ) − π2 (θ))2 var(B12 ) 1 = Eϕ 2 ϕ(θ)2 n B12 Optimal choice: | π1 (θ) − π2 (θ) | ϕ∗ (θ) = | π1 (η) − π2 (η) | dη [Chen, Shao & Ibrahim, 2000]
• 39. On some computational methods for Bayesian model choice Importance sampling solutions Regular importance Ratio importance sampling (2) Approximate variance: 2 (π1 (θ) − π2 (θ))2 var(B12 ) 1 = Eϕ 2 ϕ(θ)2 n B12 Optimal choice: | π1 (θ) − π2 (θ) | ϕ∗ (θ) = | π1 (η) − π2 (η) | dη [Chen, Shao & Ibrahim, 2000]
• 40. On some computational methods for Bayesian model choice Importance sampling solutions Regular importance Improving upon bridge sampler Theorem 5.5.3: The asymptotic variance of the optimal ratio importance sampling estimator is smaller than the asymptotic variance of the optimal bridge sampling estimator [Chen, Shao, & Ibrahim, 2000] Does not require the normalising constant | π1 (η) − π2 (η) | dη but a simulation from ϕ∗ (θ) ∝| π1 (θ) − π2 (θ) | .
• 41. On some computational methods for Bayesian model choice Importance sampling solutions Regular importance Improving upon bridge sampler Theorem 5.5.3: The asymptotic variance of the optimal ratio importance sampling estimator is smaller than the asymptotic variance of the optimal bridge sampling estimator [Chen, Shao, & Ibrahim, 2000] Does not require the normalising constant | π1 (η) − π2 (η) | dη but a simulation from ϕ∗ (θ) ∝| π1 (θ) − π2 (θ) | .
• 42. On some computational methods for Bayesian model choice Importance sampling solutions Varying dimensions Generalisation to point null situations When π1 (θ1 )dθ1 ˜ Θ1 B12 = π2 (θ2 )dθ2 ˜ Θ2 and Θ2 = Θ1 × Ψ, we get θ2 = (θ1 , ψ) and π1 (θ1 )ω(ψ|θ1 ) ˜ B12 = Eπ2 π2 (θ1 , ψ) ˜ holds for any conditional density ω(ψ|θ1 ).
• 43. On some computational methods for Bayesian model choice Importance sampling solutions Varying dimensions X-dimen’al bridge sampling Generalisation of the previous identity: For any α, Eπ [˜1 (θ1 )ω(ψ|θ1 )α(θ1 , ψ)] π B12 = 2 Eπ1 ×ω [˜2 (θ1 , ψ)α(θ1 , ψ)] π and, for any density ϕ, Eϕ [˜1 (θ1 )ω(ψ|θ1 )/ϕ(θ1 , ψ)] π B12 = Eϕ [˜2 (θ1 , ψ)/ϕ(θ1 , ψ)] π [Chen, Shao, & Ibrahim, 2000] Optimal choice: ω(ψ|θ1 ) = π2 (ψ|θ1 ) [Theorem 5.8.2]
• 44. On some computational methods for Bayesian model choice Importance sampling solutions Varying dimensions X-dimen’al bridge sampling Generalisation of the previous identity: For any α, Eπ [˜1 (θ1 )ω(ψ|θ1 )α(θ1 , ψ)] π B12 = 2 Eπ1 ×ω [˜2 (θ1 , ψ)α(θ1 , ψ)] π and, for any density ϕ, Eϕ [˜1 (θ1 )ω(ψ|θ1 )/ϕ(θ1 , ψ)] π B12 = Eϕ [˜2 (θ1 , ψ)/ϕ(θ1 , ψ)] π [Chen, Shao, & Ibrahim, 2000] Optimal choice: ω(ψ|θ1 ) = π2 (ψ|θ1 ) [Theorem 5.8.2]
• 45. On some computational methods for Bayesian model choice Importance sampling solutions Harmonic means Approximating Zk from a posterior sample Use of the [harmonic mean] identity ϕ(θk ) ϕ(θk ) πk (θk )Lk (θk ) 1 Eπk x= dθk = πk (θk )Lk (θk ) πk (θk )Lk (θk ) Zk Zk no matter what the proposal ϕ(·) is. [Gelfand & Dey, 1994; Bartolucci et al., 2006] Direct exploitation of the MCMC output RB-RJ
• 46. On some computational methods for Bayesian model choice Importance sampling solutions Harmonic means Approximating Zk from a posterior sample Use of the [harmonic mean] identity ϕ(θk ) ϕ(θk ) πk (θk )Lk (θk ) 1 Eπk x= dθk = πk (θk )Lk (θk ) πk (θk )Lk (θk ) Zk Zk no matter what the proposal ϕ(·) is. [Gelfand & Dey, 1994; Bartolucci et al., 2006] Direct exploitation of the MCMC output RB-RJ
• 47. On some computational methods for Bayesian model choice Importance sampling solutions Harmonic means Comparison with regular importance sampling Harmonic mean: Constraint opposed to usual importance sampling constraints: ϕ(θ) must have lighter (rather than fatter) tails than πk (θk )Lk (θk ) for the approximation T (t) ϕ(θk ) 1 Z1k = 1 (t) (t) T πk (θk )Lk (θk ) t=1 to have a ﬁnite variance. E.g., use ﬁnite support kernels (like Epanechnikov’s kernel) for ϕ
• 48. On some computational methods for Bayesian model choice Importance sampling solutions Harmonic means Comparison with regular importance sampling Harmonic mean: Constraint opposed to usual importance sampling constraints: ϕ(θ) must have lighter (rather than fatter) tails than πk (θk )Lk (θk ) for the approximation T (t) ϕ(θk ) 1 Z1k = 1 (t) (t) T πk (θk )Lk (θk ) t=1 to have a ﬁnite variance. E.g., use ﬁnite support kernels (like Epanechnikov’s kernel) for ϕ
• 49. On some computational methods for Bayesian model choice Importance sampling solutions Harmonic means Comparison with regular importance sampling (cont’d) Compare Z1k with a standard importance sampling approximation T (t) (t) πk (θk )Lk (θk ) 1 = Z2k (t) T ϕ(θk ) t=1 (t) where the θk ’s are generated from the density ϕ(·) (with fatter tails like t’s)
• 50. On some computational methods for Bayesian model choice Importance sampling solutions Harmonic means Approximating Zk using a mixture representation Bridge sampling redux Design a speciﬁc mixture for simulation [importance sampling] purposes, with density ϕk (θk ) ∝ ω1 πk (θk )Lk (θk ) + ϕ(θk ) , where ϕ(·) is arbitrary (but normalised) Note: ω1 is not a probability weight
• 51. On some computational methods for Bayesian model choice Importance sampling solutions Harmonic means Approximating Zk using a mixture representation Bridge sampling redux Design a speciﬁc mixture for simulation [importance sampling] purposes, with density ϕk (θk ) ∝ ω1 πk (θk )Lk (θk ) + ϕ(θk ) , where ϕ(·) is arbitrary (but normalised) Note: ω1 is not a probability weight
• 52. On some computational methods for Bayesian model choice Importance sampling solutions Harmonic means Approximating Z using a mixture representation (cont’d) Corresponding MCMC (=Gibbs) sampler At iteration t Take δ (t) = 1 with probability 1 (t−1) (t−1) (t−1) (t−1) (t−1) ω1 πk (θk )Lk (θk ) ω1 πk (θk )Lk (θk ) + ϕ(θk ) and δ (t) = 2 otherwise; (t) (t−1) If δ (t) = 1, generate θk ∼ MCMC(θk , θk ) where 2 MCMC(θk , θk ) denotes an arbitrary MCMC kernel associated with the posterior πk (θk |x) ∝ πk (θk )Lk (θk ); (t) If δ (t) = 2, generate θk ∼ ϕ(θk ) independently 3
• 53. On some computational methods for Bayesian model choice Importance sampling solutions Harmonic means Approximating Z using a mixture representation (cont’d) Corresponding MCMC (=Gibbs) sampler At iteration t Take δ (t) = 1 with probability 1 (t−1) (t−1) (t−1) (t−1) (t−1) ω1 πk (θk )Lk (θk ) ω1 πk (θk )Lk (θk ) + ϕ(θk ) and δ (t) = 2 otherwise; (t) (t−1) If δ (t) = 1, generate θk ∼ MCMC(θk , θk ) where 2 MCMC(θk , θk ) denotes an arbitrary MCMC kernel associated with the posterior πk (θk |x) ∝ πk (θk )Lk (θk ); (t) If δ (t) = 2, generate θk ∼ ϕ(θk ) independently 3
• 54. On some computational methods for Bayesian model choice Importance sampling solutions Harmonic means Approximating Z using a mixture representation (cont’d) Corresponding MCMC (=Gibbs) sampler At iteration t Take δ (t) = 1 with probability 1 (t−1) (t−1) (t−1) (t−1) (t−1) ω1 πk (θk )Lk (θk ) ω1 πk (θk )Lk (θk ) + ϕ(θk ) and δ (t) = 2 otherwise; (t) (t−1) If δ (t) = 1, generate θk ∼ MCMC(θk , θk ) where 2 MCMC(θk , θk ) denotes an arbitrary MCMC kernel associated with the posterior πk (θk |x) ∝ πk (θk )Lk (θk ); (t) If δ (t) = 2, generate θk ∼ ϕ(θk ) independently 3
• 55. On some computational methods for Bayesian model choice Importance sampling solutions Harmonic means Evidence approximation by mixtures Rao-Blackwellised estimate T ˆ1 (t) (t) (t) (t) (t) ξ= ω1 πk (θk )Lk (θk ) ω1 πk (θk )Lk (θk ) + ϕ(θk ) , T t=1 converges to ω1 Zk /{ω1 Zk + 1} ˆ Deduce Zˆ from ω1 Z3k /{ω1 Z3k + 1} = ξ ie ˆ ˆ 3k (t) (t) (t) (t) (t) T t=1 ω1 πk (θk )Lk (θk ) ω1 π(θk )Lk (θk ) + ϕ(θk ) ˆ Z3k = (t) (t) (t) (t) T t=1 ϕ(θk ) ω1 πk (θk )Lk (θk ) + ϕ(θk ) [Bridge sampler]
• 56. On some computational methods for Bayesian model choice Importance sampling solutions Harmonic means Evidence approximation by mixtures Rao-Blackwellised estimate T ˆ1 (t) (t) (t) (t) (t) ξ= ω1 πk (θk )Lk (θk ) ω1 πk (θk )Lk (θk ) + ϕ(θk ) , T t=1 converges to ω1 Zk /{ω1 Zk + 1} ˆ Deduce Zˆ from ω1 Z3k /{ω1 Z3k + 1} = ξ ie ˆ ˆ 3k (t) (t) (t) (t) (t) T t=1 ω1 πk (θk )Lk (θk ) ω1 π(θk )Lk (θk ) + ϕ(θk ) ˆ Z3k = (t) (t) (t) (t) T t=1 ϕ(θk ) ω1 πk (θk )Lk (θk ) + ϕ(θk ) [Bridge sampler]
• 57. On some computational methods for Bayesian model choice Importance sampling solutions Chib’s solution Chib’s representation Direct application of Bayes’ theorem: given x ∼ fk (x|θk ) and θk ∼ πk (θk ), fk (x|θk ) πk (θk ) Zk = mk (x) = πk (θk |x) Use of an approximation to the posterior ∗ ∗ fk (x|θk ) πk (θk ) Zk = mk (x) = . ˆ∗ πk (θk |x)
• 58. On some computational methods for Bayesian model choice Importance sampling solutions Chib’s solution Chib’s representation Direct application of Bayes’ theorem: given x ∼ fk (x|θk ) and θk ∼ πk (θk ), fk (x|θk ) πk (θk ) Zk = mk (x) = πk (θk |x) Use of an approximation to the posterior ∗ ∗ fk (x|θk ) πk (θk ) Zk = mk (x) = . ˆ∗ πk (θk |x)
• 59. On some computational methods for Bayesian model choice Importance sampling solutions Chib’s solution Case of latent variables For missing variable z as in mixture models, natural Rao-Blackwell estimate T 1 (t) ∗ ∗ πk (θk |x) = πk (θk |x, zk ) , T t=1 (t) where the zk ’s are Gibbs sampled latent variables
• 60. On some computational methods for Bayesian model choice Importance sampling solutions Chib’s solution Label switching A mixture model [special case of missing variable model] is invariant under permutations of the indices of the components. E.g., mixtures 0.3N (0, 1) + 0.7N (2.3, 1) and 0.7N (2.3, 1) + 0.3N (0, 1) are exactly the same! c The component parameters θi are not identiﬁable marginally since they are exchangeable
• 61. On some computational methods for Bayesian model choice Importance sampling solutions Chib’s solution Label switching A mixture model [special case of missing variable model] is invariant under permutations of the indices of the components. E.g., mixtures 0.3N (0, 1) + 0.7N (2.3, 1) and 0.7N (2.3, 1) + 0.3N (0, 1) are exactly the same! c The component parameters θi are not identiﬁable marginally since they are exchangeable
• 62. On some computational methods for Bayesian model choice Importance sampling solutions Chib’s solution Connected diﬃculties Number of modes of the likelihood of order O(k!): 1 c Maximization and even [MCMC] exploration of the posterior surface harder Under exchangeable priors on (θ, p) [prior invariant under 2 permutation of the indices], all posterior marginals are identical: c Posterior expectation of θ1 equal to posterior expectation of θ2
• 63. On some computational methods for Bayesian model choice Importance sampling solutions Chib’s solution Connected diﬃculties Number of modes of the likelihood of order O(k!): 1 c Maximization and even [MCMC] exploration of the posterior surface harder Under exchangeable priors on (θ, p) [prior invariant under 2 permutation of the indices], all posterior marginals are identical: c Posterior expectation of θ1 equal to posterior expectation of θ2
• 64. On some computational methods for Bayesian model choice Importance sampling solutions Chib’s solution License Since Gibbs output does not produce exchangeability, the Gibbs sampler has not explored the whole parameter space: it lacks energy to switch simultaneously enough component allocations at once 0.2 0.3 0.4 0.5 −1 0 1 2 3 µi pi −1 0 1 2 3 0 100 200 300 400 500 µ n i 0.4 0.6 0.8 1.0 0.2 0.3 0.4 0.5 σi pi 0 100 200 300 400 500 0.2 0.3 0.4 0.5 n pi 0.4 0.6 0.8 1.0 −1 0 1 2 3 σi µi 0 100 200 300 400 500 0.4 0.6 0.8 1.0 σ n i
• 65. On some computational methods for Bayesian model choice Importance sampling solutions Chib’s solution Label switching paradox We should observe the exchangeability of the components [label switching] to conclude about convergence of the Gibbs sampler. If we observe it, then we do not know how to estimate the parameters. If we do not, then we are uncertain about the convergence!!!
• 66. On some computational methods for Bayesian model choice Importance sampling solutions Chib’s solution Label switching paradox We should observe the exchangeability of the components [label switching] to conclude about convergence of the Gibbs sampler. If we observe it, then we do not know how to estimate the parameters. If we do not, then we are uncertain about the convergence!!!
• 67. On some computational methods for Bayesian model choice Importance sampling solutions Chib’s solution Label switching paradox We should observe the exchangeability of the components [label switching] to conclude about convergence of the Gibbs sampler. If we observe it, then we do not know how to estimate the parameters. If we do not, then we are uncertain about the convergence!!!
• 68. On some computational methods for Bayesian model choice Importance sampling solutions Chib’s solution Compensation for label switching (t) For mixture models, zk usually fails to visit all conﬁgurations in a balanced way, despite the symmetry predicted by the theory 1 πk (θk |x) = πk (σ(θk )|x) = πk (σ(θk )|x) k! σ∈S for all σ’s in Sk , set of all permutations of {1, . . . , k}. Consequences on numerical approximation, biased by an order k! Recover the theoretical symmetry by using T 1 (t) ∗ ∗ πk (θk |x) = πk (σ(θk )|x, zk ) . T k! σ∈Sk t=1 [Berkhof, Mechelen, & Gelman, 2003]
• 69. On some computational methods for Bayesian model choice Importance sampling solutions Chib’s solution Compensation for label switching (t) For mixture models, zk usually fails to visit all conﬁgurations in a balanced way, despite the symmetry predicted by the theory 1 πk (θk |x) = πk (σ(θk )|x) = πk (σ(θk )|x) k! σ∈S for all σ’s in Sk , set of all permutations of {1, . . . , k}. Consequences on numerical approximation, biased by an order k! Recover the theoretical symmetry by using T 1 (t) ∗ ∗ πk (θk |x) = πk (σ(θk )|x, zk ) . T k! σ∈Sk t=1 [Berkhof, Mechelen, & Gelman, 2003]
• 70. On some computational methods for Bayesian model choice Importance sampling solutions Chib’s solution Galaxy dataset n = 82 galaxies as a mixture of k normal distributions with both mean and variance unknown. [Roeder, 1992] Average density 0.8 0.6 Relative Frequency 0.4 0.2 0.0 −2 −1 0 1 2 3 data
• 71. On some computational methods for Bayesian model choice Importance sampling solutions Chib’s solution Galaxy dataset (k) ∗ Using only the original estimate, with θk as the MAP estimator, log(mk (x)) = −105.1396 ˆ for k = 3 (based on 103 simulations), while introducing the permutations leads to log(mk (x)) = −103.3479 ˆ Note that −105.1396 + log(3!) = −103.3479 k 2 3 4 5 6 7 8 mk (x) -115.68 -103.35 -102.66 -101.93 -102.88 -105.48 -108.44 Estimations of the marginal likelihoods by the symmetrised Chib’s approximation (based on 105 Gibbs iterations and, for k > 5, 100 permutations selected at random in Sk ). [Lee, Marin, Mengersen & Robert, 2008]
• 72. On some computational methods for Bayesian model choice Importance sampling solutions Chib’s solution Galaxy dataset (k) ∗ Using only the original estimate, with θk as the MAP estimator, log(mk (x)) = −105.1396 ˆ for k = 3 (based on 103 simulations), while introducing the permutations leads to log(mk (x)) = −103.3479 ˆ Note that −105.1396 + log(3!) = −103.3479 k 2 3 4 5 6 7 8 mk (x) -115.68 -103.35 -102.66 -101.93 -102.88 -105.48 -108.44 Estimations of the marginal likelihoods by the symmetrised Chib’s approximation (based on 105 Gibbs iterations and, for k > 5, 100 permutations selected at random in Sk ). [Lee, Marin, Mengersen & Robert, 2008]
• 73. On some computational methods for Bayesian model choice Importance sampling solutions Chib’s solution Galaxy dataset (k) ∗ Using only the original estimate, with θk as the MAP estimator, log(mk (x)) = −105.1396 ˆ for k = 3 (based on 103 simulations), while introducing the permutations leads to log(mk (x)) = −103.3479 ˆ Note that −105.1396 + log(3!) = −103.3479 k 2 3 4 5 6 7 8 mk (x) -115.68 -103.35 -102.66 -101.93 -102.88 -105.48 -108.44 Estimations of the marginal likelihoods by the symmetrised Chib’s approximation (based on 105 Gibbs iterations and, for k > 5, 100 permutations selected at random in Sk ). [Lee, Marin, Mengersen & Robert, 2008]
• 74. On some computational methods for Bayesian model choice Cross-model solutions Variable selection Bayesian variable selection Regression setting: one dependent random variable y and a set {x1 , . . . , xk } of k explanatory variables. Question: Are all xi ’s involved in the regression? Assumption: every subset {i1 , . . . , iq } of q (0 ≤ q ≤ k) explanatory variables, {1n , xi1 , . . . , xiq }, is a proper set of explanatory variables for the regression of y [intercept included in every corresponding model] Computational issue 2k models in competition...
• 75. On some computational methods for Bayesian model choice Cross-model solutions Variable selection Bayesian variable selection Regression setting: one dependent random variable y and a set {x1 , . . . , xk } of k explanatory variables. Question: Are all xi ’s involved in the regression? Assumption: every subset {i1 , . . . , iq } of q (0 ≤ q ≤ k) explanatory variables, {1n , xi1 , . . . , xiq }, is a proper set of explanatory variables for the regression of y [intercept included in every corresponding model] Computational issue 2k models in competition...
• 76. On some computational methods for Bayesian model choice Cross-model solutions Variable selection Bayesian variable selection Regression setting: one dependent random variable y and a set {x1 , . . . , xk } of k explanatory variables. Question: Are all xi ’s involved in the regression? Assumption: every subset {i1 , . . . , iq } of q (0 ≤ q ≤ k) explanatory variables, {1n , xi1 , . . . , xiq }, is a proper set of explanatory variables for the regression of y [intercept included in every corresponding model] Computational issue 2k models in competition...
• 77. On some computational methods for Bayesian model choice Cross-model solutions Variable selection Bayesian variable selection Regression setting: one dependent random variable y and a set {x1 , . . . , xk } of k explanatory variables. Question: Are all xi ’s involved in the regression? Assumption: every subset {i1 , . . . , iq } of q (0 ≤ q ≤ k) explanatory variables, {1n , xi1 , . . . , xiq }, is a proper set of explanatory variables for the regression of y [intercept included in every corresponding model] Computational issue 2k models in competition...
• 78. On some computational methods for Bayesian model choice Cross-model solutions Variable selection Model notations 1 X = 1n x 1 · · · xk is the matrix containing 1n and all the k potential predictor variables Each model Mγ associated with binary indicator vector 2 γ ∈ Γ = {0, 1}k where γi = 1 means that the variable xi is included in the model Mγ qγ = 1T γ number of variables included in the model Mγ 3 n t1 (γ) and t0 (γ) indices of variables included in the model and 4 indices of variables not included in the model
• 79. On some computational methods for Bayesian model choice Cross-model solutions Variable selection Model indicators For β ∈ Rk+1 and X, we deﬁne βγ as the subvector βγ = β0 , (βi )i∈t1 (γ) and Xγ as the submatrix of X where only the column 1n and the columns in t1 (γ) have been left.
• 80. On some computational methods for Bayesian model choice Cross-model solutions Variable selection Models in competition The model Mγ is thus deﬁned as y|γ, βγ , σ 2 , X ∼ Nn Xγ βγ , σ 2 In where βγ ∈ Rqγ +1 and σ 2 ∈ R∗ are the unknown parameters. + Warning σ 2 is common to all models and thus uses the same prior for all models
• 81. On some computational methods for Bayesian model choice Cross-model solutions Variable selection Models in competition The model Mγ is thus deﬁned as y|γ, βγ , σ 2 , X ∼ Nn Xγ βγ , σ 2 In where βγ ∈ Rqγ +1 and σ 2 ∈ R∗ are the unknown parameters. + Warning σ 2 is common to all models and thus uses the same prior for all models
• 82. On some computational methods for Bayesian model choice Cross-model solutions Variable selection Informative G-prior Many (2k ) models in competition: we cannot expect a practitioner to specify a prior on every Mγ in a completely subjective and autonomous manner. Shortcut: We derive all priors from a single global prior associated with the so-called full model that corresponds to γ = (1, . . . , 1).
• 83. On some computational methods for Bayesian model choice Cross-model solutions Variable selection Prior deﬁnitions For the full model, Zellner’s G-prior: (i) β|σ 2 , X ∼ Nk+1 (β, cσ 2 (X T X)−1 ) and σ 2 ∼ π(σ 2 |X) = σ −2 ˜ For each model Mγ , the prior distribution of βγ conditional (ii) on σ 2 is ﬁxed as −1 ˜ βγ |γ, σ 2 ∼ Nqγ +1 βγ , cσ 2 Xγ Xγ T , −1 ˜ T˜ T Xγ β and same prior on σ 2 . where βγ = Xγ Xγ
• 84. On some computational methods for Bayesian model choice Cross-model solutions Variable selection Prior completion The joint prior for model Mγ is the improper prior 1 T −(qγ +1)/2−1 ˜ π(βγ , σ 2 |γ) ∝ σ2 exp − βγ − βγ 2(cσ 2 ) ˜ T (Xγ Xγ ) βγ − βγ .
• 85. On some computational methods for Bayesian model choice Cross-model solutions Variable selection Prior competition (2) Inﬁnitely many ways of deﬁning a prior on the model index γ: choice of uniform prior π(γ|X) = 2−k . Posterior distribution of γ central to variable selection since it is proportional to marginal density of y on Mγ (or evidence of Mγ ) π(γ|y, X) ∝ f (y|γ, X)π(γ|X) ∝ f (y|γ, X) f (y|γ, β, σ 2 , X)π(β|γ, σ 2 , X) dβ π(σ 2 |X) dσ 2 . =
• 86. On some computational methods for Bayesian model choice Cross-model solutions Variable selection f (y|γ, σ 2 , X) = f (y|γ, β, σ 2 )π(β|γ, σ 2 ) dβ 1T −n/2 = (c + 1)−(qγ +1)/2 (2π)−n/2 σ 2 exp − yy 2σ 2 1 −1 ˜T T ˜ cy T Xγ Xγ Xγ T T Xγ y − βγ Xγ Xγ βγ + , 2σ 2 (c + 1) this posterior density satisﬁes −1 cT π(γ|y, X) ∝ (c + 1)−(qγ +1)/2 y T y − T T y Xγ Xγ Xγ Xγ y c+1 −n/2 1 ˜T T ˜ − β X Xγ βγ . c+1 γ γ
• 87. On some computational methods for Bayesian model choice Cross-model solutions Variable selection Pine processionary caterpillars t1 (γ) π(γ|y, X) 0,1,2,4,5 0.2316 0,1,2,4,5,9 0.0374 0,1,9 0.0344 0,1,2,4,5,10 0.0328 0,1,4,5 0.0306 0,1,2,9 0.0250 0,1,2,4,5,7 0.0241 0,1,2,4,5,8 0.0238 0,1,2,4,5,6 0.0237 0,1,2,3,4,5 0.0232 0,1,6,9 0.0146 0,1,2,3,9 0.0145 0,9 0.0143 0,1,2,6,9 0.0135 0,1,4,5,9 0.0128 0,1,3,9 0.0117 0,1,2,8 0.0115
• 88. On some computational methods for Bayesian model choice Cross-model solutions Variable selection Pine processionary caterpillars (cont’d) Interpretation Model Mγ with the highest posterior probability is t1 (γ) = (1, 2, 4, 5), which corresponds to the variables - altitude, - slope, - height of the tree sampled in the center of the area, and - diameter of the tree sampled in the center of the area. Corresponds to the ﬁve variables identiﬁed in the R regression output
• 89. On some computational methods for Bayesian model choice Cross-model solutions Variable selection Pine processionary caterpillars (cont’d) Interpretation Model Mγ with the highest posterior probability is t1 (γ) = (1, 2, 4, 5), which corresponds to the variables - altitude, - slope, - height of the tree sampled in the center of the area, and - diameter of the tree sampled in the center of the area. Corresponds to the ﬁve variables identiﬁed in the R regression output
• 90. On some computational methods for Bayesian model choice Cross-model solutions Variable selection Noninformative extension For Zellner noninformative prior with π(c) = 1/c, we have ∞ c−1 (c + 1)−(qγ +1)/2 y T y− π(γ|y, X) ∝ c=1 −n/2 −1 cT T T y Xγ Xγ Xγ Xγ y . c+1
• 91. On some computational methods for Bayesian model choice Cross-model solutions Variable selection Pine processionary caterpillars t1 (γ) π(γ|y, X) 0,1,2,4,5 0.0929 0,1,2,4,5,9 0.0325 0,1,2,4,5,10 0.0295 0,1,2,4,5,7 0.0231 0,1,2,4,5,8 0.0228 0,1,2,4,5,6 0.0228 0,1,2,3,4,5 0.0224 0,1,2,3,4,5,9 0.0167 0,1,2,4,5,6,9 0.0167 0,1,2,4,5,8,9 0.0137 0,1,4,5 0.0110 0,1,2,4,5,9,10 0.0100 0,1,2,3,9 0.0097 0,1,2,9 0.0093 0,1,2,4,5,7,9 0.0092 0,1,2,6,9 0.0092
• 92. On some computational methods for Bayesian model choice Cross-model solutions Variable selection Stochastic search for the most likely model When k gets large, impossible to compute the posterior probabilities of the 2k models. Need of a tailored algorithm that samples from π(γ|y, X) and selects the most likely models. Can be done by Gibbs sampling, given the availability of the full conditional posterior probabilities of the γi ’s. If γ−i = (γ1 , . . . , γi−1 , γi+1 , . . . , γk ) (1 ≤ i ≤ k) π(γi |y, γ−i , X) ∝ π(γ|y, X) (to be evaluated in both γi = 0 and γi = 1)
• 93. On some computational methods for Bayesian model choice Cross-model solutions Variable selection Stochastic search for the most likely model When k gets large, impossible to compute the posterior probabilities of the 2k models. Need of a tailored algorithm that samples from π(γ|y, X) and selects the most likely models. Can be done by Gibbs sampling, given the availability of the full conditional posterior probabilities of the γi ’s. If γ−i = (γ1 , . . . , γi−1 , γi+1 , . . . , γk ) (1 ≤ i ≤ k) π(γi |y, γ−i , X) ∝ π(γ|y, X) (to be evaluated in both γi = 0 and γi = 1)
• 94. On some computational methods for Bayesian model choice Cross-model solutions Variable selection Gibbs sampling for variable selection Initialization: Draw γ 0 from the uniform distribution on Γ (t−1) (t−1) Iteration t: Given (γ1 , . . . , γk ), generate (t) (t−1) (t−1) 1. γ1 according to π(γ1 |y, γ2 , . . . , γk , X) (t) 2. γ2 according to (t) (t−1) (t−1) π(γ2 |y, γ1 , γ3 , . . . , γk , X) . . . (t) (t) (t) p. γk according to π(γk |y, γ1 , . . . , γk−1 , X)
• 95. On some computational methods for Bayesian model choice Cross-model solutions Variable selection MCMC interpretation After T 1 MCMC iterations, output used to approximate the posterior probabilities π(γ|y, X) by empirical averages T 1 π(γ|y, X) = Iγ (t) =γ . T − T0 + 1 t=T0 where the T0 ﬁrst values are eliminated as burnin. And approximation of the probability to include i-th variable, T 1 P π (γi = 1|y, X) = Iγ (t) =1 . T − T0 + 1 i t=T0
• 96. On some computational methods for Bayesian model choice Cross-model solutions Variable selection MCMC interpretation After T 1 MCMC iterations, output used to approximate the posterior probabilities π(γ|y, X) by empirical averages T 1 π(γ|y, X) = Iγ (t) =γ . T − T0 + 1 t=T0 where the T0 ﬁrst values are eliminated as burnin. And approximation of the probability to include i-th variable, T 1 P π (γi = 1|y, X) = Iγ (t) =1 . T − T0 + 1 i t=T0
• 97. On some computational methods for Bayesian model choice Cross-model solutions Variable selection Pine processionary caterpillars P π (γi = 1|y, X) P π (γi = 1|y, X) γi γ1 0.8624 0.8844 γ2 0.7060 0.7716 γ3 0.1482 0.2978 γ4 0.6671 0.7261 γ5 0.6515 0.7006 γ6 0.1678 0.3115 γ7 0.1371 0.2880 γ8 0.1555 0.2876 γ9 0.4039 0.5168 γ10 0.1151 0.2609 ˜ Probabilities of inclusion with both informative (β = 011 , c = 100) and noninformative Zellner’s priors
• 98. On some computational methods for Bayesian model choice Cross-model solutions Reversible jump Reversible jump Idea: Set up a proper measure–theoretic framework for designing moves between models Mk [Green, 1995] Create a reversible kernel K on H = k {k} × Θk such that K(x, dy)π(x)dx = K(y, dx)π(y)dy A B B A for the invariant density π [x is of the form (k, θ(k) )]
• 99. On some computational methods for Bayesian model choice Cross-model solutions Reversible jump Reversible jump Idea: Set up a proper measure–theoretic framework for designing moves between models Mk [Green, 1995] Create a reversible kernel K on H = k {k} × Θk such that K(x, dy)π(x)dx = K(y, dx)π(y)dy A B B A for the invariant density π [x is of the form (k, θ(k) )]
• 100. On some computational methods for Bayesian model choice Cross-model solutions Reversible jump Local moves For a move between two models, M1 and M2 , the Markov chain being in state θ1 ∈ M1 , denote by K1→2 (θ1 , dθ) and K2→1 (θ2 , dθ) the corresponding kernels, under the detailed balance condition π(dθ1 ) K1→2 (θ1 , dθ) = π(dθ2 ) K2→1 (θ2 , dθ) , and take, wlog, dim(M2 ) > dim(M1 ). Proposal expressed as θ2 = Ψ1→2 (θ1 , v1→2 ) where v1→2 is a random variable of dimension dim(M2 ) − dim(M1 ), generated as v1→2 ∼ ϕ1→2 (v1→2 ) .
• 101. On some computational methods for Bayesian model choice Cross-model solutions Reversible jump Local moves For a move between two models, M1 and M2 , the Markov chain being in state θ1 ∈ M1 , denote by K1→2 (θ1 , dθ) and K2→1 (θ2 , dθ) the corresponding kernels, under the detailed balance condition π(dθ1 ) K1→2 (θ1 , dθ) = π(dθ2 ) K2→1 (θ2 , dθ) , and take, wlog, dim(M2 ) > dim(M1 ). Proposal expressed as θ2 = Ψ1→2 (θ1 , v1→2 ) where v1→2 is a random variable of dimension dim(M2 ) − dim(M1 ), generated as v1→2 ∼ ϕ1→2 (v1→2 ) .
• 102. On some computational methods for Bayesian model choice Cross-model solutions Reversible jump Local moves (2) In this case, q1→2 (θ1 , dθ2 ) has density −1 ∂Ψ1→2 (θ1 , v1→2 ) ϕ1→2 (v1→2 ) , ∂(θ1 , v1→2 ) by the Jacobian rule. Reverse importance link If probability 1→2 of choosing move to M2 while in M1 , acceptance probability reduces to π(M2 , θ2 ) 2→1 ∂Ψ1→2 (θ1 , v1→2 ) α(θ1 , v1→2 ) = 1∧ . π(M1 , θ1 ) 1→2 ϕ1→2 (v1→2 ) ∂(θ1 , v1→2 ) c Diﬃcult calibration
• 103. On some computational methods for Bayesian model choice Cross-model solutions Reversible jump Local moves (2) In this case, q1→2 (θ1 , dθ2 ) has density −1 ∂Ψ1→2 (θ1 , v1→2 ) ϕ1→2 (v1→2 ) , ∂(θ1 , v1→2 ) by the Jacobian rule. Reverse importance link If probability 1→2 of choosing move to M2 while in M1 , acceptance probability reduces to π(M2 , θ2 ) 2→1 ∂Ψ1→2 (θ1 , v1→2 ) α(θ1 , v1→2 ) = 1∧ . π(M1 , θ1 ) 1→2 ϕ1→2 (v1→2 ) ∂(θ1 , v1→2 ) c Diﬃcult calibration
• 104. On some computational methods for Bayesian model choice Cross-model solutions Reversible jump Local moves (2) In this case, q1→2 (θ1 , dθ2 ) has density −1 ∂Ψ1→2 (θ1 , v1→2 ) ϕ1→2 (v1→2 ) , ∂(θ1 , v1→2 ) by the Jacobian rule. Reverse importance link If probability 1→2 of choosing move to M2 while in M1 , acceptance probability reduces to π(M2 , θ2 ) 2→1 ∂Ψ1→2 (θ1 , v1→2 ) α(θ1 , v1→2 ) = 1∧ . π(M1 , θ1 ) 1→2 ϕ1→2 (v1→2 ) ∂(θ1 , v1→2 ) c Diﬃcult calibration
• 105. On some computational methods for Bayesian model choice Cross-model solutions Reversible jump Interpretation The representation puts us back in a ﬁxed dimension setting: M1 × V1→2 and M2 in one-to-one relation. reversibility imposes that θ1 is derived as (θ1 , v1→2 ) = Ψ−1 (θ2 ) 1→2 appears like a regular Metropolis–Hastings move from the couple (θ1 , v1→2 ) to θ2 when stationary distributions are π(M1 , θ1 ) × ϕ1→2 (v1→2 ) and π(M2 , θ2 ), and when proposal distribution is deterministic (??)
• 106. On some computational methods for Bayesian model choice Cross-model solutions Reversible jump Interpretation The representation puts us back in a ﬁxed dimension setting: M1 × V1→2 and M2 in one-to-one relation. reversibility imposes that θ1 is derived as (θ1 , v1→2 ) = Ψ−1 (θ2 ) 1→2 appears like a regular Metropolis–Hastings move from the couple (θ1 , v1→2 ) to θ2 when stationary distributions are π(M1 , θ1 ) × ϕ1→2 (v1→2 ) and π(M2 , θ2 ), and when proposal distribution is deterministic (??)
• 107. On some computational methods for Bayesian model choice Cross-model solutions Reversible jump Pseudo-deterministic reasoning Consider the proposals θ2 ∼ N (Ψ1→2 (θ1 , v1→2 ), ε) Ψ1→2 (θ1 , v1→2 ) ∼ N (θ2 , ε) and Reciprocal proposal has density exp −(θ2 − Ψ1→2 (θ1 , v1→2 ))2 /2ε ∂Ψ1→2 (θ1 , v1→2 ) √ × ∂(θ1 , v1→2 ) 2πε by the Jacobian rule. Thus Metropolis–Hastings acceptance probability is π(M2 , θ2 ) ∂Ψ1→2 (θ1 , v1→2 ) 1∧ π(M1 , θ1 ) ϕ1→2 (v1→2 ) ∂(θ1 , v1→2 ) Does not depend on ε: Let ε go to 0
• 108. On some computational methods for Bayesian model choice Cross-model solutions Reversible jump Pseudo-deterministic reasoning Consider the proposals θ2 ∼ N (Ψ1→2 (θ1 , v1→2 ), ε) Ψ1→2 (θ1 , v1→2 ) ∼ N (θ2 , ε) and Reciprocal proposal has density exp −(θ2 − Ψ1→2 (θ1 , v1→2 ))2 /2ε ∂Ψ1→2 (θ1 , v1→2 ) √ × ∂(θ1 , v1→2 ) 2πε by the Jacobian rule. Thus Metropolis–Hastings acceptance probability is π(M2 , θ2 ) ∂Ψ1→2 (θ1 , v1→2 ) 1∧ π(M1 , θ1 ) ϕ1→2 (v1→2 ) ∂(θ1 , v1→2 ) Does not depend on ε: Let ε go to 0
• 109. On some computational methods for Bayesian model choice Cross-model solutions Reversible jump Generic reversible jump acceptance probability If several models are considered simultaneously, with probability 1→2 of choosing move to M2 while in M1 , as in ∞ XZ K(x, B) = ρm (x, y)qm (x, dy) + ω(x)IB (x) m=1 acceptance probability of θ2 = Ψ1→2 (θ1 , v1→2 ) is π(M2 , θ2 ) 2→1 ∂Ψ1→2 (θ1 , v1→2 ) α(θ1 , v1→2 ) = 1 ∧ π(M1 , θ1 ) 1→2 ϕ1→2 (v1→2 ) ∂(θ1 , v1→2 ) while acceptance probability of θ1 with (θ1 , v1→2 ) = Ψ−1 (θ2 ) is 1→2 −1 π(M1 , θ1 ) 1→2 ϕ1→2 (v1→2 ) ∂Ψ1→2 (θ1 , v1→2 ) α(θ1 , v1→2 ) = 1 ∧ π(M2 , θ2 ) 2→1 ∂(θ1 , v1→2 )
• 110. On some computational methods for Bayesian model choice Cross-model solutions Reversible jump Generic reversible jump acceptance probability If several models are considered simultaneously, with probability 1→2 of choosing move to M2 while in M1 , as in ∞ XZ K(x, B) = ρm (x, y)qm (x, dy) + ω(x)IB (x) m=1 acceptance probability of θ2 = Ψ1→2 (θ1 , v1→2 ) is π(M2 , θ2 ) 2→1 ∂Ψ1→2 (θ1 , v1→2 ) α(θ1 , v1→2 ) = 1 ∧ π(M1 , θ1 ) 1→2 ϕ1→2 (v1→2 ) ∂(θ1 , v1→2 ) while acceptance probability of θ1 with (θ1 , v1→2 ) = Ψ−1 (θ2 ) is 1→2 −1 π(M1 , θ1 ) 1→2 ϕ1→2 (v1→2 ) ∂Ψ1→2 (θ1 , v1→2 ) α(θ1 , v1→2 ) = 1 ∧ π(M2 , θ2 ) 2→1 ∂(θ1 , v1→2 )
• 111. On some computational methods for Bayesian model choice Cross-model solutions Reversible jump Generic reversible jump acceptance probability If several models are considered simultaneously, with probability 1→2 of choosing move to M2 while in M1 , as in ∞ XZ K(x, B) = ρm (x, y)qm (x, dy) + ω(x)IB (x) m=1 acceptance probability of θ2 = Ψ1→2 (θ1 , v1→2 ) is π(M2 , θ2 ) 2→1 ∂Ψ1→2 (θ1 , v1→2 ) α(θ1 , v1→2 ) = 1 ∧ π(M1 , θ1 ) 1→2 ϕ1→2 (v1→2 ) ∂(θ1 , v1→2 ) while acceptance probability of θ1 with (θ1 , v1→2 ) = Ψ−1 (θ2 ) is 1→2 −1 π(M1 , θ1 ) 1→2 ϕ1→2 (v1→2 ) ∂Ψ1→2 (θ1 , v1→2 ) α(θ1 , v1→2 ) = 1 ∧ π(M2 , θ2 ) 2→1 ∂(θ1 , v1→2 )
• 112. On some computational methods for Bayesian model choice Cross-model solutions Reversible jump Green’s sampler Algorithm Iteration t (t ≥ 1): if x(t) = (m, θ(m) ), Select model Mn with probability πmn 1 Generate umn ∼ ϕmn (u) and set 2 (θ(n) , vnm ) = Ψm→n (θ(m) , umn ) Take x(t+1) = (n, θ(n) ) with probability 3 π(n, θ(n) ) πnm ϕnm (vnm ) ∂Ψm→n (θ(m) , umn ) min ,1 π(m, θ(m) ) πmn ϕmn (umn ) ∂(θ(m) , umn ) and take x(t+1) = x(t) otherwise.
• 113. On some computational methods for Bayesian model choice Cross-model solutions Reversible jump Mixture of normal distributions   k   2 pjk N (µjk , σjk ) Mk = (pjk , µjk , σjk );   j=1 Restrict moves from Mk to adjacent models, like Mk+1 and Mk−1 , with probabilities πk(k+1) and πk(k−1) .
• 114. On some computational methods for Bayesian model choice Cross-model solutions Reversible jump Mixture of normal distributions   k   2 pjk N (µjk , σjk ) Mk = (pjk , µjk , σjk );   j=1 Restrict moves from Mk to adjacent models, like Mk+1 and Mk−1 , with probabilities πk(k+1) and πk(k−1) .
• 115. On some computational methods for Bayesian model choice Cross-model solutions Reversible jump Mixture birth Take Ψk→k+1 as a birth step: i.e. add a new normal component in the mixture, by generating the parameters of the new component from the prior distribution (µk+1 , σk+1 ) ∼ π(µ, σ) and pk+1 ∼ Be(a1 , a2 + . . . + ak ) if (p1 , . . . , pk ) ∼ Mk (a1 , . . . , ak ) Jacobian is (1 − pk+1 )k−1 Death step then derived from the reversibility constraint by removing one of the k components at random.
• 116. On some computational methods for Bayesian model choice Cross-model solutions Reversible jump Mixture birth Take Ψk→k+1 as a birth step: i.e. add a new normal component in the mixture, by generating the parameters of the new component from the prior distribution (µk+1 , σk+1 ) ∼ π(µ, σ) and pk+1 ∼ Be(a1 , a2 + . . . + ak ) if (p1 , . . . , pk ) ∼ Mk (a1 , . . . , ak ) Jacobian is (1 − pk+1 )k−1 Death step then derived from the reversibility constraint by removing one of the k components at random.
• 117. On some computational methods for Bayesian model choice Cross-model solutions Reversible jump Mixture acceptance probability Birth acceptance probability π(k+1)k (k + 1)! π(k + 1, θk+1 ) min ,1 πk(k+1) (k + 1)k! π(k, θk ) (k + 1)ϕk(k+1) (uk(k+1) ) − pk+1 )k−1 π(k+1)k (k + 1) k+1 (θk+1 ) (1 = min ,1 , πk(k+1) (k) k (θk ) where k likelihood of the k component mixture model Mk and (k) prior probability of model Mk . Combinatorial terms: there are (k + 1)! ways of deﬁning a (k + 1) component mixture by adding one component, while, given a (k + 1) component mixture, there are (k + 1) choices for a component to die and then k! associated mixtures for the remaining components.
• 118. On some computational methods for Bayesian model choice Cross-model solutions Reversible jump Mixture acceptance probability Birth acceptance probability π(k+1)k (k + 1)! π(k + 1, θk+1 ) min ,1 πk(k+1) (k + 1)k! π(k, θk ) (k + 1)ϕk(k+1) (uk(k+1) ) − pk+1 )k−1 π(k+1)k (k + 1) k+1 (θk+1 ) (1 = min ,1 , πk(k+1) (k) k (θk ) where k likelihood of the k component mixture model Mk and (k) prior probability of model Mk . Combinatorial terms: there are (k + 1)! ways of deﬁning a (k + 1) component mixture by adding one component, while, given a (k + 1) component mixture, there are (k + 1) choices for a component to die and then k! associated mixtures for the remaining components.
• 119. On some computational methods for Bayesian model choice Cross-model solutions Saturation schemes Alternative Saturation of the parameter space H = k {k} × Θk by creating θ = (θ1 , . . . , θD ) a model index M pseudo-priors πj (θj |M = k) for j = k [Carlin & Chib, 1995] Validation by P(M = k|x) = P (M = k|x, θ)π(θ|x)dθ = Zk where the (marginal) posterior is [not πk !] D π(θ|x) = P(θ, M = k|x) k=1 D pk Zk πk (θk |x) πj (θj |M = k) . = k=1 j=k
• 120. On some computational methods for Bayesian model choice Cross-model solutions Saturation schemes Alternative Saturation of the parameter space H = k {k} × Θk by creating θ = (θ1 , . . . , θD ) a model index M pseudo-priors πj (θj |M = k) for j = k [Carlin & Chib, 1995] Validation by P(M = k|x) = P (M = k|x, θ)π(θ|x)dθ = Zk where the (marginal) posterior is [not πk !] D π(θ|x) = P(θ, M = k|x) k=1 D pk Zk πk (θk |x) πj (θj |M = k) . = k=1 j=k
• 121. On some computational methods for Bayesian model choice Cross-model solutions Saturation schemes MCMC implementation (t) (t) Run a Markov chain (M (t) , θ1 , . . . , θD ) with stationary distribution π(θ, M |x) by Pick M (t) = k with probability π(θ(t−1) , k|x) 1 (t−1) from the posterior πk (θk |x) [or MCMC step] Generate θk 2 (t−1) (j = k) from the pseudo-prior πj (θj |M = k) Generate θj 3 Approximate P(M = k|x) = Zk by T (t) (t) (t) pk (x) ∝ pk πj (θj |M = k) ˇ fk (x|θk ) πk (θk ) t=1 j=k D (t) (t) (t) πj (θj |M = ) p f (x|θ ) π (θ ) =1 j=
• 122. On some computational methods for Bayesian model choice Cross-model solutions Saturation schemes MCMC implementation (t) (t) Run a Markov chain (M (t) , θ1 , . . . , θD ) with stationary distribution π(θ, M |x) by Pick M (t) = k with probability π(θ(t−1) , k|x) 1 (t−1) from the posterior πk (θk |x) [or MCMC step] Generate θk 2 (t−1) (j = k) from the pseudo-prior πj (θj |M = k) Generate θj 3 Approximate P(M = k|x) = Zk by T (t) (t) (t) pk (x) ∝ pk πj (θj |M = k) ˇ fk (x|θk ) πk (θk ) t=1 j=k D (t) (t) (t) πj (θj |M = ) p f (x|θ ) π (θ ) =1 j=
• 123. On some computational methods for Bayesian model choice Cross-model solutions Saturation schemes MCMC implementation (t) (t) Run a Markov chain (M (t) , θ1 , . . . , θD ) with stationary distribution π(θ, M |x) by Pick M (t) = k with probability π(θ(t−1) , k|x) 1 (t−1) from the posterior πk (θk |x) [or MCMC step] Generate θk 2 (t−1) (j = k) from the pseudo-prior πj (θj |M = k) Generate θj 3 Approximate P(M = k|x) = Zk by T (t) (t) (t) pk (x) ∝ pk πj (θj |M = k) ˇ fk (x|θk ) πk (θk ) t=1 j=k D (t) (t) (t) πj (θj |M = ) p f (x|θ ) π (θ ) =1 j=
• 124. On some computational methods for Bayesian model choice Cross-model solutions Implementation error Scott’s (2002) proposal Suggest estimating P(M = k|x) by   T D  (t) (t) Zk ∝ pk f (x|θk ) pj fj (x|θj ) , k  t=1 j=1 based on D simultaneous and independent MCMC chains (t) 1 ≤ k ≤ D, (θk )t , with stationary distributions πk (θk |x) [instead of above joint!!]
• 125. On some computational methods for Bayesian model choice Cross-model solutions Implementation error Scott’s (2002) proposal Suggest estimating P(M = k|x) by   T D  (t) (t) Zk ∝ pk f (x|θk ) pj fj (x|θj ) , k  t=1 j=1 based on D simultaneous and independent MCMC chains (t) 1 ≤ k ≤ D, (θk )t , with stationary distributions πk (θk |x) [instead of above joint!!]
• 126. On some computational methods for Bayesian model choice Cross-model solutions Implementation error Congdon’s (2006) extension Selecting ﬂat [prohibited!] pseudo-priors, uses instead   T D  (t) (t) (t) (t) Zk ∝ pk fk (x|θk )πk (θk ) pj fj (x|θj )πj (θj ) ,   t=1 j=1 (t) where again the θk ’s are MCMC chains with stationary distributions πk (θk |x)
• 127. On some computational methods for Bayesian model choice Cross-model solutions Implementation error Examples Example (Model choice) Model M1 : x|θ ∼ U(0, θ) with prior θ ∼ Exp(1) is versus model M2 : x|θ ∼ Exp(θ) with prior θ ∼ Exp(1). Equal prior weights on both models: 1 = 2 = 0.5. 1.0 0.8 Approximations of P(M = 1|x): 0.6 Scott’s (2002) (blue), and 0.4 Congdon’s (2006) (red) [N = 106 simulations]. 0.2 0.0 1 2 3 4 5 y
• 128. On some computational methods for Bayesian model choice Cross-model solutions Implementation error Examples Example (Model choice) Model M1 : x|θ ∼ U(0, θ) with prior θ ∼ Exp(1) is versus model M2 : x|θ ∼ Exp(θ) with prior θ ∼ Exp(1). Equal prior weights on both models: 1 = 2 = 0.5. 1.0 0.8 Approximations of P(M = 1|x): 0.6 Scott’s (2002) (blue), and 0.4 Congdon’s (2006) (red) [N = 106 simulations]. 0.2 0.0 1 2 3 4 5 y
• 129. On some computational methods for Bayesian model choice Cross-model solutions Implementation error Examples (2) Example (Model choice (2)) Normal model M1 : x ∼ N (θ, 1) with θ ∼ N (0, 1) vs. normal model M2 : x ∼ N (θ, 1) with θ ∼ N (5, 1) Comparison of both 1.0 approximations with 0.8 P(M = 1|x): Scott’s (2002) (green and mixed dashes) and 0.6 Congdon’s (2006) (brown and 0.4 long dashes) [N = 104 simulations]. 0.2 0.0 −1 0 1 2 3 4 5 6 y
• 130. On some computational methods for Bayesian model choice Cross-model solutions Implementation error Examples (3) Example (Model choice (3)) Model M1 : x ∼ N (0, 1/ω) with ω ∼ Exp(a) vs. M2 : exp(x) ∼ Exp(λ) with λ ∼ Exp(b). 1.0 1.0 0.8 0.8 Comparison of Congdon’s (2006) 0.6 0.6 (brown and dashed lines) with 0.4 0.4 0.2 0.2 P(M = 1|x) when (a, b) is equal 0.0 0.0 to (.24, 8.9), (.56, .7), (4.1, .46) −10 −5 0 5 10 −10 −5 0 5 10 y y and (.98, .081), resp. [N = 104 1.0 1.0 0.8 0.8 simulations]. 0.6 0.6 0.4 0.4 0.2 0.2 0.0 0.0 −10 −5 0 5 10 −10 −5 0 5 10 y y
• 131. On some computational methods for Bayesian model choice Nested sampling Purpose Nested sampling: Goal Skilling’s (2007) technique using the one-dimensional representation: 1 Z = Eπ [L(θ)] = ϕ(x) dx 0 with ϕ−1 (l) = P π (L(θ) > l). Note; ϕ(·) is intractable in most cases.
• 132. On some computational methods for Bayesian model choice Nested sampling Implementation Nested sampling: First approximation Approximate Z by a Riemann sum: j (xi−1 − xi )ϕ(xi ) Z= i=1 where the xi ’s are either: xi = e−i/N deterministic: or random: ti ∼ Be(N, 1) x0 = 0, xi+1 = ti xi , so that E[log xi ] = −i/N .
• 133. On some computational methods for Bayesian model choice Nested sampling Implementation Extraneous white noise Take 1 −(1−δ)θ −δθ 1 −(1−δ)θ e−θ dθ = Z= e e = Eδ e δ δ N 1 δ −1 e−(1−δ)θi (xi−1 − xi ) , ˆ θi ∼ E(δ) I(θi ≤ θi−1 ) Z= N i=1 N deterministic random 50 4.64 10.5 4.65 10.5 100 2.47 4.9 Comparison of variances and MSEs 2.48 5.02 500 .549 1.01 .550 1.14
• 134. On some computational methods for Bayesian model choice Nested sampling Implementation Extraneous white noise Take 1 −(1−δ)θ −δθ 1 −(1−δ)θ e−θ dθ = Z= e e = Eδ e δ δ N 1 δ −1 e−(1−δ)θi (xi−1 − xi ) , ˆ θi ∼ E(δ) I(θi ≤ θi−1 ) Z= N i=1 N deterministic random 50 4.64 10.5 4.65 10.5 100 2.47 4.9 Comparison of variances and MSEs 2.48 5.02 500 .549 1.01 .550 1.14
• 135. On some computational methods for Bayesian model choice Nested sampling Implementation Extraneous white noise Take 1 −(1−δ)θ −δθ 1 −(1−δ)θ e−θ dθ = Z= e e = Eδ e δ δ N 1 δ −1 e−(1−δ)θi (xi−1 − xi ) , ˆ θi ∼ E(δ) I(θi ≤ θi−1 ) Z= N i=1 N deterministic random 50 4.64 10.5 4.65 10.5 100 2.47 4.9 Comparison of variances and MSEs 2.48 5.02 500 .549 1.01 .550 1.14
• 136. On some computational methods for Bayesian model choice Nested sampling Implementation Nested sampling: Second approximation Replace (intractable) ϕ(xi ) by ϕi , obtained by Nested sampling Start with N values θ1 , . . . , θN sampled from π At iteration i, Take ϕi = L(θk ), where θk is the point with smallest 1 likelihood in the pool of θi ’s Replace θk with a sample from the prior constrained to 2 L(θ) > ϕi : the current N points are sampled from prior constrained to L(θ) > ϕi .
• 137. On some computational methods for Bayesian model choice Nested sampling Implementation Nested sampling: Second approximation Replace (intractable) ϕ(xi ) by ϕi , obtained by Nested sampling Start with N values θ1 , . . . , θN sampled from π At iteration i, Take ϕi = L(θk ), where θk is the point with smallest 1 likelihood in the pool of θi ’s Replace θk with a sample from the prior constrained to 2 L(θ) > ϕi : the current N points are sampled from prior constrained to L(θ) > ϕi .
• 138. On some computational methods for Bayesian model choice Nested sampling Implementation Nested sampling: Second approximation Replace (intractable) ϕ(xi ) by ϕi , obtained by Nested sampling Start with N values θ1 , . . . , θN sampled from π At iteration i, Take ϕi = L(θk ), where θk is the point with smallest 1 likelihood in the pool of θi ’s Replace θk with a sample from the prior constrained to 2 L(θ) > ϕi : the current N points are sampled from prior constrained to L(θ) > ϕi .
• 139. On some computational methods for Bayesian model choice Nested sampling Implementation Nested sampling: Third approximation Iterate the above steps until a given stopping iteration j is reached: e.g., observe very small changes in the approximation Z; reach the maximal value of L(θ) when the likelihood is bounded and its maximum is known; truncate the integral Z at level , i.e. replace 1 1 ϕ(x) dx with ϕ(x) dx 0
• 140. On some computational methods for Bayesian model choice Nested sampling Error rates Approximation error Error = Z − Z j 1 (xi−1 − xi )ϕi − ϕ(x) dx = − = ϕ(x) dx 0 0 i=1 j 1 (xi−1 − xi )ϕ(xi ) − + ϕ(x) dx (Quadrature Error) i=1 j (xi−1 − xi ) {ϕi − ϕ(xi )} + (Stochastic Error) i=1 [Dominated by Monte Carlo!]
• 141. On some computational methods for Bayesian model choice Nested sampling Error rates A CLT for the Stochastic Error The (dominating) stochastic error is OP (N −1/2 ): D N 1/2 {Stochastic Error} → N (0, V ) with V =− sϕ (s)tϕ (t) log(s ∨ t) ds dt. s,t∈[ ,1] [Proof based on Donsker’s theorem] The number of simulated points equals the number of iterations j, and is a multiple of N : if one stops at ﬁrst iteration j such that e−j/N < , then: j = N − log .
• 142. On some computational methods for Bayesian model choice Nested sampling Error rates A CLT for the Stochastic Error The (dominating) stochastic error is OP (N −1/2 ): D N 1/2 {Stochastic Error} → N (0, V ) with V =− sϕ (s)tϕ (t) log(s ∨ t) ds dt. s,t∈[ ,1] [Proof based on Donsker’s theorem] The number of simulated points equals the number of iterations j, and is a multiple of N : if one stops at ﬁrst iteration j such that e−j/N < , then: j = N − log .
• 143. On some computational methods for Bayesian model choice Nested sampling Impact of dimension Curse of dimension For a simple Gaussian-Gaussian model of dimension dim(θ) = d, the following 3 quantities are O(d): asymptotic variance of the NS estimator; 1 number of iterations (necessary to reach a given truncation 2 error); cost of one simulated sample. 3 Therefore, CPU time necessary for achieving error level e is O(d3 /e2 )
• 144. On some computational methods for Bayesian model choice Nested sampling Impact of dimension Curse of dimension For a simple Gaussian-Gaussian model of dimension dim(θ) = d, the following 3 quantities are O(d): asymptotic variance of the NS estimator; 1 number of iterations (necessary to reach a given truncation 2 error); cost of one simulated sample. 3 Therefore, CPU time necessary for achieving error level e is O(d3 /e2 )
• 145. On some computational methods for Bayesian model choice Nested sampling Impact of dimension Curse of dimension For a simple Gaussian-Gaussian model of dimension dim(θ) = d, the following 3 quantities are O(d): asymptotic variance of the NS estimator; 1 number of iterations (necessary to reach a given truncation 2 error); cost of one simulated sample. 3 Therefore, CPU time necessary for achieving error level e is O(d3 /e2 )
• 146. On some computational methods for Bayesian model choice Nested sampling Impact of dimension Curse of dimension For a simple Gaussian-Gaussian model of dimension dim(θ) = d, the following 3 quantities are O(d): asymptotic variance of the NS estimator; 1 number of iterations (necessary to reach a given truncation 2 error); cost of one simulated sample. 3 Therefore, CPU time necessary for achieving error level e is O(d3 /e2 )
• 147. On some computational methods for Bayesian model choice Nested sampling Constraints Sampling from constr’d priors Exact simulation from the constrained prior is intractable in most cases! Skilling (2007) proposes to use MCMC, but: this introduces a bias (stopping rule). if MCMC stationary distribution is unconst’d prior, more and more diﬃcult to sample points such that L(θ) > l as l increases. If implementable, then slice sampler can be devised at the same cost!
• 148. On some computational methods for Bayesian model choice Nested sampling Constraints Sampling from constr’d priors Exact simulation from the constrained prior is intractable in most cases! Skilling (2007) proposes to use MCMC, but: this introduces a bias (stopping rule). if MCMC stationary distribution is unconst’d prior, more and more diﬃcult to sample points such that L(θ) > l as l increases. If implementable, then slice sampler can be devised at the same cost!
• 149. On some computational methods for Bayesian model choice Nested sampling Constraints Sampling from constr’d priors Exact simulation from the constrained prior is intractable in most cases! Skilling (2007) proposes to use MCMC, but: this introduces a bias (stopping rule). if MCMC stationary distribution is unconst’d prior, more and more diﬃcult to sample points such that L(θ) > l as l increases. If implementable, then slice sampler can be devised at the same cost!
• 150. On some computational methods for Bayesian model choice Nested sampling Constraints Illustration of MCMC bias N=100, M=1 N=100, M=3 N=100, M=5 10 4 4 0 2 2 -10 -20 0 0 -30 -2 -2 -40 -4 -4 -50 10 20 30 40 50 60 70 80 90100 10 20 30 40 50 60 70 80 90100 10 20 30 40 50 60 70 80 90100 N=500, M=1 N=100, M=5 10 80000 70000 5 60000 50000 0 40000 30000 -5 20000 10000 -10 00 10 20 30 40 50 60 70 80 90100 20 40 60 80 100 Log-relative error against d (left), avg. number of iterations (right) vs dimension d, for a Gaussian-Gaussian model with d parameters, when using T = 10 iterations of the Gibbs sampler.
• 151. On some computational methods for Bayesian model choice Nested sampling Importance variant A IS variant of nested sampling ˜ Consider instrumental prior π and likelihood L, weight function π(θ)L(θ) w(θ) = π(θ)L(θ) and weighted NS estimator j (xi−1 − xi )ϕi w(θi ). Z= i=1 Then choose (π, L) so that sampling from π constrained to L(θ) > l is easy; e.g. N (c, Id ) constrained to c − θ < r.
• 152. On some computational methods for Bayesian model choice Nested sampling Importance variant A IS variant of nested sampling ˜ Consider instrumental prior π and likelihood L, weight function π(θ)L(θ) w(θ) = π(θ)L(θ) and weighted NS estimator j (xi−1 − xi )ϕi w(θi ). Z= i=1 Then choose (π, L) so that sampling from π constrained to L(θ) > l is easy; e.g. N (c, Id ) constrained to c − θ < r.
• 153. On some computational methods for Bayesian model choice Nested sampling A mixture comparison Benchmark: Target distribution Posterior distribution on (µ, σ) associated with the mixture pN (0, 1) + (1 − p)N (µ, σ) , when p is known
• 154. On some computational methods for Bayesian model choice Nested sampling A mixture comparison Experiment n observations with µ = 2 and σ = 3/2, Use of a uniform prior both on (−2, 6) for µ and on (.001, 16) for log σ 2 . occurrences of posterior bursts for µ = xi computation of the various estimates of Z
• 155. On some computational methods for Bayesian model choice Nested sampling A mixture comparison Experiment (cont’d) Nested sampling sequence MCMC sample for n = 16 with M = 1000 starting points. observations from the mixture.
• 156. On some computational methods for Bayesian model choice Nested sampling A mixture comparison Experiment (cont’d) Nested sampling sequence MCMC sample for n = 50 with M = 1000 starting points. observations from the mixture.
• 157. On some computational methods for Bayesian model choice Nested sampling A mixture comparison Comparison Monte Carlo and MCMC (=Gibbs) outputs based on T = 104 simulations and numerical integration based on a 850 × 950 grid in the (µ, σ) parameter space. Nested sampling approximation based on a starting sample of M = 1000 points followed by at least 103 further simulations from the constr’d prior and a stopping rule at 95% of the observed maximum likelihood. Constr’d prior simulation based on 50 values simulated by random walk accepting only steps leading to a lik’hood higher than the bound
• 158. On some computational methods for Bayesian model choice Nested sampling A mixture comparison Comparison (cont’d) q q q q q q 1.15 q q q q q q q q q q q 1.10 q q q q q q q q q q q q q q q q q q q 1.05 q 1.00 0.95 q q q q q q q q q q q q q q 0.90 q 0.85 q q q q V1 V2 V3 V4 q q Graph based on a sample of 10 observations for µ = 2 and σ = 3/2 (150 replicas).
• 159. On some computational methods for Bayesian model choice Nested sampling A mixture comparison Comparison (cont’d) 1.10 1.05 q q q 1.00 q 0.95 q q q q 0.90 V1 V2 V3 V4 Graph based on a sample of 50 observations for µ = 2 and σ = 3/2 (150 replicas).
• 160. On some computational methods for Bayesian model choice Nested sampling A mixture comparison Comparison (cont’d) 1.15 1.10 1.05 q q q q 1.00 q q q q 0.95 q 0.90 0.85 V1 V2 V3 V4 Graph based on a sample of 100 observations for µ = 2 and σ = 3/2 (150 replicas).
• 161. On some computational methods for Bayesian model choice Nested sampling A mixture comparison Comparison (cont’d) Nested sampling gets less reliable as sample size increases Most reliable approach is mixture Z3 although harmonic solution Z1 close to Chib’s solution [taken as golden standard] Monte Carlo method Z2 also producing poor approximations to Z (Kernel φ used in Z2 is a t non-parametric kernel estimate with standard bandwidth estimation.)
• 162. On some computational methods for Bayesian model choice ABC model choice ABC method Approximate Bayesian Computation Bayesian setting: target is π(θ)f (x|θ) When likelihood f (x|θ) not in closed form, likelihood-free rejection technique: ABC algorithm For an observation y ∼ f (y|θ), under the prior π(θ), keep jointly simulating θ ∼ π(θ) , x ∼ f (x|θ ) , until the auxiliary variable x is equal to the observed value, x = y. [Pritchard et al., 1999]
• 163. On some computational methods for Bayesian model choice ABC model choice ABC method Approximate Bayesian Computation Bayesian setting: target is π(θ)f (x|θ) When likelihood f (x|θ) not in closed form, likelihood-free rejection technique: ABC algorithm For an observation y ∼ f (y|θ), under the prior π(θ), keep jointly simulating θ ∼ π(θ) , x ∼ f (x|θ ) , until the auxiliary variable x is equal to the observed value, x = y. [Pritchard et al., 1999]
• 164. On some computational methods for Bayesian model choice ABC model choice ABC method Approximate Bayesian Computation Bayesian setting: target is π(θ)f (x|θ) When likelihood f (x|θ) not in closed form, likelihood-free rejection technique: ABC algorithm For an observation y ∼ f (y|θ), under the prior π(θ), keep jointly simulating θ ∼ π(θ) , x ∼ f (x|θ ) , until the auxiliary variable x is equal to the observed value, x = y. [Pritchard et al., 1999]
• 165. On some computational methods for Bayesian model choice ABC model choice ABC method Population genetics example Tree of ancestors in a sample of genes
• 166. On some computational methods for Bayesian model choice ABC model choice ABC method A as approximative When y is a continuous random variable, equality x = y is replaced with a tolerance condition, (x, y) ≤ where is a distance between summary statistics Output distributed from π(θ) Pθ { (x, y) < } ∝ π(θ| (x, y) < )
• 167. On some computational methods for Bayesian model choice ABC model choice ABC method A as approximative When y is a continuous random variable, equality x = y is replaced with a tolerance condition, (x, y) ≤ where is a distance between summary statistics Output distributed from π(θ) Pθ { (x, y) < } ∝ π(θ| (x, y) < )
• 168. On some computational methods for Bayesian model choice ABC model choice ABC method ABC improvements Simulating from the prior is often poor in eﬃciency Either modify the proposal distribution on θ to increase the density of x’s within the vicinity of y... [Marjoram et al, 2003; Bortot et al., 2007, Sisson et al., 2007] ...or by viewing the problem as a conditional density estimation and by developing techniques to allow for larger [Beaumont et al., 2002]
• 169. On some computational methods for Bayesian model choice ABC model choice ABC method ABC improvements Simulating from the prior is often poor in eﬃciency Either modify the proposal distribution on θ to increase the density of x’s within the vicinity of y... [Marjoram et al, 2003; Bortot et al., 2007, Sisson et al., 2007] ...or by viewing the problem as a conditional density estimation and by developing techniques to allow for larger [Beaumont et al., 2002]
• 170. On some computational methods for Bayesian model choice ABC model choice ABC method ABC improvements Simulating from the prior is often poor in eﬃciency Either modify the proposal distribution on θ to increase the density of x’s within the vicinity of y... [Marjoram et al, 2003; Bortot et al., 2007, Sisson et al., 2007] ...or by viewing the problem as a conditional density estimation and by developing techniques to allow for larger [Beaumont et al., 2002]
• 171. On some computational methods for Bayesian model choice ABC model choice ABC method ABC-MCMC Markov chain (θ(t) ) created via the transition function  θ ∼ K(θ |θ(t) ) if x ∼ f (x|θ ) is such that x = y   π(θ )K(θ(t) |θ ) (t+1) and u ∼ U(0, 1) ≤ π(θ(t) )K(θ |θ(t) ) , θ =   (t) θ otherwise, has the posterior π(θ|y) as stationary distribution [Marjoram et al, 2003]
• 172. On some computational methods for Bayesian model choice ABC model choice ABC method ABC-MCMC Markov chain (θ(t) ) created via the transition function  θ ∼ K(θ |θ(t) ) if x ∼ f (x|θ ) is such that x = y   π(θ )K(θ(t) |θ ) (t+1) and u ∼ U(0, 1) ≤ π(θ(t) )K(θ |θ(t) ) , θ =   (t) θ otherwise, has the posterior π(θ|y) as stationary distribution [Marjoram et al, 2003]
• 173. On some computational methods for Bayesian model choice ABC model choice ABC method ABC-PRC Another sequential version producing a sequence of Markov (t) (t) transition kernels Kt and of samples (θ1 , . . . , θN ) (1 ≤ t ≤ T ) ABC-PRC Algorithm (t−1) Pick a θ is selected at random among the previous θi ’s 1 (t−1) (1 ≤ i ≤ N ). with probabilities ωi Generate 2 (t) (t) θi ∼ Kt (θ|θ ) , x ∼ f (x|θi ) , Check that (x, y) < , otherwise start again. 3 [Sisson et al., 2007]
• 174. On some computational methods for Bayesian model choice ABC model choice ABC method ABC-PRC Another sequential version producing a sequence of Markov (t) (t) transition kernels Kt and of samples (θ1 , . . . , θN ) (1 ≤ t ≤ T ) ABC-PRC Algorithm (t−1) Pick a θ is selected at random among the previous θi ’s 1 (t−1) (1 ≤ i ≤ N ). with probabilities ωi Generate 2 (t) (t) θi ∼ Kt (θ|θ ) , x ∼ f (x|θi ) , Check that (x, y) < , otherwise start again. 3 [Sisson et al., 2007]
• 175. On some computational methods for Bayesian model choice ABC model choice ABC method ABC-PRC weight (t) Probability ωi computed as (t) (t) (t) (t) ωi ∝ π(θi )Lt−1 (θ |θi ){π(θ )Kt (θi |θ )}−1 , where Lt−1 is an arbitrary transition kernel. In case Lt−1 (θ |θ) = Kt (θ|θ ) , all weights are equal under a uniform prior. Inspired from Del Moral et al. (2006), who use backward kernels Lt−1 in SMC to achieve unbiasedness
• 176. On some computational methods for Bayesian model choice ABC model choice ABC method ABC-PRC weight (t) Probability ωi computed as (t) (t) (t) (t) ωi ∝ π(θi )Lt−1 (θ |θi ){π(θ )Kt (θi |θ )}−1 , where Lt−1 is an arbitrary transition kernel. In case Lt−1 (θ |θ) = Kt (θ|θ ) , all weights are equal under a uniform prior. Inspired from Del Moral et al. (2006), who use backward kernels Lt−1 in SMC to achieve unbiasedness
• 177. On some computational methods for Bayesian model choice ABC model choice ABC method ABC-PRC weight (t) Probability ωi computed as (t) (t) (t) (t) ωi ∝ π(θi )Lt−1 (θ |θi ){π(θ )Kt (θi |θ )}−1 , where Lt−1 is an arbitrary transition kernel. In case Lt−1 (θ |θ) = Kt (θ|θ ) , all weights are equal under a uniform prior. Inspired from Del Moral et al. (2006), who use backward kernels Lt−1 in SMC to achieve unbiasedness
• 178. On some computational methods for Bayesian model choice ABC model choice ABC method ABC-PRC bias Lack of unbiasedness of the method Joint density of the accepted pair (θ(t−1) , θ(t) ) proportional to π(θ(t−1) |y)Kt (θ(t) |θ(t−1) )f (y|θ(t) ) , For an arbitrary function h(θ), E[ωt h(θ(t) )] proportional to π(θ (t) )Lt−1 (θ (t−1) |θ (t) ) ZZ (t) (t−1) (t) (t−1) (t) (t−1) (t) |y)Kt (θ |θ h(θ ) π(θ )f (y|θ )dθ dθ π(θ (t−1) )Kt (θ (t) |θ (t−1) ) π(θ (t) )Lt−1 (θ (t−1) |θ (t) ) ZZ (t) (t−1) (t−1) ∝ h(θ ) π(θ )f (y|θ ) π(θ (t−1) )Kt (θ (t) |θ (t−1) ) (t) (t−1) (t) (t−1) (t) × Kt (θ |θ )f (y|θ )dθ dθ Z ﬀ Z (t) (t) (t−1) (t) (t−1) (t−1) (t) ∝ |y) |θ )f (y|θ h(θ )π(θ Lt−1 (θ )dθ dθ .
• 179. On some computational methods for Bayesian model choice ABC model choice ABC method ABC-PRC bias Lack of unbiasedness of the method Joint density of the accepted pair (θ(t−1) , θ(t) ) proportional to π(θ(t−1) |y)Kt (θ(t) |θ(t−1) )f (y|θ(t) ) , For an arbitrary function h(θ), E[ωt h(θ(t) )] proportional to π(θ (t) )Lt−1 (θ (t−1) |θ (t) ) ZZ (t) (t−1) (t) (t−1) (t) (t−1) (t) |y)Kt (θ |θ h(θ ) π(θ )f (y|θ )dθ dθ π(θ (t−1) )Kt (θ (t) |θ (t−1) ) π(θ (t) )Lt−1 (θ (t−1) |θ (t) ) ZZ (t) (t−1) (t−1) ∝ h(θ ) π(θ )f (y|θ ) π(θ (t−1) )Kt (θ (t) |θ (t−1) ) (t) (t−1) (t) (t−1) (t) × Kt (θ |θ )f (y|θ )dθ dθ Z ﬀ Z (t) (t) (t−1) (t) (t−1) (t−1) (t) ∝ |y) |θ )f (y|θ h(θ )π(θ Lt−1 (θ )dθ dθ .
• 180. 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 −3 −3 −1 −1 θ θ 1 1 3 3 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 −3 −3 −1 −1 θ θ 1 1 3 3 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 −3 −3 −1 −1 θ θ 1 1 3 3 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 −3 −3 −1 −1 θ θ 1 1 3 3 A mixture example ABC method ABC model choice On some computational methods for Bayesian model choice
• 181. On some computational methods for Bayesian model choice ABC model choice ABC-PMC A PMC version Use of the same kernel idea as ABC-PRC but with IS correction Generate a sample at iteration t by N (t−1) (t−1) πt (θ(t) ) ∝ Kt (θ(t) |θj ˆ ωj ) j=1 modulo acceptance of the associated xt , and use an importance (t) weight associated with an accepted simulation θi (t) (t) (t) ωi ∝ π(θi ) πt (θi ) . ˆ c Still likelihood free [Beaumont et al., 2008, arXiv:0805.2256]
• 182. On some computational methods for Bayesian model choice ABC model choice ABC-PMC The ABC-PMC algorithm ≥ ... ≥ Given a decreasing sequence of approximation levels T, 1 1. At iteration t = 1, For i = 1, ..., N (1) (1) Simulate θi ∼ π(θ) and x ∼ f (x|θi ) until (x, y) < 1 (1) Set ωi = 1/N (1) Take τ 2 as twice the empirical variance of the θi ’s 2. At iteration 2 ≤ t ≤ T , For i = 1, ..., N , repeat (t−1) (t−1) Pick θi from the θj ’s with probabilities ωj (t) (t) 2 generate θi |θi ∼ N (θi , σt ) and x ∼ f (x|θi ) until (x, y) < t (t) (t) (t−1) (t) (t−1) N −1 ∝ π(θi )/ θi − θj Set ωi ωj ϕ σt ) j=1 (t) 2 Take τt+1 as twice the weighted empirical variance of the θi ’s
• 183. On some computational methods for Bayesian model choice ABC model choice ABC-PMC A mixture example (0) Toy model of Sisson et al. (2007): if θ ∼ U(−10, 10) , x|θ ∼ 0.5 N (θ, 1) + 0.5 N (θ, 1/100) , then the posterior distribution associated with y = 0 is the normal mixture θ|y = 0 ∼ 0.5 N (0, 1) + 0.5 N (0, 1/100) restricted to [−10, 10]. Furthermore, true target available as π(θ||x| < ) ∝ Φ( −θ)−Φ(− −θ)+Φ(10( −θ))−Φ(−10( +θ)) .
• 184. On some computational methods for Bayesian model choice ABC model choice ABC-PMC A mixture example (2) Recovery of the target, whether using a ﬁxed standard deviation of τ = 0.15 or τ = 1/0.15, or a sequence of adaptive τt ’s.
• 185. On some computational methods for Bayesian model choice ABC model choice ABC for model choice in GRFs Gibbs random ﬁelds Gibbs distribution The rv y = (y1 , . . . , yn ) is a Gibbs random ﬁeld associated with the graph G if 1 exp − f (y) = Vc (yc ) , Z c∈C where Z is the normalising constant, C is the set of cliques of G and Vc is any function also called potential U (y) = c∈C Vc (yc ) is the energy function c Z is usually unavailable in closed form
• 186. On some computational methods for Bayesian model choice ABC model choice ABC for model choice in GRFs Gibbs random ﬁelds Gibbs distribution The rv y = (y1 , . . . , yn ) is a Gibbs random ﬁeld associated with the graph G if 1 exp − f (y) = Vc (yc ) , Z c∈C where Z is the normalising constant, C is the set of cliques of G and Vc is any function also called potential U (y) = c∈C Vc (yc ) is the energy function c Z is usually unavailable in closed form
• 187. On some computational methods for Bayesian model choice ABC model choice ABC for model choice in GRFs Potts model Potts model Vc (y) is of the form Vc (y) = θS(y) = θ δyl =yi l∼i where l∼i denotes a neighbourhood structure In most realistic settings, summation exp{θT S(x)} Zθ = x∈X involves too many terms to be manageable and numerical approximations cannot always be trusted [Cucala, Marin, CPR & Titterington, 2009]
• 188. On some computational methods for Bayesian model choice ABC model choice ABC for model choice in GRFs Potts model Potts model Vc (y) is of the form Vc (y) = θS(y) = θ δyl =yi l∼i where l∼i denotes a neighbourhood structure In most realistic settings, summation exp{θT S(x)} Zθ = x∈X involves too many terms to be manageable and numerical approximations cannot always be trusted [Cucala, Marin, CPR & Titterington, 2009]
• 189. On some computational methods for Bayesian model choice ABC model choice ABC for model choice in GRFs Bayesian Model Choice Comparing a model with potential S0 taking values in Rp0 versus a model with potential S1 taking values in Rp1 can be done through the Bayes factor corresponding to the priors π0 and π1 on each parameter space T exp{θ0 S0 (x)}/Zθ0 ,0 π0 (dθ0 ) Bm0 /m1 (x) = T exp{θ1 S1 (x)}/Zθ1 ,1 π1 (dθ1 ) Use of Jeﬀreys’ scale to select most appropriate model
• 190. On some computational methods for Bayesian model choice ABC model choice ABC for model choice in GRFs Bayesian Model Choice Comparing a model with potential S0 taking values in Rp0 versus a model with potential S1 taking values in Rp1 can be done through the Bayes factor corresponding to the priors π0 and π1 on each parameter space T exp{θ0 S0 (x)}/Zθ0 ,0 π0 (dθ0 ) Bm0 /m1 (x) = T exp{θ1 S1 (x)}/Zθ1 ,1 π1 (dθ1 ) Use of Jeﬀreys’ scale to select most appropriate model
• 191. On some computational methods for Bayesian model choice ABC model choice ABC for model choice in GRFs Neighbourhood relations Choice to be made between M neighbourhood relations m i∼i (0 ≤ m ≤ M − 1) with Sm (x) = I{xi =xi } m i∼i driven by the posterior probabilities of the models.
• 192. On some computational methods for Bayesian model choice ABC model choice ABC for model choice in GRFs Model index Formalisation via a model index M that appears as a new parameter with prior distribution π(M = m) and π(θ|M = m) = πm (θm ) Computational target: P(M = m|x) ∝ fm (x|θm )πm (θm ) dθm π(M = m) , Θm
• 193. On some computational methods for Bayesian model choice ABC model choice ABC for model choice in GRFs Model index Formalisation via a model index M that appears as a new parameter with prior distribution π(M = m) and π(θ|M = m) = πm (θm ) Computational target: P(M = m|x) ∝ fm (x|θm )πm (θm ) dθm π(M = m) , Θm
• 194. On some computational methods for Bayesian model choice ABC model choice ABC for model choice in GRFs Suﬃcient statistics By deﬁnition, if S(x) suﬃcient statistic for the joint parameters (M, θ0 , . . . , θM −1 ), P(M = m|x) = P(M = m|S(x)) . For each model m, own suﬃcient statistic Sm (·) and S(·) = (S0 (·), . . . , SM −1 (·)) also suﬃcient. For Gibbs random ﬁelds, 1 2 x|M = m ∼ fm (x|θm ) = fm (x|S(x))fm (S(x)|θm ) 1 f 2 (S(x)|θm ) = n(S(x)) m where n(S(x)) = {˜ ∈ X : S(˜ ) = S(x)} x x c S(x) is therefore also suﬃcient for the joint parameters [Speciﬁc to Gibbs random ﬁelds!]
• 195. On some computational methods for Bayesian model choice ABC model choice ABC for model choice in GRFs Suﬃcient statistics By deﬁnition, if S(x) suﬃcient statistic for the joint parameters (M, θ0 , . . . , θM −1 ), P(M = m|x) = P(M = m|S(x)) . For each model m, own suﬃcient statistic Sm (·) and S(·) = (S0 (·), . . . , SM −1 (·)) also suﬃcient. For Gibbs random ﬁelds, 1 2 x|M = m ∼ fm (x|θm ) = fm (x|S(x))fm (S(x)|θm ) 1 f 2 (S(x)|θm ) = n(S(x)) m where n(S(x)) = {˜ ∈ X : S(˜ ) = S(x)} x x c S(x) is therefore also suﬃcient for the joint parameters [Speciﬁc to Gibbs random ﬁelds!]
• 196. On some computational methods for Bayesian model choice ABC model choice ABC for model choice in GRFs Suﬃcient statistics By deﬁnition, if S(x) suﬃcient statistic for the joint parameters (M, θ0 , . . . , θM −1 ), P(M = m|x) = P(M = m|S(x)) . For each model m, own suﬃcient statistic Sm (·) and S(·) = (S0 (·), . . . , SM −1 (·)) also suﬃcient. For Gibbs random ﬁelds, 1 2 x|M = m ∼ fm (x|θm ) = fm (x|S(x))fm (S(x)|θm ) 1 f 2 (S(x)|θm ) = n(S(x)) m where n(S(x)) = {˜ ∈ X : S(˜ ) = S(x)} x x c S(x) is therefore also suﬃcient for the joint parameters [Speciﬁc to Gibbs random ﬁelds!]
• 197. On some computational methods for Bayesian model choice ABC model choice ABC for model choice in GRFs ABC model choice Algorithm ABC-MC Generate m∗ from the prior π(M = m). ∗ Generate θm∗ from the prior πm∗ (·). Generate x∗ from the model fm∗ (·|θm∗ ). ∗ Compute the distance ρ(S(x0 ), S(x∗ )). Accept (θm∗ , m∗ ) if ρ(S(x0 ), S(x∗ )) < . ∗ [Cornuet, Grelaud, Marin & Robert, 2008] Note When = 0 the algorithm is exact
• 198. On some computational methods for Bayesian model choice ABC model choice ABC for model choice in GRFs ABC approximation to the Bayes factor Frequency ratio: ˆ P(M = m0 |x0 ) π(M = m1 ) BF m0 /m1 (x0 ) = × ˆ P(M = m1 |x0 ) π(M = m0 ) {mi∗ = m0 } π(M = m1 ) × = , {mi∗ = m1 } π(M = m0 ) replaced with 1 + {mi∗ = m0 } π(M = m1 ) BF m0 /m1 (x0 ) = × 1 + {mi∗ = m1 } π(M = m0 ) to avoid indeterminacy (also Bayes estimate).
• 199. On some computational methods for Bayesian model choice ABC model choice ABC for model choice in GRFs ABC approximation to the Bayes factor Frequency ratio: ˆ P(M = m0 |x0 ) π(M = m1 ) BF m0 /m1 (x0 ) = × ˆ P(M = m1 |x0 ) π(M = m0 ) {mi∗ = m0 } π(M = m1 ) × = , {mi∗ = m1 } π(M = m0 ) replaced with 1 + {mi∗ = m0 } π(M = m1 ) BF m0 /m1 (x0 ) = × 1 + {mi∗ = m1 } π(M = m0 ) to avoid indeterminacy (also Bayes estimate).
• 200. On some computational methods for Bayesian model choice ABC model choice Illustrations Toy example iid Bernoulli model versus two-state ﬁrst-order Markov chain, i.e. n {1 + exp(θ0 )}n , f0 (x|θ0 ) = exp θ0 I{xi =1} i=1 versus n 1 {1 + exp(θ1 )}n−1 , f1 (x|θ1 ) = exp θ1 I{xi =xi−1 } 2 i=2 with priors θ0 ∼ U(−5, 5) and θ1 ∼ U(0, 6) (inspired by “phase transition” boundaries).
• 201. On some computational methods for Bayesian model choice ABC model choice Illustrations Toy example (2) 10 5 5 BF01 BF01 0 ^ ^ 0 −5 −5 −10 −40 −20 0 10 −40 −20 0 10 BF01 BF01 (left) Comparison of the true BF m0 /m1 (x0 ) with BF m0 /m1 (x0 ) (in logs) over 2, 000 simulations and 4.106 proposals from the prior. (right) Same when using tolerance corresponding to the 1% quantile on the distances.
• 202. On some computational methods for Bayesian model choice ABC model choice Illustrations Protein folding Superposition of the native structure (grey) with the ST1 structure (red.), the ST2 structure (orange), the ST3 structure (green), and the DT structure (blue).
• 203. On some computational methods for Bayesian model choice ABC model choice Illustrations Protein folding (2) % seq . Id. TM-score FROST score 1i5nA (ST1) 32 0.86 75.3 1ls1A1 (ST2) 5 0.42 8.9 1jr8A (ST3) 4 0.24 8.9 1s7oA (DT) 10 0.08 7.8 Characteristics of dataset. % seq. Id.: percentage of identity with the query sequence. TM-score.: similarity between predicted and native structure (uncertainty between 0.17 and 0.4) FROST score: quality of alignment of the query onto the candidate structure (uncertainty between 7 and 9).
• 204. On some computational methods for Bayesian model choice ABC model choice Illustrations Protein folding (3) NS/ST1 NS/ST2 NS/ST3 NS/DT BF 1.34 1.22 2.42 2.76 P(M = NS|x0 ) 0.573 0.551 0.708 0.734 Estimates of the Bayes factors between model NS and models ST1, ST2, ST3, and DT, and corresponding posterior probabilities of model NS based on an ABC-MC algorithm using 1.2 106 simulations and a tolerance equal to the 1% quantile of the distances.