• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content

Loading…

Flash Player 9 (or above) is needed to view presentations.
We have detected that you do not have it on your computer. To install it, go here.

Like this presentation? Why not share!

MCMC and likelihood-free methods

on

  • 4,116 views

Evolutionary slides for my course in Wharton, UPenn, October 25 / November 1, 2010

Evolutionary slides for my course in Wharton, UPenn, October 25 / November 1, 2010

Statistics

Views

Total Views
4,116
Views on SlideShare
3,920
Embed Views
196

Actions

Likes
11
Downloads
259
Comments
0

5 Embeds 196

http://xianblog.wordpress.com 187
url_unknown 4
http://ace2jntuk.hpage.co.in 3
https://xianblog.wordpress.com 1
http://biologyfreaks.com 1

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    MCMC and likelihood-free methods MCMC and likelihood-free methods Presentation Transcript

    • MCMC and Likelihood-free Methods MCMC and Likelihood-free Methods Christian P. Robert Universit´ Paris-Dauphine & CREST e http://www.ceremade.dauphine.fr/~xian November 2, 2010
    • MCMC and Likelihood-free Methods Outline Computational issues in Bayesian statistics The Metropolis-Hastings Algorithm The Gibbs Sampler Population Monte Carlo Approximate Bayesian computation ABC for model choice
    • MCMC and Likelihood-free Methods Computational issues in Bayesian statistics Motivation and leading example Computational issues in Bayesian statistics The Metropolis-Hastings Algorithm The Gibbs Sampler Population Monte Carlo Approximate Bayesian computation ABC for model choice
    • MCMC and Likelihood-free Methods Computational issues in Bayesian statistics Latent variables Latent structures make life harder! Even simple models may lead to computational complications, as in latent variable models f (x|θ) = f (x, x |θ) dx
    • MCMC and Likelihood-free Methods Computational issues in Bayesian statistics Latent variables Latent structures make life harder! Even simple models may lead to computational complications, as in latent variable models f (x|θ) = f (x, x |θ) dx If (x, x ) observed, fine!
    • MCMC and Likelihood-free Methods Computational issues in Bayesian statistics Latent variables Latent structures make life harder! Even simple models may lead to computational complications, as in latent variable models f (x|θ) = f (x, x |θ) dx If (x, x ) observed, fine! If only x observed, trouble!
    • MCMC and Likelihood-free Methods Computational issues in Bayesian statistics Latent variables Example (Mixture models) Models of mixtures of distributions: X ∼ fj with probability pj , for j = 1, 2, . . . , k, with overall density X ∼ p1 f1 (x) + · · · + pk fk (x) .
    • MCMC and Likelihood-free Methods Computational issues in Bayesian statistics Latent variables Example (Mixture models) Models of mixtures of distributions: X ∼ fj with probability pj , for j = 1, 2, . . . , k, with overall density X ∼ p1 f1 (x) + · · · + pk fk (x) . For a sample of independent random variables (X1 , · · · , Xn ), sample density n {p1 f1 (xi ) + · · · + pk fk (xi )} . i=1
    • MCMC and Likelihood-free Methods Computational issues in Bayesian statistics Latent variables Example (Mixture models) Models of mixtures of distributions: X ∼ fj with probability pj , for j = 1, 2, . . . , k, with overall density X ∼ p1 f1 (x) + · · · + pk fk (x) . For a sample of independent random variables (X1 , · · · , Xn ), sample density n {p1 f1 (xi ) + · · · + pk fk (xi )} . i=1 Expanding this product of sums into a sum of products involves k n elementary terms: too prohibitive to compute in large samples.
    • MCMC and Likelihood-free Methods Computational issues in Bayesian statistics Latent variables Simple mixture (1) 3 2 µ2 1 0 −1 −1 0 1 2 3 µ1 Case of the 0.3N (µ1 , 1) + 0.7N (µ2 , 1) likelihood
    • MCMC and Likelihood-free Methods Computational issues in Bayesian statistics Latent variables Simple mixture (2) For the mixture of two normal distributions, 0.3N (µ1 , 1) + 0.7N (µ2 , 1) , likelihood proportional to n [0.3ϕ (xi − µ1 ) + 0.7 ϕ (xi − µ2 )] i=1 containing 2n terms.
    • MCMC and Likelihood-free Methods Computational issues in Bayesian statistics Latent variables Complex maximisation Standard maximization techniques often fail to find the global maximum because of multimodality or undesirable behavior (usually at the frontier of the domain) of the likelihood function. Example In the special case f (x|µ, σ) = (1 − ) exp{(−1/2)x2 } + exp{(−1/2σ 2 )(x − µ)2 } σ with > 0 known,
    • MCMC and Likelihood-free Methods Computational issues in Bayesian statistics Latent variables Complex maximisation Standard maximization techniques often fail to find the global maximum because of multimodality or undesirable behavior (usually at the frontier of the domain) of the likelihood function. Example In the special case f (x|µ, σ) = (1 − ) exp{(−1/2)x2 } + exp{(−1/2σ 2 )(x − µ)2 } σ with > 0 known, whatever n, the likelihood is unbounded: lim L(x1 , . . . , xn |µ = x1 , σ) = ∞ σ→0
    • MCMC and Likelihood-free Methods Computational issues in Bayesian statistics Latent variables Unbounded likelihood n= 3 n= 6 4 4 3 3 σ 2 2 1 1 −2 0 2 4 6 −2 0 2 4 6 µ µ n= 12 n= 24 4 4 3 3 σ 2 2 1 1 −2 0 2 4 6 −2 0 2 4 6 µ µ n= 48 n= 96 4 4 3 3 σ 2 2 1 1 −2 0 2 4 6 −2 0 2 4 6 µ µ Case of the 0.3N (0, 1) + 0.7N (µ, σ) likelihood
    • MCMC and Likelihood-free Methods Computational issues in Bayesian statistics Latent variables Example (Mixture once again) press for MA Observations from x1 , . . . , xn ∼ f (x|θ) = pϕ(x; µ1 , σ1 ) + (1 − p)ϕ(x; µ2 , σ2 )
    • MCMC and Likelihood-free Methods Computational issues in Bayesian statistics Latent variables Example (Mixture once again) press for MA Observations from x1 , . . . , xn ∼ f (x|θ) = pϕ(x; µ1 , σ1 ) + (1 − p)ϕ(x; µ2 , σ2 ) Prior µi |σi ∼ N (ξi , σi /ni ), 2 σi ∼ I G (νi /2, s2 /2), 2 i p ∼ Be(α, β)
    • MCMC and Likelihood-free Methods Computational issues in Bayesian statistics Latent variables Example (Mixture once again) press for MA Observations from x1 , . . . , xn ∼ f (x|θ) = pϕ(x; µ1 , σ1 ) + (1 − p)ϕ(x; µ2 , σ2 ) Prior µi |σi ∼ N (ξi , σi /ni ), 2 σi ∼ I G (νi /2, s2 /2), 2 i p ∼ Be(α, β) Posterior n π(θ|x1 , . . . , xn ) ∝ {pϕ(xj ; µ1 , σ1 ) + (1 − p)ϕ(xj ; µ2 , σ2 )} π(θ) j=1 n = ω(kt )π(θ|(kt )) =0 (kt ) n
    • MCMC and Likelihood-free Methods Computational issues in Bayesian statistics Latent variables Example (Mixture once again (cont’d)) For a given permutation (kt ), conditional posterior distribution 2 σ1 π(θ|(kt )) = N ξ1 (kt ), × I G ((ν1 + )/2, s1 (kt )/2) n1 + 2 σ2 ×N ξ2 (kt ), × I G ((ν2 + n − )/2, s2 (kt )/2) n2 + n − ×Be(α + , β + n − )
    • MCMC and Likelihood-free Methods Computational issues in Bayesian statistics Latent variables Example (Mixture once again (cont’d)) where 1 2 x1 (kt ) = ¯ t=1 xkt , s1 (kt ) = ˆ t=1 (xkt − x1 (kt )) , ¯ 1 n n 2 x2 (kt ) = ¯ n− t= +1 xkt , s2 (kt ) = ˆ t= +1 (xkt − x2 (kt )) ¯ and n1 ξ1 + x1 (kt ) ¯ n2 ξ2 + (n − )¯2 (kt ) x ξ1 (kt ) = , ξ2 (kt ) = , n1 + n2 + n − n1 s1 (kt ) = s2 + s2 (kt ) + 1 ˆ1 (ξ1 − x1 (kt ))2 , ¯ n1 + n2 (n − ) s2 (kt ) = s2 + s2 (kt ) + 2 ˆ2 (ξ2 − x2 (kt ))2 , ¯ n2 + n − posterior updates of the hyperparameters
    • MCMC and Likelihood-free Methods Computational issues in Bayesian statistics Latent variables Example (Mixture once again) Bayes estimator of θ: n δ π (x1 , . . . , xn ) = ω(kt )Eπ [θ|x, (kt )] =0 (kt ) Too costly: 2n terms
    • MCMC and Likelihood-free Methods Computational issues in Bayesian statistics The AR(p) model AR(p) model Auto-regressive representation of a time series, p xt |xt−1 , . . . ∼ N µ+ i (xt−i − µ), σ 2 i=1
    • MCMC and Likelihood-free Methods Computational issues in Bayesian statistics The AR(p) model AR(p) model Auto-regressive representation of a time series, p xt |xt−1 , . . . ∼ N µ+ i (xt−i − µ), σ 2 i=1 Generalisation of AR(1) Among the most commonly used models in dynamic settings More challenging than the static models (stationarity constraints) Different models depending on the processing of the starting value x0
    • MCMC and Likelihood-free Methods Computational issues in Bayesian statistics The AR(p) model Unknown stationarity constraints Practical difficulty: for complex models, stationarity constraints get quite involved to the point of being unknown in some cases Example (AR(1)) Case of linear Markovian dependence on the last value i.i.d. xt = µ + (xt−1 − µ) + t, t ∼ N (0, σ 2 ) If | | < 1, (xt )t∈Z can be written as ∞ j xt = µ + t−j j=0 and this is a stationary representation.
    • MCMC and Likelihood-free Methods Computational issues in Bayesian statistics The AR(p) model Stationary but... If | | > 1, alternative stationary representation ∞ −j xt = µ − t+j . j=1
    • MCMC and Likelihood-free Methods Computational issues in Bayesian statistics The AR(p) model Stationary but... If | | > 1, alternative stationary representation ∞ −j xt = µ − t+j . j=1 This stationary solution is criticized as artificial because xt is correlated with future white noises ( t )s>t , unlike the case when | | < 1. Non-causal representation...
    • MCMC and Likelihood-free Methods Computational issues in Bayesian statistics The AR(p) model Stationarity+causality Stationarity constraints in the prior as a restriction on the values of θ. Theorem AR(p) model second-order stationary and causal iff the roots of the polynomial p P(x) = 1 − ix i i=1 are all outside the unit circle
    • MCMC and Likelihood-free Methods Computational issues in Bayesian statistics The AR(p) model Stationarity constraints Under stationarity constraints, complex parameter space: each value of needs to be checked for roots of corresponding polynomial with modulus less than 1
    • MCMC and Likelihood-free Methods Computational issues in Bayesian statistics The AR(p) model Stationarity constraints Under stationarity constraints, complex parameter space: each value of needs to be checked for roots of corresponding polynomial with modulus less than 1 E.g., for an AR(2) process with 1.0 autoregressive polynomial 0.5 P(u) = 1 − 1 u − 2 u2 , constraint is 0.0 θ2 q 1 + 2 < 1, 1 − 2 <1 −0.5 −1.0 and | 2 | < 1 −2 −1 0 1 2 θ1
    • MCMC and Likelihood-free Methods Computational issues in Bayesian statistics The MA(q) model The MA(q) model Alternative type of time series q xt = µ + t − ϑj t−j , t ∼ N (0, σ 2 ) j=1 Stationary but, for identifiability considerations, the polynomial q Q(x) = 1 − ϑj xj j=1 must have all its roots outside the unit circle.
    • MCMC and Likelihood-free Methods Computational issues in Bayesian statistics The MA(q) model Identifiability Example For the MA(1) model, xt = µ + t − ϑ1 t−1 , var(xt ) = (1 + ϑ2 )σ 2 1 can also be written 1 xt = µ + ˜t−1 − ˜t , ˜ ∼ N (0, ϑ2 σ 2 ) , 1 ϑ1 Both pairs (ϑ1 , σ) & (1/ϑ1 , ϑ1 σ) lead to alternative representations of the same model.
    • MCMC and Likelihood-free Methods Computational issues in Bayesian statistics The MA(q) model Properties of MA models Non-Markovian model (but special case of hidden Markov) Autocovariance γx (s) is null for |s| > q
    • MCMC and Likelihood-free Methods Computational issues in Bayesian statistics The MA(q) model Representations x1:T is a normal random variable with constant mean µ and covariance matrix σ2   γ1 γ2 ... γq 0 ... 0 0  γ1 σ2 γ1 . . . γq−1 γq ... 0 0 Σ= ,   ..  .  2 0 0 0 ... 0 0 ... γ1 σ with (|s| ≤ q) q−|s| 2 γs = σ ϑi ϑi+|s| i=0 Not manageable in practice [large T’s]
    • MCMC and Likelihood-free Methods Computational issues in Bayesian statistics The MA(q) model Representations (contd.) Conditional on past ( 0 , . . . , −q+1 ), L(µ, ϑ1 , . . . , ϑq , σ|x1:T , 0 , . . . , −q+1 ) ∝   2  T   q   −T  2σ 2 , σ exp − xt − µ + ϑj ˆt−j   t=1  j=1  where (t > 0) q ˆt = xt − µ + ϑj ˆt−j , ˆ0 = 0, . . . , ˆ1−q = 1−q j=1 Recursive definition of the likelihood, still costly O(T × q)
    • MCMC and Likelihood-free Methods Computational issues in Bayesian statistics The MA(q) model Representations (contd.) Encompassing approach for general time series models State-space representation xt = Gyt + εt , (1) yt+1 = F yt + ξt , (2) (1) is the observation equation and (2) is the state equation
    • MCMC and Likelihood-free Methods Computational issues in Bayesian statistics The MA(q) model Representations (contd.) Encompassing approach for general time series models State-space representation xt = Gyt + εt , (1) yt+1 = F yt + ξt , (2) (1) is the observation equation and (2) is the state equation Note This is a special case of hidden Markov model
    • MCMC and Likelihood-free Methods Computational issues in Bayesian statistics The MA(q) model MA(q) state-space representation For the MA(q) model, take yt = ( t−q , . . . , t−1 , t ) and then    0 1 0 ... 0  0 0 0 0 1 ... 0     . yt+1 =  ... t+1  .   yt +  0  . 0 0 ... 1 0 0 0 0 ... 0 1 xt = µ − ϑq ϑq−1 ... ϑ1 −1 yt .
    • MCMC and Likelihood-free Methods Computational issues in Bayesian statistics The MA(q) model MA(q) state-space representation (cont’d) Example For the MA(1) model, observation equation xt = (1 0)yt with yt = (y1t y2t ) directed by the state equation 0 1 1 yt+1 = yt + t+1 . 0 0 ϑ1
    • MCMC and Likelihood-free Methods Computational issues in Bayesian statistics The MA(q) model c A typology of Bayes computational problems (i). latent variable models in general
    • MCMC and Likelihood-free Methods Computational issues in Bayesian statistics The MA(q) model c A typology of Bayes computational problems (i). latent variable models in general (ii). use of a complex parameter space, as for instance in constrained parameter sets like those resulting from imposing stationarity constraints in dynamic models;
    • MCMC and Likelihood-free Methods Computational issues in Bayesian statistics The MA(q) model c A typology of Bayes computational problems (i). latent variable models in general (ii). use of a complex parameter space, as for instance in constrained parameter sets like those resulting from imposing stationarity constraints in dynamic models; (iii). use of a complex sampling model with an intractable likelihood, as for instance in some graphical models;
    • MCMC and Likelihood-free Methods Computational issues in Bayesian statistics The MA(q) model c A typology of Bayes computational problems (i). latent variable models in general (ii). use of a complex parameter space, as for instance in constrained parameter sets like those resulting from imposing stationarity constraints in dynamic models; (iii). use of a complex sampling model with an intractable likelihood, as for instance in some graphical models; (iv). use of a huge dataset;
    • MCMC and Likelihood-free Methods Computational issues in Bayesian statistics The MA(q) model c A typology of Bayes computational problems (i). latent variable models in general (ii). use of a complex parameter space, as for instance in constrained parameter sets like those resulting from imposing stationarity constraints in dynamic models; (iii). use of a complex sampling model with an intractable likelihood, as for instance in some graphical models; (iv). use of a huge dataset; (v). use of a complex prior distribution (which may be the posterior distribution associated with an earlier sample);
    • MCMC and Likelihood-free Methods Computational issues in Bayesian statistics The MA(q) model c A typology of Bayes computational problems (i). latent variable models in general (ii). use of a complex parameter space, as for instance in constrained parameter sets like those resulting from imposing stationarity constraints in dynamic models; (iii). use of a complex sampling model with an intractable likelihood, as for instance in some graphical models; (iv). use of a huge dataset; (v). use of a complex prior distribution (which may be the posterior distribution associated with an earlier sample); (vi). use of a particular inferential procedure as for instance, Bayes factors π P (θ ∈ Θ0 | x) π(θ ∈ Θ0 ) B01 (x) = . P (θ ∈ Θ1 | x) π(θ ∈ Θ1 )
    • MCMC and Likelihood-free Methods The Metropolis-Hastings Algorithm The Metropolis-Hastings Algorithm Computational issues in Bayesian statistics The Metropolis-Hastings Algorithm Monte Carlo basics Importance Sampling Monte Carlo Methods based on Markov Chains The Metropolis–Hastings algorithm Random-walk Metropolis-Hastings algorithms Extensions The Gibbs Sampler Population Monte Carlo
    • MCMC and Likelihood-free Methods The Metropolis-Hastings Algorithm Monte Carlo basics General purpose Given a density π known up to a normalizing constant, and an integrable function h, compute h(x)˜ (x)µ(dx) π Π(h) = h(x)π(x)µ(dx) = π (x)µ(dx) ˜ when h(x)˜ (x)µ(dx) is intractable. π
    • MCMC and Likelihood-free Methods The Metropolis-Hastings Algorithm Monte Carlo basics Monte Carlo 101 Generate an iid sample x1 , . . . , xN from π and estimate Π(h) by N ΠM C (h) = N −1 ˆ N h(xi ). i=1 ˆN as LLN: ΠM C (h) −→ Π(h) If Π(h2 ) = h2 (x)π(x)µ(dx) < ∞, √ L CLT: ˆN N ΠM C (h) − Π(h) N 0, Π [h − Π(h)]2 .
    • MCMC and Likelihood-free Methods The Metropolis-Hastings Algorithm Monte Carlo basics Monte Carlo 101 Generate an iid sample x1 , . . . , xN from π and estimate Π(h) by N ΠM C (h) = N −1 ˆ N h(xi ). i=1 ˆN as LLN: ΠM C (h) −→ Π(h) If Π(h2 ) = h2 (x)π(x)µ(dx) < ∞, √ L CLT: ˆN N ΠM C (h) − Π(h) N 0, Π [h − Π(h)]2 . Caveat announcing MCMC Often impossible or inefficient to simulate directly from Π
    • MCMC and Likelihood-free Methods The Metropolis-Hastings Algorithm Importance Sampling Importance Sampling For Q proposal distribution such that Q(dx) = q(x)µ(dx), alternative representation Π(h) = h(x){π/q}(x)q(x)µ(dx).
    • MCMC and Likelihood-free Methods The Metropolis-Hastings Algorithm Importance Sampling Importance Sampling For Q proposal distribution such that Q(dx) = q(x)µ(dx), alternative representation Π(h) = h(x){π/q}(x)q(x)µ(dx). Principle of importance Generate an iid sample x1 , . . . , xN ∼ Q and estimate Π(h) by N ΠIS (h) = N −1 ˆ Q,N h(xi ){π/q}(xi ). i=1 return to pMC
    • MCMC and Likelihood-free Methods The Metropolis-Hastings Algorithm Importance Sampling Properties of importance Then ˆ as LLN: ΠIS (h) −→ Π(h) Q,N and if Q((hπ/q)2 ) < ∞, √ L CLT: ˆ Q,N N (ΠIS (h) − Π(h)) N 0, Q{(hπ/q − Π(h))2 } .
    • MCMC and Likelihood-free Methods The Metropolis-Hastings Algorithm Importance Sampling Properties of importance Then ˆ as LLN: ΠIS (h) −→ Π(h) Q,N and if Q((hπ/q)2 ) < ∞, √ L CLT: ˆ Q,N N (ΠIS (h) − Π(h)) N 0, Q{(hπ/q − Π(h))2 } . Caveat ˆ Q,N If normalizing constant of π unknown, impossible to use ΠIS Generic problem in Bayesian Statistics: π(θ|x) ∝ f (x|θ)π(θ).
    • MCMC and Likelihood-free Methods The Metropolis-Hastings Algorithm Importance Sampling Self-Normalised Importance Sampling Self normalized version N −1 N ˆ Q,N ΠSN IS (h) = {π/q}(xi ) h(xi ){π/q}(xi ). i=1 i=1
    • MCMC and Likelihood-free Methods The Metropolis-Hastings Algorithm Importance Sampling Self-Normalised Importance Sampling Self normalized version N −1 N ˆ Q,N ΠSN IS (h) = {π/q}(xi ) h(xi ){π/q}(xi ). i=1 i=1 ˆ as LLN : ΠSN IS (h) −→ Π(h) Q,N and if Π((1 + h2 )(π/q)) < ∞, √ L CLT : ˆ Q,N N (ΠSN IS (h) − Π(h)) N 0, π {(π/q)(h − Π(h)}2 ) .
    • MCMC and Likelihood-free Methods The Metropolis-Hastings Algorithm Importance Sampling Self-Normalised Importance Sampling Self normalized version N −1 N ˆ Q,N ΠSN IS (h) = {π/q}(xi ) h(xi ){π/q}(xi ). i=1 i=1 ˆ as LLN : ΠSN IS (h) −→ Π(h) Q,N and if Π((1 + h2 )(π/q)) < ∞, √ L CLT : ˆ Q,N N (ΠSN IS (h) − Π(h)) N 0, π {(π/q)(h − Π(h)}2 ) . c The quality of the SNIS approximation depends on the choice of Q
    • MCMC and Likelihood-free Methods The Metropolis-Hastings Algorithm Monte Carlo Methods based on Markov Chains Running Monte Carlo via Markov Chains (MCMC) It is not necessary to use a sample from the distribution f to approximate the integral I= h(x)f (x)dx ,
    • MCMC and Likelihood-free Methods The Metropolis-Hastings Algorithm Monte Carlo Methods based on Markov Chains Running Monte Carlo via Markov Chains (MCMC) It is not necessary to use a sample from the distribution f to approximate the integral I= h(x)f (x)dx , We can obtain X1 , . . . , Xn ∼ f (approx) without directly simulating from f , using an ergodic Markov chain with stationary distribution f
    • MCMC and Likelihood-free Methods The Metropolis-Hastings Algorithm Monte Carlo Methods based on Markov Chains Running Monte Carlo via Markov Chains (2) Idea For an arbitrary starting value x(0) , an ergodic chain (X (t) ) is generated using a transition kernel with stationary distribution f
    • MCMC and Likelihood-free Methods The Metropolis-Hastings Algorithm Monte Carlo Methods based on Markov Chains Running Monte Carlo via Markov Chains (2) Idea For an arbitrary starting value x(0) , an ergodic chain (X (t) ) is generated using a transition kernel with stationary distribution f Insures the convergence in distribution of (X (t) ) to a random variable from f . For a “large enough” T0 , X (T0 ) can be considered as distributed from f Produce a dependent sample X (T0 ) , X (T0 +1) , . . ., which is generated from f , sufficient for most approximation purposes.
    • MCMC and Likelihood-free Methods The Metropolis-Hastings Algorithm Monte Carlo Methods based on Markov Chains Running Monte Carlo via Markov Chains (2) Idea For an arbitrary starting value x(0) , an ergodic chain (X (t) ) is generated using a transition kernel with stationary distribution f Insures the convergence in distribution of (X (t) ) to a random variable from f . For a “large enough” T0 , X (T0 ) can be considered as distributed from f Produce a dependent sample X (T0 ) , X (T0 +1) , . . ., which is generated from f , sufficient for most approximation purposes. Problem: How can one build a Markov chain with a given stationary distribution?
    • MCMC and Likelihood-free Methods The Metropolis-Hastings Algorithm The Metropolis–Hastings algorithm The Metropolis–Hastings algorithm Basics The algorithm uses the objective (target) density f and a conditional density q(y|x) called the instrumental (or proposal) distribution
    • MCMC and Likelihood-free Methods The Metropolis-Hastings Algorithm The Metropolis–Hastings algorithm The MH algorithm Algorithm (Metropolis–Hastings) Given x(t) , 1. Generate Yt ∼ q(y|x(t) ). 2. Take Yt with prob. ρ(x(t) , Yt ), X (t+1) = x(t) with prob. 1 − ρ(x(t) , Yt ), where f (y) q(x|y) ρ(x, y) = min ,1 . f (x) q(y|x)
    • MCMC and Likelihood-free Methods The Metropolis-Hastings Algorithm The Metropolis–Hastings algorithm Features Independent of normalizing constants for both f and q(·|x) (ie, those constants independent of x) Never move to values with f (y) = 0 The chain (x(t) )t may take the same value several times in a row, even though f is a density wrt Lebesgue measure The sequence (yt )t is usually not a Markov chain
    • MCMC and Likelihood-free Methods The Metropolis-Hastings Algorithm The Metropolis–Hastings algorithm Convergence properties 1. The M-H Markov chain is reversible, with invariant/stationary density f since it satisfies the detailed balance condition f (y) K(y, x) = f (x) K(x, y)
    • MCMC and Likelihood-free Methods The Metropolis-Hastings Algorithm The Metropolis–Hastings algorithm Convergence properties 1. The M-H Markov chain is reversible, with invariant/stationary density f since it satisfies the detailed balance condition f (y) K(y, x) = f (x) K(x, y) 2. As f is a probability measure, the chain is positive recurrent
    • MCMC and Likelihood-free Methods The Metropolis-Hastings Algorithm The Metropolis–Hastings algorithm Convergence properties 1. The M-H Markov chain is reversible, with invariant/stationary density f since it satisfies the detailed balance condition f (y) K(y, x) = f (x) K(x, y) 2. As f is a probability measure, the chain is positive recurrent 3. If f (Yt ) q(X (t) |Yt ) Pr ≥ 1 < 1. (1) f (X (t) ) q(Yt |X (t) ) that is, the event {X (t+1) = X (t) } is possible, then the chain is aperiodic
    • MCMC and Likelihood-free Methods The Metropolis-Hastings Algorithm The Metropolis–Hastings algorithm Convergence properties (2) 4. If q(y|x) > 0 for every (x, y), (2) the chain is irreducible
    • MCMC and Likelihood-free Methods The Metropolis-Hastings Algorithm The Metropolis–Hastings algorithm Convergence properties (2) 4. If q(y|x) > 0 for every (x, y), (2) the chain is irreducible 5. For M-H, f -irreducibility implies Harris recurrence
    • MCMC and Likelihood-free Methods The Metropolis-Hastings Algorithm The Metropolis–Hastings algorithm Convergence properties (2) 4. If q(y|x) > 0 for every (x, y), (2) the chain is irreducible 5. For M-H, f -irreducibility implies Harris recurrence 6. Thus, for M-H satisfying (1) and (2) (i) For h, with Ef |h(X)| < ∞, T 1 lim h(X (t) ) = h(x)df (x) a.e. f. T →∞ T t=1 (ii) and lim K n (x, ·)µ(dx) − f =0 n→∞ TV for every initial distribution µ, where K n (x, ·) denotes the kernel for n transitions.
    • MCMC and Likelihood-free Methods The Metropolis-Hastings Algorithm Random-walk Metropolis-Hastings algorithms Random walk Metropolis–Hastings Use of a local perturbation as proposal Yt = X (t) + εt , where εt ∼ g, independent of X (t) . The instrumental density is of the form g(y − x) and the Markov chain is a random walk if we take g to be symmetric g(x) = g(−x)
    • MCMC and Likelihood-free Methods The Metropolis-Hastings Algorithm Random-walk Metropolis-Hastings algorithms Algorithm (Random walk Metropolis) Given x(t) 1. Generate Yt ∼ g(y − x(t) ) 2. Take f (Yt )  Y with prob. min 1, , (t+1) t X = f (x(t) )  (t) x otherwise.
    • MCMC and Likelihood-free Methods The Metropolis-Hastings Algorithm Random-walk Metropolis-Hastings algorithms Example (Random walk and normal target) Generate N (0, 1) based on the uniform proposal [−δ, δ] forget History! [Hastings (1970)] The probability of acceptance is then 2 ρ(x(t) , yt ) = exp{(x(t) − yt )/2} ∧ 1. 2
    • MCMC and Likelihood-free Methods The Metropolis-Hastings Algorithm Random-walk Metropolis-Hastings algorithms Example (Random walk & normal (2)) Sample statistics δ 0.1 0.5 1.0 mean 0.399 -0.111 0.10 variance 0.698 1.11 1.06 c As δ ↑, we get better histograms and a faster exploration of the support of f .
    • MCMC and Likelihood-free Methods The Metropolis-Hastings Algorithm Random-walk Metropolis-Hastings algorithms 400 400 250 0.5 0.5 0.5 300 200 300 0.0 0.0 0.0 150 200 200 -0.5 -0.5 -0.5 100 100 100 -1.0 -1.0 -1.0 50 -1.5 -1.5 -1.5 0 0 0 -1 0 1 2 -2 0 2 -3 -2 -1 0 1 2 3 (a) (b) (c) Three samples based on U[−δ, δ] with (a) δ = 0.1, (b) δ = 0.5 and (c) δ = 1.0, superimposed with the convergence of the means (15, 000 simulations).
    • MCMC and Likelihood-free Methods The Metropolis-Hastings Algorithm Random-walk Metropolis-Hastings algorithms Example (Mixture models) n k π(θ|x) ∝ p f (xj |µ , σ ) π(θ) j=1 =1
    • MCMC and Likelihood-free Methods The Metropolis-Hastings Algorithm Random-walk Metropolis-Hastings algorithms Example (Mixture models) n k π(θ|x) ∝ p f (xj |µ , σ ) π(θ) j=1 =1 Metropolis-Hastings proposal: θ(t) + ωε(t) if u(t) < ρ(t) θ(t+1) = θ(t) otherwise where π(θ(t) + ωε(t) |x) ρ(t) = ∧1 π(θ(t) |x) and ω scaled for good acceptance rate
    • MCMC and Likelihood-free Methods The Metropolis-Hastings Algorithm Random-walk Metropolis-Hastings algorithms Random walk sampling (50000 iterations) 2 2 1 1 theta theta 0 0 -1 -1 0.0 0.2 0.4 0.6 0.8 1.0 0.2 0.4 0.6 0.8 1.0 1.2 p tau 1.2 0.0 1.0 2.0 1.0 -1 0 1 2 theta 0.8 0 1 2 3 4 5 6 tau 0.6 0.4 0.0 0.2 0.4 0.6 0.8 1.0 p 0 2 4 0.2 0.0 0.2 0.4 0.6 0.8 1.0 0.2 0.4 0.6 0.8 p tau General case of a 3 component normal mixture [Celeux & al., 2000]
    • MCMC and Likelihood-free Methods The Metropolis-Hastings Algorithm Random-walk Metropolis-Hastings algorithms 3 2 µ2 1 0 X −1 −1 0 1 2 3 µ1 Random walk MCMC output for .7N (µ1 , 1) + .3N (µ2 , 1)
    • MCMC and Likelihood-free Methods The Metropolis-Hastings Algorithm Random-walk Metropolis-Hastings algorithms Convergence properties Uniform ergodicity prohibited by random walk structure
    • MCMC and Likelihood-free Methods The Metropolis-Hastings Algorithm Random-walk Metropolis-Hastings algorithms Convergence properties Uniform ergodicity prohibited by random walk structure At best, geometric ergodicity: Theorem (Sufficient ergodicity) For a symmetric density f , log-concave in the tails, and a positive and symmetric density g, the chain (X (t) ) is geometrically ergodic. [Mengersen & Tweedie, 1996] no tail effect
    • MCMC and Likelihood-free Methods The Metropolis-Hastings Algorithm Random-walk Metropolis-Hastings algorithms 1.5 1.5 1.0 1.0 Example (Comparison of tail effects) 0.5 0.5 0.0 0.0 Random-walk Metropolis–Hastings algorithms -0.5 -0.5 based on a N (0, 1) instrumental -1.0 -1.0 for the generation of (a) a -1.5 -1.5 N (0, 1) distribution and (b) a 0 50 100 (a) 150 200 0 50 100 (b) 150 200 distribution with density 90% confidence envelopes of ψ(x) ∝ (1 + |x|)−3 the means, derived from 500 parallel independent chains 1 + ξ2 ∧ 1, 1 + (ξ )2
    • MCMC and Likelihood-free Methods The Metropolis-Hastings Algorithm Random-walk Metropolis-Hastings algorithms Further convergence properties Under assumptions skip detailed convergence (A1) f is super-exponential, i.e. it is positive with positive continuous first derivative such that lim|x|→∞ n(x) log f (x) = −∞ where n(x) := x/|x|. In words : exponential decay of f in every direction with rate tending to ∞ (A2) lim sup|x|→∞ n(x) m(x) < 0, where m(x) = f (x)/| f (x)|. In words: non degeneracy of the countour manifold Cf (y) = {y : f (y) = f (x)} Q is geometrically ergodic, and V (x) ∝ f (x)−1/2 verifies the drift condition [Jarner & Hansen, 2000]
    • MCMC and Likelihood-free Methods The Metropolis-Hastings Algorithm Random-walk Metropolis-Hastings algorithms Further [further] convergence properties skip hyperdetailed convergence If P ψ-irreducible and aperiodic, for r = (r(n))n∈N real-valued non decreasing sequence, such that, for all n, m ∈ N, r(n + m) ≤ r(n)r(m), and r(0) = 1, for C a small set, τC = inf{n ≥ 1, Xn ∈ C}, and h ≥ 1, assume τC −1 sup Ex r(k)h(Xk ) < ∞, x∈C k=0
    • MCMC and Likelihood-free Methods The Metropolis-Hastings Algorithm Random-walk Metropolis-Hastings algorithms then, τC −1 S(f, C, r) := x ∈ X, Ex r(k)h(Xk ) <∞ k=0 is full and absorbing and for x ∈ S(f, C, r), lim r(n) P n (x, .) − f h = 0. n→∞ [Tuominen & Tweedie, 1994]
    • MCMC and Likelihood-free Methods The Metropolis-Hastings Algorithm Random-walk Metropolis-Hastings algorithms Comments [CLT, Rosenthal’s inequality...] h-ergodicity implies CLT for additive (possibly unbounded) functionals of the chain, Rosenthal’s inequality and so on... [Control of the moments of the return-time] The condition implies (because h ≥ 1) that τC −1 sup Ex [r0 (τC )] ≤ sup Ex r(k)h(Xk ) < ∞, x∈C x∈C k=0 where r0 (n) = n r(l) Can be used to derive bounds for l=0 the coupling time, an essential step to determine computable bounds, using coupling inequalities [Roberts & Tweedie, 1998; Fort & Moulines, 2000]
    • MCMC and Likelihood-free Methods The Metropolis-Hastings Algorithm Random-walk Metropolis-Hastings algorithms Alternative conditions The condition is not really easy to work with... [Possible alternative conditions] (a) [Tuominen, Tweedie, 1994] There exists a sequence (Vn )n∈N , Vn ≥ r(n)h, such that (i) supC V0 < ∞, (ii) {V0 = ∞} ⊂ {V1 = ∞} and (iii) P Vn+1 ≤ Vn − r(n)h + br(n)IC .
    • MCMC and Likelihood-free Methods The Metropolis-Hastings Algorithm Random-walk Metropolis-Hastings algorithms (b) [Fort 2000] ∃V ≥ f ≥ 1 and b < ∞, such that supC V < ∞ and σC P V (x) + Ex ∆r(k)f (Xk ) ≤ V (x) + bIC (x) k=0 where σC is the hitting time on C and ∆r(k) = r(k) − r(k − 1), k ≥ 1 and ∆r(0) = r(0). τC −1 Result (a) ⇔ (b) ⇔ supx∈C Ex k=0 r(k)f (Xk ) < ∞.
    • MCMC and Likelihood-free Methods The Metropolis-Hastings Algorithm Extensions Langevin Algorithms Proposal based on the Langevin diffusion Lt is defined by the stochastic differential equation 1 dLt = dBt + log f (Lt )dt, 2 where Bt is the standard Brownian motion Theorem The Langevin diffusion is the only non-explosive diffusion which is reversible with respect to f .
    • MCMC and Likelihood-free Methods The Metropolis-Hastings Algorithm Extensions Discretization Instead, consider the sequence σ2 x(t+1) = x(t) + log f (x(t) ) + σεt , εt ∼ Np (0, Ip ) 2 where σ 2 corresponds to the discretization step
    • MCMC and Likelihood-free Methods The Metropolis-Hastings Algorithm Extensions Discretization Instead, consider the sequence σ2 x(t+1) = x(t) + log f (x(t) ) + σεt , εt ∼ Np (0, Ip ) 2 where σ 2 corresponds to the discretization step Unfortunately, the discretized chain may be be transient, for instance when lim σ2 log f (x)|x|−1 > 1 x→±∞
    • MCMC and Likelihood-free Methods The Metropolis-Hastings Algorithm Extensions MH correction Accept the new value Yt with probability 2 σ2 exp − Yt − x(t) − 2 log f (x(t) ) 2σ 2 f (Yt ) · ∧1. f (x(t) ) σ2 2 exp − x(t) − Yt − 2 log f (Yt ) 2σ 2 Choice of the scaling factor σ Should lead to an acceptance rate of 0.574 to achieve optimal convergence rates (when the components of x are uncorrelated) [Roberts & Rosenthal, 1998; Girolami & Calderhead, 2011]
    • MCMC and Likelihood-free Methods The Metropolis-Hastings Algorithm Extensions Optimizing the Acceptance Rate Problem of choice of the transition kernel from a practical point of view Most common alternatives: (a) a fully automated algorithm like ARMS; (b) an instrumental density g which approximates f , such that f /g is bounded for uniform ergodicity to apply; (c) a random walk In both cases (b) and (c), the choice of g is critical,
    • MCMC and Likelihood-free Methods The Metropolis-Hastings Algorithm Extensions Case of the random walk Different approach to acceptance rates A high acceptance rate does not indicate that the algorithm is moving correctly since it indicates that the random walk is moving too slowly on the surface of f .
    • MCMC and Likelihood-free Methods The Metropolis-Hastings Algorithm Extensions Case of the random walk Different approach to acceptance rates A high acceptance rate does not indicate that the algorithm is moving correctly since it indicates that the random walk is moving too slowly on the surface of f . If x(t) and yt are close, i.e. f (x(t) ) f (yt ) y is accepted with probability f (yt ) min ,1 1. f (x(t) ) For multimodal densities with well separated modes, the negative effect of limited moves on the surface of f clearly shows.
    • MCMC and Likelihood-free Methods The Metropolis-Hastings Algorithm Extensions Case of the random walk (2) If the average acceptance rate is low, the successive values of f (yt ) tend to be small compared with f (x(t) ), which means that the random walk moves quickly on the surface of f since it often reaches the “borders” of the support of f
    • MCMC and Likelihood-free Methods The Metropolis-Hastings Algorithm Extensions Rule of thumb In small dimensions, aim at an average acceptance rate of 50%. In large dimensions, at an average acceptance rate of 25%. [Gelman,Gilks and Roberts, 1995]
    • MCMC and Likelihood-free Methods The Metropolis-Hastings Algorithm Extensions Rule of thumb In small dimensions, aim at an average acceptance rate of 50%. In large dimensions, at an average acceptance rate of 25%. [Gelman,Gilks and Roberts, 1995] This rule is to be taken with a pinch of salt!
    • MCMC and Likelihood-free Methods The Metropolis-Hastings Algorithm Extensions Example (Noisy AR(1)) Hidden Markov chain from a regular AR(1) model, xt+1 = ϕxt + t+1 t ∼ N (0, τ 2 ) and observables yt |xt ∼ N (x2 , σ 2 ) t
    • MCMC and Likelihood-free Methods The Metropolis-Hastings Algorithm Extensions Example (Noisy AR(1)) Hidden Markov chain from a regular AR(1) model, xt+1 = ϕxt + t+1 t ∼ N (0, τ 2 ) and observables yt |xt ∼ N (x2 , σ 2 ) t The distribution of xt given xt−1 , xt+1 and yt is −1 τ2 exp (xt − ϕxt−1 )2 + (xt+1 − ϕxt )2 + (yt − x2 )2 t . 2τ 2 σ2
    • MCMC and Likelihood-free Methods The Metropolis-Hastings Algorithm Extensions Example (Noisy AR(1) continued) For a Gaussian random walk with scale ω small enough, the random walk never jumps to the other mode. But if the scale ω is sufficiently large, the Markov chain explores both modes and give a satisfactory approximation of the target distribution.
    • MCMC and Likelihood-free Methods The Metropolis-Hastings Algorithm Extensions Markov chain based on a random walk with scale ω = .1.
    • MCMC and Likelihood-free Methods The Metropolis-Hastings Algorithm Extensions Markov chain based on a random walk with scale ω = .5.
    • MCMC and Likelihood-free Methods The Metropolis-Hastings Algorithm Extensions MA(2) Since the constraints on (ϑ1 , ϑ2 ) are well-defined, use of a flat prior over the triangle as prior. Simple representation of the likelihood library(mnormt) ma2like=function(theta){ n=length(y) sigma = toeplitz(c(1 +theta[1]^2+theta[2]^2, theta[1]+theta[1]*theta[2],theta[2],rep(0,n-3))) dmnorm(y,rep(0,n),sigma,log=TRUE) }
    • MCMC and Likelihood-free Methods The Metropolis-Hastings Algorithm Extensions Basic RWHM for MA(2) Algorithm 1 RW-HM-MA(2) sampler set ω and ϑ(1) for i = 2 to T do ˜ (i−1) (i−1) generate ϑj ∼ U(ϑj − ω, ϑj + ω) set p = 0 and ϑ (i) = ϑ(i−1) ˜ if ϑ within the triangle then ˜ p = exp(ma2like(ϑ) − ma2like(ϑ(i−1) )) end if if U < p then ˜ ϑ(i) = ϑ end if end for
    • MCMC and Likelihood-free Methods The Metropolis-Hastings Algorithm Extensions Outcome Result with a simulated sample of 100 points and ϑ1 = 0.6, ϑ2 = 0.2 and scale ω = 0.2 1.0 q q q q q qq qq q q qq q qq q q q qqqq qqq qqq 0.5 q q q qq q q q qq q q q q qqqq q q qq qq q qq q qq q qqq q q q qqq q q q q qqqq qqq q q q q q q q qqq qqq qq q q q q qqq q q q qq q q qqq qqqq q q q qq q qq q qq q q q q q q q q q qq qq q qqqq q q q q qqqqqqqqq q qqqqqqq qq q q qqqqqqqqqqqqq qq q q q qqqq q q qqqqqqqqqqqqq q q qqq q q q qqqqq q qq q qqqq qq q qqq qqqq qqqq qqq qq q q q q qqqqqqqqqqqqq q qqq qqqqqqq q qq qqqq q q qq q q q qqqqq qqqqq q q qqqqqqqqqqqq qq q q qqqqqqq q q qq q qq qqqqqqqqqq qq q q q qqqq q qqqqqqqqqq q q qq q q qqqqqqqqqqqqqq q q qq q q qqqqqq q q qqqqqqq qq q qqqqq qqqq qqq q q qqqqqqq qqq qq qqqqqqqqqqqqqqq qq q q qqqqqqqqqqqqqqq q q qq q q q qq q q q qqqqqqqqqq qqq q q q qqqqqqqq qqqqq q q qqqq q q q q qqqqq qqqq q q qqqq qqq qq qq q q q qq qqqqqqqqqqqq qq qqq qqqqqqqqqqqqqqq qqqqqqqq qqq qq qqqqqqqqqqqqqq q q q q qq q q q qqqqqqqqqqqqqqqqq q q q qqqq qqq q qqq qqqq qqqqqqq qqq q q q q q q q qqqqqqqqq q qqqqqqqqqqq qqq q qqqqqqqqqqqqqq q q q qqqqqqqq qqqq q qq q q q q qqqqqqqq q q q qqqqq qq qq qqqqqqqqqqqqq qq q q qqqq qqqqq qq q q q qqqqqqqqqqqq qqq q qqq q qq q q q qqqqqqqqqqq qqq qqqqq qqqqq qqqqqqqqqqqq q qqqqqq qqqqq qqq qqqq qqqqqqqqqqqqqqqqqqq q qqqqqqqqqqqq q q q q qq qq qqq q qqqqqqqqq qqq q q q q qqqqqqq q qq q q qq qqqqqqq qq q q qq qq q qqqqqq qq q q qqqqqq qq q qqq q q qqqqqqqqq q q q qqqq q qqqqq q qqqqq q qq q qq qqqqqqqqqqqqq q qq q qqqqqqqqqqqqq qqq q qqqqq q qq qq qqqqqqqqqqqqqq q q q qq q q q q q qq q q qq q q qq qqqqqqqqqqqqqqq q q qqqqqqqqqqqq q q q qq q q q qqqqqqq qq qqqqqqqqqqqqqqqqqq q qqqqqqq qq q q qqqqqqqqqq qq qqqqqqqqqqqqqqq qq q q qq q q q q qq qqqqqqqqqqq q q q qqqq qqqqq q q q qq q q qq q qqq qqqqqqqqqqqqqqq qq q q q qq q q qqqqqqqqq qq q qqqqqqqqqq q q qqqqqqqqqqqqqqqq q q q qq qqqqqqqqqqqqqq qq qq q qq qq q qqqqqqqqqq qq q qq q qqqqqqqqqqqqqq q q qqqqqq qqqq qqqqqqqqqqqqq qq q qq q qq qqqqqqqqqqqqq q q qq qqqqqq q qqqqqqqqqqqqq q q qqq q q q qqqqqqq qqq q qqqqqqqq q q q qq qqq q qqq qq q q qqqqqqqqq qqqq q q qqqqqqqq qqq q qqq q qqq q q q q qqqqqqqq qqqqq q qq q qq qq q qqqq qqqqqqq q qqqqqqq qqqqqqqqqq q q q q qqqqqq qq qqqq qqqqqqq q qq qq q qqqqq qqqqqqqq qq qqqqqqqqqqqqq q qqqqqqqqqqqqq q qq qqqq q q q qqqqqqqqqqqqqqqqq qqq qqqqqqqq qq qqqqqqqqqq q q qqqqqqqqqqqqqqq q qq q q qqqqqq q q q q qqq q q qqqqqqqqqq q qq qqqqqqqqq q q qqq qqq q q q qqqqqqqqqq qq q qq qqqqqqqqqqqq q q qqq qq q q q q q qqqqqq qq q qqqqqqqqqqqqqqq q q q qqqqqqqqq qq q qqqqqqqqq qq qqq q qqqq q q qqq q qqqqqqqqqqqqq q q qqq qqqqqqqqqqqqqqq q qqq q qq q qqq q qqqqqqqqq q q q q q q qqqqq q q qqqqqqqqqqqqqqq q q q q qqqqqqqq q q qqqqq q q q q qqq q qqq qq q q q q q qq qqqqqqqqqqqqq qq qq qqqqqqq q q qqqqqqqqq qqq qqq qq qq qq q q qqqqqqqq qqq q q qq q qq qq q qq qqqq q q qq qq qq q qqq qqqq q q q qqq q q qqqqq q qqq qqqqqqqq q qqq q q qqq qqqqq q q q q qq qq q qqqqqqqqqqqqqqq q q qqq q q q q q qq qqq q q q q q qq qq qq q qq q q q q q q q qq qqqqqqq qqqqq q qq q qqq q q qq qqq qq q qqq qqqq qq q q qqq qqqqq qqq q q q qq q qq qq q q qq q q qq q qq q q q q q q q q qq q q q q qqq q q q qq qq q q q q qq q q qq q 0.0 q q q q qq q q q q θ2 q qq qq q q q q q q qq q q q qq q q q q q q q q q q q −0.5 q q q q q q −1.0 −2 −1 0 1 2 θ1
    • MCMC and Likelihood-free Methods The Metropolis-Hastings Algorithm Extensions Outcome Result with a simulated sample of 100 points and ϑ1 = 0.6, ϑ2 = 0.2 and scale ω = 0.5 1.0 q q q qq q q q qq q q q q q q q qq q q qqq q qq q q q qq q q q qqq q q q q q qq q q q q q qq q q qqq q qq q q qqq q qq q q qq qq qq q q q q q q q q q q qq q qqq q q q q qq q qq qq q qq q q qqq q q q q q q qq q q qq qq q q qq q qq q q q q q q q q qqqqq qq q q q q qq qq q qqq qqq qq q q qqq q qq q q q qq q qqqq qq q q q q 0.5 q q qq q qqq q q q q q q q q q q q q qq q q q qq q qqqqq qqq q q q q qq q q qq q qqqqqq qq q qqq q q qqqq q qqq q q qq q q q q q q q qqqqqqq q q qq qqqqqqqq qq qq qq q q q qq q q qq q q q q qq q qqqqqqq q q q q q q qqqq qqq q q q q q qq qqq q qq q q q qq qq q qq q q q qq q q q q qqq q q q q qq qqq q q q q q q qq qqqqqqqq q qq qqqq q qq q q q q q qq qq q q q qq qq q q qqqq qqqqqqq q q q qqqq qqqq qqqq qq q q q q qqq q qq q q qqq q qq q q qqq q q qq q q q qqqq qq qqqq qqq q q q q qq qq q q q q qq qq q q q q q q qqqq qq qqqqq qq qq q q qq q q q qq qq q qqq q qqqqq q qq q q qq qq q q q qq qq qq q q qq q q q q q q q q qq q qqqqq q q q qqq qq q q q qq q qqq q q q qq q q qq q q qq qq q q q qq q q q qq qq q qq q qq qqqq qqq qqq q q q qq q qqq q q qqqq qqq qq qq q qq q q q q q q qqq q q q q q q q q qqq q qqq q qqqq qq q qq q q q qq qq qq q q q q qqqq qqq q qq q q q q qqqqqq q q q qq q q q qq q q q q q qqqqqqqq q q q q qq q qqq qq q q q q q q q q q q qq qq q qq q q q q q qq q q q q qq q qqq qq q q qqq q q qq q q q qq qq q q qqqqq qq q qq q q q q qq qq q q qq q q q q q qqqqqq q q qqq q qqq qqq q q qq qqqq qq q q q q qqq q q qqq q qqqqq q qqq q q q q qq qqqqq qq q qq q qq q q q q q q q qq q q q qq qq qq q q q q q qq qq q qq q q q q q qq q q qq q q q q qq qq q qq q q q qqq qq q q q q qq q q qq q q q q q qq q q q q q qq qqq q q q qq q q qq q q qq qq q q q q q qq q q qq 0.0 θ2 q q q qq q q q q q −0.5 −1.0 −2 −1 0 1 2 θ1
    • MCMC and Likelihood-free Methods The Metropolis-Hastings Algorithm Extensions Outcome Result with a simulated sample of 100 points and ϑ1 = 0.6, ϑ2 = 0.2 and scale ω = 2.0 1.0 0.5 q q q q q q q q q q qq q qq qq qq q q q qq q q q qq q q qq q q q q q q q 0.0 θ2 q q q q −0.5 −1.0 −2 −1 0 1 2 θ1
    • MCMC and Likelihood-free Methods The Gibbs Sampler The Gibbs Sampler skip to population Monte Carlo The Gibbs Sampler General Principles Completion Convergence The Hammersley-Clifford theorem
    • MCMC and Likelihood-free Methods The Gibbs Sampler General Principles General Principles A very specific simulation algorithm based on the target distribution f : 1. Uses the conditional densities f1 , . . . , fp from f
    • MCMC and Likelihood-free Methods The Gibbs Sampler General Principles General Principles A very specific simulation algorithm based on the target distribution f : 1. Uses the conditional densities f1 , . . . , fp from f 2. Start with the random variable X = (X1 , . . . , Xp )
    • MCMC and Likelihood-free Methods The Gibbs Sampler General Principles General Principles A very specific simulation algorithm based on the target distribution f : 1. Uses the conditional densities f1 , . . . , fp from f 2. Start with the random variable X = (X1 , . . . , Xp ) 3. Simulate from the conditional densities, Xi |x1 , x2 , . . . , xi−1 , xi+1 , . . . , xp ∼ fi (xi |x1 , x2 , . . . , xi−1 , xi+1 , . . . , xp ) for i = 1, 2, . . . , p.
    • MCMC and Likelihood-free Methods The Gibbs Sampler General Principles Algorithm (Gibbs sampler) (t) (t) Given x(t) = (x1 , . . . , xp ), generate (t+1) (t) (t) 1. X1 ∼ f1 (x1 |x2 , . . . , xp ); (t+1) (t+1) (t) (t) 2. X2 ∼ f2 (x2 |x1 , x3 , . . . , xp ), ... (t+1) (t+1) (t+1) p. Xp ∼ fp (xp |x1 , . . . , xp−1 ) X(t+1) → X ∼ f
    • MCMC and Likelihood-free Methods The Gibbs Sampler General Principles Properties The full conditionals densities f1 , . . . , fp are the only densities used for simulation. Thus, even in a high dimensional problem, all of the simulations may be univariate
    • MCMC and Likelihood-free Methods The Gibbs Sampler General Principles Properties The full conditionals densities f1 , . . . , fp are the only densities used for simulation. Thus, even in a high dimensional problem, all of the simulations may be univariate The Gibbs sampler is not reversible with respect to f . However, each of its p components is. Besides, it can be turned into a reversible sampler, either using the Random Scan Gibbs sampler see section or running instead the (double) sequence f1 · · · fp−1 fp fp−1 · · · f1
    • MCMC and Likelihood-free Methods The Gibbs Sampler General Principles Example (Bivariate Gibbs sampler) (X, Y ) ∼ f (x, y) Generate a sequence of observations by Set X0 = x0 For t = 1, 2, . . . , generate Yt ∼ fY |X (·|xt−1 ) Xt ∼ fX|Y (·|yt ) where fY |X and fX|Y are the conditional distributions
    • MCMC and Likelihood-free Methods The Gibbs Sampler General Principles A Very Simple Example: Independent N (µ, σ 2 ) Observations iid When Y1 , . . . , Yn ∼ N (y|µ, σ 2 ) with both µ and σ unknown, the posterior in (µ, σ 2 ) is conjugate outside a standard familly
    • MCMC and Likelihood-free Methods The Gibbs Sampler General Principles A Very Simple Example: Independent N (µ, σ 2 ) Observations iid When Y1 , . . . , Yn ∼ N (y|µ, σ 2 ) with both µ and σ unknown, the posterior in (µ, σ 2 ) is conjugate outside a standard familly But... 1 n σ2 µ|Y 0:n , σ 2 ∼ N µ n i=1 Yi , n ) σ 2 |Y 1:n , µ ∼ IG σ 2 n − 1, 1 n (Yi 2 2 i=1 − µ)2 assuming constant (improper) priors on both µ and σ 2 Hence we may use the Gibbs sampler for simulating from the posterior of (µ, σ 2 )
    • MCMC and Likelihood-free Methods The Gibbs Sampler General Principles R Gibbs Sampler for Gaussian posterior n = length(Y); S = sum(Y); mu = S/n; for (i in 1:500) S2 = sum((Y-mu)^2); sigma2 = 1/rgamma(1,n/2-1,S2/2); mu = S/n + sqrt(sigma2/n)*rnorm(1);
    • MCMC and Likelihood-free Methods The Gibbs Sampler General Principles Example of results with n = 10 observations from the N (0, 1) distribution Number of Iterations 1
    • MCMC and Likelihood-free Methods The Gibbs Sampler General Principles Example of results with n = 10 observations from the N (0, 1) distribution Number of Iterations 1, 2
    • MCMC and Likelihood-free Methods The Gibbs Sampler General Principles Example of results with n = 10 observations from the N (0, 1) distribution Number of Iterations 1, 2, 3
    • MCMC and Likelihood-free Methods The Gibbs Sampler General Principles Example of results with n = 10 observations from the N (0, 1) distribution Number of Iterations 1, 2, 3, 4
    • MCMC and Likelihood-free Methods The Gibbs Sampler General Principles Example of results with n = 10 observations from the N (0, 1) distribution Number of Iterations 1, 2, 3, 4, 5
    • MCMC and Likelihood-free Methods The Gibbs Sampler General Principles Example of results with n = 10 observations from the N (0, 1) distribution Number of Iterations 1, 2, 3, 4, 5, 10
    • MCMC and Likelihood-free Methods The Gibbs Sampler General Principles Example of results with n = 10 observations from the N (0, 1) distribution Number of Iterations 1, 2, 3, 4, 5, 10, 25
    • MCMC and Likelihood-free Methods The Gibbs Sampler General Principles Example of results with n = 10 observations from the N (0, 1) distribution Number of Iterations 1, 2, 3, 4, 5, 10, 25, 50
    • MCMC and Likelihood-free Methods The Gibbs Sampler General Principles Example of results with n = 10 observations from the N (0, 1) distribution Number of Iterations 1, 2, 3, 4, 5, 10, 25, 50, 100
    • MCMC and Likelihood-free Methods The Gibbs Sampler General Principles Example of results with n = 10 observations from the N (0, 1) distribution Number of Iterations 1, 2, 3, 4, 5, 10, 25, 50, 100, 500
    • MCMC and Likelihood-free Methods The Gibbs Sampler General Principles Limitations of the Gibbs sampler Formally, a special case of a sequence of 1-D M-H kernels, all with acceptance rate uniformly equal to 1. The Gibbs sampler 1. limits the choice of instrumental distributions
    • MCMC and Likelihood-free Methods The Gibbs Sampler General Principles Limitations of the Gibbs sampler Formally, a special case of a sequence of 1-D M-H kernels, all with acceptance rate uniformly equal to 1. The Gibbs sampler 1. limits the choice of instrumental distributions 2. requires some knowledge of f
    • MCMC and Likelihood-free Methods The Gibbs Sampler General Principles Limitations of the Gibbs sampler Formally, a special case of a sequence of 1-D M-H kernels, all with acceptance rate uniformly equal to 1. The Gibbs sampler 1. limits the choice of instrumental distributions 2. requires some knowledge of f 3. is, by construction, multidimensional
    • MCMC and Likelihood-free Methods The Gibbs Sampler General Principles Limitations of the Gibbs sampler Formally, a special case of a sequence of 1-D M-H kernels, all with acceptance rate uniformly equal to 1. The Gibbs sampler 1. limits the choice of instrumental distributions 2. requires some knowledge of f 3. is, by construction, multidimensional 4. does not apply to problems where the number of parameters varies as the resulting chain is not irreducible.
    • MCMC and Likelihood-free Methods The Gibbs Sampler Completion Latent variables are back The Gibbs sampler can be generalized in much wider generality A density g is a completion of f if g(x, z) dz = f (x) Z
    • MCMC and Likelihood-free Methods The Gibbs Sampler Completion Latent variables are back The Gibbs sampler can be generalized in much wider generality A density g is a completion of f if g(x, z) dz = f (x) Z Note The variable z may be meaningless for the problem
    • MCMC and Likelihood-free Methods The Gibbs Sampler Completion Purpose g should have full conditionals that are easy to simulate for a Gibbs sampler to be implemented with g rather than f For p > 1, write y = (x, z) and denote the conditional densities of g(y) = g(y1 , . . . , yp ) by Y1 |y2 , . . . , yp ∼ g1 (y1 |y2 , . . . , yp ), Y2 |y1 , y3 , . . . , yp ∼ g2 (y2 |y1 , y3 , . . . , yp ), ..., Yp |y1 , . . . , yp−1 ∼ gp (yp |y1 , . . . , yp−1 ).
    • MCMC and Likelihood-free Methods The Gibbs Sampler Completion The move from Y (t) to Y (t+1) is defined as follows: Algorithm (Completion Gibbs sampler) (t) (t) Given (y1 , . . . , yp ), simulate (t+1) (t) (t) 1. Y1 ∼ g1 (y1 |y2 , . . . , yp ), (t+1) (t+1) (t) (t) 2. Y2 ∼ g2 (y2 |y1 , y3 , . . . , yp ), ... (t+1) (t+1) (t+1) p. Yp ∼ gp (yp |y1 , . . . , yp−1 ).
    • MCMC and Likelihood-free Methods The Gibbs Sampler Completion Example (Mixtures all over again) Hierarchical missing data structure: If k X1 , . . . , Xn ∼ pi f (x|θi ), i=1 then X|Z ∼ f (x|θZ ), Z ∼ p1 I(z = 1) + . . . + pk I(z = k), Z is the component indicator associated with observation x
    • MCMC and Likelihood-free Methods The Gibbs Sampler Completion Example (Mixtures (2)) Conditionally on (Z1 , . . . , Zn ) = (z1 , . . . , zn ) : π(p1 , . . . , pk , θ1 , . . . , θk |x1 , . . . , xn , z1 , . . . , zn ) ∝ pα1 +n1 −1 . . . pαk +nk −1 1 k ×π(θ1 |y1 + n1 x1 , λ1 + n1 ) . . . π(θk |yk + nk xk , λk + nk ), ¯ ¯ with ni = I(zj = i) and xi = ¯ xj /ni . j j; zj =i
    • MCMC and Likelihood-free Methods The Gibbs Sampler Completion Algorithm (Mixture Gibbs sampler) 1. Simulate θi ∼ π(θi |yi + ni xi , λi + ni ) (i = 1, . . . , k) ¯ (p1 , . . . , pk ) ∼ D(α1 + n1 , . . . , αk + nk ) 2. Simulate (j = 1, . . . , n) k Zj |xj , p1 , . . . , pk , θ1 , . . . , θk ∼ pij I(zj = i) i=1 with (i = 1, . . . , k) pij ∝ pi f (xj |θi ) and update ni and xi (i = 1, . . . , k). ¯
    • MCMC and Likelihood-free Methods The Gibbs Sampler Completion A wee problem 4 3 2 µ2 1 0 −1 −1 0 1 2 3 4 µ1 Gibbs started at random
    • MCMC and Likelihood-free Methods The Gibbs Sampler Completion A wee problem Gibbs stuck at the wrong mode 4 3 3 2 2 µ2 1 µ2 1 0 0 −1 −1 −1 0 1 2 3 4 µ1 Gibbs started at random −1 0 1 2 3 µ1
    • MCMC and Likelihood-free Methods The Gibbs Sampler Completion Slice sampler as generic Gibbs If f (θ) can be written as a product k fi (θ), i=1
    • MCMC and Likelihood-free Methods The Gibbs Sampler Completion Slice sampler as generic Gibbs If f (θ) can be written as a product k fi (θ), i=1 it can be completed as k I0≤ωi ≤fi (θ) , i=1 leading to the following Gibbs algorithm:
    • MCMC and Likelihood-free Methods The Gibbs Sampler Completion Algorithm (Slice sampler) Simulate (t+1) 1. ω1 ∼ U[0,f1 (θ(t) )] ; ... (t+1) k. ωk ∼ U[0,fk (θ(t) )] ; k+1. θ(t+1) ∼ UA(t+1) , with (t+1) A(t+1) = {y; fi (y) ≥ ωi , i = 1, . . . , k}.
    • MCMC and Likelihood-free Methods The Gibbs Sampler Completion Example of results with a truncated N (−3, 1) distribution 0.010 0.008 0.006 y 0.004 0.002 0.000 0.0 0.2 0.4 0.6 0.8 1.0 x Number of Iterations 2
    • MCMC and Likelihood-free Methods The Gibbs Sampler Completion Example of results with a truncated N (−3, 1) distribution 0.010 0.008 0.006 y 0.004 0.002 0.000 0.0 0.2 0.4 0.6 0.8 1.0 x Number of Iterations 2, 3
    • MCMC and Likelihood-free Methods The Gibbs Sampler Completion Example of results with a truncated N (−3, 1) distribution 0.010 0.008 0.006 y 0.004 0.002 0.000 0.0 0.2 0.4 0.6 0.8 1.0 x Number of Iterations 2, 3, 4
    • MCMC and Likelihood-free Methods The Gibbs Sampler Completion Example of results with a truncated N (−3, 1) distribution 0.010 0.008 0.006 y 0.004 0.002 0.000 0.0 0.2 0.4 0.6 0.8 1.0 x Number of Iterations 2, 3, 4, 5
    • MCMC and Likelihood-free Methods The Gibbs Sampler Completion Example of results with a truncated N (−3, 1) distribution 0.010 0.008 0.006 y 0.004 0.002 0.000 0.0 0.2 0.4 0.6 0.8 1.0 x Number of Iterations 2, 3, 4, 5, 10
    • MCMC and Likelihood-free Methods The Gibbs Sampler Completion Example of results with a truncated N (−3, 1) distribution 0.010 0.008 0.006 y 0.004 0.002 0.000 0.0 0.2 0.4 0.6 0.8 1.0 x Number of Iterations 2, 3, 4, 5, 10, 50
    • MCMC and Likelihood-free Methods The Gibbs Sampler Completion Example of results with a truncated N (−3, 1) distribution 0.010 0.008 0.006 y 0.004 0.002 0.000 0.0 0.2 0.4 0.6 0.8 1.0 x Number of Iterations 2, 3, 4, 5, 10, 50, 100
    • MCMC and Likelihood-free Methods The Gibbs Sampler Completion Good slices The slice sampler usually enjoys good theoretical properties (like geometric ergodicity and even uniform ergodicity under bounded f and bounded X ). As k increases, the determination of the set A(t+1) may get increasingly complex.
    • MCMC and Likelihood-free Methods The Gibbs Sampler Completion Example (Stochastic volatility core distribution) Difficult part of the stochastic volatility model π(x) ∝ exp − σ 2 (x − µ)2 + β 2 exp(−x)y 2 + x /2 , simplified in exp − x2 + α exp(−x)
    • MCMC and Likelihood-free Methods The Gibbs Sampler Completion Example (Stochastic volatility core distribution) Difficult part of the stochastic volatility model π(x) ∝ exp − σ 2 (x − µ)2 + β 2 exp(−x)y 2 + x /2 , simplified in exp − x2 + α exp(−x) Slice sampling means simulation from a uniform distribution on A = x; exp − x2 + α exp(−x) /2 ≥ u = x; x2 + α exp(−x) ≤ ω if we set ω = −2 log u. Note Inversion of x2 + α exp(−x) = ω needs to be done by trial-and-error.
    • MCMC and Likelihood-free Methods The Gibbs Sampler Completion 1 0.8 Density 0.6 0.4 0.2 0 −1 −0.5 0 0.5 1 1.5 2 2.5 3 3.5 0.1 0.05 Correlation 0 −0.05 −0.1 0 10 20 30 40 50 60 70 80 90 100 Lag Histogram of a Markov chain produced by a slice sampler and target distribution in overlay.
    • MCMC and Likelihood-free Methods The Gibbs Sampler Convergence Properties of the Gibbs sampler Theorem (Convergence) For (Y1 , Y2 , · · · , Yp ) ∼ g(y1 , . . . , yp ), if either [Positivity condition] (i) g (i) (yi )> 0 for every i = 1, · · · , p, implies that g(y1 , . . . , yp ) > 0, where g (i) denotes the marginal distribution of Yi , or (ii) the transition kernel is absolutely continuous with respect to g, then the chain is irreducible and positive Harris recurrent.
    • MCMC and Likelihood-free Methods The Gibbs Sampler Convergence Properties of the Gibbs sampler (2) Consequences (i) If h(y)g(y)dy < ∞, then T 1 lim h1 (Y (t) ) = h(y)g(y)dy a.e. g. nT →∞ T t=1 (ii) If, in addition, (Y (t) ) is aperiodic, then lim K n (y, ·)µ(dx) − f =0 n→∞ TV for every initial distribution µ.
    • MCMC and Likelihood-free Methods The Gibbs Sampler Convergence Slice sampler fast on that slice For convergence, the properties of Xt and of f (Xt ) are identical Theorem (Uniform ergodicity) If f is bounded and suppf is bounded, the simple slice sampler is uniformly ergodic. [Mira & Tierney, 1997]
    • MCMC and Likelihood-free Methods The Gibbs Sampler Convergence A small set for a slice sampler no slice detail For > , C = {x ∈ X ; < f (x) < } is a small set: Pr(x, ·) ≥ µ(·) where 1 λ(A ∩ L( )) µ(A) = d 0 λ(L( )) if L( ) = {x ∈ X ; f (x) > }‘ [Roberts & Rosenthal, 1998]
    • MCMC and Likelihood-free Methods The Gibbs Sampler Convergence Slice sampler: drift Under differentiability and monotonicity conditions, the slice sampler also verifies a drift condition with V (x) = f (x)−β , is geometrically ergodic, and there even exist explicit bounds on the total variation distance [Roberts & Rosenthal, 1998]
    • MCMC and Likelihood-free Methods The Gibbs Sampler Convergence Slice sampler: drift Under differentiability and monotonicity conditions, the slice sampler also verifies a drift condition with V (x) = f (x)−β , is geometrically ergodic, and there even exist explicit bounds on the total variation distance [Roberts & Rosenthal, 1998] Example (Exponential Exp(1)) For n > 23, ||K n (x, ·) − f (·)||T V ≤ .054865 (0.985015)n (n − 15.7043)
    • MCMC and Likelihood-free Methods The Gibbs Sampler Convergence Slice sampler: convergence no more slice detail Theorem For any density such that ∂ λ ({x ∈ X ; f (x) > }) is non-increasing ∂ then ||K 523 (x, ·) − f (·)||T V ≤ .0095 [Roberts & Rosenthal, 1998]
    • MCMC and Likelihood-free Methods The Gibbs Sampler Convergence A poor slice sampler Example Consider 1 dimensional run 1 dimensional acf 0.0 0.2 0.4 0.6 0.8 1.0 d 1 f (x) = exp {−||x||} x∈R correlation 0 -1 -2 0 200 400 600 800 1000 0 10 20 30 40 10 dimensional run 10 dimensional acf Slice sampler equivalent to 0.0 0.2 0.4 0.6 0.8 1.0 30 25 correlation 20 one-dimensional slice sampler on 15 10 0 200 400 600 800 1000 0 10 20 30 40 20 dimensional run 20 dimensional acf d−1 −z 0.0 0.2 0.4 0.6 0.8 1.0 π(z) = z e z>0 60 correlation 40 20 0 0 200 400 600 800 1000 0 10 20 30 40 or on 100 dimensional run 100 dimensional acf 0.0 0.2 0.4 0.6 0.8 1.0 100 200 300 400 correlation 1/d π(u) = e−u u>0 0 0 200 400 600 800 1000 0 10 20 30 40 Poor performances when d large Sample runs of log(u) and (heavy tails) ACFs for log(u) (Roberts
    • MCMC and Likelihood-free Methods The Gibbs Sampler The Hammersley-Clifford theorem Hammersley-Clifford theorem An illustration that conditionals determine the joint distribution Theorem If the joint density g(y1 , y2 ) have conditional distributions g1 (y1 |y2 ) and g2 (y2 |y1 ), then g2 (y2 |y1 ) g(y1 , y2 ) = . g2 (v|y1 )/g1 (y1 |v) dv [Hammersley & Clifford, circa 1970]
    • MCMC and Likelihood-free Methods The Gibbs Sampler The Hammersley-Clifford theorem General HC decomposition Under the positivity condition, the joint distribution g satisfies p g j (y j |y 1 , . . . , y j−1 ,y j+1 , . . . , y p) g(y1 , . . . , yp ) ∝ g j (y j |y 1 , . . . , y j−1 ,y , . . . , y p) j=1 j+1 for every permutation on {1, 2, . . . , p} and every y ∈ Y .
    • MCMC and Likelihood-free Methods Population Monte Carlo Sequential importance sampling Computational issues in Bayesian statistics The Metropolis-Hastings Algorithm The Gibbs Sampler Population Monte Carlo Approximate Bayesian computation ABC for model choice
    • MCMC and Likelihood-free Methods Population Monte Carlo Importance sampling (revisited) basic importance Approximation of integrals I= h(x)π(x)dx by unbiased estimators n ˆ 1 I= i h(xi ) n i=1 when iid def π(xi ) x1 , . . . , xn ∼ q(x) and i = q(xi )
    • MCMC and Likelihood-free Methods Population Monte Carlo Iterated importance sampling As in Markov Chain Monte Carlo (MCMC) algorithms, introduction of a temporal dimension : (t) (t−1) xi ∼ qt (x|xi ) i = 1, . . . , n, t = 1, . . . and n ˆ 1 (t) (t) It = i h(xi ) n i=1 is still unbiased for (t) (t) πt (xi ) i = (t) (t−1) , i = 1, . . . , n qt (xi |xi )
    • MCMC and Likelihood-free Methods Population Monte Carlo Fundamental importance equality Preservation of unbiasedness π(X (t) ) E h(X (t) ) qt (X (t) |X (t−1) ) π(x) = h(x) qt (x|y) g(y) dx dy qt (x|y) = h(x) π(x) dx for any distribution g on X (t−1)
    • MCMC and Likelihood-free Methods Population Monte Carlo Sequential variance decomposition Furthermore, n ˆ 1 (t) (t) var It = 2 var i h(xi ) , n i=1 (t) (t) if var i exists, because the xi ’s are conditionally uncorrelated Note (t) This decomposition is still valid for correlated [in i] xi ’s when (t) incorporating weights i
    • MCMC and Likelihood-free Methods Population Monte Carlo Simulation of a population The importance distribution of the sample (a.k.a. particles) x(t) qt (x(t) |x(t−1) ) can depend on the previous sample x(t−1) in any possible way as long as marginal distributions (t) qit (x) = qt (x(t) ) dx−i can be expressed to build importance weights (t) π(xi ) it = (t) qit (xi )
    • MCMC and Likelihood-free Methods Population Monte Carlo Special case of the product proposal If n (t) (t−1) (t) qt (x |x )= qit (xi |x(t−1) ) i=1 [Independent proposals] then n ˆ 1 (t) (t) var It = 2 var i h(xi ) , n i=1
    • MCMC and Likelihood-free Methods Population Monte Carlo Validation skip validation (t) (t) (t) (t) E i h(Xi ) j h(Xj ) π(xi ) π(xj ) = h(xi ) (t−1) ) q (x |x(t−1) ) h(xj ) qit (xi |x jt j qit (xi |x(t−1) ) qjt (xj |x(t−1) ) dxi dxj g(x(t−1) )dx(t−1) = Eπ [h(X)]2 whatever the distribution g on x(t−1)
    • MCMC and Likelihood-free Methods Population Monte Carlo Self-normalised version In general, π is unscaled and the weight (t) (t) π(xi ) i ∝ (t) , i = 1, . . . , n , qit (xi ) is scaled so that (t) i =1 i
    • MCMC and Likelihood-free Methods Population Monte Carlo Self-normalised version properties Loss of the unbiasedness property and the variance decomposition Normalising constant can be estimated by t n (τ ) 1 π(xi ) t = (τ ) tn qiτ (xi ) τ =1 i=1 Variance decomposition (approximately) recovered if t−1 is used instead
    • MCMC and Likelihood-free Methods Population Monte Carlo Sampling importance resampling Importance sampling from g can also produce samples from the target π [Rubin, 1987]
    • MCMC and Likelihood-free Methods Population Monte Carlo Sampling importance resampling Importance sampling from g can also produce samples from the target π [Rubin, 1987] Theorem (Bootstraped importance sampling) If a sample (xi )1≤i≤m is derived from the weighted sample (xi , i )1≤i≤n by multinomial sampling with weights i , then xi ∼ π(x)
    • MCMC and Likelihood-free Methods Population Monte Carlo Sampling importance resampling Importance sampling from g can also produce samples from the target π [Rubin, 1987] Theorem (Bootstraped importance sampling) If a sample (xi )1≤i≤m is derived from the weighted sample (xi , i )1≤i≤n by multinomial sampling with weights i , then xi ∼ π(x) Note Obviously, the xi ’s are not iid
    • MCMC and Likelihood-free Methods Population Monte Carlo Iterated sampling importance resampling This principle can be extended to iterated importance sampling: After each iteration, resampling produces a sample from π [Again, not iid!]
    • MCMC and Likelihood-free Methods Population Monte Carlo Iterated sampling importance resampling This principle can be extended to iterated importance sampling: After each iteration, resampling produces a sample from π [Again, not iid!] Incentive Use previous sample(s) to learn about π and q
    • MCMC and Likelihood-free Methods Population Monte Carlo Generic Population Monte Carlo Algorithm (Population Monte Carlo Algorithm) For t = 1, . . . , T For i = 1, . . . , n, 1. Select the generating distribution qit (·) (t) 2. Generate xi ∼ qit (x) ˜ (t) (t) (t) 3. Compute i = π(˜i )/qit (˜i ) x x (t) (t) Normalise the i ’s into ¯i ’s (t) (t) Generate Ji,t ∼ M((¯i )1≤i≤N ) and set xi,t = xJi,t ˜
    • MCMC and Likelihood-free Methods Population Monte Carlo D-kernels in competition A general adaptive construction: Construct qi,t as a mixture of D different transition kernels (t−1) depending on xi D D (t−1) qi,t = pt, K (xi , x), pt, = 1 , =1 =1 and adapt the weights pt, .
    • MCMC and Likelihood-free Methods Population Monte Carlo D-kernels in competition A general adaptive construction: Construct qi,t as a mixture of D different transition kernels (t−1) depending on xi D D (t−1) qi,t = pt, K (xi , x), pt, = 1 , =1 =1 and adapt the weights pt, . Darwinian example Take pt, proportional to the survival rate of the points (t) (a.k.a. particles) xi generated from K
    • MCMC and Likelihood-free Methods Population Monte Carlo Implementation Algorithm (D-kernel PMC) For t = 1, . . . , T generate (Ki,t )1≤i≤N ∼ M ((pt,k )1≤k≤D ) for 1 ≤ i ≤ N , generate xi,t ∼ KKi,t (x) ˜ compute and renormalize the importance weights ωi,t generate (Ji,t )1≤i≤N ∼ M ((ω i,t )1≤i≤N ) N take xi,t = xJi,t ,t and pt+1,d = ˜ i=1 ωi,t Id (Ki,t ) ¯
    • MCMC and Likelihood-free Methods Population Monte Carlo Links with particle filters Sequential setting where π = πt changes with t: Population Monte Carlo also adapts to this case Can be traced back all the way to Hammersley and Morton (1954) and the self-avoiding random walk problem Gilks and Berzuini (2001) produce iterated samples with (SIR) resampling steps, and add an MCMC step: this step must use a πt invariant kernel Chopin (2001) uses iterated importance sampling to handle large datasets: this is a special case of PMC where the qit ’s are the posterior distributions associated with a portion kt of the observed dataset
    • MCMC and Likelihood-free Methods Population Monte Carlo Links with particle filters (2) Rubinstein and Kroese’s (2004) cross-entropy method is parameterised importance sampling targeted at rare events Stavropoulos and Titterington’s (1999) smooth bootstrap and Warnes’ (2001) kernel coupler use nonparametric kernels on the previous importance sample to build an improved proposal: this is a special case of PMC West (1992) mixture approximation is a precursor of smooth bootstrap Mengersen and Robert (2002) “pinball sampler” is an MCMC attempt at population sampling Del Moral, Doucet and Jasra (2006, JRSS B) sequential Monte Carlo samplers also relates to PMC, with a Markovian dependence on the past sample x(t) but (limited) stationarity constraints
    • MCMC and Likelihood-free Methods Population Monte Carlo Things can go wrong Unexpected behaviour of the mixture weights when the number of particles increases N 1 ωi,t IKi,t =d −→P ¯ D i=1
    • MCMC and Likelihood-free Methods Population Monte Carlo Things can go wrong Unexpected behaviour of the mixture weights when the number of particles increases N 1 ωi,t IKi,t =d −→P ¯ D i=1 Conclusion At each iteration, every weight converges to 1/D: the algorithm fails to learn from experience!!
    • MCMC and Likelihood-free Methods Population Monte Carlo Saved by Rao-Blackwell!! Modification: Rao-Blackwellisation (=conditioning)
    • MCMC and Likelihood-free Methods Population Monte Carlo Saved by Rao-Blackwell!! Modification: Rao-Blackwellisation (=conditioning) Use the whole mixture in the importance weight: D ωi,t = π(˜i,t ) x pt,d Kd (xi,t−1 , xi,t ) ˜ d=1 instead of π(˜i,t ) x ωi,t = KKi,t (xi,t−1 , xi,t ) ˜
    • MCMC and Likelihood-free Methods Population Monte Carlo Adapted algorithm Algorithm (Rao-Blackwellised D-kernel PMC) At time t (t = 1, . . . , T ), Generate iid (Ki,t )1≤i≤N ∼ M((pt,d )1≤d≤D ); Generate ind (˜i,t )1≤i≤N ∼ KKi,t (xi,t−1 , x) x D and set ωi,t = π(˜i,t ) x d=1 pt,d Kd (xi,t−1 , xi,t ); ˜ Generate iid (Ji,t )1≤i≤N ∼ M((¯ i,t )1≤i≤N ) ω N and set xi,t = xJi,t ,t and pt+1,d = ˜ i=1 ωi,t pt,d . ¯
    • MCMC and Likelihood-free Methods Population Monte Carlo Convergence properties Theorem (LLN) Under regularity assumptions, for h ∈ L1 and for every t ≥ 1, Π N 1 N →∞ ωi,t h(xi,t ) −→P Π(h) ¯ N k=1 and N →∞ t pt,d −→P αd t The limiting coefficients (αd )1≤d≤D are defined recursively as t t−1 Kd (x, x ) αd = αd D t−1 Π ⊗ Π(dx, dx ). j=1 αj Kj (x, x )
    • MCMC and Likelihood-free Methods Population Monte Carlo Recursion on the weights Set F as Kd (x, x ) F (α) = αd D Π ⊗ Π(dx, dx ) j=1 αj Kj (x, x ) 1≤d≤D on the simplex D S= α = (α1 , . . . , αD ); ∀d ∈ {1, . . . , D}, αd ≥ 0 and αd = 1 . d=1 and define the sequence αt+1 = F (αt )
    • MCMC and Likelihood-free Methods Population Monte Carlo Kullback divergence Definition (Kullback divergence) For α ∈ S, π(x)π(x ) KL(α) = log D Π ⊗ Π(dx, dx ). π(x) d=1 αd Kd (x, x ) Kullback divergence between Π and the mixture. Goal: Obtain the mixture closest to Π, i.e., that minimises KL(α)
    • MCMC and Likelihood-free Methods Population Monte Carlo Connection with RBDPMCA ?? Theorem Under the assumption ∀d ∈ {1, . . . , D}, −∞ < log(Kd (x, x ))Π ⊗ Π(dx, dx ) < ∞ for every α ∈ SD , KL(F (α)) ≤ KL(α).
    • MCMC and Likelihood-free Methods Population Monte Carlo Connection with RBDPMCA ?? Theorem Under the assumption ∀d ∈ {1, . . . , D}, −∞ < log(Kd (x, x ))Π ⊗ Π(dx, dx ) < ∞ for every α ∈ SD , KL(F (α)) ≤ KL(α). Conclusion The Kullback divergence decreases at every iteration of RBDPMCA
    • MCMC and Likelihood-free Methods Population Monte Carlo An integrated EM interpretation skip interpretation We have αmin = arg min KL(α) = arg max log pα (¯)Π ⊗ Π(d¯) x x α∈S α∈S = arg max log pα (¯, K)dK Π ⊗ Π(d¯) x x α∈S for x = (x, x ) and K ∼ M((αd )1≤d≤D ). Then αt+1 = F (αt ) ¯ means αt+1 = arg max ¯ ¯ Eαt (log pα (X, K)|X = x)Π ⊗ Π(d¯) ¯ x α and lim αt = αmin t→∞
    • MCMC and Likelihood-free Methods Population Monte Carlo Illustration Example (A toy example) Take the target 1/4N (−1, 0.3)(x) + 1/4N (0, 1)(x) + 1/2N (3, 2)(x) and use 3 proposals: N (−1, 0.3), N (0, 1) and N (3, 2) [Surprise!!!]
    • MCMC and Likelihood-free Methods Population Monte Carlo Illustration Example (A toy example) Take the target 1/4N (−1, 0.3)(x) + 1/4N (0, 1)(x) + 1/2N (3, 2)(x) and use 3 proposals: N (−1, 0.3), N (0, 1) and N (3, 2) [Surprise!!!] Then 1 0.0500000 0.05000000 0.9000000 2 0.2605712 0.09970292 0.6397259 6 0.2740816 0.19160178 0.5343166 10 0.2989651 0.19200904 0.5090259 16 0.2651511 0.24129039 0.4935585 Weight evolution
    • MCMC and Likelihood-free Methods Population Monte Carlo Target and mixture evolution
    • MCMC and Likelihood-free Methods Population Monte Carlo c Learning scheme The efficiency of the SNIS approximation depends on the choice of Q, ranging from optimal q(x) ∝ |h(x) − Π(h)|π(x) to useless ˆ var ΠSN IS (h) = +∞ Q,N
    • MCMC and Likelihood-free Methods Population Monte Carlo c Learning scheme The efficiency of the SNIS approximation depends on the choice of Q, ranging from optimal q(x) ∝ |h(x) − Π(h)|π(x) to useless ˆ var ΠSN IS (h) = +∞ Q,N Example (PMC=adaptive importance sampling) Population Monte Carlo is producing a sequence of proposals Qt aiming at improving efficiency ˆ ˆ Kull(π, qt ) ≤ Kull(π, qt−1 ) or var ΠSN IS (h) ≤ var ΠSN IS (h) Qt ,∞ Qt−1 ,∞ [Capp´, Douc, Guillin, Marin, Robert, 04, 07a, 07b, 08] e
    • MCMC and Likelihood-free Methods Approximate Bayesian computation Approximate Bayesian computation Computational issues in Bayesian statistics The Metropolis-Hastings Algorithm The Gibbs Sampler Population Monte Carlo Approximate Bayesian computation ABC basics Alphabet soup Calibration of ABC
    • MCMC and Likelihood-free Methods Approximate Bayesian computation ABC basics Untractable likelihoods Cases when the likelihood function f (y|θ) is unavailable and when the completion step f (y|θ) = f (y, z|θ) dz Z is impossible or too costly because of the dimension of z c MCMC cannot be implemented!
    • MCMC and Likelihood-free Methods Approximate Bayesian computation ABC basics Illustrations Example Stochastic volatility model: for Highest weight trajectories t = 1, . . . , T, 0.4 0.2 yt = exp(zt ) t , zt = a+bzt−1 +σηt , 0.0 −0.2 T very large makes it difficult to −0.4 include z within the simulated 0 200 400 t 600 800 1000 parameters
    • MCMC and Likelihood-free Methods Approximate Bayesian computation ABC basics Illustrations Example Potts model: if y takes values on a grid Y of size k n and f (y|θ) ∝ exp θ Iyl =yi l∼i where l∼i denotes a neighbourhood relation, n moderately large prohibits the computation of the normalising constant
    • MCMC and Likelihood-free Methods Approximate Bayesian computation ABC basics Illustrations Example Inference on CMB: in cosmology, study of the Cosmic Microwave Background via likelihoods immensely slow to computate (e.g WMAP, Plank), because of numerically costly spectral transforms [Data is a Fortran program] [Kilbinger et al., 2010, MNRAS]
    • MCMC and Likelihood-free Methods Approximate Bayesian computation ABC basics Illustrations Example Phylogenetic tree: in population genetics, reconstitution of a common ancestor from a sample of genes via a phylogenetic tree that is close to impossible to integrate out [100 processor days with 4 parameters] [Cornuet et al., 2009, Bioinformatics]
    • MCMC and Likelihood-free Methods Approximate Bayesian computation ABC basics The ABC method Bayesian setting: target is π(θ)f (x|θ)
    • MCMC and Likelihood-free Methods Approximate Bayesian computation ABC basics The ABC method Bayesian setting: target is π(θ)f (x|θ) When likelihood f (x|θ) not in closed form, likelihood-free rejection technique:
    • MCMC and Likelihood-free Methods Approximate Bayesian computation ABC basics The ABC method Bayesian setting: target is π(θ)f (x|θ) When likelihood f (x|θ) not in closed form, likelihood-free rejection technique: ABC algorithm For an observation y ∼ f (y|θ), under the prior π(θ), keep jointly simulating θ ∼ π(θ) , z ∼ f (z|θ ) , until the auxiliary variable z is equal to the observed value, z = y. [Tavar´ et al., 1997] e
    • MCMC and Likelihood-free Methods Approximate Bayesian computation ABC basics Why does it work?! The proof is trivial: f (θi ) ∝ π(θi )f (z|θi )Iy (z) z∈D ∝ π(θi )f (y|θi ) = π(θi |y) . [Accept–Reject 101]
    • MCMC and Likelihood-free Methods Approximate Bayesian computation ABC basics Earlier occurrence ‘Bayesian statistics and Monte Carlo methods are ideally suited to the task of passing many models over one dataset’ [Don Rubin, Annals of Statistics, 1984] Note Rubin (1984) does not promote this algorithm for likelihood-free simulation but frequentist intuition on posterior distributions: parameters from posteriors are more likely to be those that could have generated the data.
    • MCMC and Likelihood-free Methods Approximate Bayesian computation ABC basics A as approximative When y is a continuous random variable, equality z = y is replaced with a tolerance condition, (y, z) ≤ where is a distance
    • MCMC and Likelihood-free Methods Approximate Bayesian computation ABC basics A as approximative When y is a continuous random variable, equality z = y is replaced with a tolerance condition, (y, z) ≤ where is a distance Output distributed from π(θ) Pθ { (y, z) < } ∝ π(θ| (y, z) < )
    • MCMC and Likelihood-free Methods Approximate Bayesian computation ABC basics ABC algorithm Algorithm 2 Likelihood-free rejection sampler 2 for i = 1 to N do repeat generate θ from the prior distribution π(·) generate z from the likelihood f (·|θ ) until ρ{η(z), η(y)} ≤ set θi = θ end for where η(y) defines a (not necessarily sufficient) statistic
    • MCMC and Likelihood-free Methods Approximate Bayesian computation ABC basics Output The likelihood-free algorithm samples from the marginal in z of: π(θ)f (z|θ)IA ,y (z) π (θ, z|y) = , A ,y ×Θ π(θ)f (z|θ)dzdθ wheere A ,y = {z ∈ D|ρ(η(z), η(y)) < }.
    • MCMC and Likelihood-free Methods Approximate Bayesian computation ABC basics Output The likelihood-free algorithm samples from the marginal in z of: π(θ)f (z|θ)IA ,y (z) π (θ, z|y) = , A ,y ×Θ π(θ)f (z|θ)dzdθ wheere A ,y = {z ∈ D|ρ(η(z), η(y)) < }. The idea behind ABC is that the summary statistics coupled with a small tolerance should provide a good approximation of the posterior distribution: π (θ|y) = π (θ, z|y)dz ≈ π(θ|y) .
    • MCMC and Likelihood-free Methods Approximate Bayesian computation ABC basics MA example Back to the MA(q) model q xt = t + ϑi t−i i=1 Simple prior: uniform over the inverse [real and complex] roots in q Q(u) = 1 − ϑi ui i=1 under the identifiability conditions
    • MCMC and Likelihood-free Methods Approximate Bayesian computation ABC basics MA example Back to the MA(q) model q xt = t+ ϑi t−i i=1 Simple prior: uniform prior over the identifiability zone, e.g. triangle for MA(2)
    • MCMC and Likelihood-free Methods Approximate Bayesian computation ABC basics MA example (2) ABC algorithm thus made of 1. picking a new value (ϑ1 , ϑ2 ) in the triangle 2. generating an iid sequence ( t )−q<t≤T 3. producing a simulated series (xt )1≤t≤T
    • MCMC and Likelihood-free Methods Approximate Bayesian computation ABC basics MA example (2) ABC algorithm thus made of 1. picking a new value (ϑ1 , ϑ2 ) in the triangle 2. generating an iid sequence ( t )−q<t≤T 3. producing a simulated series (xt )1≤t≤T Distance: basic distance between the series T ρ((xt )1≤t≤T , (xt )1≤t≤T ) = (xt − xt )2 t=1 or distance between summary statistics like the q autocorrelations T τj = xt xt−j t=j+1
    • MCMC and Likelihood-free Methods Approximate Bayesian computation ABC basics Comparison of distance impact Evaluation of the tolerance on the ABC sample against both distances ( = 100%, 10%, 1%, 0.1%) for an MA(2) model
    • MCMC and Likelihood-free Methods Approximate Bayesian computation ABC basics Comparison of distance impact 4 1.5 3 1.0 2 0.5 1 0.0 0 0.0 0.2 0.4 0.6 0.8 −2.0 −1.0 0.0 0.5 1.0 1.5 θ1 θ2 Evaluation of the tolerance on the ABC sample against both distances ( = 100%, 10%, 1%, 0.1%) for an MA(2) model
    • MCMC and Likelihood-free Methods Approximate Bayesian computation ABC basics Comparison of distance impact 4 1.5 3 1.0 2 0.5 1 0.0 0 0.0 0.2 0.4 0.6 0.8 −2.0 −1.0 0.0 0.5 1.0 1.5 θ1 θ2 Evaluation of the tolerance on the ABC sample against both distances ( = 100%, 10%, 1%, 0.1%) for an MA(2) model
    • MCMC and Likelihood-free Methods Approximate Bayesian computation ABC basics Homonomy The ABC algorithm is not to be confused with the ABC algorithm The Artificial Bee Colony algorithm is a swarm based meta-heuristic algorithm that was introduced by Karaboga in 2005 for optimizing numerical problems. It was inspired by the intelligent foraging behavior of honey bees. The algorithm is specifically based on the model proposed by Tereshko and Loengarov (2005) for the foraging behaviour of honey bee colonies. The model consists of three essential components: employed and unemployed foraging bees, and food sources. The first two components, employed and unemployed foraging bees, search for rich food sources (...) close to their hive. The model also defines two leading modes of behaviour (...): recruitment of foragers to rich food sources resulting in positive feedback and abandonment of poor sources by foragers causing negative feedback. [Karaboga, Scholarpedia]
    • MCMC and Likelihood-free Methods Approximate Bayesian computation ABC basics ABC advances Simulating from the prior is often poor in efficiency
    • MCMC and Likelihood-free Methods Approximate Bayesian computation ABC basics ABC advances Simulating from the prior is often poor in efficiency Either modify the proposal distribution on θ to increase the density of x’s within the vicinity of y... [Marjoram et al, 2003; Bortot et al., 2007, Sisson et al., 2007]
    • MCMC and Likelihood-free Methods Approximate Bayesian computation ABC basics ABC advances Simulating from the prior is often poor in efficiency Either modify the proposal distribution on θ to increase the density of x’s within the vicinity of y... [Marjoram et al, 2003; Bortot et al., 2007, Sisson et al., 2007] ...or by viewing the problem as a conditional density estimation and by developing techniques to allow for larger [Beaumont et al., 2002]
    • MCMC and Likelihood-free Methods Approximate Bayesian computation ABC basics ABC advances Simulating from the prior is often poor in efficiency Either modify the proposal distribution on θ to increase the density of x’s within the vicinity of y... [Marjoram et al, 2003; Bortot et al., 2007, Sisson et al., 2007] ...or by viewing the problem as a conditional density estimation and by developing techniques to allow for larger [Beaumont et al., 2002] .....or even by including in the inferential framework [ABCµ ] [Ratmann et al., 2009]
    • MCMC and Likelihood-free Methods Approximate Bayesian computation Alphabet soup ABC-NP Better usage of [prior] simulations by adjustement: instead of throwing away θ such that ρ(η(z), η(y)) > , replace θs with locally regressed θ∗ = θ − {η(z) − η(y)}T β ˆ [Csill´ry et al., TEE, 2010] e ˆ where β is obtained by [NP] weighted least square regression on (η(z) − η(y)) with weights Kδ {ρ(η(z), η(y))} [Beaumont et al., 2002, Genetics]
    • MCMC and Likelihood-free Methods Approximate Bayesian computation Alphabet soup ABC-MCMC Markov chain (θ(t) ) created via the transition function  θ ∼ Kω (θ |θ(t) ) if x ∼ f (x|θ ) is such that x = y   π(θ )Kω (t) |θ ) θ (t+1) = and u ∼ U(0, 1) ≤ π(θ(t) )K (θ |θ(t) ) ,  ω (θ  (t) θ otherwise,
    • MCMC and Likelihood-free Methods Approximate Bayesian computation Alphabet soup ABC-MCMC Markov chain (θ(t) ) created via the transition function  θ ∼ Kω (θ |θ(t) ) if x ∼ f (x|θ ) is such that x = y   π(θ )Kω (t) |θ ) θ (t+1) = and u ∼ U(0, 1) ≤ π(θ(t) )K (θ |θ(t) ) ,  ω (θ  (t) θ otherwise, has the posterior π(θ|y) as stationary distribution [Marjoram et al, 2003]
    • MCMC and Likelihood-free Methods Approximate Bayesian computation Alphabet soup ABC-MCMC (2) Algorithm 3 Likelihood-free MCMC sampler Use Algorithm 2 to get (θ(0) , z(0) ) for t = 1 to N do Generate θ from Kω ·|θ(t−1) , Generate z from the likelihood f (·|θ ), Generate u from U[0,1] , π(θ )Kω (θ(t−1) |θ ) if u ≤ I π(θ(t−1) Kω (θ |θ(t−1) ) A ,y (z ) then set (θ(t) , z(t) ) = (θ , z ) else (θ(t) , z(t) )) = (θ(t−1) , z(t−1) ), end if end for
    • MCMC and Likelihood-free Methods Approximate Bayesian computation Alphabet soup Why does it work? Acceptance probability that does not involve the calculation of the likelihood and π (θ , z |y) Kω (θ(t−1) |θ )f (z(t−1) |θ(t−1) ) × π (θ(t−1) , z(t−1) |y) Kω (θ |θ(t−1) )f (z |θ ) π(θ ) f (z |θ ) IA ,y (z ) = (t−1) ) f (z(t−1) |θ (t−1) )I (t−1) ) π(θ A ,y (z Kω (θ(t−1) |θ ) f (z(t−1) |θ(t−1) ) × Kω (θ |θ(t−1) ) f (z |θ ) π(θ )Kω (θ(t−1) |θ ) = IA (z ) . π(θ(t−1) Kω (θ |θ(t−1) ) ,y
    • MCMC and Likelihood-free Methods Approximate Bayesian computation Alphabet soup ABCµ [Ratmann, Andrieu, Wiuf and Richardson, 2009, PNAS] Use of a joint density f (θ, |y) ∝ ξ( |y, θ) × πθ (θ) × π ( ) where y is the data, and ξ( |y, θ) is the prior predictive density of ρ(η(z), η(y)) given θ and x when z ∼ f (z|θ)
    • MCMC and Likelihood-free Methods Approximate Bayesian computation Alphabet soup ABCµ [Ratmann, Andrieu, Wiuf and Richardson, 2009, PNAS] Use of a joint density f (θ, |y) ∝ ξ( |y, θ) × πθ (θ) × π ( ) where y is the data, and ξ( |y, θ) is the prior predictive density of ρ(η(z), η(y)) given θ and x when z ∼ f (z|θ) Warning! Replacement of ξ( |y, θ) with a non-parametric kernel approximation.
    • MCMC and Likelihood-free Methods Approximate Bayesian computation Alphabet soup ABCµ details Multidimensional distances ρk (k = 1, . . . , K) and errors k = ρk (ηk (z), ηk (y)), with ˆ 1 k ∼ ξk ( |y, θ) ≈ ξk ( |y, θ) = K[{ k −ρk (ηk (zb ), ηk (y))}/hk ] Bhk b ˆ then used in replacing ξ( |y, θ) with mink ξk ( |y, θ)
    • MCMC and Likelihood-free Methods Approximate Bayesian computation Alphabet soup ABCµ details Multidimensional distances ρk (k = 1, . . . , K) and errors k = ρk (ηk (z), ηk (y)), with ˆ 1 k ∼ ξk ( |y, θ) ≈ ξk ( |y, θ) = K[{ k −ρk (ηk (zb ), ηk (y))}/hk ] Bhk b ˆ then used in replacing ξ( |y, θ) with mink ξk ( |y, θ) ABCµ involves acceptance probability ˆ π(θ , ) q(θ , θ)q( , ) mink ξk ( |y, θ ) ˆ π(θ, ) q(θ, θ )q( , ) mink ξk ( |y, θ)
    • MCMC and Likelihood-free Methods Approximate Bayesian computation Alphabet soup ABCµ multiple errors [ c Ratmann et al., PNAS, 2009]
    • MCMC and Likelihood-free Methods Approximate Bayesian computation Alphabet soup ABCµ for model choice [ c Ratmann et al., PNAS, 2009]
    • MCMC and Likelihood-free Methods Approximate Bayesian computation Alphabet soup Questions about ABCµ For each model under comparison, marginal posterior on used to assess the fit of the model (HPD includes 0 or not).
    • MCMC and Likelihood-free Methods Approximate Bayesian computation Alphabet soup Questions about ABCµ For each model under comparison, marginal posterior on used to assess the fit of the model (HPD includes 0 or not). Is the data informative about ? [Identifiability] How is the prior π( ) impacting the comparison? How is using both ξ( |x0 , θ) and π ( ) compatible with a standard probability model? [remindful of Wilkinson] Where is the penalisation for complexity in the model comparison? [X, Mengersen & Chen, 2010, PNAS]
    • MCMC and Likelihood-free Methods Approximate Bayesian computation Alphabet soup ABC-PRC Another sequential version producing a sequence of Markov (t) (t) transition kernels Kt and of samples (θ1 , . . . , θN ) (1 ≤ t ≤ T )
    • MCMC and Likelihood-free Methods Approximate Bayesian computation Alphabet soup ABC-PRC Another sequential version producing a sequence of Markov (t) (t) transition kernels Kt and of samples (θ1 , . . . , θN ) (1 ≤ t ≤ T ) ABC-PRC Algorithm (t−1) 1. Pick a θ is selected at random among the previous θi ’s (t−1) with probabilities ωi (1 ≤ i ≤ N ). 2. Generate (t) (t) θi ∼ Kt (θ|θ ) , x ∼ f (x|θi ) , 3. Check that (x, y) < , otherwise start again. [Sisson et al., 2007]
    • MCMC and Likelihood-free Methods Approximate Bayesian computation Alphabet soup Why PRC? Partial rejection control: Resample from a population of weighted particles by pruning away particles with weights below threshold C, replacing them by new particles obtained by propagating an existing particle by an SMC step and modifying the weights accordinly. [Liu, 2001]
    • MCMC and Likelihood-free Methods Approximate Bayesian computation Alphabet soup Why PRC? Partial rejection control: Resample from a population of weighted particles by pruning away particles with weights below threshold C, replacing them by new particles obtained by propagating an existing particle by an SMC step and modifying the weights accordinly. [Liu, 2001] PRC justification in ABC-PRC: Suppose we then implement the PRC algorithm for some c > 0 such that only identically zero weights are smaller than c Trouble is, there is no such c...
    • MCMC and Likelihood-free Methods Approximate Bayesian computation Alphabet soup ABC-PRC weight (t) Probability ωi computed as (t) (t) (t) (t) ωi ∝ π(θi )Lt−1 (θ |θi ){π(θ )Kt (θi |θ )}−1 , where Lt−1 is an arbitrary transition kernel.
    • MCMC and Likelihood-free Methods Approximate Bayesian computation Alphabet soup ABC-PRC weight (t) Probability ωi computed as (t) (t) (t) (t) ωi ∝ π(θi )Lt−1 (θ |θi ){π(θ )Kt (θi |θ )}−1 , where Lt−1 is an arbitrary transition kernel. In case Lt−1 (θ |θ) = Kt (θ|θ ) , all weights are equal under a uniform prior.
    • MCMC and Likelihood-free Methods Approximate Bayesian computation Alphabet soup ABC-PRC weight (t) Probability ωi computed as (t) (t) (t) (t) ωi ∝ π(θi )Lt−1 (θ |θi ){π(θ )Kt (θi |θ )}−1 , where Lt−1 is an arbitrary transition kernel. In case Lt−1 (θ |θ) = Kt (θ|θ ) , all weights are equal under a uniform prior. Inspired from Del Moral et al. (2006), who use backward kernels Lt−1 in SMC to achieve unbiasedness
    • MCMC and Likelihood-free Methods Approximate Bayesian computation Alphabet soup ABC-PRC bias Lack of unbiasedness of the method
    • MCMC and Likelihood-free Methods Approximate Bayesian computation Alphabet soup ABC-PRC bias Lack of unbiasedness of the method Joint density of the accepted pair (θ(t−1) , θ(t) ) proportional to (t−1) (t) (t−1) (t) π(θ |y)Kt (θ |θ )f (y|θ ), For an arbitrary function h(θ), E[ωt h(θ(t) )] proportional to π(θ (t) )Lt−1 (θ (t−1) |θ (t) ) ZZ (t) (t−1) (t) (t−1) (t) (t−1) (t) h(θ ) π(θ |y)Kt (θ |θ )f (y|θ )dθ dθ π(θ (t−1) )Kt (θ (t) |θ (t−1) ) π(θ (t) )Lt−1 (θ (t−1) |θ (t) ) ZZ (t) (t−1) (t−1) ∝ h(θ ) π(θ )f (y|θ ) π(θ (t−1) )Kt (θ (t) |θ (t−1) ) (t) (t−1) (t) (t−1) (t) × Kt (θ |θ )f (y|θ )dθ dθ Z Z ff (t) (t) (t−1) (t) (t−1) (t−1) (t) ∝ h(θ )π(θ |y) Lt−1 (θ |θ )f (y|θ )dθ dθ .
    • MCMC and Likelihood-free Methods Approximate Bayesian computation Alphabet soup A mixture example (1) Toy model of Sisson et al. (2007): if θ ∼ U(−10, 10) , x|θ ∼ 0.5 N (θ, 1) + 0.5 N (θ, 1/100) , then the posterior distribution associated with y = 0 is the normal mixture θ|y = 0 ∼ 0.5 N (0, 1) + 0.5 N (0, 1/100) restricted to [−10, 10]. Furthermore, true target available as π(θ||x| < ) ∝ Φ( −θ)−Φ(− −θ)+Φ(10( −θ))−Φ(−10( +θ)) .
    • MCMC and Likelihood-free Methods Approximate Bayesian computation Alphabet soup “Ugly, squalid graph...” 1.0 1.0 1.0 1.0 1.0 0.8 0.8 0.8 0.8 0.8 0.6 0.6 0.6 0.6 0.6 0.4 0.4 0.4 0.4 0.4 0.2 0.2 0.2 0.2 0.2 0.0 0.0 0.0 0.0 0.0 −3 −1 1 3 −3 −1 1 3 −3 −1 1 3 −3 −1 1 3 −3 −1 1 3 θ θ θ θ θ 1.0 1.0 1.0 1.0 1.0 0.8 0.8 0.8 0.8 0.8 0.6 0.6 0.6 0.6 0.6 0.4 0.4 0.4 0.4 0.4 0.2 0.2 0.2 0.2 0.2 0.0 0.0 0.0 0.0 0.0 −3 −1 1 3 −3 −1 1 3 −3 −1 1 3 −3 −1 1 3 −3 −1 1 3 θ θ θ θ θ Comparison of τ = 0.15 and τ = 1/0.15 in Kt
    • MCMC and Likelihood-free Methods Approximate Bayesian computation Alphabet soup A PMC version Use of the same kernel idea as ABC-PRC but with IS correction Generate a sample at iteration t by N (t) (t−1) (t−1) πt (θ ) ∝ ˆ ωj Kt (θ(t) |θj ) j=1 modulo acceptance of the associated xt , and use an importance (t) weight associated with an accepted simulation θi (t) (t) (t) ωi ∝ π(θi ) πt (θi ) . ˆ c Still likelihood free [Beaumont et al., 2008, arXiv:0805.2256]
    • MCMC and Likelihood-free Methods Approximate Bayesian computation Alphabet soup The ABC-PMC algorithm Given a decreasing sequence of approximation levels 1 ≥ ... ≥ T, 1. At iteration t = 1, For i = 1, ..., N (1) (1) Simulate θi ∼ π(θ) and x ∼ f (x|θi ) until (x, y) < 1 (1) Set ωi = 1/N (1) Take τ 2 as twice the empirical variance of the θi ’s 2. At iteration 2 ≤ t ≤ T , For i = 1, ..., N , repeat (t−1) (t−1) Pick θi from the θj ’s with probabilities ωj (t) 2 (t) generate θi |θi ∼ N (θi , σt ) and x ∼ f (x|θi ) until (x, y) < t (t) (t) N (t−1) −1 (t) (t−1) Set ωi ∝ π(θi )/ j=1 ωj ϕ σt θi − θj ) 2 (t) Take τt+1 as twice the weighted empirical variance of the θi ’s
    • MCMC and Likelihood-free Methods Approximate Bayesian computation Alphabet soup Sequential Monte Carlo SMC is a simulation technique to approximate a sequence of related probability distributions πn with π0 “easy” and πT as target. Iterated IS as PMC : particles moved from time n to time n via kernel Kn and use of a sequence of extended targets πn˜ n πn (z0:n ) = πn (zn ) ˜ Lj (zj+1 , zj ) j=0 where the Lj ’s are backward Markov kernels [check that πn (zn ) is a marginal] [Del Moral, Doucet & Jasra, Series B, 2006]
    • MCMC and Likelihood-free Methods Approximate Bayesian computation Alphabet soup Sequential Monte Carlo (2) Algorithm 4 SMC sampler (0) sample zi ∼ γ0 (x) (i = 1, . . . , N ) (0) (0) (0) compute weights wi = π0 (zi ))/γ0 (zi ) for t = 1 to N do if ESS(w(t−1) ) < NT then resample N particles z (t−1) and set weights to 1 end if (t−1) (t−1) generate zi ∼ Kt (zi , ·) and set weights to (t) (t) (t−1) (t) (t−1) πt (zi ))Lt−1 (zi ), zi )) wi = Wi−1 (t−1) (t−1) (t) πt−1 (zi ))Kt (zi ), zi )) end for
    • MCMC and Likelihood-free Methods Approximate Bayesian computation Alphabet soup ABC-SMC [Del Moral, Doucet & Jasra, 2009] True derivation of an SMC-ABC algorithm Use of a kernel Kn associated with target π n and derivation of the backward kernel π n (z )Kn (z , z) Ln−1 (z, z ) = πn (z) Update of the weights M m=1 IA n (xm ) in win ∝ wi(n−1) M m=1 IA n−1 (xm i(n−1) ) when xm ∼ K(xi(n−1) , ·) in
    • MCMC and Likelihood-free Methods Approximate Bayesian computation Alphabet soup ABC-SMCM Modification: Makes M repeated simulations of the pseudo-data z given the parameter, rather than using a single [M = 1] simulation, leading to weight that is proportional to the number of accepted zi s M 1 ω(θ) = Iρ(η(y),η(zi ))< M i=1 [limit in M means exact simulation from (tempered) target]
    • MCMC and Likelihood-free Methods Approximate Bayesian computation Alphabet soup Properties of ABC-SMC The ABC-SMC method properly uses a backward kernel L(z, z ) to simplify the importance weight and to remove the dependence on the unknown likelihood from this weight. Update of importance weights is reduced to the ratio of the proportions of surviving particles Major assumption: the forward kernel K is supposed to be invariant against the true target [tempered version of the true posterior]
    • MCMC and Likelihood-free Methods Approximate Bayesian computation Alphabet soup Properties of ABC-SMC The ABC-SMC method properly uses a backward kernel L(z, z ) to simplify the importance weight and to remove the dependence on the unknown likelihood from this weight. Update of importance weights is reduced to the ratio of the proportions of surviving particles Major assumption: the forward kernel K is supposed to be invariant against the true target [tempered version of the true posterior] Adaptivity in ABC-SMC algorithm only found in on-line construction of the thresholds t , slowly enough to keep a large number of accepted transitions
    • MCMC and Likelihood-free Methods Approximate Bayesian computation Alphabet soup A mixture example (2) Recovery of the target, whether using a fixed standard deviation of τ = 0.15 or τ = 1/0.15, or a sequence of adaptive τt ’s. 1.0 1.0 1.0 1.0 1.0 0.8 0.8 0.8 0.8 0.8 0.6 0.6 0.6 0.6 0.6 0.4 0.4 0.4 0.4 0.4 0.2 0.2 0.2 0.2 0.2 0.0 0.0 0.0 0.0 0.0 −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3 θ θ θ θ θ
    • MCMC and Likelihood-free Methods Approximate Bayesian computation Alphabet soup Wilkinson’s exact BC Wilkinson (2008) replaces the ABC approximation error (i.e. non-zero tolerance) in with an exact simulation from a controlled approximation to the target, a convolution of the true posterior with an arbitrary kernel function π(θ)f (z|θ)K (y − z) π (θ, z|y) = , π(θ)f (z|θ)K (y − z)dzdθ where K is a kernel parameterised by a bandwidth .
    • MCMC and Likelihood-free Methods Approximate Bayesian computation Alphabet soup Wilkinson’s exact BC Wilkinson (2008) replaces the ABC approximation error (i.e. non-zero tolerance) in with an exact simulation from a controlled approximation to the target, a convolution of the true posterior with an arbitrary kernel function π(θ)f (z|θ)K (y − z) π (θ, z|y) = , π(θ)f (z|θ)K (y − z)dzdθ where K is a kernel parameterised by a bandwidth . Requires K to be bounded True approximation error never assessed Requires a modification of the standard ABC algorithms
    • MCMC and Likelihood-free Methods Approximate Bayesian computation Alphabet soup Semi-automatic ABC Fearnhead and Prangle (2010) study ABC and the selection of the summary statistic in close proximity to Wilkinson’s proposal ABC then considered from a purely inferential viewpoint and calibrated for estimation purposes. Use of a randomised (or ‘noisy’) version of the summary statistics η (y) = η(y) + τ ˜ Derivation of a well-calibrated version of ABC, i.e. an algorithm that gives proper predictions for the distribution associated with this randomised summary statistic.
    • MCMC and Likelihood-free Methods Approximate Bayesian computation Alphabet soup Semi-automatic ABC Fearnhead and Prangle (2010) study ABC and the selection of the summary statistic in close proximity to Wilkinson’s proposal ABC then considered from a purely inferential viewpoint and calibrated for estimation purposes. Use of a randomised (or ‘noisy’) version of the summary statistics η (y) = η(y) + τ ˜ Derivation of a well-calibrated version of ABC, i.e. an algorithm that gives proper predictions for the distribution associated with this randomised summary statistic. [calibration constraint: ABC approximation with same posterior mean as the true randomised posterior.]
    • MCMC and Likelihood-free Methods Approximate Bayesian computation Alphabet soup Summary statistics Optimality of the posterior expectations of the parameters of interest as summary statistics!
    • MCMC and Likelihood-free Methods Approximate Bayesian computation Alphabet soup Summary statistics Optimality of the posterior expectations of the parameters of interest as summary statistics! Use of the standard quadratic loss function (θ − θ0 )T A(θ − θ0 ) .
    • MCMC and Likelihood-free Methods Approximate Bayesian computation Calibration of ABC Which summary? Fundamental difficulty of the choice of the summary statistic when there is no non-trivial sufficient statistics [except when done by the experimenters in the field]
    • MCMC and Likelihood-free Methods Approximate Bayesian computation Calibration of ABC Which summary? Fundamental difficulty of the choice of the summary statistic when there is no non-trivial sufficient statistics [except when done by the experimenters in the field] Starting from a large collection of summary statistics is available, Joyce and Marjoram (2008) consider the sequential inclusion into the ABC target, with a stopping rule based on a likelihood ratio test.
    • MCMC and Likelihood-free Methods Approximate Bayesian computation Calibration of ABC Which summary? Fundamental difficulty of the choice of the summary statistic when there is no non-trivial sufficient statistics [except when done by the experimenters in the field] Starting from a large collection of summary statistics is available, Joyce and Marjoram (2008) consider the sequential inclusion into the ABC target, with a stopping rule based on a likelihood ratio test. Does not taking into account the sequential nature of the tests Depends on parameterisation Order of inclusion matters.
    • MCMC and Likelihood-free Methods Approximate Bayesian computation Calibration of ABC A connected Monte Carlo study of the pseudo-data per simulated parameter Repeating simulations does not improve approximation Tolerance level does not seem to be highly influential Choice of distance / summary statistics / calibration factors are paramount to successful approximation ABC-SMC outperforms ABC-MCMC [Mckinley, Cook, Deardon, 2009]
    • MCMC and Likelihood-free Methods Approximate Bayesian computation Calibration of ABC A Brave New World?!
    • MCMC and Likelihood-free Methods ABC for model choice ABC for model choice Computational issues in Bayesian statistics The Metropolis-Hastings Algorithm The Gibbs Sampler Population Monte Carlo Approximate Bayesian computation ABC for model choice Model choice Gibbs random fields
    • MCMC and Likelihood-free Methods ABC for model choice Model choice Bayesian model choice Several models M1 , M2 , . . . are considered simultaneously for a dataset y and the model index M is part of the inference. Use of a prior distribution. π(M = m), plus a prior distribution on the parameter conditional on the value m of the model index, πm (θ m ) Goal is to derive the posterior distribution of M , challenging computational target when models are complex.
    • MCMC and Likelihood-free Methods ABC for model choice Model choice Generic ABC for model choice Algorithm 5 Likelihood-free model choice sampler (ABC-MC) for t = 1 to T do repeat Generate m from the prior π(M = m) Generate θ m from the prior πm (θ m ) Generate z from the model fm (z|θ m ) until ρ{η(z), η(y)} < Set m(t) = m and θ (t) = θ m end for
    • MCMC and Likelihood-free Methods ABC for model choice Model choice ABC estimates Posterior probability π(M = m|y) approximated by the frequency of acceptances from model m T 1 Im(t) =m . T t=1 Issues with implementation: should tolerances be the same for all models? should summary statistics vary across models (incl. their dimension)? should the distance measure ρ vary as well?
    • MCMC and Likelihood-free Methods ABC for model choice Model choice ABC estimates Posterior probability π(M = m|y) approximated by the frequency of acceptances from model m T 1 Im(t) =m . T t=1 Issues with implementation: should tolerances be the same for all models? should summary statistics vary across models (incl. their dimension)? should the distance measure ρ vary as well? Extension to a weighted polychotomous logistic regression estimate of π(M = m|y), with non-parametric kernel weights [Cornuet et al., DIYABC, 2009]
    • MCMC and Likelihood-free Methods ABC for model choice Model choice The Great ABC controversy On-going controvery in phylogeographic genetics about the validity of using ABC for testing Against: Templeton, 2008, 2009, 2010a, 2010b, 2010c argues that nested hypotheses cannot have higher probabilities than nesting hypotheses (!)
    • MCMC and Likelihood-free Methods ABC for model choice Model choice The Great ABC controversy On-going controvery in phylogeographic genetics about the validity of using ABC for testing Replies: Fagundes et al., 2008, Against: Templeton, 2008, Beaumont et al., 2010, Berger et 2009, 2010a, 2010b, 2010c al., 2010, Csill`ry et al., 2010 e argues that nested hypotheses point out that the criticisms are cannot have higher probabilities addressed at [Bayesian] than nesting hypotheses (!) model-based inference and have nothing to do with ABC...
    • MCMC and Likelihood-free Methods ABC for model choice Gibbs random fields Gibbs random fields Gibbs distribution The rv y = (y1 , . . . , yn ) is a Gibbs random field associated with the graph G if 1 f (y) = exp − Vc (yc ) , Z c∈C where Z is the normalising constant, C is the set of cliques of G and Vc is any function also called potential U (y) = c∈C Vc (yc ) is the energy function
    • MCMC and Likelihood-free Methods ABC for model choice Gibbs random fields Gibbs random fields Gibbs distribution The rv y = (y1 , . . . , yn ) is a Gibbs random field associated with the graph G if 1 f (y) = exp − Vc (yc ) , Z c∈C where Z is the normalising constant, C is the set of cliques of G and Vc is any function also called potential U (y) = c∈C Vc (yc ) is the energy function c Z is usually unavailable in closed form
    • MCMC and Likelihood-free Methods ABC for model choice Gibbs random fields Potts model Potts model Vc (y) is of the form Vc (y) = θS(y) = θ δyl =yi l∼i where l∼i denotes a neighbourhood structure
    • MCMC and Likelihood-free Methods ABC for model choice Gibbs random fields Potts model Potts model Vc (y) is of the form Vc (y) = θS(y) = θ δyl =yi l∼i where l∼i denotes a neighbourhood structure In most realistic settings, summation Zθ = exp{θ T S(x)} x∈X involves too many terms to be manageable and numerical approximations cannot always be trusted [Cucala, Marin, CPR & Titterington, 2009]
    • MCMC and Likelihood-free Methods ABC for model choice Model choice via ABC Bayesian Model Choice Comparing a model with potential S0 taking values in Rp0 versus a model with potential S1 taking values in Rp1 can be done through the Bayes factor corresponding to the priors π0 and π1 on each parameter space exp{θ T S0 (x)}/Zθ 0 ,0 π0 (dθ 0 ) 0 Bm0 /m1 (x) = exp{θ T S1 (x)}/Zθ 1 ,1 π1 (dθ 1 ) 1
    • MCMC and Likelihood-free Methods ABC for model choice Model choice via ABC Bayesian Model Choice Comparing a model with potential S0 taking values in Rp0 versus a model with potential S1 taking values in Rp1 can be done through the Bayes factor corresponding to the priors π0 and π1 on each parameter space exp{θ T S0 (x)}/Zθ 0 ,0 π0 (dθ 0 ) 0 Bm0 /m1 (x) = exp{θ T S1 (x)}/Zθ 1 ,1 π1 (dθ 1 ) 1 Use of Jeffreys’ scale to select most appropriate model
    • MCMC and Likelihood-free Methods ABC for model choice Model choice via ABC Neighbourhood relations Choice to be made between M neighbourhood relations m i∼i (0 ≤ m ≤ M − 1) with Sm (x) = I{xi =xi } m i∼i driven by the posterior probabilities of the models.
    • MCMC and Likelihood-free Methods ABC for model choice Model choice via ABC Model index Formalisation via a model index M that appears as a new parameter with prior distribution π(M = m) and π(θ|M = m) = πm (θm )
    • MCMC and Likelihood-free Methods ABC for model choice Model choice via ABC Model index Formalisation via a model index M that appears as a new parameter with prior distribution π(M = m) and π(θ|M = m) = πm (θm ) Computational target: P(M = m|x) ∝ fm (x|θm )πm (θm ) dθm π(M = m) , Θm
    • MCMC and Likelihood-free Methods ABC for model choice Model choice via ABC Sufficient statistics By definition, if S(x) sufficient statistic for the joint parameters (M, θ0 , . . . , θM −1 ), P(M = m|x) = P(M = m|S(x)) .
    • MCMC and Likelihood-free Methods ABC for model choice Model choice via ABC Sufficient statistics By definition, if S(x) sufficient statistic for the joint parameters (M, θ0 , . . . , θM −1 ), P(M = m|x) = P(M = m|S(x)) . For each model m, own sufficient statistic Sm (·) and S(·) = (S0 (·), . . . , SM −1 (·)) also sufficient.
    • MCMC and Likelihood-free Methods ABC for model choice Model choice via ABC Sufficient statistics By definition, if S(x) sufficient statistic for the joint parameters (M, θ0 , . . . , θM −1 ), P(M = m|x) = P(M = m|S(x)) . For each model m, own sufficient statistic Sm (·) and S(·) = (S0 (·), . . . , SM −1 (·)) also sufficient. For Gibbs random fields, 1 2 x|M = m ∼ fm (x|θm ) = fm (x|S(x))fm (S(x)|θm ) 1 = f 2 (S(x)|θm ) n(S(x)) m where n(S(x)) = {˜ ∈ X : S(˜ ) = S(x)} x x c S(x) is therefore also sufficient for the joint parameters [Specific to Gibbs random fields!]
    • MCMC and Likelihood-free Methods ABC for model choice Model choice via ABC ABC model choice Algorithm ABC-MC Generate m∗ from the prior π(M = m). ∗ Generate θm∗ from the prior πm∗ (·). Generate x∗ from the model fm∗ (·|θm∗ ). ∗ Compute the distance ρ(S(x0 ), S(x∗ )). Accept (θm∗ , m∗ ) if ρ(S(x0 ), S(x∗ )) < . ∗ Note When = 0 the algorithm is exact
    • MCMC and Likelihood-free Methods ABC for model choice Model choice via ABC ABC approximation to the Bayes factor Frequency ratio: ˆ P(M = m0 |x0 ) π(M = m1 ) BF m0 /m1 (x0 ) = × ˆ P(M = m1 |x0 ) π(M = m0 ) {mi∗ = m0 } π(M = m1 ) = × , {mi∗ = m1 } π(M = m0 )
    • MCMC and Likelihood-free Methods ABC for model choice Model choice via ABC ABC approximation to the Bayes factor Frequency ratio: ˆ P(M = m0 |x0 ) π(M = m1 ) BF m0 /m1 (x0 ) = × ˆ P(M = m1 |x0 ) π(M = m0 ) {mi∗ = m0 } π(M = m1 ) = × , {mi∗ = m1 } π(M = m0 ) replaced with 1 + {mi∗ = m0 } π(M = m1 ) BF m0 /m1 (x0 ) = × 1 + {mi∗ = m1 } π(M = m0 ) to avoid indeterminacy (also Bayes estimate).
    • MCMC and Likelihood-free Methods ABC for model choice Illustrations Toy example iid Bernoulli model versus two-state first-order Markov chain, i.e. n f0 (x|θ0 ) = exp θ0 I{xi =1} {1 + exp(θ0 )}n , i=1 versus n 1 f1 (x|θ1 ) = exp θ1 I{xi =xi−1 } {1 + exp(θ1 )}n−1 , 2 i=2 with priors θ0 ∼ U(−5, 5) and θ1 ∼ U(0, 6) (inspired by “phase transition” boundaries).
    • MCMC and Likelihood-free Methods ABC for model choice Illustrations Toy example (2) 10 5 5 BF01 BF01 0 ^ ^ 0 −5 −5 −40 −20 0 10 −10 −40 −20 0 10 BF01 BF01 (left) Comparison of the true BF m0 /m1 (x0 ) with BF m0 /m1 (x0 ) (in logs) over 2, 000 simulations and 4.106 proposals from the prior. (right) Same when using tolerance corresponding to the 1% quantile on the distances.
    • MCMC and Likelihood-free Methods ABC for model choice Illustrations Protein folding Superposition of the native structure (grey) with the ST1 structure (red.), the ST2 structure (orange), the ST3 structure (green), and the DT structure (blue).
    • MCMC and Likelihood-free Methods ABC for model choice Illustrations Protein folding (2) % seq . Id. TM-score FROST score 1i5nA (ST1) 32 0.86 75.3 1ls1A1 (ST2) 5 0.42 8.9 1jr8A (ST3) 4 0.24 8.9 1s7oA (DT) 10 0.08 7.8 Characteristics of dataset. % seq. Id.: percentage of identity with the query sequence. TM-score: similarity between predicted and native structure (uncertainty between 0.17 and 0.4) FROST score: quality of alignment of the query onto the candidate structure (uncertainty between 7 and 9).
    • MCMC and Likelihood-free Methods ABC for model choice Illustrations Protein folding (3) NS/ST1 NS/ST2 NS/ST3 NS/DT BF 1.34 1.22 2.42 2.76 P(M = NS|x0 ) 0.573 0.551 0.708 0.734 Estimates of the Bayes factors between model NS and models ST1, ST2, ST3, and DT, and corresponding posterior probabilities of model NS based on an ABC-MC algorithm using 1.2 106 simulations and a tolerance equal to the 1% quantile of the distances.