Talk at 2013 WSC, ISI Conference in Hong Kong, August 26, 2013

703 views
620 views

Published on

Those are the slides for my conference talk at 2013 WSC, in the "Jacob Bernoulli's "Ars Conjectandi" and the emergence of probability" session organised by Adam Jakubowski

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
703
On SlideShare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
12
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Talk at 2013 WSC, ISI Conference in Hong Kong, August 26, 2013

  1. 1. An [under]view of Monte Carlo methods, from importance sampling to MCMC, to ABC (& kudos to Bernoulli) Christian P. Robert Universit´e Paris-Dauphine, University of Warwick, & CREST, Paris 2013 WSC, Hong Kong bayesianstatistics@gmail.com
  2. 2. Outline Bernoulli, Jakob (1654–1705) MCMC connected steps Metropolis-Hastings revisited Approximate Bayesian computation (ABC)
  3. 3. Bernoulli as founding father of Monte Carlo methods The weak law of large numbers (or Bernoulli’s [Golden] theorem) provides the justification for Monte Carlo approximations: if x1, . . . , xn are i.i.d. rv’s with density f , lim n→∞ h(x1) + . . . + h(xn) n = X h(x)f (x) dx Stigler’s Law of Eponimy: Cardano (1501–1576) first stated the result
  4. 4. Bernoulli as founding father of Monte Carlo methods ...and indeed h(x1) + . . . + h(xn) n converges to I = X h(x)f (x) dx
  5. 5. Bernoulli as founding father of Monte Carlo methods ...and indeed h(x1) + . . . + h(xn) n converges to I = X h(x)f (x) dx ...meaning that provided we can simulate xi ∼ f (·) long and fast “enough”, the empirical mean will be a good “enough” approximation to I
  6. 6. Early implementations of the LLN While Jakob Bernoulli himself apparently did not engage in simulation, Buffon (1707–1788) resorted to a (not-yet-Monte-Carlo) experiment in 1735 to estimate the value of the Saint Petersburg game (even though he did not perform a similar experiment for estimating π) [Stigler, STS, 1991; Stigler, JRSS A, 2010]
  7. 7. Early implementations of the LLN While Jakob Bernoulli himself apparently did not engage in simulation, De Forest (1834–1888) found the median of a log-Cauchy distribution, using normal simulations approximated to the second digit (in 1876) [Stigler, STS, 1991; Stigler, JRSS A, 2010]
  8. 8. Early implementations of the LLN While Jakob Bernoulli himself apparently did not engage in simulation, followed closely by the ubuquitous Galton using “normal” dice in 1890, after developping the Quincunx, used both for checking the CLT and simulating from a posterior distribution as early as 1877 [Stigler, STS, 1991; Stigler, JRSS A, 2010]
  9. 9. Importance Sampling When focussing on integral approximation, very loose principle in that proposal distribution with pdf q(·) leads to alternative representation I = X h(x){f /q}(x) q(x) dx Principle of importance Generate an iid sample x1, . . . , xn ∼ q(·) and estimate I by ^IIS = n−1 n i=1 h(xi ){f /q}(xi ). ...provided q is positive on the right set
  10. 10. Importance Sampling When focussing on integral approximation, very loose principle in that proposal distribution with pdf q(·) leads to alternative representation I = X h(x){f /q}(x) q(x) dx Principle of importance Generate an iid sample x1, . . . , xn ∼ q(·) and estimate I by ^IIS = n−1 n i=1 h(xi ){f /q}(xi ). ...provided q is positive on the right set
  11. 11. Importance Sampling When focussing on integral approximation, very loose principle in that proposal distribution with pdf q(·) leads to alternative representation I = X h(x){f /q}(x) q(x) dx Principle of importance Generate an iid sample x1, . . . , xn ∼ q(·) and estimate I by ^IIS = n−1 n i=1 h(xi ){f /q}(xi ). ...provided q is positive on the right set
  12. 12. things aren’t all rosy... LLN not sufficient to justify Monte Carlo methods: if n−1 n i=1 h(xi ){f /q}(xi ) has an infinite variance, the estimator ^IIS is useless Importance sampling estimation of P(2 Z 6) Z is Cauchy and importance is normal, compared with exact value, 0.095.
  13. 13. The harmonic mean estimator Bayesian posterior distribution defined as π(θ|x) = π(θ)L(θ|x)/m(x) When θi ∼ π(θ|x), 1 T T t=1 1 L(θt|x) is an unbiased estimator of 1/m(x) [Gelfand & Dey, 1994; Newton & Raftery, 1994] Highly hazardous material: Most often leads to an infinite variance!!!
  14. 14. The harmonic mean estimator Bayesian posterior distribution defined as π(θ|x) = π(θ)L(θ|x)/m(x) When θi ∼ π(θ|x), 1 T T t=1 1 L(θt|x) is an unbiased estimator of 1/m(x) [Gelfand & Dey, 1994; Newton & Raftery, 1994] Highly hazardous material: Most often leads to an infinite variance!!!
  15. 15. “The Worst Monte Carlo Method Ever” “The good news is that the Law of Large Numbers guarantees that this estimator is consistent ie, it will very likely be very close to the correct answer if you use a sufficiently large number of points from the posterior distribution. The bad news is that the number of points required for this estimator to get close to the right answer will often be greater than the number of atoms in the observable universe. The even worse news is that it’s easy for people to not realize this, and to na¨ıvely accept estimates that are nowhere close to the correct value of the marginal likelihood.” [Radford Neal’s blog, Aug. 23, 2008]
  16. 16. Comparison with regular importance sampling Harmonic mean: Constraint opposed to usual importance sampling constraints: proposal ϕ(·) must have lighter (rather than fatter) tails than π(·)L(·) for the approximation 1 1 T T t=1 ϕ(θt) πk(θt)L(θt) θt ∼ ϕ(·) to have a finite variance. E.g., use finite support kernels (like Epanechnikov’s kernel) for ϕ
  17. 17. Comparison with regular importance sampling Harmonic mean: Constraint opposed to usual importance sampling constraints: proposal ϕ(·) must have lighter (rather than fatter) tails than π(·)L(·) for the approximation 1 1 T T t=1 ϕ(θt) πk(θt)L(θt) θt ∼ ϕ(·) to have a finite variance. E.g., use finite support kernels (like Epanechnikov’s kernel) for ϕ
  18. 18. HPD indicator as ϕ Use the convex hull of MCMC simulations (θt)t=1,...,T corresponding to the 10% HPD region (easily derived!) and ϕ as indicator: ϕ(θ) = 10 T t∈HPD Id(θ,θt ) [X & Wraith, 2009]
  19. 19. Bayesian computing (R)evolution Bernoulli, Jakob (1654–1705) MCMC connected steps Metropolis-Hastings revisited Approximate Bayesian computation (ABC)
  20. 20. computational jam In the 1970’s and early 1980’s, theoretical foundations of Bayesian statistics were sound, but methodology was lagging for lack of computing tools. restriction to conjugate priors limited complexity of models small sample sizes The field was desperately in need of a new computing paradigm! [X & Casella, STS, 2012]
  21. 21. MCMC as in Markov Chain Monte Carlo Notion that i.i.d. simulation is definitely not necessary, all that matters is the ergodic theorem Realization that Markov chains could be used in a wide variety of situations only came to mainstream statisticians with Gelfand and Smith (1990) despite earlier publications in the statistical literature like Hastings (1970) and growing awareness in spatial statistics (Besag, 1986) Reasons: lack of computing machinery lack of background on Markov chains lack of trust in the practicality of the method
  22. 22. pre-Gibbs/pre-Hastings era Early 1970’s, Hammersley, Clifford, and Besag were working on the specification of joint distributions from conditional distributions and on necessary and sufficient conditions for the conditional distributions to be compatible with a joint distribution. [Hammersley and Clifford, 1971]
  23. 23. pre-Gibbs/pre-Hastings era Early 1970’s, Hammersley, Clifford, and Besag were working on the specification of joint distributions from conditional distributions and on necessary and sufficient conditions for the conditional distributions to be compatible with a joint distribution. “What is the most general form of the conditional probability functions that define a coherent joint function? And what will the joint look like?” [Besag, 1972]
  24. 24. Hammersley-Clifford[-Besag] theorem Theorem (Hammersley-Clifford) Joint distribution of vector associated with a dependence graph must be represented as product of functions over the cliques of the graphs, i.e., of functions depending only on the components indexed by the labels in the clique. [Cressie, 1993; Lauritzen, 1996]
  25. 25. Hammersley-Clifford[-Besag] theorem Theorem (Hammersley-Clifford) A probability distribution P with positive and continuous density f satisfies the pairwise Markov property with respect to an undirected graph G if and only if it factorizes according to G, i.e., (F) ≡ (G) [Cressie, 1993; Lauritzen, 1996]
  26. 26. Hammersley-Clifford[-Besag] theorem Theorem (Hammersley-Clifford) Under the positivity condition, the joint distribution g satisfies g(y1, . . . , yp) ∝ p j=1 g j (y j |y 1 , . . . , y j−1 , y j+1 , . . . , y p ) g j (y j |y 1 , . . . , y j−1 , y j+1 , . . . , y p ) for every permutation on {1, 2, . . . , p} and every y ∈ Y. [Cressie, 1993; Lauritzen, 1996]
  27. 27. Clicking in After Peskun (1973), MCMC mostly dormant in mainstream statistical world for about 10 years, then several papers/books highlighted its usefulness in specific settings: Geman and Geman (1984) Besag (1986) Strauss (1986) Ripley (Stochastic Simulation, 1987) Tanner and Wong (1987) Younes (1988)
  28. 28. [Re-]Enters the Gibbs sampler Geman and Geman (1984), building on Metropolis et al. (1953), Hastings (1970), and Peskun (1973), constructed a Gibbs sampler for optimisation in a discrete image processing problem with a Gibbs random field without completion. Back to Metropolis et al., 1953: the Gibbs sampler is already in use therein and ergodicity is proven on the collection of global maxima
  29. 29. [Re-]Enters the Gibbs sampler Geman and Geman (1984), building on Metropolis et al. (1953), Hastings (1970), and Peskun (1973), constructed a Gibbs sampler for optimisation in a discrete image processing problem with a Gibbs random field without completion. Back to Metropolis et al., 1953: the Gibbs sampler is already in use therein and ergodicity is proven on the collection of global maxima
  30. 30. Removing the jam In early 1990s, researchers found that Gibbs and then Metropolis - Hastings algorithms would crack almost any problem! Flood of papers followed applying MCMC: linear mixed models (Gelfand & al., 1990; Zeger & Karim, 1991; Wang & al., 1993, 1994) generalized linear mixed models (Albert & Chib, 1993) mixture models (Tanner & Wong, 1987; Diebolt & X., 1990, 1994; Escobar & West, 1993) changepoint analysis (Carlin & al., 1992) point processes (Grenander & Møller, 1994) &tc
  31. 31. Removing the jam In early 1990s, researchers found that Gibbs and then Metropolis - Hastings algorithms would crack almost any problem! Flood of papers followed applying MCMC: genomics (Stephens & Smith, 1993; Lawrence & al., 1993; Churchill, 1995; Geyer & Thompson, 1995; Stephens & Donnelly, 2000) ecology (George & X, 1992) variable selection in regression (George & mcCulloch, 1993; Green, 1995; Chen & al., 2000) spatial statistics (Raftery & Banfield, 1991; Besag & Green, 1993)) longitudinal studies (Lange & al., 1992) &tc
  32. 32. MCMC and beyond reversible jump MCMC which impacted considerably Bayesian model choice (Green, 1995) adaptive MCMC algorithms (Haario & al., 1999; Roberts & Rosenthal, 2009) exact approximations to targets (Tanner & Wong, 1987; Beaumont, 2003; Andrieu & Roberts, 2009) comp’al stats catching up with comp’al physics: free energy sampling (e.g., Wang-Landau), Hamilton Monte Carlo (Girolami & Calderhead, 2011) sequential Monte Carlo (SMC) for non-sequential problems (Chopin, 2002; Neal, 2001; Del Moral et al 2006) retrospective sampling intractability: EP – GIMH – PMCMC – SMC2 – INLA QMC[MC] (Owen, 2011)
  33. 33. Particles Iterating/sequential importance sampling is about as old as Monte Carlo methods themselves! [Hammersley and Morton,1954; Rosenbluth and Rosenbluth, 1955] Found in the molecular simulation literature of the 50’s with self-avoiding random walks and signal processing [Marshall, 1965; Handschin and Mayne, 1969] Use of the term “particle” dates back to Kitagawa (1996), and Carpenter et al. (1997) coined the term “particle filter”.
  34. 34. Particles Iterating/sequential importance sampling is about as old as Monte Carlo methods themselves! [Hammersley and Morton,1954; Rosenbluth and Rosenbluth, 1955] Found in the molecular simulation literature of the 50’s with self-avoiding random walks and signal processing [Marshall, 1965; Handschin and Mayne, 1969] Use of the term “particle” dates back to Kitagawa (1996), and Carpenter et al. (1997) coined the term “particle filter”.
  35. 35. pMC & pMCMC Recycling of past simulations legitimate to build better importance sampling functions as in population Monte Carlo [Iba, 2000; Capp´e et al, 2004; Del Moral et al., 2007] synthesis by Andrieu, Doucet, and Hollenstein (2010) using particles to build an evolving MCMC kernel ^pθ(y1:T ) in state space models p(x1:T )p(y1:T |x1:T ) importance sampling on discretely observed diffusions [Beskos et al., 2006; Fearnhead et al., 2008, 2010]
  36. 36. Metropolis-Hastings revisited Bernoulli, Jakob (1654–1705) MCMC connected steps Metropolis-Hastings revisited Reinterpretation and Rao-Blackwellisation Russian roulette Approximate Bayesian computation (ABC)
  37. 37. Metropolis Hastings algorithm 1. We wish to approximate I = h(x)π(x)dx π(x)dx = h(x)¯π(x)dx 2. π(x) is known but not π(x)dx. 3. Approximate I with δ = 1 n n t=1 h(x(t)) where (x(t)) is a Markov chain with limiting distribution ¯π. 4. Convergence obtained from Law of Large Numbers or CLT for Markov chains.
  38. 38. Metropolis Hasting Algorithm Suppose that x(t) is drawn. 1. Simulate yt ∼ q(·|x(t)). 2. Set x(t+1) = yt with probability α(x(t) , yt) = min 1, π(yt) π(x(t)) q(x(t)|yt) q(yt|x(t)) Otherwise, set x(t+1) = x(t). 3. α is such that the detailed balance equation is satisfied: ¯π is the stationary distribution of (x(t)). The accepted candidates are simulated with the rejection algorithm.
  39. 39. Metropolis Hasting Algorithm Suppose that x(t) is drawn. 1. Simulate yt ∼ q(·|x(t)). 2. Set x(t+1) = yt with probability α(x(t) , yt) = min 1, π(yt) π(x(t)) q(x(t)|yt) q(yt|x(t)) Otherwise, set x(t+1) = x(t). 3. α is such that the detailed balance equation is satisfied: ¯π is the stationary distribution of (x(t)). The accepted candidates are simulated with the rejection algorithm.
  40. 40. Metropolis Hasting Algorithm Suppose that x(t) is drawn. 1. Simulate yt ∼ q(·|x(t)). 2. Set x(t+1) = yt with probability α(x(t) , yt) = min 1, π(yt) π(x(t)) q(x(t)|yt) q(yt|x(t)) Otherwise, set x(t+1) = x(t). 3. α is such that the detailed balance equation is satisfied: π(x)q(y|x)α(x, y) = π(y)q(x|y)α(y, x). ¯π is the stationary distribution of (x(t)). The accepted candidates are simulated with the rejection algorithm.
  41. 41. Metropolis Hasting Algorithm Suppose that x(t) is drawn. 1. Simulate yt ∼ q(·|x(t)). 2. Set x(t+1) = yt with probability α(x(t) , yt) = min 1, π(yt) π(x(t)) q(x(t)|yt) q(yt|x(t)) Otherwise, set x(t+1) = x(t). 3. α is such that the detailed balance equation is satisfied: π(x)q(y|x)α(x, y) = π(y)q(x|y)α(y, x). ¯π is the stationary distribution of (x(t)). The accepted candidates are simulated with the rejection algorithm.
  42. 42. Some properties of the HM algorithm Alternative representation of the estimator δ is δ = 1 n n t=1 h(x(t) ) = 1 n Mn i=1 ni h(zi ) , where zi ’s are the accepted yj ’s, Mn is the number of accepted yj ’s till time n, ni is the number of times zi appears in the sequence (x(t))t.
  43. 43. The ”accepted candidates” ˜q(·|zi ) = α(zi , ·) q(·|zi ) p(zi ) q(·|zi ) p(zi ) , where p(zi ) = α(zi , y) q(y|zi )dy. To simulate from ˜q(·|zi ): 1. Propose a candidate y ∼ q(·|zi ) 2. Accept with probability ˜q(y|zi ) q(y|zi ) p(zi ) = α(zi , y) Otherwise, reject it and starts again. this is the transition of the HM algorithm.The transition kernel ˜q enjoys ˜π as a stationary distribution: ˜π(x)˜q(y|x) = ˜π(y)˜q(x|y) ,
  44. 44. The ”accepted candidates” ˜q(·|zi ) = α(zi , ·) q(·|zi ) p(zi ) q(·|zi ) p(zi ) , where p(zi ) = α(zi , y) q(y|zi )dy. To simulate from ˜q(·|zi ): 1. Propose a candidate y ∼ q(·|zi ) 2. Accept with probability ˜q(y|zi ) q(y|zi ) p(zi ) = α(zi , y) Otherwise, reject it and starts again. this is the transition of the HM algorithm.The transition kernel ˜q enjoys ˜π as a stationary distribution: ˜π(x)˜q(y|x) = ˜π(y)˜q(x|y) ,
  45. 45. ”accepted” Markov chain Lemma (Douc & X., AoS, 2011) The sequence (zi , ni ) satisfies 1. (zi , ni )i is a Markov chain; 2. zi+1 and ni are independent given zi ; 3. ni is distributed as a geometric random variable with probability parameter p(zi ) := α(zi , y) q(y|zi ) dy ; (1) 4. (zi )i is a Markov chain with transition kernel ˜Q(z, dy) = ˜q(y|z)dy and stationary distribution ˜π such that ˜q(·|z) ∝ α(z, ·) q(·|z) and ˜π(·) ∝ π(·)p(·) .
  46. 46. ”accepted” Markov chain Lemma (Douc & X., AoS, 2011) The sequence (zi , ni ) satisfies 1. (zi , ni )i is a Markov chain; 2. zi+1 and ni are independent given zi ; 3. ni is distributed as a geometric random variable with probability parameter p(zi ) := α(zi , y) q(y|zi ) dy ; (1) 4. (zi )i is a Markov chain with transition kernel ˜Q(z, dy) = ˜q(y|z)dy and stationary distribution ˜π such that ˜q(·|z) ∝ α(z, ·) q(·|z) and ˜π(·) ∝ π(·)p(·) .
  47. 47. ”accepted” Markov chain Lemma (Douc & X., AoS, 2011) The sequence (zi , ni ) satisfies 1. (zi , ni )i is a Markov chain; 2. zi+1 and ni are independent given zi ; 3. ni is distributed as a geometric random variable with probability parameter p(zi ) := α(zi , y) q(y|zi ) dy ; (1) 4. (zi )i is a Markov chain with transition kernel ˜Q(z, dy) = ˜q(y|z)dy and stationary distribution ˜π such that ˜q(·|z) ∝ α(z, ·) q(·|z) and ˜π(·) ∝ π(·)p(·) .
  48. 48. ”accepted” Markov chain Lemma (Douc & X., AoS, 2011) The sequence (zi , ni ) satisfies 1. (zi , ni )i is a Markov chain; 2. zi+1 and ni are independent given zi ; 3. ni is distributed as a geometric random variable with probability parameter p(zi ) := α(zi , y) q(y|zi ) dy ; (1) 4. (zi )i is a Markov chain with transition kernel ˜Q(z, dy) = ˜q(y|z)dy and stationary distribution ˜π such that ˜q(·|z) ∝ α(z, ·) q(·|z) and ˜π(·) ∝ π(·)p(·) .
  49. 49. Importance sampling perspective 1. A natural idea: δ∗ = 1 n Mn i=1 h(zi ) p(zi ) ,
  50. 50. Importance sampling perspective 1. A natural idea: δ∗ Mn i=1 h(zi ) p(zi ) Mn i=1 1 p(zi ) = Mn i=1 π(zi ) ˜π(zi ) h(zi ) Mn i=1 π(zi ) ˜π(zi ) .
  51. 51. Importance sampling perspective 1. A natural idea: δ∗ Mn i=1 h(zi ) p(zi ) Mn i=1 1 p(zi ) = Mn i=1 π(zi ) ˜π(zi ) h(zi ) Mn i=1 π(zi ) ˜π(zi ) . 2. But p not available in closed form.
  52. 52. Importance sampling perspective 1. A natural idea: δ∗ Mn i=1 h(zi ) p(zi ) Mn i=1 1 p(zi ) = Mn i=1 π(zi ) ˜π(zi ) h(zi ) Mn i=1 π(zi ) ˜π(zi ) . 2. But p not available in closed form. 3. The geometric ni is the replacement, an obvious solution that is used in the original Metropolis–Hastings estimate since E[ni ] = 1/p(zi ).
  53. 53. The Bernoulli factory The crude estimate of 1/p(zi ), ni = 1 + ∞ j=1 j I {u α(zi , y )} , can be improved: Lemma (Douc & X., AoS, 2011) If (yj )j is an iid sequence with distribution q(y|zi ), the quantity ^ξi = 1 + ∞ j=1 j {1 − α(zi , y )} is an unbiased estimator of 1/p(zi ) which variance, conditional on zi , is lower than the conditional variance of ni , {1 − p(zi )}/p2(zi ).
  54. 54. Rao-Blackwellised, for sure? ^ξi = 1 + ∞ j=1 j {1 − α(zi , y )} 1. Infinite sum but finite with at least positive probability: α(x(t) , yt) = min 1, π(yt) π(x(t)) q(x(t)|yt) q(yt|x(t)) For example: take a symmetric random walk as a proposal. 2. What if we wish to be sure that the sum is finite? Finite horizon k version: ^ξk i = 1 + ∞ j=1 1 k∧j 1 − α(zi , yj ) k+1 j I {u α(zi , y )}
  55. 55. Rao-Blackwellised, for sure? ^ξi = 1 + ∞ j=1 j {1 − α(zi , y )} 1. Infinite sum but finite with at least positive probability: α(x(t) , yt) = min 1, π(yt) π(x(t)) q(x(t)|yt) q(yt|x(t)) For example: take a symmetric random walk as a proposal. 2. What if we wish to be sure that the sum is finite? Finite horizon k version: ^ξk i = 1 + ∞ j=1 1 k∧j 1 − α(zi , yj ) k+1 j I {u α(zi , y )}
  56. 56. which Bernoulli factory?! Not the spice warehouse of Leon Bernoulli! Query: Given an algorithm delivering iid B(p) rv’s, is it possible to derive an algorithm delivering iid B(p) rv’s when f is known and p unknown? [von Neumann, 1951; Keane & O’Brien, 1994] existence (e.g., impossible for f (p) = min(2p, 1)) condition: for some n, min{f (p), 1 − f (p)} min{p, 1 − p}n implementation (polynomial vs. exponential time) use of sandwiching polynomials/power series
  57. 57. which Bernoulli factory?! Not the spice warehouse of Leon Bernoulli! Query: Given an algorithm delivering iid B(p) rv’s, is it possible to derive an algorithm delivering iid B(p) rv’s when f is known and p unknown? [von Neumann, 1951; Keane & O’Brien, 1994] existence (e.g., impossible for f (p) = min(2p, 1)) condition: for some n, min{f (p), 1 − f (p)} min{p, 1 − p}n implementation (polynomial vs. exponential time) use of sandwiching polynomials/power series
  58. 58. Variance improvement Theorem (Douc & X., AoS, 2011) If (yj )j is an iid sequence with distribution q(y|zi ) and (uj )j is an iid uniform sequence, for any k 0, the quantity ^ξk i = 1 + ∞ j=1 1 k∧j 1 − α(zi , yj ) k+1 j I {u α(zi , y )} is an unbiased estimator of 1/p(zi ) with an almost sure finite number of terms.
  59. 59. Variance improvement Theorem (Douc & X., AoS, 2011) If (yj )j is an iid sequence with distribution q(y|zi ) and (uj )j is an iid uniform sequence, for any k 0, the quantity ^ξk i = 1 + ∞ j=1 1 k∧j 1 − α(zi , yj ) k+1 j I {u α(zi , y )} is an unbiased estimator of 1/p(zi ) with an almost sure finite number of terms. Moreover, for k 1, V ^ξk i zi = 1 − p(zi ) p2(zi ) − 1 − (1 − 2p(zi ) + r(zi ))k 2p(zi ) − r(zi ) 2 − p(zi ) p2(zi ) (p(zi ) − r(zi )) , where p(zi ) := α(zi , y) q(y|zi ) dy. and r(zi ) := α2(zi , y) q(y|zi ) dy.
  60. 60. Variance improvement Theorem (Douc & X., AoS, 2011) If (yj )j is an iid sequence with distribution q(y|zi ) and (uj )j is an iid uniform sequence, for any k 0, the quantity ^ξk i = 1 + ∞ j=1 1 k∧j 1 − α(zi , yj ) k+1 j I {u α(zi , y )} is an unbiased estimator of 1/p(zi ) with an almost sure finite number of terms. Therefore, we have V ^ξi zi V ^ξk i zi V ^ξ0 i zi = V [ni | zi ] .
  61. 61. B motivation for Russian roulette drior π(θ), data density p(y|θ) = f (y; θ)/Z(θ) with Z(θ) = f (x; θ)dx intractable (e.g., Ising spin model, MRF, diffusion processes, networks, &tc) doubly-intractable posterior follows as π(θ|y) = p(y|θ) × π(θ) × 1 Z(y) = f (y; θ) Z(θ) × π(θ) × 1 Z(y) where Z(y) = p(y|θ)π(θ)dθ both Z(θ) and Z(y) are intractable with massively different consequences [thanks to Mark Girolami for his Russian slides!]
  62. 62. B motivation for Russian roulette drior π(θ), data density p(y|θ) = f (y; θ)/Z(θ) with Z(θ) = f (x; θ)dx intractable (e.g., Ising spin model, MRF, diffusion processes, networks, &tc) doubly-intractable posterior follows as π(θ|y) = p(y|θ) × π(θ) × 1 Z(y) = f (y; θ) Z(θ) × π(θ) × 1 Z(y) where Z(y) = p(y|θ)π(θ)dθ both Z(θ) and Z(y) are intractable with massively different consequences [thanks to Mark Girolami for his Russian slides!]
  63. 63. B motivation for Russian roulette If Z(θ) is intractable, Metropolis–Hasting acceptance probability α(θ , θ) = min 1, f (y; θ )π(θ ) f (y; θ)π(θ) × q(θ|θ ) q(θ |θ) × Z(θ) Z(θ ) is not available Use instead biased approximations e.g. pseudo-likelihoods, plugin ^Z(θ ) estimates without sacrificing exactness of MCMC
  64. 64. B motivation for Russian roulette If Z(θ) is intractable, Metropolis–Hasting acceptance probability α(θ , θ) = min 1, f (y; θ )π(θ ) f (y; θ)π(θ) × q(θ|θ ) q(θ |θ) × Z(θ) Z(θ ) is not available Use instead biased approximations e.g. pseudo-likelihoods, plugin ^Z(θ ) estimates without sacrificing exactness of MCMC
  65. 65. Existing solution Unbiased plugin estimate Z(θ) Z(θ ) ≈ f (x; θ) f (x; θ ) where x ∼ f (x; θ ) Z(θ ) [Møller et al, Bka, 2006; Murray et al 2006] auxiliary variable method removes Z(θ) Z(θ ) from the picture require simulations from the model (e.g., via perfect sampling)
  66. 66. Exact approximate methods Pseudo-Marginal construction that allows for the use of unbiased, positive estimates of target in acceptance probability α(θ , θ) = min 1, ^π(θ |y) ^π(θ|y) × q(θ|θ ) q(θ |θ) [Beaumont, 2003; Andrieu and Roberts, 2009; Doucet et al, 2012] Transition kernel has invariant distribution with exact target density π(θ|y)
  67. 67. Exact approximate methods Pseudo-Marginal construction that allows for the use of unbiased, positive estimates of target in acceptance probability α(θ , θ) = min 1, ^π(θ |y) ^π(θ|y) × q(θ|θ ) q(θ |θ) [Beaumont, 2003; Andrieu and Roberts, 2009; Doucet et al, 2012] Transition kernel has invariant distribution with exact target density π(θ|y)
  68. 68. Infinite series estimator For each (θ, y), construct rv’s {V (j) θ , j 0} such that ^π(θ, {V (j) θ }|y) := ∞ j=0 V (j) θ is a.s. finite with finite expectation E ^π(θ, {V (j) θ } |y) = π(θ|y)
  69. 69. Infinite series estimator For each (θ, y), construct rv’s {V (j) θ , j 0} such that ^π(θ, {V (j) θ }|y) := ∞ j=0 V (j) θ is a.s. finite with finite expectation E ^π(θ, {V (j) θ } |y) = π(θ|y) Introduce a random stopping time τθ, such that with ξ := (τθ, {V (j) θ , 0 j τθ}) the estimate ^π(θ, ξ|y) := τθ j=0 V (j) θ satisfies E ^π(θ, ξ|y)|{V (j) θ , j 0} = ^π(θ, {V (j) θ }|y)
  70. 70. Infinite series estimator For each (θ, y), construct rv’s {V (j) θ , j 0} such that ^π(θ, {V (j) θ }|y) := ∞ j=0 V (j) θ is a.s. finite with finite expectation E ^π(θ, {V (j) θ } |y) = π(θ|y) Warning: unbiased estimate ^π(θ, ξ|y) using series construction no general guarantee of positivity
  71. 71. Russian roulette Method that requires unbiased truncation of a series S(θ) = ∞ i=0 φi (θ) Russian roulette employed extensively in simulation of neutron scattering and computer graphics Assign probabilities {qj , j 1} qj ∈ (0, 1] and generate U(0, 1) i.i.d. r.v’s {Uj , j 1} Find the first time k 1 such that Uk qk Russian roulette estimate of S(θ) is ^S(θ) = k j=0 φj (θ) j−1 i=1 qi , [Girolami, Lyne, Strathman, Simpson, & Atchad´e, arXiv:1306.4032]
  72. 72. Russian roulette Method that requires unbiased truncation of a series S(θ) = ∞ i=0 φi (θ) Russian roulette employed extensively in simulation of neutron scattering and computer graphics Assign probabilities {qj , j 1} qj ∈ (0, 1] and generate U(0, 1) i.i.d. r.v’s {Uj , j 1} Find the first time k 1 such that Uk qk Russian roulette estimate of S(θ) is ^S(θ) = k j=0 φj (θ) j−1 i=1 qi , [Girolami, Lyne, Strathman, Simpson, & Atchad´e, arXiv:1306.4032]
  73. 73. Russian roulette Method that requires unbiased truncation of a series S(θ) = ∞ i=0 φi (θ) Russian roulette employed extensively in simulation of neutron scattering and computer graphics Assign probabilities {qj , j 1} qj ∈ (0, 1] and generate U(0, 1) i.i.d. r.v’s {Uj , j 1} Find the first time k 1 such that Uk qk Russian roulette estimate of S(θ) is ^S(θ) = k j=0 φj (θ) j−1 i=1 qi , If limn→∞ n j=1 qj = 0, Russian roulette terminates with probability one [Girolami, Lyne, Strathman, Simpson, & Atchad´e, arXiv:1306.4032]
  74. 74. Russian roulette Method that requires unbiased truncation of a series S(θ) = ∞ i=0 φi (θ) Russian roulette employed extensively in simulation of neutron scattering and computer graphics Assign probabilities {qj , j 1} qj ∈ (0, 1] and generate U(0, 1) i.i.d. r.v’s {Uj , j 1} Find the first time k 1 such that Uk qk Russian roulette estimate of S(θ) is ^S(θ) = k j=0 φj (θ) j−1 i=1 qi , E{^S(θ)} = S(θ) variance finite under certain known conditions [Girolami, Lyne, Strathman, Simpson, & Atchad´e, arXiv:1306.4032]
  75. 75. towards ever more complexity Bernoulli, Jakob (1654–1705) MCMC connected steps Metropolis-Hastings revisited Approximate Bayesian computation (ABC)
  76. 76. New challenges Novel statisticial issues that forces a different Bayesian answer: very large datasets complex or unknown dependence structures with maybe p n multiple and involved random effects missing data structures containing most of the information sequential structures involving most of the above
  77. 77. New paradigm? “Surprisingly, the confident prediction of the previous generation that Bayesian methods would ultimately supplant frequentist methods has given way to a realization that Markov chain Monte Carlo (MCMC) may be too slow to handle modern data sets. Size matters because large data sets stress computer storage and processing power to the breaking point. The most successful compromises between Bayesian and frequentist methods now rely on penalization and optimization.” [Lange at al., ISR, 2013]
  78. 78. New paradigm? sad reality constraint that size does matter focus on much smaller dimensions and on sparse summaries many (fast if non-Bayesian) ways of producing those summaries Bayesian inference can kick in almost automatically at this stage
  79. 79. Approximate Bayesian computation (ABC) Case of a well-defined statistical model where the likelihood function (θ|y) = f (y1, . . . , yn|θ) is out of reach! Empirical approximations to the original Bayesian inference problem Degrading the data precision down to a tolerance ε Replacing the likelihood with a non-parametric approximation Summarising/replacing the data with insufficient statistics
  80. 80. Approximate Bayesian computation (ABC) Case of a well-defined statistical model where the likelihood function (θ|y) = f (y1, . . . , yn|θ) is out of reach! Empirical approximations to the original Bayesian inference problem Degrading the data precision down to a tolerance ε Replacing the likelihood with a non-parametric approximation Summarising/replacing the data with insufficient statistics
  81. 81. Approximate Bayesian computation (ABC) Case of a well-defined statistical model where the likelihood function (θ|y) = f (y1, . . . , yn|θ) is out of reach! Empirical approximations to the original Bayesian inference problem Degrading the data precision down to a tolerance ε Replacing the likelihood with a non-parametric approximation Summarising/replacing the data with insufficient statistics
  82. 82. Approximate Bayesian computation (ABC) Case of a well-defined statistical model where the likelihood function (θ|y) = f (y1, . . . , yn|θ) is out of reach! Empirical approximations to the original Bayesian inference problem Degrading the data precision down to a tolerance ε Replacing the likelihood with a non-parametric approximation Summarising/replacing the data with insufficient statistics
  83. 83. ABC methodology Bayesian setting: target is π(θ)f (x|θ) When likelihood f (x|θ) not in closed form, likelihood-free rejection technique: Foundation For an observation y ∼ f (y|θ), under the prior π(θ), if one keeps jointly simulating θ ∼ π(θ) , z ∼ f (z|θ ) , until the auxiliary variable z is equal to the observed value, z = y, then the selected θ ∼ π(θ|y) [Rubin, 1984; Diggle & Gratton, 1984; Griffith et al., 1997]
  84. 84. ABC methodology Bayesian setting: target is π(θ)f (x|θ) When likelihood f (x|θ) not in closed form, likelihood-free rejection technique: Foundation For an observation y ∼ f (y|θ), under the prior π(θ), if one keeps jointly simulating θ ∼ π(θ) , z ∼ f (z|θ ) , until the auxiliary variable z is equal to the observed value, z = y, then the selected θ ∼ π(θ|y) [Rubin, 1984; Diggle & Gratton, 1984; Griffith et al., 1997]
  85. 85. ABC methodology Bayesian setting: target is π(θ)f (x|θ) When likelihood f (x|θ) not in closed form, likelihood-free rejection technique: Foundation For an observation y ∼ f (y|θ), under the prior π(θ), if one keeps jointly simulating θ ∼ π(θ) , z ∼ f (z|θ ) , until the auxiliary variable z is equal to the observed value, z = y, then the selected θ ∼ π(θ|y) [Rubin, 1984; Diggle & Gratton, 1984; Griffith et al., 1997]
  86. 86. ABC algorithm In most implementations, degree of approximation: Algorithm 1 Likelihood-free rejection sampler for i = 1 to N do repeat generate θ from the prior distribution π(·) generate z from the likelihood f (·|θ ) until ρ{η(z), η(y)} set θi = θ end for where η(y) defines a (not necessarily sufficient) statistic
  87. 87. Comments role of distance paramount (because = 0) scaling of components of η(y) also capital matters little if “small enough” representative of “curse of dimensionality” small is beautiful!, i.e. data as a whole may be weakly informative for ABC non-parametric method at core
  88. 88. ABC simulation advances Simulating from the prior is often poor in efficiency Either modify the proposal distribution on θ to increase the density of x’s within the vicinity of y... [Marjoram et al, 2003; Beaumont et al., 2009, Del Moral et al., 2012] ...or by viewing the problem as a conditional density estimation and by developing techniques to allow for larger [Beaumont et al., 2002; Blum & Fran¸cois, 2010; Biau et al., 2013] .....or even by including in the inferential framework [ABCµ] [Ratmann et al., 2009]
  89. 89. ABC simulation advances Simulating from the prior is often poor in efficiency Either modify the proposal distribution on θ to increase the density of x’s within the vicinity of y... [Marjoram et al, 2003; Beaumont et al., 2009, Del Moral et al., 2012] ...or by viewing the problem as a conditional density estimation and by developing techniques to allow for larger [Beaumont et al., 2002; Blum & Fran¸cois, 2010; Biau et al., 2013] .....or even by including in the inferential framework [ABCµ] [Ratmann et al., 2009]
  90. 90. ABC simulation advances Simulating from the prior is often poor in efficiency Either modify the proposal distribution on θ to increase the density of x’s within the vicinity of y... [Marjoram et al, 2003; Beaumont et al., 2009, Del Moral et al., 2012] ...or by viewing the problem as a conditional density estimation and by developing techniques to allow for larger [Beaumont et al., 2002; Blum & Fran¸cois, 2010; Biau et al., 2013] .....or even by including in the inferential framework [ABCµ] [Ratmann et al., 2009]
  91. 91. ABC simulation advances Simulating from the prior is often poor in efficiency Either modify the proposal distribution on θ to increase the density of x’s within the vicinity of y... [Marjoram et al, 2003; Beaumont et al., 2009, Del Moral et al., 2012] ...or by viewing the problem as a conditional density estimation and by developing techniques to allow for larger [Beaumont et al., 2002; Blum & Fran¸cois, 2010; Biau et al., 2013] .....or even by including in the inferential framework [ABCµ] [Ratmann et al., 2009]
  92. 92. ABC as an inference machine Starting point is summary statistic η(y), either chosen for computational realism or imposed by external constraints ABC can produce a distribution on the parameter of interest conditional on this summary statistic η(y) inference based on ABC may be consistent or not, so it needs to be validated on its own the choice of the tolerance level is dictated by both computational and convergence constraints
  93. 93. ABC as an inference machine Starting point is summary statistic η(y), either chosen for computational realism or imposed by external constraints ABC can produce a distribution on the parameter of interest conditional on this summary statistic η(y) inference based on ABC may be consistent or not, so it needs to be validated on its own the choice of the tolerance level is dictated by both computational and convergence constraints
  94. 94. How Bayesian aBc is..? At best, ABC approximates π(θ|η(y)): approximation error unknown (w/o massive simulation) pragmatic or empirical Bayes (there is no other solution!) many calibration issues (tolerance, distance, statistics) the NP side should be incorporated into the whole Bayesian picture the approximation error should also be part of the Bayesian inference
  95. 95. Noisy ABC ABC approximation error (under non-zero tolerance ) replaced with exact simulation from a controlled approximation to the target, convolution of true posterior with kernel function π (θ, z|y) = π(θ)f (z|θ)K (y − z) π(θ)f (z|θ)K (y − z)dzdθ , with K kernel parameterised by bandwidth . [Wilkinson, 2013] Theorem The ABC algorithm based on a randomised observation y = ˜y + ξ, ξ ∼ K , and an acceptance probability of K (y − z)/M gives draws from the posterior distribution π(θ|y).
  96. 96. Noisy ABC ABC approximation error (under non-zero tolerance ) replaced with exact simulation from a controlled approximation to the target, convolution of true posterior with kernel function π (θ, z|y) = π(θ)f (z|θ)K (y − z) π(θ)f (z|θ)K (y − z)dzdθ , with K kernel parameterised by bandwidth . [Wilkinson, 2013] Theorem The ABC algorithm based on a randomised observation y = ˜y + ξ, ξ ∼ K , and an acceptance probability of K (y − z)/M gives draws from the posterior distribution π(θ|y).
  97. 97. Which summary? Fundamental difficulty of the choice of the summary statistic when there is no non-trivial sufficient statistics [except when done by the experimenters in the field]
  98. 98. Which summary? Fundamental difficulty of the choice of the summary statistic when there is no non-trivial sufficient statistics [except when done by the experimenters in the field] Loss of statistical information balanced against gain in data roughening Approximation error and information loss remain unknown Choice of statistics induces choice of distance function towards standardisation borrowing tools from data analysis (LDA) machine learning [Estoup et al., ME, 2012]
  99. 99. Which summary? Fundamental difficulty of the choice of the summary statistic when there is no non-trivial sufficient statistics [except when done by the experimenters in the field] may be imposed for external/practical reasons may gather several non-B point estimates we can learn about efficient combination distance can be provided by estimation techniques
  100. 100. Which summary for model choice? ‘This is also why focus on model discrimination typically (...) proceeds by (...) accepting that the Bayes Factor that one obtains is only derived from the summary statistics and may in no way correspond to that of the full model.’ [S. Sisson, Jan. 31, 2011, xianblog] Depending on the choice of η(·), the Bayes factor based on this insufficient statistic, Bη 12(y) = π1(θ1)f η 1 (η(y)|θ1) dθ1 π2(θ2)f η 2 (η(y)|θ2) dθ2 , is either consistent or not [X et al., PNAS, 2012]
  101. 101. Which summary for model choice? Depending on the choice of η(·), the Bayes factor based on this insufficient statistic, Bη 12(y) = π1(θ1)f η 1 (η(y)|θ1) dθ1 π2(θ2)f η 2 (η(y)|θ2) dθ2 , is either consistent or not [X et al., PNAS, 2012] q q q q q q q q q q q Gauss Laplace 0.00.10.20.30.40.50.60.7 n=100 q q q q q q q q q q q q q q q q q q Gauss Laplace 0.00.20.40.60.81.0 n=100
  102. 102. Selecting proper summaries Consistency only depends on the range of µi (θ) = Ei [η(y)] under both models against the asymptotic mean µ0 of η(y) Theorem If Pn belongs to one of the two models and if µ0 cannot be attained by the other one : 0 = min (inf{|µ0 − µi (θi )|; θi ∈ Θi }, i = 1, 2) < max (inf{|µ0 − µi (θi )|; θi ∈ Θi }, i = 1, 2) , then the Bayes factor Bη 12 is consistent [Marin et al., JRSS B, 2013]
  103. 103. Selecting proper summaries Consistency only depends on the range of µi (θ) = Ei [η(y)] under both models against the asymptotic mean µ0 of η(y) q M1 M2 0.30.40.50.60.7 q q q q M1 M2 0.30.40.50.60.7 M1 M2 0.30.40.50.60.7 q q q q q q q q M1 M2 0.00.20.40.60.8 q qq q q q q q q q q qq q q q q q q M1 M2 0.00.20.40.60.81.0 q q q q q q q q M1 M2 0.00.20.40.60.81.0 q q q q q q q M1 M2 0.00.20.40.60.8 q q qq q q q q qq q q q q q q q M1 M2 0.00.20.40.60.81.0 q q qq q qq M1 M2 0.00.20.40.60.81.0 [Marin et al., JRSS B, 2013]

×