Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Monte Carlo in Montréal 2017

313 views

Published on

slides for an invited talk at this conference, presenting recent results on ABC

Published in: Science
  • Be the first to comment

  • Be the first to like this

Monte Carlo in Montréal 2017

  1. 1. ABC: from convergence guarantees to automated implementation Christian P. Robert Universit´e Paris-Dauphine PSL, Paris & University of Warwick, Coventry Joint work with A. Estoup, J.M. Marin, P. Pudlo, L Raynal, & M. Ribatet
  2. 2. Outline motivatoy example Approximate Bayesian computation ABC for model choice ABC model choice via random forests ABC estimation via random forests [some] asymptotics of ABC
  3. 3. A motivating if pedestrian example paired and orphan socks A drawer contains an unknown number of socks, some of which can be paired and some of which are orphans (single). One takes at random 11 socks without replacement from this drawer: no pair can be found among those. What can we infer about the total number of socks in the drawer? sounds like an impossible task one observation x = 11 and two unknowns, nsocks and npairs writing the likelihood is a challenge [exercise]
  4. 4. A motivating if pedestrian example paired and orphan socks A drawer contains an unknown number of socks, some of which can be paired and some of which are orphans (single). One takes at random 11 socks without replacement from this drawer: no pair can be found among those. What can we infer about the total number of socks in the drawer? sounds like an impossible task one observation x = 11 and two unknowns, nsocks and npairs writing the likelihood is a challenge [exercise]
  5. 5. Feller’s shoes A closet contains n pairs of shoes. If 2r shoes are chosen at random (with 2r < n), what is the probability that there will be (a) no complete pair, (b) exactly one complete pair, (c) exactly two complete pairs among them? [Feller, 1970, Chapter II, Exercise 26]
  6. 6. Feller’s shoes A closet contains n pairs of shoes. If 2r shoes are chosen at random (with 2r < n), what is the probability that there will be (a) no complete pair, (b) exactly one complete pair, (c) exactly two complete pairs among them? [Feller, 1970, Chapter II, Exercise 26] Resolution as pj = n j 22r−2j n − j 2r − 2j 2n 2r being probability of obtaining js pairs among those 2r shoes, or for an odd number t of shoes pj = 2t−2j n j n − j t − 2j 2n t
  7. 7. Feller’s shoes A closet contains n pairs of shoes. If 2r shoes are chosen at random (with 2r < n), what is the probability that there will be (a) no complete pair, (b) exactly one complete pair, (c) exactly two complete pairs among them? [Feller, 1970, Chapter II, Exercise 26] If one draws 11 socks out of m socks made of f orphans and g pairs, with f + 2g = m, number k of socks from the orphan group is hypergeometric H(11, m, f ) and probability to observe 11 orphan socks total is 11 k=0 f k 2g 11−k m 11 × 211−k g 11−k 2g 11−k
  8. 8. A prioris on socks Given parameters nsocks and npairs, set of socks S = s1, s1, . . . , snpairs , snpairs , snpairs+1, . . . , snsocks and 11 socks picked at random from S give X unique socks. Rassmus’ reasoning If you are a family of 3-4 persons then a guesstimate would be that you have something like 15 pairs of socks in store. It is also possible that you have much more than 30 socks. So as a prior for nsocks I’m going to use a negative binomial with mean 30 and standard deviation 15. On npairs/2nsocks I’m going to put a Beta prior distribution that puts most of the probability over the range 0.75 to 1.0, [Rassmus B˚a˚ath’s Research Blog, Oct 20th, 2014]
  9. 9. A prioris on socks Given parameters nsocks and npairs, set of socks S = s1, s1, . . . , snpairs , snpairs , snpairs+1, . . . , snsocks and 11 socks picked at random from S give X unique socks. Rassmus’ reasoning If you are a family of 3-4 persons then a guesstimate would be that you have something like 15 pairs of socks in store. It is also possible that you have much more than 30 socks. So as a prior for nsocks I’m going to use a negative binomial with mean 30 and standard deviation 15. On npairs/2nsocks I’m going to put a Beta prior distribution that puts most of the probability over the range 0.75 to 1.0, [Rassmus B˚a˚ath’s Research Blog, Oct 20th, 2014]
  10. 10. Simulating the experiment Given a prior distribution on nsocks and npairs, nsocks ∼ Neg(30, 15) npairs|nsocks ∼ nsocks/2Be(15, 2) possible to 1. generate new values of nsocks and npairs, 2. generate a new observation of X, number of unique socks out of 11. 3. accept the pair (nsocks, npairs) if the realisation of X is equal to 11
  11. 11. Simulating the experiment Given a prior distribution on nsocks and npairs, nsocks ∼ Neg(30, 15) npairs|nsocks ∼ nsocks/2Be(15, 2) possible to 1. generate new values of nsocks and npairs, 2. generate a new observation of X, number of unique socks out of 11. 3. accept the pair (nsocks, npairs) if the realisation of X is equal to 11
  12. 12. Meaning ns Density 0 10 20 30 40 50 60 0.000.010.020.030.040.050.06 The outcome of this simulation method returns a distribution on the pair (nsocks, npairs) that is the conditional distribution of the pair given the observation X = 11 Proof: Generations from π(nsocks, npairs) are accepted with probability P {X = 11|(nsocks, npairs)}
  13. 13. Meaning ns Density 0 10 20 30 40 50 60 0.000.010.020.030.040.050.06 The outcome of this simulation method returns a distribution on the pair (nsocks, npairs) that is the conditional distribution of the pair given the observation X = 11 Proof: Hence accepted values distributed from π(nsocks, npairs) × P {X = 11|(nsocks, npairs)} ∝ π(nsocks, npairs|X = 11)
  14. 14. Approximate Bayesian computation motivatoy example Approximate Bayesian computation ABC basics Automated summary selection ABC for model choice ABC model choice via random forests ABC estimation via random forests [some] asymptotics of ABC
  15. 15. Untractable likelihoods Cases when the likelihood function f (y|θ) is unavailable and when the completion step f (y|θ) = Z f (y, z|θ) dz is impossible or too costly because of the dimension of z c MCMC cannot be implemented
  16. 16. The ABC method Bayesian setting: target is π(θ)f (x|θ) When likelihood f (x|θ) not in closed form, likelihood-free rejection technique: ABC algorithm For an observation y ∼ f (y|θ), under the prior π(θ), keep jointly simulating θ ∼ π(θ) , z ∼ f (z|θ ) , until the auxiliary variable z is equal to the observed value, z = y. [Tavar´e et al., 1997]
  17. 17. The ABC method Bayesian setting: target is π(θ)f (x|θ) When likelihood f (x|θ) not in closed form, likelihood-free rejection technique: ABC algorithm For an observation y ∼ f (y|θ), under the prior π(θ), keep jointly simulating θ ∼ π(θ) , z ∼ f (z|θ ) , until the auxiliary variable z is equal to the observed value, z = y. [Tavar´e et al., 1997]
  18. 18. A as A...pproximative When y is a continuous random variable, equality z = y is replaced with a tolerance condition, ρ(y, z) where ρ is a distance Output distributed from π(θ) Pθ{ρ(y, z) < } ∝ π(θ|ρ(y, z) < ) [Pritchard et al., 1999]
  19. 19. A as A...pproximative When y is a continuous random variable, equality z = y is replaced with a tolerance condition, ρ(y, z) where ρ is a distance Output distributed from π(θ) Pθ{ρ(y, z) < } ∝ π(θ|ρ(y, z) < ) [Pritchard et al., 1999]
  20. 20. ABC algorithm Algorithm 1 Likelihood-free rejection sampler 2 for i = 1 to N do repeat generate θ from the prior distribution π(·) generate z from the likelihood f (·|θ ) until ρ{η(z), η(y)} set θi = θ end for where η(y) defines a (not necessarily sufficient) statistic
  21. 21. Output The likelihood-free algorithm samples from the marginal in z of: π (θ, z|y) = π(θ)f (z|θ)IA ,y (z) A ,y×Θ π(θ)f (z|θ)dzdθ , where A ,y = {z ∈ D|ρ(η(z), η(y)) < }. The idea behind ABC is that the summary statistics coupled with a small tolerance should provide a good approximation of the posterior distribution: π (θ|y) = π (θ, z|y)dz ≈ π(θ|η(y)) .
  22. 22. Output The likelihood-free algorithm samples from the marginal in z of: π (θ, z|y) = π(θ)f (z|θ)IA ,y (z) A ,y×Θ π(θ)f (z|θ)dzdθ , where A ,y = {z ∈ D|ρ(η(z), η(y)) < }. The idea behind ABC is that the summary statistics coupled with a small tolerance should provide a good approximation of the posterior distribution: π (θ|y) = π (θ, z|y)dz ≈ π(θ|η(y)) .
  23. 23. Dogger Bank re-enactment Battle of Dogger Bank on Jan 24, 1915, between British and German fleets : how likely was the British victory? [MacKay, Price, and Wood, 2016]
  24. 24. Dogger Bank re-enactment Battle of Dogger Bank on Jan 24, 1915, between British and German fleets : how likely was the British victory? [MacKay, Price, and Wood, 2016]
  25. 25. Dogger Bank re-enactment Battle of Dogger Bank on Jan 24, 1915, between British and German fleets : ABC simulation of posterior distribution [MacKay, Price, and Wood, 2016]
  26. 26. ABC advances Simulating from the prior is often poor in efficiency Either modify the proposal distribution on θ to increase the density of x’s within the vicinity of y... [Marjoram et al, 2003; Bortot et al., 2007, Beaumont et al., 2009] ...or by viewing the problem as a conditional density estimation and by developing techniques to allow for larger [Beaumont et al., 2002; Blum & Fran¸cois, 2009] .....or even by including in the inferential framework [ABCµ] [Ratmann et al., 2009]
  27. 27. ABC advances Simulating from the prior is often poor in efficiency Either modify the proposal distribution on θ to increase the density of x’s within the vicinity of y... [Marjoram et al, 2003; Bortot et al., 2007, Beaumont et al., 2009] ...or by viewing the problem as a conditional density estimation and by developing techniques to allow for larger [Beaumont et al., 2002; Blum & Fran¸cois, 2009] .....or even by including in the inferential framework [ABCµ] [Ratmann et al., 2009]
  28. 28. ABC advances Simulating from the prior is often poor in efficiency Either modify the proposal distribution on θ to increase the density of x’s within the vicinity of y... [Marjoram et al, 2003; Bortot et al., 2007, Beaumont et al., 2009] ...or by viewing the problem as a conditional density estimation and by developing techniques to allow for larger [Beaumont et al., 2002; Blum & Fran¸cois, 2009] .....or even by including in the inferential framework [ABCµ] [Ratmann et al., 2009]
  29. 29. ABC advances Simulating from the prior is often poor in efficiency Either modify the proposal distribution on θ to increase the density of x’s within the vicinity of y... [Marjoram et al, 2003; Bortot et al., 2007, Beaumont et al., 2009] ...or by viewing the problem as a conditional density estimation and by developing techniques to allow for larger [Beaumont et al., 2002; Blum & Fran¸cois, 2009] .....or even by including in the inferential framework [ABCµ] [Ratmann et al., 2009]
  30. 30. ABC consistency Recent studies on large sample properties of ABC posterior distributions and ABC posterior means [Liu & Fearnhead, 2016; Frazier et al., 2016] Under regularity conditions on summary statistics, incl. convergence at speed dT , characterisation of rate of posterior concentration as a function of tolerance convergence less stringent condition on tolerance decrease than for asymptotic normality of posterior; asymptotic normality of posterior mean does not require asymptotic normality of posterior itself Cases for limiting ABC distributions 1. dT T −→ +∞; 2. dT T −→ c; 3. dT T −→ 0 and limiting ABC mean convergent for 2 T = o(1/dT ) [Frazier et al., 2016]
  31. 31. ABC consistency Recent studies on large sample properties of ABC posterior distributions and ABC posterior means [Liu & Fearnhead, 2016; Frazier et al., 2016] Under regularity conditions on summary statistics, incl. convergence at speed dT , characterisation of rate of posterior concentration as a function of tolerance convergence less stringent condition on tolerance decrease than for asymptotic normality of posterior; asymptotic normality of posterior mean does not require asymptotic normality of posterior itself Cases for limiting ABC distributions 1. dT T −→ +∞; 2. dT T −→ c; 3. dT T −→ 0 and limiting ABC mean convergent for 2 T = o(1/dT ) [Frazier et al., 2016]
  32. 32. Noisily exact ABC Idea: Modify the data from the start ˜y = y0 + ζ1 with the same scale as ABC [ see Fearnhead-Prangle ] run ABC on ˜y Then ABC produces an exact simulation from π(θ|˜y) = π(θ|˜y) [Dean et al., 2011; Fearnhead and Prangle, 2012]
  33. 33. Noisily exact ABC Idea: Modify the data from the start ˜y = y0 + ζ1 with the same scale as ABC [ see Fearnhead-Prangle ] run ABC on ˜y Then ABC produces an exact simulation from π(θ|˜y) = π(θ|˜y) [Dean et al., 2011; Fearnhead and Prangle, 2012]
  34. 34. Consistent noisy ABC Degrading the data improves the estimation performances: Noisy ABC-MLE is asymptotically (in n) consistent under further assumptions, the noisy ABC-MLE is asymptotically normal increase in variance of order −2 likely degradation in precision or computing time due to the lack of summary statistic [curse of dimensionality]
  35. 35. Semi-automatic ABC Fearnhead and Prangle (2012) study ABC and the selection of the summary statistic ABC then considered from a purely inferential viewpoint and calibrated for estimation purposes Use of a randomised (or ‘noisy’) version of the summary statistics ˜η(y) = η(y) + τ Derivation of a well-calibrated version of ABC, i.e. an algorithm that gives proper predictions for the distribution associated with this randomised summary statistic [calibration constraint: ABC approximation with same posterior mean as the true randomised posterior] Optimality of the posterior expectation E[θ|y] of the parameter of interest as summary statistics η(y)!
  36. 36. Semi-automatic ABC Fearnhead and Prangle (2012) study ABC and the selection of the summary statistic ABC then considered from a purely inferential viewpoint and calibrated for estimation purposes Use of a randomised (or ‘noisy’) version of the summary statistics ˜η(y) = η(y) + τ Derivation of a well-calibrated version of ABC, i.e. an algorithm that gives proper predictions for the distribution associated with this randomised summary statistic [calibration constraint: ABC approximation with same posterior mean as the true randomised posterior] Optimality of the posterior expectation E[θ|y] of the parameter of interest as summary statistics η(y)!
  37. 37. Fully automatic ABC Implementation of ABC still requires input of collection of summaries Towards automation statistical projection techniques (LDA, PCA, NP-GLS, &tc.) variable selection machine learning approaches bypassing summaries altogether
  38. 38. ABC with Wasserstein distance Use as distance between simulated and observed samples the Wasserstein distance: Wp(y1:n, z1:n)p = inf σ∈Sn 1 n n i=1 ρ(yi , zσ(i))p , (1) covers well- and mis-specified cases only depends on data space distance ρ(·, ·) covers iid and dynamic models (curve matching) computional feasible (linear in dimension, cubic in sample size) Hilbert curve approximation [Bernton et al., 2017]
  39. 39. Consistent inference with Wasserstein distance As ε → 0 [and n fixed] If either 1. f (n) θ is n-exchangeable and D(y1:n, z1:n) = 0 if and only if z1:n = yσ(1:n) for some σ ∈ Sn, or 2. D(y1:n, z1:n) = 0 if and only if z1:n = y1:n. then, at y1:n fixed, ABC posterior converges strongly to posterior as ε → 0. [Bernton et al., 2017]
  40. 40. Consistent inference with Wasserstein distance As n → ∞ [at ε fixed] WABC distribution with a fixed ε does not converge in n to a Dirac mass [Bernton et al., 2017]
  41. 41. Consistent inference with Wasserstein distance As εn → 0 and n → ∞ Under range of assumptions, if fn(εn) → 0, and P(W(^µn, µ ) εn) → 1, then WABC posterior with threshold εn + ε satisfies πεn+ε {θ ∈ H : W(µ , µθ) > ε + 4εn/3 + f −1 n (εL n/R)} |y1:n P δ [Bernton et al., 2017]
  42. 42. A bivariate Gaussian illustration 100 observations from bivariate Normal with variance 1 and covariance 0.55 Compare WABC with ABC based on raw Euclidean distance and Euclidean distance between sample means on 106 model simulations.
  43. 43. ABC for model choice motivatoy example Approximate Bayesian computation ABC for model choice ABC model choice via random forests ABC estimation via random forests [some] asymptotics of ABC
  44. 44. Bayesian model choice Several models M1, M2, . . . are considered simultaneously for a dataset y and the model index M is part of the inference. Use of a prior distribution. π(M = m), plus a prior distribution on the parameter conditional on the value m of the model index, πm(θm) Goal is to derive the posterior distribution of M, challenging computational target when models are complex.
  45. 45. Generic ABC for model choice Algorithm 2 Likelihood-free model choice sampler (ABC-MC) for t = 1 to T do repeat Generate m from the prior π(M = m) Generate θm from the prior πm(θm) Generate z from the model fm(z|θm) until ρ{η(z), η(y)} < Set m(t) = m and θ(t) = θm end for [Cornuet et al., DIYABC, 2009]
  46. 46. ABC estimates Posterior probability π(M = m|y) approximated by the frequency of acceptances from model m 1 T T t=1 Im(t)=m .
  47. 47. Limiting behaviour of B12 (under sufficiency) If η(y) sufficient statistic for both models, fi (y|θi ) = gi (y)f η i (η(y)|θi ) Thus B12(y) = Θ1 π(θ1)g1(y)f η 1 (η(y)|θ1) dθ1 Θ2 π(θ2)g2(y)f η 2 (η(y)|θ2) dθ2 = g1(y) π1(θ1)f η 1 (η(y)|θ1) dθ1 g2(y) π2(θ2)f η 2 (η(y)|θ2) dθ2 = g1(y) g2(y) Bη 12(y) . [Didelot, Everitt, Johansen & Lawson, 2011] c No discrepancy only when cross-model sufficiency c Inability to evaluate loss brought by summary statistics
  48. 48. Limiting behaviour of B12 (under sufficiency) If η(y) sufficient statistic for both models, fi (y|θi ) = gi (y)f η i (η(y)|θi ) Thus B12(y) = Θ1 π(θ1)g1(y)f η 1 (η(y)|θ1) dθ1 Θ2 π(θ2)g2(y)f η 2 (η(y)|θ2) dθ2 = g1(y) π1(θ1)f η 1 (η(y)|θ1) dθ1 g2(y) π2(θ2)f η 2 (η(y)|θ2) dθ2 = g1(y) g2(y) Bη 12(y) . [Didelot, Everitt, Johansen & Lawson, 2011] c No discrepancy only when cross-model sufficiency c Inability to evaluate loss brought by summary statistics
  49. 49. A stylised problem Central question to the validation of ABC for model choice: When is a Bayes factor based on an insufficient statistic T(y) consistent? Note/warnin: c drawn on T(y) through BT 12(y) necessarily differs from c drawn on y through B12(y) [Marin, Pillai, X, & Rousseau, JRSS B, 2013]
  50. 50. A stylised problem Central question to the validation of ABC for model choice: When is a Bayes factor based on an insufficient statistic T(y) consistent? Note/warnin: c drawn on T(y) through BT 12(y) necessarily differs from c drawn on y through B12(y) [Marin, Pillai, X, & Rousseau, JRSS B, 2013]
  51. 51. A benchmark if toy example Comparison suggested by referee of PNAS paper [thanks!]: [X, Cornuet, Marin, & Pillai, Aug. 2011] Model M1: y ∼ N(θ1, 1) opposed to model M2: y ∼ L(θ2, 1/ √ 2), Laplace distribution with mean θ2 and scale parameter 1/ √ 2 (variance one). Four possible statistics 1. sample mean y (sufficient for M1 if not M2); 2. sample median med(y) (insufficient); 3. sample variance var(y) (ancillary); 4. median absolute deviation mad(y) = med(|y − med(y)|);
  52. 52. A benchmark if toy example Comparison suggested by referee of PNAS paper [thanks!]: [X, Cornuet, Marin, & Pillai, Aug. 2011] Model M1: y ∼ N(θ1, 1) opposed to model M2: y ∼ L(θ2, 1/ √ 2), Laplace distribution with mean θ2 and scale parameter 1/ √ 2 (variance one). q q q q q q q q q q q Gauss Laplace 0.00.10.20.30.40.50.60.7 n=100 q q q q q q q q q q q q q q q q q q Gauss Laplace 0.00.20.40.60.81.0 n=100
  53. 53. Framework Starting from sample y = (y1, . . . , yn) the observed sample, not necessarily iid with true distribution y ∼ Pn Summary statistics T(y) = Tn = (T1(y), T2(y), · · · , Td (y)) ∈ Rd with true distribution Tn ∼ Gn.
  54. 54. Framework c Comparison of – under M1, y ∼ F1,n(·|θ1) where θ1 ∈ Θ1 ⊂ Rp1 – under M2, y ∼ F2,n(·|θ2) where θ2 ∈ Θ2 ⊂ Rp2 turned into – under M1, T(y) ∼ G1,n(·|θ1), and θ1|T(y) ∼ π1(·|Tn ) – under M2, T(y) ∼ G2,n(·|θ2), and θ2|T(y) ∼ π2(·|Tn )
  55. 55. Assumptions A collection of asymptotic “standard” assumptions: [A1] is a standard central limit theorem under the true model with asymptotic mean µ0 [A2] controls the large deviations of the estimator Tn from the model mean µ(θ) [A3] is the standard prior mass condition found in Bayesian asymptotics (di effective dimension of the parameter) [A4] restricts the behaviour of the model density against the true density [Think CLT!]
  56. 56. Asymptotic marginals Asymptotically, under [A1]–[A4] mi (t) = Θi gi (t|θi ) πi (θi ) dθi is such that (i) if inf{|µi (θi ) − µ0|; θi ∈ Θi } = 0, Cl vd−di n mi (Tn ) Cuvd−di n and (ii) if inf{|µi (θi ) − µ0|; θi ∈ Θi } > 0 mi (Tn ) = oPn [vd−τi n + vd−αi n ].
  57. 57. Between-model consistency Consequence of above is that asymptotic behaviour of the Bayes factor is driven by the asymptotic mean value µ(θ) of Tn under both models. And only by this mean value!
  58. 58. Between-model consistency Consequence of above is that asymptotic behaviour of the Bayes factor is driven by the asymptotic mean value µ(θ) of Tn under both models. And only by this mean value! Indeed, if inf{|µ0 − µ2(θ2)|; θ2 ∈ Θ2} = inf{|µ0 − µ1(θ1)|; θ1 ∈ Θ1} = 0 then Cl v −(d1−d2) n m1(Tn ) m2(Tn ) Cuv −(d1−d2) n , where Cl , Cu = OPn (1), irrespective of the true model. c Only depends on the difference d1 − d2: no consistency
  59. 59. Between-model consistency Consequence of above is that asymptotic behaviour of the Bayes factor is driven by the asymptotic mean value µ(θ) of Tn under both models. And only by this mean value! Else, if inf{|µ0 − µ2(θ2)|; θ2 ∈ Θ2} > inf{|µ0 − µ1(θ1)|; θ1 ∈ Θ1} = 0 then m1(Tn ) m2(Tn ) Cu min v −(d1−α2) n , v −(d1−τ2) n
  60. 60. Checking for adequate statistics Run a practical check of the relevance (or non-relevance) of Tn null hypothesis that both models are compatible with the statistic Tn H0 : inf{|µ2(θ2) − µ0|; θ2 ∈ Θ2} = 0 against H1 : inf{|µ2(θ2) − µ0|; θ2 ∈ Θ2} > 0 testing procedure provides estimates of mean of Tn under each model and checks for equality
  61. 61. Checking in practice Under each model Mi , generate ABC sample θi,l , l = 1, · · · , L For each θi,l , generate yi,l ∼ Fi,n(·|ψi,l ), derive Tn (yi,l ) and compute ^µi = 1 L L l=1 Tn (yi,l ), i = 1, 2 . Conditionally on Tn (y), √ L { ^µi − Eπ [µi (θi )|Tn (y)]} N(0, Vi ), Test for a common mean H0 : ^µ1 ∼ N(µ0, V1) , ^µ2 ∼ N(µ0, V2) against the alternative of different means H1 : ^µi ∼ N(µi , Vi ), with µ1 = µ2 .
  62. 62. Toy example: Laplace versus Gauss qqqqqqqqqqqqqqq qqqqqqqqqq q qq q q Gauss Laplace Gauss Laplace 010203040 Normalised χ2 without and with mad
  63. 63. ABC model choice via random forests motivatoy example Approximate Bayesian computation ABC for model choice ABC model choice via random forests Random forests ABC with random forests ABC estimation via random forests [some] asymptotics of ABC
  64. 64. Leaning towards machine learning Main notions: ABC-MC seen as learning about which model is most appropriate from a huge (reference) table exploiting a large number of summary statistics not an issue for machine learning methods intended to estimate efficient combinations abandoning (temporarily?) the idea of estimating posterior probabilities of the models, poorly approximated by machine learning methods, and replacing those by posterior predictive expected loss [Cornuet et al., 2016]
  65. 65. Random forests Technique that stemmed from Leo Breiman’s bagging (or bootstrap aggregating) machine learning algorithm for both classification and regression [Breiman, 1996] Improved classification performances by averaging over classification schemes of randomly generated training sets, creating a “forest” of (CART) decision trees, inspired by Amit and Geman (1997) ensemble learning [Breiman, 2001]
  66. 66. random forests as non-parametric regression CART means Classification and Regression Trees For regression purposes, i.e., to predict y as f (x), similar binary trees in random forests 1. at each tree node, split data into two daughter nodes 2. split variable and bound chosen to minimise heterogeneity criterion 3. stop splitting when enough homogeneity in current branch 4. predicted values at terminal nodes (or leaves) are average response variable y for all observations in final leaf
  67. 67. Growing the forest Breiman’s solution for inducing random features in the trees of the forest: boostrap resampling of the dataset and random subset-ing [of size √ t] of the covariates driving the classification at every node of each tree Covariate xτ that drives the node separation xτ cτ and the separation bound cτ chosen by minimising entropy or Gini index
  68. 68. ABC with random forests Idea: Starting with possibly large collection of summary statistics (s1i , . . . , spi ) (from scientific theory input to available statistical softwares, to machine-learning alternatives) ABC reference table involving model index, parameter values and summary statistics for the associated simulated pseudo-data run R randomforest to infer M from (s1i , . . . , spi )
  69. 69. ABC with random forests Idea: Starting with possibly large collection of summary statistics (s1i , . . . , spi ) (from scientific theory input to available statistical softwares, to machine-learning alternatives) ABC reference table involving model index, parameter values and summary statistics for the associated simulated pseudo-data run R randomforest to infer M from (s1i , . . . , spi ) at each step O( √ p) indices sampled at random and most discriminating statistic selected, by minimising entropy Gini loss
  70. 70. ABC with random forests Idea: Starting with possibly large collection of summary statistics (s1i , . . . , spi ) (from scientific theory input to available statistical softwares, to machine-learning alternatives) ABC reference table involving model index, parameter values and summary statistics for the associated simulated pseudo-data run R randomforest to infer M from (s1i , . . . , spi ) Average of the trees is resulting summary statistics, highly non-linear predictor of the model index
  71. 71. Outcome of ABC-RF Random forest predicts a (MAP) model index, from the observed dataset: The predictor provided by the forest is “sufficient” to select the most likely model but not to derive associated posterior probability exploit entire forest by computing how many trees lead to picking each of the models under comparison but variability too high to be trusted frequency of trees associated with majority model is no proper substitute to the true posterior probability And usual ABC-MC approximation equally highly variable and hard to assess
  72. 72. Outcome of ABC-RF Random forest predicts a (MAP) model index, from the observed dataset: The predictor provided by the forest is “sufficient” to select the most likely model but not to derive associated posterior probability exploit entire forest by computing how many trees lead to picking each of the models under comparison but variability too high to be trusted frequency of trees associated with majority model is no proper substitute to the true posterior probability And usual ABC-MC approximation equally highly variable and hard to assess
  73. 73. Posterior predictive expected losses We suggest replacing unstable approximation of P(M = m|xo) with xo observed sample and m model index, by average of the selection errors across all models given the data xo, P( ^M(X) = M|xo) where pair (M, X) generated from the predictive f (x|θ)π(θ, M|xo)dθ and ^M(x) denotes the random forest model (MAP) predictor
  74. 74. Posterior predictive expected losses Arguments: Bayesian estimate of the posterior error integrates error over most likely part of the parameter space gives an averaged error rather than the posterior probability of the null hypothesis easily computed: Given ABC subsample of parameters from reference table, simulate pseudo-samples associated with those and derive error frequency
  75. 75. Comments real-data implementation for population genetics with high performances unlimited aggregation of arbitrary summary statistics recovery of discriminant statistics when available automated implementation with reduced calibration self-evaluation by posterior predictive error soon to be included within DIYABC [Pudlo et al., 2016]
  76. 76. ABC estimation via random forests motivatoy example Approximate Bayesian computation ABC for model choice ABC model choice via random forests ABC estimation via random forests Random forests the ODOF principle [some] asymptotics of ABC
  77. 77. Two basic issues with ABC ABC compares numerous simulated dataset to the observed one Two major difficulties: to decrease approximation error (or tolerance ) and hence ensure reliability of ABC, total number of simulations very large; calibration of ABC (tolerance, distance, summary statistics, post-processing, &tc) critical and hard to automatise
  78. 78. classification of summaries by random forests Given a large collection of summary statistics, rather than selecting a subset and excluding the others, estimate each parameter of interest by a machine learning tool like random forests RF can handle thousands of predictors ignore useless components fast estimation method with good local properties automatised method with few calibration steps substitute to Fearnhead and Prangle (2012) preliminary estimation of ^θ(yobs) includes a natural (classification) distance measure that avoids choice of both distance and tolerance [Marin et al., 2016]
  79. 79. ABC parameter estimation (ODOF) One dimension = one forest (ODOF) methodology parametric statistical model: {f (y; θ): y ∈ Y, θ ∈ Θ}, Y ⊆ Rn , Θ ⊆ Rp with intractable density f (·; θ) plus prior distribution π(θ) Inference on quantity of interest ψ(θ) ∈ R (posterior means, variances, quantiles or covariances)
  80. 80. ABC parameter estimation (ODOF) One dimension = one forest (ODOF) methodology parametric statistical model: {f (y; θ): y ∈ Y, θ ∈ Θ}, Y ⊆ Rn , Θ ⊆ Rp with intractable density f (·; θ) plus prior distribution π(θ) Inference on quantity of interest ψ(θ) ∈ R (posterior means, variances, quantiles or covariances)
  81. 81. common reference table Given η: Y → Rk a collection of summary statistics produce reference table (RT) used as learning dataset for multiple random forests meaning, for 1 t N 1. simulate θ(t) ∼ π(θ) 2. simulate ˜yt = (˜y1,t, . . . , ˜yn,t) ∼ f (y; θ(t) ) 3. compute η(˜yt) = {η1(˜yt), . . . , ηk (˜yt)}
  82. 82. ABC posterior expectations Recall that θ = (θ1, . . . , θd ) ∈ Rd ODOF principle: For each θj , construct a separate RF regression with predictors variables equal to summary statistics η(y) = {η1(y), . . . , ηk(y)} If Lb(η(y∗)) denotes leaf index of b-th tree associated with η(y∗) —leaf reached through path of binary choices in tree b—, with |Lb| response variables E(θj | η(y∗)) = 1 B B b=1 1 |Lb(η(y∗))| t:η(yt )∈Lb(η(y∗)) θ (t) j is our ABC estimate
  83. 83. ABC posterior expectations ODOF principle: For each θj , construct a separate RF regression with predictors variables equal to summary statistics η(y) = {η1(y), . . . , ηk(y)} If Lb(η(y∗)) denotes leaf index of b-th tree associated with η(y∗) —leaf reached through path of binary choices in tree b—, with |Lb| response variables E(θj | η(y∗)) = 1 B B b=1 1 |Lb(η(y∗))| t:η(yt )∈Lb(η(y∗)) θ (t) j is our ABC estimate
  84. 84. ABC posterior quantile estimate Random forests also available for quantile regression [Meinshausen, 2006, JMLR] Since ^E(θj | η(y∗ )) = N t=1 wt(η(y∗ ))θ (t) j with wt(η(y∗ )) = 1 B B b=1 ILb(η(y∗))(η(yt)) |Lb(η(y∗))| natural estimate of the cdf of θj is ^F(u | η(y∗ )) = N t=1 wt(η(y∗ ))I{θ (t) j u} .
  85. 85. ABC posterior quantile estimate Since ^E(θj | η(y∗ )) = N t=1 wt(η(y∗ ))θ (t) j with wt(η(y∗ )) = 1 B B b=1 ILb(η(y∗))(η(yt)) |Lb(η(y∗))| natural estimate of the cdf of θj is ^F(u | η(y∗ )) = N t=1 wt(η(y∗ ))I{θ (t) j u} . ABC posterior quantiles + credible intervals given by ^F−1
  86. 86. ABC variances While approximation of Var(θj | η(y∗)) available based on ^F, choice of alternative if more involved version: In a given tree b in a random forest, existence of out-of-bag (oob) entries, i.e., not sampled in associated bootstrap subsample Use of oob simulations to produce estimate of E{θj | η(yt)}, ˜θj (t) , Apply weights ωt(η(y∗)) to oob residuals: Var(θj | η(y∗ )) = N t=1 ωt(η(y∗ )) (θ (t) j − ˜θj (t) 2
  87. 87. ABC variances While approximation of Var(θj | η(y∗)) available based on ^F, choice of alternative if more involved version: In a given tree b in a random forest, existence of out-of-bag (oob) entries, i.e., not sampled in associated bootstrap subsample Use of oob simulations to produce estimate of E{θj | η(yt)}, ˜θj (t) , Apply weights ωt(η(y∗)) to oob residuals: Var(θj | η(y∗ )) = N t=1 ωt(η(y∗ )) (θ (t) j − ˜θj (t) 2
  88. 88. ABC covariances For estimating Cov(θj , θ | η(y∗)), construction of a specific random forest product of oob errors for θj and θ θ (t) j − ˜θj (t) θ (t) − ˜θ (t) with again predictors variables the summary statistics η(y) = {η1(y), . . . , ηk(y)}
  89. 89. Human populations example 50,000 SNP markers genotyped in four Human populations: Yoruba (Africa), Han (East Asia), British (Europe[??]) and American individuals of African Ancestry; 30 individuals per population. Comparison of 6 scenarios of evolution which differ from each other by one ancient plus one recent historical events: A) a single out-of-Africa colonisation event giving an ancestral out-of-Africa versus two independent out-of-Africa colonisation events; B) the possibility of a recent genetic admixture of Americans of African origin with their African ancestors and individuals of European or East Asia origins.
  90. 90. Human populations example
  91. 91. Human populations example
  92. 92. Human populations example
  93. 93. summaries use of 112 summary statistics provided by DIYABC for SNP markers complemented by the five LDA axes as additional statistics Classification method Prior error rates (%) trained on M = 10, 000 M = 20, 000 M = 50, 000 Linear Discriminant Analysis 9.91 9.97 10.03 Rejection ABC, DIYABC summaries 23.18 20.55 17.76 Rejection ABC, LDA summaries 6.29 5.76 5.70 Local logistic reg. on LDA 6.85 6.42 6.07 RF, DIYABC summaries 8.84 7.32 6.34 RF, DIYABC and LDA summaries 5.01 4.66 4.18
  94. 94. outcome ABC-RF picks scenario 2 as forecasted scenario on the Human dataset not obvious fmor LDA projections (where scenario 2 corresponds to blue
  95. 95. comments Considering previous population genetics studies in the field, unsurprising that single out-of-Africa colonization event giving an ancestral out-of- Africa population secondarily split into one European and one East Asian population lineage recent genetic admixture of Americans of African origin with their African ancestors and European estimate of the posterior probability of scenario 2 equal to 0.998, corresponding to a high level of confidence [?] in choosing scenario 2
  96. 96. further comments For scenario 2, parameters of interest ra admixture rate between Europeans and Africans, t3 out-of-Africa time, NA effective size of the ancestral population. Reference table containing 2e5 points from which 300 simulations were excluded to evaluate accuracy of different methodologies
  97. 97. estimates RF rejection local linear reg. ridge reg. neural nets coverage 95% 96.6 97.6 92.3 93.3 85 q.range 95% 4276.12 7241.66 3594.01 3813.93 2675.63 coverage 90% 92.6 94 85.3 86.3 76.3 range 90% 3644.28 6422.49 2897.32 3101.17 2146.01 parameter Na coverages and quantile ranges
  98. 98. [not so famous] last words ABC RF methods mostly insensitive both to strong correlations between the summary statistics and to the presence of noisy variables. involves less simulations and no calibration Next steps: adaptive schemes, deep learning, inclusion in DIYABC
  99. 99. [not so famous] last words ABC RF methods mostly insensitive both to strong correlations between the summary statistics and to the presence of noisy variables. involves less simulations and no calibration Next steps: adaptive schemes, deep learning, inclusion in DIYABC
  100. 100. [some] asymptotics of ABC motivatoy example Approximate Bayesian computation ABC for model choice ABC model choice via random forests ABC estimation via random forests [some] asymptotics of ABC asymptotic setup consistency of ABC posteriors asymptotic posterior shape asymptotic behaviour of EABC [θ]
  101. 101. asymptotic setup asymptotic: y = y(n) ∼ Pn θ and = n, n → +∞ parametric: θ ∈ Rk, k fixed concentration of summary statistics η(zn): ∃b : θ → b(θ) η(zn ) − b(θ) = oP θ (1), ∀θ Objects of interest: posterior concentration and asymptotic shape of π (·|η(y(n))) (normality?) convergence of the posterior mean ^θ = EABC[θ|η(y(n))] asymptotic acceptance rate [Frazier et al., 2016]
  102. 102. asymptotic setup asymptotic: y = y(n) ∼ Pn θ and = n, n → +∞ parametric: θ ∈ Rk, k fixed concentration of summary statistics η(zn): ∃b : θ → b(θ) η(zn ) − b(θ) = oP θ (1), ∀θ Objects of interest: posterior concentration and asymptotic shape of π (·|η(y(n))) (normality?) convergence of the posterior mean ^θ = EABC[θ|η(y(n))] asymptotic acceptance rate [Frazier et al., 2016]
  103. 103. consistency of ABC posteriors ABC algorithm Bayesian consistent at θ0 if for any δ > 0, Π ( θ − θ0 > δ| η(y) − η(z) ε) → 0 as n → +∞, ε → 0 Bayesian consistency implies that sets containing θ0 have posterior probability tending to one as n → +∞, with implication being the existence of a specific rate of concentration
  104. 104. consistency of ABC posteriors ABC algorithm Bayesian consistent at θ0 if for any δ > 0, Π ( θ − θ0 > δ| η(y) − η(z) ε) → 0 as n → +∞, ε → 0 Concentration around true value and Bayesian consistency impose less stringent conditions on the convergence speed of tolerance n to zero, when compared with asymptotic normality of ABC posterior asymptotic normality of ABC posterior mean does not require asymptotic normality of ABC posterior
  105. 105. consistency of ABC posteriors Concentration of summary η(z): there exists b(θ) such that η(z) − b(θ) = oP θ (1) Consistency: Π n ( θ − θ0 δ|η(y)) = 1 + op(1) Convergence rate: there exists δn = o(1) such that Π n ( θ − θ0 δn|η(y)) = 1 + op(1)
  106. 106. consistency of ABC posteriors Consistency: Π n ( θ − θ0 δ|η(y)) = 1 + op(1) Convergence rate: there exists δn = o(1) such that Π n ( θ − θ0 δn|η(y)) = 1 + op(1) Point estimator consistency ^θ = EABC [θ|η(y(n) )], EABC [θ|η(y(n) )] − θ0 = op(1) vn(EABC [θ|η(y(n) )] − θ0) ⇒ N(0, v)
  107. 107. Rate of convergence Π (·| η(y) − η(z) ε) concentrates at rate λn → 0 if lim sup ε→0 lim sup n→+∞ Π ( θ − θ0 > λnM| η(y)η(z) ε) → 0 in P0-probability when M goes to infinity. Posterior rate of concentration related to rate at which information accumulates about true parameter vector
  108. 108. Rate of convergence Π (·| η(y) − η(z) ε) concentrates at rate λn → 0 if lim sup ε→0 lim sup n→+∞ Π ( θ − θ0 > λnM| η(y)η(z) ε) → 0 in P0-probability when M goes to infinity. Posterior rate of concentration related to rate at which information accumulates about true parameter vector
  109. 109. Related results existing studies on the large sample properties of ABC, in which the asymptotic properties of point estimators derived from ABC have been the primary focus [Creel et al., 2015; Jasra, 2015; Li & Fearnhead, 2015]
  110. 110. Convergence when n σn Under (main) assumptions (A1) ∃σn → 0 Pθ σ−1 n η(z) − b(θ) > u c(θ)h(u), lim u→+∞ h(u) = 0 (A2) Π( b(θ) − b(θ0) u) uD , u ≈ 0 posterior consistency posterior concentration rate λn that depends on the deviation control of d2{η(z), b(θ)} posterior concentration rate for b(θ) bounded from below by O( n)
  111. 111. Convergence when n σn Under (main) assumptions (A1) ∃σn → 0 Pθ σ−1 n η(z) − b(θ) > u c(θ)h(u), lim u→+∞ h(u) = 0 (A2) Π( b(θ) − b(θ0) u) uD , u ≈ 0 then Π n b(θ) − b(θ0) n + σnh−1 ( D n )|η(y) = 1 + op0 (1) If also θ − θ0 L b(θ) − c(θ0) α, locally and θ → b(θ) 1-1 Π n ( θ − θ0 α n + σα n (h−1 ( D n ))α δn |η(y)) = 1 + op0 (1)
  112. 112. Comments (A1) : if Pθ σ−1 n η(z) − b(θ) > u c(θ)h(u), two cases 1. Polynomial tail: h(u) u−κ , then δn = n + σn −D/κ n 2. Exponential tail: h(u) e−cu , then δn = n + σn log(1/ n) E.g., η(y) = n−1 i g(yi ) with moments on g (case 1) or Laplace transform (case 2)
  113. 113. Comments (A1) : if Pθ σ−1 n η(z) − b(θ) > u c(θ)h(u), two cases 1. Polynomial tail: h(u) u−κ , then δn = n + σn −D/κ n 2. Exponential tail: h(u) e−cu , then δn = n + σn log(1/ n) E.g., η(y) = n−1 i g(yi ) with moments on g (case 1) or Laplace transform (case 2)
  114. 114. Comments (A2) : Π( b(θ) − b(θ0) u) uD : If Π regular enough then D = dim(θ) no need to approximate the density f (η(y)|θ). Same results holds when n = o(σn) if (A1) replaced with inf |x| M Pθ σ−1 n (η(z) − b(θ)) − x u uD , u ≈ 0
  115. 115. proof Simple enough proof: assume σn δ n and η(y) − b(θ0) σn, η(y) − η(z) n Hence b(θ) − b(θ0) > δn ⇒ η(z) − b(θ) > δn − n − σn := tn Also, if b(θ) − b(θ0) n/3 η(y) − η(z) η(z) − b(θ) + σn n/3 + n/3 and Π n ( b(θ) − b(θ0) > δn|y) b(θ)−b(θ0) >δn Pθ ( η(z) − b(θ) > tn) dΠ(θ) |b(θ)−b(θ0)| n/3 Pθ ( η(z) − b(θ) n/3) dΠ(θ) −D n h(tnσ−1 n ) Θ c(θ)dΠ(θ)
  116. 116. proof Simple enough proof: assume σn δ n and η(y) − b(θ0) σn, η(y) − η(z) n Hence b(θ) − b(θ0) > δn ⇒ η(z) − b(θ) > δn − n − σn := tn Also, if b(θ) − b(θ0) n/3 η(y) − η(z) η(z) − b(θ) + σn n/3 + n/3 and Π n ( b(θ) − b(θ0) > δn|y) b(θ)−b(θ0) >δn Pθ ( η(z) − b(θ) > tn) dΠ(θ) |b(θ)−b(θ0)| n/3 Pθ ( η(z) − b(θ) n/3) dΠ(θ) −D n h(tnσ−1 n ) Θ c(θ)dΠ(θ)
  117. 117. Assumptions Applicable to broad range of data structures [A1] ensures that η(z) concentrates on b(θ), unescapable [A2] controls degree of prior mass in a neighbourhood of θ0, standard in Bayesian asymptotics [A2] If Π absolutely continuous with prior density p bounded, above and below, near θ0, then D = dim(θ) = kθ [A3] identification condition critical for getting posterior concentration around θ0, b being injective depending on true structural model and particular choice of η.
  118. 118. Summary statistic and (in)consistency Consider the moving average MA(2) model yt = et + θ1et−1 + θ2et−2, et ∼i.i.d. N(0, 1) and −2 θ1 2, θ1 + θ2 −1, θ1 − θ2 1. summary statistics equal to sample autocovariances ηj (y) = T−1 T t=1+j yt yt−j j = 0, 1 with η0(y) P → E[y2 t ] = 1 + (θ01)2 + (θ02)2 and η1(y) P → E[yt yt−1] = θ01(1 + θ02) For ABC target pε (θ|η(y)) to be degenerate at θ0 0 = b(θ0) − b (θ) = 1 + (θ01)2 + (θ02)2 θ01(1 + θ02) − 1 + (θ1)2 + (θ2)2 θ1(1 + θ2) must have unique solution θ = θ0 Take θ01 = .6, θ02 = .2: equation has two solutions θ1 = .6, θ2 = .2 and θ1 ≈ .5453, θ2 ≈ .3204
  119. 119. Summary statistic and (in)consistency Consider the moving average MA(2) model yt = et + θ1et−1 + θ2et−2, et ∼i.i.d. N(0, 1) and −2 θ1 2, θ1 + θ2 −1, θ1 − θ2 1. summary statistics equal to sample autocovariances ηj (y) = T−1 T t=1+j yt yt−j j = 0, 1 with η0(y) P → E[y2 t ] = 1 + (θ01)2 + (θ02)2 and η1(y) P → E[yt yt−1] = θ01(1 + θ02) For ABC target pε (θ|η(y)) to be degenerate at θ0 0 = b(θ0) − b (θ) = 1 + (θ01)2 + (θ02)2 θ01(1 + θ02) − 1 + (θ1)2 + (θ2)2 θ1(1 + θ2) must have unique solution θ = θ0 Take θ01 = .6, θ02 = .2: equation has two solutions θ1 = .6, θ2 = .2 and θ1 ≈ .5453, θ2 ≈ .3204
  120. 120. Concentration for the MA(2) model True value θ0 = (0.6, 0.2) Summaries first three autocorrelations Tolerance proportional to εT = 1/T0.4 Rejection of normality of these posteriors
  121. 121. Asymptotic shape of posterior distribution Shape of Π (·| η(y), η(z) εn) for several connections between εn and rate at which η(yn) satisfy CLT Three different regimes: 1. σn = o( n) −→ Uniform limit 2. σn n −→ perturbated Gaussian limit 3. σn n −→ Gaussian limit
  122. 122. Asymptotic shape of posterior distribution Shape of Π (·| η(y), η(z) εn) for several connections between εn and rate at which η(yn) satisfy CLT Three different regimes: 1. σn = o( n) −→ Uniform limit 2. σn n −→ perturbated Gaussian limit 3. σn n −→ Gaussian limit
  123. 123. scaling matrices Introduction of sequence of (k, k) p.d. matrices Σn(θ) such that for all θ near θ0 c1 Dn ∗ Σn(θ) ∗ c2 Dn ∗, Dn = diag(dn(1), · · · , dn(k)), with 0 < c1, c2 < +∞, dn(j) → +∞ for all j’s Possibly different convergence rates for components of η(z) Reordering components so that dn(1) · · · dn(k) with assumption that lim inf n dn(j)εn = lim sup n dn(j)εn
  124. 124. New assumptions (B1) Concentration of summary η: Σn(θ) ∈ Rk1×k1 is o(1) Σn(θ)−1 {η(z)−b(θ)} ⇒ Nk1 (0, Id), (Σn(θ)Σn(θ0)−1 )n = Co (B2) b(θ) is C1 and θ − θ0 b(θ) − b(θ0) (B3) Dominated convergence and lim n Pθ(Σn(θ)−1{η(z) − b(θ)} ∈ u + B(0, un)) j un(j) = ϕ(u)
  125. 125. main result Set Σn(θ) = σnD(θ) for θ ≈ θ0 and Zo = Σn(θ0)−1(η(y) − b(θ0)), then under (B1) and (B2) when nσ−1 n → +∞ Π n [ −1 n (θ−θ0) ∈ A|y] ⇒ UB0 (A), B0 = {x ∈ Rk ; b (θ0)T x 1
  126. 126. main result Set Σn(θ) = σnD(θ) for θ ≈ θ0 and Zo = Σn(θ0)−1(η(y) − b(θ0)), then under (B1) and (B2) when nσ−1 n → c Π n [Σn(θ0)−1 (θ − θ0) − Zo ∈ A|y] ⇒ Qc(A), Qc = N
  127. 127. main result Set Σn(θ) = σnD(θ) for θ ≈ θ0 and Zo = Σn(θ0)−1(η(y) − b(θ0)), then under (B1) and (B2) when nσ−1 n → 0 and (B3) holds, set Vn = [b (θ0)]n Σn(θ0)b (θ0) then Π n [V −1 n (θ − θ0) − ˜Zo ∈ A|y] ⇒ Φ(A),
  128. 128. intuition (?!) Set x(θ) = σ−1 n (θ − θ0) − Zo (k = 1) πn := Π n [ −1 n (θ − θ0) ∈ A|y] = |θ−θ0| un Ix(θ)∈A Pθ ( σ−1 n (η(z) − b(θ)) + x(θ) σ−1 n n)p(θ)dθ |θ−θ0| un Pθ ( σ−1 n (η(z) − b(θ)) + x(θ) σ−1 n n)p(θ)dθ + op(1) If n/σn 1 : Pθ σ−1 n (η(z) − b(θ)) + x(θ) σ−1 n n = 1+o(1), iff x σ−1 n n+o(1) If n/σn = o(1) Pθ σ−1 n (η(z) − b(θ)) + x σ−1 n n = φ(x)σn(1 + o(1))
  129. 129. more comments Surprising : U(− n, n) limit when n σn but not that surprising since n = o(1) means concentration around θ0 and σn = o( n) implies that b(θ) − b(θ0) ≈ η(z) − η(y) again, no need to control approximation of f (η(y)|θ) by a Gaussian density: merely a control of the distribution generalisation to the case where eigenvalues of Σn are dn,1 = · · · = dn,k behaviour of EABC (θ|y) consistent with Li & Fearnhead (2016)
  130. 130. more comments Surprising : U(− n, n) limit when n σn but not that surprising since n = o(1) means concentration around θ0 and σn = o( n) implies that b(θ) − b(θ0) ≈ η(z) − η(y) again, no need to control approximation of f (η(y)|θ) by a Gaussian density: merely a control of the distribution generalisation to the case where eigenvalues of Σn are dn,1 = · · · = dn,k behaviour of EABC (θ|y) consistent with Li & Fearnhead (2016)
  131. 131. even more comments If (also) p(θ) is H¨older β EABC (θ|y) − θ0 = σn Zo b(θ0) score for f (η(y)|θ) + β/2 j=1 2j n Hj (θ0, p, b) bias from threshold approx +o(σn) + O( β+1 n ) with if 2 n = o(σn) : Efficiency EABC (θ|y) − θ0 = σn Zo b(θ0) + o(σn) the Hj (θ0, p, b)’s are deterministic we gain nothing by getting a first crude ^θ(y) = EABC (θ|y) for some η(y) and then rerun ABC with ^θ(y)
  132. 132. Illustration in the MA(2) setting Sample sizes of T = 500, 1000 Asymptotic normality rejected for εT = 1/T0.4 and for θ1, T = 500 and εT = 1/T0.55
  133. 133. asymptotic behaviour of EABC [θ] When p = dim(η(y)) = d = dim(θ) and n = o(n−3/10) EABC [dT (θ − θ0)|yo ] ⇒ N(0, ( bo )T Σ−1 bo −1 [Li & Fearnhead (2016)] In fact, if β+1 n √ n = o(1), with β H¨older-smoothness of π EABC [(θ−θ0)|yo ] = ( bo)−1Zo √ n + k j=1 hj (θ0) 2j n +op(1), 2k = β Iterating for fixed p mildly interesting: if ˜η(y) = EABC [θ|yo ] then EABC [θ|˜η(y)] = θ0 + ( bo)−1Zo √ n + π (θ0) π(θ0) 2 n + o() [Fearnhead & Prangle, 2012]
  134. 134. asymptotic behaviour of EABC [θ] When p = dim(η(y)) = d = dim(θ) and n = o(n−3/10) EABC [dT (θ − θ0)|yo ] ⇒ N(0, ( bo )T Σ−1 bo −1 [Li & Fearnhead (2016)] In fact, if β+1 n √ n = o(1), with β H¨older-smoothness of π EABC [(θ−θ0)|yo ] = ( bo)−1Zo √ n + k j=1 hj (θ0) 2j n +op(1), 2k = β Iterating for fixed p mildly interesting: if ˜η(y) = EABC [θ|yo ] then EABC [θ|˜η(y)] = θ0 + ( bo)−1Zo √ n + π (θ0) π(θ0) 2 n + o() [Fearnhead & Prangle, 2012]
  135. 135. more asymptotic behaviour of EABC [θ] Li and Fearnhead (2016,2017) consider that EABC [dT (θ − θ0)|yo ] not optimal when p > d If √ n 2 n = o(1) and n √ n = o(1) √ n[EABC (θ) − θ0] = P bo Zo + op(1) Zo = √ n(η(y) − bo ) P bo Zo = (( bo )T bo )−1 ( bo )T Zo and Vas(P bo Zo ) ( bo )T Vas(Zo )−1 ( bo ) −1 If n √ n = o(1) √ n[EABC (θ)−θ0] = ( bo )T Σ−1 bo −1 ( bo )T Σ−1 Zo +op(1)
  136. 136. impact of the dimension of η dimension of η(.) does not impact above result, but impacts acceptance probability if n = o(σn), k1 = dim(η(y)), k = dim(θ) & k1 k αn := Pr ( y − z n) k1 n σ−k1+k n if n σn αn := Pr ( y − z n) k n If we choose αn αn = o(σk n) leads to n = σn(αnσ−k n )1/k1 = o(σn) αn σn leads to n α 1/k n .
  137. 137. Illustration in the MA(2) setting Sample sizes of T = 500, 1000 Asymptotic normality accepted for all graphs
  138. 138. Practical implications In practice, tolerance determined by quantile (nearest neighbours): Select all θi associated with the α = δ/N smallest distances d2{η(zi ), η(y)} for some δ Then (i) if εT v−1 T or εT = o(v−1 T ), acceptance rate associated with the threshold εT is αT = pr ( η(z) − η(y) εT ) (vT εT )kη × v−kθ T v−kθ T (ii) if εT v−1 T , αT = pr ( η(z) − η(y) εT ) εkθ T v−kθ T
  139. 139. Practical implications In practice, tolerance determined by quantile (nearest neighbours): Select all θi associated with the α = δ/N smallest distances d2{η(zi ), η(y)} for some δ Then (i) if εT v−1 T or εT = o(v−1 T ), acceptance rate associated with the threshold εT is αT = pr ( η(z) − η(y) εT ) (vT εT )kη × v−kθ T v−kθ T (ii) if εT v−1 T , αT = pr ( η(z) − η(y) εT ) εkθ T v−kθ T
  140. 140. Monte Carlo error Link the choice of εT to Monte Carlo error associated with NT draws in Algorithm Conditions (on εT ) under which ^αT = αT {1 + op(1)} where ^αT = NT i=1 1l [d{η(y), η(z)} εT ] /NT proportion of accepted draws from NT simulated draws of θ Either (i) εT = o(v−1 T ) and (vT εT )−kη ε−kθ T MNT or (ii) εT v−1 T and ε−kθ T MNT for M large enough;
  141. 141. conclusion on ABC consistency asymptotic description of ABC: different regimes depending on n & σn no point in choosing n arbitrarily small: just n = o(σn) no asymptotic gain in iterative ABC results under weak conditions by not studying g(η(z)|θ)

×