Successfully reported this slideshow.
Upcoming SlideShare
×

# CISEA 2019: ABC consistency and convergence

2,094 views

Published on

Talk at the CISEA 2019 convergence, Abidjan, 17-19 June 2019

Published in: Science
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

• Be the first to like this

### CISEA 2019: ABC consistency and convergence

1. 1. ABC: convergence and misspeciﬁcation Christian P. Robert Universit´e Paris-Dauphine PSL, Paris & University of Warwick, Coventry Joint works with D. Frazier, J.-M. Marin, G. Martin, and J. Rousseau
2. 2. Disclaimer I am not this Christian Robert!
3. 3. Outline Motivating examples Approximate Bayesian computation ABC for model choice Asymptotics of ABC ABC under misspeciﬁcation
4. 4. A motivating if pedestrian example paired and orphan socks A drawer contains an unknown number of socks, some of which can be paired and some of which are orphans (single). One takes at random 11 socks without replacement from this drawer: no pair can be found among those. What can we infer about the total number of socks in the drawer? sounds like an impossible task one observation x = 11 and two unknowns, nsocks and npairs writing the likelihood is a challenge [exercise]
5. 5. A motivating if pedestrian example paired and orphan socks A drawer contains an unknown number of socks, some of which can be paired and some of which are orphans (single). One takes at random 11 socks without replacement from this drawer: no pair can be found among those. What can we infer about the total number of socks in the drawer? sounds like an impossible task one observation x = 11 and two unknowns, nsocks and npairs writing the likelihood is a challenge [exercise]
6. 6. Feller’s shoes A closet contains n pairs of shoes. If 2r shoes are chosen at random (with 2r < n), what is the probability that there will be (a) no complete pair, (b) exactly one complete pair, (c) exactly two complete pairs among them? [Feller, 1970, Chapter II, Exercise 26]
7. 7. Feller’s shoes A closet contains n pairs of shoes. If 2r shoes are chosen at random (with 2r < n), what is the probability that there will be (a) no complete pair, (b) exactly one complete pair, (c) exactly two complete pairs among them? [Feller, 1970, Chapter II, Exercise 26] Resolution as pj = n j 22r−2j n − j 2r − 2j 2n 2r being probability of obtaining js pairs among those 2r shoes, or for an odd number t of shoes pj = 2t−2j n j n − j t − 2j 2n t
8. 8. Feller’s shoes A closet contains n pairs of shoes. If 2r shoes are chosen at random (with 2r < n), what is the probability that there will be (a) no complete pair, (b) exactly one complete pair, (c) exactly two complete pairs among them? [Feller, 1970, Chapter II, Exercise 26] If one draws 11 socks out of m socks made of f orphans and g pairs, with f + 2g = m, number k of socks from the orphan group is hypergeometric H(11, m, f ) and probability to observe 11 orphan socks total is 11 k=0 f k 2g 11−k m 11 × 211−k g 11−k 2g 11−k
9. 9. A prioris on socks Given parameters nsocks and npairs, set of socks S = s1, s1, . . . , snpairs , snpairs , snpairs+1, . . . , snsocks and 11 socks picked at random from S give X unique socks. Rassmus’ reasoning If you are a family of 3-4 persons then a guesstimate would be that you have something like 15 pairs of socks in store. It is also possible that you have much more than 30 socks. So as a prior for nsocks I’m going to use a negative binomial with mean 30 and standard deviation 15. On npairs/2nsocks I’m going to put a Beta prior distribution that puts most of the probability over the range 0.75 to 1.0, [Rassmus B˚a˚ath’s Research Blog, Oct 20th, 2014]
10. 10. A prioris on socks Given parameters nsocks and npairs, set of socks S = s1, s1, . . . , snpairs , snpairs , snpairs+1, . . . , snsocks and 11 socks picked at random from S give X unique socks. Rassmus’ reasoning If you are a family of 3-4 persons then a guesstimate would be that you have something like 15 pairs of socks in store. It is also possible that you have much more than 30 socks. So as a prior for nsocks I’m going to use a negative binomial with mean 30 and standard deviation 15. On npairs/2nsocks I’m going to put a Beta prior distribution that puts most of the probability over the range 0.75 to 1.0, [Rassmus B˚a˚ath’s Research Blog, Oct 20th, 2014]
11. 11. Simulating the experiment Given a prior distribution on nsocks and npairs, nsocks ∼ Neg(30, 15) npairs|nsocks ∼ nsocks/2Be(15, 2) possible to 1. generate new values of nsocks and npairs, 2. generate a new observation of X, number of unique socks out of 11. 3. accept the pair (nsocks, npairs) if the realisation of X is equal to 11
12. 12. Simulating the experiment Given a prior distribution on nsocks and npairs, nsocks ∼ Neg(30, 15) npairs|nsocks ∼ nsocks/2Be(15, 2) possible to 1. generate new values of nsocks and npairs, 2. generate a new observation of X, number of unique socks out of 11. 3. accept the pair (nsocks, npairs) if the realisation of X is equal to 11
13. 13. Meaning ns Density 0 10 20 30 40 50 60 0.000.010.020.030.040.050.06 The outcome of this simulation method returns a distribution on the pair (nsocks, npairs) that is the conditional distribution of the pair given the observation X = 11 Proof: Generations from π(nsocks, npairs) are accepted with probability P {X = 11|(nsocks, npairs)}
14. 14. Meaning ns Density 0 10 20 30 40 50 60 0.000.010.020.030.040.050.06 The outcome of this simulation method returns a distribution on the pair (nsocks, npairs) that is the conditional distribution of the pair given the observation X = 11 Proof: Hence accepted values distributed from π(nsocks, npairs) × P {X = 11|(nsocks, npairs)} ∝ π(nsocks, npairs|X = 11)
15. 15. Additional example Take a Normal sample x1, . . . , xn ∼ N(µ, σ2 ) summarised into (insuﬃcient) ^µn = med(x1, . . . , xn) ^σn = mad(x1, . . . , xn) = med|xi − med(x1, . . . , xn)| Under a conjugate prior π(µ, σ2), posterior close to intractable. but simulation of (^µn, ^σn) straightforward
16. 16. Additional example Take a Normal sample x1, . . . , xn ∼ N(µ, σ2 ) summarised into (insuﬃcient) ^µn = med(x1, . . . , xn) ^σn = mad(x1, . . . , xn) = med|xi − med(x1, . . . , xn)| Under a conjugate prior π(µ, σ2), posterior close to intractable. but simulation of (^µn, ^σn) straightforward
18. 18. Approximate Bayesian computation Motivating examples Approximate Bayesian computation ABC basics Automated summary selection ABC for model choice Asymptotics of ABC ABC under misspeciﬁcation
19. 19. Untractable likelihoods Cases when the likelihood function f (y|θ) is unavailable and when the completion step f (y|θ) = Z f (y, z|θ) dz is impossible or too costly because of the dimension of z c MCMC cannot be implemented
20. 20. The ABC method Bayesian setting: target is π(θ)f (x|θ) When likelihood f (x|θ) not in closed form, likelihood-free rejection technique: ABC algorithm For an observation y ∼ f (y|θ), under the prior π(θ), keep jointly simulating θ ∼ π(θ) , z ∼ f (z|θ ) , until the auxiliary variable z is equal to the observed value, z = y. [Tavar´e et al., 1997] c Exact simulation from π(θ|y)
21. 21. The ABC method Bayesian setting: target is π(θ)f (x|θ) When likelihood f (x|θ) not in closed form, likelihood-free rejection technique: ABC algorithm For an observation y ∼ f (y|θ), under the prior π(θ), keep jointly simulating θ ∼ π(θ) , z ∼ f (z|θ ) , until the auxiliary variable z is equal to the observed value, z = y. [Tavar´e et al., 1997] c Exact simulation from π(θ|y)
22. 22. A as A...pproximative When y is a continuous random variable, equality z = y is replaced with a tolerance condition, ρ(y, z) where ρ is a distance Output distributed from π(θ) Pθ{ρ(y, z) < } ∝ π(θ|ρ(y, z) < ) [Pritchard et al., 1999]
23. 23. A as A...pproximative When y is a continuous random variable, equality z = y is replaced with a tolerance condition, ρ(y, z) where ρ is a distance Output distributed from π(θ) Pθ{ρ(y, z) < } ∝ π(θ|ρ(y, z) < ) [Pritchard et al., 1999]
24. 24. ABC algorithm Algorithm 1 Likelihood-free rejection sampler 2 for i = 1 to N do repeat generate θ from the prior distribution π(·) generate z from the likelihood f (·|θ ) until ρ{η(z), η(y)} set θi = θ end for where η(y) deﬁnes a (not necessarily suﬃcient) statistic
25. 25. Output The likelihood-free algorithm samples from the marginal in z of: π (θ, z|y) = π(θ)f (z|θ)IA ,y (z) A ,y×Θ π(θ)f (z|θ)dzdθ , where A ,y = {z ∈ D|ρ(η(z), η(y)) < }. The idea behind ABC is that the summary statistics coupled with a small tolerance should provide a good approximation of a posterior distribution: π (θ|y) = π (θ, z|y)dz ≈ π(θ | η(y) summary ) .
26. 26. Output The likelihood-free algorithm samples from the marginal in z of: π (θ, z|y) = π(θ)f (z|θ)IA ,y (z) A ,y×Θ π(θ)f (z|θ)dzdθ , where A ,y = {z ∈ D|ρ(η(z), η(y)) < }. The idea behind ABC is that the summary statistics coupled with a small tolerance should provide a good approximation of a posterior distribution: π (θ|y) = π (θ, z|y)dz ≈ π(θ | η(y) summary ) .
27. 27. Dogger Bank re-enactment Battle of Dogger Bank on Jan 24, 1915, between British and German ﬂeets : how likely was the British victory? [MacKay, Price, and Wood, 2016]
28. 28. Dogger Bank re-enactment Battle of Dogger Bank on Jan 24, 1915, between British and German ﬂeets : how likely was the British victory? [MacKay, Price, and Wood, 2016]
29. 29. Dogger Bank re-enactment Battle of Dogger Bank on Jan 24, 1915, between British and German ﬂeets : ABC simulation of posterior distribution [MacKay, Price, and Wood, 2016]
30. 30. twenty years of ABC advances Simulating from the prior is often poor in eﬃciency Either modify the proposal distribution on θ to increase the density of x’s within the vicinity of y... [Beaumont et al., 2009; Mengersen et al., 2013; Clart´e et al., 2019] ...or by viewing the problem as a conditional density estimation and by developing techniques to allow for larger [Beaumont et al., 2002; Blum & Fran¸cois, 2009] .....or even by including in the inferential framework [ABCµ] [Ratmann et al., 2009]
31. 31. twenty years of ABC advances Simulating from the prior is often poor in eﬃciency Either modify the proposal distribution on θ to increase the density of x’s within the vicinity of y... [Beaumont et al., 2009; Mengersen et al., 2013; Clart´e et al., 2019] ...or by viewing the problem as a conditional density estimation and by developing techniques to allow for larger [Beaumont et al., 2002; Blum & Fran¸cois, 2009] .....or even by including in the inferential framework [ABCµ] [Ratmann et al., 2009]
32. 32. twenty years of ABC advances Simulating from the prior is often poor in eﬃciency Either modify the proposal distribution on θ to increase the density of x’s within the vicinity of y... [Beaumont et al., 2009; Mengersen et al., 2013; Clart´e et al., 2019] ...or by viewing the problem as a conditional density estimation and by developing techniques to allow for larger [Beaumont et al., 2002; Blum & Fran¸cois, 2009] .....or even by including in the inferential framework [ABCµ] [Ratmann et al., 2009]
33. 33. twenty years of ABC advances Simulating from the prior is often poor in eﬃciency Either modify the proposal distribution on θ to increase the density of x’s within the vicinity of y... [Beaumont et al., 2009; Mengersen et al., 2013; Clart´e et al., 2019] ...or by viewing the problem as a conditional density estimation and by developing techniques to allow for larger [Beaumont et al., 2002; Blum & Fran¸cois, 2009] .....or even by including in the inferential framework [ABCµ] [Ratmann et al., 2009]
34. 34. Noisily exact ABC Idea: Modify the data from the start ˜y = y0 + ζ1 with the same scale as ABC and run ABC on ˜y c ABC produces an exact simulation from π(θ|˜y) = π(θ|˜y) [Dean et al., 2011; Fearnhead and Prangle, 2012]
35. 35. Noisily exact ABC Idea: Modify the data from the start ˜y = y0 + ζ1 with the same scale as ABC and run ABC on ˜y c ABC produces an exact simulation from π(θ|˜y) = π(θ|˜y) [Dean et al., 2011; Fearnhead and Prangle, 2012]
36. 36. consistent noisy ABC Degrading the data improves the estimation performances (!): Noisy ABC-MLE is asymptotically (in n) consistent under further assumptions, noisy ABC-MLE asymptotically normal increase in variance of order −2 likely degradation in precision or computing time due to the lack of summary statistic [curse of dimensionality]
37. 37. Semi-automatic ABC Fearnhead and Prangle (2012) study noisy ABC and selection of the summary statistic from purely inferential viewpoint and calibrated for estimation purposes Derivation of a well-calibrated version of ABC, i.e. an algorithm that gives proper predictions for the distribution associated with this randomised summary statistic [calibration constraint: ABC approximation with same posterior mean as the true randomised posterior] Optimality of the posterior expectation E[θ|y] of the parameter of interest as summary statistics η(y)!
38. 38. Semi-automatic ABC Fearnhead and Prangle (2012) study noisy ABC and selection of the summary statistic from purely inferential viewpoint and calibrated for estimation purposes Derivation of a well-calibrated version of ABC, i.e. an algorithm that gives proper predictions for the distribution associated with this randomised summary statistic [calibration constraint: ABC approximation with same posterior mean as the true randomised posterior] Optimality of the posterior expectation E[θ|y] of the parameter of interest as summary statistics η(y)!
39. 39. Fully automatic ABC Implementation of ABC still requires input of collection of summaries Towards automation statistical projection techniques (LDA, PCA, NP-GLS, &tc.) variable selection machine learning approaches [Raynal & al., 2017] bypassing summaries altogether
40. 40. ABC with Wasserstein distance Use as distance between simulated and observed samples the Wasserstein distance: Wp(y1:n, z1:n)p = inf σ∈Sn 1 n n i=1 ρ(yi , zσ(i))p generalises Kolmogorov–Smirnov covers well- and mis-speciﬁed cases only depends on data space distance ρ(·, ·) covers iid and dynamic models (curve matching) computional feasible (linear in dimension, cubic in sample size) Hilbert curve approximation in higher dimensions [Bernton et al., 2019]
41. 41. consistent inference with Wasserstein distance As ε → 0 [and n ﬁxed] If either 1. f (n) θ is n-exchangeable and D(y1:n, z1:n) = 0 if and only if z1:n = yσ(1:n) for some σ ∈ Sn, or 2. D(y1:n, z1:n) = 0 if and only if z1:n = y1:n. then, at y1:n ﬁxed, ABC posterior converges strongly to posterior as ε → 0. [Bernton et al., 2019]
42. 42. consistent inference with Wasserstein distance As n → ∞ [at ε ﬁxed] WABC distribution with a ﬁxed ε does not converge in n to a Dirac mass [Bernton et al., 2019]
43. 43. consistent inference with Wasserstein distance As εn → 0 and n → ∞ Under range of assumptions, if fn(εn) → 0, and P(W(^µn, µ ) εn) → 1 then WABC posterior with threshold εn + ε satisﬁes πεn+ε {θ ∈ H : W(µ , µθ) > ε + 4εn/3 + f −1 n (εL n/R)} |y1:n P δ [Bernton et al., 2019]
44. 44. A bivariate Gaussian illustration 100 observations from bivariate Normal with variance 1 and covariance 0.55 Compare WABC with ABC based on (g) raw Euclidean and (o) Euclidean distance between sample means on 106 model simulations.
45. 45. ABC for model choice Motivating examples Approximate Bayesian computation ABC for model choice Asymptotics of ABC ABC under misspeciﬁcation
46. 46. Bayesian model choice Several models M1, M2, . . . are considered simultaneously for a dataset y and the model index M is part of the inference. Use of a prior distribution. π(M = m), plus a prior distribution on the parameter conditional on the value m of the model index, πm(θm) Goal is to derive the posterior distribution of M, challenging computational target when models are complex.
47. 47. Generic ABC for model choice Algorithm 2 Likelihood-free model choice sampler (ABC-MC) for t = 1 to T do repeat Generate m from the prior π(M = m) Generate θm from the prior πm(θm) Generate z from the model fm(z|θm) until ρ{η(z), η(y)} < Set m(t) = m and θ(t) = θm end for [Cornuet et al., DIYABC, 2009]
48. 48. ABC estimates Posterior probability π(M = m|y) approximated by frequency of acceptances from model m 1 T T t=1 Im(t)=m
49. 49. Limiting behaviour of B12 (under suﬃciency) If η(y) suﬃcient statistic for both models, fi (y|θi ) = gi (y)f η i (η(y)|θi ) Thus B12(y) = Θ1 π(θ1)g1(y)f η 1 (η(y)|θ1) dθ1 Θ2 π(θ2)g2(y)f η 2 (η(y)|θ2) dθ2 = g1(y) π1(θ1)f η 1 (η(y)|θ1) dθ1 g2(y) π2(θ2)f η 2 (η(y)|θ2) dθ2 = g1(y) g2(y) Bη 12(y) . [Didelot, Everitt, Johansen & Lawson, 2011] c No discrepancy only when cross-model suﬃciency c Inability to evaluate loss brought by summary statistics
50. 50. Limiting behaviour of B12 (under suﬃciency) If η(y) suﬃcient statistic for both models, fi (y|θi ) = gi (y)f η i (η(y)|θi ) Thus B12(y) = Θ1 π(θ1)g1(y)f η 1 (η(y)|θ1) dθ1 Θ2 π(θ2)g2(y)f η 2 (η(y)|θ2) dθ2 = g1(y) π1(θ1)f η 1 (η(y)|θ1) dθ1 g2(y) π2(θ2)f η 2 (η(y)|θ2) dθ2 = g1(y) g2(y) Bη 12(y) . [Didelot, Everitt, Johansen & Lawson, 2011] c No discrepancy only when cross-model suﬃciency c Inability to evaluate loss brought by summary statistics
51. 51. A stylised problem Central question to the validation of ABC for model choice: When is a Bayes factor based on an insuﬃcient statistic T(y) consistent? Note/warnin: c drawn on T(y) through BT 12(y) necessarily diﬀers from c drawn on y through B12(y) [Marin, Pillai, X, & Rousseau, JRSS B, 2013]
52. 52. A stylised problem Central question to the validation of ABC for model choice: When is a Bayes factor based on an insuﬃcient statistic T(y) consistent? Note/warnin: c drawn on T(y) through BT 12(y) necessarily diﬀers from c drawn on y through B12(y) [Marin, Pillai, X, & Rousseau, JRSS B, 2013]
53. 53. A benchmark if toy example Comparison suggested by referee of PNAS paper [thanks!]: [X, Cornuet, Marin, & Pillai, Aug. 2011] Model M1: y ∼ N(θ1, 1) opposed to model M2: y ∼ L(θ2, 1/ √ 2), Laplace distribution with mean θ2 and scale parameter 1/ √ 2 (variance one). Four possible statistics 1. sample mean y (suﬃcient for M1 if not M2); 2. sample median med(y) (insuﬃcient); 3. sample variance var(y) (ancillary); 4. median absolute deviation mad(y) = med(|y − med(y)|);
54. 54. A benchmark if toy example Comparison suggested by referee of PNAS paper [thanks!]: [X, Cornuet, Marin, & Pillai, Aug. 2011] Model M1: y ∼ N(θ1, 1) opposed to model M2: y ∼ L(θ2, 1/ √ 2), Laplace distribution with mean θ2 and scale parameter 1/ √ 2 (variance one). q q q q q q q q q q q Gauss Laplace 0.00.10.20.30.40.50.60.7 n=100 q q q q q q q q q q q q q q q q q q Gauss Laplace 0.00.20.40.60.81.0 n=100
55. 55. Framework Starting from sample y = (y1, . . . , yn) the observed sample, not necessarily iid with true distribution y ∼ Pn Summary statistics T(y) = Tn = (T1(y), T2(y), · · · , Td (y)) ∈ Rd with true distribution Tn ∼ Gn.
56. 56. Framework c Comparison of – under M1, y ∼ F1,n(·|θ1) where θ1 ∈ Θ1 ⊂ Rp1 – under M2, y ∼ F2,n(·|θ2) where θ2 ∈ Θ2 ⊂ Rp2 turned into – under M1, T(y) ∼ G1,n(·|θ1), and θ1|T(y) ∼ π1(·|Tn ) – under M2, T(y) ∼ G2,n(·|θ2), and θ2|T(y) ∼ π2(·|Tn )
57. 57. Assumptions A collection of asymptotic “standard” assumptions: (A1) is a standard central limit theorem under the true model with asymptotic mean µ0 (A2) controls the large deviations of the estimator Tn from the model mean µ(θ) (A3) is the standard prior mass condition found in Bayesian asymptotics (di eﬀective dimension of the parameter) (A4) restricts the behaviour of the model density against the true density [Think CLT/BvM!]
58. 58. Asymptotic marginals Asymptotically, under (A1)–(A4) mi (t) = Θi gi (t|θi ) πi (θi ) dθi is such that (i) if inf{|µi (θi ) − µ0|; θi ∈ Θi } = 0, Cl vd−di n mi (Tn ) Cuvd−di n and (ii) if inf{|µi (θi ) − µ0|; θi ∈ Θi } > 0 mi (Tn ) = oPn [vd−τi n + vd−αi n ].
59. 59. Between-model consistency Consequence of above is that asymptotic behaviour of the Bayes factor is driven by the asymptotic mean value µ(θ) of Tn under both models. And only by this mean value!
60. 60. Between-model consistency Consequence of above is that asymptotic behaviour of the Bayes factor is driven by the asymptotic mean value µ(θ) of Tn under both models. And only by this mean value! Indeed, if inf{|µ0 − µ2(θ2)|; θ2 ∈ Θ2} = inf{|µ0 − µ1(θ1)|; θ1 ∈ Θ1} = 0 then Cl v −(d1−d2) n m1(Tn ) m2(Tn ) Cuv −(d1−d2) n , where Cl , Cu = OPn (1), irrespective of the true model. c Only depends on the diﬀerence d1 − d2: no consistency
61. 61. Between-model consistency Consequence of above is that asymptotic behaviour of the Bayes factor is driven by the asymptotic mean value µ(θ) of Tn under both models. And only by this mean value! Else, if inf{|µ0 − µ2(θ2)|; θ2 ∈ Θ2} > inf{|µ0 − µ1(θ1)|; θ1 ∈ Θ1} = 0 then m1(Tn ) m2(Tn ) Cu min v −(d1−α2) n , v −(d1−τ2) n
62. 62. Checking for adequate statistics Run a practical check of the relevance (or non-relevance) of Tn null hypothesis that both models are compatible with the statistic Tn H0 : inf{|µ2(θ2) − µ0|; θ2 ∈ Θ2} = 0 against H1 : inf{|µ2(θ2) − µ0|; θ2 ∈ Θ2} > 0 testing procedure provides estimates of mean of Tn under each model and checks for equality
63. 63. Checking in practice Under each model Mi , generate ABC sample θi,l , l = 1, · · · , L For each θi,l , generate yi,l ∼ Fi,n(·|ψi,l ), derive Tn (yi,l ) and compute ^µi = 1 L L l=1 Tn (yi,l ), i = 1, 2 . Conditionally on Tn (y), √ L { ^µi − Eπ [µi (θi )|Tn (y)]} N(0, Vi ), Test for a common mean H0 : ^µ1 ∼ N(µ0, V1) , ^µ2 ∼ N(µ0, V2) against the alternative of diﬀerent means H1 : ^µi ∼ N(µi , Vi ), with µ1 = µ2 .
64. 64. Toy example: Laplace versus Gauss qqqqqqqqqqqqqqq qqqqqqqqqq q qq q q Gauss Laplace Gauss Laplace 010203040 Normalised χ2 without and with mad
65. 65. [some] asymptotics of ABC Motivating examples Approximate Bayesian computation ABC for model choice Asymptotics of ABC consistency of ABC posteriors asymptotic posterior shape asymptotic behaviour of EABC [θ] ABC under misspeciﬁcation
66. 66. Asymptotic setup asymptotic: y = y(n) ∼ Pn θ and = n, n → +∞ parametric: θ ∈ Rk, k ﬁxed concentration of summary statistics η(zn): ∃b : θ → b(θ) η(zn ) − b(θ) = oP θ (1), ∀θ Objects of interest: posterior concentration and asymptotic shape of π (·|η(y(n))) (normality?) convergence of the posterior mean ^θ = EABC[θ|η(y(n))] asymptotic acceptance rate [Frazier et al., 2018]
67. 67. Asymptotic setup asymptotic: y = y(n) ∼ Pn θ and = n, n → +∞ parametric: θ ∈ Rk, k ﬁxed concentration of summary statistics η(zn): ∃b : θ → b(θ) η(zn ) − b(θ) = oP θ (1), ∀θ Objects of interest: posterior concentration and asymptotic shape of π (·|η(y(n))) (normality?) convergence of the posterior mean ^θ = EABC[θ|η(y(n))] asymptotic acceptance rate [Frazier et al., 2018]
68. 68. consistency of ABC posteriors ABC algorithm Bayesian consistent at θ0 if for any δ > 0, Π ( θ − θ0 > δ| η(y) − η(z) ε) → 0 as n → +∞, ε → 0 Bayesian consistency implies that sets containing θ0 have posterior probability tending to one as n → +∞, with implication being the existence of a speciﬁc rate of concentration
69. 69. consistency of ABC posteriors ABC algorithm Bayesian consistent at θ0 if for any δ > 0, Π ( θ − θ0 > δ| η(y) − η(z) ε) → 0 as n → +∞, ε → 0 Concentration around true value and Bayesian consistency impose less stringent conditions on the convergence speed of tolerance n to zero, when compared with asymptotic normality of ABC posterior asymptotic normality of ABC posterior mean does not require asymptotic normality of ABC posterior
70. 70. consistency of ABC posteriors Concentration of summary η(z): there exists b(θ) such that η(z) − b(θ) = oP θ (1) Consistency: Π n ( θ − θ0 δ|η(y)) = 1 + op(1) Convergence rate: there exists δn = o(1) such that Π n ( θ − θ0 δn|η(y)) = 1 + op(1)
71. 71. consistency of ABC posteriors Consistency: Π n ( θ − θ0 δ|η(y)) = 1 + op(1) Convergence rate: there exists δn = o(1) such that Π n ( θ − θ0 δn|η(y)) = 1 + op(1) Point estimator consistency ^θ = EABC [θ|η(y(n) )], EABC [θ|η(y(n) )] − θ0 = op(1) vn(EABC [θ|η(y(n) )] − θ0) ⇒ N(0, v)
72. 72. Rate of convergence Π (·| η(y) − η(z) ε) concentrates at rate λn → 0 if lim sup ε→0 lim sup n→+∞ Π ( θ − θ0 > λnM| η(y)η(z) ε) → 0 in P0-probability when M goes to inﬁnity. Posterior rate of concentration related to rate at which information accumulates about true parameter vector
73. 73. Rate of convergence Π (·| η(y) − η(z) ε) concentrates at rate λn → 0 if lim sup ε→0 lim sup n→+∞ Π ( θ − θ0 > λnM| η(y)η(z) ε) → 0 in P0-probability when M goes to inﬁnity. Posterior rate of concentration related to rate at which information accumulates about true parameter vector
74. 74. Related results Studies on the large sample properties of ABC, with focus the asymptotic properties of ABC point estimators [Creel et al., 2015; Jasra, 2015; Li & Fearnhead, 2018a,b] assumption of CLT for summary statistic plus regularity assumptions on convergence of its density of the summary statistics to normal limit (incl. existence of Edgeworth expansion with exponential tail control) convergence rate of posterior mean if εT = o(1/v 3/5 T ) acceptance probability chosen as arbitrary density η(y) − η(z)
75. 75. Related results Studies on the large sample properties of ABC, with focus the asymptotic properties of ABC point estimators [Creel et al., 2015; Jasra, 2015; Li & Fearnhead, 2018a,b] stronger conditions on η than [our own] weak convergence of vT {η(z) − b(θ)}, with non-uniform polynomial deviations excludes non compact parameter space when εT v−1 T , only deviation bounds and weak convergence are required (weaker than convergence of densities) when εT = o(1/vT ) no requirement on convergence rate
76. 76. Related results Studies on the large sample properties of ABC, with focus the asymptotic properties of ABC point estimators [Creel et al., 2015; Jasra, 2015; Li & Fearnhead, 2018a,b] oo characterisation of asymptotic behavior of ABC posterior for all εT = o(1) with posterior concentration oo asymptotic normality and unbiasedness of posterior mean remain achievable even when limT vT εT = ∞, provided εT = o(1/v 1/2 T )
77. 77. Related results Studies on the large sample properties of ABC, with focus the asymptotic properties of ABC point estimators [Creel et al., 2015; Jasra, 2015; Li & Fearnhead, 2018a,b] Li and Fearnhead (2018a) point out that if kη > kθ 1 and if εT = o(1/v 3/5 T ), posterior mean asymptotically normal, unbiased, but not asymptotically eﬃcient if vT εT → ∞ and vT ε2 T = o(1), for kη > kθ 1, EΠε {vT (θ − θ0)} = { θb(θ0) θb(θ0)}−1 θb(θ0) vT {η(y) − b(θ0)} + op(1).
78. 78. Convergence when n σn [= vT ] Under (main) assumptions (A1) ∃σn → 0 Pθ σ−1 n η(z) − b(θ) > u c(θ)h(u), lim u→+∞ h(u) = 0 (A2) Π( b(θ) − b(θ0) u) uD , u ≈ 0 posterior consistency posterior concentration rate λn that depends on the deviation control of d2{η(z), b(θ)} posterior concentration rate for b(θ) bounded from below by O( n)
79. 79. Summary statistic and (in)consistency Consider the moving average MA(2) model yt = et + θ1et−1 + θ2et−2, et ∼i.i.d. N(0, 1) and −2 θ1 2, θ1 + θ2 −1, θ1 − θ2 1. summary statistics equal to sample autocovariances ηj (y) = T−1 T t=1+j yt yt−j j = 0, 1 with η0(y) P → E[y2 t ] = 1 + (θ01)2 + (θ02)2 and η1(y) P → E[yt yt−1] = θ01(1 + θ02) For ABC target pε (θ|η(y)) to degenerate at θ0 0 = b(θ0) − b (θ) = 1 + (θ01)2 + (θ02)2 θ01(1 + θ02) − 1 + (θ1)2 + (θ2)2 θ1(1 + θ2) must have unique solution θ = θ0 Take θ01 = .6, θ02 = .2: equation has two solutions θ1 = .6, θ2 = .2 and θ1 ≈ .5453, θ2 ≈ .3204
80. 80. Summary statistic and (in)consistency Consider the moving average MA(2) model yt = et + θ1et−1 + θ2et−2, et ∼i.i.d. N(0, 1) and −2 θ1 2, θ1 + θ2 −1, θ1 − θ2 1. summary statistics equal to sample autocovariances ηj (y) = T−1 T t=1+j yt yt−j j = 0, 1 with η0(y) P → E[y2 t ] = 1 + (θ01)2 + (θ02)2 and η1(y) P → E[yt yt−1] = θ01(1 + θ02) For ABC target pε (θ|η(y)) to degenerate at θ0 0 = b(θ0) − b (θ) = 1 + (θ01)2 + (θ02)2 θ01(1 + θ02) − 1 + (θ1)2 + (θ2)2 θ1(1 + θ2) must have unique solution θ = θ0 Take θ01 = .6, θ02 = .2: equation has two solutions θ1 = .6, θ2 = .2 and θ1 ≈ .5453, θ2 ≈ .3204
81. 81. Concentration for the MA(2) model True value θ0 = (0.6, 0.2) Summaries ﬁrst three autocorrelations Tolerance proportional to εT = 1/T0.4 Rejection of normality of these posteriors
82. 82. Asymptotic shape of posterior distribution Shape of Π ( · | η(y), η(z) εn) depending on relation between εn and rate σn at which η(yn) satisfy CLT Three diﬀerent regimes: 1. σn = o( n) −→ Uniform limit 2. σn n −→ perturbated Gaussian limit 3. σn n −→ Gaussian limit
83. 83. New assumptions (B1) Concentration of summary η: Σn(θ) ∈ Rk1×k1 is o(1) Σn(θ)−1 {η(z)−b(θ)} ⇒ Nk1 (0, Id), (Σn(θ)Σn(θ0)−1 )n = Co (B2) b(θ) is C1 and θ − θ0 b(θ) − b(θ0) (B3) Dominated convergence and lim n Pθ(Σn(θ)−1{η(z) − b(θ)} ∈ u + B(0, un)) j un(j) = ϕ(u)
84. 84. main result Set Σn(θ) = σnD(θ) for θ ≈ θ0 and Zo = Σn(θ0)−1(η(y) − b(θ0)), then under (B1) and (B2) when nσ−1 n → +∞ Π n [ −1 n (θ − θ0) ∈ A|y] ⇒ UB0 (A), B0 = {x ∈ Rk ; b (θ0)T x 1}
85. 85. main result Set Σn(θ) = σnD(θ) for θ ≈ θ0 and Zo = Σn(θ0)−1(η(y) − b(θ0)), then under (B1) and (B2) when nσ−1 n → c Π n [Σn(θ0)−1 (θ − θ0) − Zo ∈ A|y] ⇒ Qc(A), Qc = N
86. 86. main result Set Σn(θ) = σnD(θ) for θ ≈ θ0 and Zo = Σn(θ0)−1(η(y) − b(θ0)), then under (B1) and (B2) when nσ−1 n → 0 and (B3) holds, set Vn = [b (θ0)]n Σn(θ0)b (θ0) then Π n [V −1 n (θ − θ0) − ˜Zo ∈ A|y] ⇒ Φ(A),
87. 87. Illustration in the MA(2) setting Sample sizes of T = 500, 1000 Asymptotic normality rejected for εT = 1/T0.4 and for θ1, T = 500 and εT = 1/T0.55
88. 88. asymptotic behaviour of EABC [θ] When p = dim(η(y)) = d = dim(θ) and n = o(n−3/10) EABC [dT (θ − θ0)|yo ] ⇒ N(0, ( bo )T Σ−1 bo −1 [Li & Fearnhead (2018a)] In fact, if β+1 n √ n = o(1), with β H¨older-smoothness of π EABC [(θ−θ0)|yo ] = ( bo)−1Zo √ n + k j=1 hj (θ0) 2j n +op(1), 2k = β [Fearnhead & Prangle, 2012]
89. 89. asymptotic behaviour of EABC [θ] When p = dim(η(y)) = d = dim(θ) and n = o(n−3/10) EABC [dT (θ − θ0)|yo ] ⇒ N(0, ( bo )T Σ−1 bo −1 [Li & Fearnhead (2018a)] Iterating for ﬁxed p mildly interesting: if ˜η(y) = EABC [θ|yo ] then EABC [θ|˜η(y)] = θ0 + ( bo)−1Zo √ n + π (θ0) π(θ0) 2 n + o() [Fearnhead & Prangle, 2012]
90. 90. Practical implications In practice, tolerance determined by quantile (nearest neighbours): Select all θi associated with the α = δ/N smallest distances d2{η(zi ), η(y)} for some δ Then (i) if εT v−1 T or εT = o(v−1 T ), acceptance rate associated with the threshold εT is αT = pr ( η(z) − η(y) εT ) (vT εT )kη × v−kθ T v−kθ T (ii) if εT v−1 T , αT = pr ( η(z) − η(y) εT ) εkθ T v−kθ T
91. 91. Practical implications In practice, tolerance determined by quantile (nearest neighbours): Select all θi associated with the α = δ/N smallest distances d2{η(zi ), η(y)} for some δ Then (i) if εT v−1 T or εT = o(v−1 T ), acceptance rate associated with the threshold εT is αT = pr ( η(z) − η(y) εT ) (vT εT )kη × v−kθ T v−kθ T (ii) if εT v−1 T , αT = pr ( η(z) − η(y) εT ) εkθ T v−kθ T
92. 92. Curse of dimensionality For reasonable statistical behavior, rate of decline of αT the faster the larger the dimension of θ, kθ, but unaﬀected by dimension of η, kη Theoretical justiﬁcation for dimension reduction methods that process parameter components individually and independently of other components [Fearnhead & Prangle, 2012; Martin & al., 2016] importance sampling approach of Li & Fearnhead (2018a) yields acceptance rates αT = O(1), when εT = O(1/vT )
93. 93. Curse of dimensionality For reasonable statistical behavior, rate of decline of αT the faster the larger the dimension of θ, kθ, but unaﬀected by dimension of η, kη Theoretical justiﬁcation for dimension reduction methods that process parameter components individually and independently of other components [Fearnhead & Prangle, 2012; Martin & al., 2016] importance sampling approach of Li & Fearnhead (2018a) yields acceptance rates αT = O(1), when εT = O(1/vT )
94. 94. Monte Carlo error Link the choice of εT to Monte Carlo error associated with NT draws in ABC Algorithm Conditions (on εT ) under which ^αT = αT {1 + op(1)} where ^αT = NT i=1 1l [d{η(y), η(z)} εT ] /NT proportion of accepted draws from NT simulated draws of θ Either (i) εT = o(v−1 T ) and (vT εT )−kη ε−kθ T MNT or (ii) εT v−1 T and ε−kθ T MNT for M large enough;
95. 95. conclusion on ABC consistency asymptotic description of ABC: diﬀerent regimes depending on n & σn no point in choosing n arbitrarily small: just n = o(σn) no asymptotic gain in iterative ABC results under weak(er) conditions by not studying g(η(z)|θ)
96. 96. Mis-ABC-ﬁed Motivating examples Approximate Bayesian computation ABC for model choice Asymptotics of ABC ABC under misspeciﬁcation Misspeciﬁcation Consequences
97. 97. Illustration Assumed data generating process (DGP) is z ∼ N(θ, 1) but actual DGP is y ∼ N(θ, ˜σ2) Use of summaries sample mean η1(y) = 1 n n i=1 yi centered summary η2(y) = 1 n−1 n i=1(yi − η1(y))2 − 1 Three ABC: ABC-AR: accept/reject approach with K (d{η(z), η(y)}) = 1l [d{η(z), η(y)} ] and d{x, y} = x − y ABC-K: smooth rejection approach, with K (d{η(z), η(y)}) univariate Gaussian kernel ABC-Reg: post-processing ABC approach with weighted linear regression adjustment
98. 98. Illustration Posterior means for ABC-AR, ABC-K and ABC-Reg as misspeciﬁcation σ2 increases (N = 50, 000 simulated data sets) αn = n−5/9 quantile for ABC-AR ABC-K and ABC-Reg bandwidth of n−5/9
99. 99. Framework Data y with true distribution P0 Model P := {θ ∈ Θ ⊂ Rkθ : Pθ} Summary statistic η(y) = (η1(y), ..., ηkη (y)) Misspeciﬁcation inf θ∈Θ D(P0||Pθ) = inf θ∈Θ − log dP0(y) dPθ(y) dP0(y) > 0, with θ∗ = arg inf θ∈Θ D(P0||Pθ) [Muller, 2013] ABC misspeciﬁcation for b0 (resp. b(θ)) limit of η(y) (resp. η(z)) inf θ∈Θ d{b0, b(θ)} > 0 ABC pseudo-true value θ∗ = arg inf θ∈Θ d{b0, b(θ)}.
100. 100. Compulsory tolerance Under standard identiﬁcation conditions on b(·) ∈ Rkη , there exists ∗ such that ∗ = inf θ∈Θ d{b0, b(θ)} > 0 c For tolerances n = o(1), once n < ∗ no draw of θ to be selected and posterior Π [A|η(y)] ill-conditioned But appropriately chosen tolerance sequence ( n)n allows ABC-based posterior to concentrate on distance-dependent pseudo-true value θ∗
101. 101. Compulsory tolerance Under standard identiﬁcation conditions on b(·) ∈ Rkη , there exists ∗ such that ∗ = inf θ∈Θ d{b0, b(θ)} > 0 c For tolerances n = o(1), once n < ∗ no draw of θ to be selected and posterior Π [A|η(y)] ill-conditioned But appropriately chosen tolerance sequence ( n)n allows ABC-based posterior to concentrate on distance-dependent pseudo-true value θ∗
102. 102. ABC posterior concentration under misspeciﬁcation Assumptions (A0) There exist continuous b : Θ → B and decreasing ρn(·) such that ρn(u) → 0 at inﬁnity and Pθ [d{η(θ), b(θ)} > u] c(θ)ρn(u), Θ c(θ)dΠ(θ) < +∞ with existence of either (i) Polynomial deviations: positive sequence vn → +∞ and u0, κ > 0 such that ρn(u) = v−κ n u−κ , for u u0. (ii) Exponential deviations: hθ(·) > 0 such that Pθ[d{η(z), b(θ)} > u] c(θ)e−hθ(uvn) and there exist c, C > 0 such that Θ c(θ)e−hθ(uvn) dΠ(θ) Ce−c(uvn)τ
103. 103. ABC posterior concentration under misspeciﬁcation Assumptions (A1) There exist D > 0 and M0, δ0 > 0 such that, for all δ0 δ > 0 and M M0, there exists Sδ ⊂ {θ ∈ Θ : d{b(θ), b0} − ∗ δ} for which In case (i), D < κ and Sδ 1 − c(θ) M dΠ(θ) δD . In case (ii), Sδ 1 − c(θ)e−hθ(M) dΠ(θ) δD .
104. 104. ABC posterior concentration under misspeciﬁcation Under (A0) and (A1), for n ↓ ∗ with n ∗ + Mv−1 n + v−1 0,n if Mn sequence going to inﬁnity and δn Mn{( n − ∗) + ρn( n − ∗)}, then Π [d{b(θ), b0} ∗ + δn|η(y)] = oP0 (1), provided ρn( n − ∗ ) ( n − ∗ )−D/κ in case (i) ρn( n − ∗ ) | log( n − ∗ )|1/τ in case (ii) Further Π [d{θ, θ∗ } > δ|η(y)] = oP0 (1) [Bernton & al., 2019; Frazier & al., 2018]
105. 105. ABC posterior concentration under misspeciﬁcation Under (A0) and (A1), for n ↓ ∗ with n ∗ + Mv−1 n + v−1 0,n if Mn sequence going to inﬁnity and δn Mn{( n − ∗) + ρn( n − ∗)}, then Π [d{b(θ), b0} ∗ + δn|η(y)] = oP0 (1), provided ρn( n − ∗ ) ( n − ∗ )−D/κ in case (i) ρn( n − ∗ ) | log( n − ∗ )|1/τ in case (ii) Further Π [d{θ, θ∗ } > δ|η(y)] = oP0 (1) [Bernton & al., 2019; Frazier & al., 2018]
106. 106. ABC asymptotic posterior Under further assumptions existence of positive deﬁnite matrix Σn(θ0) such that for all θ − θ0 δ and all 0 < u δvn Pθ [ Σn(θ0){η(z) − b(θ)} > u] c0u−κ parametric CLT: existence of sequence of positive deﬁnite matrices Σn(θ) such that for all θ in a neighbourhood of θ0 Σn(θ) ∗ vn, with vn → +∞ and Σn(θ){η(z) − b(θ)} ⇒ N(0, Ikη ), classical concentration assumptions If limn vn( n − ∗) = 0, then for Z0 n = Σn(θ∗){η(y) − b0} and Φkη (·) standard Normal lim n→+∞ Π Σn(θ∗ ){b(θ) − b(θ∗ )} − Z0 n ∈ B|η(y) = Φkη (B)
107. 107. Accept/Reject ABC revisited Given link between n and αn in Frazier & al. (2019), posterior concentration achieved through sequence of αn converging to zero ABC algorithm based on appropriate αn always yields (under appropriate regularity) approach that concentrates asymptotically, regardless of misspeciﬁcation. (i) if ( n − ∗) v−1 n or ( n − ∗) = o(v−1 n ), then αn = Pr ( η(z) − η(y) n) (vn{ n − ∗ })kη × v−kθ n v−kθ n (ii) if ( n − ∗) v−1 n , then αn = Pr ( η(z) − η(y) n) { n − ∗ }kθ v−kθ n
108. 108. Regression adjustment In case of scalar θ, ABC-Reg runs ABC-AR, with tolerance n, and obtains selected draws and summaries {θi , η(zi )} uses linear regression to predict accepted values of θ from η(z) through θi = α + β {η(y) − η(zi )} + νi deﬁnes adjusted parameter draw as ˜θi = θi + ^β {η(y) − η(zi )}
109. 109. ABC-Reg pseudo-consistency If n ∗ + Mv−1 n + v−1 0,n , with (i) ∗ = infθ∈Θ d{b(θ), b0} > 0 (ii) θ∗ = arg infθ∈Θ d{b(θ), b0} exists. (iii) For some β0 with β0 > 0, ^β − β0 = oPθ (1). then Π [|˜θ − ˜θ∗ | > δ|η(y)] = oP0 (1), [Frazier & al., 2019] The End
110. 110. ABC-Reg pseudo-consistency If n ∗ + Mv−1 n + v−1 0,n , with (i) ∗ = infθ∈Θ d{b(θ), b0} > 0 (ii) θ∗ = arg infθ∈Θ d{b(θ), b0} exists. (iii) For some β0 with β0 > 0, ^β − β0 = oPθ (1). then Π [|˜θ − ˜θ∗ | > δ|η(y)] = oP0 (1), [Frazier & al., 2019] The End