[A]BCel : a presentation at ABC in Roma

1. (Approximate) Bayesian Computation as a new empirical Bayes method Christian P. Robert Universit´e Paris-Dauphine & CREST, Paris Joint work with K.L. Mengersen and P. Pudlo ABC in Roma, 30 mai 2013

2. meeting of potential interest?! MCMSki IV to be held in Chamonix Mt Blanc, France, from Monday, Jan. 6 to Wed., Jan. 8, 2014 All aspects of MCMC++ theory and methodology and applications and more! Website http://www.pages.drexel.edu/∼mwl25/mcmski/

3. Outline Intractable likelihoods ABC methods ABC as an inference machine [A]BCel Conclusion and perspectives

4. Intractable likelihood Case of a well-deﬁned statistical model where the likelihood function (θ|y) = f (y1, . . . , yn|θ) is (really!) not available in closed form cannot (easily!) be either completed or demarginalised cannot be (at all!) estimated by an unbiased estimator c Prohibits direct implementation of a generic MCMC algorithm like Metropolis–Hastings

5. Intractable likelihood Case of a well-deﬁned statistical model where the likelihood function (θ|y) = f (y1, . . . , yn|θ) is (really!) not available in closed form cannot (easily!) be either completed or demarginalised cannot be (at all!) estimated by an unbiased estimator c Prohibits direct implementation of a generic MCMC algorithm like Metropolis–Hastings

6. The abc alternative Case of a well-deﬁned statistical model where the likelihood function (θ|y) = f (y1, . . . , yn|θ) is out of reach Empirical approximations to the original B problem Degrading the data precision down to a tolerance ε Replacing the likelihood with a non-parametric approximation Summarising/replacing the data with insuﬃcient statistics

9. Diﬀerent [levels of] worries about abc Impact on B inference a mere computational issue (that will eventually end up being solved by more powerful computers, &tc, even if too costly in the short term, as for regular Monte Carlo methods) Not! an inferential issue (opening opportunities for new inference machine on complex/Big Data models, with legitimity diﬀering from classical B approach) a Bayesian conundrum (how closely related to the/a B approach? more than recycling B tools? true B with raw data? a new type of empirical Bayes inference?)

12. summary: reply to Lange at al. [ISR, 2013] “Surprisingly, the conﬁdent prediction of the previous generation that Bayesian methods would ultimately supplant frequentist methods has given way to a realization that Markov chain Monte Carlo (MCMC) may be too slow to handle modern data sets. Size matters because large data sets stress computer storage and processing power to the breaking point. The most successful compromises between Bayesian and frequentist methods now rely on penalization and optimization.” sad reality constraint that size does matter focus on much smaller dimensions and on sparse summaries many (fast) ways of producing those summaries Bayesian inference can kick in almost automatically at this stage

13. summary: reply to Lange at al. [ISR, 2013] “Surprisingly, the conﬁdent prediction of the previous generation that Bayesian methods would ultimately supplant frequentist methods has given way to a realization that Markov chain Monte Carlo (MCMC) may be too slow to handle modern data sets. Size matters because large data sets stress computer storage and processing power to the breaking point. The most successful compromises between Bayesian and frequentist methods now rely on penalization and optimization.” sad reality constraint that size does matter focus on much smaller dimensions and on sparse summaries many (fast) ways of producing those summaries Bayesian inference can kick in almost automatically at this stage

14. Econom’ections Similar exploration of simulation-based and approximation techniques in Econometrics Simulated method of moments Method of simulated moments Simulated pseudo-maximum-likelihood Indirect inference [Gouri´eroux & Monfort, 1996] even though motivation is partly-deﬁned models rather than complex likelihoods

15. Econom’ections Similar exploration of simulation-based and approximation techniques in Econometrics Simulated method of moments Method of simulated moments Simulated pseudo-maximum-likelihood Indirect inference [Gouri´eroux & Monfort, 1996] even though motivation is partly-deﬁned models rather than complex likelihoods

16. Indirect inference Minimise [in θ] a distance between estimators ^β based on a pseudo-model for genuine observations and for observations simulated under the true model and the parameter θ. [Gouri´eroux, Monfort, & Renault, 1993; Smith, 1993; Gallant & Tauchen, 1996]

17. Indirect inference (PML vs. PSE) Example of the pseudo-maximum-likelihood (PML) ^β(y) = arg max β t log f (yt|β, y1:(t−1)) leading to arg min θ ||^β(yo ) − ^β(y1(θ), . . . , yS (θ))||2 when ys(θ) ∼ f (y|θ) s = 1, . . . , S

18. Indirect inference (PML vs. PSE) Example of the pseudo-score-estimator (PSE) ^β(y) = arg min β t ∂ log f ∂β (yt|β, y1:(t−1)) 2 leading to arg min θ ||^β(yo ) − ^β(y1(θ), . . . , yS (θ))||2 when ys(θ) ∼ f (y|θ) s = 1, . . . , S

19. Consistent indirect inference “...in order to get a unique solution the dimension of the auxiliary parameter β must be larger than or equal to the dimension of the initial parameter θ. If the problem is just identified the different methods become easier...” Consistency depending on the criterion and on the asymptotic identifiability of θ [Gouriéroux & Monfort, 1996, p. 66] Which connection [if any] with the B perspective?

22. Approximate Bayesian computation Intractable likelihoods ABC methods Genesis of ABC ABC basics Advances and interpretations ABC as knn ABC as an inference machine [A]BCel Conclusion and perspectives

23. Genetic background of ABC skip genetics ABC is a recent computational technique that only requires being able to sample from the likelihood f (·|θ) This technique stemmed from population genetics models, about 15 years ago, and population geneticists still contribute significantly to methodological developments of ABC. [Griffith & al., 1997; Tavaré & al., 1999]

24. Demo-genetic inference Each model is characterized by a set of parameters θ that cover historical (time divergence, admixture time ...), demographics (population sizes, admixture rates, migration rates, ...) and genetic (mutation rate, ...) factors The goal is to estimate these parameters from a dataset of polymorphism (DNA sample) y observed at the present time Problem: most of the time, we cannot calculate the likelihood of the polymorphism data f (y|θ)...

25. Demo-genetic inference Each model is characterized by a set of parameters θ that cover historical (time divergence, admixture time ...), demographics (population sizes, admixture rates, migration rates, ...) and genetic (mutation rate, ...) factors The goal is to estimate these parameters from a dataset of polymorphism (DNA sample) y observed at the present time Problem: most of the time, we cannot calculate the likelihood of the polymorphism data f (y|θ)...

26. Neutral model at a given microsatellite locus, in a closed panmictic population at equilibrium Sample of 8 genes Mutations according to the Simple stepwise Mutation Model (SMM) • date of the mutations ∼ Poisson process with intensity θ/2 over the branches • MRCA = 100 • independent mutations: ±1 with pr. 1/2

27. Neutral model at a given microsatellite locus, in a closed panmictic population at equilibrium Kingman’s genealogy When time axis is normalized, T(k) ∼ Exp(k(k − 1)/2) Mutations according to the Simple stepwise Mutation Model (SMM) • date of the mutations ∼ Poisson process with intensity θ/2 over the branches • MRCA = 100 • independent mutations: ±1 with pr. 1/2

28. Neutral model at a given microsatellite locus, in a closed panmictic population at equilibrium Kingman’s genealogy When time axis is normalized, T(k) ∼ Exp(k(k − 1)/2) Mutations according to the Simple stepwise Mutation Model (SMM) • date of the mutations ∼ Poisson process with intensity θ/2 over the branches • MRCA = 100 • independent mutations: ±1 with pr. 1/2

29. Neutral model at a given microsatellite locus, in a closed panmictic population at equilibrium Observations: leafs of the tree ^θ = ? Kingman’s genealogy When time axis is normalized, T(k) ∼ Exp(k(k − 1)/2) Mutations according to the Simple stepwise Mutation Model (SMM) • date of the mutations ∼ Poisson process with intensity θ/2 over the branches • MRCA = 100 • independent mutations: ±1 with pr. 1/2

30. Much more interesting models. . . several independent locus Independent gene genealogies and mutations diﬀerent populations linked by an evolutionary scenario made of divergences, admixtures, migrations between populations, etc. larger sample size usually between 50 and 100 genes A typical evolutionary scenario: MRCA POP 0 POP 1 POP 2 τ1 τ2

31. Intractable likelihood Missing (too much missing!) data structure: f (y|θ) = G f (y|G, θ)f (G|θ)dG cannot be computed in a manageable way... [Stephens & Donnelly, 2000] The genealogies are considered as nuisance parameters This modelling clearly diﬀers from the phylogenetic perspective where the tree is the parameter of interest.

32. Intractable likelihood Missing (too much missing!) data structure: f (y|θ) = G f (y|G, θ)f (G|θ)dG cannot be computed in a manageable way... [Stephens & Donnelly, 2000] The genealogies are considered as nuisance parameters This modelling clearly diﬀers from the phylogenetic perspective where the tree is the parameter of interest.

33. A?B?C? A stands for approximate [wrong likelihood / pic?] B stands for Bayesian C stands for computation [producing a parameter sample]

34. A?B?C? A stands for approximate [wrong likelihood / pic?] B stands for Bayesian C stands for computation [producing a parameter sample]

35. A?B?C? A stands for approximate [wrong likelihood / pic?] B stands for Bayesian C stands for computation [producing a parameter sample] ESS=155.6 θ Density −0.5 0.0 0.5 1.0 0.01.0 ESS=75.93 θ Density −0.4 −0.2 0.0 0.2 0.4 0.01.02.0 ESS=76.87 θ Density −0.4 −0.2 0.0 0.2 01234 ESS=91.54 θ Density −0.6 −0.4 −0.2 0.0 0.2 01234 ESS=108.4 θ Density −0.4 0.0 0.2 0.4 0.6 0.01.02.03.0 ESS=85.13 θ Density −0.2 0.0 0.2 0.4 0.6 0.01.02.03.0 ESS=149.1 θ Density −0.5 0.0 0.5 1.0 0.01.02.0 ESS=96.31 θ Density −0.4 0.0 0.2 0.4 0.6 0.01.02.0 ESS=83.77 θ Density −0.6 −0.4 −0.2 0.0 0.2 0.4 01234 ESS=155.7 θ Density −0.5 0.0 0.5 0.01.02.0 ESS=92.42 θ Density −0.4 0.0 0.2 0.4 0.6 0.01.02.03.0 ESS=95.01 θ Density −0.4 0.0 0.2 0.4 0.6 0.01.53.0 ESS=139.2 Density −0.6 −0.2 0.2 0.6 0.01.02.0 ESS=99.33 Density −0.4 −0.2 0.0 0.2 0.4 0.01.02.03.0 ESS=87.28 Density −0.2 0.0 0.2 0.4 0.6 0123

36. How Bayesian is aBc? Could we turn the resolution into a Bayesian answer? ideally so (not meaningfull: requires ∞-ly powerful computer approximation error unknown (w/o costly simulation) true Bayes for wrong model (formal and artiﬁcial) true Bayes for noisy model (much more convincing) true Bayes for estimated likelihood (back to econometrics?) illuminating the tension between information and precision

37. ABC methodology Bayesian setting: target is π(θ)f (x|θ) When likelihood f (x|θ) not in closed form, likelihood-free rejection technique: Foundation For an observation y ∼ f (y|θ), under the prior π(θ), if one keeps jointly simulating θ ∼ π(θ) , z ∼ f (z|θ ) , until the auxiliary variable z is equal to the observed value, z = y, then the selected θ ∼ π(θ|y) [Rubin, 1984; Diggle & Gratton, 1984; Tavar´e et al., 1997]

40. A as A...pproximative When y is a continuous random variable, strict equality z = y is replaced with a tolerance zone ρ(y, z) where ρ is a distance Output distributed from π(θ) Pθ{ρ(y, z) < } def ∝ π(θ|ρ(y, z) < ) [Pritchard et al., 1999]

41. A as A...pproximative When y is a continuous random variable, strict equality z = y is replaced with a tolerance zone ρ(y, z) where ρ is a distance Output distributed from π(θ) Pθ{ρ(y, z) < } def ∝ π(θ|ρ(y, z) < ) [Pritchard et al., 1999]

42. ABC algorithm In most implementations, further degree of A...pproximation: Algorithm 1 Likelihood-free rejection sampler for i = 1 to N do repeat generate θ from the prior distribution π(·) generate z from the likelihood f (·|θ ) until ρ{η(z), η(y)} set θi = θ end for where η(y) deﬁnes a (not necessarily suﬃcient) statistic

43. Output The likelihood-free algorithm samples from the marginal in z of: π (θ, z|y) = π(θ)f (z|θ)IA ,y (z) A ,y×Θ π(θ)f (z|θ)dzdθ , where A ,y = {z ∈ D|ρ(η(z), η(y)) < }. The idea behind ABC is that the summary statistics coupled with a small tolerance should provide a good approximation of the posterior distribution: π (θ|y) = π (θ, z|y)dz ≈ π(θ|y) . ...does it?!

46. Output The likelihood-free algorithm samples from the marginal in z of: π (θ, z|y) = π(θ)f (z|θ)IA ,y (z) A ,y×Θ π(θ)f (z|θ)dzdθ , where A ,y = {z ∈ D|ρ(η(z), η(y)) < }. The idea behind ABC is that the summary statistics coupled with a small tolerance should provide a good approximation of the restricted posterior distribution: π (θ|y) = π (θ, z|y)dz ≈ π(θ|η(y)) . Not so good..! skip convergence details! to non-parametrics

49. Convergence (do not attempt!) ...and the above does not apply to insuﬃcient statistics: If η(y) is not a suﬃcient statistics, the best one can hope for is π(θ|η(y)) , not π(θ|y) If η(y) is an ancillary statistic, the whole information contained in y is lost!, the “best” one can “hope” for is π(θ|η(y)) = π(θ)

52. Comments Role of distance paramount (because = 0) Scaling of components of η(y) is also determinant matters little if “small enough” representative of “curse of dimensionality” small is beautiful! the data as a whole may be paradoxically weakly informative for ABC

53. ABC (simul’) advances how approximative is ABC? ABC as knn Simulating from the prior is often poor in eﬃciency Either modify the proposal distribution on θ to increase the density of x’s within the vicinity of y... [Marjoram et al, 2003; Bortot et al., 2007, Sisson et al., 2007] ...or by viewing the problem as a conditional density estimation and by developing techniques to allow for larger [Beaumont et al., 2002] .....or even by including in the inferential framework [ABCµ] [Ratmann et al., 2009]

57. ABC-NP Better usage of [prior] simulations by adjustement: instead of throwing away θ such that ρ(η(z), η(y)) > , replace θ’s with locally regressed transforms θ∗ = θ − {η(z) − η(y)}T ^β [Csill´ery et al., TEE, 2010] where ^β is obtained by [NP] weighted least square regression on (η(z) − η(y)) with weights Kδ {ρ(η(z), η(y))} [Beaumont et al., 2002, Genetics]

58. ABC-NP (regression) Also found in the subsequent literature, e.g. in Fearnhead-Prangle (2012) : weight directly simulation by Kδ {ρ(η(z(θ)), η(y))} or 1 S S s=1 Kδ {ρ(η(zs (θ)), η(y))} [consistent estimate of f (η|θ)] Curse of dimensionality: poor estimate when d = dim(η) is large...

59. ABC-NP (regression) Also found in the subsequent literature, e.g. in Fearnhead-Prangle (2012) : weight directly simulation by Kδ {ρ(η(z(θ)), η(y))} or 1 S S s=1 Kδ {ρ(η(zs (θ)), η(y))} [consistent estimate of f (η|θ)] Curse of dimensionality: poor estimate when d = dim(η) is large...

60. ABC-NP (density estimation) Use of the kernel weights Kδ {ρ(η(z(θ)), η(y))} leads to the NP estimate of the posterior expectation i θi Kδ {ρ(η(z(θi )), η(y))} i Kδ {ρ(η(z(θi )), η(y))} [Blum, JASA, 2010]

61. ABC-NP (density estimation) Use of the kernel weights Kδ {ρ(η(z(θ)), η(y))} leads to the NP estimate of the posterior conditional density i ˜Kb(θi − θ)Kδ {ρ(η(z(θi )), η(y))} i Kδ {ρ(η(z(θi )), η(y))} [Blum, JASA, 2010]

62. ABC-NP (density estimations) Other versions incorporating regression adjustments i ˜Kb(θ∗ i − θ)Kδ {ρ(η(z(θi )), η(y))} i Kδ {ρ(η(z(θi )), η(y))} In all cases, error E[^g(θ|y)] − g(θ|y) = cb2 + cδ2 + OP(b2 + δ2 ) + OP(1/nδd ) var(^g(θ|y)) = c nbδd (1 + oP(1))

63. ABC-NP (density estimations) Other versions incorporating regression adjustments i ˜Kb(θ∗ i − θ)Kδ {ρ(η(z(θi )), η(y))} i Kδ {ρ(η(z(θi )), η(y))} In all cases, error E[^g(θ|y)] − g(θ|y) = cb2 + cδ2 + OP(b2 + δ2 ) + OP(1/nδd ) var(^g(θ|y)) = c nbδd (1 + oP(1)) [Blum, JASA, 2010]

64. ABC-NP (density estimations) Other versions incorporating regression adjustments i ˜Kb(θ∗ i − θ)Kδ {ρ(η(z(θi )), η(y))} i Kδ {ρ(η(z(θi )), η(y))} In all cases, error E[^g(θ|y)] − g(θ|y) = cb2 + cδ2 + OP(b2 + δ2 ) + OP(1/nδd ) var(^g(θ|y)) = c nbδd (1 + oP(1)) [standard NP calculations]

65. ABC as knn see and listen to G´erard’s talk [Biau et al., 2013, Annales de l’IHP] Practice of ABC: determine tolerance as a quantile on observed distances, say 10% or 1% quantile, = N = qα(d1, . . . , dN) Interpretation of ε as nonparametric bandwidth only approximation of the actual practice [Blum & Fran¸cois, 2010] ABC is a k-nearest neighbour (knn) method with kN = N N [Loftsgaarden & Quesenberry, 1965]

68. ABC consistency Provided kN/ log log N −→ ∞ and kN/N −→ 0 as N → ∞, for almost all s0 (with respect to the distribution of S), with probability 1, 1 kN kN j=1 ϕ(θj ) −→ E[ϕ(θj )|S = s0] [Devroye, 1982] Biau et al. (2012) also recall pointwise and integrated mean square error consistency results on the corresponding kernel estimate of the conditional posterior distribution, under constraints kN → ∞, kN /N → 0, hN → 0 and hp N kN → ∞,

69. ABC consistency Provided kN/ log log N −→ ∞ and kN/N −→ 0 as N → ∞, for almost all s0 (with respect to the distribution of S), with probability 1, 1 kN kN j=1 ϕ(θj ) −→ E[ϕ(θj )|S = s0] [Devroye, 1982] Biau et al. (2012) also recall pointwise and integrated mean square error consistency results on the corresponding kernel estimate of the conditional posterior distribution, under constraints kN → ∞, kN /N → 0, hN → 0 and hp N kN → ∞,

70. Rates of convergence Further assumptions (on target and kernel) allow for precise (integrated mean square) convergence rates (as a power of the sample size N), derived from classical k-nearest neighbour regression, like when m = 1, 2, 3, kN ≈ N(p+4)/(p+8) and rate N− 4 p+8 when m = 4, kN ≈ N(p+4)/(p+8) and rate N− 4 p+8 log N when m > 4, kN ≈ N(p+4)/(m+p+4) and rate N− 4 m+p+4 [Biau et al., 2012, arxiv:1207.6461] Drag: Only applies to suﬃcient summary statistics

71. Rates of convergence Further assumptions (on target and kernel) allow for precise (integrated mean square) convergence rates (as a power of the sample size N), derived from classical k-nearest neighbour regression, like when m = 1, 2, 3, kN ≈ N(p+4)/(p+8) and rate N− 4 p+8 when m = 4, kN ≈ N(p+4)/(p+8) and rate N− 4 p+8 log N when m > 4, kN ≈ N(p+4)/(m+p+4) and rate N− 4 m+p+4 [Biau et al., 2012, arxiv:1207.6461] Drag: Only applies to suﬃcient summary statistics

72. ABC inference machine Intractable likelihoods ABC methods ABC as an inference machine Error inc. Exact BC and approximate targets summary statistic [A]BCel Conclusion and perspectives

73. How Bayesian is aBc..? may be a convergent method of inference (meaningful? suﬃcient? foreign?) approximation error unknown (w/o massive simulation) pragmatic/empirical B (there is no other solution!) many calibration issues (tolerance, distance, statistics) the NP side should be incorporated into the whole B picture the approximation error should also be part of the B inference to ABCel

74. ABCµ Idea Infer about the error as well as about the parameter: Use of a joint density f (θ, |y) ∝ ξ( |y, θ) × πθ(θ) × π ( ) where y is the data, and ξ( |y, θ) is the prior predictive density of ρ(η(z), η(y)) given θ and y when z ∼ f (z|θ) Warning! Replacement of ξ( |y, θ) with a non-parametric kernel approximation. [Ratmann, Andrieu, Wiuf and Richardson, 2009, PNAS]

77. ABCµ details Multidimensional distances ρk (k = 1, . . . , K) and errors k = ρk(ηk(z), ηk(y)), with k ∼ ξk( |y, θ) ≈ ^ξk( |y, θ) = 1 Bhk b K[{ k−ρk(ηk(zb), ηk(y))}/hk] then used in replacing ξ( |y, θ) with mink ^ξk( |y, θ) ABCµ involves acceptance probability π(θ , ) π(θ, ) q(θ , θ)q( , ) q(θ, θ )q( , ) mink ^ξk( |y, θ ) mink ^ξk( |y, θ)

78. ABCµ details Multidimensional distances ρk (k = 1, . . . , K) and errors k = ρk(ηk(z), ηk(y)), with k ∼ ξk( |y, θ) ≈ ^ξk( |y, θ) = 1 Bhk b K[{ k−ρk(ηk(zb), ηk(y))}/hk] then used in replacing ξ( |y, θ) with mink ^ξk( |y, θ) ABCµ involves acceptance probability π(θ , ) π(θ, ) q(θ , θ)q( , ) q(θ, θ )q( , ) mink ^ξk( |y, θ ) mink ^ξk( |y, θ)

79. Wilkinson’s exact BC (not exactly!) ABC approximation error (i.e. non-zero tolerance) replaced with exact simulation from a controlled approximation to the target, convolution of true posterior with kernel function π (θ, z|y) = π(θ)f (z|θ)K (y − z) π(θ)f (z|θ)K (y − z)dzdθ , with K kernel parameterised by bandwidth . [Wilkinson, 2013] Theorem The ABC algorithm based on the assumption of a randomised observation y = ˜y + ξ, ξ ∼ K , and an acceptance probability of K (y − z)/M gives draws from the posterior distribution π(θ|y).

80. Wilkinson’s exact BC (not exactly!) ABC approximation error (i.e. non-zero tolerance) replaced with exact simulation from a controlled approximation to the target, convolution of true posterior with kernel function π (θ, z|y) = π(θ)f (z|θ)K (y − z) π(θ)f (z|θ)K (y − z)dzdθ , with K kernel parameterised by bandwidth . [Wilkinson, 2013] Theorem The ABC algorithm based on the assumption of a randomised observation y = ˜y + ξ, ξ ∼ K , and an acceptance probability of K (y − z)/M gives draws from the posterior distribution π(θ|y).

81. How exact a BC? Pros Pseudo-data from true model vs. observed data from noisy model Outcome is completely controlled Link with ABCµ when assuming y observed with measurement error with density K Relates to the theory of model approximation [Kennedy & O’Hagan, 2001] leads to “noisy ABC”: just do it!, i.e. perturbates the data down to precision ε and proceed [Fearnhead & Prangle, 2012] Cons Requires K to be bounded by M True approximation error never assessed

82. How exact a BC? Pros Pseudo-data from true model vs. observed data from noisy model Outcome is completely controlled Link with ABCµ when assuming y observed with measurement error with density K Relates to the theory of model approximation [Kennedy & O’Hagan, 2001] leads to “noisy ABC”: just do it!, i.e. perturbates the data down to precision ε and proceed [Fearnhead & Prangle, 2012] Cons Requires K to be bounded by M True approximation error never assessed

83. Noisy ABC for HMMs Speciﬁc case of a hidden Markov model summaries Xt+1 ∼ Qθ(Xt, ·) Yt+1 ∼ gθ(·|xt) where only y0 1:n is observed. [Dean, Singh, Jasra, & Peters, 2011] Use of speciﬁc constraints, adapted to the Markov structure: y1 ∈ B(y0 1 , ) × · · · × yn ∈ B(y0 n , )

84. Noisy ABC for HMMs Speciﬁc case of a hidden Markov model summaries Xt+1 ∼ Qθ(Xt, ·) Yt+1 ∼ gθ(·|xt) where only y0 1:n is observed. [Dean, Singh, Jasra, & Peters, 2011] Use of speciﬁc constraints, adapted to the Markov structure: y1 ∈ B(y0 1 , ) × · · · × yn ∈ B(y0 n , )

85. Noisy ABC-MLE Idea: Modify instead the data from the start (y0 1 + ζ1, . . . , yn + ζn) [ see Fearnhead-Prangle ] noisy ABC-MLE estimate arg max θ Pθ Y1 ∈ B(y0 1 + ζ1, ), . . . , Yn ∈ B(y0 n + ζn, ) [Dean, Singh, Jasra, & Peters, 2011]

86. Consistent noisy ABC-MLE Degrading the data improves the estimation performances: Noisy ABC-MLE is asymptotically (in n) consistent under further assumptions, the noisy ABC-MLE is asymptotically normal increase in variance of order −2 likely degradation in precision or computing time due to the lack of summary statistic [curse of dimensionality]

87. Which summary? Fundamental difficulty of the choice of the summary statistic when there is no non-trivial sufficient statistics Loss of statistical information balanced against gain in data roughening Approximation error and information loss remain unknown Choice of statistics induces choice of distance function towards standardisation may be imposed for external/practical reasons may gather several non-B point estimates can learn about efficient combination

90. Which summary for model choice? Depending on the choice of η(·), the Bayes factor based on this insuﬃcient statistic, Bη 12(y) = π1(θ1)f η 1 (η(y)|θ1) dθ1 π2(θ2)f η 2 (η(y)|θ2) dθ2 , is either consistent or inconsistent [X, Cornuet, Marin, & Pillai, 2012] Consistency only depends on the range of Ei [η(y)] under both models [Marin, Pillai, X, & Rousseau, 2012]

91. Which summary for model choice? Depending on the choice of η(·), the Bayes factor based on this insuﬃcient statistic, Bη 12(y) = π1(θ1)f η 1 (η(y)|θ1) dθ1 π2(θ2)f η 2 (η(y)|θ2) dθ2 , is either consistent or inconsistent [X, Cornuet, Marin, & Pillai, 2012] Consistency only depends on the range of Ei [η(y)] under both models [Marin, Pillai, X, & Rousseau, 2012]

92. Semi-automatic ABC Fearnhead and Prangle (2012) study ABC and the selection of the summary statistic in close proximity to Wilkinson’s proposal ABC considered as inferential method and calibrated as such randomised (or ‘noisy’) version of the summary statistics ˜η(y) = η(y) + τ derivation of a well-calibrated version of ABC, i.e. an algorithm that gives proper predictions for the distribution associated with this randomised summary statistic

93. Summary [of F&P/statistics) optimality of the posterior expectation E[θ|y] of the parameter of interest as summary statistics η(y)! [requires iterative process] use of the standard quadratic loss function (θ − θ0)T A(θ − θ0) recent extension to model choice, optimality of Bayes factor B12(y) [F&P, ISBA 2012, Kyoto]

94. Summary [of F&P/statistics) optimality of the posterior expectation E[θ|y] of the parameter of interest as summary statistics η(y)! [requires iterative process] use of the standard quadratic loss function (θ − θ0)T A(θ − θ0) recent extension to model choice, optimality of Bayes factor B12(y) [F&P, ISBA 2012, Kyoto]

95. Summary [about summaries] Choice of summary statistics is paramount for ABC validation/performances At best, ABC approximates π(. | η(y)) [unavoidable] Model selection feasible with ABC [with caution!] For estimation, consistency if {θ; µ(θ) = µ0} = θ0 when Eθ[η(y)] = µ(θ) For testing consistency if {µ1(θ1), θ1 ∈ Θ1} ∩ {µ2(θ2), θ2 ∈ Θ2} = ∅ [Marin et al., 2011]

100. Empirical likelihood (EL) Intractable likelihoods ABC methods ABC as an inference machine [A]BCel ABC and EL Composite likelihood Illustrations Conclusion and perspectives

101. Empirical likelihood (EL) Dataset x made of n independent replicates x = (x1, . . . , xn) of a rv X ∼ F Generalized moment condition pseudo-model EF h(X, φ) = 0, where h is a known function, and φ an unknown parameter Induced empirical likelihood Lel(φ|x) = max p n i=1 pi for all p such that 0 pi 1, i pi = 1, i pi h(xi , φ) = 0. [Owen, 1988, Bio’ka, & Empirical Likelihood, 2001]

102. Empirical likelihood (EL) Dataset x made of n independent replicates x = (x1, . . . , xn) of a rv X ∼ F Generalized moment condition pseudo-model EF h(X, φ) = 0, where h is a known function, and φ an unknown parameter Induced empirical likelihood Lel(φ|x) = max p n i=1 pi for all p such that 0 pi 1, i pi = 1, i pi h(xi , φ) = 0. [Owen, 1988, Bio’ka, & Empirical Likelihood, 2001]

103. EL motivation Estimate of the data distribution under moment conditions i pi h(xi , φ) = 0 Example Normal N(θ, σ2) sample x1, . . . , x25 Constraints Eθ,σ[(x − θ, (x − θ)2 − σ2 )] = 0 with ^θ = −.45, ^σ = 1.12 −2.0 −1.5 −1.0 −0.5 0.0 0.5 1.0 012345 θ σ

104. Convergence of EL [3.4] Theorem 3.4 Let X, Y1, . . . , Yn be independent rv’s with common distribution F0. For θ ∈ Θ, and the function h(X, θ) ∈ Rs, let θ0 ∈ Θ be such that Var(h(Yi , θ0)) is ﬁnite and has rank q > 0. If θ0 satisﬁes E(h(X, θ0)) = 0, then −2 log Lel(θ0|Y1, . . . , Yn) n−n → χ2 (q) in distribution when n → ∞. [Owen, 2001]

105. Convergence of EL [3.4] “...The interesting thing about Theorem 3.4 is what is not there. It includes no conditions to make ^θ a good estimate of θ0, nor even conditions to ensure a unique value for θ0, nor even that any solution θ0 exists. Theorem 3.4 applies in the just determined, over-determined, and under-determined cases. When we can prove that our estimating equations uniquely deﬁne θ0, and provide a consistent estimator ^θ of it, then conﬁdence regions and tests follow almost automatically through Theorem 3.4.”. [Owen, 2001]

106. Raw [A]BCelsampler Na¨ıve implementation: Act as if EL was an exact likelihood [Lazar, 2003] for i = 1 → N do generate φi from the prior distribution π(·) set the weight ωi = Lel(φi |xobs) end for return (φi , ωi ), i = 1, . . . , N Output weighted sample of size N

107. Raw [A]BCelsampler Na¨ıve implementation: Act as if EL was an exact likelihood [Lazar, 2003] for i = 1 → N do generate φi from the prior distribution π(·) set the weight ωi = Lel(φi |xobs) end for return (φi , ωi ), i = 1, . . . , N Performance evaluated through eﬀective sample size ESS = 1 N i=1    ωi N j=1 ωj    2

108. AMIS implementation More advanced algorithms can be adapted to EL: E.g., adaptive multiple importance sampling (AMIS) to speed up computations [Cornuet et al., 2012] for i = 1 to M do Generate θ1,i ∼ q1(·) Set ω1,i = Lel (θ1,i |y) end for for t = 2 to TM do Compute weighted mean mt and variance Σt for i = 1 to M do Generate θt,i ∼ qt(·) = t3(·|mt, Σt) Set ωt,i = π(θt,i )Lel (θt,i |y) t−1 s=1 qs(θt,i ) end for for r = 1 to t − 1 and i = 1 to M do Update weights as ωr,i = π(θt,i )Lel (θr,i |y) t−1 s=1 qs(θr,i )

109. Moment condition in population genetics? EL does not require a fully deﬁned and often complex (hence debatable) parametric model Main diﬃculty Derive a constraint EF h(X, φ) = 0, on the parameters of interest φ when X is made of the genotypes of the sample of individuals at a given locus E.g., in phylogeography, φ is composed of dates of divergence between populations, ratio of population sizes, mutation rates, etc. None of them moments of the distribution of the allelic states of the sample

110. Moment condition in population genetics? EL does not require a fully deﬁned and often complex (hence debatable) parametric model Main diﬃculty Derive a constraint EF h(X, φ) = 0, on the parameters of interest φ when X is made of the genotypes of the sample of individuals at a given locus c take an h made of pairwise composite scores (whose zero is the pairwise maximum likelihood estimator)

111. Pairwise composite likelihood The intra-locus pairwise likelihood 2(xk|φ) = i<j 2(xi k, xj k|φ) with x1 k , . . . , xn k : allelic states of the gene sample at the k-th locus The pairwise score function φ log 2(xk|φ) = i<j φ log 2(xi k, xj k|φ) Composite likelihoods are often much narrower than the original likelihood of the model Safe(r) with EL because we only use position of its mode

112. Pairwise likelihood: a simple case Assumptions sample ⊂ closed, panmictic population at equilibrium marker: microsatellite mutation rate: θ/2 if xi k et xj k are two genes of the sample, 2(xi k, xj k|θ) depends only on δ = xi k − xj k 2(δ|θ) = 1 √ 1 + 2θ ρ (θ)|δ| with ρ(θ) = θ 1 + θ + √ 1 + 2θ Pairwise score function ∂θ log 2(δ|θ) = − 1 1 + 2θ + |δ| θ √ 1 + 2θ

115. Pairwise likelihood: 2 diverging populations MRCA POP a POP b τ Assumptions τ: divergence date of pop. a and b θ/2: mutation rate Let xi k and xj k be two genes coming resp. from pop. a and b Set δ = xi k − xj k. Then 2(δ|θ, τ) = e−τθ √ 1 + 2θ +∞ k=−∞ ρ(θ)|k| Iδ−k(τθ). where In(z) nth-order modiﬁed Bessel function of the ﬁrst kind

116. Pairwise likelihood: 2 diverging populations MRCA POP a POP b τ Assumptions τ: divergence date of pop. a and b θ/2: mutation rate Let xi k and xj k be two genes coming resp. from pop. a and b Set δ = xi k − xj k. A 2-dim score function ∂τ log 2(δ|θ, τ) = −θ+ θ 2 2(δ − 1|θ, τ) + 2(δ + 1|θ, τ) 2(δ|θ, τ) ∂θ log 2(δ|θ, τ) = −τ − 1 1 + 2θ + q(δ|θ, τ) 2(δ|θ, τ) + τ 2 2(δ − 1|θ, τ) + 2(δ + 1|θ, τ) 2(δ|θ, τ) where q(δ|θ, τ) := e−τθ √ 1 + 2θ ρ (θ) ρ(θ) ∞ k=−∞ |k|ρ(θ)|k| Iδ−k (τθ)

117. Example: normal posterior [A]BCel with two constraints ESS=155.6 θ Density −0.5 0.0 0.5 1.0 0.01.0 ESS=75.93 θ Density −0.4 −0.2 0.0 0.2 0.4 0.01.02.0 ESS=76.87 θ Density −0.4 −0.2 0.0 0.2 01234 ESS=91.54 θ Density −0.6 −0.4 −0.2 0.0 0.2 01234 ESS=108.4 θ Density −0.4 0.0 0.2 0.4 0.6 0.01.02.03.0 ESS=85.13 θ Density −0.2 0.0 0.2 0.4 0.6 0.01.02.03.0 ESS=149.1 θ Density −0.5 0.0 0.5 1.0 0.01.02.0 ESS=96.31 θ Density −0.4 0.0 0.2 0.4 0.6 0.01.02.0 ESS=83.77 θ Density −0.6 −0.4 −0.2 0.0 0.2 0.401234 ESS=155.7 θ Density −0.5 0.0 0.5 0.01.02.0 ESS=92.42 θ Density −0.4 0.0 0.2 0.4 0.6 0.01.02.03.0 ESS=95.01 θ Density −0.4 0.0 0.2 0.4 0.6 0.01.53.0 ESS=139.2 Density −0.6 −0.2 0.2 0.6 0.01.02.0 ESS=99.33 Density −0.4 −0.2 0.0 0.2 0.4 0.01.02.03.0 ESS=87.28 Density −0.2 0.0 0.2 0.4 0.6 0123 Sample sizes are of 25 (column 1), 50 (column 2) and 75 (column 3) observations

118. Example: normal posterior [A]BCel with three constraints ESS=300.1 θ Density −0.4 0.0 0.4 0.8 0.01.0 ESS=205.5 θ Density −0.6 −0.2 0.0 0.2 0.4 0.01.02.0 ESS=179.4 θ Density −0.2 0.0 0.2 0.4 0.01.53.0 ESS=265.1 θ Density −0.3 −0.2 −0.1 0.0 0.1 01234 ESS=250.3 θ Density −0.6 −0.4 −0.2 0.0 0.2 0.01.02.0 ESS=134.8 θ Density −0.4 −0.2 0.0 0.1 01234 ESS=331.5 θ Density −0.8 −0.4 0.0 0.4 0.01.02.0 ESS=167.4 θ Density −0.9 −0.7 −0.5 −0.3 0123 ESS=136.5 θ Density −0.4 −0.2 0.0 0.201234 ESS=322.4 θ Density −0.2 0.0 0.2 0.4 0.6 0.8 0.01.02.0 ESS=202.7 θ Density −0.4 −0.2 0.0 0.2 0.4 0.01.02.03.0 ESS=166 θ Density −0.4 −0.2 0.0 0.2 01234 ESS=263.7 Density −1.0 −0.6 −0.2 0.01.02.0 ESS=190.9 Density −0.4 −0.2 0.0 0.2 0.4 0.6 0123 ESS=165.3 Density −0.5 −0.3 −0.1 0.1 0.01.53.0 Sample sizes are of 25 (column 1), 50 (column 2) and 75 (column 3) observations

119. Example: quantile g-and-k distributions Distributions defined by a closed-form quantile function F−1(p; θ) equal to A + B 1 + c 1 − exp(−gz(r)) 1 + exp(−gz(r)) 1 + z(r)2 k z(r) where z(r) the rth standard normal quantile 2 4 6 8 10 12 0.00.20.40.60.81.0 Fn(x) True and fitted cdf functions with pointwise 95% credible (shaded) region centered on fitted cdf for a dataset of n = 100 observations from the g-and-k distribution, based on M = 103 simulations of [A]BCel.

120. Example: quantile g-and-k distributions Distributions deﬁned by a closed-form quantile function F−1(p; θ) equal to A + B 1 + c 1 − exp(−gz(r)) 1 + exp(−gz(r)) 1 + z(r)2 k z(r) where z(r) the rth standard normal quantile q q q q q q q 1 2 3 4 5 6 7 8 9 10 2.83.03.23.4 A q q 1 2 3 4 5 6 7 8 9 10 0.60.81.01.21.41.6 B q q 1 2 3 4 5 6 7 8 9 10 1.52.02.53.03.5 g q 1 2 3 4 5 6 7 8 9 10 0.20.40.60.8 k Variations of posterior means for n = 100 (1 to 5) and n = 500 observations (6 to 10). Moment conditions in the [A]BCel algorithm are quantiles of order (0.2, 0.5, 0.8) (1 and 6), (0.2, 0.4, 0.6, 0.8) (2 and 7), (0.1, 0.4, 0.6, 0.9) (3 and 8), (0.1, 0.25, 0.5, 0.75, 0.9) (4 and 9), and (0.1, 0.2, . . . .0.9) (5 and 10).

121. Example: ARCH and GARCH models Simple dynamic model, namely the ARCH(1) model: yt = σtεt, εt ∼ N(0, 1) , σ2 t = α0 + α1y2 t−1 , with uniform prior over α0, α1 0, α0 + α1 1. Natural empirical likelihood representation based on the reconstituted εt’s, deﬁned as yt/σt when the σt’s are derived recursively q q q q ABC opti. ABC mom. ABCEL mom. ABCEL corr. 012345 α0 q q q q qq q q q q q q q q q q q ABC opti. ABC mom. ABCEL mom. ABCEL corr. 0.20.40.60.8 α1 qq q q q q qq q qqq q q q q q q q ABC ABC ABCEL ABCEL 0.00.51.01.52.02.5 q qqq qq q q q ABC ABC ABCEL ABCEL 0.050.100.150.200.250.30 ABC evaluations of posterior expectations (top, with true values in dashed lines) and variances (bottom) of (α0, α1) with 100 observations. 1-2: two choices of summaries (least squares estimates and mean of the log yt ’s plus autocorrelations of order 2 and 3, respectively). 3-4: two sets of constraints (ﬁrst 3 moments and second moment plus autocorrelation of order 1 plus correlation with previous observation for the reconstituted εt ’s).

122. Example: ARCH and GARCH models GARCH(1, 1) model formalized as the observation of yt ∼ N(0, σ2 t ) when σ2 t = α0 + α1y2 t−1 + β1σ2 t−1 under the constraints α0, α1, β1 > 0 and α1 + β1 < 1, that is, yt = σtεt. Given constraints, natural prior is exponential distribution on α0 and Dirichlet D3(1, 1, 1) on (α1, β1, 1 − α1 − β1) For ABC, use of maximum likelihood estimator as summary statistics, provided by R function garch Chan and Ling (2006) derived natural score constraints for the empirical likelihood, used in [A]BCel algorithm

123. Example: ARCH and GARCH models GARCH(1, 1) model formalized as the observation of yt ∼ N(0, σ2 t ) when σ2 t = α0 + α1y2 t−1 + β1σ2 t−1 under the constraints α0, α1, β1 > 0 and α1 + β1 < 1, that is, yt = σtεt. q q q q q q q qq q q ABC BCel MLE 0.000.100.200.30 α0 q q q q q q q q q ABC BCel MLE 0.000.100.200.30 α1 q q q ABC BCel MLE 0.00.20.40.60.8 β1 Evaluations of posterior expectations (with true values in dashed lines) of (α0, α1, β1) with 250 observations. First row: optimal ABC algorithm (using the MLE as summary statistic and with the tolerance ε chosen as the 5% quantile of the distances), second row: [A]BCel algorithm based on the constraints derived in [?], and third row: MLE derived by R garch initialized at the true value.

124. Example: Superposition of gamma processes Example of superposition of N renewal processes with waiting times τij (i = 1, . . . , M, j = 1, . . .) ∼ G(α, β), when N is unknown. Renewal processes ζi1 = τi1, ζi2 = ζi1 + τi2, . . . with observations made of ﬁrst n values of the ζij ’s, z1 = min{ζij }, z2 = min{ζij ; ζij > z1}, . . . ending with zn = min{ζij ; ζij > zn−1} . [Cox & Kartsonaki, B’ka, 2012]

125. Example: Superposition of gamma processes (ABC) Interesting testing ground for [A]BCel since data (zt) neither iid nor Markov Recovery of an iid structure by 1. simulating a pseudo-dataset, (z1 , . . . , zn ), as in regular ABC, 2. deriving sequence of indicators (ν1, . . . , νn), as z1 = ζν11, z2 = ζν2j2 , . . . 3. exploiting that those indicators are distributed from the prior distribution on the νt’s leading to an iid sample of G(α, β) variables Comparison of ABC and [A]BCel posteriors α Density 0 1 2 3 4 0.00.20.40.60.81.01.21.4 β Density 0 1 2 3 4 0.00.51.01.5 N Density 0 5 10 15 20 0.000.020.040.060.08 α Density 0 1 2 3 4 0.00.51.01.5 β Density 0 1 2 3 4 0.00.20.40.60.81.0 N Density 0 5 10 15 20 0.000.010.020.030.040.050.06 Top: [A]BCel Bottom: regular ABC

126. Example: Superposition of gamma processes (ABC) Interesting testing ground for [A]BCel since data (zt) neither iid nor Markov Recovery of an iid structure by 1. simulating a pseudo-dataset, (z1 , . . . , zn ), as in regular ABC, 2. deriving sequence of indicators (ν1, . . . , νn), as z1 = ζν11, z2 = ζν2j2 , . . . 3. exploiting that those indicators are distributed from the prior distribution on the νt’s leading to an iid sample of G(α, β) variables Comparison of ABC and [A]BCel posteriors α Density 0 1 2 3 4 0.00.20.40.60.81.01.21.4 β Density 0 1 2 3 4 0.00.51.01.5 N Density 0 5 10 15 20 0.000.020.040.060.08 α Density 0 1 2 3 4 0.00.51.01.5 β Density 0 1 2 3 4 0.00.20.40.60.81.0 N Density 0 5 10 15 20 0.000.010.020.030.040.050.06 Top: [A]BCel Bottom: regular ABC

127. Pop’gen’: A ﬁrst experiment Evolutionary scenario: MRCA POP 0 POP 1 τ Dataset: 50 genes per populations, 100 microsat. loci Assumptions: Ne identical over all populations φ = (log10 θ, log10 τ) uniform prior over (−1., 1.5) × (−1., 1.) Comparison of the original ABC with [A]BCel ESS=7034 log(theta) Density 0.00 0.05 0.10 0.15 0.20 0.25 051015 log(tau1) Density −0.3 −0.2 −0.1 0.0 0.1 0.2 0.3 01234567 histogram = [A]BCel curve = original ABC vertical line = “true” parameter

128. Pop’gen’: A ﬁrst experiment Evolutionary scenario: MRCA POP 0 POP 1 τ Dataset: 50 genes per populations, 100 microsat. loci Assumptions: Ne identical over all populations φ = (log10 θ, log10 τ) uniform prior over (−1., 1.5) × (−1., 1.) Comparison of the original ABC with [A]BCel ESS=7034 log(theta) Density 0.00 0.05 0.10 0.15 0.20 0.25 051015 log(tau1) Density −0.3 −0.2 −0.1 0.0 0.1 0.2 0.3 01234567 histogram = [A]BCel curve = original ABC vertical line = “true” parameter

129. ABC vs. [A]BCel on 100 replicates of the 1st experiment Accuracy: log10 θ log10 τ ABC BCel ABC BCel (1) 0.097 0.094 0.315 0.117 (2) 0.071 0.059 0.272 0.077 (3) 0.68 0.81 1.0 0.80 (1) Root Mean Square Error of the posterior mean (2) Median Absolute Deviation of the posterior median (3) Coverage of the credibility interval of probability 0.8 Computation time: on a recent 6-core computer (C++/OpenMP) ABC ≈ 4 hours [A]BCel≈ 2 minutes

130. Pop’gen’: Second experiment Evolutionary scenario: MRCA POP 0 POP 1 POP 2 τ1 τ2 Dataset: 50 genes per populations, 100 microsat. loci Assumptions: Ne identical over all populations φ = (log10 θ, log10 τ1, log10 τ2) non-informative uniform Comparison of the original ABC with [A]BCel histogram = [A]BCel curve = original ABC vertical line = “true” parameter

134. ABC vs. [A]BCel on 100 replicates of the 2nd experiment Accuracy: log10 θ log10 τ1 log10 τ2 ABC BCel ABC BCel ABC BCel (1) .0059 .0794 .472 .483 29.3 4.76 (2) .048 .053 .32 .28 4.13 3.36 (3) .79 .76 .88 .76 .89 .79 (1) Root Mean Square Error of the posterior mean (2) Median Absolute Deviation of the posterior median (3) Coverage of the credibility interval of probability 0.8 Computation time: on a recent 6-core computer (C++/OpenMP) ABC ≈ 6 hours [A]BCel≈ 8 minutes

135. Comparison On large datasets, [A]BCel gives more accurate results than ABC ABC simpliﬁes the dataset through summary statistics Due to the large dimension of x, the original ABC algorithm estimates π θ η(xobs) , where η(xobs) is some (non-linear) projection of the observed dataset on a space with smaller dimension → Some information is lost [A]BCel simpliﬁes the model through a generalized moment condition model. → Here, the moment condition model is based on pairwise composition likelihood

136. Conclusion/perspectives abc part of a wider picture to handle complex/Big Data models many formats of empirical [likelihood] Bayes methods available lack of comparative tools and of an assessment for information loss

[A]BCel : a presentation at ABC in Roma

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to [A]BCel : a presentation at ABC in Roma

Similar to [A]BCel : a presentation at ABC in Roma (20)

More from Christian Robert

More from Christian Robert (20)

Recently uploaded

Recently uploaded (20)

[A]BCel : a presentation at ABC in Roma