Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

ABC short course: introduction chapters

3,084 views

Published on

Chapters 1 and 2 of my short course on ABC, Les Diablerets, CH

Published in: Science
  • Be the first to comment

  • Be the first to like this

ABC short course: introduction chapters

  1. 1. ABC methodology and applications Christian P. Robert Universit´e Paris-Dauphine, University of Warwick, & IUF ´Ecole d’Hiver, Les Diablerets, CH, Feb. 4-8 2016
  2. 2. Outline 1 simulation-based methods in Econometrics 2 Genetics of ABC 3 Approximate Bayesian computation 4 ABC for model choice 5 ABC model choice via random forests 6 ABC estimation via random forests 7 [some] asymptotics of ABC
  3. 3. A motivating if pedestrian example paired and orphan socks A drawer contains an unknown number of socks, some of which can be paired and some of which are orphans (single). One takes at random 11 socks without replacement from this drawer: no pair can be found among those. What can we infer about the total number of socks in the drawer?
  4. 4. A motivating if pedestrian example paired and orphan socks A drawer contains an unknown number of socks, some of which can be paired and some of which are orphans (single). One takes at random 11 socks without replacement from this drawer: no pair can be found among those. What can we infer about the total number of socks in the drawer? • sounds like an impossible task • one observation x = 11 and two unknowns, nsocks and npairs • writing the likelihood is a challenge [exercise]
  5. 5. Feller’s shoes A closet contains n pairs of shoes. If 2r shoes are chosen at random (with 2r < n), what is the probability that there will be (a) no complete pair, (b) exactly one complete pair, (c) exactly two complete pairs among them? [Feller, 1970, Chapter II, Exercise 26]
  6. 6. Feller’s shoes A closet contains n pairs of shoes. If 2r shoes are chosen at random (with 2r < n), what is the probability that there will be (a) no complete pair, (b) exactly one complete pair, (c) exactly two complete pairs among them? [Feller, 1970, Chapter II, Exercise 26] Resolution as pj = n j 22r−2j n − j 2r − 2j 2n 2r being probability of obtaining js pairs among those 2r shoes, or for an odd number t of shoes pj = 2t−2j n j n − j t − 2j 2n t
  7. 7. Feller’s shoes A closet contains n pairs of shoes. If 2r shoes are chosen at random (with 2r < n), what is the probability that there will be (a) no complete pair, (b) exactly one complete pair, (c) exactly two complete pairs among them? [Feller, 1970, Chapter II, Exercise 26] If one draws 11 socks out of m socks made of f orphans and g pairs, with f + 2g = m, number k of socks from the orphan group is hypergeometric H (11, m, f ) and probability to observe 11 orphan socks total is 11 k=0 f k 2g 11−k m 11 × 211−k g 11−k 2g 11−k
  8. 8. A prioris on socks Given parameters nsocks and npairs, set of socks S = s1, s1, . . . , snpairs , snpairs , snpairs+1, . . . , snsocks and 11 socks picked at random from S give X unique socks.
  9. 9. A prioris on socks Given parameters nsocks and npairs, set of socks S = s1, s1, . . . , snpairs , snpairs , snpairs+1, . . . , snsocks and 11 socks picked at random from S give X unique socks. Rassmus’ reasoning If you are a family of 3-4 persons then a guesstimate would be that you have something like 15 pairs of socks in store. It is also possible that you have much more than 30 socks. So as a prior for nsocks I’m going to use a negative binomial with mean 30 and standard deviation 15. On npairs/2nsocks I’m going to put a Beta prior distribution that puts most of the probability over the range 0.75 to 1.0, [Rassmus B˚a˚ath’s Research Blog, Oct 20th, 2014]
  10. 10. Simulating the experiment Given a prior distribution on nsocks and npairs, nsocks ∼ Neg(30, 15) npairs|nsocks ∼ nsocks/2Be(15, 2) possible to 1 generate new values of nsocks and npairs, 2 generate a new observation of X, number of unique socks out of 11.
  11. 11. Simulating the experiment Given a prior distribution on nsocks and npairs, nsocks ∼ Neg(30, 15) npairs|nsocks ∼ nsocks/2Be(15, 2) possible to 1 generate new values of nsocks and npairs, 2 generate a new observation of X, number of unique socks out of 11. 3 accept the pair (nsocks, npairs) if the realisation of X is equal to 11
  12. 12. Meaning ns Density 0 10 20 30 40 50 60 0.000.010.020.030.040.050.06 The outcome of this simulation method returns a distribution on the pair (nsocks, npairs) that is the conditional distribution of the pair given the observation X = 11 Proof: Generations from π(nsocks, npairs) are accepted with probability P {X = 11|(nsocks, npairs)}
  13. 13. Meaning ns Density 0 10 20 30 40 50 60 0.000.010.020.030.040.050.06 The outcome of this simulation method returns a distribution on the pair (nsocks, npairs) that is the conditional distribution of the pair given the observation X = 11 Proof: Hence accepted values distributed from π(nsocks, npairs) × P {X = 11|(nsocks, npairs)} = π(nsocks, npairs|X = 11)
  14. 14. Econ’ections 1 simulation-based methods in Econometrics 2 Genetics of ABC 3 Approximate Bayesian computation 4 ABC for model choice 5 ABC model choice via random forests 6 ABC estimation via random forests 7 [some] asymptotics of ABC
  15. 15. Usages of simulation in Econometrics Similar exploration of simulation-based techniques in Econometrics • Simulated method of moments • Method of simulated moments • Simulated pseudo-maximum-likelihood • Indirect inference [Gouri´eroux & Monfort, 1996]
  16. 16. Simulated method of moments Given observations yo 1:n from a model yt = r(y1:(t−1), t, θ) , t ∼ g(·) simulate 1:n, derive yt (θ) = r(y1:(t−1), t , θ) and estimate θ by arg min θ n t=1 (yo t − yt (θ))2
  17. 17. Simulated method of moments Given observations yo 1:n from a model yt = r(y1:(t−1), t, θ) , t ∼ g(·) simulate 1:n, derive yt (θ) = r(y1:(t−1), t , θ) and estimate θ by arg min θ n t=1 yo t − n t=1 yt (θ) 2
  18. 18. Method of simulated moments Given a statistic vector K(y) with Eθ[K(Yt)|y1:(t−1)] = k(y1:(t−1); θ) find an unbiased estimator of k(y1:(t−1); θ), ˜k( t, y1:(t−1); θ) Estimate θ by arg min θ n t=1 K(yt) − S s=1 ˜k( s t , y1:(t−1); θ)/S [Pakes & Pollard, 1989]
  19. 19. Indirect inference Minimise (in θ) the distance between estimators ˆβ based on pseudo-models for genuine observations and for observations simulated under the true model and the parameter θ. [Gouri´eroux, Monfort, & Renault, 1993; Smith, 1993; Gallant & Tauchen, 1996]
  20. 20. Indirect inference (PML vs. PSE) Example of the pseudo-maximum-likelihood (PML) ˆβ(y) = arg max β t log f (yt|β, y1:(t−1)) leading to arg min θ ||ˆβ(yo ) − ˆβ(y1(θ), . . . , yS (θ))||2 when ys(θ) ∼ f (y|θ) s = 1, . . . , S
  21. 21. Indirect inference (PML vs. PSE) Example of the pseudo-score-estimator (PSE) ˆβ(y) = arg min β t ∂ log f ∂β (yt|β, y1:(t−1)) 2 leading to arg min θ ||ˆβ(yo ) − ˆβ(y1(θ), . . . , yS (θ))||2 when ys(θ) ∼ f (y|θ) s = 1, . . . , S
  22. 22. Consistent indirect inference ...in order to get a unique solution the dimension of the auxiliary parameter β must be larger than or equal to the dimension of the initial parameter θ. If the problem is just identified the different methods become easier...
  23. 23. Consistent indirect inference ...in order to get a unique solution the dimension of the auxiliary parameter β must be larger than or equal to the dimension of the initial parameter θ. If the problem is just identified the different methods become easier... Consistency depending on the criterion and on the asymptotic identifiability of θ [Gouri´eroux, Monfort, 1996, p. 66]
  24. 24. AR(2) vs. MA(1) example true (AR) model yt = t − θ t−1 and [wrong!] auxiliary (MA) model yt = β1yt−1 + β2yt−2 + ut R code x=eps=rnorm(250) x[2:250]=x[2:250]-0.5*x[1:249] simeps=rnorm(250) propeta=seq(-.99,.99,le=199) dist=rep(0,199) bethat=as.vector(arima(x,c(2,0,0),incl=FALSE)$coef) for (t in 1:199) dist[t]=sum((as.vector(arima(c(simeps[1],simeps[2:250]-propeta[t]* simeps[1:249]),c(2,0,0),incl=FALSE)$coef)-bethat)^2)
  25. 25. AR(2) vs. MA(1) example One sample: −1.0 −0.5 0.0 0.5 1.0 0.00.20.40.60.8 θ distance
  26. 26. AR(2) vs. MA(1) example Many samples: 0.2 0.4 0.6 0.8 1.0 0123456
  27. 27. Choice of pseudo-model Pick model such that 1 ˆβ(θ) not flat (i.e. sensitive to changes in θ) 2 ˆβ(θ) not dispersed (i.e. robust agains changes in ys(θ)) [Frigessi & Heggland, 2004]
  28. 28. ABC using indirect inference (1) We present a novel approach for developing summary statistics for use in approximate Bayesian computation (ABC) algorithms by using indirect inference(...) In the indirect inference approach to ABC the parameters of an auxiliary model fitted to the data become the summary statistics. Although applicable to any ABC technique, we embed this approach within a sequential Monte Carlo algorithm that is completely adaptive and requires very little tuning(...) [Drovandi, Pettitt & Faddy, 2011] c Indirect inference provides summary statistics for ABC...
  29. 29. ABC using indirect inference (2) ...the above result shows that, in the limit as h → 0, ABC will be more accurate than an indirect inference method whose auxiliary statistics are the same as the summary statistic that is used for ABC(...) Initial analysis showed that which method is more accurate depends on the true value of θ. [Fearnhead and Prangle, 2012] c Indirect inference provides estimates rather than global inference...
  30. 30. Genetics of ABC 1 simulation-based methods in Econometrics 2 Genetics of ABC 3 Approximate Bayesian computation 4 ABC for model choice 5 ABC model choice via random forests 6 ABC estimation via random forests 7 [some] asymptotics of ABC
  31. 31. Genetic background of ABC ABC is a recent computational technique that only requires a generative model, i.e., being able to sample from the density f (·|θ) This technique stemmed from population genetics models, about 15 years ago, and population geneticists still contribute significantly to methodological developments of ABC. [Griffith & al., 1997; Tavar´e & al., 1999]
  32. 32. Population genetics [Part derived from the teaching material of Raphael Leblois, ENS Lyon, November 2010] • Describe the genotypes, estimate the alleles frequencies, determine their distribution among individuals, populations and between populations; • Predict and understand the evolution of gene frequencies in populations as a result of various factors. c Analyses the effect of various evolutive forces (mutation, drift, migration, selection) on the evolution of gene frequencies in time and space.
  33. 33. Wright-Fisher model Le modèle de Wright-Fisher •! En l’absence de mutation et de sélection, les fréquences alléliques dérivent (augmentent et diminuent) inévitablement jusqu’à la fixation d’un allèle •! La dérive conduit donc à la perte de variation génétique à l’intérieur des populations • A population of constant size, in which individuals reproduce at the same time. • Each gene in a generation is a copy of a gene of the previous generation. • In the absence of mutation and selection, allele frequencies derive inevitably until the fixation of an allele.
  34. 34. Coalescent theory [Kingman, 1982; Tajima, Tavar´e, &tc] !"#$%&'(('")**+$,-'".'"/010234%'".'5"*$*%()23$15"6" !!"7**+$,-'",()5534%'" " "!"7**+$,-'"8",$)('5,'1,'"9" "":";<;=>7?@<#" " " """"":"ABC7#?@>><#" Coalescence theory interested in the genealogy of a sample of genes back in time to the common ancestor of the sample.
  35. 35. Common ancestor 6 Timeofcoalescence (T) Modélisation du processus de dérive génétique en “remontant dans le temps” jusqu’à l’ancêtre commun d’un échantillon de gènes Les différentes lignées fusionnent (coalescent) au fur et à mesure que l’on remonte vers le passé The different lineages merge when we go back in the past.
  36. 36. Neutral mutations 20 Sous l’hypothèse de neutralité des marqueurs génétiques étudiés, les mutations sont indépendantes de la généalogie i.e. la généalogie ne dépend que des processus démographiques On construit donc la généalogie selon les paramètres démographiques (ex. N), puis on ajoute a posteriori les mutations sur les différentes branches, du MRCA au feuilles de l’arbre On obtient ainsi des données de polymorphisme sous les modèles démographiques et mutationnels considérés • Under the assumption of neutrality, the mutations are independent of the genealogy. • We construct the genealogy according to the demographic parameters, then we add a posteriori the mutations.
  37. 37. Neutral model at a given microsatellite locus, in a closed panmictic population at equilibrium Kingman’s genealogy When time axis is normalized, T(k) ∼ Exp(k(k −1)/2)
  38. 38. Neutral model at a given microsatellite locus, in a closed panmictic population at equilibrium Kingman’s genealogy When time axis is normalized, T(k) ∼ Exp(k(k −1)/2) Mutations according to the Simple stepwise Mutation Model (SMM) • date of the mutations ∼ Poisson process with intensity θ/2 over the branches
  39. 39. Neutral model at a given microsatellite locus, in a closed panmictic population at equilibrium Observations: leafs of the tree ˆθ =? Kingman’s genealogy When time axis is normalized, T(k) ∼ Exp(k(k −1)/2) Mutations according to the Simple stepwise Mutation Model (SMM) • date of the mutations ∼ Poisson process with intensity θ/2 over the branches • MRCA = 100 • independent mutations: ±1 with pr. 1/2
  40. 40. Much more interesting models. . . • several independent locus Independent gene genealogies and mutations • different populations linked by an evolutionary scenario made of divergences, admixtures, migrations between populations, selection pressure, etc. • larger sample size usually between 50 and 100 genes
  41. 41. Available population scenarios Between populations: three types of events, backward in time • the divergence is the fusion between two populations, • the admixture is the split of a population into two parts, • the migration allows the move of some lineages of a population to another. • 4 • 2 • 5 • 3 • 1 Lignée ancestrale Présent T5 T4 T3 T2 FIGURE 2.2: Exemple de généalogie de cinq individus issus d’une seule population fermée à l’équilibre. Les individus échantillonnés sont représentés par les feuilles du dendrogramme, les durées inter-coalescences T2, . . . , T5 sont indépendantes, et Tk est de loi exponentielle de paramètre k k - 1 /2. Pop1 Pop2 Pop1 Divergence (a) t t0 Pop1 Pop3 Pop2 Admixture (b) 1 - rr t t0 m12 m21 Pop1 Pop2 Migration (c) t t0 FIGURE 2.3: Représentations graphiques des trois types d’évènements inter-populationnels d’un scénario démographique. Il existe deux familles d’évènements inter-populationnels. La première famille est simple, elle correspond aux évènement inter-populationnels instantanés. C’est le cas d’une divergence ou d’une admixture. (a) Deux populations qui évoluent pour se fusionner dans le cas d’une divergence. (b) Trois po- pulations qui évoluent en parallèle pour une admixture. Pour cette situation, chacun des tubes représente (on peut imaginer qu’il porte à l’intérieur) la généalogie de la population qui évolue indépendamment des
  42. 42. A complex scenario The goal is to discriminate between different population scenarios from a dataset of polymorphism (DNA sample) y observed at the present time. 2.5 Conclusion 37 Divergence Pop1 Ne1 Pop4 Ne4 Admixture Pop3 Ne3 Pop6Ne6 Pop2 Ne2 Pop5Ne5 Migration m m0 t = 0 t5 t4 t0 4 Ne4 Ne0 4 t3 t2 t1 r 1 - r 1 - ss FIGURE 2.1: Exemple d’un scénario évolutif complexe composé d’évènements inter-populationnels. Ce
  43. 43. Demo-genetic inference Each model is characterized by a set of parameters θ that cover historical (time divergence, admixture time ...), demographics (population sizes, admixture rates, migration rates, ...) and genetic (mutation rate, ...) factors The goal is to estimate these parameters from a dataset of polymorphism (DNA sample) y observed at the present time Problem: most of the time, we can not calculate the likelihood of the polymorphism data f (y|θ).
  44. 44. Untractable likelihood Missing (too missing!) data structure: f (y|θ) = G f (y|G, θ)f (G|θ)dG The genealogies are considered as nuisance parameters. This problematic thus differs from the phylogenetic approach where the tree is the parameter of interesst.
  45. 45. A genuine example of application 94 !""#$%&'()*+,(-*.&(/+0$'"1)()&$/+2!,03! 1/+*%*'"4*+56(""4&7()&$/.+.1#+4*.+8-9':*.+ Pygmies populations: do they have a common origin? Is there a lot of exchanges between pygmies and non-pygmies populations?
  46. 46. Scenarios under competition 96 !""#$%&'()*+,(-*.&(/+0$'"1)()&$/+2!,03! 1/+*%*'"4*+56(""4&7()&$/.+.1#+4*.+8-9':*.+ Différents scénarios possibles, choix de scenario par ABC Verdu et al. 2009
  47. 47. Simulation results Différents scénarios possibles, choix de scenari Le scenario 1a est largement soutenu par rap autres ! plaide pour une origine commune !""#$%&'()*+,(-*.&(/+0$'"1)()&$/+2!,03 1/+*%*'"4*+56(""4&7()&$/.+.1#+4*.+8-9':*. Différents scénarios possibles, choix de scenario par ABC Le scenario 1a est largement soutenu par rapport aux autres ! plaide pour une origine commune des populations pygmées d’Afrique de l’Ouest Verdu e c Scenario 1A is chosen.
  48. 48. Most likely scenario 99 !""#$%&'()*+,(-*.&(/+0$'"1)()&$/+2!,03! 1/+*%*'"4*+56(""4&7()&$/.+.1#+4*.+8-9':*.+ Scénario évolutif : on « raconte » une histoire à partir de ces inférences Verdu et al. 2009
  49. 49. Instance of ecological questions [message in a beetle] • How the Asian Ladybird beetle arrived in Europe? • Why does they swarm right now? • What are the routes of invasion? • How to get rid of them? • Why did the chicken cross the road? [Lombaert & al., 2010, PLoS ONE] beetles in forests
  50. 50. Worldwide invasion routes of Harmonia Axyridis For each outbreak, the arrow indicates the most likely invasion pathway and the associated posterior probability, with 95% credible intervals in brackets [Estoup et al., 2012, Molecular Ecology Res.]
  51. 51. Worldwide invasion routes of Harmonia Axyridis For each outbreak, the arrow indicates the most likely invasion pathway and the associated posterior probability, with 95% credible intervals in brackets [Estoup et al., 2012, Molecular Ecology Res.]
  52. 52. A population genetic illustration of ABC model choice Two populations (1 and 2) having diverged at a fixed known time in the past and third population (3) which diverged from one of those two populations (models 1 and 2, respectively). Observation of 50 diploid individuals/population genotyped at 5, 50 or 100 independent microsatellite loci. Model 2
  53. 53. A population genetic illustration of ABC model choice Two populations (1 and 2) having diverged at a fixed known time in the past and third population (3) which diverged from one of those two populations (models 1 and 2, respectively). Observation of 50 diploid individuals/population genotyped at 5, 50 or 100 independent microsatellite loci. Stepwise mutation model: the number of repeats of the mutated gene increases or decreases by one. Mutation rate µ common to all loci set to 0.005 (single parameter) with uniform prior distribution µ ∼ U[0.0001, 0.01]
  54. 54. A population genetic illustration of ABC model choice Summary statistics associated to the (δµ)2 distance xl,i,j repeated number of allele in locus l = 1, . . . , L for individual i = 1, . . . , 100 within the population j = 1, 2, 3. Then (δµ)2 j1,j2 = 1 L L l=1   1 100 100 i1=1 xl,i1,j1 − 1 100 100 i2=1 xl,i2,j2   2 .
  55. 55. A population genetic illustration of ABC model choice For two copies of locus l with allele sizes xl,i,j1 and xl,i ,j2 , most recent common ancestor at coalescence time τj1,j2 , gene genealogy distance of 2τj1,j2 , hence number of mutations Poisson with parameter 2µτj1,j2 . Therefore, E xl,i,j1 − xl,i ,j2 2 |τj1,j2 = 2µτj1,j2 and Model 1 Model 2 E (δµ)2 1,2 2µ1t 2µ2t E (δµ)2 1,3 2µ1t 2µ2t E (δµ)2 2,3 2µ1t 2µ2t
  56. 56. A population genetic illustration of ABC model choice Thus, • Bayes factor based only on distance (δµ)2 1,2 not convergent: if µ1 = µ2, same expectation • Bayes factor based only on distance (δµ)2 1,3 or (δµ)2 2,3 not convergent: if µ1 = 2µ2 or 2µ1 = µ2 same expectation • if two of the three distances are used, Bayes factor converges: there is no (µ1, µ2) for which all expectations are equal
  57. 57. A population genetic illustration of ABC model choice q q q 5 50 100 0.00.40.8 DM2(12) q q q q q q q q q qq q q q 5 50 100 0.00.40.8 DM2(13) q q q q q q q q q q q q q q q q q q q qqqqq q qq q qqqq q q q q q q 5 50 100 0.00.40.8 DM2(13) & DM2(23) Posterior probabilities that the data is from model 1 for 5, 50 and 100 loci

×