Reliable Approximate Bayesian computation 
(ABC) model choice via random forests 
Christian P. Robert 
Universite Paris-Dauphine, Paris  University of Warwick, Coventry 
December 13, 2014, NIPS 2014 
bayesianstatistics@gmail.com 
Joint with J.-M. Cornuet, A. Estoup, J.-M. Marin,  P. Pudlo
The next O'Bayes meeting: 
I Objective Bayes section of ISBA 
major meeting: 
I O-Bayes 2015 in Valencia, 
Spain, June 1-4(+1), 2015 
I in memory of our friend Susie 
Bayarri 
I objective Bayes, limited 
information, partly de
ned and 
approximate models, tc 
I Spain in June, what else...?!
Outline 
Intractable likelihoods 
ABC methods 
ABC for model choice 
ABC model choice via random forests
Intractable likelihood 
Case of a well-de
ned statistical model where the likelihood 
function 
`(jy) = f (y1, . . . , ynj) 
I is (really!) not available in closed form 
I cannot (easily!) be either completed or demarginalised 
I cannot be (at all!) estimated by an unbiased estimator 
I examples of latent variable models of high dimension, 
including combinatorial structures (trees, graphs), missing 
constant f (xj) = g(y, ) 
 
Z() (eg. Markov random
elds, 
exponential graphs,. . . ) 

c Prohibits direct implementation of a generic MCMC algorithm 
like Metropolis{Hastings which gets stuck exploring missing 
structures
Intractable likelihood 
Case of a well-de
ned statistical model where the likelihood 
function 
`(jy) = f (y1, . . . , ynj) 
I is (really!) not available in closed form 
I cannot (easily!) be either completed or demarginalised 
I cannot be (at all!) estimated by an unbiased estimator 

c Prohibits direct implementation of a generic MCMC algorithm 
like Metropolis{Hastings which gets stuck exploring missing 
structures
Getting approximative 
Case of a well-de
ned statistical model where the likelihood 
function 
`(jy) = f (y1, . . . , ynj) 
is out of reach 
Empirical A to the original B problem 
I Degrading the data precision down to tolerance level  
I Replacing the likelihood with a non-parametric approximation 
I Summarising/replacing the data with insucient statistics
Getting approximative 
Case of a well-de
ned statistical model where the likelihood 
function 
`(jy) = f (y1, . . . , ynj) 
is out of reach 
Empirical A to the original B problem 
I Degrading the data precision down to tolerance level  
I Replacing the likelihood with a non-parametric approximation 
I Summarising/replacing the data with insucient statistics
Getting approximative 
Case of a well-de
ned statistical model where the likelihood 
function 
`(jy) = f (y1, . . . , ynj) 
is out of reach 
Empirical A to the original B problem 
I Degrading the data precision down to tolerance level  
I Replacing the likelihood with a non-parametric approximation 
I Summarising/replacing the data with insucient statistics
Approximate Bayesian computation 
Intractable likelihoods 
ABC methods 
Genesis of ABC 
abc of ABC 
Summary statistic 
ABC for model choice 
ABC model choice via random 
forests
Genetic background of ABC 
skip genetics 
ABC is a recent computational technique that only requires being 
able to sample from the likelihood f (j) 
This technique stemmed from population genetics models, about 
15 years ago, and population geneticists still signi
cantly 
contribute to methodological developments of ABC. 
[Grith  al., 1997; Tavare  al., 1999]
Demo-genetic inference 
Each model is characterized by a set of parameters  that cover 
historical (time divergence, admixture time ...), demographics 
(population sizes, admixture rates, migration rates, ...) and genetic 
(mutation rate, ...) factors 
The goal is to estimate these parameters from a dataset of 
polymorphism (DNA sample) y observed at the present time 
Problem: 
most of the time, we cannot calculate the likelihood of the 
polymorphism data f (yj)...
Demo-genetic inference 
Each model is characterized by a set of parameters  that cover 
historical (time divergence, admixture time ...), demographics 
(population sizes, admixture rates, migration rates, ...) and genetic 
(mutation rate, ...) factors 
The goal is to estimate these parameters from a dataset of 
polymorphism (DNA sample) y observed at the present time 
Problem: 
most of the time, we cannot calculate the likelihood of the 
polymorphism data f (yj)...
Neutral model at a given microsatellite locus, in a closed 
panmictic population at equilibrium 
Kingman's genealogy 
When time axis is 
normalized, 
T(k)  Exp(k(k - 1)=2) 
Mutations according to 
the Simple stepwise 
Mutation Model 
(SMM) 
 date of the mutations  
Poisson process with 
intensity =2 over the 
branches 
 MRCA = 100 
 independent mutations: 
1 with pr. 1=2
Neutral model at a given microsatellite locus, in a closed 
panmictic population at equilibrium 
Kingman's genealogy 
When time axis is 
normalized, 
T(k)  Exp(k(k - 1)=2) 
Mutations according to 
the Simple stepwise 
Mutation Model 
(SMM) 
 date of the mutations  
Poisson process with 
intensity =2 over the 
branches 
 MRCA = 100 
 independent mutations: 
1 with pr. 1=2
Neutral model at a given microsatellite locus, in a closed 
panmictic population at equilibrium 
Observations: leafs of the tree 
^ = ? 
Kingman's genealogy 
When time axis is 
normalized, 
T(k)  Exp(k(k - 1)=2) 
Mutations according to 
the Simple stepwise 
Mutation Model 
(SMM) 
 date of the mutations  
Poisson process with 
intensity =2 over the 
branches 
 MRCA = 100 
 independent mutations: 
1 with pr. 1=2
Instance of ecological questions [message in a beetle] 
I How did the Asian Ladybird 
beetle arrive in Europe? 
I Why do they swarm right 
now? 
I What are the routes of 
invasion? 
I How to get rid of them? 
[Lombaert  al., 2010, PLoS ONE] 
beetles in forests
Worldwide invasion routes of Harmonia Axyridis 
[Estoup et al., 2012, Molecular Ecology Res.]
c Intractable likelihood 
Missing (too much missing!) data structure: 
f (yj) = 
Z 
G 
f (yjG, )f (Gj)dG 
cannot be computed in a manageable way... 
[Stephens  Donnelly, 2000] 
The genealogies are considered as nuisance parameters 
This modelling clearly diers from the phylogenetic perspective 
where the tree is the parameter of interest.
c Intractable likelihood 
Missing (too much missing!) data structure: 
f (yj) = 
Z 
G 
f (yjG, )f (Gj)dG 
cannot be computed in a manageable way... 
[Stephens  Donnelly, 2000] 
The genealogies are considered as nuisance parameters 
This modelling clearly diers from the phylogenetic perspective 
where the tree is the parameter of interest.
A?B?C? 
I A stands for approximate 
[wrong likelihood / 
picture] 
I B stands for Bayesian 
I C stands for computation 
[producing a parameter 
sample]
ABC methodology 
Bayesian setting: target is ()f (xj) 
When likelihood f (xj) not in closed form, likelihood-free rejection 
technique: 
Foundation 
For an observation y  f (yj), under the prior (), if one keeps 
jointly simulating 
0  () , z  f (zj0) , 
until the auxiliary variable z is equal to the observed value, z = y, 
then the selected 
0  (jy) 
[Rubin, 1984; Diggle  Gratton, 1984; Tavare et al., 1997]
ABC methodology 
Bayesian setting: target is ()f (xj) 
When likelihood f (xj) not in closed form, likelihood-free rejection 
technique: 
Foundation 
For an observation y  f (yj), under the prior (), if one keeps 
jointly simulating 
0  () , z  f (zj0) , 
until the auxiliary variable z is equal to the observed value, z = y, 
then the selected 
0  (jy) 
[Rubin, 1984; Diggle  Gratton, 1984; Tavare et al., 1997]
A as A...pproximative 
When y is a continuous random variable, strict equality z = y is 
replaced with a tolerance zone 
(y, z) 6  
where  is a distance 
Output distributed from 
() Pf(y, z)  g 
def 
/ (j(y, z)  ) 
[Pritchard et al., 1999]
A as A...pproximative 
When y is a continuous random variable, strict equality z = y is 
replaced with a tolerance zone 
(y, z) 6  
where  is a distance 
Output distributed from 
() Pf(y, z)  g 
def 
/ (j(y, z)  ) 
[Pritchard et al., 1999]
ABC recap 
Likelihood free rejection 
sampling 
Tavare et al. (1997) Genetics 
1) Set i = 1, 
2) Generate 0 from the 
prior distribution 
(), 
3) Generate z0 from the 
likelihood f (j0), 
4) If ((z0), (y)) 6 , 
set (i , zi ) = (0, z0) and 
i = i + 1, 
5) If i 6 N, return to 
2). 
We keep the 's values such 
that the distance between the 
corresponding simulated 
dataset and the observed 
dataset is small enough. 
Tuning parameters 
I   0: tolerance level, 
I (z): function that 
summarizes datasets, 
I (, 0): distance 
between vectors of 
summary statistics 
I N: size of the output
Output 
The likelihood-free algorithm samples from the marginal in z of: 
(, zjy) = 
()f (zj)IA,y(z) R 
A,y ()f (zj)dzd 
, 
where A,y = fz 2 Dj((z), (y))  g. 
The idea behind ABC is that the summary statistics coupled with a 
small tolerance should provide a good approximation of the 
posterior distribution: 
(jy) = 
Z 
(, zjy)dz  (jy) .
Output 
The likelihood-free algorithm samples from the marginal in z of: 
(, zjy) = 
()f (zj)IA,y(z) R 
A,y ()f (zj)dzd 
, 
where A,y = fz 2 Dj((z), (y))  g. 
The idea behind ABC is that the summary statistics coupled with a 
small tolerance should provide a good approximation of the 
posterior distribution: 
(jy) = 
Z 
(, zjy)dz  (jy) .
Output 
The likelihood-free algorithm samples from the marginal in z of: 
(, zjy) = 
()f (zj)IA,y(z) R 
A,y ()f (zj)dzd 
, 
where A,y = fz 2 Dj((z), (y))  g. 
The idea behind ABC is that the summary statistics coupled with a 
small tolerance should provide a good approximation of the 
restricted posterior distribution: 
(jy) = 
Z 
(, zjy)dz  (j(y)) . 
Not so good..!
Which summary? 
Fundamental diculty of the choice of the summary statistic when 
there is no non-trivial sucient statistics [except when done by the 
experimenters in the
eld] 
I Loss of statistical information balanced against gain in data 
roughening 
I Approximation error and information loss remain unknown 
I Choice of statistics induces choice of distance function 
towards standardisation 
I may be imposed for external/practical reasons (e.g., DIYABC) 
I may gather several non-B point estimates [the more the 
merrier] 
I can [machine-]learn about ecient combination
Which summary? 
Fundamental diculty of the choice of the summary statistic when 
there is no non-trivial sucient statistics [except when done by the 
experimenters in the
eld] 
I Loss of statistical information balanced against gain in data 
roughening 
I Approximation error and information loss remain unknown 
I Choice of statistics induces choice of distance function 
towards standardisation 
I may be imposed for external/practical reasons (e.g., DIYABC) 
I may gather several non-B point estimates [the more the 
merrier] 
I can [machine-]learn about ecient combination
Which summary? 
Fundamental diculty of the choice of the summary statistic when 
there is no non-trivial sucient statistics [except when done by the 
experimenters in the
eld] 
I Loss of statistical information balanced against gain in data 
roughening 
I Approximation error and information loss remain unknown 
I Choice of statistics induces choice of distance function 
towards standardisation 
I may be imposed for external/practical reasons (e.g., DIYABC) 
I may gather several non-B point estimates [the more the 
merrier] 
I can [machine-]learn about ecient combination
attempts at summaries 
How to choose the set of summary statistics? 
I Joyce and Marjoram (2008, SAGMB) 
I Nunes and Balding (2010, SAGMB) 
I Fearnhead and Prangle (2012, JRSS B) 
I Ratmann et al. (2012, PLOS Comput. Biol) 
I Blum et al. (2013, Statistical science) 
I EP-ABC of Barthelme  Chopin (2013, JASA) 
I LDA selection of Estoup  al. (2012, Mol. Ecol. Res.)
LDA advantages 
I much faster computation of scenario probabilities via 
polychotomous regression 
I a (much) lower number of explanatory variables improves the 
accuracy of the ABC approximation, reduces the tolerance  
and avoids extra costs in constructing the reference table 
I allows for a large collection of initial summaries 
I ability to evaluate Type I and Type II errors on more complex 
models 
I LDA reduces correlation among explanatory variables 
When available, using both simulated and real data sets, posterior 
probabilities of scenarios computed from LDA-transformed and raw 
summaries are strongly correlated
LDA advantages 
I much faster computation of scenario probabilities via 
polychotomous regression 
I a (much) lower number of explanatory variables improves the 
accuracy of the ABC approximation, reduces the tolerance  
and avoids extra costs in constructing the reference table 
I allows for a large collection of initial summaries 
I ability to evaluate Type I and Type II errors on more complex 
models 
I LDA reduces correlation among explanatory variables 
When available, using both simulated and real data sets, posterior 
probabilities of scenarios computed from LDA-transformed and raw 
summaries are strongly correlated
ABC for model choice 
Intractable likelihoods 
ABC methods 
ABC for model choice 
ABC model choice via random forests
ABC model choice 
ABC model choice 
A) Generate large set of 
(m, , z) from the 
Bayesian predictive, 
(m)m()fm(zj) 
B) Keep particles (m, , z) 
such that 
((y), (z)) 6  
C) For each m, return 
cpm = proportion of m 
among remaining 
particles 
If  tuned towards k resulting 
particles, then cpm k-nearest 
neighbor estimate of 
 
fM = m 
P
(y) 
 
Approximating posterior 
prob's of models = regression 
problem where 
 
M = m 
I response is 1 
	 
, 
I covariates are summary 
statistics (z), 
I loss is, e.g., L2 
Method of choice in DIYABC 
is local polytomous logistic 
regression
Machine learning perspective [paradigm shift] 
ABC model choice 
A) Generate a large set 
of (m, , z)'s from 
Bayesian predictive, 
(m)m()fm(zj) 
B) Use machine learning 
tech. to infer on 
(m
(y)) 
In this perspective: 
I (iid) data set reference 
table simulated during 
stage A) 
I observed y becomes a 
new data point 
Note that: 
I predicting m is a 
classi
cation problem 
() select the best 
model based on a 
maximal a posteriori rule 
I computing (mj(y)) is 
a regression problem 
() con
dence in each 
model 
classi
cation is much simpler 
than regression (e.g., dim. of 
objects we try to learn)
Warning 
the lost of information induced by using non sucient 
summary statistics is a genuine problem 
Fundamental discrepancy between the genuine Bayes 
factors/posterior probabilities and the Bayes factors based on 
summary statistics. See, e.g., 
I Didelot et al. (2011, Bayesian analysis) 
I X et al. (2011, PNAS) 
I Marin et al. (2014, JRSS B) 
I . . . 
Call instead for machine learning approach able to handle with a 
large number of correlated summary statistics: 
random forests well suited for that task
ABC model choice via random forests 
Intractable likelihoods 
ABC methods 
ABC for model choice 
ABC model choice via random forests 
Random forests 
ABC with random forests 
Illustrations
Leaning towards machine learning 
Main notions: 
I ABC-MC seen as learning about which model is most 
appropriate from a huge (reference) table 
I exploiting a large number of summary statistics not an issue 
for machine learning methods intended to estimate ecient 
combinations 
I abandoning (temporarily?) the idea of estimating posterior 
probabilities of the models, poorly approximated by machine 
learning methods, and replacing those by posterior predictive 
expected loss
Random forests 
Technique that stemmed from Leo Breiman's bagging (or 
bootstrap aggregating) machine learning algorithm for both 
classi
cation and regression 
[Breiman, 1996] 
Improved classi
cation performances by averaging over 
classi
cation schemes of randomly generated training sets, creating 
a forest of (CART) decision trees, inspired by Amit and Geman 
(1997) ensemble learning 
[Breiman, 2001]
CART construction 
Basic classi
cation tree: 
Algorithm 1 CART 
start the tree with a single root 
repeat 
pick a non-homogeneous tip v such that Q(v)6= 1 
attach to v two daughter nodes v1 and v2 
for all covariates Xj do
nd the threshold tj in the rule Xj  tj that minimizes N(v1)Q(v1) + 
N(v2)Q(v2) 
end for
nd the rule Xj  tj that minimizes N(v1)Q(v1)+N(v2)Q(v2) in j and set 
this best rule to node v 
until all tips v are homogeneous (Q(v) = 0) 
set the labels of all tips 
where Q is Gini's index 
Q(vi ) = 
XM 
y=1 
^p(vi , y) f1 - ^p(v,y)g .
Growing the forest 
Breiman's solution for inducing random features in the trees of the 
forest: 
I boostrap resampling of the dataset and 
I random subset-ing [of size 
p 
t] of the covariates driving the 
classi
cation at every node of each tree 
Covariate x that drives the node separation 
x ? c 
and the separation bound c chosen by minimising entropy or Gini 
index
Breiman and Cutler's algorithm 
Algorithm 2 Random forests 
for t = 1 to T do 
//*T is the number of trees*// 
Draw a bootstrap sample of size nboot6= n 
Grow an unpruned decision tree 
for b = 1 to B do 
//*B is the number of nodes*// 
Select ntry of the predictors at random 
Determine the best split from among those predictors 
end for 
end for 
Predict new data by aggregating the predictions of the T trees
Subsampling 
Due to both large datasets [practical] and theoretical 
recommendation of Scornet et al. (2014), from independence 
between trees to convergence issues, boostrap sample of much 
smaller size than original data size 
nboot = o(n) 
Each CART tree stops when number of observations per node is 1: 
no culling of the branches
Subsampling 
Due to both large datasets [practical] and theoretical 
recommendation of Scornet et al. (2014), from independence 
between trees to convergence issues, boostrap sample of much 
smaller size than original data size 
nboot = o(n) 
Each CART tree stops when number of observations per node is 1: 
no culling of the branches
ABC with random forests 
Idea: Starting with 
I possibly large collection of summary statistics (s1i , . . . , spi ) 
(from scienti
c theory input to available statistical softwares, 
to machine-learning alternatives, to pure noise) 
I ABC reference table involving model index, parameter values 
and summary statistics for the associated simulated 
pseudo-data 
run R randomforest to infer M from (s1i , . . . , spi )
ABC with random forests 
Idea: Starting with 
I possibly large collection of summary statistics (s1i , . . . , spi ) 
(from scienti
c theory input to available statistical softwares, 
to machine-learning alternatives, to pure noise) 
I ABC reference table involving model index, parameter values 
and summary statistics for the associated simulated 
pseudo-data 
run R randomforest to infer M from (s1i , . . . , spi ) 
at each step O( 
p 
p) indices sampled at random and most 
discriminating statistic selected, by minimising entropy Gini loss
ABC with random forests 
Idea: Starting with 
I possibly large collection of summary statistics (s1i , . . . , spi ) 
(from scienti
c theory input to available statistical softwares, 
to machine-learning alternatives, to pure noise) 
I ABC reference table involving model index, parameter values 
and summary statistics for the associated simulated 
pseudo-data 
run R randomforest to infer M from (s1i , . . . , spi ) 
Average of the trees is resulting summary statistics, highly 
non-linear predictor of the model index
Outcome of ABC-RF 
Random forest predicts a (MAP) model index, from the observed 
dataset: The predictor provided by the forest is sucient to 
select the most likely model but not to derive associated posterior 
probability 
I exploit entire forest by computing how many trees lead to 
picking each of the models under comparison but variability 
too high to be trusted 
I frequency of trees associated with majority model is no proper 
substitute to the true posterior probability 
I usual ABC-MC approximation equally highly variable and hard 
to assess 
I random forests de
ne a natural distance for ABC sample via 
agreement frequency
Outcome of ABC-RF 
Random forest predicts a (MAP) model index, from the observed 
dataset: The predictor provided by the forest is sucient to 
select the most likely model but not to derive associated posterior 
probability 
I exploit entire forest by computing how many trees lead to 
picking each of the models under comparison but variability 
too high to be trusted 
I frequency of trees associated with majority model is no proper 
substitute to the true posterior probability 
I usual ABC-MC approximation equally highly variable and hard 
to assess 
I random forests de
ne a natural distance for ABC sample via 
agreement frequency
Posterior predictive expected losses 
We suggest replacing unstable approximation of 
P(M = mjxo) 
with xo observed sample and m model index, by average of the 
selection errors across all models given the data xo, 
P(M^ (X)6= Mjxo) 
where pair (M,X) generated from the predictive 
Z 
f (xj)(,Mjxo)d 
and M^ (x) denotes the random forest model (MAP) predictor
Posterior predictive expected losses 
Arguments: 
I Bayesian estimate of the posterior error 
I integrates error over most likely part of the parameter space 
I gives an averaged error rather than the posterior probability of 
the null hypothesis 
I easily computed: Given ABC subsample of parameters from 
reference table, simulate pseudo-samples associated with 
those and derive error frequency
Algorithmic implementation 
Algorithm 3 Computation of the posterior error 
(a) Use the trained RF to compute proximity between each 
(m, , S(x)) of the reference table and s0 = S(x0) 
(b) Select the k simulations with the highest proximity to s0 
(c) For each (m, ) in the latter set, compute Npp new 
simulations S(y) from f (yj,m) 
(d) Return the frequency of erroneous RF predictions over these 
k  Npp simulations
toy: MA(1) vs. MA(2) 
Comparing an MA(1) and an MA(2) models: 
xt = t - #1t-1[-#2t-2] 
Earlier illustration using
rst two autocorrelations as S(x) 
[Marin et al., Stat.  Comp., 2011] 
Result #1: values of p(mjx) [obtained by numerical integration] 
and p(mjS(x)) [obtained by mixing ABC outcome and density 
estimation] highly dier!
toy: MA(1) vs. MA(2) 
Dierence between the posterior probability of MA(2) given either 
x or S(x). Blue stands for data from MA(1), orange for data from 
MA(2)
toy: MA(1) vs. MA(2) 
Comparing an MA(1) and an MA(2) models: 
xt = t - #1t-1[-#2t-2] 
Earlier illustration using two autocorrelations as S(x) 
[Marin et al., Stat.  Comp., 2011] 
Result #2: Embedded models, with simulations from MA(1) 
within those from MA(2), hence linear classi

3rd NIPS Workshop on PROBABILISTIC PROGRAMMING

  • 1.
    Reliable Approximate Bayesiancomputation (ABC) model choice via random forests Christian P. Robert Universite Paris-Dauphine, Paris University of Warwick, Coventry December 13, 2014, NIPS 2014 bayesianstatistics@gmail.com Joint with J.-M. Cornuet, A. Estoup, J.-M. Marin, P. Pudlo
  • 2.
    The next O'Bayesmeeting: I Objective Bayes section of ISBA major meeting: I O-Bayes 2015 in Valencia, Spain, June 1-4(+1), 2015 I in memory of our friend Susie Bayarri I objective Bayes, limited information, partly de
  • 3.
    ned and approximatemodels, tc I Spain in June, what else...?!
  • 4.
    Outline Intractable likelihoods ABC methods ABC for model choice ABC model choice via random forests
  • 5.
  • 6.
    ned statistical modelwhere the likelihood function `(jy) = f (y1, . . . , ynj) I is (really!) not available in closed form I cannot (easily!) be either completed or demarginalised I cannot be (at all!) estimated by an unbiased estimator I examples of latent variable models of high dimension, including combinatorial structures (trees, graphs), missing constant f (xj) = g(y, ) Z() (eg. Markov random
  • 7.
    elds, exponential graphs,.. . ) c Prohibits direct implementation of a generic MCMC algorithm like Metropolis{Hastings which gets stuck exploring missing structures
  • 8.
  • 9.
    ned statistical modelwhere the likelihood function `(jy) = f (y1, . . . , ynj) I is (really!) not available in closed form I cannot (easily!) be either completed or demarginalised I cannot be (at all!) estimated by an unbiased estimator c Prohibits direct implementation of a generic MCMC algorithm like Metropolis{Hastings which gets stuck exploring missing structures
  • 10.
  • 11.
    ned statistical modelwhere the likelihood function `(jy) = f (y1, . . . , ynj) is out of reach Empirical A to the original B problem I Degrading the data precision down to tolerance level I Replacing the likelihood with a non-parametric approximation I Summarising/replacing the data with insucient statistics
  • 12.
  • 13.
    ned statistical modelwhere the likelihood function `(jy) = f (y1, . . . , ynj) is out of reach Empirical A to the original B problem I Degrading the data precision down to tolerance level I Replacing the likelihood with a non-parametric approximation I Summarising/replacing the data with insucient statistics
  • 14.
  • 15.
    ned statistical modelwhere the likelihood function `(jy) = f (y1, . . . , ynj) is out of reach Empirical A to the original B problem I Degrading the data precision down to tolerance level I Replacing the likelihood with a non-parametric approximation I Summarising/replacing the data with insucient statistics
  • 16.
    Approximate Bayesian computation Intractable likelihoods ABC methods Genesis of ABC abc of ABC Summary statistic ABC for model choice ABC model choice via random forests
  • 17.
    Genetic background ofABC skip genetics ABC is a recent computational technique that only requires being able to sample from the likelihood f (j) This technique stemmed from population genetics models, about 15 years ago, and population geneticists still signi
  • 18.
    cantly contribute tomethodological developments of ABC. [Grith al., 1997; Tavare al., 1999]
  • 19.
    Demo-genetic inference Eachmodel is characterized by a set of parameters that cover historical (time divergence, admixture time ...), demographics (population sizes, admixture rates, migration rates, ...) and genetic (mutation rate, ...) factors The goal is to estimate these parameters from a dataset of polymorphism (DNA sample) y observed at the present time Problem: most of the time, we cannot calculate the likelihood of the polymorphism data f (yj)...
  • 20.
    Demo-genetic inference Eachmodel is characterized by a set of parameters that cover historical (time divergence, admixture time ...), demographics (population sizes, admixture rates, migration rates, ...) and genetic (mutation rate, ...) factors The goal is to estimate these parameters from a dataset of polymorphism (DNA sample) y observed at the present time Problem: most of the time, we cannot calculate the likelihood of the polymorphism data f (yj)...
  • 21.
    Neutral model ata given microsatellite locus, in a closed panmictic population at equilibrium Kingman's genealogy When time axis is normalized, T(k) Exp(k(k - 1)=2) Mutations according to the Simple stepwise Mutation Model (SMM) date of the mutations Poisson process with intensity =2 over the branches MRCA = 100 independent mutations: 1 with pr. 1=2
  • 22.
    Neutral model ata given microsatellite locus, in a closed panmictic population at equilibrium Kingman's genealogy When time axis is normalized, T(k) Exp(k(k - 1)=2) Mutations according to the Simple stepwise Mutation Model (SMM) date of the mutations Poisson process with intensity =2 over the branches MRCA = 100 independent mutations: 1 with pr. 1=2
  • 23.
    Neutral model ata given microsatellite locus, in a closed panmictic population at equilibrium Observations: leafs of the tree ^ = ? Kingman's genealogy When time axis is normalized, T(k) Exp(k(k - 1)=2) Mutations according to the Simple stepwise Mutation Model (SMM) date of the mutations Poisson process with intensity =2 over the branches MRCA = 100 independent mutations: 1 with pr. 1=2
  • 24.
    Instance of ecologicalquestions [message in a beetle] I How did the Asian Ladybird beetle arrive in Europe? I Why do they swarm right now? I What are the routes of invasion? I How to get rid of them? [Lombaert al., 2010, PLoS ONE] beetles in forests
  • 25.
    Worldwide invasion routesof Harmonia Axyridis [Estoup et al., 2012, Molecular Ecology Res.]
  • 26.
    c Intractable likelihood Missing (too much missing!) data structure: f (yj) = Z G f (yjG, )f (Gj)dG cannot be computed in a manageable way... [Stephens Donnelly, 2000] The genealogies are considered as nuisance parameters This modelling clearly diers from the phylogenetic perspective where the tree is the parameter of interest.
  • 27.
    c Intractable likelihood Missing (too much missing!) data structure: f (yj) = Z G f (yjG, )f (Gj)dG cannot be computed in a manageable way... [Stephens Donnelly, 2000] The genealogies are considered as nuisance parameters This modelling clearly diers from the phylogenetic perspective where the tree is the parameter of interest.
  • 28.
    A?B?C? I Astands for approximate [wrong likelihood / picture] I B stands for Bayesian I C stands for computation [producing a parameter sample]
  • 29.
    ABC methodology Bayesiansetting: target is ()f (xj) When likelihood f (xj) not in closed form, likelihood-free rejection technique: Foundation For an observation y f (yj), under the prior (), if one keeps jointly simulating 0 () , z f (zj0) , until the auxiliary variable z is equal to the observed value, z = y, then the selected 0 (jy) [Rubin, 1984; Diggle Gratton, 1984; Tavare et al., 1997]
  • 30.
    ABC methodology Bayesiansetting: target is ()f (xj) When likelihood f (xj) not in closed form, likelihood-free rejection technique: Foundation For an observation y f (yj), under the prior (), if one keeps jointly simulating 0 () , z f (zj0) , until the auxiliary variable z is equal to the observed value, z = y, then the selected 0 (jy) [Rubin, 1984; Diggle Gratton, 1984; Tavare et al., 1997]
  • 31.
    A as A...pproximative When y is a continuous random variable, strict equality z = y is replaced with a tolerance zone (y, z) 6 where is a distance Output distributed from () Pf(y, z) g def / (j(y, z) ) [Pritchard et al., 1999]
  • 32.
    A as A...pproximative When y is a continuous random variable, strict equality z = y is replaced with a tolerance zone (y, z) 6 where is a distance Output distributed from () Pf(y, z) g def / (j(y, z) ) [Pritchard et al., 1999]
  • 33.
    ABC recap Likelihoodfree rejection sampling Tavare et al. (1997) Genetics 1) Set i = 1, 2) Generate 0 from the prior distribution (), 3) Generate z0 from the likelihood f (j0), 4) If ((z0), (y)) 6 , set (i , zi ) = (0, z0) and i = i + 1, 5) If i 6 N, return to 2). We keep the 's values such that the distance between the corresponding simulated dataset and the observed dataset is small enough. Tuning parameters I 0: tolerance level, I (z): function that summarizes datasets, I (, 0): distance between vectors of summary statistics I N: size of the output
  • 34.
    Output The likelihood-freealgorithm samples from the marginal in z of: (, zjy) = ()f (zj)IA,y(z) R A,y ()f (zj)dzd , where A,y = fz 2 Dj((z), (y)) g. The idea behind ABC is that the summary statistics coupled with a small tolerance should provide a good approximation of the posterior distribution: (jy) = Z (, zjy)dz (jy) .
  • 35.
    Output The likelihood-freealgorithm samples from the marginal in z of: (, zjy) = ()f (zj)IA,y(z) R A,y ()f (zj)dzd , where A,y = fz 2 Dj((z), (y)) g. The idea behind ABC is that the summary statistics coupled with a small tolerance should provide a good approximation of the posterior distribution: (jy) = Z (, zjy)dz (jy) .
  • 36.
    Output The likelihood-freealgorithm samples from the marginal in z of: (, zjy) = ()f (zj)IA,y(z) R A,y ()f (zj)dzd , where A,y = fz 2 Dj((z), (y)) g. The idea behind ABC is that the summary statistics coupled with a small tolerance should provide a good approximation of the restricted posterior distribution: (jy) = Z (, zjy)dz (j(y)) . Not so good..!
  • 37.
    Which summary? Fundamentaldiculty of the choice of the summary statistic when there is no non-trivial sucient statistics [except when done by the experimenters in the
  • 38.
    eld] I Lossof statistical information balanced against gain in data roughening I Approximation error and information loss remain unknown I Choice of statistics induces choice of distance function towards standardisation I may be imposed for external/practical reasons (e.g., DIYABC) I may gather several non-B point estimates [the more the merrier] I can [machine-]learn about ecient combination
  • 39.
    Which summary? Fundamentaldiculty of the choice of the summary statistic when there is no non-trivial sucient statistics [except when done by the experimenters in the
  • 40.
    eld] I Lossof statistical information balanced against gain in data roughening I Approximation error and information loss remain unknown I Choice of statistics induces choice of distance function towards standardisation I may be imposed for external/practical reasons (e.g., DIYABC) I may gather several non-B point estimates [the more the merrier] I can [machine-]learn about ecient combination
  • 41.
    Which summary? Fundamentaldiculty of the choice of the summary statistic when there is no non-trivial sucient statistics [except when done by the experimenters in the
  • 42.
    eld] I Lossof statistical information balanced against gain in data roughening I Approximation error and information loss remain unknown I Choice of statistics induces choice of distance function towards standardisation I may be imposed for external/practical reasons (e.g., DIYABC) I may gather several non-B point estimates [the more the merrier] I can [machine-]learn about ecient combination
  • 43.
    attempts at summaries How to choose the set of summary statistics? I Joyce and Marjoram (2008, SAGMB) I Nunes and Balding (2010, SAGMB) I Fearnhead and Prangle (2012, JRSS B) I Ratmann et al. (2012, PLOS Comput. Biol) I Blum et al. (2013, Statistical science) I EP-ABC of Barthelme Chopin (2013, JASA) I LDA selection of Estoup al. (2012, Mol. Ecol. Res.)
  • 44.
    LDA advantages Imuch faster computation of scenario probabilities via polychotomous regression I a (much) lower number of explanatory variables improves the accuracy of the ABC approximation, reduces the tolerance and avoids extra costs in constructing the reference table I allows for a large collection of initial summaries I ability to evaluate Type I and Type II errors on more complex models I LDA reduces correlation among explanatory variables When available, using both simulated and real data sets, posterior probabilities of scenarios computed from LDA-transformed and raw summaries are strongly correlated
  • 45.
    LDA advantages Imuch faster computation of scenario probabilities via polychotomous regression I a (much) lower number of explanatory variables improves the accuracy of the ABC approximation, reduces the tolerance and avoids extra costs in constructing the reference table I allows for a large collection of initial summaries I ability to evaluate Type I and Type II errors on more complex models I LDA reduces correlation among explanatory variables When available, using both simulated and real data sets, posterior probabilities of scenarios computed from LDA-transformed and raw summaries are strongly correlated
  • 46.
    ABC for modelchoice Intractable likelihoods ABC methods ABC for model choice ABC model choice via random forests
  • 47.
    ABC model choice ABC model choice A) Generate large set of (m, , z) from the Bayesian predictive, (m)m()fm(zj) B) Keep particles (m, , z) such that ((y), (z)) 6 C) For each m, return cpm = proportion of m among remaining particles If tuned towards k resulting particles, then cpm k-nearest neighbor estimate of fM = m P
  • 50.
    (y) Approximatingposterior prob's of models = regression problem where M = m I response is 1 , I covariates are summary statistics (z), I loss is, e.g., L2 Method of choice in DIYABC is local polytomous logistic regression
  • 51.
    Machine learning perspective[paradigm shift] ABC model choice A) Generate a large set of (m, , z)'s from Bayesian predictive, (m)m()fm(zj) B) Use machine learning tech. to infer on (m
  • 53.
    (y)) In thisperspective: I (iid) data set reference table simulated during stage A) I observed y becomes a new data point Note that: I predicting m is a classi
  • 54.
    cation problem ()select the best model based on a maximal a posteriori rule I computing (mj(y)) is a regression problem () con
  • 55.
    dence in each model classi
  • 56.
    cation is muchsimpler than regression (e.g., dim. of objects we try to learn)
  • 57.
    Warning the lostof information induced by using non sucient summary statistics is a genuine problem Fundamental discrepancy between the genuine Bayes factors/posterior probabilities and the Bayes factors based on summary statistics. See, e.g., I Didelot et al. (2011, Bayesian analysis) I X et al. (2011, PNAS) I Marin et al. (2014, JRSS B) I . . . Call instead for machine learning approach able to handle with a large number of correlated summary statistics: random forests well suited for that task
  • 58.
    ABC model choicevia random forests Intractable likelihoods ABC methods ABC for model choice ABC model choice via random forests Random forests ABC with random forests Illustrations
  • 59.
    Leaning towards machinelearning Main notions: I ABC-MC seen as learning about which model is most appropriate from a huge (reference) table I exploiting a large number of summary statistics not an issue for machine learning methods intended to estimate ecient combinations I abandoning (temporarily?) the idea of estimating posterior probabilities of the models, poorly approximated by machine learning methods, and replacing those by posterior predictive expected loss
  • 60.
    Random forests Techniquethat stemmed from Leo Breiman's bagging (or bootstrap aggregating) machine learning algorithm for both classi
  • 61.
    cation and regression [Breiman, 1996] Improved classi
  • 62.
    cation performances byaveraging over classi
  • 63.
    cation schemes ofrandomly generated training sets, creating a forest of (CART) decision trees, inspired by Amit and Geman (1997) ensemble learning [Breiman, 2001]
  • 64.
  • 65.
    cation tree: Algorithm1 CART start the tree with a single root repeat pick a non-homogeneous tip v such that Q(v)6= 1 attach to v two daughter nodes v1 and v2 for all covariates Xj do
  • 66.
    nd the thresholdtj in the rule Xj tj that minimizes N(v1)Q(v1) + N(v2)Q(v2) end for
  • 67.
    nd the ruleXj tj that minimizes N(v1)Q(v1)+N(v2)Q(v2) in j and set this best rule to node v until all tips v are homogeneous (Q(v) = 0) set the labels of all tips where Q is Gini's index Q(vi ) = XM y=1 ^p(vi , y) f1 - ^p(v,y)g .
  • 68.
    Growing the forest Breiman's solution for inducing random features in the trees of the forest: I boostrap resampling of the dataset and I random subset-ing [of size p t] of the covariates driving the classi
  • 69.
    cation at everynode of each tree Covariate x that drives the node separation x ? c and the separation bound c chosen by minimising entropy or Gini index
  • 70.
    Breiman and Cutler'salgorithm Algorithm 2 Random forests for t = 1 to T do //*T is the number of trees*// Draw a bootstrap sample of size nboot6= n Grow an unpruned decision tree for b = 1 to B do //*B is the number of nodes*// Select ntry of the predictors at random Determine the best split from among those predictors end for end for Predict new data by aggregating the predictions of the T trees
  • 71.
    Subsampling Due toboth large datasets [practical] and theoretical recommendation of Scornet et al. (2014), from independence between trees to convergence issues, boostrap sample of much smaller size than original data size nboot = o(n) Each CART tree stops when number of observations per node is 1: no culling of the branches
  • 72.
    Subsampling Due toboth large datasets [practical] and theoretical recommendation of Scornet et al. (2014), from independence between trees to convergence issues, boostrap sample of much smaller size than original data size nboot = o(n) Each CART tree stops when number of observations per node is 1: no culling of the branches
  • 73.
    ABC with randomforests Idea: Starting with I possibly large collection of summary statistics (s1i , . . . , spi ) (from scienti
  • 74.
    c theory inputto available statistical softwares, to machine-learning alternatives, to pure noise) I ABC reference table involving model index, parameter values and summary statistics for the associated simulated pseudo-data run R randomforest to infer M from (s1i , . . . , spi )
  • 75.
    ABC with randomforests Idea: Starting with I possibly large collection of summary statistics (s1i , . . . , spi ) (from scienti
  • 76.
    c theory inputto available statistical softwares, to machine-learning alternatives, to pure noise) I ABC reference table involving model index, parameter values and summary statistics for the associated simulated pseudo-data run R randomforest to infer M from (s1i , . . . , spi ) at each step O( p p) indices sampled at random and most discriminating statistic selected, by minimising entropy Gini loss
  • 77.
    ABC with randomforests Idea: Starting with I possibly large collection of summary statistics (s1i , . . . , spi ) (from scienti
  • 78.
    c theory inputto available statistical softwares, to machine-learning alternatives, to pure noise) I ABC reference table involving model index, parameter values and summary statistics for the associated simulated pseudo-data run R randomforest to infer M from (s1i , . . . , spi ) Average of the trees is resulting summary statistics, highly non-linear predictor of the model index
  • 79.
    Outcome of ABC-RF Random forest predicts a (MAP) model index, from the observed dataset: The predictor provided by the forest is sucient to select the most likely model but not to derive associated posterior probability I exploit entire forest by computing how many trees lead to picking each of the models under comparison but variability too high to be trusted I frequency of trees associated with majority model is no proper substitute to the true posterior probability I usual ABC-MC approximation equally highly variable and hard to assess I random forests de
  • 80.
    ne a naturaldistance for ABC sample via agreement frequency
  • 81.
    Outcome of ABC-RF Random forest predicts a (MAP) model index, from the observed dataset: The predictor provided by the forest is sucient to select the most likely model but not to derive associated posterior probability I exploit entire forest by computing how many trees lead to picking each of the models under comparison but variability too high to be trusted I frequency of trees associated with majority model is no proper substitute to the true posterior probability I usual ABC-MC approximation equally highly variable and hard to assess I random forests de
  • 82.
    ne a naturaldistance for ABC sample via agreement frequency
  • 83.
    Posterior predictive expectedlosses We suggest replacing unstable approximation of P(M = mjxo) with xo observed sample and m model index, by average of the selection errors across all models given the data xo, P(M^ (X)6= Mjxo) where pair (M,X) generated from the predictive Z f (xj)(,Mjxo)d and M^ (x) denotes the random forest model (MAP) predictor
  • 84.
    Posterior predictive expectedlosses Arguments: I Bayesian estimate of the posterior error I integrates error over most likely part of the parameter space I gives an averaged error rather than the posterior probability of the null hypothesis I easily computed: Given ABC subsample of parameters from reference table, simulate pseudo-samples associated with those and derive error frequency
  • 85.
    Algorithmic implementation Algorithm3 Computation of the posterior error (a) Use the trained RF to compute proximity between each (m, , S(x)) of the reference table and s0 = S(x0) (b) Select the k simulations with the highest proximity to s0 (c) For each (m, ) in the latter set, compute Npp new simulations S(y) from f (yj,m) (d) Return the frequency of erroneous RF predictions over these k Npp simulations
  • 86.
    toy: MA(1) vs.MA(2) Comparing an MA(1) and an MA(2) models: xt = t - #1t-1[-#2t-2] Earlier illustration using
  • 87.
    rst two autocorrelationsas S(x) [Marin et al., Stat. Comp., 2011] Result #1: values of p(mjx) [obtained by numerical integration] and p(mjS(x)) [obtained by mixing ABC outcome and density estimation] highly dier!
  • 88.
    toy: MA(1) vs.MA(2) Dierence between the posterior probability of MA(2) given either x or S(x). Blue stands for data from MA(1), orange for data from MA(2)
  • 89.
    toy: MA(1) vs.MA(2) Comparing an MA(1) and an MA(2) models: xt = t - #1t-1[-#2t-2] Earlier illustration using two autocorrelations as S(x) [Marin et al., Stat. Comp., 2011] Result #2: Embedded models, with simulations from MA(1) within those from MA(2), hence linear classi