Reliable ABC model choice via random forests 
Pierre Pudlo  y, Jean-Michel Marin  y , Arnaud Estoup z, Jean-Marie Cornuet z , Mathieu Gauthier z 
and Christian P. Robert x {, 
Universite de Montpellier 2, I3M, Montpellier, France,yInstitut de Biologie Computationnelle (IBC), Montpellier, France,zCBGP, INRA, Montpellier, France,xUniversite Paris 
Dauphine, CEREMADE, Paris, France, and {University of Warwick, Coventry, UK 
Submitted to Proceedings of the National Academy of Sciences of the United States of America 
Approximate Bayesian computation (ABC) methods provide an elab- 
orate approach to Bayesian inference on complex models, includ- 
ing model choice. Both theoretical arguments and simulation ex- 
periments indicate, however, that model posterior probabilities are 
poorly evaluated by ABC. We propose a novel approach based on 
a machine learning tool named random forests to conduct selection 
among the highly complex models covered by ABC algorithms. We 
strongly shift the way Bayesian model selection is both understood 
and operated, since we replace the evidential use of model pos- 
terior probabilities by predicting the model that best
ts the data 
with random forests and computing an associated posterior error 
rate. Compared with past implementations of ABC model choice, 
the ABC random forest approach oers several improvements: (i) 
it has a larger discriminative power among the competing models, 
(ii) it is robust to the number and choice of statistics summarizing 
the data, (iii) the computing eort is drastically reduced (with a 
minimum gain in computation eciency around a factor of about
fty), and (iv) it includes an embedded and cost-free error evalua- 
tion conditional on the actual analyzed dataset. Random forest will 
undoubtedly extend the range of size of datasets and complexity of 
models that ABC can handle. We illustrate the power of the ABC 
random forest methodology by analyzing controlled experiments as 
well as real population genetics datasets. 1 
Approximate Bayesian computation j model selection j summary statistics j k- 
nearest neighbors j likelihood-free methods j random forests j posterior predic- 
tive j error rate j Harlequin ladybird j Bayesian model choice 
Abbreviations: ABC, approximate Bayesian computation; RF, random forest; LDA, 
linear discriminant analysis; MAP, maximum a posteriori; nn, nearest neighbors; 
CART, classi
cation and regression tree; SNP, single nucleotide polymorphism 
Since its introduction (1, 2, 3), the approximate Bayesian 
computation (ABC) method has found an ever increasing 
range of applications covering diverse types of complex mod- 
els (see, e.g., 4, 5, 6, 7). The principle of ABC is to conduct 
Bayesian inference on a dataset through comparisons with nu- 
merous simulated datasets. However, it suers from two ma- 
jor diculties. First, to ensure reliability of the method, the 
number of simulations is large; hence, it proves dicult to ap- 
ply ABC for large datasets (e.g., in population genomics where 
ten to hundred thousand markers are commonly genotyped). 
Second, calibration has always been a critical step in ABC 
implementation (8, 9). More speci
cally, the major feature in 
this calibration process involves selecting a vector of summary 
statistics that quanti
es the dierence between the observed 
data and the simulated data. The construction of this vec- 
tor is therefore paramount and examples abound about poor 
performances of ABC algorithms related with speci
c choices 
of those statistics. In particular, in the setting of ABC model 
choice, the summaries play a crucial role in providing consis- 
tent or inconsistent inference (10, 11, 12). 
We advocate here a drastic modi
cation of the way ABC 
model selection is conducted: we propose to both step away 
from a mere mimicking of exact Bayesian solutions like pos- 
terior probabilities, and reconsider the very problem of con- 
structing ecient summary statistics. First, given an arbi- 
trary pool of available statistics, we now completely bypass 
the selection of a subset of those. This new perspective di- 
rectly proceeds from machine learning methodology. Second, 
we also entirely bypass the ABC estimation of model poste- 
rior probabilities, as we deem the numerical ABC approxima- 
tions of such probabilities fundamentally untrustworthy, even 
though the approximations can preserve the proper ordering 
of the compared models. Having abandoned approximations 
of posterior probabilities, we implement the crucial shift to 
using posterior error rates for model selection towards assess- 
ing the reliability of the selection made by the classi
er. The 
statistical technique of random forests (RF) (13) represents 
a trustworthy machine learning tool well adapted to complex 
settings as is typical for ABC treatments, and which allows 
an ecient computation of posterior error rates. We show 
here how RF improves upon existing classi
cation methods 
in signi
cantly reducing both the classi
cation error and the 
computational expense. 
Model choice 
Bayesian model choice (14, 15) compares the
t of M mod- 
els to an observed dataset x0. It relies on a hierarchical 
modelling, setting
rst prior probabilities on model indices 
m 2 f1; : : : ;Mg and then prior distributions (jm) on the 
parameter  of each model, characterized by a likelihood func- 
tion f(xjm; ). Inferences and decisions are based on the pos- 
terior probabilities of each model (mjx0). 
ABC algorithms for model choice. To approximate posterior 
probabilities of competing models, ABC methods (16) com- 
pare observed data with a massive collection of pseudo-data, 
generated from the prior; the comparison proceeds via a nor- 
malized Euclidean distance on a vector of statistics S(x) com- 
puted for both observed and simulated data. Standard ABC 
estimates posterior probabilities (mjx0) at stage (B) of Al- 
gorithm 1 below as the frequencies of those models within the 
k nearest-to-x0 simulations, proximity being de
ned by the 
distance between s0 and the simulated S(x)'s. 
Selecting a model means choosing the model with the high- 
est frequency in the sample of size k produced by ABC, such 
frequencies being approximations to posterior probabilities of 
models. We stress that this solution means resorting to a k- 
nearest neighbor (k-nn) estimate of those probabilities, for a 
set of simulations drawn at stage (A), whose records consti- 
Reserved for Publication Footnotes 
1PP, JMM, AE and CPR designed and performed research, PP, JMM, AE, JMC and MG analysed 
data, and PP, JMM, AE and CPR wrote the paper. 
www.pnas.org/cgi/doi/10.1073/pnas.xxx PNAS Issue Date Volume Issue Number 1{7
tute the so-called reference table. In fact, this interpretation 
provides a useful path to convergence properties of ABC pa- 
rameter estimators (17) and properties of summary statistics 
to compare hidden Markov random
elds (18). 
Algorithm 1 General ABC algorithm 
(A) Generate Nref simulations (m; ; S(x)) from the joint 
(m)(jm)f(xjm; ). 
(B) Learn from this set to infer about m or  at s0 = S(x0). 
A major calibration issue with ABC imposes selecting the 
summary statistics S(x). When considering the speci
c goal 
of model selection, the ABC approximation to the posterior 
probabilities will eventually produce a right ordering of the
t 
of competing models to the observed data and thus will select 
the right model for a speci
c class of statistics when the in- 
formation carried by the data becomes important (12). The 
state-of-the-art evaluation of ABC model choice is thus that 
some statistics produce nonsensical decisions and that there 
exist sucient conditions for statistics to produce consistent 
model prediction, albeit at the cost of an information loss due 
to summaries that may be substantial. The toy example com- 
paring MA(1) and MA(2) models in SI and Fig. 1 clearly 
exhibits this potential loss. 
It may seem tempting to collect the largest possible num- 
ber of summary statistics to capture more information from 
the data. However, ABC algorithms, like k-nn and other local 
methods, suer from the curse of dimensionality, see e.g. Sec- 
tion 2.5 in (19), and yield poor results when the number of 
statistics is large. Selecting summary statistics is therefore 
paramount, as shown by the literature in the recent years. 
(See (9) surveying ABC parameter estimation.) Excursions 
into machine learning are currently limited, being mostly a 
dimension reduction device that preserves the recourse to k- 
nn methods. See, e.g., the call to boosting in (20) for selecting 
statistics in problems pertaining to parameter estimation (21). 
For model choice, two projection techniques are considered. 
First, (22) show that the Bayes factor itself is an acceptable 
summary (of dimension one) when comparing two models, but 
its practical evaluation via a pilot ABC simulation induces a 
poor approximation of model evidences (10, 11). The recourse 
to a regression layer like linear discriminant analysis (LDA) 
(23) is discussed below and in SI (Classi
cation method sec- 
tion). Given the fundamental diculty in producing reliable 
tools for model choice based on summary statistics (11), we 
now propose to switch to a better adapted machine learning 
approach based on random forest (RF) classi
ers. 
ABC model choice via random forests. SI provides a review 
of classi
cation methods. The so-called Bayesian classi

Reliable ABC model choice via random forests

  • 1.
    Reliable ABC modelchoice via random forests Pierre Pudlo y, Jean-Michel Marin y , Arnaud Estoup z, Jean-Marie Cornuet z , Mathieu Gauthier z and Christian P. Robert x {, Universite de Montpellier 2, I3M, Montpellier, France,yInstitut de Biologie Computationnelle (IBC), Montpellier, France,zCBGP, INRA, Montpellier, France,xUniversite Paris Dauphine, CEREMADE, Paris, France, and {University of Warwick, Coventry, UK Submitted to Proceedings of the National Academy of Sciences of the United States of America Approximate Bayesian computation (ABC) methods provide an elab- orate approach to Bayesian inference on complex models, includ- ing model choice. Both theoretical arguments and simulation ex- periments indicate, however, that model posterior probabilities are poorly evaluated by ABC. We propose a novel approach based on a machine learning tool named random forests to conduct selection among the highly complex models covered by ABC algorithms. We strongly shift the way Bayesian model selection is both understood and operated, since we replace the evidential use of model pos- terior probabilities by predicting the model that best
  • 2.
    ts the data with random forests and computing an associated posterior error rate. Compared with past implementations of ABC model choice, the ABC random forest approach oers several improvements: (i) it has a larger discriminative power among the competing models, (ii) it is robust to the number and choice of statistics summarizing the data, (iii) the computing eort is drastically reduced (with a minimum gain in computation eciency around a factor of about
  • 3.
    fty), and (iv)it includes an embedded and cost-free error evalua- tion conditional on the actual analyzed dataset. Random forest will undoubtedly extend the range of size of datasets and complexity of models that ABC can handle. We illustrate the power of the ABC random forest methodology by analyzing controlled experiments as well as real population genetics datasets. 1 Approximate Bayesian computation j model selection j summary statistics j k- nearest neighbors j likelihood-free methods j random forests j posterior predic- tive j error rate j Harlequin ladybird j Bayesian model choice Abbreviations: ABC, approximate Bayesian computation; RF, random forest; LDA, linear discriminant analysis; MAP, maximum a posteriori; nn, nearest neighbors; CART, classi
  • 4.
    cation and regressiontree; SNP, single nucleotide polymorphism Since its introduction (1, 2, 3), the approximate Bayesian computation (ABC) method has found an ever increasing range of applications covering diverse types of complex mod- els (see, e.g., 4, 5, 6, 7). The principle of ABC is to conduct Bayesian inference on a dataset through comparisons with nu- merous simulated datasets. However, it suers from two ma- jor diculties. First, to ensure reliability of the method, the number of simulations is large; hence, it proves dicult to ap- ply ABC for large datasets (e.g., in population genomics where ten to hundred thousand markers are commonly genotyped). Second, calibration has always been a critical step in ABC implementation (8, 9). More speci
  • 5.
    cally, the majorfeature in this calibration process involves selecting a vector of summary statistics that quanti
  • 6.
    es the dierencebetween the observed data and the simulated data. The construction of this vec- tor is therefore paramount and examples abound about poor performances of ABC algorithms related with speci
  • 7.
    c choices ofthose statistics. In particular, in the setting of ABC model choice, the summaries play a crucial role in providing consis- tent or inconsistent inference (10, 11, 12). We advocate here a drastic modi
  • 8.
    cation of theway ABC model selection is conducted: we propose to both step away from a mere mimicking of exact Bayesian solutions like pos- terior probabilities, and reconsider the very problem of con- structing ecient summary statistics. First, given an arbi- trary pool of available statistics, we now completely bypass the selection of a subset of those. This new perspective di- rectly proceeds from machine learning methodology. Second, we also entirely bypass the ABC estimation of model poste- rior probabilities, as we deem the numerical ABC approxima- tions of such probabilities fundamentally untrustworthy, even though the approximations can preserve the proper ordering of the compared models. Having abandoned approximations of posterior probabilities, we implement the crucial shift to using posterior error rates for model selection towards assess- ing the reliability of the selection made by the classi
  • 9.
    er. The statisticaltechnique of random forests (RF) (13) represents a trustworthy machine learning tool well adapted to complex settings as is typical for ABC treatments, and which allows an ecient computation of posterior error rates. We show here how RF improves upon existing classi
  • 10.
  • 11.
  • 12.
    cation error andthe computational expense. Model choice Bayesian model choice (14, 15) compares the
  • 13.
    t of Mmod- els to an observed dataset x0. It relies on a hierarchical modelling, setting
  • 14.
    rst prior probabilitieson model indices m 2 f1; : : : ;Mg and then prior distributions (jm) on the parameter of each model, characterized by a likelihood func- tion f(xjm; ). Inferences and decisions are based on the pos- terior probabilities of each model (mjx0). ABC algorithms for model choice. To approximate posterior probabilities of competing models, ABC methods (16) com- pare observed data with a massive collection of pseudo-data, generated from the prior; the comparison proceeds via a nor- malized Euclidean distance on a vector of statistics S(x) com- puted for both observed and simulated data. Standard ABC estimates posterior probabilities (mjx0) at stage (B) of Al- gorithm 1 below as the frequencies of those models within the k nearest-to-x0 simulations, proximity being de
  • 15.
    ned by the distance between s0 and the simulated S(x)'s. Selecting a model means choosing the model with the high- est frequency in the sample of size k produced by ABC, such frequencies being approximations to posterior probabilities of models. We stress that this solution means resorting to a k- nearest neighbor (k-nn) estimate of those probabilities, for a set of simulations drawn at stage (A), whose records consti- Reserved for Publication Footnotes 1PP, JMM, AE and CPR designed and performed research, PP, JMM, AE, JMC and MG analysed data, and PP, JMM, AE and CPR wrote the paper. www.pnas.org/cgi/doi/10.1073/pnas.xxx PNAS Issue Date Volume Issue Number 1{7
  • 16.
    tute the so-calledreference table. In fact, this interpretation provides a useful path to convergence properties of ABC pa- rameter estimators (17) and properties of summary statistics to compare hidden Markov random
  • 17.
    elds (18). Algorithm1 General ABC algorithm (A) Generate Nref simulations (m; ; S(x)) from the joint (m)(jm)f(xjm; ). (B) Learn from this set to infer about m or at s0 = S(x0). A major calibration issue with ABC imposes selecting the summary statistics S(x). When considering the speci
  • 18.
    c goal ofmodel selection, the ABC approximation to the posterior probabilities will eventually produce a right ordering of the
  • 19.
    t of competingmodels to the observed data and thus will select the right model for a speci
  • 20.
    c class ofstatistics when the in- formation carried by the data becomes important (12). The state-of-the-art evaluation of ABC model choice is thus that some statistics produce nonsensical decisions and that there exist sucient conditions for statistics to produce consistent model prediction, albeit at the cost of an information loss due to summaries that may be substantial. The toy example com- paring MA(1) and MA(2) models in SI and Fig. 1 clearly exhibits this potential loss. It may seem tempting to collect the largest possible num- ber of summary statistics to capture more information from the data. However, ABC algorithms, like k-nn and other local methods, suer from the curse of dimensionality, see e.g. Sec- tion 2.5 in (19), and yield poor results when the number of statistics is large. Selecting summary statistics is therefore paramount, as shown by the literature in the recent years. (See (9) surveying ABC parameter estimation.) Excursions into machine learning are currently limited, being mostly a dimension reduction device that preserves the recourse to k- nn methods. See, e.g., the call to boosting in (20) for selecting statistics in problems pertaining to parameter estimation (21). For model choice, two projection techniques are considered. First, (22) show that the Bayes factor itself is an acceptable summary (of dimension one) when comparing two models, but its practical evaluation via a pilot ABC simulation induces a poor approximation of model evidences (10, 11). The recourse to a regression layer like linear discriminant analysis (LDA) (23) is discussed below and in SI (Classi
  • 21.
    cation method sec- tion). Given the fundamental diculty in producing reliable tools for model choice based on summary statistics (11), we now propose to switch to a better adapted machine learning approach based on random forest (RF) classi
  • 22.
    ers. ABC modelchoice via random forests. SI provides a review of classi
  • 23.
    cation methods. Theso-called Bayesian classi