Reliable ABC model choice via random forests

Reliable ABC model choice via random forests
Pierre Pudlo y, Jean-Michel Marin y , Arnaud Estoup z, Jean-Marie Cornuet z , Mathieu Gauthier z
and Christian P. Robert x {,
Universite de Montpellier 2, I3M, Montpellier, France,yInstitut de Biologie Computationnelle (IBC), Montpellier, France,zCBGP, INRA, Montpellier, France,xUniversite Paris
Dauphine, CEREMADE, Paris, France, and {University of Warwick, Coventry, UK
Submitted to Proceedings of the National Academy of Sciences of the United States of America
Approximate Bayesian computation (ABC) methods provide an elab-
orate approach to Bayesian inference on complex models, includ-
ing model choice. Both theoretical arguments and simulation ex-
periments indicate, however, that model posterior probabilities are
poorly evaluated by ABC. We propose a novel approach based on
a machine learning tool named random forests to conduct selection
among the highly complex models covered by ABC algorithms. We
strongly shift the way Bayesian model selection is both understood
and operated, since we replace the evidential use of model pos-
terior probabilities by predicting the model that best

ts the data
with random forests and computing an associated posterior error
rate. Compared with past implementations of ABC model choice,
the ABC random forest approach oers several improvements: (i)
it has a larger discriminative power among the competing models,
(ii) it is robust to the number and choice of statistics summarizing
the data, (iii) the computing eort is drastically reduced (with a
minimum gain in computation eciency around a factor of about

fty), and (iv) it includes an embedded and cost-free error evalua-
tion conditional on the actual analyzed dataset. Random forest will
undoubtedly extend the range of size of datasets and complexity of
models that ABC can handle. We illustrate the power of the ABC
random forest methodology by analyzing controlled experiments as
well as real population genetics datasets. 1
Approximate Bayesian computation j model selection j summary statistics j k-
nearest neighbors j likelihood-free methods j random forests j posterior predic-
tive j error rate j Harlequin ladybird j Bayesian model choice
Abbreviations: ABC, approximate Bayesian computation; RF, random forest; LDA,
linear discriminant analysis; MAP, maximum a posteriori; nn, nearest neighbors;
CART, classi

cation and regression tree; SNP, single nucleotide polymorphism
Since its introduction (1, 2, 3), the approximate Bayesian
computation (ABC) method has found an ever increasing
range of applications covering diverse types of complex mod-
els (see, e.g., 4, 5, 6, 7). The principle of ABC is to conduct
Bayesian inference on a dataset through comparisons with nu-
merous simulated datasets. However, it suers from two ma-
jor diculties. First, to ensure reliability of the method, the
number of simulations is large; hence, it proves dicult to ap-
ply ABC for large datasets (e.g., in population genomics where
ten to hundred thousand markers are commonly genotyped).
Second, calibration has always been a critical step in ABC
implementation (8, 9). More speci

cally, the major feature in
this calibration process involves selecting a vector of summary
statistics that quanti

es the dierence between the observed
data and the simulated data. The construction of this vec-
tor is therefore paramount and examples abound about poor
performances of ABC algorithms related with speci

c choices
of those statistics. In particular, in the setting of ABC model
choice, the summaries play a crucial role in providing consis-
tent or inconsistent inference (10, 11, 12).
We advocate here a drastic modi

cation of the way ABC
model selection is conducted: we propose to both step away
from a mere mimicking of exact Bayesian solutions like pos-
terior probabilities, and reconsider the very problem of con-
structing ecient summary statistics. First, given an arbi-
trary pool of available statistics, we now completely bypass
the selection of a subset of those. This new perspective di-
rectly proceeds from machine learning methodology. Second,
we also entirely bypass the ABC estimation of model poste-
rior probabilities, as we deem the numerical ABC approxima-
tions of such probabilities fundamentally untrustworthy, even
though the approximations can preserve the proper ordering
of the compared models. Having abandoned approximations
of posterior probabilities, we implement the crucial shift to
using posterior error rates for model selection towards assess-
ing the reliability of the selection made by the classi

er. The
statistical technique of random forests (RF) (13) represents
a trustworthy machine learning tool well adapted to complex
settings as is typical for ABC treatments, and which allows
an ecient computation of posterior error rates. We show
here how RF improves upon existing classi

cantly reducing both the classi

cation error and the
computational expense.
Model choice
Bayesian model choice (14, 15) compares the

t of M mod-
els to an observed dataset x0. It relies on a hierarchical
modelling, setting

rst prior probabilities on model indices
m 2 f1; : : : ;Mg and then prior distributions (jm) on the
parameter of each model, characterized by a likelihood func-
tion f(xjm; ). Inferences and decisions are based on the pos-
terior probabilities of each model (mjx0).
ABC algorithms for model choice. To approximate posterior
probabilities of competing models, ABC methods (16) com-
pare observed data with a massive collection of pseudo-data,
generated from the prior; the comparison proceeds via a nor-
malized Euclidean distance on a vector of statistics S(x) com-
puted for both observed and simulated data. Standard ABC
estimates posterior probabilities (mjx0) at stage (B) of Al-
gorithm 1 below as the frequencies of those models within the
k nearest-to-x0 simulations, proximity being de

ned by the
distance between s0 and the simulated S(x)'s.
Selecting a model means choosing the model with the high-
est frequency in the sample of size k produced by ABC, such
frequencies being approximations to posterior probabilities of
models. We stress that this solution means resorting to a k-
nearest neighbor (k-nn) estimate of those probabilities, for a
set of simulations drawn at stage (A), whose records consti-
Reserved for Publication Footnotes
1PP, JMM, AE and CPR designed and performed research, PP, JMM, AE, JMC and MG analysed
data, and PP, JMM, AE and CPR wrote the paper.
www.pnas.org/cgi/doi/10.1073/pnas.xxx PNAS Issue Date Volume Issue Number 1{7

tute the so-called reference table. In fact, this interpretation
provides a useful path to convergence properties of ABC pa-
rameter estimators (17) and properties of summary statistics
to compare hidden Markov random

elds (18).
Algorithm 1 General ABC algorithm
(A) Generate Nref simulations (m; ; S(x)) from the joint
(m)(jm)f(xjm; ).
(B) Learn from this set to infer about m or at s0 = S(x0).
A major calibration issue with ABC imposes selecting the
summary statistics S(x). When considering the speci

c goal
of model selection, the ABC approximation to the posterior
probabilities will eventually produce a right ordering of the

t
of competing models to the observed data and thus will select
the right model for a speci

c class of statistics when the in-
formation carried by the data becomes important (12). The
state-of-the-art evaluation of ABC model choice is thus that
some statistics produce nonsensical decisions and that there
exist sucient conditions for statistics to produce consistent
model prediction, albeit at the cost of an information loss due
to summaries that may be substantial. The toy example com-
paring MA(1) and MA(2) models in SI and Fig. 1 clearly
exhibits this potential loss.
It may seem tempting to collect the largest possible num-
ber of summary statistics to capture more information from
the data. However, ABC algorithms, like k-nn and other local
methods, suer from the curse of dimensionality, see e.g. Sec-
tion 2.5 in (19), and yield poor results when the number of
statistics is large. Selecting summary statistics is therefore
paramount, as shown by the literature in the recent years.
(See (9) surveying ABC parameter estimation.) Excursions
into machine learning are currently limited, being mostly a
dimension reduction device that preserves the recourse to k-
nn methods. See, e.g., the call to boosting in (20) for selecting
statistics in problems pertaining to parameter estimation (21).
For model choice, two projection techniques are considered.
First, (22) show that the Bayes factor itself is an acceptable
summary (of dimension one) when comparing two models, but
its practical evaluation via a pilot ABC simulation induces a
poor approximation of model evidences (10, 11). The recourse
to a regression layer like linear discriminant analysis (LDA)
(23) is discussed below and in SI (Classi

cation method sec-
tion). Given the fundamental diculty in producing reliable
tools for model choice based on summary statistics (11), we
now propose to switch to a better adapted machine learning
approach based on random forest (RF) classi

ers.
ABC model choice via random forests. SI provides a review
of classi

cation methods. The so-called Bayesian classi

Reliable ABC model choice via random forests

More Related Content

What's hot

Similar to Reliable ABC model choice via random forests

More from Christian Robert

Recently uploaded

Reliable ABC model choice via random forests