Testing as estimation: the demise of the Bayes factor

Testing as estimation: the demise of the Bayes factors
Testing as estimation: the demise of the Bayes
factors
Christian P. Robert
Universit´e Paris-Dauphine and University of Warwick
arXiv:1412.2044
with K. Kamary, K. Mengersen, and J. Rousseau

Outline
Introduction
Testing problems as estimating mixture
models
Illustrations
Asymptotic consistency
Conclusion

Introduction
Testing hypotheses
Hypothesis testing
central problem of statistical inference
dramatically diﬀerentiating feature between classical and
Bayesian paradigms
wide open to controversy and divergent opinions, includ.
within the Bayesian community
non-informative Bayesian testing case mostly unresolved,
witness the Jeﬀreys–Lindley paradox
[Berger (2003), Mayo & Cox (2006), Gelman (2008)]

Introduction
Besting hypotheses
Bayesian model selection as comparison of k potential
statistical models towards the selection of model that ﬁts the
data “best”
mostly accepted perspective: it does not primarily seek to
identify which model is “true”, but compares ﬁts
tools like Bayes factor naturally include a penalisation
addressing model complexity, mimicked by Bayes Information
(BIC) and Deviance Information (DIC) criteria
posterior predictive tools successfully advocated in Gelman et
al. (2013) even though they involve double use of data

Introduction
Bayesian modelling
Standard Bayesian approach to testing: consider two families of
models, one for each of the hypotheses under comparison,
M1 : x ∼ f1(x|θ1) , θ1 ∈ Θ1 and M2 : x ∼ f2(x|θ2) , θ2 ∈ Θ2 ,
and associate with each model a prior distribution,
θ1 ∼ π1(θ1) and θ2 ∼ π2(θ2) ,
[Jeﬀreys, 1939]

Introduction
Bayesian modelling
in order to compare the marginal likelihoods
m1(x) =
Θ1
f1(x|θ1) π1(θ1) dθ1 and m2(x) =
Θ2
f2(x|θ2) π1(θ2) dθ2
[Jeﬀreys, 1939]

Introduction
Bayesian modelling
either through Bayes factor or posterior probability,
B12 =
m1(x)
m2(x)
, P(M1|x) =
ω1m1(x)
ω1m1(x) + ω2m2(x)
;
the latter depends on the prior weights ωi
[Jeﬀreys, 1939]

Introduction
Bayesian modelling
Bayesian decision step
comparing Bayes factor B12 with threshold value of one or
comparing posterior probability P(M1|x) with bound α
[Jeﬀreys, 1939]

Introduction
Some difficulties
tension between (i) posterior probabilities justified by binary
loss but depending on unnatural prior weights, and (ii) Bayes
factors that eliminate dependence but lose direct connection
with posterior, unless prior weights are integrated within loss
delicate interpretation (or calibration) of the strength of the
Bayes factor towards supporting a given hypothesis or model:
not a Bayesian decision rule!
difficulty with posterior probabilities: tendency to interpret
them as p-values while they only report respective strengths
of fitting to both models

Introduction
Some further difficulties
long-lasting impact of the prior modeling, i.e., choice of prior
distributions on both parameter spaces under comparison,
despite overall consistency for Bayes factor
major discontinuity in use of improper priors, not justified in
most testing situations, leading to ad hoc solutions (zoo),
where data is either used twice or split artificially
binary (accept vs. reject) outcome more suited for immediate
decision (if any) than for model evaluation, connected with
rudimentary loss function

Introduction
Lindley’s paradox
In a normal mean testing problem,
¯xn ∼ N(θ, σ2
/n) , H0 : θ = θ0 ,
under Jeffreys prior, θ ∼ N(θ0, σ2), the Bayes factor
B01(tn) = (1 + n)1/2
exp −nt2
n/2[1 + n] ,
where tn =
√
n|¯xn − θ0|/σ, satisfies
B01(tn)
n−→∞
−→ ∞
[assuming a fixed tn]
[Lindley, 1957]

Introduction
A strong impropriety
Improper priors not allowed in Bayes factors:
If
Θ1
π1(dθ1) = ∞ or
Θ2
π2(dθ2) = ∞
then π1 or π2 cannot be coherently normalised while the
normalisation matters in the Bayes factor B12
Lack of mathematical justiﬁcation for “common nuisance
parameter” [and prior of]
[Berger, Pericchi, and Varshavsky, 1998; Marin and Robert, 2013]

Testing problems as estimating mixture models
Paradigm shift
New proposal as paradigm shift in Bayesian processing of
hypothesis testing and of model selection
convergent and naturally interpretable solution
extended use of improper priors
abandonment of the Neyman-Pearson decision framework
natural strenght of evidence
Simple representation of the testing problem as a
two-component mixture estimation problem where the
weights are formally equal to 0 or 1

Paradigm shift
Simple representation of the testing problem as a
two-component mixture estimation problem where the
weights are formally equal to 0 or 1
Approach inspired from consistency result of Rousseau and
Mengersen (2011) on estimated overﬁtting mixtures
Mixture representation not equivalent to use of a posterior
probability
More natural approach to testing, while sparse in parameters
Calibration of the posterior distribution of mixture weight,
while moving away from artiﬁcial notion of the posterior
probability of a model

Encompassing mixture model
Idea: Given two statistical models,
embed both within an encompassing mixture
Mα : x ∼ αf1(x|θ1) + (1 − α)f2(x|θ2) , 0 α 1 (1)
Note: Both models correspond to special cases of (1), one for
α = 1 and one for α = 0
Draw inference on mixture representation (1), as if each
observation was individually and independently produced by the
mixture model

Inferential motivations
Sounds like approximation to the real problem, but deﬁnitive
advantages to shift:
Bayes estimate of the weight α replaces posterior probability
of model M1, equally convergent indicator of which model is
“true”, while avoiding artiﬁcial prior probabilities on model
indices, ω1 and ω2, and 0 − 1 loss setting
posterior on α provides measure of proximity to models, while
being interpretable as data propensity to stand within one
model
further allows for alternative perspectives on testing and
model choice, like predictive tools, cross-validation, and
information indices like WAIC

Computational motivations
avoids problematic computations of marginal likelihoods, since
standard algorithms are available for Bayesian mixture
estimation
straightforward extension to finite collection of models, which
considers all models at once and eliminates least likely models
by simulation
eliminates famous difficulty of label switching that plagues
both Bayes estimation and computation: components are no
longer exchangeable
posterior distribution on α evaluates more thoroughly strength
of support for a given model than the single figure posterior
probability
variability of posterior distribution on α allows for a more
thorough assessment of the strength of this support

Noninformative motivations
novel Bayesian feature: a mixture model acknowledges
possibility that, for a finite dataset, both models or none
could be acceptable
standard (proper and informative) prior modeling can be
processed in this setting, but non-informative (improper)
priors also are manageable, provided both models first
reparameterised into shared parameters, e.g. location and
scale parameters
in special case when all parameters are common
Mα : x ∼ αf1(x|θ) + (1 − α)f2(x|θ) , 0 α 1
if θ is a location parameter, a flat prior π(θ) ∝ 1 is available

Weakly informative motivations
using the same parameters or some identical parameters on
both components highlights that opposition between the two
components is not an issue of enjoying diﬀerent parameters
common parameters are nuisance parameters, easily integrated
prior model weights ωi rarely discussed in classical Bayesian
approach, with linear impact on posterior probabilities
prior modeling only involves selecting a prior on α, e.g.,
α ∼ B(a0, a0)
while a0 impacts posterior on α, it always leads to mass
accumulation near 1 or 0, i.e. favours most likely model
sensitivity analysis straightforward to carry
approach easily calibrated by parametric boostrap providing
reference posterior of α under each model
natural Metropolis–Hastings alternative

Illustrations
Poisson/Geometric example
choice betwen Poisson P(λ) and Geometric Geo(p)
distribution
mixture with common parameter λ
Mα : αP(λ) + (1 − α)Geo(1/1+λ)
Allows for Jeﬀreys prior since resulting posterior is proper
independent Metropolis–within–Gibbs with proposal
distribution on λ equal to Poisson posterior (with acceptance
rate larger than 75%)

Illustrations
Beta prior
When α ∼ Be(a0, a0) prior, full conditional posterior
α ∼ Be(n1(ζ) + a0, n2(ζ) + a0)
Exact Bayes factor opposing Poisson and Geometric
B12 = nn¯xn
n
i=1
xi ! Γ n + 2 +
n
i=1
xi Γ(n + 2)
although undeﬁned from a purely mathematical viewpoint

Illustrations
Weight estimation
1e-04 0.001 0.01 0.1 0.2 0.3 0.4 0.5
0.9900.9920.9940.9960.9981.000
Posterior medians of α for 100 Poisson P(4) datasets of size n = 1000, for
a0 = .0001, .001, .01, .1, .2, .3, .4, .5. Each posterior approximation is based on
104
Metropolis-Hastings iterations.

Illustrations
Consistency
0 1 2 3 4 5 6 7
0.00.20.40.60.81.0
a0=.1
log(sample size)
0 1 2 3 4 5 6 7
0.00.20.40.60.81.0
a0=.3
log(sample size)
0 1 2 3 4 5 6 7
0.00.20.40.60.81.0
a0=.5
log(sample size)
Posterior means (sky-blue) and medians (grey-dotted) of α, over 100 Poisson
P(4) datasets for sample sizes from 1 to 1000.

Illustrations
Behaviour of Bayes factor
0 1 2 3 4 5 6 7
0.00.20.40.60.81.0
log(sample size)
0 1 2 3 4 5 6 7
0.00.20.40.60.81.0
log(sample size)
0 1 2 3 4 5 6 7
0.00.20.40.60.81.0
log(sample size)
Comparison between P(M1|x) (red dotted area) and posterior medians of α
(grey zone) for 100 Poisson P(4) datasets with sample sizes n between 1 and
1000, for a0 = .001, .1, .5

Illustrations
Normal-normal comparison
comparison of a normal N(θ1, 1) with a normal N(θ2, 2)
distribution
mixture with identical location parameter θ
αN(θ, 1) + (1 − α)N(θ, 2)
Jeﬀreys prior π(θ) = 1 can be used, since posterior is proper
Reference (improper) Bayes factor
B12 = 2
n−1/2
exp 1/4
n
i=1
(xi − ¯x)2
,

Illustrations
Comparison with posterior probability
0 100 200 300 400 500
-50-40-30-20-100
a0=.1
sample size
0 100 200 300 400 500
-50-40-30-20-100
a0=.3
sample size
0 100 200 300 400 500
-50-40-30-20-100
a0=.4
sample size
0 100 200 300 400 500
-50-40-30-20-100
a0=.5
sample size
Plots of ranges of log(n) log(1 − E[α|x]) (gray color) and log(1 − p(M1|x)) (red
dotted) over 100 N(0, 1) samples as sample size n grows from 1 to 500. and α
is the weight of N(0, 1) in the mixture model. The shaded areas indicate the
range of the estimations and each plot is based on a Beta prior with
a0 = .1, .2, .3, .4, .5, 1 and each posterior approximation is based on 104
iterations.

Illustrations
Comments
convergence to one boundary value as sample size n grows
impact of hyperarameter a0 slowly vanishes as n increases, but
present for moderate sample sizes
when simulated sample is neither from N(θ1, 1) nor from
N(θ2, 2), behaviour of posterior varies, depending on which
distribution is closest

Illustrations
Logit or Probit?
binary dataset, R dataset about diabetes in 200 Pima Indian
women with body mass index as explanatory variable
comparison of logit and probit ﬁts could be suitable. We are
thus comparing both ﬁts via our method
M1 : yi | xi
, θ1 ∼ B(1, pi ) where pi =
exp(xi θ1)
1 + exp(xi θ1)
M2 : yi | xi
, θ2 ∼ B(1, qi ) where qi = Φ(xi
θ2)

Illustrations
Common parameterisation
Local reparameterisation strategy that rescales parameters of the
probit model M2 so that the MLE’s of both models coincide.
[Choudhuty et al., 2007]
Φ(xi
θ2) ≈
exp(kxi θ2)
1 + exp(kxi θ2)
and use best estimate of k to bring both parameters into coherency
(k0, k1) = (θ01/θ02, θ11/θ12) ,
reparameterise M1 and M2 as
M1 :yi | xi
, θ ∼ B(1, pi ) where pi =
exp(xi θ)
1 + exp(xi θ)
M2 :yi | xi
, θ ∼ B(1, qi ) where qi = Φ(xi
(κ−1
θ)) ,
with κ−1θ = (θ0/k0, θ1/k1).

Illustrations
Prior modelling
Under default g-prior
θ ∼ N2(0, n(XT
X)−1
)
full conditional posterior distributions given allocations
π(θ | y, X, ζ) ∝
exp i Iζi =1yi xi θ
i;ζi =1[1 + exp(xi θ)]
exp −θT
(XT
X)θ 2n
×
i;ζi =2
Φ(xi
(κ−1
θ))yi
(1 − Φ(xi
(κ−1
θ)))(1−yi )
hence posterior distribution clearly deﬁned

Illustrations
Results
Logistic Probit
a0 α θ0 θ1
θ0
k0
θ1
k1
.1 .352 -4.06 .103 -2.51 .064
.2 .427 -4.03 .103 -2.49 .064
.3 .440 -4.02 .102 -2.49 .063
.4 .456 -4.01 .102 -2.48 .063
.5 .449 -4.05 .103 -2.51 .064
Histograms of posteriors of α in favour of logistic model where a0 = .1, .2, .3,
.4, .5 for (a) Pima dataset, (b) Data from logistic model, (c) Data from probit

Illustrations
Survival analysis models
Testing hypothesis that data comes from a
1. log-Normal(φ, κ2),
2. Weibull(α, λ), or
3. log-Logistic(γ, δ)
distribution
Corresponding mixture given by the density
α1 exp{−(log x − φ)2
/2κ2
}/
√
2πxκ+
α2
α
λ
exp{−(x/λ)α
}((x/λ)α−1
+
α3(δ/γ)(x/γ)δ−1
/(1 + (x/γ)δ
)2
where α1 + α2 + α3 = 1

Illustrations
Reparameterisation
Looking for common parameter(s):
φ = µ + γβ = ξ
σ2
= π2
β2
/6 = ζ2
π2
/3
where γ ≈ 0.5772 is Euler-Mascheroni constant.
Allows for a noninformative prior on the common location scale
parameter,
π(φ, σ2
) = 1/σ2

Illustrations
Recovery
.01 0.1 1.0 10.0
0.860.880.900.920.940.960.981.00
.01 0.1 1.0 10.0
0.000.020.040.060.08
Boxplots of the posterior distributions of the Normal weight α1 under the two
scenarii: truth = Normal (left panel), truth = Gumbel (right panel), a0=0.01,
0.1, 1.0, 10.0 (from left to right in each panel) and n = 10, 000 simulated
observations.

Posterior consistency holds for mixture testing procedure [under
minor conditions]
Two diﬀerent cases
the two models, M1 and M2, are well separated
model M1 is a submodel of M2.

Separated models
Assumption: Models are separated, i.e. identiﬁability holds:
∀α, α ∈ [0, 1], ∀θj , θj , j = 1, 2 Pθ,α = Pθ ,α ⇒ α = α , θ = θ
theorem
Under above assumptions, then for all > 0,
π [|α − α∗
| > |xn
] = op(1)

Separated models
theorem
If
θj → fj,θj
is C2 around θ∗
j , j = 1, 2,
f1,θ∗
1
− f2,θ∗
2
, f1,θ∗
1
, f2,θ∗
2
are linearly independent in y and
there exists δ > 0 such that
f1,θ∗
1
, f2,θ∗
2
, sup
|θ1−θ∗
1 |<δ
|D2
f1,θ1
|, sup
|θ2−θ∗
2 |<δ
|D2
f2,θ2
| ∈ L1
then
π |α − α∗
| > M log n/n xn
= op(1).

Separated models
theorem allows for interpretation of α under the posterior: If data
xn is generated from model M1 then posterior on α concentrates
around α = 1

Embedded case
Here M1 is a submodel of M2, i.e.
θ2 = (θ1, ψ) and θ2 = (θ1, ψ0 = 0)
corresponds to f2,θ2 ∈ M1
Same posterior concentration rate
log n/n
for estimating α when α∗ ∈ (0, 1) and ψ∗ = 0.

Null case
Case where ψ∗ = 0, i.e., f ∗ is in model M1
Two possible paths to approximate f ∗: either α goes to 1
(path 1) or ψ goes to 0 (path 2)
New identiﬁability condition: Pθ,α = P∗ only if
α = 1, θ1 = θ∗
1, θ2 = (θ∗
1, ψ) or α 1, θ1 = θ∗
1, θ2 = (θ∗
1, 0)
Prior
π(α, θ) = πα(α)π1(θ1)πψ(ψ), θ2 = (θ1, ψ)
with common (prior on) θ1

Consistency
theorem
Given the mixture fθ1,ψ,α = αf1,θ1 + (1 − α)f2,θ1,ψ and a sample
xn = (x1, · · · , xn) issued from f1,θ∗
1
, under regularity assumptions,
and an M > 0 such that
π (α, θ); fθ,α − f ∗
1 > M log n/n|xn
= op(1).
If α ∼ B(a1, a2), with a2 < d2, and if the prior πθ1,ψ is absolutely
continuous with positive and continuous density at (θ∗
1, 0), then for
Mn −→ ∞
π |α − α∗
| > Mn(log n)γ
/
√
n|xn
= op(1), γ = max((d1 + a2)/(d2 − a2), 1)/2,

Conclusion
Conclusion
many applications of the Bayesian paradigm concentrate on
the comparison of scientiﬁc theories and on testing of null
hypotheses
natural tendency to default to Bayes factors
poorly understood sensitivity to prior modeling and posterior
calibration
Time is ripe for a paradigm shift

Conclusion
Conclusion
original testing problem replaced with a better controlled
estimation target
allow for posterior variability over the component frequency as
opposed to deterministic Bayes factors
range of acceptance, rejection and indecision conclusions
easily calibrated by simulation
posterior medians quickly settling near the boundary values of
0 and 1
potential derivation of a Bayesian b-value by looking at the
posterior area under the tail of the distribution of the weight

Conclusion
Prior modelling
Partly common parameterisation always feasible and hence
allows for reference priors
removal of the absolute prohibition of improper priors in
hypothesis testing
prior on the weight α shows sensitivity that naturally vanishes
as the sample size increases
default value of a0 = 0.5 in the Beta prior

Conclusion
Computing aspects
proposal that does not induce additional computational strain
when algorithmic solutions exist for both models, they can be
recycled towards estimating the encompassing mixture
easier than in standard mixture problems due to common
parameters that allow for original MCMC samplers to be
turned into proposals
Gibbs sampling completions useful for assessing potential
outliers but not essential to achieve a conclusion about the
overall problem

Testing as estimation: the demise of the Bayes factor

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Testing as estimation: the demise of the Bayes factor

Similar to Testing as estimation: the demise of the Bayes factor (20)

More from Christian Robert

More from Christian Robert (20)

Recently uploaded

Recently uploaded (20)

Testing as estimation: the demise of the Bayes factor