SlideShare a Scribd company logo
Several perspectives and solutions on Bayesian
testing of hypotheses
Christian P. Robert
Universit´e Paris-Dauphine, Paris & University of Warwick, Coventry
Outline
Bayesian testing of hypotheses
Significance tests: one new parameter
Noninformative solutions
Jeffreys-Lindley paradox
Deviance (information criterion)
Testing under incomplete information
Posterior predictive checking
Testing via mixtures
Paradigm shift
Testing issues
Hypothesis testing
central problem of statistical inference
witness the recent ASA’s statement on p-values (Wasserstein,
2016)
dramatically differentiating feature between classical and
Bayesian paradigms
wide open to controversy and divergent opinions, includ.
within the Bayesian community
non-informative Bayesian testing case mostly unresolved,
witness the Jeffreys–Lindley paradox
[Berger (2003), Mayo & Cox (2006), Gelman (2008)]
”proper use and interpretation of the p-value”
”Scientific conclusions and
business or policy decisions
should not be based only on
whether a p-value passes a
specific threshold.”
”By itself, a p-value does not
provide a good measure of
evidence regarding a model or
hypothesis.”
[Wasserstein, 2016]
”proper use and interpretation of the p-value”
”Scientific conclusions and
business or policy decisions
should not be based only on
whether a p-value passes a
specific threshold.”
”By itself, a p-value does not
provide a good measure of
evidence regarding a model or
hypothesis.”
[Wasserstein, 2016]
Bayesian testing of hypotheses
Bayesian model selection as comparison of k potential
statistical models towards the selection of model that fits the
data “best”
mostly accepted perspective: it does not primarily seek to
identify which model is “true”, but compares fits
tools like Bayes factor naturally include a penalisation
addressing model complexity, mimicked by Bayes Information
(BIC) and Deviance Information (DIC) criteria
posterior predictive tools successfully advocated in Gelman et
al. (2013) even though they involve double use of data
Bayesian testing of hypotheses
Bayesian model selection as comparison of k potential
statistical models towards the selection of model that fits the
data “best”
mostly accepted perspective: it does not primarily seek to
identify which model is “true”, but compares fits
tools like Bayes factor naturally include a penalisation
addressing model complexity, mimicked by Bayes Information
(BIC) and Deviance Information (DIC) criteria
posterior predictive tools successfully advocated in Gelman et
al. (2013) even though they involve double use of data
Some difficulties
tension between using (i) posterior probabilities justified by
binary loss function but depending on unnatural prior weights,
and (ii) Bayes factors that eliminate dependence but escape
direct connection with posterior, unless prior weights are
integrated within loss
delicate interpretation (or calibration) of strength of the Bayes
factor towards supporting a given hypothesis or model,
because not a Bayesian decision rule
similar difficulty with posterior probabilities, with tendency to
interpret them as p-values: only report of respective strengths
of fitting data to both models
Some difficulties
tension between using (i) posterior probabilities justified by
binary loss function but depending on unnatural prior weights,
and (ii) Bayes factors that eliminate dependence but escape
direct connection with posterior, unless prior weights are
integrated within loss
referring to a fixed and arbitrary cuttoff value falls into the
same difficulties as regular p-values
no “third way” like opting out from a decision
Some further difficulties
long-lasting impact of prior modeling, i.e., choice of prior
distributions on parameters of both models, despite overall
consistency proof for Bayes factor
discontinuity in valid use of improper priors since they are not
justified in most testing situations, leading to many alternative
and ad hoc solutions, where data is either used twice or split
in artificial ways [or further tortured into confession]
binary (accept vs. reject) outcome more suited for immediate
decision (if any) than for model evaluation, in connection with
rudimentary loss function [atavistic remain of
Neyman-Pearson formalism]
Some additional difficulties
related impossibility to ascertain simultaneous misfit or to
detect outliers
no assessment of uncertainty associated with decision itself
besides posterior probability
difficult computation of marginal likelihoods in most settings
with further controversies about which algorithm to adopt
strong dependence of posterior probabilities on conditioning
statistics (ABC), which undermines their validity for model
assessment
temptation to create pseudo-frequentist equivalents such as
q-values with even less Bayesian justifications
c time for a paradigm shift
back to some solutions
Bayesian tests 101
Associated with the risk
R(θ, δ) = Eθ[L(θ, δ(x))]
=
Pθ(δ(x) = 0) if θ ∈ Θ0,
Pθ(δ(x) = 1) otherwise,
Bayes test
The Bayes estimator associated with π and with the 0 − 1 loss is
δπ
(x) =
1 if P(θ ∈ Θ0|x) > P(θ ∈ Θ0|x),
0 otherwise,
Bayesian tests 101
Associated with the risk
R(θ, δ) = Eθ[L(θ, δ(x))]
=
Pθ(δ(x) = 0) if θ ∈ Θ0,
Pθ(δ(x) = 1) otherwise,
Bayes test
The Bayes estimator associated with π and with the 0 − 1 loss is
δπ
(x) =
1 if P(θ ∈ Θ0|x) > P(θ ∈ Θ0|x),
0 otherwise,
Bayesian tests 102
Weights errors differently under both hypotheses:
Theorem (Optimal Bayes decision)
Under the 0 − 1 loss function
L(θ, d) =



0 if d = IΘ0 (θ)
a0 if d = 1 and θ ∈ Θ0
a1 if d = 0 and θ ∈ Θ0
the Bayes procedure is
δπ
(x) =
1 if P(θ ∈ Θ0|x) a0/(a0 + a1)
0 otherwise
Bayesian tests 102
Weights errors differently under both hypotheses:
Theorem (Optimal Bayes decision)
Under the 0 − 1 loss function
L(θ, d) =



0 if d = IΘ0 (θ)
a0 if d = 1 and θ ∈ Θ0
a1 if d = 0 and θ ∈ Θ0
the Bayes procedure is
δπ
(x) =
1 if P(θ ∈ Θ0|x) a0/(a0 + a1)
0 otherwise
A function of posterior probabilities
Definition (Bayes factors)
For hypotheses H0 : θ ∈ Θ0 vs. Ha : θ ∈ Θ0
B01 =
π(Θ0|x)
π(Θc
0|x)
π(Θ0)
π(Θc
0)
=
Θ0
f (x|θ)π0(θ)dθ
Θc
0
f (x|θ)π1(θ)dθ
[Jeffreys, ToP, 1939, V, §5.01]
Bayes rule under 0 − 1 loss: acceptance if
B01 > {(1 − π(Θ0))/a1}/{π(Θ0)/a0}
self-contained concept
Outside decision-theoretic environment:
eliminates choice of π(Θ0)
but depends on the choice of (π0, π1)
Bayesian/marginal equivalent to the likelihood ratio
Jeffreys’ scale of evidence:
if log10(Bπ
10) between 0 and 0.5, evidence against H0 weak,
if log10(Bπ
10) 0.5 and 1, evidence substantial,
if log10(Bπ
10) 1 and 2, evidence strong and
if log10(Bπ
10) above 2, evidence decisive
[...fairly arbitrary!]
self-contained concept
Outside decision-theoretic environment:
eliminates choice of π(Θ0)
but depends on the choice of (π0, π1)
Bayesian/marginal equivalent to the likelihood ratio
Jeffreys’ scale of evidence:
if log10(Bπ
10) between 0 and 0.5, evidence against H0 weak,
if log10(Bπ
10) 0.5 and 1, evidence substantial,
if log10(Bπ
10) 1 and 2, evidence strong and
if log10(Bπ
10) above 2, evidence decisive
[...fairly arbitrary!]
self-contained concept
Outside decision-theoretic environment:
eliminates choice of π(Θ0)
but depends on the choice of (π0, π1)
Bayesian/marginal equivalent to the likelihood ratio
Jeffreys’ scale of evidence:
if log10(Bπ
10) between 0 and 0.5, evidence against H0 weak,
if log10(Bπ
10) 0.5 and 1, evidence substantial,
if log10(Bπ
10) 1 and 2, evidence strong and
if log10(Bπ
10) above 2, evidence decisive
[...fairly arbitrary!]
self-contained concept
Outside decision-theoretic environment:
eliminates choice of π(Θ0)
but depends on the choice of (π0, π1)
Bayesian/marginal equivalent to the likelihood ratio
Jeffreys’ scale of evidence:
if log10(Bπ
10) between 0 and 0.5, evidence against H0 weak,
if log10(Bπ
10) 0.5 and 1, evidence substantial,
if log10(Bπ
10) 1 and 2, evidence strong and
if log10(Bπ
10) above 2, evidence decisive
[...fairly arbitrary!]
consistency
Example of a normal ¯Xn ∼ N(µ, 1/n) when µ ∼ N(0, 1), leading to
B01 = (1 + n)−1/2
exp{n2
¯x2
n /2(1 + n)}
Point null hypotheses
“Is it of the slightest use to reject a hypothesis until we have some
idea of what to put in its place?” H. Jeffreys, ToP(p.390)
Particular case H0 : θ = θ0
Take ρ0 = Prπ
(θ = θ0) and g1 prior density under Hc
0 .
Posterior probability of H0
π(Θ0|x) =
f (x|θ0)ρ0
f (x|θ)π(θ) dθ
=
f (x|θ0)ρ0
f (x|θ0)ρ0 + (1 − ρ0)m1(x)
and marginal under Hc
0
m1(x) =
Θ1
f (x|θ)g1(θ) dθ.
and
Bπ
01(x) =
f (x|θ0)ρ0
m1(x)(1 − ρ0)
ρ0
1 − ρ0
=
f (x|θ0)
m1(x)
Point null hypotheses
“Is it of the slightest use to reject a hypothesis until we have some
idea of what to put in its place?” H. Jeffreys, ToP(p.390)
Particular case H0 : θ = θ0
Take ρ0 = Prπ
(θ = θ0) and g1 prior density under Hc
0 .
Posterior probability of H0
π(Θ0|x) =
f (x|θ0)ρ0
f (x|θ)π(θ) dθ
=
f (x|θ0)ρ0
f (x|θ0)ρ0 + (1 − ρ0)m1(x)
and marginal under Hc
0
m1(x) =
Θ1
f (x|θ)g1(θ) dθ.
and
Bπ
01(x) =
f (x|θ0)ρ0
m1(x)(1 − ρ0)
ρ0
1 − ρ0
=
f (x|θ0)
m1(x)
“Significance tests: one new parameter”
Bayesian testing of hypotheses
Significance tests: one new parameter
Bayesian tests
Bayes factors
Improper priors for tests
Conclusion
Noninformative solutions
Jeffreys-Lindley paradox
Deviance (information criterion)
Testing under incomplete information
Fundamental setting
Is the new parameter supported by the observations or is
any variation expressible by it better interpreted as
random? Thus we must set two hypotheses for
comparison, the more complicated having the smaller
initial probability (Jeffreys, ToP, V, §5.0)
...compare a specially suggested value of a new
parameter, often 0 [q], with the aggregate of other
possible values [q ]. We shall call q the null hypothesis
and q the alternative hypothesis [and] we must take
P(q|H) = P(q |H) = 1/2 .
Construction of Bayes tests
Definition (Test)
Given an hypothesis H0 : θ ∈ Θ0 on the parameter θ ∈ Θ0 of a
statistical model, a test is a statistical procedure that takes its
values in {0, 1}.
Type–one and type–two errors
Associated with the risk
R(θ, δ) = Eθ[L(θ, δ(x))]
=
Pθ(δ(x) = 0) if θ ∈ Θ0,
Pθ(δ(x) = 1) otherwise,
Theorem (Bayes test)
The Bayes estimator associated with π and with the 0 − 1 loss is
δπ
(x) =
1 if P(θ ∈ Θ0|x) > P(θ ∈ Θ0|x),
0 otherwise,
Type–one and type–two errors
Associated with the risk
R(θ, δ) = Eθ[L(θ, δ(x))]
=
Pθ(δ(x) = 0) if θ ∈ Θ0,
Pθ(δ(x) = 1) otherwise,
Theorem (Bayes test)
The Bayes estimator associated with π and with the 0 − 1 loss is
δπ
(x) =
1 if P(θ ∈ Θ0|x) > P(θ ∈ Θ0|x),
0 otherwise,
Jeffreys’ example (§5.0)
Testing whether the mean α of a normal observation is zero:
P(q|aH) ∝ exp −
a2
2s2
P(q dα|aH) ∝ exp −
(a − α)2
2s2
f (α)dα
P(q |aH) ∝ exp −
(a − α)2
2s2
f (α)dα
A (small) point of contention
Jeffreys asserts
Suppose that there is one old parameter α; the new
parameter is β and is 0 on q. In q we could replace α by
α , any function of α and β: but to make it explicit that
q reduces to q when β = 0 we shall require that α = α
when β = 0 (V, §5.0).
This amounts to assume identical parameters in both models, a
controversial principle for model choice or at the very best to make
α and β dependent a priori, a choice contradicted by the next
paragraph in ToP
A (small) point of contention
Jeffreys asserts
Suppose that there is one old parameter α; the new
parameter is β and is 0 on q. In q we could replace α by
α , any function of α and β: but to make it explicit that
q reduces to q when β = 0 we shall require that α = α
when β = 0 (V, §5.0).
This amounts to assume identical parameters in both models, a
controversial principle for model choice or at the very best to make
α and β dependent a priori, a choice contradicted by the next
paragraph in ToP
Orthogonal parameters
If
I(α, β) =
gαα 0
0 gββ
,
α and β orthogonal, but not [a posteriori] independent, contrary to
ToP assertions
...the result will be nearly independent on previous
information on old parameters (V, §5.01).
and
K =
1
f (b, a)
ngββ
2π
exp −
1
2
ngββb2
[where] h(α) is irrelevant (V, §5.01)
Orthogonal parameters
If
I(α, β) =
gαα 0
0 gββ
,
α and β orthogonal, but not [a posteriori] independent, contrary to
ToP assertions
...the result will be nearly independent on previous
information on old parameters (V, §5.01).
and
K =
1
f (b, a)
ngββ
2π
exp −
1
2
ngββb2
[where] h(α) is irrelevant (V, §5.01)
Acknowledgement in ToP
In practice it is rather unusual for a set of parameters to
arise in such a way that each can be treated as irrelevant
to the presence of any other. More usual cases are (...)
where some parameters are so closely associated that one
could hardly occur without the others (V, §5.04).
Generalisation
Theorem (Optimal Bayes decision)
Under the 0 − 1 loss function
L(θ, d) =



0 if d = IΘ0 (θ)
a0 if d = 1 and θ ∈ Θ0
a1 if d = 0 and θ ∈ Θ0
the Bayes procedure is
δπ
(x) =
1 if Prπ
(θ ∈ Θ0|x) a0/(a0 + a1)
0 otherwise
Generalisation
Theorem (Optimal Bayes decision)
Under the 0 − 1 loss function
L(θ, d) =



0 if d = IΘ0 (θ)
a0 if d = 1 and θ ∈ Θ0
a1 if d = 0 and θ ∈ Θ0
the Bayes procedure is
δπ
(x) =
1 if Prπ
(θ ∈ Θ0|x) a0/(a0 + a1)
0 otherwise
Bound comparison
Determination of a0/a1 depends on consequences of “wrong
decision” under both circumstances
Often difficult to assess in practice and replacement with “golden”
default bounds like .05, biased towards H0
Bound comparison
Determination of a0/a1 depends on consequences of “wrong
decision” under both circumstances
Often difficult to assess in practice and replacement with “golden”
default bounds like .05, biased towards H0
A function of posterior probabilities
Definition (Bayes factors)
For hypotheses H0 : θ ∈ Θ0 vs. Ha : θ ∈ Θ0
B01 =
π(Θ0|x)
π(Θc
0|x)
π(Θ0)
π(Θc
0)
=
Θ0
f (x|θ)π0(θ)dθ
Θc
0
f (x|θ)π1(θ)dθ
[Good, 1958 & ToP, V, §5.01]
Equivalent to Bayes rule: acceptance if
B01 > {(1 − π(Θ0))/a1}/{π(Θ0)/a0}
Self-contained concept
Outside decision-theoretic environment:
eliminates choice of π(Θ0)
but depends on the choice of (π0, π1)
Bayesian/marginal equivalent to the likelihood ratio
Jeffreys’ scale of evidence:
if log10(Bπ
10) between 0 and 0.5, evidence against H0 weak,
if log10(Bπ
10) 0.5 and 1, evidence substantial,
if log10(Bπ
10) 1 and 2, evidence strong and
if log10(Bπ
10) above 2, evidence decisive
Self-contained concept
Outside decision-theoretic environment:
eliminates choice of π(Θ0)
but depends on the choice of (π0, π1)
Bayesian/marginal equivalent to the likelihood ratio
Jeffreys’ scale of evidence:
if log10(Bπ
10) between 0 and 0.5, evidence against H0 weak,
if log10(Bπ
10) 0.5 and 1, evidence substantial,
if log10(Bπ
10) 1 and 2, evidence strong and
if log10(Bπ
10) above 2, evidence decisive
Self-contained concept
Outside decision-theoretic environment:
eliminates choice of π(Θ0)
but depends on the choice of (π0, π1)
Bayesian/marginal equivalent to the likelihood ratio
Jeffreys’ scale of evidence:
if log10(Bπ
10) between 0 and 0.5, evidence against H0 weak,
if log10(Bπ
10) 0.5 and 1, evidence substantial,
if log10(Bπ
10) 1 and 2, evidence strong and
if log10(Bπ
10) above 2, evidence decisive
Self-contained concept
Outside decision-theoretic environment:
eliminates choice of π(Θ0)
but depends on the choice of (π0, π1)
Bayesian/marginal equivalent to the likelihood ratio
Jeffreys’ scale of evidence:
if log10(Bπ
10) between 0 and 0.5, evidence against H0 weak,
if log10(Bπ
10) 0.5 and 1, evidence substantial,
if log10(Bπ
10) 1 and 2, evidence strong and
if log10(Bπ
10) above 2, evidence decisive
A major modification
When the null hypothesis is supported by a set of measure 0
against Lebesgue measure, π(Θ0) = 0 for an absolutely continuous
prior distribution
[End of the story?!]
Suppose we are considering whether a location parameter
α is 0. The estimation prior probability for it is uniform
and we should have to take f (α) = 0 and K[= B10]
would always be infinite (V, §5.02)
A major modification
When the null hypothesis is supported by a set of measure 0
against Lebesgue measure, π(Θ0) = 0 for an absolutely continuous
prior distribution
[End of the story?!]
Suppose we are considering whether a location parameter
α is 0. The estimation prior probability for it is uniform
and we should have to take f (α) = 0 and K[= B10]
would always be infinite (V, §5.02)
Point null refurbishment
Requirement
Defined prior distributions under both assumptions,
π0(θ) ∝ π(θ)IΘ0 (θ), π1(θ) ∝ π(θ)IΘ1 (θ),
(under the standard dominating measures on Θ0 and Θ1)
Using the prior probabilities π(Θ0) = ρ0 and π(Θ1) = ρ1,
π(θ) = ρ0π0(θ) + ρ1π1(θ).
Note If Θ0 = {θ0}, π0 is the Dirac mass in θ0
Point null refurbishment
Requirement
Defined prior distributions under both assumptions,
π0(θ) ∝ π(θ)IΘ0 (θ), π1(θ) ∝ π(θ)IΘ1 (θ),
(under the standard dominating measures on Θ0 and Θ1)
Using the prior probabilities π(Θ0) = ρ0 and π(Θ1) = ρ1,
π(θ) = ρ0π0(θ) + ρ1π1(θ).
Note If Θ0 = {θ0}, π0 is the Dirac mass in θ0
Point null hypotheses
Particular case H0 : θ = θ0
Take ρ0 = Prπ
(θ = θ0) and g1 prior density under Ha.
Posterior probability of H0
π(Θ0|x) =
f (x|θ0)ρ0
f (x|θ)π(θ) dθ
=
f (x|θ0)ρ0
f (x|θ0)ρ0 + (1 − ρ0)m1(x)
and marginal under Ha
m1(x) =
Θ1
f (x|θ)g1(θ) dθ.
Point null hypotheses
Particular case H0 : θ = θ0
Take ρ0 = Prπ
(θ = θ0) and g1 prior density under Ha.
Posterior probability of H0
π(Θ0|x) =
f (x|θ0)ρ0
f (x|θ)π(θ) dθ
=
f (x|θ0)ρ0
f (x|θ0)ρ0 + (1 − ρ0)m1(x)
and marginal under Ha
m1(x) =
Θ1
f (x|θ)g1(θ) dθ.
Point null hypotheses (cont’d)
Dual representation
π(Θ0|x) = 1 +
1 − ρ0
ρ0
m1(x)
f (x|θ0)
−1
.
and
Bπ
01(x) =
f (x|θ0)ρ0
m1(x)(1 − ρ0)
ρ0
1 − ρ0
=
f (x|θ0)
m1(x)
Connection
π(Θ0|x) = 1 +
1 − ρ0
ρ0
1
Bπ
01(x)
−1
.
Point null hypotheses (cont’d)
Dual representation
π(Θ0|x) = 1 +
1 − ρ0
ρ0
m1(x)
f (x|θ0)
−1
.
and
Bπ
01(x) =
f (x|θ0)ρ0
m1(x)(1 − ρ0)
ρ0
1 − ρ0
=
f (x|θ0)
m1(x)
Connection
π(Θ0|x) = 1 +
1 − ρ0
ρ0
1
Bπ
01(x)
−1
.
A further difficulty
Improper priors are not allowed here
If
Θ1
π1(dθ1) = ∞ or
Θ2
π2(dθ2) = ∞
then π1 or π2 cannot be coherently normalised while the
normalisation matters in the Bayes factor remember Bayes factor?
A further difficulty
Improper priors are not allowed here
If
Θ1
π1(dθ1) = ∞ or
Θ2
π2(dθ2) = ∞
then π1 or π2 cannot be coherently normalised while the
normalisation matters in the Bayes factor remember Bayes factor?
ToP unaware of the problem?
A. Not entirely, as improper priors keep being used on nuisance
parameters
Example of testing for a zero normal mean:
If σ is the standard error and λ the true value, λ is 0 on
q. We want a suitable form for its prior on q . (...) Then
we should take
P(qdσ|H) ∝ dσ/σ
P(q dσdλ|H) ∝ f
λ
σ
dσ/σdλ/λ
where f [is a true density] (V, §5.2).
Fallacy of the “same” σ!
ToP unaware of the problem?
A. Not entirely, as improper priors keep being used on nuisance
parameters
Example of testing for a zero normal mean:
If σ is the standard error and λ the true value, λ is 0 on
q. We want a suitable form for its prior on q . (...) Then
we should take
P(qdσ|H) ∝ dσ/σ
P(q dσdλ|H) ∝ f
λ
σ
dσ/σdλ/λ
where f [is a true density] (V, §5.2).
Fallacy of the “same” σ!
Not enought information
If s = 0 [!!!], then [for σ = |¯x|/τ, λ = σv]
P(q|θH) ∝
∞
0
τ
|¯x|
n
exp −
1
2
nτ2 dτ
τ
,
P(q |θH) ∝
∞
0
dτ
τ
∞
−∞
τ
|¯x|
n
f (v) exp −
1
2
n(v − τ)2
.
If n = 1 and f (v) is any even [density],
P(q |θH) ∝
1
2
√
2π
|¯x|
and P(q|θH) ∝
1
2
√
2π
|¯x|
and therefore K = 1 (V, §5.2).
Strange constraints
If n 2, the condition that K = 0 for s = 0, ¯x = 0 is
equivalent to
∞
0
f (v)vn−1
dv = ∞ .
The function satisfying this condition for [all] n is
f (v) =
1
π(1 + v2)
This is the prior recommended by Jeffreys hereafter.
But, first, many other families of densities satisfy this constraint
and a scale of 1 cannot be universal!
Second, s = 0 is a zero probability event...
Strange constraints
If n 2, the condition that K = 0 for s = 0, ¯x = 0 is
equivalent to
∞
0
f (v)vn−1
dv = ∞ .
The function satisfying this condition for [all] n is
f (v) =
1
π(1 + v2)
This is the prior recommended by Jeffreys hereafter.
But, first, many other families of densities satisfy this constraint
and a scale of 1 cannot be universal!
Second, s = 0 is a zero probability event...
Strange constraints
If n 2, the condition that K = 0 for s = 0, ¯x = 0 is
equivalent to
∞
0
f (v)vn−1
dv = ∞ .
The function satisfying this condition for [all] n is
f (v) =
1
π(1 + v2)
This is the prior recommended by Jeffreys hereafter.
But, first, many other families of densities satisfy this constraint
and a scale of 1 cannot be universal!
Second, s = 0 is a zero probability event...
Comments
ToP very imprecise about choice of priors in the setting of
tests (despite existence of Susie’s Jeffreys’ conventional partly
proper priors)
ToP misses the difficulty of improper priors [coherent with
earlier stance]
but this problem still generates debates within the B
community
Some degree of goodness-of-fit testing but against fixed
alternatives
Persistence of the form
K ≈
πn
2
1 +
t2
ν
−1/2ν+1/2
but ν not so clearly defined...
Noninformative proposals
Bayesian testing of hypotheses
Significance tests: one new
parameter
Noninformative solutions
Jeffreys-Lindley paradox
Deviance (information criterion)
Testing under incomplete
information
Posterior predictive checking
what’s special about the Bayes factor?!
“The priors do not represent substantive knowledge of the
parameters within the model
Using Bayes’ theorem, these priors can then be updated to
posteriors conditioned on the data that were actually observed
In general, the fact that different priors result in different
Bayes factors should not come as a surprise
The Bayes factor (...) balances the tension between parsimony
and goodness of fit, (...) against overfitting the data
In induction there is no harm in being occasionally wrong; it is
inevitable that we shall be”
[Jeffreys, 1939; Ly et al., 2015]
what’s wrong with the Bayes factor?!
(1/2, 1/2) partition between hypotheses has very little to
suggest in terms of extensions
central difficulty stands with issue of picking a prior
probability of a model
unfortunate impossibility of using improper priors in most
settings
Bayes factors lack direct scaling associated with posterior
probability and loss function
twofold dependence on subjective prior measure, first in prior
weights of models and second in lasting impact of prior
modelling on the parameters
Bayes factor offers no window into uncertainty associated with
decision
further reasons in the summary
[Robert, 2016]
Lindley’s paradox
In a normal mean testing problem,
¯xn ∼ N(θ, σ2
/n) , H0 : θ = θ0 ,
under Jeffreys prior, θ ∼ N(θ0, σ2), the Bayes factor
B01(tn) = (1 + n)1/2
exp −nt2
n/2[1 + n] ,
where tn =
√
n|¯xn − θ0|/σ, satisfies
B01(tn)
n−→∞
−→ ∞
[assuming a fixed tn]
[Lindley, 1957]
A strong impropriety
Improper priors not allowed in Bayes factors:
If
Θ1
π1(dθ1) = ∞ or
Θ2
π2(dθ2) = ∞
then π1 or π2 cannot be coherently normalised while the
normalisation matters in the Bayes factor B12
Lack of mathematical justification for “common nuisance
parameter” [and prior of]
[Berger, Pericchi, and Varshavsky, 1998; Marin and Robert, 2013]
A strong impropriety
Improper priors not allowed in Bayes factors:
If
Θ1
π1(dθ1) = ∞ or
Θ2
π2(dθ2) = ∞
then π1 or π2 cannot be coherently normalised while the
normalisation matters in the Bayes factor B12
Lack of mathematical justification for “common nuisance
parameter” [and prior of]
[Berger, Pericchi, and Varshavsky, 1998; Marin and Robert, 2013]
On some resolutions of the paradox
use of pseudo-Bayes factors, fractional Bayes factors, &tc,
which lacks complete proper Bayesian justification
[Berger & Pericchi, 2001]
use of identical improper priors on nuisance parameters,
calibration via the posterior predictive distribution,
matching priors,
use of score functions extending the log score function
non-local priors correcting default priors
On some resolutions of the paradox
use of pseudo-Bayes factors, fractional Bayes factors, &tc,
use of identical improper priors on nuisance parameters, a
notion already entertained by Jeffreys
[Berger et al., 1998; Marin & Robert, 2013]
calibration via the posterior predictive distribution,
matching priors,
use of score functions extending the log score function
non-local priors correcting default priors
On some resolutions of the paradox
use of pseudo-Bayes factors, fractional Bayes factors, &tc,
use of identical improper priors on nuisance parameters,
P´ech´e de jeunesse: equating the values of the prior densities
at the point-null value θ0,
ρ0 = (1 − ρ0)π1(θ0)
[Robert, 1993]
calibration via the posterior predictive distribution,
matching priors,
use of score functions extending the log score function
non-local priors correcting default priors
On some resolutions of the paradox
use of pseudo-Bayes factors, fractional Bayes factors, &tc,
use of identical improper priors on nuisance parameters,
calibration via the posterior predictive distribution, which uses
the data twice
matching priors,
use of score functions extending the log score function
non-local priors correcting default priors
On some resolutions of the paradox
use of pseudo-Bayes factors, fractional Bayes factors, &tc,
use of identical improper priors on nuisance parameters,
calibration via the posterior predictive distribution,
matching priors, whose sole purpose is to bring frequentist
and Bayesian coverages as close as possible
[Datta & Mukerjee, 2004]
use of score functions extending the log score function
non-local priors correcting default priors
On some resolutions of the paradox
use of pseudo-Bayes factors, fractional Bayes factors, &tc,
use of identical improper priors on nuisance parameters,
calibration via the posterior predictive distribution,
matching priors,
use of score functions extending the log score function
log B12(x) = log m1(x) − log m2(x) = S0(x, m1) − S0(x, m2) ,
that are independent of the normalising constant
[Dawid et al., 2013; Dawid & Musio, 2015]
non-local priors correcting default priors
On some resolutions of the paradox
use of pseudo-Bayes factors, fractional Bayes factors, &tc,
use of identical improper priors on nuisance parameters,
calibration via the posterior predictive distribution,
matching priors,
use of score functions extending the log score function
non-local priors correcting default priors towards more
balanced error rates
[Johnson & Rossell, 2010; Consonni et al., 2013]
Pseudo-Bayes factors
Idea
Use one part x[i] of the data x to make the prior proper:
πi improper but πi (·|x[i]) proper
and
fi (x[n/i]|θi ) πi (θi |x[i])dθi
fj (x[n/i]|θj ) πj (θj |x[i])dθj
independent of normalizing constant
Use remaining x[n/i] to run test as if πj (θj |x[i]) is the true prior
Pseudo-Bayes factors
Idea
Use one part x[i] of the data x to make the prior proper:
πi improper but πi (·|x[i]) proper
and
fi (x[n/i]|θi ) πi (θi |x[i])dθi
fj (x[n/i]|θj ) πj (θj |x[i])dθj
independent of normalizing constant
Use remaining x[n/i] to run test as if πj (θj |x[i]) is the true prior
Pseudo-Bayes factors
Idea
Use one part x[i] of the data x to make the prior proper:
πi improper but πi (·|x[i]) proper
and
fi (x[n/i]|θi ) πi (θi |x[i])dθi
fj (x[n/i]|θj ) πj (θj |x[i])dθj
independent of normalizing constant
Use remaining x[n/i] to run test as if πj (θj |x[i]) is the true prior
Motivation
Provides a working principle for improper priors
Gather enough information from data to achieve properness
and use this properness to run the test on remaining data
does not use x twice as in Aitkin’s (1991)
Motivation
Provides a working principle for improper priors
Gather enough information from data to achieve properness
and use this properness to run the test on remaining data
does not use x twice as in Aitkin’s (1991)
Motivation
Provides a working principle for improper priors
Gather enough information from data to achieve properness
and use this properness to run the test on remaining data
does not use x twice as in Aitkin’s (1991)
Issues
depends on the choice of x[i]
many ways of combining pseudo-Bayes factors
AIBF = BN
ji
1
L
Bij (x[ ])
MIBF = BN
ji med[Bij (x[ ])]
GIBF = BN
ji exp
1
L
log Bij (x[ ])
not often an exact Bayes factor
and thus lacking inner coherence
B12 = B10B02 and B01 = 1/B10 .
[Berger & Pericchi, 1996]
Issues
depends on the choice of x[i]
many ways of combining pseudo-Bayes factors
AIBF = BN
ji
1
L
Bij (x[ ])
MIBF = BN
ji med[Bij (x[ ])]
GIBF = BN
ji exp
1
L
log Bij (x[ ])
not often an exact Bayes factor
and thus lacking inner coherence
B12 = B10B02 and B01 = 1/B10 .
[Berger & Pericchi, 1996]
Fractional Bayes factor
Idea
use directly the likelihood to separate training sample from testing
sample
BF
12 = B12(x)
Lb
2(θ2)π2(θ2)dθ2
Lb
1(θ1)π1(θ1)dθ1
[O’Hagan, 1995]
Proportion b of the sample used to gain proper-ness
Fractional Bayes factor
Idea
use directly the likelihood to separate training sample from testing
sample
BF
12 = B12(x)
Lb
2(θ2)π2(θ2)dθ2
Lb
1(θ1)π1(θ1)dθ1
[O’Hagan, 1995]
Proportion b of the sample used to gain proper-ness
Fractional Bayes factor (cont’d)
Example (Normal mean)
BF
12 =
1
√
b
en(b−1)¯x2
n /2
corresponds to exact Bayes factor for the prior N 0, 1−b
nb
If b constant, prior variance goes to 0
If b =
1
n
, prior variance stabilises around 1
If b = n−α
, α < 1, prior variance goes to 0 too.
c Call to external principles to pick the order of b
Bayesian predictive
“If the model fits, then replicated data generated under
the model should look similar to observed data. To put it
another way, the observed data should look plausible
under the posterior predictive distribution. This is really a
self-consistency check: an observed discrepancy can be
due to model misfit or chance.” (BDA, p.143)
Use of posterior predictive
p(yrep
|y) = p(yrep
|θ)π(θ|y) dθ
and measure of discrepancy T(·, ·)
Replacing p-value
p(y|θ) = P(T(yrep
, θ) T(y, θ)|θ)
with Bayesian posterior p-value
P(T(yrep
, θ) T(y, θ)|y) = p(y|θ)π(θ|x) dθ
Bayesian predictive
“If the model fits, then replicated data generated under
the model should look similar to observed data. To put it
another way, the observed data should look plausible
under the posterior predictive distribution. This is really a
self-consistency check: an observed discrepancy can be
due to model misfit or chance.” (BDA, p.143)
Use of posterior predictive
p(yrep
|y) = p(yrep
|θ)π(θ|y) dθ
and measure of discrepancy T(·, ·)
Replacing p-value
p(y|θ) = P(T(yrep
, θ) T(y, θ)|θ)
with Bayesian posterior p-value
P(T(yrep
, θ) T(y, θ)|y) = p(y|θ)π(θ|x) dθ
Issues
“the posterior predictive p-value is such a [Bayesian]
probability statement, conditional on the model and data,
about what might be expected in future replications.
(BDA, p.151)
sounds too much like a p-value...!
relies on choice of T(·, ·)
seems to favour overfitting
(again) using the data twice (once for the posterior and twice
in the p-value)
needs to be calibrated (back to 0.05?)
general difficulty in interpreting
where is the penalty for model complexity?
Jeffreys–Lindley paradox
Bayesian testing of hypotheses
Significance tests: one new parameter
Noninformative solutions
Jeffreys-Lindley paradox
Lindley’s paradox
dual versions of the paradox
Bayesian resolutions
Deviance (information criterion)
Testing under incomplete information
Lindley’s paradox
In a normal mean testing problem,
¯xn ∼ N(θ, σ2
/n) , H0 : θ = θ0 ,
under Jeffreys prior, θ ∼ N(θ0, σ2), the Bayes factor
B01(tn) = (1 + n)1/2
exp −nt2
n/2[1 + n] ,
where tn =
√
n|¯xn − θ0|/σ, satisfies
B01(tn)
n−→∞
−→ ∞
[assuming a fixed tn]
[Lindley, 1957]
Two versions of the paradox
“the weight of Lindley’s paradoxical result (...) burdens
proponents of the Bayesian practice”.
[Lad, 2003]
official version, opposing frequentist and Bayesian assessments
[Lindley, 1957]
intra-Bayesian version, blaming vague and improper priors for
the Bayes factor misbehaviour:
if π1(·|σ) depends on a scale parameter σ, it is often the case
that
B01(x)
σ−→∞
−→ +∞
for a given x, meaning H0 is always accepted
[Robert, 1992, 2013]
where does it matter?
In the normal case, Z ∼ N(θ, 1), θ ∼ N(0, α2), Bayes factor
B10(z) =
ez2α2/(1+α2)
√
1 + α2
=
√
1 − λ exp{λz2
/2}
Evacuation of the first version
Two paradigms [(b) versus (f)]
one (b) operates on the parameter space Θ, while the other
(f) is produced from the sample space
one (f) relies solely on the point-null hypothesis H0 and the
corresponding sampling distribution, while the other
(b) opposes H0 to a (predictive) marginal version of H1
one (f) could reject “a hypothesis that may be true (...)
because it has not predicted observable results that have not
occurred” (Jeffreys, ToP, VII, §7.2) while the other
(b) conditions upon the observed value xobs
one (f) cannot agree with the likelihood principle, while the
other (b) is almost uniformly in agreement with it
one (f) resorts to an arbitrary fixed bound α on the p-value,
while the other (b) refers to the (default) boundary probability
of 1/2
Nothing’s wrong with the second version
n, prior’s scale factor: prior variance n times larger than the
observation variance and when n goes to ∞, Bayes factor
goes to ∞ no matter what the observation is
n becomes what Lindley (1957) calls “a measure of lack of
conviction about the null hypothesis”
when prior diffuseness under H1 increases, only relevant
information becomes that θ could be equal to θ0, and this
overwhelms any evidence to the contrary contained in the data
mass of the prior distribution in the vicinity of any fixed
neighbourhood of the null hypothesis vanishes to zero under
H1
c deep coherence in the outcome: being indecisive about
the alternative hypothesis means we should not chose it
Nothing’s wrong with the second version
n, prior’s scale factor: prior variance n times larger than the
observation variance and when n goes to ∞, Bayes factor
goes to ∞ no matter what the observation is
n becomes what Lindley (1957) calls “a measure of lack of
conviction about the null hypothesis”
when prior diffuseness under H1 increases, only relevant
information becomes that θ could be equal to θ0, and this
overwhelms any evidence to the contrary contained in the data
mass of the prior distribution in the vicinity of any fixed
neighbourhood of the null hypothesis vanishes to zero under
H1
c deep coherence in the outcome: being indecisive about
the alternative hypothesis means we should not chose it
On some resolutions of the second version
use of pseudo-Bayes factors, fractional Bayes factors, &tc,
which lacks complete proper Bayesian justification
[Berger & Pericchi, 2001]
use of identical improper priors on nuisance parameters,
use of the posterior predictive distribution,
matching priors,
use of score functions extending the log score function
non-local priors correcting default priors
On some resolutions of the second version
use of pseudo-Bayes factors, fractional Bayes factors, &tc,
use of identical improper priors on nuisance parameters, a
notion already entertained by Jeffreys
[Berger et al., 1998; Marin & Robert, 2013]
use of the posterior predictive distribution,
matching priors,
use of score functions extending the log score function
non-local priors correcting default priors
On some resolutions of the second version
use of pseudo-Bayes factors, fractional Bayes factors, &tc,
use of identical improper priors on nuisance parameters,
P´ech´e de jeunesse: equating the values of the prior densities
at the point-null value θ0,
ρ0 = (1 − ρ0)π1(θ0)
[Robert, 1993]
use of the posterior predictive distribution,
matching priors,
use of score functions extending the log score function
non-local priors correcting default priors
On some resolutions of the second version
use of pseudo-Bayes factors, fractional Bayes factors, &tc,
use of identical improper priors on nuisance parameters,
use of the posterior predictive distribution, which uses the
data twice
matching priors,
use of score functions extending the log score function
non-local priors correcting default priors
On some resolutions of the second version
use of pseudo-Bayes factors, fractional Bayes factors, &tc,
use of identical improper priors on nuisance parameters,
use of the posterior predictive distribution,
matching priors, whose sole purpose is to bring frequentist
and Bayesian coverages as close as possible
[Datta & Mukerjee, 2004]
use of score functions extending the log score function
non-local priors correcting default priors
On some resolutions of the second version
use of pseudo-Bayes factors, fractional Bayes factors, &tc,
use of identical improper priors on nuisance parameters,
use of the posterior predictive distribution,
matching priors,
use of score functions extending the log score function
log B12(x) = log m1(x) − log m2(x) = S0(x, m1) − S0(x, m2) ,
that are independent of the normalising constant
[Dawid et al., 2013; Dawid & Musio, 2015]
non-local priors correcting default priors
On some resolutions of the second version
use of pseudo-Bayes factors, fractional Bayes factors, &tc,
use of identical improper priors on nuisance parameters,
use of the posterior predictive distribution,
matching priors,
use of score functions extending the log score function
non-local priors correcting default priors towards more
balanced error rates
[Johnson & Rossell, 2010; Consonni et al., 2013]
Deviance (information criterion)
Bayesian testing of hypotheses
Significance tests: one new parameter
Noninformative solutions
Jeffreys-Lindley paradox
Deviance (information criterion)
Testing under incomplete information
Posterior predictive checking
Testing via mixtures
Paradigm shift
DIC as in Dayesian?
Deviance defined by
D(θ) = −2 log(p(y|θ)) ,
Effective number of parameters computed as
pD = ¯D − D(¯θ) ,
with ¯D posterior expectation of D and ¯θ estimate of θ
Deviance information criterion (DIC) defined by
DIC = pD + ¯D
= D(¯θ) + 2pD
Models with smaller DIC better supported by the data
[Spiegelhalter et al., 2002]
“thou shalt not use the data twice”
The data is used twice in the DIC method:
1. y used once to produce the posterior π(θ|y), and the
associated estimate, ˜θ(y)
2. y used a second time to compute the posterior expectation
of the observed likelihood p(y|θ),
log p(y|θ)π(dθ|y) ∝ log p(y|θ)p(y|θ)π(dθ) ,
DIC for missing data models
Framework of missing data models
f (y|θ) = f (y, z|θ)dz ,
with observed data y = (y1, . . . , yn) and corresponding missing
data by z = (z1, . . . , zn)
How do we define DIC in such settings?
DIC for missing data models
Framework of missing data models
f (y|θ) = f (y, z|θ)dz ,
with observed data y = (y1, . . . , yn) and corresponding missing
data by z = (z1, . . . , zn)
How do we define DIC in such settings?
how many DICs can you fit in a mixture?
Q: How many giraffes can you fit in a VW bug?
A: None, the elephants are in there.
1. observed DICs
DIC1 = −4Eθ [log f (y|θ)|y] + 2 log f (y|Eθ [θ|y])
often a poor choice in case of unidentifiability
2. complete DICs based on f (y, z|θ)
3. conditional DICs based on f (y|z, θ)
[Celeux et al., BA, 2006]
how many DICs can you fit in a mixture?
Q: How many giraffes can you fit in a VW bug?
A: None, the elephants are in there.
1. observed DICs
DIC2 = −4Eθ [log f (y|θ)|y] + 2 log f (y|θ(y)) .
which uses posterior mode instead
2. complete DICs based on f (y, z|θ)
3. conditional DICs based on f (y|z, θ)
[Celeux et al., BA, 2006]
how many DICs can you fit in a mixture?
Q: How many giraffes can you fit in a VW bug?
A: None, the elephants are in there.
1. observed DICs
DIC3 = −4Eθ [log f (y|θ)|y] + 2 log f (y) ,
which instead relies on the MCMC density estimate
2. complete DICs based on f (y, z|θ)
3. conditional DICs based on f (y|z, θ)
[Celeux et al., BA, 2006]
how many DICs can you fit in a mixture?
Q: How many giraffes can you fit in a VW bug?
A: None, the elephants are in there.
1. observed DICs
2. complete DICs based on f (y, z|θ)
DIC4 = EZ [DIC(y, Z)|y]
= −4Eθ,Z [log f (y, Z|θ)|y] + 2EZ [log f (y, Z|Eθ[θ|y, Z])|y]
3. conditional DICs based on f (y|z, θ)
[Celeux et al., BA, 2006]
how many DICs can you fit in a mixture?
Q: How many giraffes can you fit in a VW bug?
A: None, the elephants are in there.
1. observed DICs
2. complete DICs based on f (y, z|θ)
DIC5 = −4Eθ,Z [log f (y, Z|θ)|y] + 2 log f (y, z(y)|θ(y)) ,
using Z as an additional parameter
3. conditional DICs based on f (y|z, θ)
[Celeux et al., BA, 2006]
how many DICs can you fit in a mixture?
Q: How many giraffes can you fit in a VW bug?
A: None, the elephants are in there.
1. observed DICs
2. complete DICs based on f (y, z|θ)
DIC6 = −4Eθ,Z [log f (y, Z|θ)|y]+2EZ[log f (y, Z|θ(y))|y, θ(y)] .
in analogy with EM, θ being an EM fixed point
3. conditional DICs based on f (y|z, θ)
[Celeux et al., BA, 2006]
how many DICs can you fit in a mixture?
Q: How many giraffes can you fit in a VW bug?
A: None, the elephants are in there.
1. observed DICs
2. complete DICs based on f (y, z|θ)
3. conditional DICs based on f (y|z, θ)
DIC7 = −4Eθ,Z [log f (y|Z, θ)|y] + 2 log f (y|z(y), θ(y)) ,
using MAP estimates
[Celeux et al., BA, 2006]
how many DICs can you fit in a mixture?
Q: How many giraffes can you fit in a VW bug?
A: None, the elephants are in there.
1. observed DICs
2. complete DICs based on f (y, z|θ)
3. conditional DICs based on f (y|z, θ)
DIC8 = −4Eθ,Z [log f (y|Z, θ)|y]+2EZ log f (y|Z, θ(y, Z))|y ,
conditioning first on Z and then integrating over Z
conditional on y
[Celeux et al., BA, 2006]
Galactic DICs
Example of the galaxy mixture dataset
DIC2 DIC3 DIC4 DIC5 DIC6 DIC7 DIC8
K (pD2) (pD3) (pD4) (pD5) (pD6) (pD7) (pD8)
2 453 451 502 705 501 417 410
(5.56) (3.66) (5.50) (207.88) (4.48) (11.07) (4.09)
3 440 436 461 622 471 378 372
(9.23) (4.94) (6.40) (167.28) (15.80) (13.59) (7.43)
4 446 439 473 649 482 388 382
(11.58) (5.41) (7.52) (183.48) (16.51) (17.47) (11.37)
5 447 442 485 658 511 395 390
(10.80) (5.48) (7.58) (180.73) (33.29) (20.00) (15.15)
6 449 444 494 676 532 407 398
(11.26) (5.49) (8.49) (191.10) (46.83) (28.23) (19.34)
7 460 446 508 700 571 425 409
(19.26) (5.83) (8.93) (200.35) (71.26) (40.51) (24.57)
questions
what is the behaviour of DIC under model mispecification?
is there an absolute scale to the DIC values, i.e. when is a
difference in DICs significant?
how can DIC handle small n’s versus p’s?
should pD be defined as var(D|y)/2 [Gelman’s suggestion]?
is WAIC (Gelman and Vehtari, 2013) making a difference for
being based on expected posterior predictive?
In an era of complex models, is DIC applicable?
[Robert, 2013]
questions
what is the behaviour of DIC under model mispecification?
is there an absolute scale to the DIC values, i.e. when is a
difference in DICs significant?
how can DIC handle small n’s versus p’s?
should pD be defined as var(D|y)/2 [Gelman’s suggestion]?
is WAIC (Gelman and Vehtari, 2013) making a difference for
being based on expected posterior predictive?
In an era of complex models, is DIC applicable?
[Robert, 2013]
Testing under incomplete information
Bayesian testing of hypotheses
Significance tests: one new parameter
Noninformative solutions
Jeffreys-Lindley paradox
Deviance (information criterion)
Testing under incomplete information
Posterior predictive checking
Testing via mixtures
Paradigm shift
Likelihood-free settings
Cases when the likelihood function f (y|θ) is unavailable (in
analytic and numerical senses) and when the completion step
f (y|θ) =
Z
f (y, z|θ) dz
is impossible or too costly because of the dimension of z
c MCMC cannot be implemented!
The ABC method
Bayesian setting: target is π(θ)f (x|θ)
When likelihood f (x|θ) not in closed form, likelihood-free rejection
technique:
ABC algorithm
For an observation y ∼ f (y|θ), under the prior π(θ), keep jointly
simulating
θ ∼ π(θ) , z ∼ f (z|θ ) ,
until the auxiliary variable z is equal to the observed value, z = y.
[Tavar´e et al., 1997]
The ABC method
Bayesian setting: target is π(θ)f (x|θ)
When likelihood f (x|θ) not in closed form, likelihood-free rejection
technique:
ABC algorithm
For an observation y ∼ f (y|θ), under the prior π(θ), keep jointly
simulating
θ ∼ π(θ) , z ∼ f (z|θ ) ,
until the auxiliary variable z is equal to the observed value, z = y.
[Tavar´e et al., 1997]
The ABC method
Bayesian setting: target is π(θ)f (x|θ)
When likelihood f (x|θ) not in closed form, likelihood-free rejection
technique:
ABC algorithm
For an observation y ∼ f (y|θ), under the prior π(θ), keep jointly
simulating
θ ∼ π(θ) , z ∼ f (z|θ ) ,
until the auxiliary variable z is equal to the observed value, z = y.
[Tavar´e et al., 1997]
A as A...pproximative
When y is a continuous random variable, strict equality z = y is
replaced with a tolerance zone
ρ(y, z)
where ρ is a distance
Output distributed from
π(θ) Pθ{ρ(y, z) < }
def
∝ π(θ|ρ(y, z) < )
[Pritchard et al., 1999]
A as A...pproximative
When y is a continuous random variable, strict equality z = y is
replaced with a tolerance zone
ρ(y, z)
where ρ is a distance
Output distributed from
π(θ) Pθ{ρ(y, z) < }
def
∝ π(θ|ρ(y, z) < )
[Pritchard et al., 1999]
ABC algorithm
In most implementations, further degree of A...pproximation:
Algorithm 1 Likelihood-free rejection sampler
for i = 1 to N do
repeat
generate θ from the prior distribution π(·)
generate z from the likelihood f (·|θ )
until ρ{η(z), η(y)}
set θi = θ
end for
where η(y) defines a (not necessarily sufficient) statistic
Which summary η(·)?
Fundamental difficulty of the choice of the summary statistic when
there is no non-trivial sufficient statistics
Loss of statistical information balanced against gain in data
roughening
Approximation error and information loss remain unknown
Choice of statistics induces choice of distance function
towards standardisation
may be imposed for external/practical reasons (e.g., LDA)
may gather several non-B point estimates
can learn about efficient combination
[Estoup et al., 2012, Genetics]
Which summary η(·)?
Fundamental difficulty of the choice of the summary statistic when
there is no non-trivial sufficient statistics
Loss of statistical information balanced against gain in data
roughening
Approximation error and information loss remain unknown
Choice of statistics induces choice of distance function
towards standardisation
may be imposed for external/practical reasons (e.g., LDA)
may gather several non-B point estimates
can learn about efficient combination
[Estoup et al., 2012, Genetics]
Which summary η(·)?
Fundamental difficulty of the choice of the summary statistic when
there is no non-trivial sufficient statistics
Loss of statistical information balanced against gain in data
roughening
Approximation error and information loss remain unknown
Choice of statistics induces choice of distance function
towards standardisation
may be imposed for external/practical reasons (e.g., LDA)
may gather several non-B point estimates
can learn about efficient combination
[Estoup et al., 2012, Genetics]
Generic ABC for model choice
Algorithm 2 Likelihood-free model choice sampler (ABC-MC)
for t = 1 to T do
repeat
Generate m from the prior π(M = m)
Generate θm from the prior πm(θm)
Generate z from the model fm(z|θm)
until ρ{η(z), η(y)} <
Set m(t) = m and θ(t)
= θm
end for
[Grelaud et al., 2009]
ABC estimates
Posterior probability π(M = m|y) approximated by the frequency
of acceptances from model m
1
T
T
t=1
Im(t)=m .
Issues with implementation:
should tolerances be the same for all models?
should summary statistics vary across models (incl. their
dimension)?
should the distance measure ρ vary as well?
ABC estimates
Posterior probability π(M = m|y) approximated by the frequency
of acceptances from model m
1
T
T
t=1
Im(t)=m .
Extension to a weighted polychotomous logistic regression estimate
of π(M = m|y), with non-parametric kernel weights
[Cornuet et al., DIYABC, 2009]
Formalised framework
Central question to the validation of ABC for model choice:
When is a Bayes factor based on an insufficient statistic
T(y) consistent?
Note: c drawn on T(y) through BT
12(y) necessarily differs from
c drawn on y through B12(y)
Formalised framework
Central question to the validation of ABC for model choice:
When is a Bayes factor based on an insufficient statistic
T(y) consistent?
Note: c drawn on T(y) through BT
12(y) necessarily differs from
c drawn on y through B12(y)
A benchmark if toy example
Comparison suggested by referee of PNAS paper [thanks]:
[X, Cornuet, Marin, & Pillai, Aug. 2011]
Model M1: y ∼ N(θ1, 1) opposed to model M2: y ∼ L(θ2, 1/
√
2),
Laplace distribution with mean θ2 and scale parameter 1/
√
2
(variance one).
A benchmark if toy example
Comparison suggested by referee of PNAS paper [thanks]:
[X, Cornuet, Marin, & Pillai, Aug. 2011]
Model M1: y ∼ N(θ1, 1) opposed to model M2: y ∼ L(θ2, 1/
√
2),
Laplace distribution with mean θ2 and scale parameter 1/
√
2
(variance one).
Four possible statistics
1. sample mean y (sufficient for M1 if not M2);
2. sample median med(y) (insufficient);
3. sample variance var(y) (ancillary);
4. median absolute deviation mad(y) = med(y − med(y));
A benchmark if toy example
Comparison suggested by referee of PNAS paper [thanks]:
[X, Cornuet, Marin, & Pillai, Aug. 2011]
Model M1: y ∼ N(θ1, 1) opposed to model M2: y ∼ L(θ2, 1/
√
2),
Laplace distribution with mean θ2 and scale parameter 1/
√
2
(variance one).
0.0 0.2 0.4 0.6 0.8 1.0
0123
probability
Density
A benchmark if toy example
Comparison suggested by referee of PNAS paper [thanks]:
[X, Cornuet, Marin, & Pillai, Aug. 2011]
Model M1: y ∼ N(θ1, 1) opposed to model M2: y ∼ L(θ2, 1/
√
2),
Laplace distribution with mean θ2 and scale parameter 1/
√
2
(variance one).
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
Gauss Laplace
0.00.20.40.60.81.0
n=100
Consistency theorem
If Pn belongs to one of the two models and if µ0 = E[T] cannot
be attained by the other one :
0 = min (inf{|µ0 − µi (θi )|; θi ∈ Θi }, i = 1, 2)
< max (inf{|µ0 − µi (θi )|; θi ∈ Θi }, i = 1, 2) ,
then the Bayes factor BT
12 is consistent
Posterior predictive checking
Bayesian testing of hypotheses
Significance tests: one new parameter
Noninformative solutions
Jeffreys-Lindley paradox
Deviance (information criterion)
Testing under incomplete information
Posterior predictive checking
Testing via mixtures
Paradigm shift
Bayesian predictive
“If the model fits, then replicated data generated under
the model should look similar to observed data. To put it
another way, the observed data should look plausible
under the posterior predictive distribution. This is really a
self-consistency check: an observed discrepancy can be
due to model misfit or chance.” (BDA, p.143)
Use of posterior predictive
p(yrep
|y) = p(yrep
|θ)π(θ|y) dθ
and measure of discrepancy T(·, ·)
Replacing p-value
p(y|θ) = P(T(yrep
, θ) T(y, θ)|θ)
with Bayesian posterior p-value
P(T(yrep
, θ) T(y, θ)|y) = p(y|θ)π(θ|x) dθ
Bayesian predictive
“If the model fits, then replicated data generated under
the model should look similar to observed data. To put it
another way, the observed data should look plausible
under the posterior predictive distribution. This is really a
self-consistency check: an observed discrepancy can be
due to model misfit or chance.” (BDA, p.143)
Use of posterior predictive
p(yrep
|y) = p(yrep
|θ)π(θ|y) dθ
and measure of discrepancy T(·, ·)
Replacing p-value
p(y|θ) = P(T(yrep
, θ) T(y, θ)|θ)
with Bayesian posterior p-value
P(T(yrep
, θ) T(y, θ)|y) = p(y|θ)π(θ|x) dθ
Issues
“the posterior predictive p-value is such a [Bayesian]
probability statement, conditional on the model and data,
about what might be expected in future replications.
(BDA, p.151)
sounds too much like a p-value...!
relies on choice of T(·, ·)
seems to favour overfitting
(again) using the data twice (once for the posterior and twice
in the p-value)
needs to be calibrated (back to 0.05?)
general difficulty in interpreting
where is the penalty for model complexity?
Example
Normal-normal mean model:
X ∼ N(θ, 1) , θ ∼ N(0, 10)
Bayesian posterior p-value for
T(x) = x2, m(x), B10(x)−1
P(|X| |x||θ, x)π(θ|x) dθ
Example
Normal-normal mean model:
X ∼ N(θ, 1) , θ ∼ N(0, 10)
Bayesian posterior p-value for
T(x) = x2, m(x), B10(x)−1
P(|X| |x||θ, x)π(θ|x) dθ
which interpretation?
n= 10
posterior predictive p−value
Frequency
0.0 0.1 0.2 0.3 0.4 0.5 0.6
0500100015002000
n= 100
posterior predictive p−value
Frequency
0.0 0.1 0.2 0.3 0.4 0.5
0100020003000400050006000
n= 1000
posterior predictive p−value
Frequency
0.0 0.1 0.2 0.3 0.4 0.5
01000200030004000500060007000
Example
Normal-normal mean model:
X ∼ N(θ, 1) , θ ∼ N(0, 10)
Bayesian posterior p-value for
T(x) = x2, m(x), B10(x)−1
P(|X| |x||θ, x)π(θ|x) dθ
gets down as x gets away from 0...
while discrepancy based on B10(x)
increases mildly
n= 10
posterior predictive p−value
Frequency
0.0 0.1 0.2 0.3 0.4 0.5 0.6
0500100015002000
n= 100
posterior predictive p−value
Frequency
0.0 0.1 0.2 0.3 0.4 0.5
0100020003000400050006000
n= 1000
posterior predictive p−value
Frequency
0.0 0.1 0.2 0.3 0.4 0.5
01000200030004000500060007000
goodness-of-fit [only?]
“A model is suspect if a discrepancy is of practical importance and
its observed value has a tail-area probability near 0 or 1, indicating
that the observed pattern would be unlikely to be seen in
replications of the data if the model were true. An extreme p-value
implies that the model cannot be expected to capture this aspect of
the data. A p-value is a posterior probability and can therefore be
interpreted directly—although not as Pr(model is true — data).
Major failures of the model (...) can be addressed by expanding the
model appropriately.” BDA, p.150
not helpful in comparing models (both may be deficient)
anti-Ockham? i.e., may favour larger dimensions (if prior
concentrated enough)
lingering worries about using the data twice and favourable
bias
impact of the prior (only under the current model) but allows
for improper priors
goodness-of-fit [only?]
“A model is suspect if a discrepancy is of practical importance and
its observed value has a tail-area probability near 0 or 1, indicating
that the observed pattern would be unlikely to be seen in
replications of the data if the model were true. An extreme p-value
implies that the model cannot be expected to capture this aspect of
the data. A p-value is a posterior probability and can therefore be
interpreted directly—although not as Pr(model is true — data).
Major failures of the model (...) can be addressed by expanding the
model appropriately.” BDA, p.150
not helpful in comparing models (both may be deficient)
anti-Ockham? i.e., may favour larger dimensions (if prior
concentrated enough)
lingering worries about using the data twice and favourable
bias
impact of the prior (only under the current model) but allows
for improper priors
Changing the testing perspective
Bayesian testing of hypotheses
Significance tests: one new parameter
Noninformative solutions
Jeffreys-Lindley paradox
Deviance (information criterion)
Testing under incomplete information
Posterior predictive checking
Testing via mixtures
Paradigm shift
Paradigm shift
New proposal for a paradigm shift (!) in the Bayesian processing of
hypothesis testing and of model selection
convergent and naturally interpretable solution
more extended use of improper priors
Simple representation of the testing problem as a
two-component mixture estimation problem where the
weights are formally equal to 0 or 1
Paradigm shift
New proposal for a paradigm shift (!) in the Bayesian processing of
hypothesis testing and of model selection
convergent and naturally interpretable solution
more extended use of improper priors
Simple representation of the testing problem as a
two-component mixture estimation problem where the
weights are formally equal to 0 or 1
Paradigm shift
Simple representation of the testing problem as a
two-component mixture estimation problem where the
weights are formally equal to 0 or 1
Approach inspired from consistency result of Rousseau and
Mengersen (2011) on estimated overfitting mixtures
Mixture representation not directly equivalent to the use of a
posterior probability
Potential of a better approach to testing, while not expanding
number of parameters
Calibration of posterior distribution of the weight of a model,
moving from artificial notion of posterior probability of a
model
Encompassing mixture model
Idea: Given two statistical models,
M1 : x ∼ f1(x|θ1) , θ1 ∈ Θ1 and M2 : x ∼ f2(x|θ2) , θ2 ∈ Θ2 ,
embed both within an encompassing mixture
Mα : x ∼ αf1(x|θ1) + (1 − α)f2(x|θ2) , 0 α 1 (1)
Note: Both models correspond to special cases of (1), one for
α = 1 and one for α = 0
Draw inference on mixture representation (1), as if each
observation was individually and independently produced by the
mixture model
Encompassing mixture model
Idea: Given two statistical models,
M1 : x ∼ f1(x|θ1) , θ1 ∈ Θ1 and M2 : x ∼ f2(x|θ2) , θ2 ∈ Θ2 ,
embed both within an encompassing mixture
Mα : x ∼ αf1(x|θ1) + (1 − α)f2(x|θ2) , 0 α 1 (1)
Note: Both models correspond to special cases of (1), one for
α = 1 and one for α = 0
Draw inference on mixture representation (1), as if each
observation was individually and independently produced by the
mixture model
Encompassing mixture model
Idea: Given two statistical models,
M1 : x ∼ f1(x|θ1) , θ1 ∈ Θ1 and M2 : x ∼ f2(x|θ2) , θ2 ∈ Θ2 ,
embed both within an encompassing mixture
Mα : x ∼ αf1(x|θ1) + (1 − α)f2(x|θ2) , 0 α 1 (1)
Note: Both models correspond to special cases of (1), one for
α = 1 and one for α = 0
Draw inference on mixture representation (1), as if each
observation was individually and independently produced by the
mixture model
Inferential motivations
Sounds like approximation to the real model, but several definitive
advantages to this paradigm shift:
Bayes estimate of the weight α replaces posterior probability
of model M1, equally convergent indicator of which model is
“true”, while avoiding artificial prior probabilities on model
indices, ω1 and ω2
interpretation of estimator of α at least as natural as handling
the posterior probability, while avoiding zero-one loss setting
α and its posterior distribution provide measure of proximity
to the models, while being interpretable as data propensity to
stand within one model
further allows for alternative perspectives on testing and
model choice, like predictive tools, cross-validation, and
information indices like WAIC
Computational motivations
avoids highly problematic computations of the marginal
likelihoods, since standard algorithms are available for
Bayesian mixture estimation
straightforward extension to a finite collection of models, with
a larger number of components, which considers all models at
once and eliminates least likely models by simulation
eliminates difficulty of label switching that plagues both
Bayesian estimation and Bayesian computation, since
components are no longer exchangeable
posterior distribution of α evaluates more thoroughly strength
of support for a given model than the single figure outcome of
a posterior probability
variability of posterior distribution on α allows for a more
thorough assessment of the strength of this support
Noninformative motivations
additional feature missing from traditional Bayesian answers:
a mixture model acknowledges possibility that, for a finite
dataset, both models or none could be acceptable
standard (proper and informative) prior modeling can be
reproduced in this setting, but non-informative (improper)
priors also are manageable therein, provided both models first
reparameterised towards shared parameters, e.g. location and
scale parameters
in special case when all parameters are common
Mα : x ∼ αf1(x|θ) + (1 − α)f2(x|θ) , 0 α 1
if θ is a location parameter, a flat prior π(θ) ∝ 1 is available
Weakly informative motivations
using the same parameters or some identical parameters on
both components highlights that opposition between the two
components is not an issue of enjoying different parameters
those common parameters are nuisance parameters, to be
integrated out [unlike Lindley’s paradox]
prior model weights ωi rarely discussed in classical Bayesian
approach, even though linear impact on posterior probabilities.
Here, prior modeling only involves selecting a prior on α, e.g.,
α ∼ B(a0, a0)
while a0 impacts posterior on α, it always leads to mass
accumulation near 1 or 0, i.e. favours most likely model
sensitivity analysis straightforward to carry
approach easily calibrated by parametric boostrap providing
reference posterior of α under each model
natural Metropolis–Hastings alternative
Poisson/Geometric
choice betwen Poisson P(λ) and Geometric Geo(p)
distribution
mixture with common parameter λ
Mα : αP(λ) + (1 − α)Geo(1/1+λ)
Allows for Jeffreys prior since resulting posterior is proper
independent Metropolis–within–Gibbs with proposal
distribution on λ equal to Poisson posterior (with acceptance
rate larger than 75%)
Poisson/Geometric
choice betwen Poisson P(λ) and Geometric Geo(p)
distribution
mixture with common parameter λ
Mα : αP(λ) + (1 − α)Geo(1/1+λ)
Allows for Jeffreys prior since resulting posterior is proper
independent Metropolis–within–Gibbs with proposal
distribution on λ equal to Poisson posterior (with acceptance
rate larger than 75%)
Poisson/Geometric
choice betwen Poisson P(λ) and Geometric Geo(p)
distribution
mixture with common parameter λ
Mα : αP(λ) + (1 − α)Geo(1/1+λ)
Allows for Jeffreys prior since resulting posterior is proper
independent Metropolis–within–Gibbs with proposal
distribution on λ equal to Poisson posterior (with acceptance
rate larger than 75%)
Beta prior
When α ∼ Be(a0, a0) prior, full conditional posterior
α ∼ Be(n1(ζ) + a0, n2(ζ) + a0)
Exact Bayes factor opposing Poisson and Geometric
B12 = nn¯xn
n
i=1
xi ! Γ n + 2 +
n
i=1
xi Γ(n + 2)
although arbitrary from a purely mathematical viewpoint
Parameter estimation
1e-04 0.001 0.01 0.1 0.2 0.3 0.4 0.5
3.903.954.004.054.10
Posterior means of λ and medians of α for 100 Poisson P(4) datasets of size
n = 1000, for a0 = .0001, .001, .01, .1, .2, .3, .4, .5. Each posterior approximation
is based on 104
Metropolis-Hastings iterations.
Parameter estimation
1e-04 0.001 0.01 0.1 0.2 0.3 0.4 0.5
0.9900.9920.9940.9960.9981.000
Posterior means of λ and medians of α for 100 Poisson P(4) datasets of size
n = 1000, for a0 = .0001, .001, .01, .1, .2, .3, .4, .5. Each posterior approximation
is based on 104
Metropolis-Hastings iterations.
Consistency
0 1 2 3 4 5 6 7
0.00.20.40.60.81.0
a0=.1
log(sample size)
0 1 2 3 4 5 6 7
0.00.20.40.60.81.0
a0=.3
log(sample size)
0 1 2 3 4 5 6 7
0.00.20.40.60.81.0
a0=.5
log(sample size)
Posterior means (sky-blue) and medians (grey-dotted) of α, over 100 Poisson
P(4) datasets for sample sizes from 1 to 1000.
Behaviour of Bayes factor
0 1 2 3 4 5 6 7
0.00.20.40.60.81.0
log(sample size)
0 1 2 3 4 5 6 7
0.00.20.40.60.81.0
log(sample size)
0 1 2 3 4 5 6 7
0.00.20.40.60.81.0
log(sample size)
Comparison between P(M1|x) (red dotted area) and posterior medians of α
(grey zone) for 100 Poisson P(4) datasets with sample sizes n between 1 and
1000, for a0 = .001, .1, .5
Normal-normal comparison
comparison of a normal N(θ1, 1) with a normal N(θ2, 2)
distribution
mixture with identical location parameter θ
αN(θ, 1) + (1 − α)N(θ, 2)
Jeffreys prior π(θ) = 1 can be used, since posterior is proper
Reference (improper) Bayes factor
B12 = 2
n−1/2
exp 1/4
n
i=1
(xi − ¯x)2
,
Normal-normal comparison
comparison of a normal N(θ1, 1) with a normal N(θ2, 2)
distribution
mixture with identical location parameter θ
αN(θ, 1) + (1 − α)N(θ, 2)
Jeffreys prior π(θ) = 1 can be used, since posterior is proper
Reference (improper) Bayes factor
B12 = 2
n−1/2
exp 1/4
n
i=1
(xi − ¯x)2
,
Normal-normal comparison
comparison of a normal N(θ1, 1) with a normal N(θ2, 2)
distribution
mixture with identical location parameter θ
αN(θ, 1) + (1 − α)N(θ, 2)
Jeffreys prior π(θ) = 1 can be used, since posterior is proper
Reference (improper) Bayes factor
B12 = 2
n−1/2
exp 1/4
n
i=1
(xi − ¯x)2
,
Consistency
0.1 0.2 0.3 0.4 0.5 p(M1|x)
0.00.20.40.60.81.0
n=15
0.00.20.40.60.81.0
0.1 0.2 0.3 0.4 0.5 p(M1|x)
0.650.700.750.800.850.900.951.00
n=50
0.650.700.750.800.850.900.951.00
0.1 0.2 0.3 0.4 0.5 p(M1|x)
0.750.800.850.900.951.00
n=100
0.750.800.850.900.951.00
0.1 0.2 0.3 0.4 0.5 p(M1|x)
0.930.940.950.960.970.980.991.00
n=500
0.930.940.950.960.970.980.991.00
Posterior means (wheat) and medians of α (dark wheat), compared with
posterior probabilities of M0 (gray) for a N(0, 1) sample, derived from 100
datasets for sample sizes equal to 15, 50, 100, 500. Each posterior
approximation is based on 104
MCMC iterations.
Comparison with posterior probability
0 100 200 300 400 500
-50-40-30-20-100
a0=.1
sample size
0 100 200 300 400 500
-50-40-30-20-100
a0=.3
sample size
0 100 200 300 400 500
-50-40-30-20-100
a0=.4
sample size
0 100 200 300 400 500
-50-40-30-20-100
a0=.5
sample size
Plots of ranges of log(n) log(1 − E[α|x]) (gray color) and log(1 − p(M1|x)) (red
dotted) over 100 N(0, 1) samples as sample size n grows from 1 to 500. and α
is the weight of N(0, 1) in the mixture model. The shaded areas indicate the
range of the estimations and each plot is based on a Beta prior with
a0 = .1, .2, .3, .4, .5, 1 and each posterior approximation is based on 104
iterations.
Comments
convergence to one boundary value as sample size n grows
impact of hyperarameter a0 slowly vanishes as n increases, but
present for moderate sample sizes
when simulated sample is neither from N(θ1, 1) nor from
N(θ2, 2), behaviour of posterior varies, depending on which
distribution is closest
Logit or Probit?
binary dataset, R dataset about diabetes in 200 Pima Indian
women with body mass index as explanatory variable
comparison of logit and probit fits could be suitable. We are
thus comparing both fits via our method
M1 : yi | xi
, θ1 ∼ B(1, pi ) where pi =
exp(xi θ1)
1 + exp(xi θ1)
M2 : yi | xi
, θ2 ∼ B(1, qi ) where qi = Φ(xi
θ2)
Common parameterisation
Local reparameterisation strategy that rescales parameters of the
probit model M2 so that the MLE’s of both models coincide.
[Choudhuty et al., 2007]
Φ(xi
θ2) ≈
exp(kxi θ2)
1 + exp(kxi θ2)
and use best estimate of k to bring both parameters into coherency
(k0, k1) = (θ01/θ02, θ11/θ12) ,
reparameterise M1 and M2 as
M1 :yi | xi
, θ ∼ B(1, pi ) where pi =
exp(xi θ)
1 + exp(xi θ)
M2 :yi | xi
, θ ∼ B(1, qi ) where qi = Φ(xi
(κ−1
θ)) ,
with κ−1θ = (θ0/k0, θ1/k1).
Prior modelling
Under default g-prior
θ ∼ N2(0, n(XT
X)−1
)
full conditional posterior distributions given allocations
π(θ | y, X, ζ) ∝
exp i Iζi =1yi xi θ
i;ζi =1[1 + exp(xi θ)]
exp −θT
(XT
X)θ 2n
×
i;ζi =2
Φ(xi
(κ−1
θ))yi
(1 − Φ(xi
(κ−1
θ)))(1−yi )
hence posterior distribution clearly defined
Results
Logistic Probit
a0 α θ0 θ1
θ0
k0
θ1
k1
.1 .352 -4.06 .103 -2.51 .064
.2 .427 -4.03 .103 -2.49 .064
.3 .440 -4.02 .102 -2.49 .063
.4 .456 -4.01 .102 -2.48 .063
.5 .449 -4.05 .103 -2.51 .064
Histograms of posteriors of α in favour of logistic model where a0 = .1, .2, .3,
.4, .5 for (a) Pima dataset, (b) Data from logistic model, (c) Data from probit
model
Survival analysis
Testing hypothesis that data comes from a
1. log-Normal(φ, κ2),
2. Weibull(α, λ), or
3. log-Logistic(γ, δ)
distribution
Corresponding mixture given by the density
α1 exp{−(log x − φ)2
/2κ2
}/
√
2πxκ+
α2
α
λ
exp{−(x/λ)α
}((x/λ)α−1
+
α3(δ/γ)(x/γ)δ−1
/(1 + (x/γ)δ
)2
where α1 + α2 + α3 = 1
Survival analysis
Testing hypothesis that data comes from a
1. log-Normal(φ, κ2),
2. Weibull(α, λ), or
3. log-Logistic(γ, δ)
distribution
Corresponding mixture given by the density
α1 exp{−(log x − φ)2
/2κ2
}/
√
2πxκ+
α2
α
λ
exp{−(x/λ)α
}((x/λ)α−1
+
α3(δ/γ)(x/γ)δ−1
/(1 + (x/γ)δ
)2
where α1 + α2 + α3 = 1
Reparameterisation
Looking for common parameter(s):
φ = µ + γβ = ξ
σ2
= π2
β2
/6 = ζ2
π2
/3
where γ ≈ 0.5772 is Euler-Mascheroni constant.
Allows for a noninformative prior on the common location scale
parameter,
π(φ, σ2
) = 1/σ2
Reparameterisation
Looking for common parameter(s):
φ = µ + γβ = ξ
σ2
= π2
β2
/6 = ζ2
π2
/3
where γ ≈ 0.5772 is Euler-Mascheroni constant.
Allows for a noninformative prior on the common location scale
parameter,
π(φ, σ2
) = 1/σ2
Recovery
.01 0.1 1.0 10.0
0.860.880.900.920.940.960.981.00
.01 0.1 1.0 10.00.000.020.040.060.08
Boxplots of the posterior distributions of the Normal weight α1 under the two
scenarii: truth = Normal (left panel), truth = Gumbel (right panel), a0=0.01,
0.1, 1.0, 10.0 (from left to right in each panel) and n = 10, 000 simulated
observations.
Asymptotic consistency
Posterior consistency holds for mixture testing procedure [under
minor conditions]
Two different cases
the two models, M1 and M2, are well separated
model M1 is a submodel of M2.
Asymptotic consistency
Posterior consistency holds for mixture testing procedure [under
minor conditions]
Two different cases
the two models, M1 and M2, are well separated
model M1 is a submodel of M2.
Posterior concentration rate
Let π be the prior and xn = (x1, · · · , xn) a sample with true
density f ∗
proposition
Assume that, for all c > 0, there exist Θn ⊂ Θ1 × Θ2 and B > 0 such that
π [Θc
n] n−c
, Θn ⊂ { θ1 + θ2 nB
}
and that there exist H 0 and L, δ > 0 such that, for j = 1, 2,
sup
θ,θ ∈Θn
fj,θj
− fj,θj
1 LnH
θj − θj , θ = (θ1, θ2), θ = (θ1, θ2) ,
∀ θj − θ∗
j δ; KL(fj,θj
, fj,θ∗
j
) θj − θ∗
j .
Then, when f ∗
= fθ∗,α∗ , with α∗
∈ [0, 1], there exists M > 0 such that
π (α, θ); fθ,α − f ∗
1 > M log n/n|xn
= op(1) .
Separated models
Assumption: Models are separated, i.e. identifiability holds:
∀α, α ∈ [0, 1], ∀θj , θj , j = 1, 2 Pθ,α = Pθ ,α ⇒ α = α , θ = θ
Further
inf
θ1∈Θ1
inf
θ2∈Θ2
f1,θ1
− f2,θ2 1 > 0
and, for θ∗
j ∈ Θj , if Pθj
weakly converges to Pθ∗
j
, then
θj −→ θ∗
j
in the Euclidean topology
Separated models
Assumption: Models are separated, i.e. identifiability holds:
∀α, α ∈ [0, 1], ∀θj , θj , j = 1, 2 Pθ,α = Pθ ,α ⇒ α = α , θ = θ
theorem
Under above assumptions, then for all > 0,
π [|α − α∗
| > |xn
] = op(1)
Separated models
Assumption: Models are separated, i.e. identifiability holds:
∀α, α ∈ [0, 1], ∀θj , θj , j = 1, 2 Pθ,α = Pθ ,α ⇒ α = α , θ = θ
theorem
If
θj → fj,θj
is C2 around θ∗
j , j = 1, 2,
f1,θ∗
1
− f2,θ∗
2
, f1,θ∗
1
, f2,θ∗
2
are linearly independent in y and
there exists δ > 0 such that
f1,θ∗
1
, f2,θ∗
2
, sup
|θ1−θ∗
1 |<δ
|D2
f1,θ1
|, sup
|θ2−θ∗
2 |<δ
|D2
f2,θ2
| ∈ L1
then
π |α − α∗
| > M log n/n xn
= op(1).
Separated models
Assumption: Models are separated, i.e. identifiability holds:
∀α, α ∈ [0, 1], ∀θj , θj , j = 1, 2 Pθ,α = Pθ ,α ⇒ α = α , θ = θ
theorem allows for interpretation of α under the posterior: If data
xn is generated from model M1 then posterior on α concentrates
around α = 1
Embedded case
Here M1 is a submodel of M2, i.e.
θ2 = (θ1, ψ) and θ2 = (θ1, ψ0 = 0)
corresponds to f2,θ2 ∈ M1
Same posterior concentration rate
log n/n
for estimating α when α∗ ∈ (0, 1) and ψ∗ = 0.
Null case
Case where ψ∗ = 0, i.e., f ∗ is in model M1
Two possible paths to approximate f ∗: either α goes to 1
(path 1) or ψ goes to 0 (path 2)
New identifiability condition: Pθ,α = P∗ only if
α = 1, θ1 = θ∗
1, θ2 = (θ∗
1, ψ) or α 1, θ1 = θ∗
1, θ2 = (θ∗
1, 0)
Prior
π(α, θ) = πα(α)π1(θ1)πψ(ψ), θ2 = (θ1, ψ)
with common (prior on) θ1
Null case
Case where ψ∗ = 0, i.e., f ∗ is in model M1
Two possible paths to approximate f ∗: either α goes to 1
(path 1) or ψ goes to 0 (path 2)
New identifiability condition: Pθ,α = P∗ only if
α = 1, θ1 = θ∗
1, θ2 = (θ∗
1, ψ) or α 1, θ1 = θ∗
1, θ2 = (θ∗
1, 0)
Prior
π(α, θ) = πα(α)π1(θ1)πψ(ψ), θ2 = (θ1, ψ)
with common (prior on) θ1
Assumptions
[B1] Regularity: Assume that θ1 → f1,θ1 and θ2 → f2,θ2 are 3
times continuously differentiable and that
F∗
¯f 3
1,θ∗
1
f 3
1,θ∗
1
< +∞, ¯f1,θ∗
1
= sup
|θ1−θ∗
1 |<δ
f1,θ1
, f 1,θ∗
1
= inf
|θ1−θ∗
1 |<δ
f1,θ1
F∗
sup|θ1−θ∗
1 |<δ | f1,θ∗
1
|3
f 3
1,θ∗
1
< +∞, F∗
| f1,θ∗
1
|4
f 4
1,θ∗
1
< +∞,
F∗
sup|θ1−θ∗
1 |<δ |D2
f1,θ∗
1
|2
f 2
1,θ∗
1
< +∞, F∗
sup|θ1−θ∗
1 |<δ |D3
f1,θ∗
1
|
f 1,θ∗
1
< +∞
Assumptions
[B2] Integrability: There exists
S0 ⊂ S ∩ {|ψ| > δ0}
for some positive δ0 and satisfying Leb(S0) > 0, and such that for
all ψ ∈ S0,
F∗
sup|θ1−θ∗
1 |<δ f2,θ1,ψ
f 4
1,θ∗
1
< +∞, F∗
sup|θ1−θ∗
1 |<δ f 3
2,θ1,ψ
f 3
1,θ1∗
< +∞,
Assumptions
[B3] Stronger identifiability: Set
f2,θ∗
1,ψ∗ (x) = θ1 f2,θ∗
1,ψ∗ (x)T
, ψf2,θ∗
1,ψ∗ (x)T T
.
Then for all ψ ∈ S with ψ = 0, if η0 ∈ R, η1 ∈ Rd1
η0(f1,θ∗
1
− f2,θ∗
1,ψ) + ηT
1 θ1 f1,θ∗
1
= 0 ⇔ η1 = 0, η2 = 0
Consistency
theorem
Given the mixture fθ1,ψ,α = αf1,θ1 + (1 − α)f2,θ1,ψ and a sample
xn = (x1, · · · , xn) issued from f1,θ∗
1
, under assumptions B1 − B3,
and an M > 0 such that
π (α, θ); fθ,α − f ∗
1 > M log n/n|xn
= op(1).
If α ∼ B(a1, a2), with a2 < d2, and if the prior πθ1,ψ is absolutely
continuous with positive and continuous density at (θ∗
1, 0), then for
Mn −→ ∞
π |α − α∗
| > Mn(log n)γ
/
√
n|xn
= op(1), γ = max((d1 + a2)/(d2 − a2), 1)/2,
M-open case
When the true model behind the data is neither of the tested
models, what happens?
issue mostly bypassed by classical Bayesian procedures
theoretically produces an α∗ away from both 0 and 1
possible (recommended?) inclusion of a Bayesian
non-parametric model within alternatives
M-open case
When the true model behind the data is neither of the tested
models, what happens?
issue mostly bypassed by classical Bayesian procedures
theoretically produces an α∗ away from both 0 and 1
possible (recommended?) inclusion of a Bayesian
non-parametric model within alternatives
Towards which decision?
And if we have to make a decision?
soft consider behaviour of posterior under prior predictives
or posterior predictive [e.g., prior predictive does not exist]
boostrapping behaviour
comparison with Bayesian non-parametric solution
hard rethink the loss function
Conclusion
many applications of the Bayesian paradigm concentrate on
the comparison of scientific theories and on testing of null
hypotheses
natural tendency to default to Bayes factors
poorly understood sensitivity to prior modeling and posterior
calibration
c Time is ripe for a paradigm shift
Down with Bayes factors!
c Time is ripe for a paradigm shift
original testing problem replaced with a better controlled
estimation target
allow for posterior variability over the component frequency as
opposed to deterministic Bayes factors
range of acceptance, rejection and indecision conclusions
easily calibrated by simulation
posterior medians quickly settling near the boundary values of
0 and 1
potential derivation of a Bayesian b-value by looking at the
posterior area under the tail of the distribution of the weight
Prior modelling
c Time is ripe for a paradigm shift
Partly common parameterisation always feasible and hence
allows for reference priors
removal of the absolute prohibition of improper priors in
hypothesis testing
prior on the weight α shows sensitivity that naturally vanishes
as the sample size increases
default value of a0 = 0.5 in the Beta prior
Computing aspects
c Time is ripe for a paradigm shift
proposal that does not induce additional computational strain
when algorithmic solutions exist for both models, they can be
recycled towards estimating the encompassing mixture
easier than in standard mixture problems due to common
parameters that allow for original MCMC samplers to be
turned into proposals
Gibbs sampling completions useful for assessing potential
outliers but not essential to achieve a conclusion about the
overall problem

More Related Content

What's hot

Ml3 logistic regression-and_classification_error_metrics
Ml3 logistic regression-and_classification_error_metricsMl3 logistic regression-and_classification_error_metrics
Ml3 logistic regression-and_classification_error_metrics
ankit_ppt
 
Solution to the practice test ch 8 hypothesis testing ch 9 two populations
Solution to the practice test ch 8 hypothesis testing ch 9 two populationsSolution to the practice test ch 8 hypothesis testing ch 9 two populations
Solution to the practice test ch 8 hypothesis testing ch 9 two populations
Long Beach City College
 
Introduction to Generalized Linear Models
Introduction to Generalized Linear ModelsIntroduction to Generalized Linear Models
Introduction to Generalized Linear Models
richardchandler
 
Correlation & Regression
Correlation & RegressionCorrelation & Regression
Correlation & RegressionGrant Heller
 
Logistic regression with SPSS
Logistic regression with SPSSLogistic regression with SPSS
Logistic regression with SPSS
LNIPE
 
Regression analysis
Regression analysisRegression analysis
Regression analysissaba khan
 
4.5. logistic regression
4.5. logistic regression4.5. logistic regression
4.5. logistic regression
A M
 
Correlation and Regression ppt
Correlation and Regression pptCorrelation and Regression ppt
Correlation and Regression ppt
Santosh Bhaskar
 
Regression analysis
Regression analysisRegression analysis
Regression analysisRavi shankar
 
Simple Linear Regression (simplified)
Simple Linear Regression (simplified)Simple Linear Regression (simplified)
Simple Linear Regression (simplified)
Haoran Zhang
 
An Overview of Simple Linear Regression
An Overview of Simple Linear RegressionAn Overview of Simple Linear Regression
An Overview of Simple Linear Regression
Georgian Court University
 
Bayesian statistics
Bayesian statisticsBayesian statistics
Bayesian statistics
Alberto Labarga
 
Chi square
Chi squareChi square
correlation.ppt
correlation.pptcorrelation.ppt
correlation.ppt
NayanPatil59
 
Lecture 2
Lecture 2Lecture 2
Lecture 2
Shravan Vasishth
 
regression and correlation
regression and correlationregression and correlation
regression and correlation
Priya Sharma
 
Logistic regression
Logistic regressionLogistic regression
Logistic regression
Khaled Abd Elaziz
 
Ordinal logistic regression
Ordinal logistic regression Ordinal logistic regression
Ordinal logistic regression
Dr Athar Khan
 
Statistics Presentation week 7
Statistics Presentation week 7Statistics Presentation week 7
Statistics Presentation week 7krookroo
 

What's hot (20)

Confidence Intervals
Confidence IntervalsConfidence Intervals
Confidence Intervals
 
Ml3 logistic regression-and_classification_error_metrics
Ml3 logistic regression-and_classification_error_metricsMl3 logistic regression-and_classification_error_metrics
Ml3 logistic regression-and_classification_error_metrics
 
Solution to the practice test ch 8 hypothesis testing ch 9 two populations
Solution to the practice test ch 8 hypothesis testing ch 9 two populationsSolution to the practice test ch 8 hypothesis testing ch 9 two populations
Solution to the practice test ch 8 hypothesis testing ch 9 two populations
 
Introduction to Generalized Linear Models
Introduction to Generalized Linear ModelsIntroduction to Generalized Linear Models
Introduction to Generalized Linear Models
 
Correlation & Regression
Correlation & RegressionCorrelation & Regression
Correlation & Regression
 
Logistic regression with SPSS
Logistic regression with SPSSLogistic regression with SPSS
Logistic regression with SPSS
 
Regression analysis
Regression analysisRegression analysis
Regression analysis
 
4.5. logistic regression
4.5. logistic regression4.5. logistic regression
4.5. logistic regression
 
Correlation and Regression ppt
Correlation and Regression pptCorrelation and Regression ppt
Correlation and Regression ppt
 
Regression analysis
Regression analysisRegression analysis
Regression analysis
 
Simple Linear Regression (simplified)
Simple Linear Regression (simplified)Simple Linear Regression (simplified)
Simple Linear Regression (simplified)
 
An Overview of Simple Linear Regression
An Overview of Simple Linear RegressionAn Overview of Simple Linear Regression
An Overview of Simple Linear Regression
 
Bayesian statistics
Bayesian statisticsBayesian statistics
Bayesian statistics
 
Chi square
Chi squareChi square
Chi square
 
correlation.ppt
correlation.pptcorrelation.ppt
correlation.ppt
 
Lecture 2
Lecture 2Lecture 2
Lecture 2
 
regression and correlation
regression and correlationregression and correlation
regression and correlation
 
Logistic regression
Logistic regressionLogistic regression
Logistic regression
 
Ordinal logistic regression
Ordinal logistic regression Ordinal logistic regression
Ordinal logistic regression
 
Statistics Presentation week 7
Statistics Presentation week 7Statistics Presentation week 7
Statistics Presentation week 7
 

Viewers also liked

Convergence of ABC methods
Convergence of ABC methodsConvergence of ABC methods
Convergence of ABC methods
Christian Robert
 
Evaluating Classifiers' Performance KDD2002
Evaluating Classifiers' Performance KDD2002Evaluating Classifiers' Performance KDD2002
Evaluating Classifiers' Performance KDD2002Anna Olecka
 
Chapter 20 and 21 combined testing hypotheses about proportions 2013
Chapter 20 and 21 combined testing hypotheses about proportions 2013Chapter 20 and 21 combined testing hypotheses about proportions 2013
Chapter 20 and 21 combined testing hypotheses about proportions 2013calculistictt
 
ABC workshop: 17w5025
ABC workshop: 17w5025ABC workshop: 17w5025
ABC workshop: 17w5025
Christian Robert
 
Performance Evaluation for Classifiers tutorial
Performance Evaluation for Classifiers tutorialPerformance Evaluation for Classifiers tutorial
Performance Evaluation for Classifiers tutorialBilkent University
 
Markov Chain Monte Carlo explained
Markov Chain Monte Carlo explainedMarkov Chain Monte Carlo explained
Markov Chain Monte Carlo explained
dariodigiuni
 
Introduction to MCMC methods
Introduction to MCMC methodsIntroduction to MCMC methods
Introduction to MCMC methods
Christian Robert
 
Monte Carlo Simulation
Monte Carlo SimulationMonte Carlo Simulation
Monte Carlo Simulation
Ayman Hassan
 
Lecture1
Lecture1Lecture1
Lecture1rjaeh
 
Access lesson 02 Creating a Database
Access lesson 02 Creating a DatabaseAccess lesson 02 Creating a Database
Access lesson 02 Creating a DatabaseAram SE
 
Communication skills in english
Communication skills in englishCommunication skills in english
Communication skills in englishAqib Memon
 
Access lesson 06 Integrating Access
Access lesson 06  Integrating AccessAccess lesson 06  Integrating Access
Access lesson 06 Integrating AccessAram SE
 
Database and Access Power Point
Database and Access Power PointDatabase and Access Power Point
Database and Access Power PointAyee_Its_Bailey
 
Monte Carlo G P U Jan2010
Monte  Carlo  G P U  Jan2010Monte  Carlo  G P U  Jan2010
Monte Carlo G P U Jan2010
John Holden
 
OWASP Khartoum Cyber Security Session
OWASP Khartoum Cyber Security SessionOWASP Khartoum Cyber Security Session
OWASP Khartoum Cyber Security Session
OWASP Khartoum
 
Monte Carlo Statistical Methods
Monte Carlo Statistical MethodsMonte Carlo Statistical Methods
Monte Carlo Statistical Methods
Christian Robert
 
01 computer%20 forensics%20in%20todays%20world
01 computer%20 forensics%20in%20todays%20world01 computer%20 forensics%20in%20todays%20world
01 computer%20 forensics%20in%20todays%20worldAqib Memon
 
Chapter 4 microsoft access 2010
Chapter 4 microsoft access 2010Chapter 4 microsoft access 2010
Chapter 4 microsoft access 2010
home
 

Viewers also liked (20)

Convergence of ABC methods
Convergence of ABC methodsConvergence of ABC methods
Convergence of ABC methods
 
Evaluating Classifiers' Performance KDD2002
Evaluating Classifiers' Performance KDD2002Evaluating Classifiers' Performance KDD2002
Evaluating Classifiers' Performance KDD2002
 
Chapter 20 and 21 combined testing hypotheses about proportions 2013
Chapter 20 and 21 combined testing hypotheses about proportions 2013Chapter 20 and 21 combined testing hypotheses about proportions 2013
Chapter 20 and 21 combined testing hypotheses about proportions 2013
 
Hyphothesis
HyphothesisHyphothesis
Hyphothesis
 
ABC workshop: 17w5025
ABC workshop: 17w5025ABC workshop: 17w5025
ABC workshop: 17w5025
 
Performance Evaluation for Classifiers tutorial
Performance Evaluation for Classifiers tutorialPerformance Evaluation for Classifiers tutorial
Performance Evaluation for Classifiers tutorial
 
Markov Chain Monte Carlo explained
Markov Chain Monte Carlo explainedMarkov Chain Monte Carlo explained
Markov Chain Monte Carlo explained
 
Introduction to MCMC methods
Introduction to MCMC methodsIntroduction to MCMC methods
Introduction to MCMC methods
 
Monte Carlo Simulation
Monte Carlo SimulationMonte Carlo Simulation
Monte Carlo Simulation
 
Lecture1
Lecture1Lecture1
Lecture1
 
Access lesson 02 Creating a Database
Access lesson 02 Creating a DatabaseAccess lesson 02 Creating a Database
Access lesson 02 Creating a Database
 
Communication skills in english
Communication skills in englishCommunication skills in english
Communication skills in english
 
Access lesson 06 Integrating Access
Access lesson 06  Integrating AccessAccess lesson 06  Integrating Access
Access lesson 06 Integrating Access
 
Database and Access Power Point
Database and Access Power PointDatabase and Access Power Point
Database and Access Power Point
 
Monte Carlo G P U Jan2010
Monte  Carlo  G P U  Jan2010Monte  Carlo  G P U  Jan2010
Monte Carlo G P U Jan2010
 
OWASP Khartoum Cyber Security Session
OWASP Khartoum Cyber Security SessionOWASP Khartoum Cyber Security Session
OWASP Khartoum Cyber Security Session
 
Monte Carlo Statistical Methods
Monte Carlo Statistical MethodsMonte Carlo Statistical Methods
Monte Carlo Statistical Methods
 
01 computer%20 forensics%20in%20todays%20world
01 computer%20 forensics%20in%20todays%20world01 computer%20 forensics%20in%20todays%20world
01 computer%20 forensics%20in%20todays%20world
 
Chapter 4 microsoft access 2010
Chapter 4 microsoft access 2010Chapter 4 microsoft access 2010
Chapter 4 microsoft access 2010
 
Model inquiri
Model inquiriModel inquiri
Model inquiri
 

Similar to An overview of Bayesian testing

Statistics symposium talk, Harvard University
Statistics symposium talk, Harvard UniversityStatistics symposium talk, Harvard University
Statistics symposium talk, Harvard University
Christian Robert
 
ISBA 2016: Foundations
ISBA 2016: FoundationsISBA 2016: Foundations
ISBA 2016: Foundations
Christian Robert
 
On the vexing dilemma of hypothesis testing and the predicted demise of the B...
On the vexing dilemma of hypothesis testing and the predicted demise of the B...On the vexing dilemma of hypothesis testing and the predicted demise of the B...
On the vexing dilemma of hypothesis testing and the predicted demise of the B...
Christian Robert
 
Tutorial on testing at O'Bayes 2015, Valencià, June 1, 2015
Tutorial on testing at O'Bayes 2015, Valencià, June 1, 2015Tutorial on testing at O'Bayes 2015, Valencià, June 1, 2015
Tutorial on testing at O'Bayes 2015, Valencià, June 1, 2015
Christian Robert
 
Yes III: Computational methods for model choice
Yes III: Computational methods for model choiceYes III: Computational methods for model choice
Yes III: Computational methods for model choice
Christian Robert
 
Course on Bayesian computational methods
Course on Bayesian computational methodsCourse on Bayesian computational methods
Course on Bayesian computational methods
Christian Robert
 
San Antonio short course, March 2010
San Antonio short course, March 2010San Antonio short course, March 2010
San Antonio short course, March 2010
Christian Robert
 
original
originaloriginal
originalbutest
 
Bayesian statistics using r intro
Bayesian statistics using r   introBayesian statistics using r   intro
Bayesian statistics using r intro
BayesLaplace1
 
Testing as estimation: the demise of the Bayes factor
Testing as estimation: the demise of the Bayes factorTesting as estimation: the demise of the Bayes factor
Testing as estimation: the demise of the Bayes factor
Christian Robert
 
Stat451 - Life Distribution
Stat451 - Life DistributionStat451 - Life Distribution
Stat451 - Life Distributionnpxquynh
 
MaxEnt 2009 talk
MaxEnt 2009 talkMaxEnt 2009 talk
MaxEnt 2009 talk
Christian Robert
 
Bayes Classification
Bayes ClassificationBayes Classification
Bayes Classification
sathish sak
 
from model uncertainty to ABC
from model uncertainty to ABCfrom model uncertainty to ABC
from model uncertainty to ABC
Christian Robert
 
Bayesian decision making in clinical research
Bayesian decision making in clinical researchBayesian decision making in clinical research
Bayesian decision making in clinical researchBhaswat Chakraborty
 
“I. Conjugate priors”
“I. Conjugate priors”“I. Conjugate priors”
“I. Conjugate priors”
Steven Scott
 
01.ConditionalProb.pdf in the Bayes_intro folder
01.ConditionalProb.pdf in the Bayes_intro folder01.ConditionalProb.pdf in the Bayes_intro folder
01.ConditionalProb.pdf in the Bayes_intro folder
Steven Scott
 
01.conditional prob
01.conditional prob01.conditional prob
01.conditional prob
Steven Scott
 

Similar to An overview of Bayesian testing (20)

Statistics symposium talk, Harvard University
Statistics symposium talk, Harvard UniversityStatistics symposium talk, Harvard University
Statistics symposium talk, Harvard University
 
ISBA 2016: Foundations
ISBA 2016: FoundationsISBA 2016: Foundations
ISBA 2016: Foundations
 
On the vexing dilemma of hypothesis testing and the predicted demise of the B...
On the vexing dilemma of hypothesis testing and the predicted demise of the B...On the vexing dilemma of hypothesis testing and the predicted demise of the B...
On the vexing dilemma of hypothesis testing and the predicted demise of the B...
 
Tutorial on testing at O'Bayes 2015, Valencià, June 1, 2015
Tutorial on testing at O'Bayes 2015, Valencià, June 1, 2015Tutorial on testing at O'Bayes 2015, Valencià, June 1, 2015
Tutorial on testing at O'Bayes 2015, Valencià, June 1, 2015
 
Yes III: Computational methods for model choice
Yes III: Computational methods for model choiceYes III: Computational methods for model choice
Yes III: Computational methods for model choice
 
Course on Bayesian computational methods
Course on Bayesian computational methodsCourse on Bayesian computational methods
Course on Bayesian computational methods
 
San Antonio short course, March 2010
San Antonio short course, March 2010San Antonio short course, March 2010
San Antonio short course, March 2010
 
original
originaloriginal
original
 
JSM 2011 round table
JSM 2011 round tableJSM 2011 round table
JSM 2011 round table
 
JSM 2011 round table
JSM 2011 round tableJSM 2011 round table
JSM 2011 round table
 
Bayesian statistics using r intro
Bayesian statistics using r   introBayesian statistics using r   intro
Bayesian statistics using r intro
 
Testing as estimation: the demise of the Bayes factor
Testing as estimation: the demise of the Bayes factorTesting as estimation: the demise of the Bayes factor
Testing as estimation: the demise of the Bayes factor
 
Stat451 - Life Distribution
Stat451 - Life DistributionStat451 - Life Distribution
Stat451 - Life Distribution
 
MaxEnt 2009 talk
MaxEnt 2009 talkMaxEnt 2009 talk
MaxEnt 2009 talk
 
Bayes Classification
Bayes ClassificationBayes Classification
Bayes Classification
 
from model uncertainty to ABC
from model uncertainty to ABCfrom model uncertainty to ABC
from model uncertainty to ABC
 
Bayesian decision making in clinical research
Bayesian decision making in clinical researchBayesian decision making in clinical research
Bayesian decision making in clinical research
 
“I. Conjugate priors”
“I. Conjugate priors”“I. Conjugate priors”
“I. Conjugate priors”
 
01.ConditionalProb.pdf in the Bayes_intro folder
01.ConditionalProb.pdf in the Bayes_intro folder01.ConditionalProb.pdf in the Bayes_intro folder
01.ConditionalProb.pdf in the Bayes_intro folder
 
01.conditional prob
01.conditional prob01.conditional prob
01.conditional prob
 

More from Christian Robert

Adaptive Restore algorithm & importance Monte Carlo
Adaptive Restore algorithm & importance Monte CarloAdaptive Restore algorithm & importance Monte Carlo
Adaptive Restore algorithm & importance Monte Carlo
Christian Robert
 
Asymptotics of ABC, lecture, Collège de France
Asymptotics of ABC, lecture, Collège de FranceAsymptotics of ABC, lecture, Collège de France
Asymptotics of ABC, lecture, Collège de France
Christian Robert
 
Workshop in honour of Don Poskitt and Gael Martin
Workshop in honour of Don Poskitt and Gael MartinWorkshop in honour of Don Poskitt and Gael Martin
Workshop in honour of Don Poskitt and Gael Martin
Christian Robert
 
discussion of ICML23.pdf
discussion of ICML23.pdfdiscussion of ICML23.pdf
discussion of ICML23.pdf
Christian Robert
 
How many components in a mixture?
How many components in a mixture?How many components in a mixture?
How many components in a mixture?
Christian Robert
 
restore.pdf
restore.pdfrestore.pdf
restore.pdf
Christian Robert
 
Testing for mixtures at BNP 13
Testing for mixtures at BNP 13Testing for mixtures at BNP 13
Testing for mixtures at BNP 13
Christian Robert
 
Inferring the number of components: dream or reality?
Inferring the number of components: dream or reality?Inferring the number of components: dream or reality?
Inferring the number of components: dream or reality?
Christian Robert
 
CDT 22 slides.pdf
CDT 22 slides.pdfCDT 22 slides.pdf
CDT 22 slides.pdf
Christian Robert
 
Testing for mixtures by seeking components
Testing for mixtures by seeking componentsTesting for mixtures by seeking components
Testing for mixtures by seeking components
Christian Robert
 
discussion on Bayesian restricted likelihood
discussion on Bayesian restricted likelihooddiscussion on Bayesian restricted likelihood
discussion on Bayesian restricted likelihood
Christian Robert
 
NCE, GANs & VAEs (and maybe BAC)
NCE, GANs & VAEs (and maybe BAC)NCE, GANs & VAEs (and maybe BAC)
NCE, GANs & VAEs (and maybe BAC)
Christian Robert
 
ABC-Gibbs
ABC-GibbsABC-Gibbs
ABC-Gibbs
Christian Robert
 
Coordinate sampler : A non-reversible Gibbs-like sampler
Coordinate sampler : A non-reversible Gibbs-like samplerCoordinate sampler : A non-reversible Gibbs-like sampler
Coordinate sampler : A non-reversible Gibbs-like sampler
Christian Robert
 
eugenics and statistics
eugenics and statisticseugenics and statistics
eugenics and statistics
Christian Robert
 
Laplace's Demon: seminar #1
Laplace's Demon: seminar #1Laplace's Demon: seminar #1
Laplace's Demon: seminar #1
Christian Robert
 
ABC-Gibbs
ABC-GibbsABC-Gibbs
ABC-Gibbs
Christian Robert
 
asymptotics of ABC
asymptotics of ABCasymptotics of ABC
asymptotics of ABC
Christian Robert
 
ABC-Gibbs
ABC-GibbsABC-Gibbs
ABC-Gibbs
Christian Robert
 
Likelihood-free Design: a discussion
Likelihood-free Design: a discussionLikelihood-free Design: a discussion
Likelihood-free Design: a discussion
Christian Robert
 

More from Christian Robert (20)

Adaptive Restore algorithm & importance Monte Carlo
Adaptive Restore algorithm & importance Monte CarloAdaptive Restore algorithm & importance Monte Carlo
Adaptive Restore algorithm & importance Monte Carlo
 
Asymptotics of ABC, lecture, Collège de France
Asymptotics of ABC, lecture, Collège de FranceAsymptotics of ABC, lecture, Collège de France
Asymptotics of ABC, lecture, Collège de France
 
Workshop in honour of Don Poskitt and Gael Martin
Workshop in honour of Don Poskitt and Gael MartinWorkshop in honour of Don Poskitt and Gael Martin
Workshop in honour of Don Poskitt and Gael Martin
 
discussion of ICML23.pdf
discussion of ICML23.pdfdiscussion of ICML23.pdf
discussion of ICML23.pdf
 
How many components in a mixture?
How many components in a mixture?How many components in a mixture?
How many components in a mixture?
 
restore.pdf
restore.pdfrestore.pdf
restore.pdf
 
Testing for mixtures at BNP 13
Testing for mixtures at BNP 13Testing for mixtures at BNP 13
Testing for mixtures at BNP 13
 
Inferring the number of components: dream or reality?
Inferring the number of components: dream or reality?Inferring the number of components: dream or reality?
Inferring the number of components: dream or reality?
 
CDT 22 slides.pdf
CDT 22 slides.pdfCDT 22 slides.pdf
CDT 22 slides.pdf
 
Testing for mixtures by seeking components
Testing for mixtures by seeking componentsTesting for mixtures by seeking components
Testing for mixtures by seeking components
 
discussion on Bayesian restricted likelihood
discussion on Bayesian restricted likelihooddiscussion on Bayesian restricted likelihood
discussion on Bayesian restricted likelihood
 
NCE, GANs & VAEs (and maybe BAC)
NCE, GANs & VAEs (and maybe BAC)NCE, GANs & VAEs (and maybe BAC)
NCE, GANs & VAEs (and maybe BAC)
 
ABC-Gibbs
ABC-GibbsABC-Gibbs
ABC-Gibbs
 
Coordinate sampler : A non-reversible Gibbs-like sampler
Coordinate sampler : A non-reversible Gibbs-like samplerCoordinate sampler : A non-reversible Gibbs-like sampler
Coordinate sampler : A non-reversible Gibbs-like sampler
 
eugenics and statistics
eugenics and statisticseugenics and statistics
eugenics and statistics
 
Laplace's Demon: seminar #1
Laplace's Demon: seminar #1Laplace's Demon: seminar #1
Laplace's Demon: seminar #1
 
ABC-Gibbs
ABC-GibbsABC-Gibbs
ABC-Gibbs
 
asymptotics of ABC
asymptotics of ABCasymptotics of ABC
asymptotics of ABC
 
ABC-Gibbs
ABC-GibbsABC-Gibbs
ABC-Gibbs
 
Likelihood-free Design: a discussion
Likelihood-free Design: a discussionLikelihood-free Design: a discussion
Likelihood-free Design: a discussion
 

Recently uploaded

Introduction to Quality Improvement Essentials
Introduction to Quality Improvement EssentialsIntroduction to Quality Improvement Essentials
Introduction to Quality Improvement Essentials
Excellence Foundation for South Sudan
 
Home assignment II on Spectroscopy 2024 Answers.pdf
Home assignment II on Spectroscopy 2024 Answers.pdfHome assignment II on Spectroscopy 2024 Answers.pdf
Home assignment II on Spectroscopy 2024 Answers.pdf
Tamralipta Mahavidyalaya
 
How to Create Map Views in the Odoo 17 ERP
How to Create Map Views in the Odoo 17 ERPHow to Create Map Views in the Odoo 17 ERP
How to Create Map Views in the Odoo 17 ERP
Celine George
 
Language Across the Curriculm LAC B.Ed.
Language Across the  Curriculm LAC B.Ed.Language Across the  Curriculm LAC B.Ed.
Language Across the Curriculm LAC B.Ed.
Atul Kumar Singh
 
Basic phrases for greeting and assisting costumers
Basic phrases for greeting and assisting costumersBasic phrases for greeting and assisting costumers
Basic phrases for greeting and assisting costumers
PedroFerreira53928
 
How to Break the cycle of negative Thoughts
How to Break the cycle of negative ThoughtsHow to Break the cycle of negative Thoughts
How to Break the cycle of negative Thoughts
Col Mukteshwar Prasad
 
Fish and Chips - have they had their chips
Fish and Chips - have they had their chipsFish and Chips - have they had their chips
Fish and Chips - have they had their chips
GeoBlogs
 
Polish students' mobility in the Czech Republic
Polish students' mobility in the Czech RepublicPolish students' mobility in the Czech Republic
Polish students' mobility in the Czech Republic
Anna Sz.
 
The Roman Empire A Historical Colossus.pdf
The Roman Empire A Historical Colossus.pdfThe Roman Empire A Historical Colossus.pdf
The Roman Empire A Historical Colossus.pdf
kaushalkr1407
 
CLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCE
CLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCECLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCE
CLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCE
BhavyaRajput3
 
TESDA TM1 REVIEWER FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
TESDA TM1 REVIEWER  FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...TESDA TM1 REVIEWER  FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
TESDA TM1 REVIEWER FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
EugeneSaldivar
 
Students, digital devices and success - Andreas Schleicher - 27 May 2024..pptx
Students, digital devices and success - Andreas Schleicher - 27 May 2024..pptxStudents, digital devices and success - Andreas Schleicher - 27 May 2024..pptx
Students, digital devices and success - Andreas Schleicher - 27 May 2024..pptx
EduSkills OECD
 
The geography of Taylor Swift - some ideas
The geography of Taylor Swift - some ideasThe geography of Taylor Swift - some ideas
The geography of Taylor Swift - some ideas
GeoBlogs
 
Model Attribute Check Company Auto Property
Model Attribute  Check Company Auto PropertyModel Attribute  Check Company Auto Property
Model Attribute Check Company Auto Property
Celine George
 
Instructions for Submissions thorugh G- Classroom.pptx
Instructions for Submissions thorugh G- Classroom.pptxInstructions for Submissions thorugh G- Classroom.pptx
Instructions for Submissions thorugh G- Classroom.pptx
Jheel Barad
 
Unit 2- Research Aptitude (UGC NET Paper I).pdf
Unit 2- Research Aptitude (UGC NET Paper I).pdfUnit 2- Research Aptitude (UGC NET Paper I).pdf
Unit 2- Research Aptitude (UGC NET Paper I).pdf
Thiyagu K
 
MARUTI SUZUKI- A Successful Joint Venture in India.pptx
MARUTI SUZUKI- A Successful Joint Venture in India.pptxMARUTI SUZUKI- A Successful Joint Venture in India.pptx
MARUTI SUZUKI- A Successful Joint Venture in India.pptx
bennyroshan06
 
How to Make a Field invisible in Odoo 17
How to Make a Field invisible in Odoo 17How to Make a Field invisible in Odoo 17
How to Make a Field invisible in Odoo 17
Celine George
 
Palestine last event orientationfvgnh .pptx
Palestine last event orientationfvgnh .pptxPalestine last event orientationfvgnh .pptx
Palestine last event orientationfvgnh .pptx
RaedMohamed3
 
Supporting (UKRI) OA monographs at Salford.pptx
Supporting (UKRI) OA monographs at Salford.pptxSupporting (UKRI) OA monographs at Salford.pptx
Supporting (UKRI) OA monographs at Salford.pptx
Jisc
 

Recently uploaded (20)

Introduction to Quality Improvement Essentials
Introduction to Quality Improvement EssentialsIntroduction to Quality Improvement Essentials
Introduction to Quality Improvement Essentials
 
Home assignment II on Spectroscopy 2024 Answers.pdf
Home assignment II on Spectroscopy 2024 Answers.pdfHome assignment II on Spectroscopy 2024 Answers.pdf
Home assignment II on Spectroscopy 2024 Answers.pdf
 
How to Create Map Views in the Odoo 17 ERP
How to Create Map Views in the Odoo 17 ERPHow to Create Map Views in the Odoo 17 ERP
How to Create Map Views in the Odoo 17 ERP
 
Language Across the Curriculm LAC B.Ed.
Language Across the  Curriculm LAC B.Ed.Language Across the  Curriculm LAC B.Ed.
Language Across the Curriculm LAC B.Ed.
 
Basic phrases for greeting and assisting costumers
Basic phrases for greeting and assisting costumersBasic phrases for greeting and assisting costumers
Basic phrases for greeting and assisting costumers
 
How to Break the cycle of negative Thoughts
How to Break the cycle of negative ThoughtsHow to Break the cycle of negative Thoughts
How to Break the cycle of negative Thoughts
 
Fish and Chips - have they had their chips
Fish and Chips - have they had their chipsFish and Chips - have they had their chips
Fish and Chips - have they had their chips
 
Polish students' mobility in the Czech Republic
Polish students' mobility in the Czech RepublicPolish students' mobility in the Czech Republic
Polish students' mobility in the Czech Republic
 
The Roman Empire A Historical Colossus.pdf
The Roman Empire A Historical Colossus.pdfThe Roman Empire A Historical Colossus.pdf
The Roman Empire A Historical Colossus.pdf
 
CLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCE
CLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCECLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCE
CLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCE
 
TESDA TM1 REVIEWER FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
TESDA TM1 REVIEWER  FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...TESDA TM1 REVIEWER  FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
TESDA TM1 REVIEWER FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
 
Students, digital devices and success - Andreas Schleicher - 27 May 2024..pptx
Students, digital devices and success - Andreas Schleicher - 27 May 2024..pptxStudents, digital devices and success - Andreas Schleicher - 27 May 2024..pptx
Students, digital devices and success - Andreas Schleicher - 27 May 2024..pptx
 
The geography of Taylor Swift - some ideas
The geography of Taylor Swift - some ideasThe geography of Taylor Swift - some ideas
The geography of Taylor Swift - some ideas
 
Model Attribute Check Company Auto Property
Model Attribute  Check Company Auto PropertyModel Attribute  Check Company Auto Property
Model Attribute Check Company Auto Property
 
Instructions for Submissions thorugh G- Classroom.pptx
Instructions for Submissions thorugh G- Classroom.pptxInstructions for Submissions thorugh G- Classroom.pptx
Instructions for Submissions thorugh G- Classroom.pptx
 
Unit 2- Research Aptitude (UGC NET Paper I).pdf
Unit 2- Research Aptitude (UGC NET Paper I).pdfUnit 2- Research Aptitude (UGC NET Paper I).pdf
Unit 2- Research Aptitude (UGC NET Paper I).pdf
 
MARUTI SUZUKI- A Successful Joint Venture in India.pptx
MARUTI SUZUKI- A Successful Joint Venture in India.pptxMARUTI SUZUKI- A Successful Joint Venture in India.pptx
MARUTI SUZUKI- A Successful Joint Venture in India.pptx
 
How to Make a Field invisible in Odoo 17
How to Make a Field invisible in Odoo 17How to Make a Field invisible in Odoo 17
How to Make a Field invisible in Odoo 17
 
Palestine last event orientationfvgnh .pptx
Palestine last event orientationfvgnh .pptxPalestine last event orientationfvgnh .pptx
Palestine last event orientationfvgnh .pptx
 
Supporting (UKRI) OA monographs at Salford.pptx
Supporting (UKRI) OA monographs at Salford.pptxSupporting (UKRI) OA monographs at Salford.pptx
Supporting (UKRI) OA monographs at Salford.pptx
 

An overview of Bayesian testing

  • 1. Several perspectives and solutions on Bayesian testing of hypotheses Christian P. Robert Universit´e Paris-Dauphine, Paris & University of Warwick, Coventry
  • 2. Outline Bayesian testing of hypotheses Significance tests: one new parameter Noninformative solutions Jeffreys-Lindley paradox Deviance (information criterion) Testing under incomplete information Posterior predictive checking Testing via mixtures Paradigm shift
  • 3. Testing issues Hypothesis testing central problem of statistical inference witness the recent ASA’s statement on p-values (Wasserstein, 2016) dramatically differentiating feature between classical and Bayesian paradigms wide open to controversy and divergent opinions, includ. within the Bayesian community non-informative Bayesian testing case mostly unresolved, witness the Jeffreys–Lindley paradox [Berger (2003), Mayo & Cox (2006), Gelman (2008)]
  • 4. ”proper use and interpretation of the p-value” ”Scientific conclusions and business or policy decisions should not be based only on whether a p-value passes a specific threshold.” ”By itself, a p-value does not provide a good measure of evidence regarding a model or hypothesis.” [Wasserstein, 2016]
  • 5. ”proper use and interpretation of the p-value” ”Scientific conclusions and business or policy decisions should not be based only on whether a p-value passes a specific threshold.” ”By itself, a p-value does not provide a good measure of evidence regarding a model or hypothesis.” [Wasserstein, 2016]
  • 6. Bayesian testing of hypotheses Bayesian model selection as comparison of k potential statistical models towards the selection of model that fits the data “best” mostly accepted perspective: it does not primarily seek to identify which model is “true”, but compares fits tools like Bayes factor naturally include a penalisation addressing model complexity, mimicked by Bayes Information (BIC) and Deviance Information (DIC) criteria posterior predictive tools successfully advocated in Gelman et al. (2013) even though they involve double use of data
  • 7. Bayesian testing of hypotheses Bayesian model selection as comparison of k potential statistical models towards the selection of model that fits the data “best” mostly accepted perspective: it does not primarily seek to identify which model is “true”, but compares fits tools like Bayes factor naturally include a penalisation addressing model complexity, mimicked by Bayes Information (BIC) and Deviance Information (DIC) criteria posterior predictive tools successfully advocated in Gelman et al. (2013) even though they involve double use of data
  • 8. Some difficulties tension between using (i) posterior probabilities justified by binary loss function but depending on unnatural prior weights, and (ii) Bayes factors that eliminate dependence but escape direct connection with posterior, unless prior weights are integrated within loss delicate interpretation (or calibration) of strength of the Bayes factor towards supporting a given hypothesis or model, because not a Bayesian decision rule similar difficulty with posterior probabilities, with tendency to interpret them as p-values: only report of respective strengths of fitting data to both models
  • 9. Some difficulties tension between using (i) posterior probabilities justified by binary loss function but depending on unnatural prior weights, and (ii) Bayes factors that eliminate dependence but escape direct connection with posterior, unless prior weights are integrated within loss referring to a fixed and arbitrary cuttoff value falls into the same difficulties as regular p-values no “third way” like opting out from a decision
  • 10. Some further difficulties long-lasting impact of prior modeling, i.e., choice of prior distributions on parameters of both models, despite overall consistency proof for Bayes factor discontinuity in valid use of improper priors since they are not justified in most testing situations, leading to many alternative and ad hoc solutions, where data is either used twice or split in artificial ways [or further tortured into confession] binary (accept vs. reject) outcome more suited for immediate decision (if any) than for model evaluation, in connection with rudimentary loss function [atavistic remain of Neyman-Pearson formalism]
  • 11. Some additional difficulties related impossibility to ascertain simultaneous misfit or to detect outliers no assessment of uncertainty associated with decision itself besides posterior probability difficult computation of marginal likelihoods in most settings with further controversies about which algorithm to adopt strong dependence of posterior probabilities on conditioning statistics (ABC), which undermines their validity for model assessment temptation to create pseudo-frequentist equivalents such as q-values with even less Bayesian justifications c time for a paradigm shift back to some solutions
  • 12. Bayesian tests 101 Associated with the risk R(θ, δ) = Eθ[L(θ, δ(x))] = Pθ(δ(x) = 0) if θ ∈ Θ0, Pθ(δ(x) = 1) otherwise, Bayes test The Bayes estimator associated with π and with the 0 − 1 loss is δπ (x) = 1 if P(θ ∈ Θ0|x) > P(θ ∈ Θ0|x), 0 otherwise,
  • 13. Bayesian tests 101 Associated with the risk R(θ, δ) = Eθ[L(θ, δ(x))] = Pθ(δ(x) = 0) if θ ∈ Θ0, Pθ(δ(x) = 1) otherwise, Bayes test The Bayes estimator associated with π and with the 0 − 1 loss is δπ (x) = 1 if P(θ ∈ Θ0|x) > P(θ ∈ Θ0|x), 0 otherwise,
  • 14. Bayesian tests 102 Weights errors differently under both hypotheses: Theorem (Optimal Bayes decision) Under the 0 − 1 loss function L(θ, d) =    0 if d = IΘ0 (θ) a0 if d = 1 and θ ∈ Θ0 a1 if d = 0 and θ ∈ Θ0 the Bayes procedure is δπ (x) = 1 if P(θ ∈ Θ0|x) a0/(a0 + a1) 0 otherwise
  • 15. Bayesian tests 102 Weights errors differently under both hypotheses: Theorem (Optimal Bayes decision) Under the 0 − 1 loss function L(θ, d) =    0 if d = IΘ0 (θ) a0 if d = 1 and θ ∈ Θ0 a1 if d = 0 and θ ∈ Θ0 the Bayes procedure is δπ (x) = 1 if P(θ ∈ Θ0|x) a0/(a0 + a1) 0 otherwise
  • 16. A function of posterior probabilities Definition (Bayes factors) For hypotheses H0 : θ ∈ Θ0 vs. Ha : θ ∈ Θ0 B01 = π(Θ0|x) π(Θc 0|x) π(Θ0) π(Θc 0) = Θ0 f (x|θ)π0(θ)dθ Θc 0 f (x|θ)π1(θ)dθ [Jeffreys, ToP, 1939, V, §5.01] Bayes rule under 0 − 1 loss: acceptance if B01 > {(1 − π(Θ0))/a1}/{π(Θ0)/a0}
  • 17. self-contained concept Outside decision-theoretic environment: eliminates choice of π(Θ0) but depends on the choice of (π0, π1) Bayesian/marginal equivalent to the likelihood ratio Jeffreys’ scale of evidence: if log10(Bπ 10) between 0 and 0.5, evidence against H0 weak, if log10(Bπ 10) 0.5 and 1, evidence substantial, if log10(Bπ 10) 1 and 2, evidence strong and if log10(Bπ 10) above 2, evidence decisive [...fairly arbitrary!]
  • 18. self-contained concept Outside decision-theoretic environment: eliminates choice of π(Θ0) but depends on the choice of (π0, π1) Bayesian/marginal equivalent to the likelihood ratio Jeffreys’ scale of evidence: if log10(Bπ 10) between 0 and 0.5, evidence against H0 weak, if log10(Bπ 10) 0.5 and 1, evidence substantial, if log10(Bπ 10) 1 and 2, evidence strong and if log10(Bπ 10) above 2, evidence decisive [...fairly arbitrary!]
  • 19. self-contained concept Outside decision-theoretic environment: eliminates choice of π(Θ0) but depends on the choice of (π0, π1) Bayesian/marginal equivalent to the likelihood ratio Jeffreys’ scale of evidence: if log10(Bπ 10) between 0 and 0.5, evidence against H0 weak, if log10(Bπ 10) 0.5 and 1, evidence substantial, if log10(Bπ 10) 1 and 2, evidence strong and if log10(Bπ 10) above 2, evidence decisive [...fairly arbitrary!]
  • 20. self-contained concept Outside decision-theoretic environment: eliminates choice of π(Θ0) but depends on the choice of (π0, π1) Bayesian/marginal equivalent to the likelihood ratio Jeffreys’ scale of evidence: if log10(Bπ 10) between 0 and 0.5, evidence against H0 weak, if log10(Bπ 10) 0.5 and 1, evidence substantial, if log10(Bπ 10) 1 and 2, evidence strong and if log10(Bπ 10) above 2, evidence decisive [...fairly arbitrary!]
  • 21. consistency Example of a normal ¯Xn ∼ N(µ, 1/n) when µ ∼ N(0, 1), leading to B01 = (1 + n)−1/2 exp{n2 ¯x2 n /2(1 + n)}
  • 22. Point null hypotheses “Is it of the slightest use to reject a hypothesis until we have some idea of what to put in its place?” H. Jeffreys, ToP(p.390) Particular case H0 : θ = θ0 Take ρ0 = Prπ (θ = θ0) and g1 prior density under Hc 0 . Posterior probability of H0 π(Θ0|x) = f (x|θ0)ρ0 f (x|θ)π(θ) dθ = f (x|θ0)ρ0 f (x|θ0)ρ0 + (1 − ρ0)m1(x) and marginal under Hc 0 m1(x) = Θ1 f (x|θ)g1(θ) dθ. and Bπ 01(x) = f (x|θ0)ρ0 m1(x)(1 − ρ0) ρ0 1 − ρ0 = f (x|θ0) m1(x)
  • 23. Point null hypotheses “Is it of the slightest use to reject a hypothesis until we have some idea of what to put in its place?” H. Jeffreys, ToP(p.390) Particular case H0 : θ = θ0 Take ρ0 = Prπ (θ = θ0) and g1 prior density under Hc 0 . Posterior probability of H0 π(Θ0|x) = f (x|θ0)ρ0 f (x|θ)π(θ) dθ = f (x|θ0)ρ0 f (x|θ0)ρ0 + (1 − ρ0)m1(x) and marginal under Hc 0 m1(x) = Θ1 f (x|θ)g1(θ) dθ. and Bπ 01(x) = f (x|θ0)ρ0 m1(x)(1 − ρ0) ρ0 1 − ρ0 = f (x|θ0) m1(x)
  • 24. “Significance tests: one new parameter” Bayesian testing of hypotheses Significance tests: one new parameter Bayesian tests Bayes factors Improper priors for tests Conclusion Noninformative solutions Jeffreys-Lindley paradox Deviance (information criterion) Testing under incomplete information
  • 25. Fundamental setting Is the new parameter supported by the observations or is any variation expressible by it better interpreted as random? Thus we must set two hypotheses for comparison, the more complicated having the smaller initial probability (Jeffreys, ToP, V, §5.0) ...compare a specially suggested value of a new parameter, often 0 [q], with the aggregate of other possible values [q ]. We shall call q the null hypothesis and q the alternative hypothesis [and] we must take P(q|H) = P(q |H) = 1/2 .
  • 26. Construction of Bayes tests Definition (Test) Given an hypothesis H0 : θ ∈ Θ0 on the parameter θ ∈ Θ0 of a statistical model, a test is a statistical procedure that takes its values in {0, 1}.
  • 27. Type–one and type–two errors Associated with the risk R(θ, δ) = Eθ[L(θ, δ(x))] = Pθ(δ(x) = 0) if θ ∈ Θ0, Pθ(δ(x) = 1) otherwise, Theorem (Bayes test) The Bayes estimator associated with π and with the 0 − 1 loss is δπ (x) = 1 if P(θ ∈ Θ0|x) > P(θ ∈ Θ0|x), 0 otherwise,
  • 28. Type–one and type–two errors Associated with the risk R(θ, δ) = Eθ[L(θ, δ(x))] = Pθ(δ(x) = 0) if θ ∈ Θ0, Pθ(δ(x) = 1) otherwise, Theorem (Bayes test) The Bayes estimator associated with π and with the 0 − 1 loss is δπ (x) = 1 if P(θ ∈ Θ0|x) > P(θ ∈ Θ0|x), 0 otherwise,
  • 29. Jeffreys’ example (§5.0) Testing whether the mean α of a normal observation is zero: P(q|aH) ∝ exp − a2 2s2 P(q dα|aH) ∝ exp − (a − α)2 2s2 f (α)dα P(q |aH) ∝ exp − (a − α)2 2s2 f (α)dα
  • 30. A (small) point of contention Jeffreys asserts Suppose that there is one old parameter α; the new parameter is β and is 0 on q. In q we could replace α by α , any function of α and β: but to make it explicit that q reduces to q when β = 0 we shall require that α = α when β = 0 (V, §5.0). This amounts to assume identical parameters in both models, a controversial principle for model choice or at the very best to make α and β dependent a priori, a choice contradicted by the next paragraph in ToP
  • 31. A (small) point of contention Jeffreys asserts Suppose that there is one old parameter α; the new parameter is β and is 0 on q. In q we could replace α by α , any function of α and β: but to make it explicit that q reduces to q when β = 0 we shall require that α = α when β = 0 (V, §5.0). This amounts to assume identical parameters in both models, a controversial principle for model choice or at the very best to make α and β dependent a priori, a choice contradicted by the next paragraph in ToP
  • 32. Orthogonal parameters If I(α, β) = gαα 0 0 gββ , α and β orthogonal, but not [a posteriori] independent, contrary to ToP assertions ...the result will be nearly independent on previous information on old parameters (V, §5.01). and K = 1 f (b, a) ngββ 2π exp − 1 2 ngββb2 [where] h(α) is irrelevant (V, §5.01)
  • 33. Orthogonal parameters If I(α, β) = gαα 0 0 gββ , α and β orthogonal, but not [a posteriori] independent, contrary to ToP assertions ...the result will be nearly independent on previous information on old parameters (V, §5.01). and K = 1 f (b, a) ngββ 2π exp − 1 2 ngββb2 [where] h(α) is irrelevant (V, §5.01)
  • 34. Acknowledgement in ToP In practice it is rather unusual for a set of parameters to arise in such a way that each can be treated as irrelevant to the presence of any other. More usual cases are (...) where some parameters are so closely associated that one could hardly occur without the others (V, §5.04).
  • 35. Generalisation Theorem (Optimal Bayes decision) Under the 0 − 1 loss function L(θ, d) =    0 if d = IΘ0 (θ) a0 if d = 1 and θ ∈ Θ0 a1 if d = 0 and θ ∈ Θ0 the Bayes procedure is δπ (x) = 1 if Prπ (θ ∈ Θ0|x) a0/(a0 + a1) 0 otherwise
  • 36. Generalisation Theorem (Optimal Bayes decision) Under the 0 − 1 loss function L(θ, d) =    0 if d = IΘ0 (θ) a0 if d = 1 and θ ∈ Θ0 a1 if d = 0 and θ ∈ Θ0 the Bayes procedure is δπ (x) = 1 if Prπ (θ ∈ Θ0|x) a0/(a0 + a1) 0 otherwise
  • 37. Bound comparison Determination of a0/a1 depends on consequences of “wrong decision” under both circumstances Often difficult to assess in practice and replacement with “golden” default bounds like .05, biased towards H0
  • 38. Bound comparison Determination of a0/a1 depends on consequences of “wrong decision” under both circumstances Often difficult to assess in practice and replacement with “golden” default bounds like .05, biased towards H0
  • 39. A function of posterior probabilities Definition (Bayes factors) For hypotheses H0 : θ ∈ Θ0 vs. Ha : θ ∈ Θ0 B01 = π(Θ0|x) π(Θc 0|x) π(Θ0) π(Θc 0) = Θ0 f (x|θ)π0(θ)dθ Θc 0 f (x|θ)π1(θ)dθ [Good, 1958 & ToP, V, §5.01] Equivalent to Bayes rule: acceptance if B01 > {(1 − π(Θ0))/a1}/{π(Θ0)/a0}
  • 40. Self-contained concept Outside decision-theoretic environment: eliminates choice of π(Θ0) but depends on the choice of (π0, π1) Bayesian/marginal equivalent to the likelihood ratio Jeffreys’ scale of evidence: if log10(Bπ 10) between 0 and 0.5, evidence against H0 weak, if log10(Bπ 10) 0.5 and 1, evidence substantial, if log10(Bπ 10) 1 and 2, evidence strong and if log10(Bπ 10) above 2, evidence decisive
  • 41. Self-contained concept Outside decision-theoretic environment: eliminates choice of π(Θ0) but depends on the choice of (π0, π1) Bayesian/marginal equivalent to the likelihood ratio Jeffreys’ scale of evidence: if log10(Bπ 10) between 0 and 0.5, evidence against H0 weak, if log10(Bπ 10) 0.5 and 1, evidence substantial, if log10(Bπ 10) 1 and 2, evidence strong and if log10(Bπ 10) above 2, evidence decisive
  • 42. Self-contained concept Outside decision-theoretic environment: eliminates choice of π(Θ0) but depends on the choice of (π0, π1) Bayesian/marginal equivalent to the likelihood ratio Jeffreys’ scale of evidence: if log10(Bπ 10) between 0 and 0.5, evidence against H0 weak, if log10(Bπ 10) 0.5 and 1, evidence substantial, if log10(Bπ 10) 1 and 2, evidence strong and if log10(Bπ 10) above 2, evidence decisive
  • 43. Self-contained concept Outside decision-theoretic environment: eliminates choice of π(Θ0) but depends on the choice of (π0, π1) Bayesian/marginal equivalent to the likelihood ratio Jeffreys’ scale of evidence: if log10(Bπ 10) between 0 and 0.5, evidence against H0 weak, if log10(Bπ 10) 0.5 and 1, evidence substantial, if log10(Bπ 10) 1 and 2, evidence strong and if log10(Bπ 10) above 2, evidence decisive
  • 44. A major modification When the null hypothesis is supported by a set of measure 0 against Lebesgue measure, π(Θ0) = 0 for an absolutely continuous prior distribution [End of the story?!] Suppose we are considering whether a location parameter α is 0. The estimation prior probability for it is uniform and we should have to take f (α) = 0 and K[= B10] would always be infinite (V, §5.02)
  • 45. A major modification When the null hypothesis is supported by a set of measure 0 against Lebesgue measure, π(Θ0) = 0 for an absolutely continuous prior distribution [End of the story?!] Suppose we are considering whether a location parameter α is 0. The estimation prior probability for it is uniform and we should have to take f (α) = 0 and K[= B10] would always be infinite (V, §5.02)
  • 46. Point null refurbishment Requirement Defined prior distributions under both assumptions, π0(θ) ∝ π(θ)IΘ0 (θ), π1(θ) ∝ π(θ)IΘ1 (θ), (under the standard dominating measures on Θ0 and Θ1) Using the prior probabilities π(Θ0) = ρ0 and π(Θ1) = ρ1, π(θ) = ρ0π0(θ) + ρ1π1(θ). Note If Θ0 = {θ0}, π0 is the Dirac mass in θ0
  • 47. Point null refurbishment Requirement Defined prior distributions under both assumptions, π0(θ) ∝ π(θ)IΘ0 (θ), π1(θ) ∝ π(θ)IΘ1 (θ), (under the standard dominating measures on Θ0 and Θ1) Using the prior probabilities π(Θ0) = ρ0 and π(Θ1) = ρ1, π(θ) = ρ0π0(θ) + ρ1π1(θ). Note If Θ0 = {θ0}, π0 is the Dirac mass in θ0
  • 48. Point null hypotheses Particular case H0 : θ = θ0 Take ρ0 = Prπ (θ = θ0) and g1 prior density under Ha. Posterior probability of H0 π(Θ0|x) = f (x|θ0)ρ0 f (x|θ)π(θ) dθ = f (x|θ0)ρ0 f (x|θ0)ρ0 + (1 − ρ0)m1(x) and marginal under Ha m1(x) = Θ1 f (x|θ)g1(θ) dθ.
  • 49. Point null hypotheses Particular case H0 : θ = θ0 Take ρ0 = Prπ (θ = θ0) and g1 prior density under Ha. Posterior probability of H0 π(Θ0|x) = f (x|θ0)ρ0 f (x|θ)π(θ) dθ = f (x|θ0)ρ0 f (x|θ0)ρ0 + (1 − ρ0)m1(x) and marginal under Ha m1(x) = Θ1 f (x|θ)g1(θ) dθ.
  • 50. Point null hypotheses (cont’d) Dual representation π(Θ0|x) = 1 + 1 − ρ0 ρ0 m1(x) f (x|θ0) −1 . and Bπ 01(x) = f (x|θ0)ρ0 m1(x)(1 − ρ0) ρ0 1 − ρ0 = f (x|θ0) m1(x) Connection π(Θ0|x) = 1 + 1 − ρ0 ρ0 1 Bπ 01(x) −1 .
  • 51. Point null hypotheses (cont’d) Dual representation π(Θ0|x) = 1 + 1 − ρ0 ρ0 m1(x) f (x|θ0) −1 . and Bπ 01(x) = f (x|θ0)ρ0 m1(x)(1 − ρ0) ρ0 1 − ρ0 = f (x|θ0) m1(x) Connection π(Θ0|x) = 1 + 1 − ρ0 ρ0 1 Bπ 01(x) −1 .
  • 52. A further difficulty Improper priors are not allowed here If Θ1 π1(dθ1) = ∞ or Θ2 π2(dθ2) = ∞ then π1 or π2 cannot be coherently normalised while the normalisation matters in the Bayes factor remember Bayes factor?
  • 53. A further difficulty Improper priors are not allowed here If Θ1 π1(dθ1) = ∞ or Θ2 π2(dθ2) = ∞ then π1 or π2 cannot be coherently normalised while the normalisation matters in the Bayes factor remember Bayes factor?
  • 54. ToP unaware of the problem? A. Not entirely, as improper priors keep being used on nuisance parameters Example of testing for a zero normal mean: If σ is the standard error and λ the true value, λ is 0 on q. We want a suitable form for its prior on q . (...) Then we should take P(qdσ|H) ∝ dσ/σ P(q dσdλ|H) ∝ f λ σ dσ/σdλ/λ where f [is a true density] (V, §5.2). Fallacy of the “same” σ!
  • 55. ToP unaware of the problem? A. Not entirely, as improper priors keep being used on nuisance parameters Example of testing for a zero normal mean: If σ is the standard error and λ the true value, λ is 0 on q. We want a suitable form for its prior on q . (...) Then we should take P(qdσ|H) ∝ dσ/σ P(q dσdλ|H) ∝ f λ σ dσ/σdλ/λ where f [is a true density] (V, §5.2). Fallacy of the “same” σ!
  • 56. Not enought information If s = 0 [!!!], then [for σ = |¯x|/τ, λ = σv] P(q|θH) ∝ ∞ 0 τ |¯x| n exp − 1 2 nτ2 dτ τ , P(q |θH) ∝ ∞ 0 dτ τ ∞ −∞ τ |¯x| n f (v) exp − 1 2 n(v − τ)2 . If n = 1 and f (v) is any even [density], P(q |θH) ∝ 1 2 √ 2π |¯x| and P(q|θH) ∝ 1 2 √ 2π |¯x| and therefore K = 1 (V, §5.2).
  • 57. Strange constraints If n 2, the condition that K = 0 for s = 0, ¯x = 0 is equivalent to ∞ 0 f (v)vn−1 dv = ∞ . The function satisfying this condition for [all] n is f (v) = 1 π(1 + v2) This is the prior recommended by Jeffreys hereafter. But, first, many other families of densities satisfy this constraint and a scale of 1 cannot be universal! Second, s = 0 is a zero probability event...
  • 58. Strange constraints If n 2, the condition that K = 0 for s = 0, ¯x = 0 is equivalent to ∞ 0 f (v)vn−1 dv = ∞ . The function satisfying this condition for [all] n is f (v) = 1 π(1 + v2) This is the prior recommended by Jeffreys hereafter. But, first, many other families of densities satisfy this constraint and a scale of 1 cannot be universal! Second, s = 0 is a zero probability event...
  • 59. Strange constraints If n 2, the condition that K = 0 for s = 0, ¯x = 0 is equivalent to ∞ 0 f (v)vn−1 dv = ∞ . The function satisfying this condition for [all] n is f (v) = 1 π(1 + v2) This is the prior recommended by Jeffreys hereafter. But, first, many other families of densities satisfy this constraint and a scale of 1 cannot be universal! Second, s = 0 is a zero probability event...
  • 60. Comments ToP very imprecise about choice of priors in the setting of tests (despite existence of Susie’s Jeffreys’ conventional partly proper priors) ToP misses the difficulty of improper priors [coherent with earlier stance] but this problem still generates debates within the B community Some degree of goodness-of-fit testing but against fixed alternatives Persistence of the form K ≈ πn 2 1 + t2 ν −1/2ν+1/2 but ν not so clearly defined...
  • 61. Noninformative proposals Bayesian testing of hypotheses Significance tests: one new parameter Noninformative solutions Jeffreys-Lindley paradox Deviance (information criterion) Testing under incomplete information Posterior predictive checking
  • 62. what’s special about the Bayes factor?! “The priors do not represent substantive knowledge of the parameters within the model Using Bayes’ theorem, these priors can then be updated to posteriors conditioned on the data that were actually observed In general, the fact that different priors result in different Bayes factors should not come as a surprise The Bayes factor (...) balances the tension between parsimony and goodness of fit, (...) against overfitting the data In induction there is no harm in being occasionally wrong; it is inevitable that we shall be” [Jeffreys, 1939; Ly et al., 2015]
  • 63. what’s wrong with the Bayes factor?! (1/2, 1/2) partition between hypotheses has very little to suggest in terms of extensions central difficulty stands with issue of picking a prior probability of a model unfortunate impossibility of using improper priors in most settings Bayes factors lack direct scaling associated with posterior probability and loss function twofold dependence on subjective prior measure, first in prior weights of models and second in lasting impact of prior modelling on the parameters Bayes factor offers no window into uncertainty associated with decision further reasons in the summary [Robert, 2016]
  • 64. Lindley’s paradox In a normal mean testing problem, ¯xn ∼ N(θ, σ2 /n) , H0 : θ = θ0 , under Jeffreys prior, θ ∼ N(θ0, σ2), the Bayes factor B01(tn) = (1 + n)1/2 exp −nt2 n/2[1 + n] , where tn = √ n|¯xn − θ0|/σ, satisfies B01(tn) n−→∞ −→ ∞ [assuming a fixed tn] [Lindley, 1957]
  • 65. A strong impropriety Improper priors not allowed in Bayes factors: If Θ1 π1(dθ1) = ∞ or Θ2 π2(dθ2) = ∞ then π1 or π2 cannot be coherently normalised while the normalisation matters in the Bayes factor B12 Lack of mathematical justification for “common nuisance parameter” [and prior of] [Berger, Pericchi, and Varshavsky, 1998; Marin and Robert, 2013]
  • 66. A strong impropriety Improper priors not allowed in Bayes factors: If Θ1 π1(dθ1) = ∞ or Θ2 π2(dθ2) = ∞ then π1 or π2 cannot be coherently normalised while the normalisation matters in the Bayes factor B12 Lack of mathematical justification for “common nuisance parameter” [and prior of] [Berger, Pericchi, and Varshavsky, 1998; Marin and Robert, 2013]
  • 67. On some resolutions of the paradox use of pseudo-Bayes factors, fractional Bayes factors, &tc, which lacks complete proper Bayesian justification [Berger & Pericchi, 2001] use of identical improper priors on nuisance parameters, calibration via the posterior predictive distribution, matching priors, use of score functions extending the log score function non-local priors correcting default priors
  • 68. On some resolutions of the paradox use of pseudo-Bayes factors, fractional Bayes factors, &tc, use of identical improper priors on nuisance parameters, a notion already entertained by Jeffreys [Berger et al., 1998; Marin & Robert, 2013] calibration via the posterior predictive distribution, matching priors, use of score functions extending the log score function non-local priors correcting default priors
  • 69. On some resolutions of the paradox use of pseudo-Bayes factors, fractional Bayes factors, &tc, use of identical improper priors on nuisance parameters, P´ech´e de jeunesse: equating the values of the prior densities at the point-null value θ0, ρ0 = (1 − ρ0)π1(θ0) [Robert, 1993] calibration via the posterior predictive distribution, matching priors, use of score functions extending the log score function non-local priors correcting default priors
  • 70. On some resolutions of the paradox use of pseudo-Bayes factors, fractional Bayes factors, &tc, use of identical improper priors on nuisance parameters, calibration via the posterior predictive distribution, which uses the data twice matching priors, use of score functions extending the log score function non-local priors correcting default priors
  • 71. On some resolutions of the paradox use of pseudo-Bayes factors, fractional Bayes factors, &tc, use of identical improper priors on nuisance parameters, calibration via the posterior predictive distribution, matching priors, whose sole purpose is to bring frequentist and Bayesian coverages as close as possible [Datta & Mukerjee, 2004] use of score functions extending the log score function non-local priors correcting default priors
  • 72. On some resolutions of the paradox use of pseudo-Bayes factors, fractional Bayes factors, &tc, use of identical improper priors on nuisance parameters, calibration via the posterior predictive distribution, matching priors, use of score functions extending the log score function log B12(x) = log m1(x) − log m2(x) = S0(x, m1) − S0(x, m2) , that are independent of the normalising constant [Dawid et al., 2013; Dawid & Musio, 2015] non-local priors correcting default priors
  • 73. On some resolutions of the paradox use of pseudo-Bayes factors, fractional Bayes factors, &tc, use of identical improper priors on nuisance parameters, calibration via the posterior predictive distribution, matching priors, use of score functions extending the log score function non-local priors correcting default priors towards more balanced error rates [Johnson & Rossell, 2010; Consonni et al., 2013]
  • 74. Pseudo-Bayes factors Idea Use one part x[i] of the data x to make the prior proper: πi improper but πi (·|x[i]) proper and fi (x[n/i]|θi ) πi (θi |x[i])dθi fj (x[n/i]|θj ) πj (θj |x[i])dθj independent of normalizing constant Use remaining x[n/i] to run test as if πj (θj |x[i]) is the true prior
  • 75. Pseudo-Bayes factors Idea Use one part x[i] of the data x to make the prior proper: πi improper but πi (·|x[i]) proper and fi (x[n/i]|θi ) πi (θi |x[i])dθi fj (x[n/i]|θj ) πj (θj |x[i])dθj independent of normalizing constant Use remaining x[n/i] to run test as if πj (θj |x[i]) is the true prior
  • 76. Pseudo-Bayes factors Idea Use one part x[i] of the data x to make the prior proper: πi improper but πi (·|x[i]) proper and fi (x[n/i]|θi ) πi (θi |x[i])dθi fj (x[n/i]|θj ) πj (θj |x[i])dθj independent of normalizing constant Use remaining x[n/i] to run test as if πj (θj |x[i]) is the true prior
  • 77. Motivation Provides a working principle for improper priors Gather enough information from data to achieve properness and use this properness to run the test on remaining data does not use x twice as in Aitkin’s (1991)
  • 78. Motivation Provides a working principle for improper priors Gather enough information from data to achieve properness and use this properness to run the test on remaining data does not use x twice as in Aitkin’s (1991)
  • 79. Motivation Provides a working principle for improper priors Gather enough information from data to achieve properness and use this properness to run the test on remaining data does not use x twice as in Aitkin’s (1991)
  • 80. Issues depends on the choice of x[i] many ways of combining pseudo-Bayes factors AIBF = BN ji 1 L Bij (x[ ]) MIBF = BN ji med[Bij (x[ ])] GIBF = BN ji exp 1 L log Bij (x[ ]) not often an exact Bayes factor and thus lacking inner coherence B12 = B10B02 and B01 = 1/B10 . [Berger & Pericchi, 1996]
  • 81. Issues depends on the choice of x[i] many ways of combining pseudo-Bayes factors AIBF = BN ji 1 L Bij (x[ ]) MIBF = BN ji med[Bij (x[ ])] GIBF = BN ji exp 1 L log Bij (x[ ]) not often an exact Bayes factor and thus lacking inner coherence B12 = B10B02 and B01 = 1/B10 . [Berger & Pericchi, 1996]
  • 82. Fractional Bayes factor Idea use directly the likelihood to separate training sample from testing sample BF 12 = B12(x) Lb 2(θ2)π2(θ2)dθ2 Lb 1(θ1)π1(θ1)dθ1 [O’Hagan, 1995] Proportion b of the sample used to gain proper-ness
  • 83. Fractional Bayes factor Idea use directly the likelihood to separate training sample from testing sample BF 12 = B12(x) Lb 2(θ2)π2(θ2)dθ2 Lb 1(θ1)π1(θ1)dθ1 [O’Hagan, 1995] Proportion b of the sample used to gain proper-ness
  • 84. Fractional Bayes factor (cont’d) Example (Normal mean) BF 12 = 1 √ b en(b−1)¯x2 n /2 corresponds to exact Bayes factor for the prior N 0, 1−b nb If b constant, prior variance goes to 0 If b = 1 n , prior variance stabilises around 1 If b = n−α , α < 1, prior variance goes to 0 too. c Call to external principles to pick the order of b
  • 85. Bayesian predictive “If the model fits, then replicated data generated under the model should look similar to observed data. To put it another way, the observed data should look plausible under the posterior predictive distribution. This is really a self-consistency check: an observed discrepancy can be due to model misfit or chance.” (BDA, p.143) Use of posterior predictive p(yrep |y) = p(yrep |θ)π(θ|y) dθ and measure of discrepancy T(·, ·) Replacing p-value p(y|θ) = P(T(yrep , θ) T(y, θ)|θ) with Bayesian posterior p-value P(T(yrep , θ) T(y, θ)|y) = p(y|θ)π(θ|x) dθ
  • 86. Bayesian predictive “If the model fits, then replicated data generated under the model should look similar to observed data. To put it another way, the observed data should look plausible under the posterior predictive distribution. This is really a self-consistency check: an observed discrepancy can be due to model misfit or chance.” (BDA, p.143) Use of posterior predictive p(yrep |y) = p(yrep |θ)π(θ|y) dθ and measure of discrepancy T(·, ·) Replacing p-value p(y|θ) = P(T(yrep , θ) T(y, θ)|θ) with Bayesian posterior p-value P(T(yrep , θ) T(y, θ)|y) = p(y|θ)π(θ|x) dθ
  • 87. Issues “the posterior predictive p-value is such a [Bayesian] probability statement, conditional on the model and data, about what might be expected in future replications. (BDA, p.151) sounds too much like a p-value...! relies on choice of T(·, ·) seems to favour overfitting (again) using the data twice (once for the posterior and twice in the p-value) needs to be calibrated (back to 0.05?) general difficulty in interpreting where is the penalty for model complexity?
  • 88. Jeffreys–Lindley paradox Bayesian testing of hypotheses Significance tests: one new parameter Noninformative solutions Jeffreys-Lindley paradox Lindley’s paradox dual versions of the paradox Bayesian resolutions Deviance (information criterion) Testing under incomplete information
  • 89. Lindley’s paradox In a normal mean testing problem, ¯xn ∼ N(θ, σ2 /n) , H0 : θ = θ0 , under Jeffreys prior, θ ∼ N(θ0, σ2), the Bayes factor B01(tn) = (1 + n)1/2 exp −nt2 n/2[1 + n] , where tn = √ n|¯xn − θ0|/σ, satisfies B01(tn) n−→∞ −→ ∞ [assuming a fixed tn] [Lindley, 1957]
  • 90. Two versions of the paradox “the weight of Lindley’s paradoxical result (...) burdens proponents of the Bayesian practice”. [Lad, 2003] official version, opposing frequentist and Bayesian assessments [Lindley, 1957] intra-Bayesian version, blaming vague and improper priors for the Bayes factor misbehaviour: if π1(·|σ) depends on a scale parameter σ, it is often the case that B01(x) σ−→∞ −→ +∞ for a given x, meaning H0 is always accepted [Robert, 1992, 2013]
  • 91. where does it matter? In the normal case, Z ∼ N(θ, 1), θ ∼ N(0, α2), Bayes factor B10(z) = ez2α2/(1+α2) √ 1 + α2 = √ 1 − λ exp{λz2 /2}
  • 92. Evacuation of the first version Two paradigms [(b) versus (f)] one (b) operates on the parameter space Θ, while the other (f) is produced from the sample space one (f) relies solely on the point-null hypothesis H0 and the corresponding sampling distribution, while the other (b) opposes H0 to a (predictive) marginal version of H1 one (f) could reject “a hypothesis that may be true (...) because it has not predicted observable results that have not occurred” (Jeffreys, ToP, VII, §7.2) while the other (b) conditions upon the observed value xobs one (f) cannot agree with the likelihood principle, while the other (b) is almost uniformly in agreement with it one (f) resorts to an arbitrary fixed bound α on the p-value, while the other (b) refers to the (default) boundary probability of 1/2
  • 93. Nothing’s wrong with the second version n, prior’s scale factor: prior variance n times larger than the observation variance and when n goes to ∞, Bayes factor goes to ∞ no matter what the observation is n becomes what Lindley (1957) calls “a measure of lack of conviction about the null hypothesis” when prior diffuseness under H1 increases, only relevant information becomes that θ could be equal to θ0, and this overwhelms any evidence to the contrary contained in the data mass of the prior distribution in the vicinity of any fixed neighbourhood of the null hypothesis vanishes to zero under H1 c deep coherence in the outcome: being indecisive about the alternative hypothesis means we should not chose it
  • 94. Nothing’s wrong with the second version n, prior’s scale factor: prior variance n times larger than the observation variance and when n goes to ∞, Bayes factor goes to ∞ no matter what the observation is n becomes what Lindley (1957) calls “a measure of lack of conviction about the null hypothesis” when prior diffuseness under H1 increases, only relevant information becomes that θ could be equal to θ0, and this overwhelms any evidence to the contrary contained in the data mass of the prior distribution in the vicinity of any fixed neighbourhood of the null hypothesis vanishes to zero under H1 c deep coherence in the outcome: being indecisive about the alternative hypothesis means we should not chose it
  • 95. On some resolutions of the second version use of pseudo-Bayes factors, fractional Bayes factors, &tc, which lacks complete proper Bayesian justification [Berger & Pericchi, 2001] use of identical improper priors on nuisance parameters, use of the posterior predictive distribution, matching priors, use of score functions extending the log score function non-local priors correcting default priors
  • 96. On some resolutions of the second version use of pseudo-Bayes factors, fractional Bayes factors, &tc, use of identical improper priors on nuisance parameters, a notion already entertained by Jeffreys [Berger et al., 1998; Marin & Robert, 2013] use of the posterior predictive distribution, matching priors, use of score functions extending the log score function non-local priors correcting default priors
  • 97. On some resolutions of the second version use of pseudo-Bayes factors, fractional Bayes factors, &tc, use of identical improper priors on nuisance parameters, P´ech´e de jeunesse: equating the values of the prior densities at the point-null value θ0, ρ0 = (1 − ρ0)π1(θ0) [Robert, 1993] use of the posterior predictive distribution, matching priors, use of score functions extending the log score function non-local priors correcting default priors
  • 98. On some resolutions of the second version use of pseudo-Bayes factors, fractional Bayes factors, &tc, use of identical improper priors on nuisance parameters, use of the posterior predictive distribution, which uses the data twice matching priors, use of score functions extending the log score function non-local priors correcting default priors
  • 99. On some resolutions of the second version use of pseudo-Bayes factors, fractional Bayes factors, &tc, use of identical improper priors on nuisance parameters, use of the posterior predictive distribution, matching priors, whose sole purpose is to bring frequentist and Bayesian coverages as close as possible [Datta & Mukerjee, 2004] use of score functions extending the log score function non-local priors correcting default priors
  • 100. On some resolutions of the second version use of pseudo-Bayes factors, fractional Bayes factors, &tc, use of identical improper priors on nuisance parameters, use of the posterior predictive distribution, matching priors, use of score functions extending the log score function log B12(x) = log m1(x) − log m2(x) = S0(x, m1) − S0(x, m2) , that are independent of the normalising constant [Dawid et al., 2013; Dawid & Musio, 2015] non-local priors correcting default priors
  • 101. On some resolutions of the second version use of pseudo-Bayes factors, fractional Bayes factors, &tc, use of identical improper priors on nuisance parameters, use of the posterior predictive distribution, matching priors, use of score functions extending the log score function non-local priors correcting default priors towards more balanced error rates [Johnson & Rossell, 2010; Consonni et al., 2013]
  • 102. Deviance (information criterion) Bayesian testing of hypotheses Significance tests: one new parameter Noninformative solutions Jeffreys-Lindley paradox Deviance (information criterion) Testing under incomplete information Posterior predictive checking Testing via mixtures Paradigm shift
  • 103. DIC as in Dayesian? Deviance defined by D(θ) = −2 log(p(y|θ)) , Effective number of parameters computed as pD = ¯D − D(¯θ) , with ¯D posterior expectation of D and ¯θ estimate of θ Deviance information criterion (DIC) defined by DIC = pD + ¯D = D(¯θ) + 2pD Models with smaller DIC better supported by the data [Spiegelhalter et al., 2002]
  • 104. “thou shalt not use the data twice” The data is used twice in the DIC method: 1. y used once to produce the posterior π(θ|y), and the associated estimate, ˜θ(y) 2. y used a second time to compute the posterior expectation of the observed likelihood p(y|θ), log p(y|θ)π(dθ|y) ∝ log p(y|θ)p(y|θ)π(dθ) ,
  • 105. DIC for missing data models Framework of missing data models f (y|θ) = f (y, z|θ)dz , with observed data y = (y1, . . . , yn) and corresponding missing data by z = (z1, . . . , zn) How do we define DIC in such settings?
  • 106. DIC for missing data models Framework of missing data models f (y|θ) = f (y, z|θ)dz , with observed data y = (y1, . . . , yn) and corresponding missing data by z = (z1, . . . , zn) How do we define DIC in such settings?
  • 107. how many DICs can you fit in a mixture? Q: How many giraffes can you fit in a VW bug? A: None, the elephants are in there. 1. observed DICs DIC1 = −4Eθ [log f (y|θ)|y] + 2 log f (y|Eθ [θ|y]) often a poor choice in case of unidentifiability 2. complete DICs based on f (y, z|θ) 3. conditional DICs based on f (y|z, θ) [Celeux et al., BA, 2006]
  • 108. how many DICs can you fit in a mixture? Q: How many giraffes can you fit in a VW bug? A: None, the elephants are in there. 1. observed DICs DIC2 = −4Eθ [log f (y|θ)|y] + 2 log f (y|θ(y)) . which uses posterior mode instead 2. complete DICs based on f (y, z|θ) 3. conditional DICs based on f (y|z, θ) [Celeux et al., BA, 2006]
  • 109. how many DICs can you fit in a mixture? Q: How many giraffes can you fit in a VW bug? A: None, the elephants are in there. 1. observed DICs DIC3 = −4Eθ [log f (y|θ)|y] + 2 log f (y) , which instead relies on the MCMC density estimate 2. complete DICs based on f (y, z|θ) 3. conditional DICs based on f (y|z, θ) [Celeux et al., BA, 2006]
  • 110. how many DICs can you fit in a mixture? Q: How many giraffes can you fit in a VW bug? A: None, the elephants are in there. 1. observed DICs 2. complete DICs based on f (y, z|θ) DIC4 = EZ [DIC(y, Z)|y] = −4Eθ,Z [log f (y, Z|θ)|y] + 2EZ [log f (y, Z|Eθ[θ|y, Z])|y] 3. conditional DICs based on f (y|z, θ) [Celeux et al., BA, 2006]
  • 111. how many DICs can you fit in a mixture? Q: How many giraffes can you fit in a VW bug? A: None, the elephants are in there. 1. observed DICs 2. complete DICs based on f (y, z|θ) DIC5 = −4Eθ,Z [log f (y, Z|θ)|y] + 2 log f (y, z(y)|θ(y)) , using Z as an additional parameter 3. conditional DICs based on f (y|z, θ) [Celeux et al., BA, 2006]
  • 112. how many DICs can you fit in a mixture? Q: How many giraffes can you fit in a VW bug? A: None, the elephants are in there. 1. observed DICs 2. complete DICs based on f (y, z|θ) DIC6 = −4Eθ,Z [log f (y, Z|θ)|y]+2EZ[log f (y, Z|θ(y))|y, θ(y)] . in analogy with EM, θ being an EM fixed point 3. conditional DICs based on f (y|z, θ) [Celeux et al., BA, 2006]
  • 113. how many DICs can you fit in a mixture? Q: How many giraffes can you fit in a VW bug? A: None, the elephants are in there. 1. observed DICs 2. complete DICs based on f (y, z|θ) 3. conditional DICs based on f (y|z, θ) DIC7 = −4Eθ,Z [log f (y|Z, θ)|y] + 2 log f (y|z(y), θ(y)) , using MAP estimates [Celeux et al., BA, 2006]
  • 114. how many DICs can you fit in a mixture? Q: How many giraffes can you fit in a VW bug? A: None, the elephants are in there. 1. observed DICs 2. complete DICs based on f (y, z|θ) 3. conditional DICs based on f (y|z, θ) DIC8 = −4Eθ,Z [log f (y|Z, θ)|y]+2EZ log f (y|Z, θ(y, Z))|y , conditioning first on Z and then integrating over Z conditional on y [Celeux et al., BA, 2006]
  • 115. Galactic DICs Example of the galaxy mixture dataset DIC2 DIC3 DIC4 DIC5 DIC6 DIC7 DIC8 K (pD2) (pD3) (pD4) (pD5) (pD6) (pD7) (pD8) 2 453 451 502 705 501 417 410 (5.56) (3.66) (5.50) (207.88) (4.48) (11.07) (4.09) 3 440 436 461 622 471 378 372 (9.23) (4.94) (6.40) (167.28) (15.80) (13.59) (7.43) 4 446 439 473 649 482 388 382 (11.58) (5.41) (7.52) (183.48) (16.51) (17.47) (11.37) 5 447 442 485 658 511 395 390 (10.80) (5.48) (7.58) (180.73) (33.29) (20.00) (15.15) 6 449 444 494 676 532 407 398 (11.26) (5.49) (8.49) (191.10) (46.83) (28.23) (19.34) 7 460 446 508 700 571 425 409 (19.26) (5.83) (8.93) (200.35) (71.26) (40.51) (24.57)
  • 116. questions what is the behaviour of DIC under model mispecification? is there an absolute scale to the DIC values, i.e. when is a difference in DICs significant? how can DIC handle small n’s versus p’s? should pD be defined as var(D|y)/2 [Gelman’s suggestion]? is WAIC (Gelman and Vehtari, 2013) making a difference for being based on expected posterior predictive? In an era of complex models, is DIC applicable? [Robert, 2013]
  • 117. questions what is the behaviour of DIC under model mispecification? is there an absolute scale to the DIC values, i.e. when is a difference in DICs significant? how can DIC handle small n’s versus p’s? should pD be defined as var(D|y)/2 [Gelman’s suggestion]? is WAIC (Gelman and Vehtari, 2013) making a difference for being based on expected posterior predictive? In an era of complex models, is DIC applicable? [Robert, 2013]
  • 118. Testing under incomplete information Bayesian testing of hypotheses Significance tests: one new parameter Noninformative solutions Jeffreys-Lindley paradox Deviance (information criterion) Testing under incomplete information Posterior predictive checking Testing via mixtures Paradigm shift
  • 119. Likelihood-free settings Cases when the likelihood function f (y|θ) is unavailable (in analytic and numerical senses) and when the completion step f (y|θ) = Z f (y, z|θ) dz is impossible or too costly because of the dimension of z c MCMC cannot be implemented!
  • 120. The ABC method Bayesian setting: target is π(θ)f (x|θ) When likelihood f (x|θ) not in closed form, likelihood-free rejection technique: ABC algorithm For an observation y ∼ f (y|θ), under the prior π(θ), keep jointly simulating θ ∼ π(θ) , z ∼ f (z|θ ) , until the auxiliary variable z is equal to the observed value, z = y. [Tavar´e et al., 1997]
  • 121. The ABC method Bayesian setting: target is π(θ)f (x|θ) When likelihood f (x|θ) not in closed form, likelihood-free rejection technique: ABC algorithm For an observation y ∼ f (y|θ), under the prior π(θ), keep jointly simulating θ ∼ π(θ) , z ∼ f (z|θ ) , until the auxiliary variable z is equal to the observed value, z = y. [Tavar´e et al., 1997]
  • 122. The ABC method Bayesian setting: target is π(θ)f (x|θ) When likelihood f (x|θ) not in closed form, likelihood-free rejection technique: ABC algorithm For an observation y ∼ f (y|θ), under the prior π(θ), keep jointly simulating θ ∼ π(θ) , z ∼ f (z|θ ) , until the auxiliary variable z is equal to the observed value, z = y. [Tavar´e et al., 1997]
  • 123. A as A...pproximative When y is a continuous random variable, strict equality z = y is replaced with a tolerance zone ρ(y, z) where ρ is a distance Output distributed from π(θ) Pθ{ρ(y, z) < } def ∝ π(θ|ρ(y, z) < ) [Pritchard et al., 1999]
  • 124. A as A...pproximative When y is a continuous random variable, strict equality z = y is replaced with a tolerance zone ρ(y, z) where ρ is a distance Output distributed from π(θ) Pθ{ρ(y, z) < } def ∝ π(θ|ρ(y, z) < ) [Pritchard et al., 1999]
  • 125. ABC algorithm In most implementations, further degree of A...pproximation: Algorithm 1 Likelihood-free rejection sampler for i = 1 to N do repeat generate θ from the prior distribution π(·) generate z from the likelihood f (·|θ ) until ρ{η(z), η(y)} set θi = θ end for where η(y) defines a (not necessarily sufficient) statistic
  • 126. Which summary η(·)? Fundamental difficulty of the choice of the summary statistic when there is no non-trivial sufficient statistics Loss of statistical information balanced against gain in data roughening Approximation error and information loss remain unknown Choice of statistics induces choice of distance function towards standardisation may be imposed for external/practical reasons (e.g., LDA) may gather several non-B point estimates can learn about efficient combination [Estoup et al., 2012, Genetics]
  • 127. Which summary η(·)? Fundamental difficulty of the choice of the summary statistic when there is no non-trivial sufficient statistics Loss of statistical information balanced against gain in data roughening Approximation error and information loss remain unknown Choice of statistics induces choice of distance function towards standardisation may be imposed for external/practical reasons (e.g., LDA) may gather several non-B point estimates can learn about efficient combination [Estoup et al., 2012, Genetics]
  • 128. Which summary η(·)? Fundamental difficulty of the choice of the summary statistic when there is no non-trivial sufficient statistics Loss of statistical information balanced against gain in data roughening Approximation error and information loss remain unknown Choice of statistics induces choice of distance function towards standardisation may be imposed for external/practical reasons (e.g., LDA) may gather several non-B point estimates can learn about efficient combination [Estoup et al., 2012, Genetics]
  • 129. Generic ABC for model choice Algorithm 2 Likelihood-free model choice sampler (ABC-MC) for t = 1 to T do repeat Generate m from the prior π(M = m) Generate θm from the prior πm(θm) Generate z from the model fm(z|θm) until ρ{η(z), η(y)} < Set m(t) = m and θ(t) = θm end for [Grelaud et al., 2009]
  • 130. ABC estimates Posterior probability π(M = m|y) approximated by the frequency of acceptances from model m 1 T T t=1 Im(t)=m . Issues with implementation: should tolerances be the same for all models? should summary statistics vary across models (incl. their dimension)? should the distance measure ρ vary as well?
  • 131. ABC estimates Posterior probability π(M = m|y) approximated by the frequency of acceptances from model m 1 T T t=1 Im(t)=m . Extension to a weighted polychotomous logistic regression estimate of π(M = m|y), with non-parametric kernel weights [Cornuet et al., DIYABC, 2009]
  • 132. Formalised framework Central question to the validation of ABC for model choice: When is a Bayes factor based on an insufficient statistic T(y) consistent? Note: c drawn on T(y) through BT 12(y) necessarily differs from c drawn on y through B12(y)
  • 133. Formalised framework Central question to the validation of ABC for model choice: When is a Bayes factor based on an insufficient statistic T(y) consistent? Note: c drawn on T(y) through BT 12(y) necessarily differs from c drawn on y through B12(y)
  • 134. A benchmark if toy example Comparison suggested by referee of PNAS paper [thanks]: [X, Cornuet, Marin, & Pillai, Aug. 2011] Model M1: y ∼ N(θ1, 1) opposed to model M2: y ∼ L(θ2, 1/ √ 2), Laplace distribution with mean θ2 and scale parameter 1/ √ 2 (variance one).
  • 135. A benchmark if toy example Comparison suggested by referee of PNAS paper [thanks]: [X, Cornuet, Marin, & Pillai, Aug. 2011] Model M1: y ∼ N(θ1, 1) opposed to model M2: y ∼ L(θ2, 1/ √ 2), Laplace distribution with mean θ2 and scale parameter 1/ √ 2 (variance one). Four possible statistics 1. sample mean y (sufficient for M1 if not M2); 2. sample median med(y) (insufficient); 3. sample variance var(y) (ancillary); 4. median absolute deviation mad(y) = med(y − med(y));
  • 136. A benchmark if toy example Comparison suggested by referee of PNAS paper [thanks]: [X, Cornuet, Marin, & Pillai, Aug. 2011] Model M1: y ∼ N(θ1, 1) opposed to model M2: y ∼ L(θ2, 1/ √ 2), Laplace distribution with mean θ2 and scale parameter 1/ √ 2 (variance one). 0.0 0.2 0.4 0.6 0.8 1.0 0123 probability Density
  • 137. A benchmark if toy example Comparison suggested by referee of PNAS paper [thanks]: [X, Cornuet, Marin, & Pillai, Aug. 2011] Model M1: y ∼ N(θ1, 1) opposed to model M2: y ∼ L(θ2, 1/ √ 2), Laplace distribution with mean θ2 and scale parameter 1/ √ 2 (variance one). q q q q q q q q q q q q q q q q q q Gauss Laplace 0.00.20.40.60.81.0 n=100
  • 138. Consistency theorem If Pn belongs to one of the two models and if µ0 = E[T] cannot be attained by the other one : 0 = min (inf{|µ0 − µi (θi )|; θi ∈ Θi }, i = 1, 2) < max (inf{|µ0 − µi (θi )|; θi ∈ Θi }, i = 1, 2) , then the Bayes factor BT 12 is consistent
  • 139. Posterior predictive checking Bayesian testing of hypotheses Significance tests: one new parameter Noninformative solutions Jeffreys-Lindley paradox Deviance (information criterion) Testing under incomplete information Posterior predictive checking Testing via mixtures Paradigm shift
  • 140. Bayesian predictive “If the model fits, then replicated data generated under the model should look similar to observed data. To put it another way, the observed data should look plausible under the posterior predictive distribution. This is really a self-consistency check: an observed discrepancy can be due to model misfit or chance.” (BDA, p.143) Use of posterior predictive p(yrep |y) = p(yrep |θ)π(θ|y) dθ and measure of discrepancy T(·, ·) Replacing p-value p(y|θ) = P(T(yrep , θ) T(y, θ)|θ) with Bayesian posterior p-value P(T(yrep , θ) T(y, θ)|y) = p(y|θ)π(θ|x) dθ
  • 141. Bayesian predictive “If the model fits, then replicated data generated under the model should look similar to observed data. To put it another way, the observed data should look plausible under the posterior predictive distribution. This is really a self-consistency check: an observed discrepancy can be due to model misfit or chance.” (BDA, p.143) Use of posterior predictive p(yrep |y) = p(yrep |θ)π(θ|y) dθ and measure of discrepancy T(·, ·) Replacing p-value p(y|θ) = P(T(yrep , θ) T(y, θ)|θ) with Bayesian posterior p-value P(T(yrep , θ) T(y, θ)|y) = p(y|θ)π(θ|x) dθ
  • 142. Issues “the posterior predictive p-value is such a [Bayesian] probability statement, conditional on the model and data, about what might be expected in future replications. (BDA, p.151) sounds too much like a p-value...! relies on choice of T(·, ·) seems to favour overfitting (again) using the data twice (once for the posterior and twice in the p-value) needs to be calibrated (back to 0.05?) general difficulty in interpreting where is the penalty for model complexity?
  • 143. Example Normal-normal mean model: X ∼ N(θ, 1) , θ ∼ N(0, 10) Bayesian posterior p-value for T(x) = x2, m(x), B10(x)−1 P(|X| |x||θ, x)π(θ|x) dθ
  • 144. Example Normal-normal mean model: X ∼ N(θ, 1) , θ ∼ N(0, 10) Bayesian posterior p-value for T(x) = x2, m(x), B10(x)−1 P(|X| |x||θ, x)π(θ|x) dθ which interpretation? n= 10 posterior predictive p−value Frequency 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0500100015002000 n= 100 posterior predictive p−value Frequency 0.0 0.1 0.2 0.3 0.4 0.5 0100020003000400050006000 n= 1000 posterior predictive p−value Frequency 0.0 0.1 0.2 0.3 0.4 0.5 01000200030004000500060007000
  • 145. Example Normal-normal mean model: X ∼ N(θ, 1) , θ ∼ N(0, 10) Bayesian posterior p-value for T(x) = x2, m(x), B10(x)−1 P(|X| |x||θ, x)π(θ|x) dθ gets down as x gets away from 0... while discrepancy based on B10(x) increases mildly n= 10 posterior predictive p−value Frequency 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0500100015002000 n= 100 posterior predictive p−value Frequency 0.0 0.1 0.2 0.3 0.4 0.5 0100020003000400050006000 n= 1000 posterior predictive p−value Frequency 0.0 0.1 0.2 0.3 0.4 0.5 01000200030004000500060007000
  • 146. goodness-of-fit [only?] “A model is suspect if a discrepancy is of practical importance and its observed value has a tail-area probability near 0 or 1, indicating that the observed pattern would be unlikely to be seen in replications of the data if the model were true. An extreme p-value implies that the model cannot be expected to capture this aspect of the data. A p-value is a posterior probability and can therefore be interpreted directly—although not as Pr(model is true — data). Major failures of the model (...) can be addressed by expanding the model appropriately.” BDA, p.150 not helpful in comparing models (both may be deficient) anti-Ockham? i.e., may favour larger dimensions (if prior concentrated enough) lingering worries about using the data twice and favourable bias impact of the prior (only under the current model) but allows for improper priors
  • 147. goodness-of-fit [only?] “A model is suspect if a discrepancy is of practical importance and its observed value has a tail-area probability near 0 or 1, indicating that the observed pattern would be unlikely to be seen in replications of the data if the model were true. An extreme p-value implies that the model cannot be expected to capture this aspect of the data. A p-value is a posterior probability and can therefore be interpreted directly—although not as Pr(model is true — data). Major failures of the model (...) can be addressed by expanding the model appropriately.” BDA, p.150 not helpful in comparing models (both may be deficient) anti-Ockham? i.e., may favour larger dimensions (if prior concentrated enough) lingering worries about using the data twice and favourable bias impact of the prior (only under the current model) but allows for improper priors
  • 148. Changing the testing perspective Bayesian testing of hypotheses Significance tests: one new parameter Noninformative solutions Jeffreys-Lindley paradox Deviance (information criterion) Testing under incomplete information Posterior predictive checking Testing via mixtures Paradigm shift
  • 149. Paradigm shift New proposal for a paradigm shift (!) in the Bayesian processing of hypothesis testing and of model selection convergent and naturally interpretable solution more extended use of improper priors Simple representation of the testing problem as a two-component mixture estimation problem where the weights are formally equal to 0 or 1
  • 150. Paradigm shift New proposal for a paradigm shift (!) in the Bayesian processing of hypothesis testing and of model selection convergent and naturally interpretable solution more extended use of improper priors Simple representation of the testing problem as a two-component mixture estimation problem where the weights are formally equal to 0 or 1
  • 151. Paradigm shift Simple representation of the testing problem as a two-component mixture estimation problem where the weights are formally equal to 0 or 1 Approach inspired from consistency result of Rousseau and Mengersen (2011) on estimated overfitting mixtures Mixture representation not directly equivalent to the use of a posterior probability Potential of a better approach to testing, while not expanding number of parameters Calibration of posterior distribution of the weight of a model, moving from artificial notion of posterior probability of a model
  • 152. Encompassing mixture model Idea: Given two statistical models, M1 : x ∼ f1(x|θ1) , θ1 ∈ Θ1 and M2 : x ∼ f2(x|θ2) , θ2 ∈ Θ2 , embed both within an encompassing mixture Mα : x ∼ αf1(x|θ1) + (1 − α)f2(x|θ2) , 0 α 1 (1) Note: Both models correspond to special cases of (1), one for α = 1 and one for α = 0 Draw inference on mixture representation (1), as if each observation was individually and independently produced by the mixture model
  • 153. Encompassing mixture model Idea: Given two statistical models, M1 : x ∼ f1(x|θ1) , θ1 ∈ Θ1 and M2 : x ∼ f2(x|θ2) , θ2 ∈ Θ2 , embed both within an encompassing mixture Mα : x ∼ αf1(x|θ1) + (1 − α)f2(x|θ2) , 0 α 1 (1) Note: Both models correspond to special cases of (1), one for α = 1 and one for α = 0 Draw inference on mixture representation (1), as if each observation was individually and independently produced by the mixture model
  • 154. Encompassing mixture model Idea: Given two statistical models, M1 : x ∼ f1(x|θ1) , θ1 ∈ Θ1 and M2 : x ∼ f2(x|θ2) , θ2 ∈ Θ2 , embed both within an encompassing mixture Mα : x ∼ αf1(x|θ1) + (1 − α)f2(x|θ2) , 0 α 1 (1) Note: Both models correspond to special cases of (1), one for α = 1 and one for α = 0 Draw inference on mixture representation (1), as if each observation was individually and independently produced by the mixture model
  • 155. Inferential motivations Sounds like approximation to the real model, but several definitive advantages to this paradigm shift: Bayes estimate of the weight α replaces posterior probability of model M1, equally convergent indicator of which model is “true”, while avoiding artificial prior probabilities on model indices, ω1 and ω2 interpretation of estimator of α at least as natural as handling the posterior probability, while avoiding zero-one loss setting α and its posterior distribution provide measure of proximity to the models, while being interpretable as data propensity to stand within one model further allows for alternative perspectives on testing and model choice, like predictive tools, cross-validation, and information indices like WAIC
  • 156. Computational motivations avoids highly problematic computations of the marginal likelihoods, since standard algorithms are available for Bayesian mixture estimation straightforward extension to a finite collection of models, with a larger number of components, which considers all models at once and eliminates least likely models by simulation eliminates difficulty of label switching that plagues both Bayesian estimation and Bayesian computation, since components are no longer exchangeable posterior distribution of α evaluates more thoroughly strength of support for a given model than the single figure outcome of a posterior probability variability of posterior distribution on α allows for a more thorough assessment of the strength of this support
  • 157. Noninformative motivations additional feature missing from traditional Bayesian answers: a mixture model acknowledges possibility that, for a finite dataset, both models or none could be acceptable standard (proper and informative) prior modeling can be reproduced in this setting, but non-informative (improper) priors also are manageable therein, provided both models first reparameterised towards shared parameters, e.g. location and scale parameters in special case when all parameters are common Mα : x ∼ αf1(x|θ) + (1 − α)f2(x|θ) , 0 α 1 if θ is a location parameter, a flat prior π(θ) ∝ 1 is available
  • 158. Weakly informative motivations using the same parameters or some identical parameters on both components highlights that opposition between the two components is not an issue of enjoying different parameters those common parameters are nuisance parameters, to be integrated out [unlike Lindley’s paradox] prior model weights ωi rarely discussed in classical Bayesian approach, even though linear impact on posterior probabilities. Here, prior modeling only involves selecting a prior on α, e.g., α ∼ B(a0, a0) while a0 impacts posterior on α, it always leads to mass accumulation near 1 or 0, i.e. favours most likely model sensitivity analysis straightforward to carry approach easily calibrated by parametric boostrap providing reference posterior of α under each model natural Metropolis–Hastings alternative
  • 159. Poisson/Geometric choice betwen Poisson P(λ) and Geometric Geo(p) distribution mixture with common parameter λ Mα : αP(λ) + (1 − α)Geo(1/1+λ) Allows for Jeffreys prior since resulting posterior is proper independent Metropolis–within–Gibbs with proposal distribution on λ equal to Poisson posterior (with acceptance rate larger than 75%)
  • 160. Poisson/Geometric choice betwen Poisson P(λ) and Geometric Geo(p) distribution mixture with common parameter λ Mα : αP(λ) + (1 − α)Geo(1/1+λ) Allows for Jeffreys prior since resulting posterior is proper independent Metropolis–within–Gibbs with proposal distribution on λ equal to Poisson posterior (with acceptance rate larger than 75%)
  • 161. Poisson/Geometric choice betwen Poisson P(λ) and Geometric Geo(p) distribution mixture with common parameter λ Mα : αP(λ) + (1 − α)Geo(1/1+λ) Allows for Jeffreys prior since resulting posterior is proper independent Metropolis–within–Gibbs with proposal distribution on λ equal to Poisson posterior (with acceptance rate larger than 75%)
  • 162. Beta prior When α ∼ Be(a0, a0) prior, full conditional posterior α ∼ Be(n1(ζ) + a0, n2(ζ) + a0) Exact Bayes factor opposing Poisson and Geometric B12 = nn¯xn n i=1 xi ! Γ n + 2 + n i=1 xi Γ(n + 2) although arbitrary from a purely mathematical viewpoint
  • 163. Parameter estimation 1e-04 0.001 0.01 0.1 0.2 0.3 0.4 0.5 3.903.954.004.054.10 Posterior means of λ and medians of α for 100 Poisson P(4) datasets of size n = 1000, for a0 = .0001, .001, .01, .1, .2, .3, .4, .5. Each posterior approximation is based on 104 Metropolis-Hastings iterations.
  • 164. Parameter estimation 1e-04 0.001 0.01 0.1 0.2 0.3 0.4 0.5 0.9900.9920.9940.9960.9981.000 Posterior means of λ and medians of α for 100 Poisson P(4) datasets of size n = 1000, for a0 = .0001, .001, .01, .1, .2, .3, .4, .5. Each posterior approximation is based on 104 Metropolis-Hastings iterations.
  • 165. Consistency 0 1 2 3 4 5 6 7 0.00.20.40.60.81.0 a0=.1 log(sample size) 0 1 2 3 4 5 6 7 0.00.20.40.60.81.0 a0=.3 log(sample size) 0 1 2 3 4 5 6 7 0.00.20.40.60.81.0 a0=.5 log(sample size) Posterior means (sky-blue) and medians (grey-dotted) of α, over 100 Poisson P(4) datasets for sample sizes from 1 to 1000.
  • 166. Behaviour of Bayes factor 0 1 2 3 4 5 6 7 0.00.20.40.60.81.0 log(sample size) 0 1 2 3 4 5 6 7 0.00.20.40.60.81.0 log(sample size) 0 1 2 3 4 5 6 7 0.00.20.40.60.81.0 log(sample size) Comparison between P(M1|x) (red dotted area) and posterior medians of α (grey zone) for 100 Poisson P(4) datasets with sample sizes n between 1 and 1000, for a0 = .001, .1, .5
  • 167. Normal-normal comparison comparison of a normal N(θ1, 1) with a normal N(θ2, 2) distribution mixture with identical location parameter θ αN(θ, 1) + (1 − α)N(θ, 2) Jeffreys prior π(θ) = 1 can be used, since posterior is proper Reference (improper) Bayes factor B12 = 2 n−1/2 exp 1/4 n i=1 (xi − ¯x)2 ,
  • 168. Normal-normal comparison comparison of a normal N(θ1, 1) with a normal N(θ2, 2) distribution mixture with identical location parameter θ αN(θ, 1) + (1 − α)N(θ, 2) Jeffreys prior π(θ) = 1 can be used, since posterior is proper Reference (improper) Bayes factor B12 = 2 n−1/2 exp 1/4 n i=1 (xi − ¯x)2 ,
  • 169. Normal-normal comparison comparison of a normal N(θ1, 1) with a normal N(θ2, 2) distribution mixture with identical location parameter θ αN(θ, 1) + (1 − α)N(θ, 2) Jeffreys prior π(θ) = 1 can be used, since posterior is proper Reference (improper) Bayes factor B12 = 2 n−1/2 exp 1/4 n i=1 (xi − ¯x)2 ,
  • 170. Consistency 0.1 0.2 0.3 0.4 0.5 p(M1|x) 0.00.20.40.60.81.0 n=15 0.00.20.40.60.81.0 0.1 0.2 0.3 0.4 0.5 p(M1|x) 0.650.700.750.800.850.900.951.00 n=50 0.650.700.750.800.850.900.951.00 0.1 0.2 0.3 0.4 0.5 p(M1|x) 0.750.800.850.900.951.00 n=100 0.750.800.850.900.951.00 0.1 0.2 0.3 0.4 0.5 p(M1|x) 0.930.940.950.960.970.980.991.00 n=500 0.930.940.950.960.970.980.991.00 Posterior means (wheat) and medians of α (dark wheat), compared with posterior probabilities of M0 (gray) for a N(0, 1) sample, derived from 100 datasets for sample sizes equal to 15, 50, 100, 500. Each posterior approximation is based on 104 MCMC iterations.
  • 171. Comparison with posterior probability 0 100 200 300 400 500 -50-40-30-20-100 a0=.1 sample size 0 100 200 300 400 500 -50-40-30-20-100 a0=.3 sample size 0 100 200 300 400 500 -50-40-30-20-100 a0=.4 sample size 0 100 200 300 400 500 -50-40-30-20-100 a0=.5 sample size Plots of ranges of log(n) log(1 − E[α|x]) (gray color) and log(1 − p(M1|x)) (red dotted) over 100 N(0, 1) samples as sample size n grows from 1 to 500. and α is the weight of N(0, 1) in the mixture model. The shaded areas indicate the range of the estimations and each plot is based on a Beta prior with a0 = .1, .2, .3, .4, .5, 1 and each posterior approximation is based on 104 iterations.
  • 172. Comments convergence to one boundary value as sample size n grows impact of hyperarameter a0 slowly vanishes as n increases, but present for moderate sample sizes when simulated sample is neither from N(θ1, 1) nor from N(θ2, 2), behaviour of posterior varies, depending on which distribution is closest
  • 173. Logit or Probit? binary dataset, R dataset about diabetes in 200 Pima Indian women with body mass index as explanatory variable comparison of logit and probit fits could be suitable. We are thus comparing both fits via our method M1 : yi | xi , θ1 ∼ B(1, pi ) where pi = exp(xi θ1) 1 + exp(xi θ1) M2 : yi | xi , θ2 ∼ B(1, qi ) where qi = Φ(xi θ2)
  • 174. Common parameterisation Local reparameterisation strategy that rescales parameters of the probit model M2 so that the MLE’s of both models coincide. [Choudhuty et al., 2007] Φ(xi θ2) ≈ exp(kxi θ2) 1 + exp(kxi θ2) and use best estimate of k to bring both parameters into coherency (k0, k1) = (θ01/θ02, θ11/θ12) , reparameterise M1 and M2 as M1 :yi | xi , θ ∼ B(1, pi ) where pi = exp(xi θ) 1 + exp(xi θ) M2 :yi | xi , θ ∼ B(1, qi ) where qi = Φ(xi (κ−1 θ)) , with κ−1θ = (θ0/k0, θ1/k1).
  • 175. Prior modelling Under default g-prior θ ∼ N2(0, n(XT X)−1 ) full conditional posterior distributions given allocations π(θ | y, X, ζ) ∝ exp i Iζi =1yi xi θ i;ζi =1[1 + exp(xi θ)] exp −θT (XT X)θ 2n × i;ζi =2 Φ(xi (κ−1 θ))yi (1 − Φ(xi (κ−1 θ)))(1−yi ) hence posterior distribution clearly defined
  • 176. Results Logistic Probit a0 α θ0 θ1 θ0 k0 θ1 k1 .1 .352 -4.06 .103 -2.51 .064 .2 .427 -4.03 .103 -2.49 .064 .3 .440 -4.02 .102 -2.49 .063 .4 .456 -4.01 .102 -2.48 .063 .5 .449 -4.05 .103 -2.51 .064 Histograms of posteriors of α in favour of logistic model where a0 = .1, .2, .3, .4, .5 for (a) Pima dataset, (b) Data from logistic model, (c) Data from probit model
  • 177. Survival analysis Testing hypothesis that data comes from a 1. log-Normal(φ, κ2), 2. Weibull(α, λ), or 3. log-Logistic(γ, δ) distribution Corresponding mixture given by the density α1 exp{−(log x − φ)2 /2κ2 }/ √ 2πxκ+ α2 α λ exp{−(x/λ)α }((x/λ)α−1 + α3(δ/γ)(x/γ)δ−1 /(1 + (x/γ)δ )2 where α1 + α2 + α3 = 1
  • 178. Survival analysis Testing hypothesis that data comes from a 1. log-Normal(φ, κ2), 2. Weibull(α, λ), or 3. log-Logistic(γ, δ) distribution Corresponding mixture given by the density α1 exp{−(log x − φ)2 /2κ2 }/ √ 2πxκ+ α2 α λ exp{−(x/λ)α }((x/λ)α−1 + α3(δ/γ)(x/γ)δ−1 /(1 + (x/γ)δ )2 where α1 + α2 + α3 = 1
  • 179. Reparameterisation Looking for common parameter(s): φ = µ + γβ = ξ σ2 = π2 β2 /6 = ζ2 π2 /3 where γ ≈ 0.5772 is Euler-Mascheroni constant. Allows for a noninformative prior on the common location scale parameter, π(φ, σ2 ) = 1/σ2
  • 180. Reparameterisation Looking for common parameter(s): φ = µ + γβ = ξ σ2 = π2 β2 /6 = ζ2 π2 /3 where γ ≈ 0.5772 is Euler-Mascheroni constant. Allows for a noninformative prior on the common location scale parameter, π(φ, σ2 ) = 1/σ2
  • 181. Recovery .01 0.1 1.0 10.0 0.860.880.900.920.940.960.981.00 .01 0.1 1.0 10.00.000.020.040.060.08 Boxplots of the posterior distributions of the Normal weight α1 under the two scenarii: truth = Normal (left panel), truth = Gumbel (right panel), a0=0.01, 0.1, 1.0, 10.0 (from left to right in each panel) and n = 10, 000 simulated observations.
  • 182. Asymptotic consistency Posterior consistency holds for mixture testing procedure [under minor conditions] Two different cases the two models, M1 and M2, are well separated model M1 is a submodel of M2.
  • 183. Asymptotic consistency Posterior consistency holds for mixture testing procedure [under minor conditions] Two different cases the two models, M1 and M2, are well separated model M1 is a submodel of M2.
  • 184. Posterior concentration rate Let π be the prior and xn = (x1, · · · , xn) a sample with true density f ∗ proposition Assume that, for all c > 0, there exist Θn ⊂ Θ1 × Θ2 and B > 0 such that π [Θc n] n−c , Θn ⊂ { θ1 + θ2 nB } and that there exist H 0 and L, δ > 0 such that, for j = 1, 2, sup θ,θ ∈Θn fj,θj − fj,θj 1 LnH θj − θj , θ = (θ1, θ2), θ = (θ1, θ2) , ∀ θj − θ∗ j δ; KL(fj,θj , fj,θ∗ j ) θj − θ∗ j . Then, when f ∗ = fθ∗,α∗ , with α∗ ∈ [0, 1], there exists M > 0 such that π (α, θ); fθ,α − f ∗ 1 > M log n/n|xn = op(1) .
  • 185. Separated models Assumption: Models are separated, i.e. identifiability holds: ∀α, α ∈ [0, 1], ∀θj , θj , j = 1, 2 Pθ,α = Pθ ,α ⇒ α = α , θ = θ Further inf θ1∈Θ1 inf θ2∈Θ2 f1,θ1 − f2,θ2 1 > 0 and, for θ∗ j ∈ Θj , if Pθj weakly converges to Pθ∗ j , then θj −→ θ∗ j in the Euclidean topology
  • 186. Separated models Assumption: Models are separated, i.e. identifiability holds: ∀α, α ∈ [0, 1], ∀θj , θj , j = 1, 2 Pθ,α = Pθ ,α ⇒ α = α , θ = θ theorem Under above assumptions, then for all > 0, π [|α − α∗ | > |xn ] = op(1)
  • 187. Separated models Assumption: Models are separated, i.e. identifiability holds: ∀α, α ∈ [0, 1], ∀θj , θj , j = 1, 2 Pθ,α = Pθ ,α ⇒ α = α , θ = θ theorem If θj → fj,θj is C2 around θ∗ j , j = 1, 2, f1,θ∗ 1 − f2,θ∗ 2 , f1,θ∗ 1 , f2,θ∗ 2 are linearly independent in y and there exists δ > 0 such that f1,θ∗ 1 , f2,θ∗ 2 , sup |θ1−θ∗ 1 |<δ |D2 f1,θ1 |, sup |θ2−θ∗ 2 |<δ |D2 f2,θ2 | ∈ L1 then π |α − α∗ | > M log n/n xn = op(1).
  • 188. Separated models Assumption: Models are separated, i.e. identifiability holds: ∀α, α ∈ [0, 1], ∀θj , θj , j = 1, 2 Pθ,α = Pθ ,α ⇒ α = α , θ = θ theorem allows for interpretation of α under the posterior: If data xn is generated from model M1 then posterior on α concentrates around α = 1
  • 189. Embedded case Here M1 is a submodel of M2, i.e. θ2 = (θ1, ψ) and θ2 = (θ1, ψ0 = 0) corresponds to f2,θ2 ∈ M1 Same posterior concentration rate log n/n for estimating α when α∗ ∈ (0, 1) and ψ∗ = 0.
  • 190. Null case Case where ψ∗ = 0, i.e., f ∗ is in model M1 Two possible paths to approximate f ∗: either α goes to 1 (path 1) or ψ goes to 0 (path 2) New identifiability condition: Pθ,α = P∗ only if α = 1, θ1 = θ∗ 1, θ2 = (θ∗ 1, ψ) or α 1, θ1 = θ∗ 1, θ2 = (θ∗ 1, 0) Prior π(α, θ) = πα(α)π1(θ1)πψ(ψ), θ2 = (θ1, ψ) with common (prior on) θ1
  • 191. Null case Case where ψ∗ = 0, i.e., f ∗ is in model M1 Two possible paths to approximate f ∗: either α goes to 1 (path 1) or ψ goes to 0 (path 2) New identifiability condition: Pθ,α = P∗ only if α = 1, θ1 = θ∗ 1, θ2 = (θ∗ 1, ψ) or α 1, θ1 = θ∗ 1, θ2 = (θ∗ 1, 0) Prior π(α, θ) = πα(α)π1(θ1)πψ(ψ), θ2 = (θ1, ψ) with common (prior on) θ1
  • 192. Assumptions [B1] Regularity: Assume that θ1 → f1,θ1 and θ2 → f2,θ2 are 3 times continuously differentiable and that F∗ ¯f 3 1,θ∗ 1 f 3 1,θ∗ 1 < +∞, ¯f1,θ∗ 1 = sup |θ1−θ∗ 1 |<δ f1,θ1 , f 1,θ∗ 1 = inf |θ1−θ∗ 1 |<δ f1,θ1 F∗ sup|θ1−θ∗ 1 |<δ | f1,θ∗ 1 |3 f 3 1,θ∗ 1 < +∞, F∗ | f1,θ∗ 1 |4 f 4 1,θ∗ 1 < +∞, F∗ sup|θ1−θ∗ 1 |<δ |D2 f1,θ∗ 1 |2 f 2 1,θ∗ 1 < +∞, F∗ sup|θ1−θ∗ 1 |<δ |D3 f1,θ∗ 1 | f 1,θ∗ 1 < +∞
  • 193. Assumptions [B2] Integrability: There exists S0 ⊂ S ∩ {|ψ| > δ0} for some positive δ0 and satisfying Leb(S0) > 0, and such that for all ψ ∈ S0, F∗ sup|θ1−θ∗ 1 |<δ f2,θ1,ψ f 4 1,θ∗ 1 < +∞, F∗ sup|θ1−θ∗ 1 |<δ f 3 2,θ1,ψ f 3 1,θ1∗ < +∞,
  • 194. Assumptions [B3] Stronger identifiability: Set f2,θ∗ 1,ψ∗ (x) = θ1 f2,θ∗ 1,ψ∗ (x)T , ψf2,θ∗ 1,ψ∗ (x)T T . Then for all ψ ∈ S with ψ = 0, if η0 ∈ R, η1 ∈ Rd1 η0(f1,θ∗ 1 − f2,θ∗ 1,ψ) + ηT 1 θ1 f1,θ∗ 1 = 0 ⇔ η1 = 0, η2 = 0
  • 195. Consistency theorem Given the mixture fθ1,ψ,α = αf1,θ1 + (1 − α)f2,θ1,ψ and a sample xn = (x1, · · · , xn) issued from f1,θ∗ 1 , under assumptions B1 − B3, and an M > 0 such that π (α, θ); fθ,α − f ∗ 1 > M log n/n|xn = op(1). If α ∼ B(a1, a2), with a2 < d2, and if the prior πθ1,ψ is absolutely continuous with positive and continuous density at (θ∗ 1, 0), then for Mn −→ ∞ π |α − α∗ | > Mn(log n)γ / √ n|xn = op(1), γ = max((d1 + a2)/(d2 − a2), 1)/2,
  • 196. M-open case When the true model behind the data is neither of the tested models, what happens? issue mostly bypassed by classical Bayesian procedures theoretically produces an α∗ away from both 0 and 1 possible (recommended?) inclusion of a Bayesian non-parametric model within alternatives
  • 197. M-open case When the true model behind the data is neither of the tested models, what happens? issue mostly bypassed by classical Bayesian procedures theoretically produces an α∗ away from both 0 and 1 possible (recommended?) inclusion of a Bayesian non-parametric model within alternatives
  • 198. Towards which decision? And if we have to make a decision? soft consider behaviour of posterior under prior predictives or posterior predictive [e.g., prior predictive does not exist] boostrapping behaviour comparison with Bayesian non-parametric solution hard rethink the loss function
  • 199. Conclusion many applications of the Bayesian paradigm concentrate on the comparison of scientific theories and on testing of null hypotheses natural tendency to default to Bayes factors poorly understood sensitivity to prior modeling and posterior calibration c Time is ripe for a paradigm shift
  • 200. Down with Bayes factors! c Time is ripe for a paradigm shift original testing problem replaced with a better controlled estimation target allow for posterior variability over the component frequency as opposed to deterministic Bayes factors range of acceptance, rejection and indecision conclusions easily calibrated by simulation posterior medians quickly settling near the boundary values of 0 and 1 potential derivation of a Bayesian b-value by looking at the posterior area under the tail of the distribution of the weight
  • 201. Prior modelling c Time is ripe for a paradigm shift Partly common parameterisation always feasible and hence allows for reference priors removal of the absolute prohibition of improper priors in hypothesis testing prior on the weight α shows sensitivity that naturally vanishes as the sample size increases default value of a0 = 0.5 in the Beta prior
  • 202. Computing aspects c Time is ripe for a paradigm shift proposal that does not induce additional computational strain when algorithmic solutions exist for both models, they can be recycled towards estimating the encompassing mixture easier than in standard mixture problems due to common parameters that allow for original MCMC samplers to be turned into proposals Gibbs sampling completions useful for assessing potential outliers but not essential to achieve a conclusion about the overall problem