SlideShare a Scribd company logo
The vexing dilemma of Bayes tests of hypotheses
and the predicted demise of the Bayes factor
Christian P. Robert
Universit´e Paris-Dauphine, Paris & University of Warwick, Coventry
Outline
Summary
Significance tests
Noninformative solutions
Jeffreys-Lindley paradox
Testing via mixtures
Testing issues
Hypothesis testing
central problem of statistical inference
dramatically differentiating feature between classical and
Bayesian paradigms
wide open to controversy and divergent opinions, includ.
within the Bayesian community
non-informative Bayesian testing case mostly unresolved,
witness the Jeffreys–Lindley paradox
[Berger (2003), Mayo & Cox (2006), Gelman (2008)]
Testing hypotheses
Bayesian model selection as comparison of k potential
statistical models towards the selection of model that fits the
data “best”
mostly accepted perspective: it does not primarily seek to
identify which model is “true”, but compares fits
tools like Bayes factor naturally include a penalisation
addressing model complexity, mimicked by Bayes Information
(BIC) and Deviance Information (DIC) criteria
posterior predictive tools successfully advocated in Gelman et
al. (2013) even though they involve double use of data
Testing hypotheses
Bayesian model selection as comparison of k potential
statistical models towards the selection of model that fits the
data “best”
mostly accepted perspective: it does not primarily seek to
identify which model is “true”, but compares fits
tools like Bayes factor naturally include a penalisation
addressing model complexity, mimicked by Bayes Information
(BIC) and Deviance Information (DIC) criteria
posterior predictive tools successfully advocated in Gelman et
al. (2013) even though they involve double use of data
Some difficulties
tension between using (i) posterior probabilities justified by a
binary loss function but depending on unnatural prior weights,
and (ii) Bayes factors that eliminate this dependence but
escape direct connection with posterior distribution, unless
prior weights are integrated within the loss
subsequent and delicate interpretation (or calibration) of the
strength of the Bayes factor towards supporting a given
hypothesis or model, because it is not a Bayesian decision rule
similar difficulty with posterior probabilities, with tendency to
interpret them as p-values (rather than the opposite!) when
they only report through a marginal likelihood ratio the
respective strengths of fitting the data to both models
Some further difficulties
long-lasting impact of the prior modeling, meaning the choice
of the prior distributions on the parameter spaces of both
models under comparison, despite overall consistency proof for
Bayes factor
discontinuity in use of improper priors since they are not
justified in most testing situations, leading to many alternative
if ad hoc solutions, where data is either used twice or split in
artificial ways
binary (accept vs. reject) outcome more suited for immediate
decision (if any) than for model evaluation, in connection with
rudimentary loss function
Some additional difficulties
related impossibility to ascertain simultaneous misfit or to
detect presence of outliers
no assessment of uncertainty associated with decision itself
difficult computation of marginal likelihoods in most settings
with further controversies about which algorithm to adopt
strong dependence of posterior probabilities on conditioning
statistics, which in turn undermines their validity for model
assessment, as exhibited in ABC model choice
temptation to create pseudo-frequentist equivalents such as
q-values with even less Bayesian justifications
time for a paradigm shift
back to some solutions
Significance tests
Summary
Significance tests
Statistical tests
Bayesian principles
Bayesian tests
Bayes factors
Noninformative solutions
Jeffreys-Lindley paradox
Testing via mixtures
data related questions
Given a dataset
x1, . . . , xn
is it possible to answer a question related with the mechanism
producing this data?
[Answer: No!]
For instance, is E[X] > 0? Or, is xn+1 = 104 possible?
[Under some assumptions...]
data related questions
Given a dataset
x1, . . . , xn
is it possible to answer a question related with the mechanism
producing this data?
[Answer: No!]
For instance, is E[X] > 0? Or, is xn+1 = 104 possible?
[Under some assumptions...]
Statistical tests
Questions are called hypotheses, like H0 : E[X] > 0, while models
are generaly indexed by parameters, e.g.,
X1 ∼ f (x|θ), θ ∈ Θ0
meaning H0 can be expressed as
H0 : θ ∈ Θ0
A statistical test of an hypothesis H0 is a function δ of the data
x1, . . . , xn taking values in {0, 1} (yes or no).
Sounds easy: estimate θ by ^θ(x1, . . . , xn) and check whether
^θ ∈ Θ0
just... misses random fluctuations
Statistical tests
Questions are called hypotheses, like H0 : E[X] > 0, while models
are generaly indexed by parameters, e.g.,
X1 ∼ f (x|θ), θ ∈ Θ0
meaning H0 can be expressed as
H0 : θ ∈ Θ0
A statistical test of an hypothesis H0 is a function δ of the data
x1, . . . , xn taking values in {0, 1} (yes or no).
Sounds easy: estimate θ by ^θ(x1, . . . , xn) and check whether
^θ ∈ Θ0
just... misses random fluctuations
Statistical tests
Questions are called hypotheses, like H0 : E[X] > 0, while models
are generaly indexed by parameters, e.g.,
X1 ∼ f (x|θ), θ ∈ Θ0
meaning H0 can be expressed as
H0 : θ ∈ Θ0
A statistical test of an hypothesis H0 is a function δ of the data
x1, . . . , xn taking values in {0, 1} (yes or no).
Sounds easy: estimate θ by ^θ(x1, . . . , xn) and check whether
^θ ∈ Θ0
just... misses random fluctuations
type x errors
Test δ associated with average error:
R(θ, δ) = Eθ[L(θ, δ(X))]
=
Pθ(δ(X) = 0) if θ ∈ Θ0,
Pθ(δ(X) = 1) otherwise,
False rejection (Type I) and false acceptance (Type II) rates are
antagonistic: one goes up when the other goes down
A binomial example
Sequence of binary [survey] replies
x1, . . . , xn
iid
∼ B(θ)
and question about position of θ
against 0.5:
H0 : θ 1/2
If decision based on ¯xn 1/2, Type I
error approximately
Φ({1/2 − p}/{p(1−p)/n}
1/2
)
A binomial example
Sequence of binary [survey] replies
x1, . . . , xn
iid
∼ B(θ)
and question about position of θ
against 0.5:
H0 : θ 1/2
If decision based on ¯xn 1/2, Type I
error approximately
Φ({1/2 − p}/{p(1−p)/n}
1/2
) Type I error for
n = 10, 50, 100, 200, 500, 103
Uncertain decisions
While tests are [hard] decisions in {0, 1}, there is uncertainty about
reliability of decision and error is unknown
A p-value is the probability of
obtaining results at least as extreme
as the observed one under the null:
p(x) = PH0 (T(X) T(x))
Binomial example:
p(x1, . . . , xn) = P1/2( ¯Xn ¯xn)
Uncertain decisions
While tests are [hard] decisions in {0, 1}, there is uncertainty about
reliability of decision and error is unknown
A p-value is the probability of
obtaining results at least as extreme
as the observed one under the null:
p(x) = PH0 (T(X) T(x))
Binomial example:
p(x1, . . . , xn) = P1/2( ¯Xn ¯xn)
p-value for
n = 10, 50, 100, 200, 500, 103
Uncertain decisions
While tests are [hard] decisions in {0, 1}, there is uncertainty about
reliability of decision and error is unknown
A p-value is the probability of
obtaining results at least as extreme
as the observed one under the null:
p(x) = PH0 (T(X) T(x))
Binomial example:
p(x1, . . . , xn) = P1/2( ¯Xn ¯xn)
p-value for
n = 10, 50, 100, 200, 500, 103
Bayesian statistics
Publication on Dec. 23, 1763 of
“An Essay towards solving a
Problem in the Doctrine of
Chances” by the late
Rev. Mr. Bayes, communicated
by Mr. Price in the Philosophical
Transactions of the Royal Society
of London.
Bayesian statistics
with original title “A Method of
Calculating the Exact Probability
of All Conclusions founded on
Induction”
[Stiegler, 2014]
Bayesian statistics
and maybe intended as a reply to
Hume’s (1748) evaluation of the
probability of miracles
[Stiegler, 2014]
Bayes’ 1763 paper
Billiard ball W rolled on a line of length one, with a uniform
probability of stopping anywhere: W stops at p.
Second ball O then rolled n times under the same assumptions. X
denotes the number of times the ball O stopped on the left of W .
“Given the number of times in which an unknown event has
happened and failed; Required the chance that the probability of
its happening in a single trial lies somewhere between any two
degrees of probability that can be named.” T. Bayes, 1763
Bayes’ 1763 paper
Billiard ball W rolled on a line of length one, with a uniform
probability of stopping anywhere: W stops at p.
Second ball O then rolled n times under the same assumptions. X
denotes the number of times the ball O stopped on the left of W .
“Given the number of times in which an unknown event has
happened and failed; Required the chance that the probability of
its happening in a single trial lies somewhere between any two
degrees of probability that can be named.” T. Bayes, 1763
Resolution
“Given the number of times in which an unknown event has happened and
failed; Required the chance that the probability of its happening in a single trial
lies somewhere between any two degrees of probability that can be named.” T.
Bayes, 1763
Since
P(X = x|p) =
n
x
px
(1 − p)n−x
,
P(a < p < b and X = x) =
b
a
n
x
px
(1 − p)n−x
dp
and
P(X = x) =
1
0
n
x
px
(1 − p)n−x
dp,
Resolution (2)
“Given the number of times in which an unknown event has happened and
failed; Required the chance that the probability of its happening in a single trial
lies somewhere between any two degrees of probability that can be named.” T.
Bayes, 1763
then
P(a < p < b|X = x) =
b
a
n
x px (1 − p)n−x dp
1
0
n
x px (1 − p)n−x dp
=
b
a px (1 − p)n−x dp
B(x + 1, n − x + 1)
,
i.e.
p|x ∼ Be(x + 1, n − x + 1)
[posterior distribution on p]
Bayes 101
Given a statistical model attached to data x1, . . . , xn, with
Xi ∼ f (x|θ), Bayes-ian paradigm introduces probability distribution
π(θ) called prior
without turning constant parameters into random variates (!)
representing prior beliefs and available information
reference measure on parameter space Θ
inducing a posterior distribution
π(θ|x1, . . . , xn) ∝ π(θ)f (x1|θ) . . . f (xn|θ)
which entirely drives Bayesian inference
Historical appearance of Bayesian tests
Is the new parameter supported by the observations or is
any variation expressible by it better interpreted as
random? Thus we must set two hypotheses for
comparison, the more complicated having the smaller
initial probability
...compare a specially suggested value of a new
parameter, often 0 [q], with the aggregate of other
possible values [q ]. We shall call q the null hypothesis
and q the alternative hypothesis [and] we must take
P(q|H) = P(q |H) = 1/2 .
(Jeffreys, ToP, 1939, V, §5.0)
Bayesian tests 101
Associated with the risk
R(θ, δ) = Eθ[L(θ, δ(x))]
=
Pθ(δ(x) = 0) if θ ∈ Θ0,
Pθ(δ(x) = 1) otherwise,
Bayes test
The Bayes estimator associated with π and with the 0 − 1 loss is
δπ
(x) =
1 if P(θ ∈ Θ0|x) > P(θ ∈ Θ0|x),
0 otherwise,
Bayesian tests 101
Associated with the risk
R(θ, δ) = Eθ[L(θ, δ(x))]
=
Pθ(δ(x) = 0) if θ ∈ Θ0,
Pθ(δ(x) = 1) otherwise,
Bayes test
The Bayes estimator associated with π and with the 0 − 1 loss is
δπ
(x) =
1 if P(θ ∈ Θ0|x) > P(θ ∈ Θ0|x),
0 otherwise,
Generalisation
Weights errors differently under both hypotheses:
Theorem (Optimal Bayes decision)
Under the 0 − 1 loss function
L(θ, d) =



0 if d = IΘ0 (θ)
a0 if d = 1 and θ ∈ Θ0
a1 if d = 0 and θ ∈ Θ0
the Bayes procedure is
δπ
(x) =
1 if P(θ ∈ Θ0|x) a0/(a0 + a1)
0 otherwise
Generalisation
Weights errors differently under both hypotheses:
Theorem (Optimal Bayes decision)
Under the 0 − 1 loss function
L(θ, d) =



0 if d = IΘ0 (θ)
a0 if d = 1 and θ ∈ Θ0
a1 if d = 0 and θ ∈ Θ0
the Bayes procedure is
δπ
(x) =
1 if P(θ ∈ Θ0|x) a0/(a0 + a1)
0 otherwise
A function of posterior probabilities
Definition (Bayes factors)
For hypotheses H0 : θ ∈ Θ0 vs. Ha : θ ∈ Θ0
B01 =
π(Θ0|x)
π(Θc
0|x)
π(Θ0)
π(Θc
0)
=
Θ0
f (x|θ)π0(θ)dθ
Θc
0
f (x|θ)π1(θ)dθ
[Jeffreys, ToP, 1939, V, §5.01]
Bayes rule: acceptance if
B01 > {(1 − π(Θ0))/a1}/{π(Θ0)/a0}
Self-contained concept
Outside decision-theoretic environment:
eliminates choice of π(Θ0)
but depends on the choice of (π0, π1)
Bayesian/marginal equivalent to the likelihood ratio
Jeffreys’ scale of evidence:
if log10(Bπ
10) between 0 and 0.5, evidence against H0 weak,
if log10(Bπ
10) 0.5 and 1, evidence substantial,
if log10(Bπ
10) 1 and 2, evidence strong and
if log10(Bπ
10) above 2, evidence decisive
[...fairly arbitrary!]
Self-contained concept
Outside decision-theoretic environment:
eliminates choice of π(Θ0)
but depends on the choice of (π0, π1)
Bayesian/marginal equivalent to the likelihood ratio
Jeffreys’ scale of evidence:
if log10(Bπ
10) between 0 and 0.5, evidence against H0 weak,
if log10(Bπ
10) 0.5 and 1, evidence substantial,
if log10(Bπ
10) 1 and 2, evidence strong and
if log10(Bπ
10) above 2, evidence decisive
[...fairly arbitrary!]
Self-contained concept
Outside decision-theoretic environment:
eliminates choice of π(Θ0)
but depends on the choice of (π0, π1)
Bayesian/marginal equivalent to the likelihood ratio
Jeffreys’ scale of evidence:
if log10(Bπ
10) between 0 and 0.5, evidence against H0 weak,
if log10(Bπ
10) 0.5 and 1, evidence substantial,
if log10(Bπ
10) 1 and 2, evidence strong and
if log10(Bπ
10) above 2, evidence decisive
[...fairly arbitrary!]
Self-contained concept
Outside decision-theoretic environment:
eliminates choice of π(Θ0)
but depends on the choice of (π0, π1)
Bayesian/marginal equivalent to the likelihood ratio
Jeffreys’ scale of evidence:
if log10(Bπ
10) between 0 and 0.5, evidence against H0 weak,
if log10(Bπ
10) 0.5 and 1, evidence substantial,
if log10(Bπ
10) 1 and 2, evidence strong and
if log10(Bπ
10) above 2, evidence decisive
[...fairly arbitrary!]
regressive illustration
caterpillar dataset from Bayesian Essentials (2013): predicting
density of caterpillar nests from 10 covariates
Estimate Post. Var. log10(BF)
(Intercept) 10.8895 6.8229 2.1873 (****)
X1 -0.0044 2e-06 1.1571 (***)
X2 -0.0533 0.0003 0.6667 (**)
X3 0.0673 0.0072 -0.8585
X4 -1.2808 0.2316 0.4726 (*)
X5 0.2293 0.0079 0.3861 (*)
X6 -0.3532 1.7877 -0.9860
X7 -0.2351 0.7373 -0.9848
X8 0.1793 0.0408 -0.8223
X9 -1.2726 0.5449 -0.3461
X10 -0.4288 0.3934 -0.8949
evidence against H0: (****) decisive, (***) strong,
(**) substantial, (*) poor
A major refurbishment
Suppose we are considering whether a location parameter
α is 0. The estimation prior probability for it is uniform
and we should have to take f (α) = 0 and K[= B10]
would always be infinite (Jeffreys, ToP, V, §5.02)
When the null hypothesis is supported by a set of measure 0
against Lebesgue measure, π(Θ0) = 0 for an absolutely continuous
prior distribution
[End of the story?!]
A major refurbishment
When the null hypothesis is supported by a set of measure 0
against Lebesgue measure, π(Θ0) = 0 for an absolutely continuous
prior distribution
[End of the story?!]
Requirement
Defined prior distributions under both assumptions,
π0(θ) ∝ π(θ)IΘ0 (θ), π1(θ) ∝ π(θ)IΘ1 (θ),
(under the standard dominating measures on Θ0 and Θ1)
A major refurbishment
When the null hypothesis is supported by a set of measure 0
against Lebesgue measure, π(Θ0) = 0 for an absolutely continuous
prior distribution
[End of the story?!]
Using the prior probabilities π(Θ0) = ρ0 and π(Θ1) = ρ1,
π(θ) = ρ0π0(θ) + ρ1π1(θ).
Point null hypotheses
“Is it of the slightest use to reject a hypothesis until we have some
idea of what to put in its place?” H. Jeffreys, ToP (p.390)
Particular case H0 : θ = θ0
Take ρ0 = Prπ
(θ = θ0) and g1 prior density under Hc
0 .
Posterior probability of H0
π(Θ0|x) =
f (x|θ0)ρ0
f (x|θ)π(θ) dθ
=
f (x|θ0)ρ0
f (x|θ0)ρ0 + (1 − ρ0)m1(x)
and marginal under Hc
0
m1(x) =
Θ1
f (x|θ)g1(θ) dθ.
and
Bπ
01(x) =
f (x|θ0)ρ0
m1(x)(1 − ρ0)
ρ0
1 − ρ0
=
f (x|θ0)
m1(x)
Point null hypotheses
“Is it of the slightest use to reject a hypothesis until we have some
idea of what to put in its place?” H. Jeffreys, ToP (p.390)
Particular case H0 : θ = θ0
Take ρ0 = Prπ
(θ = θ0) and g1 prior density under Hc
0 .
Posterior probability of H0
π(Θ0|x) =
f (x|θ0)ρ0
f (x|θ)π(θ) dθ
=
f (x|θ0)ρ0
f (x|θ0)ρ0 + (1 − ρ0)m1(x)
and marginal under Hc
0
m1(x) =
Θ1
f (x|θ)g1(θ) dθ.
and
Bπ
01(x) =
f (x|θ0)ρ0
m1(x)(1 − ρ0)
ρ0
1 − ρ0
=
f (x|θ0)
m1(x)
Normal example
Testing whether the mean α of a normal observation X ∼ N(α, s2)
is zero:
P(H0|x) ∝ exp −
x2
2s2
P(Hc
0 |x) ∝ exp −
(x − α)2
2s2
f (α)dα
regressive illustation
caterpillar dataset from Bayesian Essentials (2013): predicting
density of caterpillar nests from 10 covariates
Estimate Post. Var. log10(BF)
(Intercept) 10.8895 6.8229 2.1873 (****)
X1 -0.0044 2e-06 1.1571 (***)
X2 -0.0533 0.0003 0.6667 (**)
X3 0.0673 0.0072 -0.8585
X4 -1.2808 0.2316 0.4726 (*)
X5 0.2293 0.0079 0.3861 (*)
X6 -0.3532 1.7877 -0.9860
X7 -0.2351 0.7373 -0.9848
X8 0.1793 0.0408 -0.8223
X9 -1.2726 0.5449 -0.3461
X10 -0.4288 0.3934 -0.8949
evidence against H0: (****) decisive, (***) strong,
(**) substantial, (*) poor
Noninformative proposals
Summary
Significance tests
Noninformative solutions
Jeffreys-Lindley paradox
Testing via mixtures
what’s special about the Bayes factor?!
“The priors do not represent substantive knowledge of the
parameters within the model”
“Using Bayes’ theorem, these priors can then be updated to
posteriors conditioned on the data that were actually
observed.”
“In general, the fact that different priors result in different
Bayes factors should not come as a surprise.”
“The Bayes factor (...) balances the tension between
parsimony and goodness of fit, (...) against overfitting the
data.”
“In induction there is no harm in being occasionally wrong; it
is inevitable that we shall be.”
[Jeffreys, 1939; Ly et al., 2015]
what’s wrong with the Bayes factor?!
(1/2, 1/2) partition between hypotheses has very little to
suggest in terms of extensions
central difficulty stands with issue of picking a prior
probability of a model
unfortunate impossibility of using improper priors in most
settings
Bayes factors lack direct scaling associated with posterior
probability and loss function
twofold dependence on subjective prior measure, first in prior
weights of models and second in lasting impact of prior
modelling on the parameters
Bayes factor offers no window into uncertainty associated with
decision
further reasons in the summary
[Robert, 2016]
Vague proper priors are not the solution
Taking a proper prior and take a “very large” variance (e.g.,
BUGS) will most often result in an undefined or ill-defined limit
Example (Lindley’s paradox)
If testing H0 : θ = 0 when observing x ∼ N(θ, 1), under a normal
N(0, α) prior π1(θ),
B01(x)
α−→∞
−→ 0
Vague proper priors are not the solution
Taking a proper prior and take a “very large” variance (e.g.,
BUGS) will most often result in an undefined or ill-defined limit
Example (Lindley’s paradox)
If testing H0 : θ = 0 when observing x ∼ N(θ, 1), under a normal
N(0, α) prior π1(θ),
B01(x)
α−→∞
−→ 0
Vague proper priors are not the solution
Taking a proper prior and take a “very large” variance (e.g.,
BUGS) will most often result in an undefined or ill-defined limit
Example (Lindley’s paradox)
If testing H0 : θ = 0 when observing x ∼ N(θ, 1), under a normal
N(0, α) prior π1(θ),
B01(x)
α−→∞
−→ 0
Learning from the sample
Definition (Learning sample)
Given an improper prior π, (x1, . . . , xn) is a learning sample if
π(·|x1, . . . , xn) is proper and a minimal learning sample if none of
its subsamples is a learning sample
There is just enough information in a minimal learning sample to
make inference about θ under the prior π
Learning from the sample
Definition (Learning sample)
Given an improper prior π, (x1, . . . , xn) is a learning sample if
π(·|x1, . . . , xn) is proper and a minimal learning sample if none of
its subsamples is a learning sample
There is just enough information in a minimal learning sample to
make inference about θ under the prior π
Pseudo-Bayes factors
Idea
Use one part x[i] of the data x to make the prior proper:
πi improper but πi (·|x[i]) proper
and
fi (x[n/i]|θi ) πi (θi |x[i])dθi
fj (x[n/i]|θj ) πj (θj |x[i])dθj
independent of normalizing constant
Use remaining x[n/i] to run test as if πj (θj |x[i]) is the true prior
Pseudo-Bayes factors
Idea
Use one part x[i] of the data x to make the prior proper:
πi improper but πi (·|x[i]) proper
and
fi (x[n/i]|θi ) πi (θi |x[i])dθi
fj (x[n/i]|θj ) πj (θj |x[i])dθj
independent of normalizing constant
Use remaining x[n/i] to run test as if πj (θj |x[i]) is the true prior
Pseudo-Bayes factors
Idea
Use one part x[i] of the data x to make the prior proper:
πi improper but πi (·|x[i]) proper
and
fi (x[n/i]|θi ) πi (θi |x[i])dθi
fj (x[n/i]|θj ) πj (θj |x[i])dθj
independent of normalizing constant
Use remaining x[n/i] to run test as if πj (θj |x[i]) is the true prior
Motivation
Provides a working principle for improper priors
Gather enough information from data to achieve properness
and use this properness to run the test on remaining data
does not use x twice as in Aitkin’s (1991)
Motivation
Provides a working principle for improper priors
Gather enough information from data to achieve properness
and use this properness to run the test on remaining data
does not use x twice as in Aitkin’s (1991)
Motivation
Provides a working principle for improper priors
Gather enough information from data to achieve properness
and use this properness to run the test on remaining data
does not use x twice as in Aitkin’s (1991)
Unexpected problems!
depends on the choice of x[i]
many ways of combining pseudo-Bayes factors
AIBF = BN
ji
1
L
Bij (x[ ])
MIBF = BN
ji med[Bij (x[ ])]
GIBF = BN
ji exp
1
L
log Bij (x[ ])
not often an exact Bayes factor
and thus lacking inner coherence
B12 = B10B02 and B01 = 1/B10 .
[Berger & Pericchi, 1996]
Unexpected problems!
depends on the choice of x[i]
many ways of combining pseudo-Bayes factors
AIBF = BN
ji
1
L
Bij (x[ ])
MIBF = BN
ji med[Bij (x[ ])]
GIBF = BN
ji exp
1
L
log Bij (x[ ])
not often an exact Bayes factor
and thus lacking inner coherence
B12 = B10B02 and B01 = 1/B10 .
[Berger & Pericchi, 1996]
Fractional Bayes factor
Idea
use directly the likelihood to separate training sample from testing
sample
BF
12 = B12(x)
Lb
2(θ2)π2(θ2)dθ2
Lb
1(θ1)π1(θ1)dθ1
[O’Hagan, 1995]
Proportion b of the sample used to gain proper-ness
Fractional Bayes factor
Idea
use directly the likelihood to separate training sample from testing
sample
BF
12 = B12(x)
Lb
2(θ2)π2(θ2)dθ2
Lb
1(θ1)π1(θ1)dθ1
[O’Hagan, 1995]
Proportion b of the sample used to gain proper-ness
Fractional Bayes factor (cont’d)
Example (Normal mean)
BF
12 =
1
√
b
en(b−1)¯x2
n /2
corresponds to exact Bayes factor for the prior N 0, 1−b
nb
If b constant, prior variance goes to 0
If b =
1
n
, prior variance stabilises around 1
If b = n−α
, α < 1, prior variance goes to 0 too.
Jeffreys–Lindley paradox
Summary
Significance tests
Noninformative solutions
Jeffreys-Lindley paradox
Lindley’s paradox
dual versions of the paradox
Bayesian resolutions
Testing via mixtures
Lindley’s paradox
In a normal mean testing problem,
¯xn ∼ N(θ, σ2
/n) , H0 : θ = θ0 ,
under Jeffreys prior, θ ∼ N(θ0, σ2), the Bayes factor
B01(tn) = (1 + n)1/2
exp −nt2
n/2[1 + n] ,
where tn =
√
n|¯xn − θ0|/σ, satisfies
B01(tn)
n−→∞
−→ ∞
[assuming a fixed tn]
[Lindley, 1957]
Two versions of the paradox
“the weight of Lindley’s paradoxical result (...) burdens
proponents of the Bayesian practice”.
[Lad, 2003]
official version, opposing frequentist and Bayesian assessments
[Lindley, 1957]
intra-Bayesian version, blaming vague and improper priors for
the Bayes factor misbehaviour:
if π1(·|σ) depends on a scale parameter σ, it is often the case
that
B01(x)
σ−→∞
−→ +∞
for a given x, meaning H0 is always accepted
[Robert, 1992, 2013]
where does it matter?
In the normal case, Z ∼ N(θ, 1), θ ∼ N(0, α2), Bayes factor
B10(z) =
ez2α2/(1+α2)
√
1 + α2
=
√
1 − λ exp{λz2
/2}
Evacuation of the first version
Two paradigms [(b) versus (f)]
one (b) operates on the parameter space Θ, while the other
(f) is produced from the sample space
one (f) relies solely on the point-null hypothesis H0 and the
corresponding sampling distribution, while the other
(b) opposes H0 to a (predictive) marginal version of H1
one (f) could reject “a hypothesis that may be true (...)
because it has not predicted observable results that have not
occurred” (Jeffreys, ToP, VII, §7.2) while the other
(b) conditions upon the observed value xobs
one (f) cannot agree with the likelihood principle, while the
other (b) is almost uniformly in agreement with it
one (f) resorts to an arbitrary fixed bound α on the p-value,
while the other (b) refers to the (default) boundary probability
of 1/2
Nothing’s wrong with the second version
n, prior’s scale factor: prior variance n times larger than the
observation variance and when n goes to ∞, Bayes factor
goes to ∞ no matter what the observation is
n becomes what Lindley (1957) calls “a measure of lack of
conviction about the null hypothesis”
when prior diffuseness under H1 increases, only relevant
information becomes that θ could be equal to θ0, and this
overwhelms any evidence to the contrary contained in the data
mass of the prior distribution in the vicinity of any fixed
neighbourhood of the null hypothesis vanishes to zero under
H1
c deep coherence in the outcome: being indecisive about
the alternative hypothesis means we should not chose it
Nothing’s wrong with the second version
n, prior’s scale factor: prior variance n times larger than the
observation variance and when n goes to ∞, Bayes factor
goes to ∞ no matter what the observation is
n becomes what Lindley (1957) calls “a measure of lack of
conviction about the null hypothesis”
when prior diffuseness under H1 increases, only relevant
information becomes that θ could be equal to θ0, and this
overwhelms any evidence to the contrary contained in the data
mass of the prior distribution in the vicinity of any fixed
neighbourhood of the null hypothesis vanishes to zero under
H1
c deep coherence in the outcome: being indecisive about
the alternative hypothesis means we should not chose it
On some resolutions of the second version
use of pseudo-Bayes factors, fractional Bayes factors, &tc,
which lacks complete proper Bayesian justification
[Berger & Pericchi, 2001]
use of identical improper priors on nuisance parameters,
use of the posterior predictive distribution,
matching priors,
use of score functions extending the log score function
non-local priors correcting default priors
On some resolutions of the second version
use of pseudo-Bayes factors, fractional Bayes factors, &tc,
use of identical improper priors on nuisance parameters, a
notion already entertained by Jeffreys
[Berger et al., 1998; Marin & Robert, 2013]
use of the posterior predictive distribution,
matching priors,
use of score functions extending the log score function
non-local priors correcting default priors
On some resolutions of the second version
use of pseudo-Bayes factors, fractional Bayes factors, &tc,
use of identical improper priors on nuisance parameters,
P´ech´e de jeunesse: equating the values of the prior densities
at the point-null value θ0,
ρ0 = (1 − ρ0)π1(θ0)
[Robert, 1993]
use of the posterior predictive distribution,
matching priors,
use of score functions extending the log score function
non-local priors correcting default priors
On some resolutions of the second version
use of pseudo-Bayes factors, fractional Bayes factors, &tc,
use of identical improper priors on nuisance parameters,
use of the posterior predictive distribution, which uses the
data twice
matching priors,
use of score functions extending the log score function
non-local priors correcting default priors
On some resolutions of the second version
use of pseudo-Bayes factors, fractional Bayes factors, &tc,
use of identical improper priors on nuisance parameters,
use of the posterior predictive distribution,
matching priors, whose sole purpose is to bring frequentist
and Bayesian coverages as close as possible
[Datta & Mukerjee, 2004]
use of score functions extending the log score function
non-local priors correcting default priors
On some resolutions of the second version
use of pseudo-Bayes factors, fractional Bayes factors, &tc,
use of identical improper priors on nuisance parameters,
use of the posterior predictive distribution,
matching priors,
use of score functions extending the log score function
log B12(x) = log m1(x) − log m2(x) = S0(x, m1) − S0(x, m2) ,
that are independent of the normalising constant
[Dawid et al., 2013; Dawid & Musio, 2015]
non-local priors correcting default priors
On some resolutions of the second version
use of pseudo-Bayes factors, fractional Bayes factors, &tc,
use of identical improper priors on nuisance parameters,
use of the posterior predictive distribution,
matching priors,
use of score functions extending the log score function
non-local priors correcting default priors towards more
balanced error rates
[Johnson & Rossell, 2010; Consonni et al., 2013]
Changing the testing perspective
Summary
Significance tests
Noninformative solutions
Jeffreys-Lindley paradox
Testing via mixtures
Paradigm shift
New proposal for a paradigm shift in the Bayesian processing of
hypothesis testing and of model selection
convergent and naturally interpretable solution
more extended use of improper priors
Simple representation of the testing problem as a
two-component mixture estimation problem where the
weights are formally equal to 0 or 1
Paradigm shift
New proposal for a paradigm shift in the Bayesian processing of
hypothesis testing and of model selection
convergent and naturally interpretable solution
more extended use of improper priors
Simple representation of the testing problem as a
two-component mixture estimation problem where the
weights are formally equal to 0 or 1
Paradigm shift
Simple representation of the testing problem as a
two-component mixture estimation problem where the
weights are formally equal to 0 or 1
Approach inspired from consistency result of Rousseau and
Mengersen (2011) on estimated overfitting mixtures
Mixture representation not directly equivalent to the use of a
posterior probability
Potential of a better approach to testing, while not expanding
number of parameters
Calibration of the posterior distribution of the weight of a
model, while moving from the artificial notion of the posterior
probability of a model
Encompassing mixture model
Idea: Given two statistical models,
M1 : x ∼ f1(x|θ1) , θ1 ∈ Θ1 and M2 : x ∼ f2(x|θ2) , θ2 ∈ Θ2 ,
embed both within an encompassing mixture
Mα : x ∼ αf1(x|θ1) + (1 − α)f2(x|θ2) , 0 α 1 (1)
Note: Both models correspond to special cases of (1), one for
α = 1 and one for α = 0
Draw inference on mixture representation (1), as if each
observation was individually and independently produced by the
mixture model
Encompassing mixture model
Idea: Given two statistical models,
M1 : x ∼ f1(x|θ1) , θ1 ∈ Θ1 and M2 : x ∼ f2(x|θ2) , θ2 ∈ Θ2 ,
embed both within an encompassing mixture
Mα : x ∼ αf1(x|θ1) + (1 − α)f2(x|θ2) , 0 α 1 (1)
Note: Both models correspond to special cases of (1), one for
α = 1 and one for α = 0
Draw inference on mixture representation (1), as if each
observation was individually and independently produced by the
mixture model
Encompassing mixture model
Idea: Given two statistical models,
M1 : x ∼ f1(x|θ1) , θ1 ∈ Θ1 and M2 : x ∼ f2(x|θ2) , θ2 ∈ Θ2 ,
embed both within an encompassing mixture
Mα : x ∼ αf1(x|θ1) + (1 − α)f2(x|θ2) , 0 α 1 (1)
Note: Both models correspond to special cases of (1), one for
α = 1 and one for α = 0
Draw inference on mixture representation (1), as if each
observation was individually and independently produced by the
mixture model
Inferential motivations
Sounds like an approximation to the real model, but several
definitive advantages to this paradigm shift:
Bayes estimate of the weight α replaces posterior probability
of model M1, equally convergent indicator of which model is
“true”, while avoiding artificial prior probabilities on model
indices, ω1 and ω2
interpretation of estimator of α at least as natural as handling
the posterior probability, while avoiding zero-one loss setting
α and its posterior distribution provide measure of proximity
to the models, while being interpretable as data propensity to
stand within one model
further allows for alternative perspectives on testing and
model choice, like predictive tools, cross-validation, and
information indices like WAIC
Computational motivations
avoids highly problematic computations of the marginal
likelihoods, since standard algorithms are available for
Bayesian mixture estimation
straightforward extension to a finite collection of models, with
a larger number of components, which considers all models at
once and eliminates least likely models by simulation
eliminates difficulty of label switching that plagues both
Bayesian estimation and Bayesian computation, since
components are no longer exchangeable
posterior distribution of α evaluates more thoroughly strength
of support for a given model than the single figure outcome of
a posterior probability
variability of posterior distribution on α allows for a more
thorough assessment of the strength of this support
Noninformative motivations
additional feature missing from traditional Bayesian answers:
a mixture model acknowledges possibility that, for a finite
dataset, both models or none could be acceptable
standard (proper and informative) prior modeling can be
reproduced in this setting, but non-informative (improper)
priors also are manageable therein, provided both models first
reparameterised towards shared parameters, e.g. location and
scale parameters
in special case when all parameters are common
Mα : x ∼ αf1(x|θ) + (1 − α)f2(x|θ) , 0 α 1
if θ is a location parameter, a flat prior π(θ) ∝ 1 is available
Weakly informative motivations
using the same parameters or some identical parameters on
both components highlights that opposition between the two
components is not an issue of enjoying different parameters
those common parameters are nuisance parameters, to be
integrated out
prior model weights ωi rarely discussed in classical Bayesian
approach, even though linear impact on posterior probabilities.
Here, prior modeling only involves selecting a prior on α, e.g.,
α ∼ B(a0, a0)
while a0 impacts posterior on α, it always leads to mass
accumulation near 1 or 0, i.e. favours most likely model
sensitivity analysis straightforward to carry
approach easily calibrated by parametric boostrap providing
reference posterior of α under each model
natural Metropolis–Hastings alternative
Poisson/Geometric
choice betwen Poisson P(λ) and Geometric Geo(p)
distribution
mixture with common parameter λ
Mα : αP(λ) + (1 − α)Geo(1/1+λ)
Allows for Jeffreys prior since resulting posterior is proper
independent Metropolis–within–Gibbs with proposal
distribution on λ equal to Poisson posterior (with acceptance
rate larger than 75%)
Poisson/Geometric
choice betwen Poisson P(λ) and Geometric Geo(p)
distribution
mixture with common parameter λ
Mα : αP(λ) + (1 − α)Geo(1/1+λ)
Allows for Jeffreys prior since resulting posterior is proper
independent Metropolis–within–Gibbs with proposal
distribution on λ equal to Poisson posterior (with acceptance
rate larger than 75%)
Poisson/Geometric
choice betwen Poisson P(λ) and Geometric Geo(p)
distribution
mixture with common parameter λ
Mα : αP(λ) + (1 − α)Geo(1/1+λ)
Allows for Jeffreys prior since resulting posterior is proper
independent Metropolis–within–Gibbs with proposal
distribution on λ equal to Poisson posterior (with acceptance
rate larger than 75%)
Beta prior
When α ∼ Be(a0, a0) prior, full conditional posterior
α ∼ Be(n1(ζ) + a0, n2(ζ) + a0)
Exact Bayes factor opposing Poisson and Geometric
B12 = nn¯xn
n
i=1
xi ! Γ n + 2 +
n
i=1
xi Γ(n + 2)
although undefined from a purely mathematical viewpoint
Parameter estimation
1e-04 0.001 0.01 0.1 0.2 0.3 0.4 0.5
3.903.954.004.054.10
Posterior means of λ and medians of α for 100 Poisson P(4) datasets of size
n = 1000, for a0 = .0001, .001, .01, .1, .2, .3, .4, .5. Each posterior approximation
is based on 104
Metropolis-Hastings iterations.
Parameter estimation
1e-04 0.001 0.01 0.1 0.2 0.3 0.4 0.5
0.9900.9920.9940.9960.9981.000
Posterior means of λ and medians of α for 100 Poisson P(4) datasets of size
n = 1000, for a0 = .0001, .001, .01, .1, .2, .3, .4, .5. Each posterior approximation
is based on 104
Metropolis-Hastings iterations.
Consistency
0 1 2 3 4 5 6 7
0.00.20.40.60.81.0
a0=.1
log(sample size)
0 1 2 3 4 5 6 7
0.00.20.40.60.81.0
a0=.3
log(sample size)
0 1 2 3 4 5 6 7
0.00.20.40.60.81.0
a0=.5
log(sample size)
Posterior means (sky-blue) and medians (grey-dotted) of α, over 100 Poisson
P(4) datasets for sample sizes from 1 to 1000.
Behaviour of Bayes factor
0 1 2 3 4 5 6 7
0.00.20.40.60.81.0
log(sample size)
0 1 2 3 4 5 6 7
0.00.20.40.60.81.0
log(sample size)
0 1 2 3 4 5 6 7
0.00.20.40.60.81.0
log(sample size)
Comparison between P(M1|x) (red dotted area) and posterior medians of α
(grey zone) for 100 Poisson P(4) datasets with sample sizes n between 1 and
1000, for a0 = .001, .1, .5
Normal-normal comparison
comparison of a normal N(θ1, 1) with a normal N(θ2, 2)
distribution
mixture with identical location parameter θ
αN(θ, 1) + (1 − α)N(θ, 2)
Jeffreys prior π(θ) = 1 can be used, since posterior is proper
Reference (improper) Bayes factor
B12 = 2
n−1/2
exp 1/4
n
i=1
(xi − ¯x)2
,
Normal-normal comparison
comparison of a normal N(θ1, 1) with a normal N(θ2, 2)
distribution
mixture with identical location parameter θ
αN(θ, 1) + (1 − α)N(θ, 2)
Jeffreys prior π(θ) = 1 can be used, since posterior is proper
Reference (improper) Bayes factor
B12 = 2
n−1/2
exp 1/4
n
i=1
(xi − ¯x)2
,
Normal-normal comparison
comparison of a normal N(θ1, 1) with a normal N(θ2, 2)
distribution
mixture with identical location parameter θ
αN(θ, 1) + (1 − α)N(θ, 2)
Jeffreys prior π(θ) = 1 can be used, since posterior is proper
Reference (improper) Bayes factor
B12 = 2
n−1/2
exp 1/4
n
i=1
(xi − ¯x)2
,
Consistency
0.1 0.2 0.3 0.4 0.5 p(M1|x)
0.00.20.40.60.81.0
n=15
0.00.20.40.60.81.0
0.1 0.2 0.3 0.4 0.5 p(M1|x)
0.650.700.750.800.850.900.951.00
n=50
0.650.700.750.800.850.900.951.00
0.1 0.2 0.3 0.4 0.5 p(M1|x)
0.750.800.850.900.951.00
n=100
0.750.800.850.900.951.00
0.1 0.2 0.3 0.4 0.5 p(M1|x)
0.930.940.950.960.970.980.991.00
n=500
0.930.940.950.960.970.980.991.00
Posterior means (wheat) and medians of α (dark wheat), compared with
posterior probabilities of M0 (gray) for a N(0, 1) sample, derived from 100
datasets for sample sizes equal to 15, 50, 100, 500. Each posterior
approximation is based on 104
MCMC iterations.
Comparison with posterior probability
0 100 200 300 400 500
-50-40-30-20-100
a0=.1
sample size
0 100 200 300 400 500
-50-40-30-20-100
a0=.3
sample size
0 100 200 300 400 500
-50-40-30-20-100
a0=.4
sample size
0 100 200 300 400 500
-50-40-30-20-100
a0=.5
sample size
Plots of ranges of log(n) log(1 − E[α|x]) (gray color) and log(1 − p(M1|x)) (red
dotted) over 100 N(0, 1) samples as sample size n grows from 1 to 500. and α
is the weight of N(0, 1) in the mixture model. The shaded areas indicate the
range of the estimations and each plot is based on a Beta prior with
a0 = .1, .2, .3, .4, .5, 1 and each posterior approximation is based on 104
iterations.
Comments
convergence to one boundary value as sample size n grows
impact of hyperarameter a0 slowly vanishes as n increases, but
present for moderate sample sizes
when simulated sample is neither from N(θ1, 1) nor from
N(θ2, 2), behaviour of posterior varies, depending on which
distribution is closest
Logit or Probit?
binary dataset, R dataset about diabetes in 200 Pima Indian
women with body mass index as explanatory variable
comparison of logit and probit fits could be suitable. We are
thus comparing both fits via our method
M1 : yi | xi
, θ1 ∼ B(1, pi ) where pi =
exp(xi θ1)
1 + exp(xi θ1)
M2 : yi | xi
, θ2 ∼ B(1, qi ) where qi = Φ(xi
θ2)
Common parameterisation
Local reparameterisation strategy that rescales parameters of the
probit model M2 so that the MLE’s of both models coincide.
[Choudhuty et al., 2007]
Φ(xi
θ2) ≈
exp(kxi θ2)
1 + exp(kxi θ2)
and use best estimate of k to bring both parameters into coherency
(k0, k1) = (θ01/θ02, θ11/θ12) ,
reparameterise M1 and M2 as
M1 :yi | xi
, θ ∼ B(1, pi ) where pi =
exp(xi θ)
1 + exp(xi θ)
M2 :yi | xi
, θ ∼ B(1, qi ) where qi = Φ(xi
(κ−1
θ)) ,
with κ−1θ = (θ0/k0, θ1/k1).
Prior modelling
Under default g-prior
θ ∼ N2(0, n(XT
X)−1
)
full conditional posterior distributions given allocations
π(θ | y, X, ζ) ∝
exp i Iζi =1yi xi θ
i;ζi =1[1 + exp(xi θ)]
exp −θT
(XT
X)θ 2n
×
i;ζi =2
Φ(xi
(κ−1
θ))yi
(1 − Φ(xi
(κ−1
θ)))(1−yi )
hence posterior distribution clearly defined
Results
Logistic Probit
a0 α θ0 θ1
θ0
k0
θ1
k1
.1 .352 -4.06 .103 -2.51 .064
.2 .427 -4.03 .103 -2.49 .064
.3 .440 -4.02 .102 -2.49 .063
.4 .456 -4.01 .102 -2.48 .063
.5 .449 -4.05 .103 -2.51 .064
Histograms of posteriors of α in favour of logistic model where a0 = .1, .2, .3,
.4, .5 for (a) Pima dataset, (b) Data from logistic model, (c) Data from probit
model
Survival analysis
Testing hypothesis that data comes from a
1. log-Normal(φ, κ2),
2. Weibull(α, λ), or
3. log-Logistic(γ, δ)
distribution
Corresponding mixture given by the density
α1 exp{−(log x − φ)2
/2κ2
}/
√
2πxκ+
α2
α
λ
exp{−(x/λ)α
}((x/λ)α−1
+
α3(δ/γ)(x/γ)δ−1
/(1 + (x/γ)δ
)2
where α1 + α2 + α3 = 1
Survival analysis
Testing hypothesis that data comes from a
1. log-Normal(φ, κ2),
2. Weibull(α, λ), or
3. log-Logistic(γ, δ)
distribution
Corresponding mixture given by the density
α1 exp{−(log x − φ)2
/2κ2
}/
√
2πxκ+
α2
α
λ
exp{−(x/λ)α
}((x/λ)α−1
+
α3(δ/γ)(x/γ)δ−1
/(1 + (x/γ)δ
)2
where α1 + α2 + α3 = 1
Reparameterisation
Looking for common parameter(s):
φ = µ + γβ = ξ
σ2
= π2
β2
/6 = ζ2
π2
/3
where γ ≈ 0.5772 is Euler-Mascheroni constant.
Allows for a noninformative prior on the common location scale
parameter,
π(φ, σ2
) = 1/σ2
Reparameterisation
Looking for common parameter(s):
φ = µ + γβ = ξ
σ2
= π2
β2
/6 = ζ2
π2
/3
where γ ≈ 0.5772 is Euler-Mascheroni constant.
Allows for a noninformative prior on the common location scale
parameter,
π(φ, σ2
) = 1/σ2
Recovery
.01 0.1 1.0 10.0
0.860.880.900.920.940.960.981.00
.01 0.1 1.0 10.00.000.020.040.060.08
Boxplots of the posterior distributions of the Normal weight α1 under the two
scenarii: truth = Normal (left panel), truth = Gumbel (right panel), a0=0.01,
0.1, 1.0, 10.0 (from left to right in each panel) and n = 10, 000 simulated
observations.
Asymptotic consistency
Posterior consistency holds for mixture testing procedure [under
minor conditions]
Two different cases
the two models, M1 and M2, are well separated
model M1 is a submodel of M2.
Asymptotic consistency
Posterior consistency holds for mixture testing procedure [under
minor conditions]
Two different cases
the two models, M1 and M2, are well separated
model M1 is a submodel of M2.
Posterior concentration rate
Let π be the prior and xn = (x1, · · · , xn) a sample with true
density f ∗
proposition
Assume that, for all c > 0, there exist Θn ⊂ Θ1 × Θ2 and B > 0 such that
π [Θc
n] n−c
, Θn ⊂ { θ1 + θ2 nB
}
and that there exist H 0 and L, δ > 0 such that, for j = 1, 2,
sup
θ,θ ∈Θn
fj,θj
− fj,θj
1 LnH
θj − θj , θ = (θ1, θ2), θ = (θ1, θ2) ,
∀ θj − θ∗
j δ; KL(fj,θj
, fj,θ∗
j
) θj − θ∗
j .
Then, when f ∗
= fθ∗,α∗ , with α∗
∈ [0, 1], there exists M > 0 such that
π (α, θ); fθ,α − f ∗
1 > M log n/n|xn
= op(1) .
Separated models
Assumption: Models are separated, i.e. identifiability holds:
∀α, α ∈ [0, 1], ∀θj , θj , j = 1, 2 Pθ,α = Pθ ,α ⇒ α = α , θ = θ
Further
inf
θ1∈Θ1
inf
θ2∈Θ2
f1,θ1
− f2,θ2 1 > 0
and, for θ∗
j ∈ Θj , if Pθj
weakly converges to Pθ∗
j
, then
θj −→ θ∗
j
in the Euclidean topology
Separated models
Assumption: Models are separated, i.e. identifiability holds:
∀α, α ∈ [0, 1], ∀θj , θj , j = 1, 2 Pθ,α = Pθ ,α ⇒ α = α , θ = θ
theorem
Under above assumptions, then for all > 0,
π [|α − α∗
| > |xn
] = op(1)
Separated models
Assumption: Models are separated, i.e. identifiability holds:
∀α, α ∈ [0, 1], ∀θj , θj , j = 1, 2 Pθ,α = Pθ ,α ⇒ α = α , θ = θ
theorem
If
θj → fj,θj
is C2 around θ∗
j , j = 1, 2,
f1,θ∗
1
− f2,θ∗
2
, f1,θ∗
1
, f2,θ∗
2
are linearly independent in y and
there exists δ > 0 such that
f1,θ∗
1
, f2,θ∗
2
, sup
|θ1−θ∗
1 |<δ
|D2
f1,θ1
|, sup
|θ2−θ∗
2 |<δ
|D2
f2,θ2
| ∈ L1
then
π |α − α∗
| > M log n/n xn
= op(1).
Separated models
Assumption: Models are separated, i.e. identifiability holds:
∀α, α ∈ [0, 1], ∀θj , θj , j = 1, 2 Pθ,α = Pθ ,α ⇒ α = α , θ = θ
theorem allows for interpretation of α under the posterior: If data
xn is generated from model M1 then posterior on α concentrates
around α = 1
Embedded case
Here M1 is a submodel of M2, i.e.
θ2 = (θ1, ψ) and θ2 = (θ1, ψ0 = 0)
corresponds to f2,θ2 ∈ M1
Same posterior concentration rate
log n/n
for estimating α when α∗ ∈ (0, 1) and ψ∗ = 0.
Null case
Case where ψ∗ = 0, i.e., f ∗ is in model M1
Two possible paths to approximate f ∗: either α goes to 1
(path 1) or ψ goes to 0 (path 2)
New identifiability condition: Pθ,α = P∗ only if
α = 1, θ1 = θ∗
1, θ2 = (θ∗
1, ψ) or α 1, θ1 = θ∗
1, θ2 = (θ∗
1, 0)
Prior
π(α, θ) = πα(α)π1(θ1)πψ(ψ), θ2 = (θ1, ψ)
with common (prior on) θ1
Null case
Case where ψ∗ = 0, i.e., f ∗ is in model M1
Two possible paths to approximate f ∗: either α goes to 1
(path 1) or ψ goes to 0 (path 2)
New identifiability condition: Pθ,α = P∗ only if
α = 1, θ1 = θ∗
1, θ2 = (θ∗
1, ψ) or α 1, θ1 = θ∗
1, θ2 = (θ∗
1, 0)
Prior
π(α, θ) = πα(α)π1(θ1)πψ(ψ), θ2 = (θ1, ψ)
with common (prior on) θ1
Assumptions
[B1] Regularity: Assume that θ1 → f1,θ1 and θ2 → f2,θ2 are 3
times continuously differentiable and that
F∗
¯f 3
1,θ∗
1
f 3
1,θ∗
1
< +∞, ¯f1,θ∗
1
= sup
|θ1−θ∗
1 |<δ
f1,θ1
, f 1,θ∗
1
= inf
|θ1−θ∗
1 |<δ
f1,θ1
F∗
sup|θ1−θ∗
1 |<δ | f1,θ∗
1
|3
f 3
1,θ∗
1
< +∞, F∗
| f1,θ∗
1
|4
f 4
1,θ∗
1
< +∞,
F∗
sup|θ1−θ∗
1 |<δ |D2
f1,θ∗
1
|2
f 2
1,θ∗
1
< +∞, F∗
sup|θ1−θ∗
1 |<δ |D3
f1,θ∗
1
|
f 1,θ∗
1
< +∞
Assumptions
[B2] Integrability: There exists
S0 ⊂ S ∩ {|ψ| > δ0}
for some positive δ0 and satisfying Leb(S0) > 0, and such that for
all ψ ∈ S0,
F∗
sup|θ1−θ∗
1 |<δ f2,θ1,ψ
f 4
1,θ∗
1
< +∞, F∗
sup|θ1−θ∗
1 |<δ f 3
2,θ1,ψ
f 3
1,θ1∗
< +∞,
Assumptions
[B3] Stronger identifiability: Set
f2,θ∗
1,ψ∗ (x) = θ1 f2,θ∗
1,ψ∗ (x)T
, ψf2,θ∗
1,ψ∗ (x)T T
.
Then for all ψ ∈ S with ψ = 0, if η0 ∈ R, η1 ∈ Rd1
η0(f1,θ∗
1
− f2,θ∗
1,ψ) + ηT
1 θ1 f1,θ∗
1
= 0 ⇔ η1 = 0, η2 = 0
Consistency
theorem
Given the mixture fθ1,ψ,α = αf1,θ1 + (1 − α)f2,θ1,ψ and a sample
xn = (x1, · · · , xn) issued from f1,θ∗
1
, under assumptions B1 − B3,
and an M > 0 such that
π (α, θ); fθ,α − f ∗
1 > M log n/n|xn
= op(1).
If α ∼ B(a1, a2), with a2 < d2, and if the prior πθ1,ψ is absolutely
continuous with positive and continuous density at (θ∗
1, 0), then for
Mn −→ ∞
π |α − α∗
| > Mn(log n)γ
/
√
n|xn
= op(1), γ = max((d1 + a2)/(d2 − a2), 1)/2,

More Related Content

What's hot

Reliable ABC model choice via random forests
Reliable ABC model choice via random forestsReliable ABC model choice via random forests
Reliable ABC model choice via random forests
Christian Robert
 
Likelihood-free Design: a discussion
Likelihood-free Design: a discussionLikelihood-free Design: a discussion
Likelihood-free Design: a discussion
Christian Robert
 
Can we estimate a constant?
Can we estimate a constant?Can we estimate a constant?
Can we estimate a constant?
Christian Robert
 
ABC workshop: 17w5025
ABC workshop: 17w5025ABC workshop: 17w5025
ABC workshop: 17w5025
Christian Robert
 
Laplace's Demon: seminar #1
Laplace's Demon: seminar #1Laplace's Demon: seminar #1
Laplace's Demon: seminar #1
Christian Robert
 
Big model, big data
Big model, big dataBig model, big data
Big model, big data
Christian Robert
 
Convergence of ABC methods
Convergence of ABC methodsConvergence of ABC methods
Convergence of ABC methods
Christian Robert
 
NCE, GANs & VAEs (and maybe BAC)
NCE, GANs & VAEs (and maybe BAC)NCE, GANs & VAEs (and maybe BAC)
NCE, GANs & VAEs (and maybe BAC)
Christian Robert
 
An overview of Bayesian testing
An overview of Bayesian testingAn overview of Bayesian testing
An overview of Bayesian testing
Christian Robert
 
accurate ABC Oliver Ratmann
accurate ABC Oliver Ratmannaccurate ABC Oliver Ratmann
accurate ABC Oliver Ratmannolli0601
 
better together? statistical learning in models made of modules
better together? statistical learning in models made of modulesbetter together? statistical learning in models made of modules
better together? statistical learning in models made of modules
Christian Robert
 
Approximating Bayes Factors
Approximating Bayes FactorsApproximating Bayes Factors
Approximating Bayes Factors
Christian Robert
 
"reflections on the probability space induced by moment conditions with impli...
"reflections on the probability space induced by moment conditions with impli..."reflections on the probability space induced by moment conditions with impli...
"reflections on the probability space induced by moment conditions with impli...
Christian Robert
 
ABC-Gibbs
ABC-GibbsABC-Gibbs
ABC-Gibbs
Christian Robert
 
ABC-Gibbs
ABC-GibbsABC-Gibbs
ABC-Gibbs
Christian Robert
 
ISBA 2016: Foundations
ISBA 2016: FoundationsISBA 2016: Foundations
ISBA 2016: Foundations
Christian Robert
 
Tutorial on testing at O'Bayes 2015, Valencià, June 1, 2015
Tutorial on testing at O'Bayes 2015, Valencià, June 1, 2015Tutorial on testing at O'Bayes 2015, Valencià, June 1, 2015
Tutorial on testing at O'Bayes 2015, Valencià, June 1, 2015
Christian Robert
 
ABC short course: introduction chapters
ABC short course: introduction chaptersABC short course: introduction chapters
ABC short course: introduction chapters
Christian Robert
 
ABC short course: survey chapter
ABC short course: survey chapterABC short course: survey chapter
ABC short course: survey chapter
Christian Robert
 
Multiple estimators for Monte Carlo approximations
Multiple estimators for Monte Carlo approximationsMultiple estimators for Monte Carlo approximations
Multiple estimators for Monte Carlo approximations
Christian Robert
 

What's hot (20)

Reliable ABC model choice via random forests
Reliable ABC model choice via random forestsReliable ABC model choice via random forests
Reliable ABC model choice via random forests
 
Likelihood-free Design: a discussion
Likelihood-free Design: a discussionLikelihood-free Design: a discussion
Likelihood-free Design: a discussion
 
Can we estimate a constant?
Can we estimate a constant?Can we estimate a constant?
Can we estimate a constant?
 
ABC workshop: 17w5025
ABC workshop: 17w5025ABC workshop: 17w5025
ABC workshop: 17w5025
 
Laplace's Demon: seminar #1
Laplace's Demon: seminar #1Laplace's Demon: seminar #1
Laplace's Demon: seminar #1
 
Big model, big data
Big model, big dataBig model, big data
Big model, big data
 
Convergence of ABC methods
Convergence of ABC methodsConvergence of ABC methods
Convergence of ABC methods
 
NCE, GANs & VAEs (and maybe BAC)
NCE, GANs & VAEs (and maybe BAC)NCE, GANs & VAEs (and maybe BAC)
NCE, GANs & VAEs (and maybe BAC)
 
An overview of Bayesian testing
An overview of Bayesian testingAn overview of Bayesian testing
An overview of Bayesian testing
 
accurate ABC Oliver Ratmann
accurate ABC Oliver Ratmannaccurate ABC Oliver Ratmann
accurate ABC Oliver Ratmann
 
better together? statistical learning in models made of modules
better together? statistical learning in models made of modulesbetter together? statistical learning in models made of modules
better together? statistical learning in models made of modules
 
Approximating Bayes Factors
Approximating Bayes FactorsApproximating Bayes Factors
Approximating Bayes Factors
 
"reflections on the probability space induced by moment conditions with impli...
"reflections on the probability space induced by moment conditions with impli..."reflections on the probability space induced by moment conditions with impli...
"reflections on the probability space induced by moment conditions with impli...
 
ABC-Gibbs
ABC-GibbsABC-Gibbs
ABC-Gibbs
 
ABC-Gibbs
ABC-GibbsABC-Gibbs
ABC-Gibbs
 
ISBA 2016: Foundations
ISBA 2016: FoundationsISBA 2016: Foundations
ISBA 2016: Foundations
 
Tutorial on testing at O'Bayes 2015, Valencià, June 1, 2015
Tutorial on testing at O'Bayes 2015, Valencià, June 1, 2015Tutorial on testing at O'Bayes 2015, Valencià, June 1, 2015
Tutorial on testing at O'Bayes 2015, Valencià, June 1, 2015
 
ABC short course: introduction chapters
ABC short course: introduction chaptersABC short course: introduction chapters
ABC short course: introduction chapters
 
ABC short course: survey chapter
ABC short course: survey chapterABC short course: survey chapter
ABC short course: survey chapter
 
Multiple estimators for Monte Carlo approximations
Multiple estimators for Monte Carlo approximationsMultiple estimators for Monte Carlo approximations
Multiple estimators for Monte Carlo approximations
 

Similar to On the vexing dilemma of hypothesis testing and the predicted demise of the Bayes factor

Phil 6334 Mayo slides Day 1
Phil 6334 Mayo slides Day 1Phil 6334 Mayo slides Day 1
Phil 6334 Mayo slides Day 1
jemille6
 
Probability/Statistics Lecture Notes 4: Hypothesis Testing
Probability/Statistics Lecture Notes 4: Hypothesis TestingProbability/Statistics Lecture Notes 4: Hypothesis Testing
Probability/Statistics Lecture Notes 4: Hypothesis Testing
jemille6
 
Bayesian decision making in clinical research
Bayesian decision making in clinical researchBayesian decision making in clinical research
Bayesian decision making in clinical researchBhaswat Chakraborty
 
Bayesian statistics
Bayesian statisticsBayesian statistics
Bayesian statistics
Alberto Labarga
 
original
originaloriginal
originalbutest
 
Mayo Slides: Part I Meeting #2 (Phil 6334/Econ 6614)
Mayo Slides: Part I Meeting #2 (Phil 6334/Econ 6614)Mayo Slides: Part I Meeting #2 (Phil 6334/Econ 6614)
Mayo Slides: Part I Meeting #2 (Phil 6334/Econ 6614)
jemille6
 
Chapter1
Chapter1Chapter1
Meeting #1 Slides Phil 6334/Econ 6614 SP2019
Meeting #1 Slides Phil 6334/Econ 6614 SP2019Meeting #1 Slides Phil 6334/Econ 6614 SP2019
Meeting #1 Slides Phil 6334/Econ 6614 SP2019
jemille6
 
Excursion 4 Tour II: Rejection Fallacies: Whose Exaggerating What?
Excursion 4 Tour II: Rejection Fallacies: Whose Exaggerating What?Excursion 4 Tour II: Rejection Fallacies: Whose Exaggerating What?
Excursion 4 Tour II: Rejection Fallacies: Whose Exaggerating What?
jemille6
 
Likelihoodist vs. significance tester w bernoulli trials
Likelihoodist vs. significance tester w bernoulli trialsLikelihoodist vs. significance tester w bernoulli trials
Likelihoodist vs. significance tester w bernoulli trials
jemille6
 
Bayesian Learning- part of machine learning
Bayesian Learning-  part of machine learningBayesian Learning-  part of machine learning
Bayesian Learning- part of machine learning
kensaleste
 
Probability and basic statistics with R
Probability and basic statistics with RProbability and basic statistics with R
Probability and basic statistics with R
Alberto Labarga
 
Bayes Classification
Bayes ClassificationBayes Classification
Bayes Classification
sathish sak
 
Severe Testing: The Key to Error Correction
Severe Testing: The Key to Error CorrectionSevere Testing: The Key to Error Correction
Severe Testing: The Key to Error Correction
jemille6
 
Chapter 4: Decision theory and Bayesian analysis
Chapter 4: Decision theory and Bayesian analysisChapter 4: Decision theory and Bayesian analysis
Chapter 4: Decision theory and Bayesian analysis
Christian Robert
 
ISM_Session_5 _ 23rd and 24th December.pptx
ISM_Session_5 _ 23rd and 24th December.pptxISM_Session_5 _ 23rd and 24th December.pptx
ISM_Session_5 _ 23rd and 24th December.pptx
ssuser1eba67
 
Statistics (1): estimation, Chapter 2: Empirical distribution and bootstrap
Statistics (1): estimation, Chapter 2: Empirical distribution and bootstrapStatistics (1): estimation, Chapter 2: Empirical distribution and bootstrap
Statistics (1): estimation, Chapter 2: Empirical distribution and bootstrap
Christian Robert
 
Quantification
QuantificationQuantification
Quantification
Abdur Rehman
 
U unit7 ssb
U unit7 ssbU unit7 ssb
U unit7 ssb
Akhilesh Deshpande
 

Similar to On the vexing dilemma of hypothesis testing and the predicted demise of the Bayes factor (20)

Phil 6334 Mayo slides Day 1
Phil 6334 Mayo slides Day 1Phil 6334 Mayo slides Day 1
Phil 6334 Mayo slides Day 1
 
Probability/Statistics Lecture Notes 4: Hypothesis Testing
Probability/Statistics Lecture Notes 4: Hypothesis TestingProbability/Statistics Lecture Notes 4: Hypothesis Testing
Probability/Statistics Lecture Notes 4: Hypothesis Testing
 
Bayesian decision making in clinical research
Bayesian decision making in clinical researchBayesian decision making in clinical research
Bayesian decision making in clinical research
 
Bayesian statistics
Bayesian statisticsBayesian statistics
Bayesian statistics
 
original
originaloriginal
original
 
Mayo Slides: Part I Meeting #2 (Phil 6334/Econ 6614)
Mayo Slides: Part I Meeting #2 (Phil 6334/Econ 6614)Mayo Slides: Part I Meeting #2 (Phil 6334/Econ 6614)
Mayo Slides: Part I Meeting #2 (Phil 6334/Econ 6614)
 
Chapter1
Chapter1Chapter1
Chapter1
 
Meeting #1 Slides Phil 6334/Econ 6614 SP2019
Meeting #1 Slides Phil 6334/Econ 6614 SP2019Meeting #1 Slides Phil 6334/Econ 6614 SP2019
Meeting #1 Slides Phil 6334/Econ 6614 SP2019
 
Excursion 4 Tour II: Rejection Fallacies: Whose Exaggerating What?
Excursion 4 Tour II: Rejection Fallacies: Whose Exaggerating What?Excursion 4 Tour II: Rejection Fallacies: Whose Exaggerating What?
Excursion 4 Tour II: Rejection Fallacies: Whose Exaggerating What?
 
Likelihoodist vs. significance tester w bernoulli trials
Likelihoodist vs. significance tester w bernoulli trialsLikelihoodist vs. significance tester w bernoulli trials
Likelihoodist vs. significance tester w bernoulli trials
 
Bayesian Learning- part of machine learning
Bayesian Learning-  part of machine learningBayesian Learning-  part of machine learning
Bayesian Learning- part of machine learning
 
Probability and basic statistics with R
Probability and basic statistics with RProbability and basic statistics with R
Probability and basic statistics with R
 
Bayes Classification
Bayes ClassificationBayes Classification
Bayes Classification
 
Severe Testing: The Key to Error Correction
Severe Testing: The Key to Error CorrectionSevere Testing: The Key to Error Correction
Severe Testing: The Key to Error Correction
 
Basic of Hypothesis Testing TEKU QM
Basic of Hypothesis Testing TEKU QMBasic of Hypothesis Testing TEKU QM
Basic of Hypothesis Testing TEKU QM
 
Chapter 4: Decision theory and Bayesian analysis
Chapter 4: Decision theory and Bayesian analysisChapter 4: Decision theory and Bayesian analysis
Chapter 4: Decision theory and Bayesian analysis
 
ISM_Session_5 _ 23rd and 24th December.pptx
ISM_Session_5 _ 23rd and 24th December.pptxISM_Session_5 _ 23rd and 24th December.pptx
ISM_Session_5 _ 23rd and 24th December.pptx
 
Statistics (1): estimation, Chapter 2: Empirical distribution and bootstrap
Statistics (1): estimation, Chapter 2: Empirical distribution and bootstrapStatistics (1): estimation, Chapter 2: Empirical distribution and bootstrap
Statistics (1): estimation, Chapter 2: Empirical distribution and bootstrap
 
Quantification
QuantificationQuantification
Quantification
 
U unit7 ssb
U unit7 ssbU unit7 ssb
U unit7 ssb
 

More from Christian Robert

Adaptive Restore algorithm & importance Monte Carlo
Adaptive Restore algorithm & importance Monte CarloAdaptive Restore algorithm & importance Monte Carlo
Adaptive Restore algorithm & importance Monte Carlo
Christian Robert
 
Asymptotics of ABC, lecture, Collège de France
Asymptotics of ABC, lecture, Collège de FranceAsymptotics of ABC, lecture, Collège de France
Asymptotics of ABC, lecture, Collège de France
Christian Robert
 
Workshop in honour of Don Poskitt and Gael Martin
Workshop in honour of Don Poskitt and Gael MartinWorkshop in honour of Don Poskitt and Gael Martin
Workshop in honour of Don Poskitt and Gael Martin
Christian Robert
 
discussion of ICML23.pdf
discussion of ICML23.pdfdiscussion of ICML23.pdf
discussion of ICML23.pdf
Christian Robert
 
How many components in a mixture?
How many components in a mixture?How many components in a mixture?
How many components in a mixture?
Christian Robert
 
restore.pdf
restore.pdfrestore.pdf
restore.pdf
Christian Robert
 
Testing for mixtures at BNP 13
Testing for mixtures at BNP 13Testing for mixtures at BNP 13
Testing for mixtures at BNP 13
Christian Robert
 
Inferring the number of components: dream or reality?
Inferring the number of components: dream or reality?Inferring the number of components: dream or reality?
Inferring the number of components: dream or reality?
Christian Robert
 
CDT 22 slides.pdf
CDT 22 slides.pdfCDT 22 slides.pdf
CDT 22 slides.pdf
Christian Robert
 
Testing for mixtures by seeking components
Testing for mixtures by seeking componentsTesting for mixtures by seeking components
Testing for mixtures by seeking components
Christian Robert
 
discussion on Bayesian restricted likelihood
discussion on Bayesian restricted likelihooddiscussion on Bayesian restricted likelihood
discussion on Bayesian restricted likelihood
Christian Robert
 
ABC-Gibbs
ABC-GibbsABC-Gibbs
ABC-Gibbs
Christian Robert
 
Coordinate sampler : A non-reversible Gibbs-like sampler
Coordinate sampler : A non-reversible Gibbs-like samplerCoordinate sampler : A non-reversible Gibbs-like sampler
Coordinate sampler : A non-reversible Gibbs-like sampler
Christian Robert
 
eugenics and statistics
eugenics and statisticseugenics and statistics
eugenics and statistics
Christian Robert
 
asymptotics of ABC
asymptotics of ABCasymptotics of ABC
asymptotics of ABC
Christian Robert
 
the ABC of ABC
the ABC of ABCthe ABC of ABC
the ABC of ABC
Christian Robert
 
CISEA 2019: ABC consistency and convergence
CISEA 2019: ABC consistency and convergenceCISEA 2019: ABC consistency and convergence
CISEA 2019: ABC consistency and convergence
Christian Robert
 
a discussion of Chib, Shin, and Simoni (2017-8) Bayesian moment models
a discussion of Chib, Shin, and Simoni (2017-8) Bayesian moment modelsa discussion of Chib, Shin, and Simoni (2017-8) Bayesian moment models
a discussion of Chib, Shin, and Simoni (2017-8) Bayesian moment models
Christian Robert
 
ABC based on Wasserstein distances
ABC based on Wasserstein distancesABC based on Wasserstein distances
ABC based on Wasserstein distances
Christian Robert
 
Poster for Bayesian Statistics in the Big Data Era conference
Poster for Bayesian Statistics in the Big Data Era conferencePoster for Bayesian Statistics in the Big Data Era conference
Poster for Bayesian Statistics in the Big Data Era conference
Christian Robert
 

More from Christian Robert (20)

Adaptive Restore algorithm & importance Monte Carlo
Adaptive Restore algorithm & importance Monte CarloAdaptive Restore algorithm & importance Monte Carlo
Adaptive Restore algorithm & importance Monte Carlo
 
Asymptotics of ABC, lecture, Collège de France
Asymptotics of ABC, lecture, Collège de FranceAsymptotics of ABC, lecture, Collège de France
Asymptotics of ABC, lecture, Collège de France
 
Workshop in honour of Don Poskitt and Gael Martin
Workshop in honour of Don Poskitt and Gael MartinWorkshop in honour of Don Poskitt and Gael Martin
Workshop in honour of Don Poskitt and Gael Martin
 
discussion of ICML23.pdf
discussion of ICML23.pdfdiscussion of ICML23.pdf
discussion of ICML23.pdf
 
How many components in a mixture?
How many components in a mixture?How many components in a mixture?
How many components in a mixture?
 
restore.pdf
restore.pdfrestore.pdf
restore.pdf
 
Testing for mixtures at BNP 13
Testing for mixtures at BNP 13Testing for mixtures at BNP 13
Testing for mixtures at BNP 13
 
Inferring the number of components: dream or reality?
Inferring the number of components: dream or reality?Inferring the number of components: dream or reality?
Inferring the number of components: dream or reality?
 
CDT 22 slides.pdf
CDT 22 slides.pdfCDT 22 slides.pdf
CDT 22 slides.pdf
 
Testing for mixtures by seeking components
Testing for mixtures by seeking componentsTesting for mixtures by seeking components
Testing for mixtures by seeking components
 
discussion on Bayesian restricted likelihood
discussion on Bayesian restricted likelihooddiscussion on Bayesian restricted likelihood
discussion on Bayesian restricted likelihood
 
ABC-Gibbs
ABC-GibbsABC-Gibbs
ABC-Gibbs
 
Coordinate sampler : A non-reversible Gibbs-like sampler
Coordinate sampler : A non-reversible Gibbs-like samplerCoordinate sampler : A non-reversible Gibbs-like sampler
Coordinate sampler : A non-reversible Gibbs-like sampler
 
eugenics and statistics
eugenics and statisticseugenics and statistics
eugenics and statistics
 
asymptotics of ABC
asymptotics of ABCasymptotics of ABC
asymptotics of ABC
 
the ABC of ABC
the ABC of ABCthe ABC of ABC
the ABC of ABC
 
CISEA 2019: ABC consistency and convergence
CISEA 2019: ABC consistency and convergenceCISEA 2019: ABC consistency and convergence
CISEA 2019: ABC consistency and convergence
 
a discussion of Chib, Shin, and Simoni (2017-8) Bayesian moment models
a discussion of Chib, Shin, and Simoni (2017-8) Bayesian moment modelsa discussion of Chib, Shin, and Simoni (2017-8) Bayesian moment models
a discussion of Chib, Shin, and Simoni (2017-8) Bayesian moment models
 
ABC based on Wasserstein distances
ABC based on Wasserstein distancesABC based on Wasserstein distances
ABC based on Wasserstein distances
 
Poster for Bayesian Statistics in the Big Data Era conference
Poster for Bayesian Statistics in the Big Data Era conferencePoster for Bayesian Statistics in the Big Data Era conference
Poster for Bayesian Statistics in the Big Data Era conference
 

Recently uploaded

BLOOD AND BLOOD COMPONENT- introduction to blood physiology
BLOOD AND BLOOD COMPONENT- introduction to blood physiologyBLOOD AND BLOOD COMPONENT- introduction to blood physiology
BLOOD AND BLOOD COMPONENT- introduction to blood physiology
NoelManyise1
 
Nucleic Acid-its structural and functional complexity.
Nucleic Acid-its structural and functional complexity.Nucleic Acid-its structural and functional complexity.
Nucleic Acid-its structural and functional complexity.
Nistarini College, Purulia (W.B) India
 
THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.
THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.
THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.
Sérgio Sacani
 
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
yqqaatn0
 
GBSN- Microbiology (Lab 3) Gram Staining
GBSN- Microbiology (Lab 3) Gram StainingGBSN- Microbiology (Lab 3) Gram Staining
GBSN- Microbiology (Lab 3) Gram Staining
Areesha Ahmad
 
Deep Software Variability and Frictionless Reproducibility
Deep Software Variability and Frictionless ReproducibilityDeep Software Variability and Frictionless Reproducibility
Deep Software Variability and Frictionless Reproducibility
University of Rennes, INSA Rennes, Inria/IRISA, CNRS
 
Toxic effects of heavy metals : Lead and Arsenic
Toxic effects of heavy metals : Lead and ArsenicToxic effects of heavy metals : Lead and Arsenic
Toxic effects of heavy metals : Lead and Arsenic
sanjana502982
 
Mammalian Pineal Body Structure and Also Functions
Mammalian Pineal Body Structure and Also FunctionsMammalian Pineal Body Structure and Also Functions
Mammalian Pineal Body Structure and Also Functions
YOGESH DOGRA
 
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...
Sérgio Sacani
 
Leaf Initiation, Growth and Differentiation.pdf
Leaf Initiation, Growth and Differentiation.pdfLeaf Initiation, Growth and Differentiation.pdf
Leaf Initiation, Growth and Differentiation.pdf
RenuJangid3
 
What is greenhouse gasses and how many gasses are there to affect the Earth.
What is greenhouse gasses and how many gasses are there to affect the Earth.What is greenhouse gasses and how many gasses are there to affect the Earth.
What is greenhouse gasses and how many gasses are there to affect the Earth.
moosaasad1975
 
platelets_clotting_biogenesis.clot retractionpptx
platelets_clotting_biogenesis.clot retractionpptxplatelets_clotting_biogenesis.clot retractionpptx
platelets_clotting_biogenesis.clot retractionpptx
muralinath2
 
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...
Sérgio Sacani
 
Seminar of U.V. Spectroscopy by SAMIR PANDA
 Seminar of U.V. Spectroscopy by SAMIR PANDA Seminar of U.V. Spectroscopy by SAMIR PANDA
Seminar of U.V. Spectroscopy by SAMIR PANDA
SAMIR PANDA
 
GBSN - Biochemistry (Unit 5) Chemistry of Lipids
GBSN - Biochemistry (Unit 5) Chemistry of LipidsGBSN - Biochemistry (Unit 5) Chemistry of Lipids
GBSN - Biochemistry (Unit 5) Chemistry of Lipids
Areesha Ahmad
 
原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样
原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样
原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样
yqqaatn0
 
Hemoglobin metabolism_pathophysiology.pptx
Hemoglobin metabolism_pathophysiology.pptxHemoglobin metabolism_pathophysiology.pptx
Hemoglobin metabolism_pathophysiology.pptx
muralinath2
 
(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...
(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...
(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...
Scintica Instrumentation
 
S.1 chemistry scheme term 2 for ordinary level
S.1 chemistry scheme term 2 for ordinary levelS.1 chemistry scheme term 2 for ordinary level
S.1 chemistry scheme term 2 for ordinary level
ronaldlakony0
 
erythropoiesis-I_mechanism& clinical significance.pptx
erythropoiesis-I_mechanism& clinical significance.pptxerythropoiesis-I_mechanism& clinical significance.pptx
erythropoiesis-I_mechanism& clinical significance.pptx
muralinath2
 

Recently uploaded (20)

BLOOD AND BLOOD COMPONENT- introduction to blood physiology
BLOOD AND BLOOD COMPONENT- introduction to blood physiologyBLOOD AND BLOOD COMPONENT- introduction to blood physiology
BLOOD AND BLOOD COMPONENT- introduction to blood physiology
 
Nucleic Acid-its structural and functional complexity.
Nucleic Acid-its structural and functional complexity.Nucleic Acid-its structural and functional complexity.
Nucleic Acid-its structural and functional complexity.
 
THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.
THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.
THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.
 
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
 
GBSN- Microbiology (Lab 3) Gram Staining
GBSN- Microbiology (Lab 3) Gram StainingGBSN- Microbiology (Lab 3) Gram Staining
GBSN- Microbiology (Lab 3) Gram Staining
 
Deep Software Variability and Frictionless Reproducibility
Deep Software Variability and Frictionless ReproducibilityDeep Software Variability and Frictionless Reproducibility
Deep Software Variability and Frictionless Reproducibility
 
Toxic effects of heavy metals : Lead and Arsenic
Toxic effects of heavy metals : Lead and ArsenicToxic effects of heavy metals : Lead and Arsenic
Toxic effects of heavy metals : Lead and Arsenic
 
Mammalian Pineal Body Structure and Also Functions
Mammalian Pineal Body Structure and Also FunctionsMammalian Pineal Body Structure and Also Functions
Mammalian Pineal Body Structure and Also Functions
 
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...
 
Leaf Initiation, Growth and Differentiation.pdf
Leaf Initiation, Growth and Differentiation.pdfLeaf Initiation, Growth and Differentiation.pdf
Leaf Initiation, Growth and Differentiation.pdf
 
What is greenhouse gasses and how many gasses are there to affect the Earth.
What is greenhouse gasses and how many gasses are there to affect the Earth.What is greenhouse gasses and how many gasses are there to affect the Earth.
What is greenhouse gasses and how many gasses are there to affect the Earth.
 
platelets_clotting_biogenesis.clot retractionpptx
platelets_clotting_biogenesis.clot retractionpptxplatelets_clotting_biogenesis.clot retractionpptx
platelets_clotting_biogenesis.clot retractionpptx
 
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...
 
Seminar of U.V. Spectroscopy by SAMIR PANDA
 Seminar of U.V. Spectroscopy by SAMIR PANDA Seminar of U.V. Spectroscopy by SAMIR PANDA
Seminar of U.V. Spectroscopy by SAMIR PANDA
 
GBSN - Biochemistry (Unit 5) Chemistry of Lipids
GBSN - Biochemistry (Unit 5) Chemistry of LipidsGBSN - Biochemistry (Unit 5) Chemistry of Lipids
GBSN - Biochemistry (Unit 5) Chemistry of Lipids
 
原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样
原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样
原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样
 
Hemoglobin metabolism_pathophysiology.pptx
Hemoglobin metabolism_pathophysiology.pptxHemoglobin metabolism_pathophysiology.pptx
Hemoglobin metabolism_pathophysiology.pptx
 
(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...
(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...
(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...
 
S.1 chemistry scheme term 2 for ordinary level
S.1 chemistry scheme term 2 for ordinary levelS.1 chemistry scheme term 2 for ordinary level
S.1 chemistry scheme term 2 for ordinary level
 
erythropoiesis-I_mechanism& clinical significance.pptx
erythropoiesis-I_mechanism& clinical significance.pptxerythropoiesis-I_mechanism& clinical significance.pptx
erythropoiesis-I_mechanism& clinical significance.pptx
 

On the vexing dilemma of hypothesis testing and the predicted demise of the Bayes factor

  • 1. The vexing dilemma of Bayes tests of hypotheses and the predicted demise of the Bayes factor Christian P. Robert Universit´e Paris-Dauphine, Paris & University of Warwick, Coventry
  • 3. Testing issues Hypothesis testing central problem of statistical inference dramatically differentiating feature between classical and Bayesian paradigms wide open to controversy and divergent opinions, includ. within the Bayesian community non-informative Bayesian testing case mostly unresolved, witness the Jeffreys–Lindley paradox [Berger (2003), Mayo & Cox (2006), Gelman (2008)]
  • 4. Testing hypotheses Bayesian model selection as comparison of k potential statistical models towards the selection of model that fits the data “best” mostly accepted perspective: it does not primarily seek to identify which model is “true”, but compares fits tools like Bayes factor naturally include a penalisation addressing model complexity, mimicked by Bayes Information (BIC) and Deviance Information (DIC) criteria posterior predictive tools successfully advocated in Gelman et al. (2013) even though they involve double use of data
  • 5. Testing hypotheses Bayesian model selection as comparison of k potential statistical models towards the selection of model that fits the data “best” mostly accepted perspective: it does not primarily seek to identify which model is “true”, but compares fits tools like Bayes factor naturally include a penalisation addressing model complexity, mimicked by Bayes Information (BIC) and Deviance Information (DIC) criteria posterior predictive tools successfully advocated in Gelman et al. (2013) even though they involve double use of data
  • 6. Some difficulties tension between using (i) posterior probabilities justified by a binary loss function but depending on unnatural prior weights, and (ii) Bayes factors that eliminate this dependence but escape direct connection with posterior distribution, unless prior weights are integrated within the loss subsequent and delicate interpretation (or calibration) of the strength of the Bayes factor towards supporting a given hypothesis or model, because it is not a Bayesian decision rule similar difficulty with posterior probabilities, with tendency to interpret them as p-values (rather than the opposite!) when they only report through a marginal likelihood ratio the respective strengths of fitting the data to both models
  • 7. Some further difficulties long-lasting impact of the prior modeling, meaning the choice of the prior distributions on the parameter spaces of both models under comparison, despite overall consistency proof for Bayes factor discontinuity in use of improper priors since they are not justified in most testing situations, leading to many alternative if ad hoc solutions, where data is either used twice or split in artificial ways binary (accept vs. reject) outcome more suited for immediate decision (if any) than for model evaluation, in connection with rudimentary loss function
  • 8. Some additional difficulties related impossibility to ascertain simultaneous misfit or to detect presence of outliers no assessment of uncertainty associated with decision itself difficult computation of marginal likelihoods in most settings with further controversies about which algorithm to adopt strong dependence of posterior probabilities on conditioning statistics, which in turn undermines their validity for model assessment, as exhibited in ABC model choice temptation to create pseudo-frequentist equivalents such as q-values with even less Bayesian justifications time for a paradigm shift back to some solutions
  • 9. Significance tests Summary Significance tests Statistical tests Bayesian principles Bayesian tests Bayes factors Noninformative solutions Jeffreys-Lindley paradox Testing via mixtures
  • 10. data related questions Given a dataset x1, . . . , xn is it possible to answer a question related with the mechanism producing this data? [Answer: No!] For instance, is E[X] > 0? Or, is xn+1 = 104 possible? [Under some assumptions...]
  • 11. data related questions Given a dataset x1, . . . , xn is it possible to answer a question related with the mechanism producing this data? [Answer: No!] For instance, is E[X] > 0? Or, is xn+1 = 104 possible? [Under some assumptions...]
  • 12. Statistical tests Questions are called hypotheses, like H0 : E[X] > 0, while models are generaly indexed by parameters, e.g., X1 ∼ f (x|θ), θ ∈ Θ0 meaning H0 can be expressed as H0 : θ ∈ Θ0 A statistical test of an hypothesis H0 is a function δ of the data x1, . . . , xn taking values in {0, 1} (yes or no). Sounds easy: estimate θ by ^θ(x1, . . . , xn) and check whether ^θ ∈ Θ0 just... misses random fluctuations
  • 13. Statistical tests Questions are called hypotheses, like H0 : E[X] > 0, while models are generaly indexed by parameters, e.g., X1 ∼ f (x|θ), θ ∈ Θ0 meaning H0 can be expressed as H0 : θ ∈ Θ0 A statistical test of an hypothesis H0 is a function δ of the data x1, . . . , xn taking values in {0, 1} (yes or no). Sounds easy: estimate θ by ^θ(x1, . . . , xn) and check whether ^θ ∈ Θ0 just... misses random fluctuations
  • 14. Statistical tests Questions are called hypotheses, like H0 : E[X] > 0, while models are generaly indexed by parameters, e.g., X1 ∼ f (x|θ), θ ∈ Θ0 meaning H0 can be expressed as H0 : θ ∈ Θ0 A statistical test of an hypothesis H0 is a function δ of the data x1, . . . , xn taking values in {0, 1} (yes or no). Sounds easy: estimate θ by ^θ(x1, . . . , xn) and check whether ^θ ∈ Θ0 just... misses random fluctuations
  • 15. type x errors Test δ associated with average error: R(θ, δ) = Eθ[L(θ, δ(X))] = Pθ(δ(X) = 0) if θ ∈ Θ0, Pθ(δ(X) = 1) otherwise, False rejection (Type I) and false acceptance (Type II) rates are antagonistic: one goes up when the other goes down
  • 16. A binomial example Sequence of binary [survey] replies x1, . . . , xn iid ∼ B(θ) and question about position of θ against 0.5: H0 : θ 1/2 If decision based on ¯xn 1/2, Type I error approximately Φ({1/2 − p}/{p(1−p)/n} 1/2 )
  • 17. A binomial example Sequence of binary [survey] replies x1, . . . , xn iid ∼ B(θ) and question about position of θ against 0.5: H0 : θ 1/2 If decision based on ¯xn 1/2, Type I error approximately Φ({1/2 − p}/{p(1−p)/n} 1/2 ) Type I error for n = 10, 50, 100, 200, 500, 103
  • 18. Uncertain decisions While tests are [hard] decisions in {0, 1}, there is uncertainty about reliability of decision and error is unknown A p-value is the probability of obtaining results at least as extreme as the observed one under the null: p(x) = PH0 (T(X) T(x)) Binomial example: p(x1, . . . , xn) = P1/2( ¯Xn ¯xn)
  • 19. Uncertain decisions While tests are [hard] decisions in {0, 1}, there is uncertainty about reliability of decision and error is unknown A p-value is the probability of obtaining results at least as extreme as the observed one under the null: p(x) = PH0 (T(X) T(x)) Binomial example: p(x1, . . . , xn) = P1/2( ¯Xn ¯xn) p-value for n = 10, 50, 100, 200, 500, 103
  • 20. Uncertain decisions While tests are [hard] decisions in {0, 1}, there is uncertainty about reliability of decision and error is unknown A p-value is the probability of obtaining results at least as extreme as the observed one under the null: p(x) = PH0 (T(X) T(x)) Binomial example: p(x1, . . . , xn) = P1/2( ¯Xn ¯xn) p-value for n = 10, 50, 100, 200, 500, 103
  • 21. Bayesian statistics Publication on Dec. 23, 1763 of “An Essay towards solving a Problem in the Doctrine of Chances” by the late Rev. Mr. Bayes, communicated by Mr. Price in the Philosophical Transactions of the Royal Society of London.
  • 22. Bayesian statistics with original title “A Method of Calculating the Exact Probability of All Conclusions founded on Induction” [Stiegler, 2014]
  • 23. Bayesian statistics and maybe intended as a reply to Hume’s (1748) evaluation of the probability of miracles [Stiegler, 2014]
  • 24. Bayes’ 1763 paper Billiard ball W rolled on a line of length one, with a uniform probability of stopping anywhere: W stops at p. Second ball O then rolled n times under the same assumptions. X denotes the number of times the ball O stopped on the left of W . “Given the number of times in which an unknown event has happened and failed; Required the chance that the probability of its happening in a single trial lies somewhere between any two degrees of probability that can be named.” T. Bayes, 1763
  • 25. Bayes’ 1763 paper Billiard ball W rolled on a line of length one, with a uniform probability of stopping anywhere: W stops at p. Second ball O then rolled n times under the same assumptions. X denotes the number of times the ball O stopped on the left of W . “Given the number of times in which an unknown event has happened and failed; Required the chance that the probability of its happening in a single trial lies somewhere between any two degrees of probability that can be named.” T. Bayes, 1763
  • 26. Resolution “Given the number of times in which an unknown event has happened and failed; Required the chance that the probability of its happening in a single trial lies somewhere between any two degrees of probability that can be named.” T. Bayes, 1763 Since P(X = x|p) = n x px (1 − p)n−x , P(a < p < b and X = x) = b a n x px (1 − p)n−x dp and P(X = x) = 1 0 n x px (1 − p)n−x dp,
  • 27. Resolution (2) “Given the number of times in which an unknown event has happened and failed; Required the chance that the probability of its happening in a single trial lies somewhere between any two degrees of probability that can be named.” T. Bayes, 1763 then P(a < p < b|X = x) = b a n x px (1 − p)n−x dp 1 0 n x px (1 − p)n−x dp = b a px (1 − p)n−x dp B(x + 1, n − x + 1) , i.e. p|x ∼ Be(x + 1, n − x + 1) [posterior distribution on p]
  • 28. Bayes 101 Given a statistical model attached to data x1, . . . , xn, with Xi ∼ f (x|θ), Bayes-ian paradigm introduces probability distribution π(θ) called prior without turning constant parameters into random variates (!) representing prior beliefs and available information reference measure on parameter space Θ inducing a posterior distribution π(θ|x1, . . . , xn) ∝ π(θ)f (x1|θ) . . . f (xn|θ) which entirely drives Bayesian inference
  • 29. Historical appearance of Bayesian tests Is the new parameter supported by the observations or is any variation expressible by it better interpreted as random? Thus we must set two hypotheses for comparison, the more complicated having the smaller initial probability ...compare a specially suggested value of a new parameter, often 0 [q], with the aggregate of other possible values [q ]. We shall call q the null hypothesis and q the alternative hypothesis [and] we must take P(q|H) = P(q |H) = 1/2 . (Jeffreys, ToP, 1939, V, §5.0)
  • 30. Bayesian tests 101 Associated with the risk R(θ, δ) = Eθ[L(θ, δ(x))] = Pθ(δ(x) = 0) if θ ∈ Θ0, Pθ(δ(x) = 1) otherwise, Bayes test The Bayes estimator associated with π and with the 0 − 1 loss is δπ (x) = 1 if P(θ ∈ Θ0|x) > P(θ ∈ Θ0|x), 0 otherwise,
  • 31. Bayesian tests 101 Associated with the risk R(θ, δ) = Eθ[L(θ, δ(x))] = Pθ(δ(x) = 0) if θ ∈ Θ0, Pθ(δ(x) = 1) otherwise, Bayes test The Bayes estimator associated with π and with the 0 − 1 loss is δπ (x) = 1 if P(θ ∈ Θ0|x) > P(θ ∈ Θ0|x), 0 otherwise,
  • 32. Generalisation Weights errors differently under both hypotheses: Theorem (Optimal Bayes decision) Under the 0 − 1 loss function L(θ, d) =    0 if d = IΘ0 (θ) a0 if d = 1 and θ ∈ Θ0 a1 if d = 0 and θ ∈ Θ0 the Bayes procedure is δπ (x) = 1 if P(θ ∈ Θ0|x) a0/(a0 + a1) 0 otherwise
  • 33. Generalisation Weights errors differently under both hypotheses: Theorem (Optimal Bayes decision) Under the 0 − 1 loss function L(θ, d) =    0 if d = IΘ0 (θ) a0 if d = 1 and θ ∈ Θ0 a1 if d = 0 and θ ∈ Θ0 the Bayes procedure is δπ (x) = 1 if P(θ ∈ Θ0|x) a0/(a0 + a1) 0 otherwise
  • 34. A function of posterior probabilities Definition (Bayes factors) For hypotheses H0 : θ ∈ Θ0 vs. Ha : θ ∈ Θ0 B01 = π(Θ0|x) π(Θc 0|x) π(Θ0) π(Θc 0) = Θ0 f (x|θ)π0(θ)dθ Θc 0 f (x|θ)π1(θ)dθ [Jeffreys, ToP, 1939, V, §5.01] Bayes rule: acceptance if B01 > {(1 − π(Θ0))/a1}/{π(Θ0)/a0}
  • 35. Self-contained concept Outside decision-theoretic environment: eliminates choice of π(Θ0) but depends on the choice of (π0, π1) Bayesian/marginal equivalent to the likelihood ratio Jeffreys’ scale of evidence: if log10(Bπ 10) between 0 and 0.5, evidence against H0 weak, if log10(Bπ 10) 0.5 and 1, evidence substantial, if log10(Bπ 10) 1 and 2, evidence strong and if log10(Bπ 10) above 2, evidence decisive [...fairly arbitrary!]
  • 36. Self-contained concept Outside decision-theoretic environment: eliminates choice of π(Θ0) but depends on the choice of (π0, π1) Bayesian/marginal equivalent to the likelihood ratio Jeffreys’ scale of evidence: if log10(Bπ 10) between 0 and 0.5, evidence against H0 weak, if log10(Bπ 10) 0.5 and 1, evidence substantial, if log10(Bπ 10) 1 and 2, evidence strong and if log10(Bπ 10) above 2, evidence decisive [...fairly arbitrary!]
  • 37. Self-contained concept Outside decision-theoretic environment: eliminates choice of π(Θ0) but depends on the choice of (π0, π1) Bayesian/marginal equivalent to the likelihood ratio Jeffreys’ scale of evidence: if log10(Bπ 10) between 0 and 0.5, evidence against H0 weak, if log10(Bπ 10) 0.5 and 1, evidence substantial, if log10(Bπ 10) 1 and 2, evidence strong and if log10(Bπ 10) above 2, evidence decisive [...fairly arbitrary!]
  • 38. Self-contained concept Outside decision-theoretic environment: eliminates choice of π(Θ0) but depends on the choice of (π0, π1) Bayesian/marginal equivalent to the likelihood ratio Jeffreys’ scale of evidence: if log10(Bπ 10) between 0 and 0.5, evidence against H0 weak, if log10(Bπ 10) 0.5 and 1, evidence substantial, if log10(Bπ 10) 1 and 2, evidence strong and if log10(Bπ 10) above 2, evidence decisive [...fairly arbitrary!]
  • 39. regressive illustration caterpillar dataset from Bayesian Essentials (2013): predicting density of caterpillar nests from 10 covariates Estimate Post. Var. log10(BF) (Intercept) 10.8895 6.8229 2.1873 (****) X1 -0.0044 2e-06 1.1571 (***) X2 -0.0533 0.0003 0.6667 (**) X3 0.0673 0.0072 -0.8585 X4 -1.2808 0.2316 0.4726 (*) X5 0.2293 0.0079 0.3861 (*) X6 -0.3532 1.7877 -0.9860 X7 -0.2351 0.7373 -0.9848 X8 0.1793 0.0408 -0.8223 X9 -1.2726 0.5449 -0.3461 X10 -0.4288 0.3934 -0.8949 evidence against H0: (****) decisive, (***) strong, (**) substantial, (*) poor
  • 40. A major refurbishment Suppose we are considering whether a location parameter α is 0. The estimation prior probability for it is uniform and we should have to take f (α) = 0 and K[= B10] would always be infinite (Jeffreys, ToP, V, §5.02) When the null hypothesis is supported by a set of measure 0 against Lebesgue measure, π(Θ0) = 0 for an absolutely continuous prior distribution [End of the story?!]
  • 41. A major refurbishment When the null hypothesis is supported by a set of measure 0 against Lebesgue measure, π(Θ0) = 0 for an absolutely continuous prior distribution [End of the story?!] Requirement Defined prior distributions under both assumptions, π0(θ) ∝ π(θ)IΘ0 (θ), π1(θ) ∝ π(θ)IΘ1 (θ), (under the standard dominating measures on Θ0 and Θ1)
  • 42. A major refurbishment When the null hypothesis is supported by a set of measure 0 against Lebesgue measure, π(Θ0) = 0 for an absolutely continuous prior distribution [End of the story?!] Using the prior probabilities π(Θ0) = ρ0 and π(Θ1) = ρ1, π(θ) = ρ0π0(θ) + ρ1π1(θ).
  • 43. Point null hypotheses “Is it of the slightest use to reject a hypothesis until we have some idea of what to put in its place?” H. Jeffreys, ToP (p.390) Particular case H0 : θ = θ0 Take ρ0 = Prπ (θ = θ0) and g1 prior density under Hc 0 . Posterior probability of H0 π(Θ0|x) = f (x|θ0)ρ0 f (x|θ)π(θ) dθ = f (x|θ0)ρ0 f (x|θ0)ρ0 + (1 − ρ0)m1(x) and marginal under Hc 0 m1(x) = Θ1 f (x|θ)g1(θ) dθ. and Bπ 01(x) = f (x|θ0)ρ0 m1(x)(1 − ρ0) ρ0 1 − ρ0 = f (x|θ0) m1(x)
  • 44. Point null hypotheses “Is it of the slightest use to reject a hypothesis until we have some idea of what to put in its place?” H. Jeffreys, ToP (p.390) Particular case H0 : θ = θ0 Take ρ0 = Prπ (θ = θ0) and g1 prior density under Hc 0 . Posterior probability of H0 π(Θ0|x) = f (x|θ0)ρ0 f (x|θ)π(θ) dθ = f (x|θ0)ρ0 f (x|θ0)ρ0 + (1 − ρ0)m1(x) and marginal under Hc 0 m1(x) = Θ1 f (x|θ)g1(θ) dθ. and Bπ 01(x) = f (x|θ0)ρ0 m1(x)(1 − ρ0) ρ0 1 − ρ0 = f (x|θ0) m1(x)
  • 45. Normal example Testing whether the mean α of a normal observation X ∼ N(α, s2) is zero: P(H0|x) ∝ exp − x2 2s2 P(Hc 0 |x) ∝ exp − (x − α)2 2s2 f (α)dα
  • 46. regressive illustation caterpillar dataset from Bayesian Essentials (2013): predicting density of caterpillar nests from 10 covariates Estimate Post. Var. log10(BF) (Intercept) 10.8895 6.8229 2.1873 (****) X1 -0.0044 2e-06 1.1571 (***) X2 -0.0533 0.0003 0.6667 (**) X3 0.0673 0.0072 -0.8585 X4 -1.2808 0.2316 0.4726 (*) X5 0.2293 0.0079 0.3861 (*) X6 -0.3532 1.7877 -0.9860 X7 -0.2351 0.7373 -0.9848 X8 0.1793 0.0408 -0.8223 X9 -1.2726 0.5449 -0.3461 X10 -0.4288 0.3934 -0.8949 evidence against H0: (****) decisive, (***) strong, (**) substantial, (*) poor
  • 47. Noninformative proposals Summary Significance tests Noninformative solutions Jeffreys-Lindley paradox Testing via mixtures
  • 48. what’s special about the Bayes factor?! “The priors do not represent substantive knowledge of the parameters within the model” “Using Bayes’ theorem, these priors can then be updated to posteriors conditioned on the data that were actually observed.” “In general, the fact that different priors result in different Bayes factors should not come as a surprise.” “The Bayes factor (...) balances the tension between parsimony and goodness of fit, (...) against overfitting the data.” “In induction there is no harm in being occasionally wrong; it is inevitable that we shall be.” [Jeffreys, 1939; Ly et al., 2015]
  • 49. what’s wrong with the Bayes factor?! (1/2, 1/2) partition between hypotheses has very little to suggest in terms of extensions central difficulty stands with issue of picking a prior probability of a model unfortunate impossibility of using improper priors in most settings Bayes factors lack direct scaling associated with posterior probability and loss function twofold dependence on subjective prior measure, first in prior weights of models and second in lasting impact of prior modelling on the parameters Bayes factor offers no window into uncertainty associated with decision further reasons in the summary [Robert, 2016]
  • 50. Vague proper priors are not the solution Taking a proper prior and take a “very large” variance (e.g., BUGS) will most often result in an undefined or ill-defined limit Example (Lindley’s paradox) If testing H0 : θ = 0 when observing x ∼ N(θ, 1), under a normal N(0, α) prior π1(θ), B01(x) α−→∞ −→ 0
  • 51. Vague proper priors are not the solution Taking a proper prior and take a “very large” variance (e.g., BUGS) will most often result in an undefined or ill-defined limit Example (Lindley’s paradox) If testing H0 : θ = 0 when observing x ∼ N(θ, 1), under a normal N(0, α) prior π1(θ), B01(x) α−→∞ −→ 0
  • 52. Vague proper priors are not the solution Taking a proper prior and take a “very large” variance (e.g., BUGS) will most often result in an undefined or ill-defined limit Example (Lindley’s paradox) If testing H0 : θ = 0 when observing x ∼ N(θ, 1), under a normal N(0, α) prior π1(θ), B01(x) α−→∞ −→ 0
  • 53. Learning from the sample Definition (Learning sample) Given an improper prior π, (x1, . . . , xn) is a learning sample if π(·|x1, . . . , xn) is proper and a minimal learning sample if none of its subsamples is a learning sample There is just enough information in a minimal learning sample to make inference about θ under the prior π
  • 54. Learning from the sample Definition (Learning sample) Given an improper prior π, (x1, . . . , xn) is a learning sample if π(·|x1, . . . , xn) is proper and a minimal learning sample if none of its subsamples is a learning sample There is just enough information in a minimal learning sample to make inference about θ under the prior π
  • 55. Pseudo-Bayes factors Idea Use one part x[i] of the data x to make the prior proper: πi improper but πi (·|x[i]) proper and fi (x[n/i]|θi ) πi (θi |x[i])dθi fj (x[n/i]|θj ) πj (θj |x[i])dθj independent of normalizing constant Use remaining x[n/i] to run test as if πj (θj |x[i]) is the true prior
  • 56. Pseudo-Bayes factors Idea Use one part x[i] of the data x to make the prior proper: πi improper but πi (·|x[i]) proper and fi (x[n/i]|θi ) πi (θi |x[i])dθi fj (x[n/i]|θj ) πj (θj |x[i])dθj independent of normalizing constant Use remaining x[n/i] to run test as if πj (θj |x[i]) is the true prior
  • 57. Pseudo-Bayes factors Idea Use one part x[i] of the data x to make the prior proper: πi improper but πi (·|x[i]) proper and fi (x[n/i]|θi ) πi (θi |x[i])dθi fj (x[n/i]|θj ) πj (θj |x[i])dθj independent of normalizing constant Use remaining x[n/i] to run test as if πj (θj |x[i]) is the true prior
  • 58. Motivation Provides a working principle for improper priors Gather enough information from data to achieve properness and use this properness to run the test on remaining data does not use x twice as in Aitkin’s (1991)
  • 59. Motivation Provides a working principle for improper priors Gather enough information from data to achieve properness and use this properness to run the test on remaining data does not use x twice as in Aitkin’s (1991)
  • 60. Motivation Provides a working principle for improper priors Gather enough information from data to achieve properness and use this properness to run the test on remaining data does not use x twice as in Aitkin’s (1991)
  • 61. Unexpected problems! depends on the choice of x[i] many ways of combining pseudo-Bayes factors AIBF = BN ji 1 L Bij (x[ ]) MIBF = BN ji med[Bij (x[ ])] GIBF = BN ji exp 1 L log Bij (x[ ]) not often an exact Bayes factor and thus lacking inner coherence B12 = B10B02 and B01 = 1/B10 . [Berger & Pericchi, 1996]
  • 62. Unexpected problems! depends on the choice of x[i] many ways of combining pseudo-Bayes factors AIBF = BN ji 1 L Bij (x[ ]) MIBF = BN ji med[Bij (x[ ])] GIBF = BN ji exp 1 L log Bij (x[ ]) not often an exact Bayes factor and thus lacking inner coherence B12 = B10B02 and B01 = 1/B10 . [Berger & Pericchi, 1996]
  • 63. Fractional Bayes factor Idea use directly the likelihood to separate training sample from testing sample BF 12 = B12(x) Lb 2(θ2)π2(θ2)dθ2 Lb 1(θ1)π1(θ1)dθ1 [O’Hagan, 1995] Proportion b of the sample used to gain proper-ness
  • 64. Fractional Bayes factor Idea use directly the likelihood to separate training sample from testing sample BF 12 = B12(x) Lb 2(θ2)π2(θ2)dθ2 Lb 1(θ1)π1(θ1)dθ1 [O’Hagan, 1995] Proportion b of the sample used to gain proper-ness
  • 65. Fractional Bayes factor (cont’d) Example (Normal mean) BF 12 = 1 √ b en(b−1)¯x2 n /2 corresponds to exact Bayes factor for the prior N 0, 1−b nb If b constant, prior variance goes to 0 If b = 1 n , prior variance stabilises around 1 If b = n−α , α < 1, prior variance goes to 0 too.
  • 66. Jeffreys–Lindley paradox Summary Significance tests Noninformative solutions Jeffreys-Lindley paradox Lindley’s paradox dual versions of the paradox Bayesian resolutions Testing via mixtures
  • 67. Lindley’s paradox In a normal mean testing problem, ¯xn ∼ N(θ, σ2 /n) , H0 : θ = θ0 , under Jeffreys prior, θ ∼ N(θ0, σ2), the Bayes factor B01(tn) = (1 + n)1/2 exp −nt2 n/2[1 + n] , where tn = √ n|¯xn − θ0|/σ, satisfies B01(tn) n−→∞ −→ ∞ [assuming a fixed tn] [Lindley, 1957]
  • 68. Two versions of the paradox “the weight of Lindley’s paradoxical result (...) burdens proponents of the Bayesian practice”. [Lad, 2003] official version, opposing frequentist and Bayesian assessments [Lindley, 1957] intra-Bayesian version, blaming vague and improper priors for the Bayes factor misbehaviour: if π1(·|σ) depends on a scale parameter σ, it is often the case that B01(x) σ−→∞ −→ +∞ for a given x, meaning H0 is always accepted [Robert, 1992, 2013]
  • 69. where does it matter? In the normal case, Z ∼ N(θ, 1), θ ∼ N(0, α2), Bayes factor B10(z) = ez2α2/(1+α2) √ 1 + α2 = √ 1 − λ exp{λz2 /2}
  • 70. Evacuation of the first version Two paradigms [(b) versus (f)] one (b) operates on the parameter space Θ, while the other (f) is produced from the sample space one (f) relies solely on the point-null hypothesis H0 and the corresponding sampling distribution, while the other (b) opposes H0 to a (predictive) marginal version of H1 one (f) could reject “a hypothesis that may be true (...) because it has not predicted observable results that have not occurred” (Jeffreys, ToP, VII, §7.2) while the other (b) conditions upon the observed value xobs one (f) cannot agree with the likelihood principle, while the other (b) is almost uniformly in agreement with it one (f) resorts to an arbitrary fixed bound α on the p-value, while the other (b) refers to the (default) boundary probability of 1/2
  • 71. Nothing’s wrong with the second version n, prior’s scale factor: prior variance n times larger than the observation variance and when n goes to ∞, Bayes factor goes to ∞ no matter what the observation is n becomes what Lindley (1957) calls “a measure of lack of conviction about the null hypothesis” when prior diffuseness under H1 increases, only relevant information becomes that θ could be equal to θ0, and this overwhelms any evidence to the contrary contained in the data mass of the prior distribution in the vicinity of any fixed neighbourhood of the null hypothesis vanishes to zero under H1 c deep coherence in the outcome: being indecisive about the alternative hypothesis means we should not chose it
  • 72. Nothing’s wrong with the second version n, prior’s scale factor: prior variance n times larger than the observation variance and when n goes to ∞, Bayes factor goes to ∞ no matter what the observation is n becomes what Lindley (1957) calls “a measure of lack of conviction about the null hypothesis” when prior diffuseness under H1 increases, only relevant information becomes that θ could be equal to θ0, and this overwhelms any evidence to the contrary contained in the data mass of the prior distribution in the vicinity of any fixed neighbourhood of the null hypothesis vanishes to zero under H1 c deep coherence in the outcome: being indecisive about the alternative hypothesis means we should not chose it
  • 73. On some resolutions of the second version use of pseudo-Bayes factors, fractional Bayes factors, &tc, which lacks complete proper Bayesian justification [Berger & Pericchi, 2001] use of identical improper priors on nuisance parameters, use of the posterior predictive distribution, matching priors, use of score functions extending the log score function non-local priors correcting default priors
  • 74. On some resolutions of the second version use of pseudo-Bayes factors, fractional Bayes factors, &tc, use of identical improper priors on nuisance parameters, a notion already entertained by Jeffreys [Berger et al., 1998; Marin & Robert, 2013] use of the posterior predictive distribution, matching priors, use of score functions extending the log score function non-local priors correcting default priors
  • 75. On some resolutions of the second version use of pseudo-Bayes factors, fractional Bayes factors, &tc, use of identical improper priors on nuisance parameters, P´ech´e de jeunesse: equating the values of the prior densities at the point-null value θ0, ρ0 = (1 − ρ0)π1(θ0) [Robert, 1993] use of the posterior predictive distribution, matching priors, use of score functions extending the log score function non-local priors correcting default priors
  • 76. On some resolutions of the second version use of pseudo-Bayes factors, fractional Bayes factors, &tc, use of identical improper priors on nuisance parameters, use of the posterior predictive distribution, which uses the data twice matching priors, use of score functions extending the log score function non-local priors correcting default priors
  • 77. On some resolutions of the second version use of pseudo-Bayes factors, fractional Bayes factors, &tc, use of identical improper priors on nuisance parameters, use of the posterior predictive distribution, matching priors, whose sole purpose is to bring frequentist and Bayesian coverages as close as possible [Datta & Mukerjee, 2004] use of score functions extending the log score function non-local priors correcting default priors
  • 78. On some resolutions of the second version use of pseudo-Bayes factors, fractional Bayes factors, &tc, use of identical improper priors on nuisance parameters, use of the posterior predictive distribution, matching priors, use of score functions extending the log score function log B12(x) = log m1(x) − log m2(x) = S0(x, m1) − S0(x, m2) , that are independent of the normalising constant [Dawid et al., 2013; Dawid & Musio, 2015] non-local priors correcting default priors
  • 79. On some resolutions of the second version use of pseudo-Bayes factors, fractional Bayes factors, &tc, use of identical improper priors on nuisance parameters, use of the posterior predictive distribution, matching priors, use of score functions extending the log score function non-local priors correcting default priors towards more balanced error rates [Johnson & Rossell, 2010; Consonni et al., 2013]
  • 80. Changing the testing perspective Summary Significance tests Noninformative solutions Jeffreys-Lindley paradox Testing via mixtures
  • 81. Paradigm shift New proposal for a paradigm shift in the Bayesian processing of hypothesis testing and of model selection convergent and naturally interpretable solution more extended use of improper priors Simple representation of the testing problem as a two-component mixture estimation problem where the weights are formally equal to 0 or 1
  • 82. Paradigm shift New proposal for a paradigm shift in the Bayesian processing of hypothesis testing and of model selection convergent and naturally interpretable solution more extended use of improper priors Simple representation of the testing problem as a two-component mixture estimation problem where the weights are formally equal to 0 or 1
  • 83. Paradigm shift Simple representation of the testing problem as a two-component mixture estimation problem where the weights are formally equal to 0 or 1 Approach inspired from consistency result of Rousseau and Mengersen (2011) on estimated overfitting mixtures Mixture representation not directly equivalent to the use of a posterior probability Potential of a better approach to testing, while not expanding number of parameters Calibration of the posterior distribution of the weight of a model, while moving from the artificial notion of the posterior probability of a model
  • 84. Encompassing mixture model Idea: Given two statistical models, M1 : x ∼ f1(x|θ1) , θ1 ∈ Θ1 and M2 : x ∼ f2(x|θ2) , θ2 ∈ Θ2 , embed both within an encompassing mixture Mα : x ∼ αf1(x|θ1) + (1 − α)f2(x|θ2) , 0 α 1 (1) Note: Both models correspond to special cases of (1), one for α = 1 and one for α = 0 Draw inference on mixture representation (1), as if each observation was individually and independently produced by the mixture model
  • 85. Encompassing mixture model Idea: Given two statistical models, M1 : x ∼ f1(x|θ1) , θ1 ∈ Θ1 and M2 : x ∼ f2(x|θ2) , θ2 ∈ Θ2 , embed both within an encompassing mixture Mα : x ∼ αf1(x|θ1) + (1 − α)f2(x|θ2) , 0 α 1 (1) Note: Both models correspond to special cases of (1), one for α = 1 and one for α = 0 Draw inference on mixture representation (1), as if each observation was individually and independently produced by the mixture model
  • 86. Encompassing mixture model Idea: Given two statistical models, M1 : x ∼ f1(x|θ1) , θ1 ∈ Θ1 and M2 : x ∼ f2(x|θ2) , θ2 ∈ Θ2 , embed both within an encompassing mixture Mα : x ∼ αf1(x|θ1) + (1 − α)f2(x|θ2) , 0 α 1 (1) Note: Both models correspond to special cases of (1), one for α = 1 and one for α = 0 Draw inference on mixture representation (1), as if each observation was individually and independently produced by the mixture model
  • 87. Inferential motivations Sounds like an approximation to the real model, but several definitive advantages to this paradigm shift: Bayes estimate of the weight α replaces posterior probability of model M1, equally convergent indicator of which model is “true”, while avoiding artificial prior probabilities on model indices, ω1 and ω2 interpretation of estimator of α at least as natural as handling the posterior probability, while avoiding zero-one loss setting α and its posterior distribution provide measure of proximity to the models, while being interpretable as data propensity to stand within one model further allows for alternative perspectives on testing and model choice, like predictive tools, cross-validation, and information indices like WAIC
  • 88. Computational motivations avoids highly problematic computations of the marginal likelihoods, since standard algorithms are available for Bayesian mixture estimation straightforward extension to a finite collection of models, with a larger number of components, which considers all models at once and eliminates least likely models by simulation eliminates difficulty of label switching that plagues both Bayesian estimation and Bayesian computation, since components are no longer exchangeable posterior distribution of α evaluates more thoroughly strength of support for a given model than the single figure outcome of a posterior probability variability of posterior distribution on α allows for a more thorough assessment of the strength of this support
  • 89. Noninformative motivations additional feature missing from traditional Bayesian answers: a mixture model acknowledges possibility that, for a finite dataset, both models or none could be acceptable standard (proper and informative) prior modeling can be reproduced in this setting, but non-informative (improper) priors also are manageable therein, provided both models first reparameterised towards shared parameters, e.g. location and scale parameters in special case when all parameters are common Mα : x ∼ αf1(x|θ) + (1 − α)f2(x|θ) , 0 α 1 if θ is a location parameter, a flat prior π(θ) ∝ 1 is available
  • 90. Weakly informative motivations using the same parameters or some identical parameters on both components highlights that opposition between the two components is not an issue of enjoying different parameters those common parameters are nuisance parameters, to be integrated out prior model weights ωi rarely discussed in classical Bayesian approach, even though linear impact on posterior probabilities. Here, prior modeling only involves selecting a prior on α, e.g., α ∼ B(a0, a0) while a0 impacts posterior on α, it always leads to mass accumulation near 1 or 0, i.e. favours most likely model sensitivity analysis straightforward to carry approach easily calibrated by parametric boostrap providing reference posterior of α under each model natural Metropolis–Hastings alternative
  • 91. Poisson/Geometric choice betwen Poisson P(λ) and Geometric Geo(p) distribution mixture with common parameter λ Mα : αP(λ) + (1 − α)Geo(1/1+λ) Allows for Jeffreys prior since resulting posterior is proper independent Metropolis–within–Gibbs with proposal distribution on λ equal to Poisson posterior (with acceptance rate larger than 75%)
  • 92. Poisson/Geometric choice betwen Poisson P(λ) and Geometric Geo(p) distribution mixture with common parameter λ Mα : αP(λ) + (1 − α)Geo(1/1+λ) Allows for Jeffreys prior since resulting posterior is proper independent Metropolis–within–Gibbs with proposal distribution on λ equal to Poisson posterior (with acceptance rate larger than 75%)
  • 93. Poisson/Geometric choice betwen Poisson P(λ) and Geometric Geo(p) distribution mixture with common parameter λ Mα : αP(λ) + (1 − α)Geo(1/1+λ) Allows for Jeffreys prior since resulting posterior is proper independent Metropolis–within–Gibbs with proposal distribution on λ equal to Poisson posterior (with acceptance rate larger than 75%)
  • 94. Beta prior When α ∼ Be(a0, a0) prior, full conditional posterior α ∼ Be(n1(ζ) + a0, n2(ζ) + a0) Exact Bayes factor opposing Poisson and Geometric B12 = nn¯xn n i=1 xi ! Γ n + 2 + n i=1 xi Γ(n + 2) although undefined from a purely mathematical viewpoint
  • 95. Parameter estimation 1e-04 0.001 0.01 0.1 0.2 0.3 0.4 0.5 3.903.954.004.054.10 Posterior means of λ and medians of α for 100 Poisson P(4) datasets of size n = 1000, for a0 = .0001, .001, .01, .1, .2, .3, .4, .5. Each posterior approximation is based on 104 Metropolis-Hastings iterations.
  • 96. Parameter estimation 1e-04 0.001 0.01 0.1 0.2 0.3 0.4 0.5 0.9900.9920.9940.9960.9981.000 Posterior means of λ and medians of α for 100 Poisson P(4) datasets of size n = 1000, for a0 = .0001, .001, .01, .1, .2, .3, .4, .5. Each posterior approximation is based on 104 Metropolis-Hastings iterations.
  • 97. Consistency 0 1 2 3 4 5 6 7 0.00.20.40.60.81.0 a0=.1 log(sample size) 0 1 2 3 4 5 6 7 0.00.20.40.60.81.0 a0=.3 log(sample size) 0 1 2 3 4 5 6 7 0.00.20.40.60.81.0 a0=.5 log(sample size) Posterior means (sky-blue) and medians (grey-dotted) of α, over 100 Poisson P(4) datasets for sample sizes from 1 to 1000.
  • 98. Behaviour of Bayes factor 0 1 2 3 4 5 6 7 0.00.20.40.60.81.0 log(sample size) 0 1 2 3 4 5 6 7 0.00.20.40.60.81.0 log(sample size) 0 1 2 3 4 5 6 7 0.00.20.40.60.81.0 log(sample size) Comparison between P(M1|x) (red dotted area) and posterior medians of α (grey zone) for 100 Poisson P(4) datasets with sample sizes n between 1 and 1000, for a0 = .001, .1, .5
  • 99. Normal-normal comparison comparison of a normal N(θ1, 1) with a normal N(θ2, 2) distribution mixture with identical location parameter θ αN(θ, 1) + (1 − α)N(θ, 2) Jeffreys prior π(θ) = 1 can be used, since posterior is proper Reference (improper) Bayes factor B12 = 2 n−1/2 exp 1/4 n i=1 (xi − ¯x)2 ,
  • 100. Normal-normal comparison comparison of a normal N(θ1, 1) with a normal N(θ2, 2) distribution mixture with identical location parameter θ αN(θ, 1) + (1 − α)N(θ, 2) Jeffreys prior π(θ) = 1 can be used, since posterior is proper Reference (improper) Bayes factor B12 = 2 n−1/2 exp 1/4 n i=1 (xi − ¯x)2 ,
  • 101. Normal-normal comparison comparison of a normal N(θ1, 1) with a normal N(θ2, 2) distribution mixture with identical location parameter θ αN(θ, 1) + (1 − α)N(θ, 2) Jeffreys prior π(θ) = 1 can be used, since posterior is proper Reference (improper) Bayes factor B12 = 2 n−1/2 exp 1/4 n i=1 (xi − ¯x)2 ,
  • 102. Consistency 0.1 0.2 0.3 0.4 0.5 p(M1|x) 0.00.20.40.60.81.0 n=15 0.00.20.40.60.81.0 0.1 0.2 0.3 0.4 0.5 p(M1|x) 0.650.700.750.800.850.900.951.00 n=50 0.650.700.750.800.850.900.951.00 0.1 0.2 0.3 0.4 0.5 p(M1|x) 0.750.800.850.900.951.00 n=100 0.750.800.850.900.951.00 0.1 0.2 0.3 0.4 0.5 p(M1|x) 0.930.940.950.960.970.980.991.00 n=500 0.930.940.950.960.970.980.991.00 Posterior means (wheat) and medians of α (dark wheat), compared with posterior probabilities of M0 (gray) for a N(0, 1) sample, derived from 100 datasets for sample sizes equal to 15, 50, 100, 500. Each posterior approximation is based on 104 MCMC iterations.
  • 103. Comparison with posterior probability 0 100 200 300 400 500 -50-40-30-20-100 a0=.1 sample size 0 100 200 300 400 500 -50-40-30-20-100 a0=.3 sample size 0 100 200 300 400 500 -50-40-30-20-100 a0=.4 sample size 0 100 200 300 400 500 -50-40-30-20-100 a0=.5 sample size Plots of ranges of log(n) log(1 − E[α|x]) (gray color) and log(1 − p(M1|x)) (red dotted) over 100 N(0, 1) samples as sample size n grows from 1 to 500. and α is the weight of N(0, 1) in the mixture model. The shaded areas indicate the range of the estimations and each plot is based on a Beta prior with a0 = .1, .2, .3, .4, .5, 1 and each posterior approximation is based on 104 iterations.
  • 104. Comments convergence to one boundary value as sample size n grows impact of hyperarameter a0 slowly vanishes as n increases, but present for moderate sample sizes when simulated sample is neither from N(θ1, 1) nor from N(θ2, 2), behaviour of posterior varies, depending on which distribution is closest
  • 105. Logit or Probit? binary dataset, R dataset about diabetes in 200 Pima Indian women with body mass index as explanatory variable comparison of logit and probit fits could be suitable. We are thus comparing both fits via our method M1 : yi | xi , θ1 ∼ B(1, pi ) where pi = exp(xi θ1) 1 + exp(xi θ1) M2 : yi | xi , θ2 ∼ B(1, qi ) where qi = Φ(xi θ2)
  • 106. Common parameterisation Local reparameterisation strategy that rescales parameters of the probit model M2 so that the MLE’s of both models coincide. [Choudhuty et al., 2007] Φ(xi θ2) ≈ exp(kxi θ2) 1 + exp(kxi θ2) and use best estimate of k to bring both parameters into coherency (k0, k1) = (θ01/θ02, θ11/θ12) , reparameterise M1 and M2 as M1 :yi | xi , θ ∼ B(1, pi ) where pi = exp(xi θ) 1 + exp(xi θ) M2 :yi | xi , θ ∼ B(1, qi ) where qi = Φ(xi (κ−1 θ)) , with κ−1θ = (θ0/k0, θ1/k1).
  • 107. Prior modelling Under default g-prior θ ∼ N2(0, n(XT X)−1 ) full conditional posterior distributions given allocations π(θ | y, X, ζ) ∝ exp i Iζi =1yi xi θ i;ζi =1[1 + exp(xi θ)] exp −θT (XT X)θ 2n × i;ζi =2 Φ(xi (κ−1 θ))yi (1 − Φ(xi (κ−1 θ)))(1−yi ) hence posterior distribution clearly defined
  • 108. Results Logistic Probit a0 α θ0 θ1 θ0 k0 θ1 k1 .1 .352 -4.06 .103 -2.51 .064 .2 .427 -4.03 .103 -2.49 .064 .3 .440 -4.02 .102 -2.49 .063 .4 .456 -4.01 .102 -2.48 .063 .5 .449 -4.05 .103 -2.51 .064 Histograms of posteriors of α in favour of logistic model where a0 = .1, .2, .3, .4, .5 for (a) Pima dataset, (b) Data from logistic model, (c) Data from probit model
  • 109. Survival analysis Testing hypothesis that data comes from a 1. log-Normal(φ, κ2), 2. Weibull(α, λ), or 3. log-Logistic(γ, δ) distribution Corresponding mixture given by the density α1 exp{−(log x − φ)2 /2κ2 }/ √ 2πxκ+ α2 α λ exp{−(x/λ)α }((x/λ)α−1 + α3(δ/γ)(x/γ)δ−1 /(1 + (x/γ)δ )2 where α1 + α2 + α3 = 1
  • 110. Survival analysis Testing hypothesis that data comes from a 1. log-Normal(φ, κ2), 2. Weibull(α, λ), or 3. log-Logistic(γ, δ) distribution Corresponding mixture given by the density α1 exp{−(log x − φ)2 /2κ2 }/ √ 2πxκ+ α2 α λ exp{−(x/λ)α }((x/λ)α−1 + α3(δ/γ)(x/γ)δ−1 /(1 + (x/γ)δ )2 where α1 + α2 + α3 = 1
  • 111. Reparameterisation Looking for common parameter(s): φ = µ + γβ = ξ σ2 = π2 β2 /6 = ζ2 π2 /3 where γ ≈ 0.5772 is Euler-Mascheroni constant. Allows for a noninformative prior on the common location scale parameter, π(φ, σ2 ) = 1/σ2
  • 112. Reparameterisation Looking for common parameter(s): φ = µ + γβ = ξ σ2 = π2 β2 /6 = ζ2 π2 /3 where γ ≈ 0.5772 is Euler-Mascheroni constant. Allows for a noninformative prior on the common location scale parameter, π(φ, σ2 ) = 1/σ2
  • 113. Recovery .01 0.1 1.0 10.0 0.860.880.900.920.940.960.981.00 .01 0.1 1.0 10.00.000.020.040.060.08 Boxplots of the posterior distributions of the Normal weight α1 under the two scenarii: truth = Normal (left panel), truth = Gumbel (right panel), a0=0.01, 0.1, 1.0, 10.0 (from left to right in each panel) and n = 10, 000 simulated observations.
  • 114. Asymptotic consistency Posterior consistency holds for mixture testing procedure [under minor conditions] Two different cases the two models, M1 and M2, are well separated model M1 is a submodel of M2.
  • 115. Asymptotic consistency Posterior consistency holds for mixture testing procedure [under minor conditions] Two different cases the two models, M1 and M2, are well separated model M1 is a submodel of M2.
  • 116. Posterior concentration rate Let π be the prior and xn = (x1, · · · , xn) a sample with true density f ∗ proposition Assume that, for all c > 0, there exist Θn ⊂ Θ1 × Θ2 and B > 0 such that π [Θc n] n−c , Θn ⊂ { θ1 + θ2 nB } and that there exist H 0 and L, δ > 0 such that, for j = 1, 2, sup θ,θ ∈Θn fj,θj − fj,θj 1 LnH θj − θj , θ = (θ1, θ2), θ = (θ1, θ2) , ∀ θj − θ∗ j δ; KL(fj,θj , fj,θ∗ j ) θj − θ∗ j . Then, when f ∗ = fθ∗,α∗ , with α∗ ∈ [0, 1], there exists M > 0 such that π (α, θ); fθ,α − f ∗ 1 > M log n/n|xn = op(1) .
  • 117. Separated models Assumption: Models are separated, i.e. identifiability holds: ∀α, α ∈ [0, 1], ∀θj , θj , j = 1, 2 Pθ,α = Pθ ,α ⇒ α = α , θ = θ Further inf θ1∈Θ1 inf θ2∈Θ2 f1,θ1 − f2,θ2 1 > 0 and, for θ∗ j ∈ Θj , if Pθj weakly converges to Pθ∗ j , then θj −→ θ∗ j in the Euclidean topology
  • 118. Separated models Assumption: Models are separated, i.e. identifiability holds: ∀α, α ∈ [0, 1], ∀θj , θj , j = 1, 2 Pθ,α = Pθ ,α ⇒ α = α , θ = θ theorem Under above assumptions, then for all > 0, π [|α − α∗ | > |xn ] = op(1)
  • 119. Separated models Assumption: Models are separated, i.e. identifiability holds: ∀α, α ∈ [0, 1], ∀θj , θj , j = 1, 2 Pθ,α = Pθ ,α ⇒ α = α , θ = θ theorem If θj → fj,θj is C2 around θ∗ j , j = 1, 2, f1,θ∗ 1 − f2,θ∗ 2 , f1,θ∗ 1 , f2,θ∗ 2 are linearly independent in y and there exists δ > 0 such that f1,θ∗ 1 , f2,θ∗ 2 , sup |θ1−θ∗ 1 |<δ |D2 f1,θ1 |, sup |θ2−θ∗ 2 |<δ |D2 f2,θ2 | ∈ L1 then π |α − α∗ | > M log n/n xn = op(1).
  • 120. Separated models Assumption: Models are separated, i.e. identifiability holds: ∀α, α ∈ [0, 1], ∀θj , θj , j = 1, 2 Pθ,α = Pθ ,α ⇒ α = α , θ = θ theorem allows for interpretation of α under the posterior: If data xn is generated from model M1 then posterior on α concentrates around α = 1
  • 121. Embedded case Here M1 is a submodel of M2, i.e. θ2 = (θ1, ψ) and θ2 = (θ1, ψ0 = 0) corresponds to f2,θ2 ∈ M1 Same posterior concentration rate log n/n for estimating α when α∗ ∈ (0, 1) and ψ∗ = 0.
  • 122. Null case Case where ψ∗ = 0, i.e., f ∗ is in model M1 Two possible paths to approximate f ∗: either α goes to 1 (path 1) or ψ goes to 0 (path 2) New identifiability condition: Pθ,α = P∗ only if α = 1, θ1 = θ∗ 1, θ2 = (θ∗ 1, ψ) or α 1, θ1 = θ∗ 1, θ2 = (θ∗ 1, 0) Prior π(α, θ) = πα(α)π1(θ1)πψ(ψ), θ2 = (θ1, ψ) with common (prior on) θ1
  • 123. Null case Case where ψ∗ = 0, i.e., f ∗ is in model M1 Two possible paths to approximate f ∗: either α goes to 1 (path 1) or ψ goes to 0 (path 2) New identifiability condition: Pθ,α = P∗ only if α = 1, θ1 = θ∗ 1, θ2 = (θ∗ 1, ψ) or α 1, θ1 = θ∗ 1, θ2 = (θ∗ 1, 0) Prior π(α, θ) = πα(α)π1(θ1)πψ(ψ), θ2 = (θ1, ψ) with common (prior on) θ1
  • 124. Assumptions [B1] Regularity: Assume that θ1 → f1,θ1 and θ2 → f2,θ2 are 3 times continuously differentiable and that F∗ ¯f 3 1,θ∗ 1 f 3 1,θ∗ 1 < +∞, ¯f1,θ∗ 1 = sup |θ1−θ∗ 1 |<δ f1,θ1 , f 1,θ∗ 1 = inf |θ1−θ∗ 1 |<δ f1,θ1 F∗ sup|θ1−θ∗ 1 |<δ | f1,θ∗ 1 |3 f 3 1,θ∗ 1 < +∞, F∗ | f1,θ∗ 1 |4 f 4 1,θ∗ 1 < +∞, F∗ sup|θ1−θ∗ 1 |<δ |D2 f1,θ∗ 1 |2 f 2 1,θ∗ 1 < +∞, F∗ sup|θ1−θ∗ 1 |<δ |D3 f1,θ∗ 1 | f 1,θ∗ 1 < +∞
  • 125. Assumptions [B2] Integrability: There exists S0 ⊂ S ∩ {|ψ| > δ0} for some positive δ0 and satisfying Leb(S0) > 0, and such that for all ψ ∈ S0, F∗ sup|θ1−θ∗ 1 |<δ f2,θ1,ψ f 4 1,θ∗ 1 < +∞, F∗ sup|θ1−θ∗ 1 |<δ f 3 2,θ1,ψ f 3 1,θ1∗ < +∞,
  • 126. Assumptions [B3] Stronger identifiability: Set f2,θ∗ 1,ψ∗ (x) = θ1 f2,θ∗ 1,ψ∗ (x)T , ψf2,θ∗ 1,ψ∗ (x)T T . Then for all ψ ∈ S with ψ = 0, if η0 ∈ R, η1 ∈ Rd1 η0(f1,θ∗ 1 − f2,θ∗ 1,ψ) + ηT 1 θ1 f1,θ∗ 1 = 0 ⇔ η1 = 0, η2 = 0
  • 127. Consistency theorem Given the mixture fθ1,ψ,α = αf1,θ1 + (1 − α)f2,θ1,ψ and a sample xn = (x1, · · · , xn) issued from f1,θ∗ 1 , under assumptions B1 − B3, and an M > 0 such that π (α, θ); fθ,α − f ∗ 1 > M log n/n|xn = op(1). If α ∼ B(a1, a2), with a2 < d2, and if the prior πθ1,ψ is absolutely continuous with positive and continuous density at (θ∗ 1, 0), then for Mn −→ ∞ π |α − α∗ | > Mn(log n)γ / √ n|xn = op(1), γ = max((d1 + a2)/(d2 − a2), 1)/2,