SlideShare a Scribd company logo
1 of 71
Download to read offline
Bayesian tests of hypotheses
Christian P. Robert
Universit´e Paris-Dauphine, Paris & University of Warwick, Coventry
Joint work with K. Kamary, K. Mengersen & J. Rousseau
Outline
Bayesian testing of hypotheses
Noninformative solutions
Testing via mixtures
Paradigm shift
Testing issues
Hypothesis testing
central problem of statistical inference
witness the recent ASA’s statement on p-values (Wasserstein,
2016)
dramatically differentiating feature between classical and
Bayesian paradigms
wide open to controversy and divergent opinions, includ.
within the Bayesian community
non-informative Bayesian testing case mostly unresolved,
witness the Jeffreys–Lindley paradox
[Berger (2003), Mayo & Cox (2006), Gelman (2008)]
”proper use and interpretation of the p-value”
”Scientific conclusions and
business or policy decisions
should not be based only on
whether a p-value passes a
specific threshold.”
”By itself, a p-value does not
provide a good measure of
evidence regarding a model or
hypothesis.”
[Wasserstein, 2016]
”proper use and interpretation of the p-value”
”Scientific conclusions and
business or policy decisions
should not be based only on
whether a p-value passes a
specific threshold.”
”By itself, a p-value does not
provide a good measure of
evidence regarding a model or
hypothesis.”
[Wasserstein, 2016]
Bayesian testing of hypotheses
Bayesian model selection as comparison of k potential
statistical models towards the selection of model that fits the
data “best”
mostly accepted perspective: it does not primarily seek to
identify which model is “true”, but compares fits
tools like Bayes factor naturally include a penalisation
addressing model complexity, mimicked by Bayes Information
(BIC) and Deviance Information (DIC) criteria
posterior predictive tools successfully advocated in Gelman et
al. (2013) even though they involve double use of data
Bayesian testing of hypotheses
Bayesian model selection as comparison of k potential
statistical models towards the selection of model that fits the
data “best”
mostly accepted perspective: it does not primarily seek to
identify which model is “true”, but compares fits
tools like Bayes factor naturally include a penalisation
addressing model complexity, mimicked by Bayes Information
(BIC) and Deviance Information (DIC) criteria
posterior predictive tools successfully advocated in Gelman et
al. (2013) even though they involve double use of data
Bayesian tests 101
Associated with the risk
R(θ, δ) = Eθ[L(θ, δ(x))]
=
Pθ(δ(x) = 0) if θ ∈ Θ0,
Pθ(δ(x) = 1) otherwise,
Bayes test
The Bayes estimator associated with π and with the 0 − 1 loss is
δπ
(x) =
1 if P(θ ∈ Θ0|x) > P(θ ∈ Θ0|x),
0 otherwise,
Bayesian tests 101
Associated with the risk
R(θ, δ) = Eθ[L(θ, δ(x))]
=
Pθ(δ(x) = 0) if θ ∈ Θ0,
Pθ(δ(x) = 1) otherwise,
Bayes test
The Bayes estimator associated with π and with the 0 − 1 loss is
δπ
(x) =
1 if P(θ ∈ Θ0|x) > P(θ ∈ Θ0|x),
0 otherwise,
Bayesian tests 102
Weights errors differently under both hypotheses:
Theorem (Optimal Bayes decision)
Under the 0 − 1 loss function
L(θ, d) =



0 if d = IΘ0 (θ)
a0 if d = 1 and θ ∈ Θ0
a1 if d = 0 and θ ∈ Θ0
the Bayes procedure is
δπ
(x) =
1 if P(θ ∈ Θ0|x) a0/(a0 + a1)
0 otherwise
Bayesian tests 102
Weights errors differently under both hypotheses:
Theorem (Optimal Bayes decision)
Under the 0 − 1 loss function
L(θ, d) =



0 if d = IΘ0 (θ)
a0 if d = 1 and θ ∈ Θ0
a1 if d = 0 and θ ∈ Θ0
the Bayes procedure is
δπ
(x) =
1 if P(θ ∈ Θ0|x) a0/(a0 + a1)
0 otherwise
A function of posterior probabilities
Definition (Bayes factors)
For hypotheses H0 : θ ∈ Θ0 vs. Ha : θ ∈ Θ0
B01 =
π(Θ0|x)
π(Θc
0|x)
π(Θ0)
π(Θc
0)
=
Θ0
f (x|θ)π0(θ)dθ
Θc
0
f (x|θ)π1(θ)dθ
[Jeffreys, ToP, 1939, V, §5.01]
Bayes rule under 0 − 1 loss: acceptance if
B01 > {(1 − π(Θ0))/a1}/{π(Θ0)/a0}
self-contained concept
Outside decision-theoretic environment:
eliminates choice of π(Θ0)
but depends on the choice of (π0, π1)
Bayesian/marginal equivalent to the likelihood ratio
Jeffreys’ scale of evidence:
if log10(Bπ
10) between 0 and 0.5, evidence against H0 weak,
if log10(Bπ
10) 0.5 and 1, evidence substantial,
if log10(Bπ
10) 1 and 2, evidence strong and
if log10(Bπ
10) above 2, evidence decisive
[...fairly arbitrary!]
self-contained concept
Outside decision-theoretic environment:
eliminates choice of π(Θ0)
but depends on the choice of (π0, π1)
Bayesian/marginal equivalent to the likelihood ratio
Jeffreys’ scale of evidence:
if log10(Bπ
10) between 0 and 0.5, evidence against H0 weak,
if log10(Bπ
10) 0.5 and 1, evidence substantial,
if log10(Bπ
10) 1 and 2, evidence strong and
if log10(Bπ
10) above 2, evidence decisive
[...fairly arbitrary!]
self-contained concept
Outside decision-theoretic environment:
eliminates choice of π(Θ0)
but depends on the choice of (π0, π1)
Bayesian/marginal equivalent to the likelihood ratio
Jeffreys’ scale of evidence:
if log10(Bπ
10) between 0 and 0.5, evidence against H0 weak,
if log10(Bπ
10) 0.5 and 1, evidence substantial,
if log10(Bπ
10) 1 and 2, evidence strong and
if log10(Bπ
10) above 2, evidence decisive
[...fairly arbitrary!]
self-contained concept
Outside decision-theoretic environment:
eliminates choice of π(Θ0)
but depends on the choice of (π0, π1)
Bayesian/marginal equivalent to the likelihood ratio
Jeffreys’ scale of evidence:
if log10(Bπ
10) between 0 and 0.5, evidence against H0 weak,
if log10(Bπ
10) 0.5 and 1, evidence substantial,
if log10(Bπ
10) 1 and 2, evidence strong and
if log10(Bπ
10) above 2, evidence decisive
[...fairly arbitrary!]
consistency
Example of a normal ¯Xn ∼ N(µ, 1/n) when µ ∼ N(0, 1), leading to
B01 = (1 + n)−1/2
exp{n2
¯x2
n /2(1 + n)}
Some difficulties
tension between using (i) posterior probabilities justified by
binary loss function but depending on unnatural prior weights,
and (ii) Bayes factors that eliminate dependence but escape
direct connection with posterior, unless prior weights are
integrated within loss
delicate interpretation (or calibration) of strength of the Bayes
factor towards supporting a given hypothesis or model,
because not a Bayesian decision rule
similar difficulty with posterior probabilities, with tendency to
interpret them as p-values: only report of respective strengths
of fitting data to both models
Some difficulties
tension between using (i) posterior probabilities justified by
binary loss function but depending on unnatural prior weights,
and (ii) Bayes factors that eliminate dependence but escape
direct connection with posterior, unless prior weights are
integrated within loss
referring to a fixed and arbitrary cuttoff value falls into the
same difficulties as regular p-values
no “third way” like opting out from a decision
Some further difficulties
long-lasting impact of prior modeling, i.e., choice of prior
distributions on parameters of both models, despite overall
consistency proof for Bayes factor
discontinuity in valid use of improper priors since they are not
justified in most testing situations, leading to many alternative
and ad hoc solutions, where data is either used twice or split
in artificial ways [or further tortured into confession]
binary (accept vs. reject) outcome more suited for immediate
decision (if any) than for model evaluation, in connection with
rudimentary loss function [atavistic remain of
Neyman-Pearson formalism]
Some additional difficulties
related impossibility to ascertain simultaneous misfit or to
detect outliers
no assessment of uncertainty associated with decision itself
besides posterior probability
difficult computation of marginal likelihoods in most settings
with further controversies about which algorithm to adopt
strong dependence of posterior probabilities on conditioning
statistics (ABC), which undermines their validity for model
assessment
temptation to create pseudo-frequentist equivalents such as
q-values with even less Bayesian justifications
c time for a paradigm shift
back to some solutions
Historical appearance of Bayesian tests
Is the new parameter supported by the observations or is
any variation expressible by it better interpreted as
random? Thus we must set two hypotheses for
comparison, the more complicated having the smaller
initial probability
...compare a specially suggested value of a new
parameter, often 0 [q], with the aggregate of other
possible values [q ]. We shall call q the null hypothesis
and q the alternative hypothesis [and] we must take
P(q|H) = P(q |H) = 1/2 .
(Jeffreys, ToP, 1939, V, §5.0)
A major refurbishment
Suppose we are considering whether a location parameter
α is 0. The estimation prior probability for it is uniform
and we should have to take f (α) = 0 and K[= B10]
would always be infinite (Jeffreys, ToP, V, §5.02)
When the null hypothesis is supported by a set of measure 0
against Lebesgue measure, π(Θ0) = 0 for an absolutely continuous
prior distribution
[End of the story?!]
A major refurbishment
When the null hypothesis is supported by a set of measure 0
against Lebesgue measure, π(Θ0) = 0 for an absolutely continuous
prior distribution
[End of the story?!]
Requirement
Defined prior distributions under both assumptions,
π0(θ) ∝ π(θ)IΘ0 (θ), π1(θ) ∝ π(θ)IΘ1 (θ),
(under the standard dominating measures on Θ0 and Θ1)
A major refurbishment
When the null hypothesis is supported by a set of measure 0
against Lebesgue measure, π(Θ0) = 0 for an absolutely continuous
prior distribution
[End of the story?!]
Using the prior probabilities π(Θ0) = ρ0 and π(Θ1) = ρ1,
π(θ) = ρ0π0(θ) + ρ1π1(θ).
Point null hypotheses
“Is it of the slightest use to reject a hypothesis until we have some
idea of what to put in its place?” H. Jeffreys, ToP(p.390)
Particular case H0 : θ = θ0
Take ρ0 = Prπ
(θ = θ0) and g1 prior density under Hc
0 .
Posterior probability of H0
π(Θ0|x) =
f (x|θ0)ρ0
f (x|θ)π(θ) dθ
=
f (x|θ0)ρ0
f (x|θ0)ρ0 + (1 − ρ0)m1(x)
and marginal under Hc
0
m1(x) =
Θ1
f (x|θ)g1(θ) dθ.
and
Bπ
01(x) =
f (x|θ0)ρ0
m1(x)(1 − ρ0)
ρ0
1 − ρ0
=
f (x|θ0)
m1(x)
Point null hypotheses
“Is it of the slightest use to reject a hypothesis until we have some
idea of what to put in its place?” H. Jeffreys, ToP(p.390)
Particular case H0 : θ = θ0
Take ρ0 = Prπ
(θ = θ0) and g1 prior density under Hc
0 .
Posterior probability of H0
π(Θ0|x) =
f (x|θ0)ρ0
f (x|θ)π(θ) dθ
=
f (x|θ0)ρ0
f (x|θ0)ρ0 + (1 − ρ0)m1(x)
and marginal under Hc
0
m1(x) =
Θ1
f (x|θ)g1(θ) dθ.
and
Bπ
01(x) =
f (x|θ0)ρ0
m1(x)(1 − ρ0)
ρ0
1 − ρ0
=
f (x|θ0)
m1(x)
Noninformative proposals
Bayesian testing of hypotheses
Noninformative solutions
Testing via mixtures
Paradigm shift
what’s special about the Bayes factor?!
“The priors do not represent substantive knowledge of the
parameters within the model
Using Bayes’ theorem, these priors can then be updated to
posteriors conditioned on the data that were actually observed
In general, the fact that different priors result in different
Bayes factors should not come as a surprise
The Bayes factor (...) balances the tension between parsimony
and goodness of fit, (...) against overfitting the data
In induction there is no harm in being occasionally wrong; it is
inevitable that we shall be”
[Jeffreys, 1939; Ly et al., 2015]
what’s wrong with the Bayes factor?!
(1/2, 1/2) partition between hypotheses has very little to
suggest in terms of extensions
central difficulty stands with issue of picking a prior
probability of a model
unfortunate impossibility of using improper priors in most
settings
Bayes factors lack direct scaling associated with posterior
probability and loss function
twofold dependence on subjective prior measure, first in prior
weights of models and second in lasting impact of prior
modelling on the parameters
Bayes factor offers no window into uncertainty associated with
decision
further reasons in the summary
[Robert, 2016]
Lindley’s paradox
In a normal mean testing problem,
¯xn ∼ N(θ, σ2
/n) , H0 : θ = θ0 ,
under Jeffreys prior, θ ∼ N(θ0, σ2), the Bayes factor
B01(tn) = (1 + n)1/2
exp −nt2
n/2[1 + n] ,
where tn =
√
n|¯xn − θ0|/σ, satisfies
B01(tn)
n−→∞
−→ ∞
[assuming a fixed tn]
[Lindley, 1957]
A strong impropriety
Improper priors not allowed in Bayes factors:
If
Θ1
π1(dθ1) = ∞ or
Θ2
π2(dθ2) = ∞
then π1 or π2 cannot be coherently normalised while the
normalisation matters in the Bayes factor B12
Lack of mathematical justification for “common nuisance
parameter” [and prior of]
[Berger, Pericchi, and Varshavsky, 1998; Marin and Robert, 2013]
A strong impropriety
Improper priors not allowed in Bayes factors:
If
Θ1
π1(dθ1) = ∞ or
Θ2
π2(dθ2) = ∞
then π1 or π2 cannot be coherently normalised while the
normalisation matters in the Bayes factor B12
Lack of mathematical justification for “common nuisance
parameter” [and prior of]
[Berger, Pericchi, and Varshavsky, 1998; Marin and Robert, 2013]
On some resolutions of the paradox
use of pseudo-Bayes factors, fractional Bayes factors, &tc,
which lacks complete proper Bayesian justification
[Berger & Pericchi, 2001]
use of identical improper priors on nuisance parameters,
calibration via the posterior predictive distribution,
matching priors,
use of score functions extending the log score function
non-local priors correcting default priors
On some resolutions of the paradox
use of pseudo-Bayes factors, fractional Bayes factors, &tc,
use of identical improper priors on nuisance parameters, a
notion already entertained by Jeffreys
[Berger et al., 1998; Marin & Robert, 2013]
calibration via the posterior predictive distribution,
matching priors,
use of score functions extending the log score function
non-local priors correcting default priors
On some resolutions of the paradox
use of pseudo-Bayes factors, fractional Bayes factors, &tc,
use of identical improper priors on nuisance parameters,
P´ech´e de jeunesse: equating the values of the prior densities
at the point-null value θ0,
ρ0 = (1 − ρ0)π1(θ0)
[Robert, 1993]
calibration via the posterior predictive distribution,
matching priors,
use of score functions extending the log score function
non-local priors correcting default priors
On some resolutions of the paradox
use of pseudo-Bayes factors, fractional Bayes factors, &tc,
use of identical improper priors on nuisance parameters,
calibration via the posterior predictive distribution, which uses
the data twice
matching priors,
use of score functions extending the log score function
non-local priors correcting default priors
On some resolutions of the paradox
use of pseudo-Bayes factors, fractional Bayes factors, &tc,
use of identical improper priors on nuisance parameters,
calibration via the posterior predictive distribution,
matching priors, whose sole purpose is to bring frequentist
and Bayesian coverages as close as possible
[Datta & Mukerjee, 2004]
use of score functions extending the log score function
non-local priors correcting default priors
On some resolutions of the paradox
use of pseudo-Bayes factors, fractional Bayes factors, &tc,
use of identical improper priors on nuisance parameters,
calibration via the posterior predictive distribution,
matching priors,
use of score functions extending the log score function
log B12(x) = log m1(x) − log m2(x) = S0(x, m1) − S0(x, m2) ,
that are independent of the normalising constant
[Dawid et al., 2013; Dawid & Musio, 2015]
non-local priors correcting default priors
On some resolutions of the paradox
use of pseudo-Bayes factors, fractional Bayes factors, &tc,
use of identical improper priors on nuisance parameters,
calibration via the posterior predictive distribution,
matching priors,
use of score functions extending the log score function
non-local priors correcting default priors towards more
balanced error rates
[Johnson & Rossell, 2010; Consonni et al., 2013]
Pseudo-Bayes factors
Idea
Use one part x[i] of the data x to make the prior proper:
πi improper but πi (·|x[i]) proper
and
fi (x[n/i]|θi ) πi (θi |x[i])dθi
fj (x[n/i]|θj ) πj (θj |x[i])dθj
independent of normalizing constant
Use remaining x[n/i] to run test as if πj (θj |x[i]) is the true prior
Pseudo-Bayes factors
Idea
Use one part x[i] of the data x to make the prior proper:
πi improper but πi (·|x[i]) proper
and
fi (x[n/i]|θi ) πi (θi |x[i])dθi
fj (x[n/i]|θj ) πj (θj |x[i])dθj
independent of normalizing constant
Use remaining x[n/i] to run test as if πj (θj |x[i]) is the true prior
Pseudo-Bayes factors
Idea
Use one part x[i] of the data x to make the prior proper:
πi improper but πi (·|x[i]) proper
and
fi (x[n/i]|θi ) πi (θi |x[i])dθi
fj (x[n/i]|θj ) πj (θj |x[i])dθj
independent of normalizing constant
Use remaining x[n/i] to run test as if πj (θj |x[i]) is the true prior
Motivation
Provides a working principle for improper priors
Gather enough information from data to achieve properness
and use this properness to run the test on remaining data
does not use x twice as in Aitkin’s (1991)
Motivation
Provides a working principle for improper priors
Gather enough information from data to achieve properness
and use this properness to run the test on remaining data
does not use x twice as in Aitkin’s (1991)
Motivation
Provides a working principle for improper priors
Gather enough information from data to achieve properness
and use this properness to run the test on remaining data
does not use x twice as in Aitkin’s (1991)
Issues
depends on the choice of x[i]
many ways of combining pseudo-Bayes factors
AIBF = BN
ji
1
L
Bij (x[ ])
MIBF = BN
ji med[Bij (x[ ])]
GIBF = BN
ji exp
1
L
log Bij (x[ ])
not often an exact Bayes factor
and thus lacking inner coherence
B12 = B10B02 and B01 = 1/B10 .
[Berger & Pericchi, 1996]
Issues
depends on the choice of x[i]
many ways of combining pseudo-Bayes factors
AIBF = BN
ji
1
L
Bij (x[ ])
MIBF = BN
ji med[Bij (x[ ])]
GIBF = BN
ji exp
1
L
log Bij (x[ ])
not often an exact Bayes factor
and thus lacking inner coherence
B12 = B10B02 and B01 = 1/B10 .
[Berger & Pericchi, 1996]
Fractional Bayes factor
Idea
use directly the likelihood to separate training sample from testing
sample
BF
12 = B12(x)
Lb
2(θ2)π2(θ2)dθ2
Lb
1(θ1)π1(θ1)dθ1
[O’Hagan, 1995]
Proportion b of the sample used to gain proper-ness
Fractional Bayes factor
Idea
use directly the likelihood to separate training sample from testing
sample
BF
12 = B12(x)
Lb
2(θ2)π2(θ2)dθ2
Lb
1(θ1)π1(θ1)dθ1
[O’Hagan, 1995]
Proportion b of the sample used to gain proper-ness
Fractional Bayes factor (cont’d)
Example (Normal mean)
BF
12 =
1
√
b
en(b−1)¯x2
n /2
corresponds to exact Bayes factor for the prior N 0, 1−b
nb
If b constant, prior variance goes to 0
If b =
1
n
, prior variance stabilises around 1
If b = n−α
, α < 1, prior variance goes to 0 too.
c Call to external principles to pick the order of b
Bayesian predictive
“If the model fits, then replicated data generated under
the model should look similar to observed data. To put it
another way, the observed data should look plausible
under the posterior predictive distribution. This is really a
self-consistency check: an observed discrepancy can be
due to model misfit or chance.” (BDA, p.143)
Use of posterior predictive
p(yrep
|y) = p(yrep
|θ)π(θ|y) dθ
and measure of discrepancy T(·, ·)
Replacing p-value
p(y|θ) = P(T(yrep
, θ) T(y, θ)|θ)
with Bayesian posterior p-value
P(T(yrep
, θ) T(y, θ)|y) = p(y|θ)π(θ|x) dθ
Bayesian predictive
“If the model fits, then replicated data generated under
the model should look similar to observed data. To put it
another way, the observed data should look plausible
under the posterior predictive distribution. This is really a
self-consistency check: an observed discrepancy can be
due to model misfit or chance.” (BDA, p.143)
Use of posterior predictive
p(yrep
|y) = p(yrep
|θ)π(θ|y) dθ
and measure of discrepancy T(·, ·)
Replacing p-value
p(y|θ) = P(T(yrep
, θ) T(y, θ)|θ)
with Bayesian posterior p-value
P(T(yrep
, θ) T(y, θ)|y) = p(y|θ)π(θ|x) dθ
Issues
“the posterior predictive p-value is such a [Bayesian]
probability statement, conditional on the model and data,
about what might be expected in future replications.
(BDA, p.151)
sounds too much like a p-value...!
relies on choice of T(·, ·)
seems to favour overfitting
(again) using the data twice (once for the posterior and twice
in the p-value)
needs to be calibrated (back to 0.05?)
general difficulty in interpreting
where is the penalty for model complexity?
Changing the testing perspective
Bayesian testing of hypotheses
Noninformative solutions
Testing via mixtures
Paradigm shift
Paradigm shift
New proposal for a paradigm shift (!) in the Bayesian processing of
hypothesis testing and of model selection
convergent and naturally interpretable solution
more extended use of improper priors
Simple representation of the testing problem as a
two-component mixture estimation problem where the
weights are formally equal to 0 or 1
Paradigm shift
New proposal for a paradigm shift (!) in the Bayesian processing of
hypothesis testing and of model selection
convergent and naturally interpretable solution
more extended use of improper priors
Simple representation of the testing problem as a
two-component mixture estimation problem where the
weights are formally equal to 0 or 1
Paradigm shift
Simple representation of the testing problem as a
two-component mixture estimation problem where the
weights are formally equal to 0 or 1
Approach inspired from consistency result of Rousseau and
Mengersen (2011) on estimated overfitting mixtures
Mixture representation not directly equivalent to the use of a
posterior probability
Potential of a better approach to testing, while not expanding
number of parameters
Calibration of posterior distribution of the weight of a model,
moving from artificial notion of posterior probability of a
model
Encompassing mixture model
Idea: Given two statistical models,
M1 : x ∼ f1(x|θ1) , θ1 ∈ Θ1 and M2 : x ∼ f2(x|θ2) , θ2 ∈ Θ2 ,
embed both within an encompassing mixture
Mα : x ∼ αf1(x|θ1) + (1 − α)f2(x|θ2) , 0 α 1 (1)
Note: Both models correspond to special cases of (1), one for
α = 1 and one for α = 0
Draw inference on mixture representation (1), as if each
observation was individually and independently produced by the
mixture model
Encompassing mixture model
Idea: Given two statistical models,
M1 : x ∼ f1(x|θ1) , θ1 ∈ Θ1 and M2 : x ∼ f2(x|θ2) , θ2 ∈ Θ2 ,
embed both within an encompassing mixture
Mα : x ∼ αf1(x|θ1) + (1 − α)f2(x|θ2) , 0 α 1 (1)
Note: Both models correspond to special cases of (1), one for
α = 1 and one for α = 0
Draw inference on mixture representation (1), as if each
observation was individually and independently produced by the
mixture model
Encompassing mixture model
Idea: Given two statistical models,
M1 : x ∼ f1(x|θ1) , θ1 ∈ Θ1 and M2 : x ∼ f2(x|θ2) , θ2 ∈ Θ2 ,
embed both within an encompassing mixture
Mα : x ∼ αf1(x|θ1) + (1 − α)f2(x|θ2) , 0 α 1 (1)
Note: Both models correspond to special cases of (1), one for
α = 1 and one for α = 0
Draw inference on mixture representation (1), as if each
observation was individually and independently produced by the
mixture model
Inferential motivations
Sounds like approximation to the real model, but several definitive
advantages to this paradigm shift:
Bayes estimate of the weight α replaces posterior probability
of model M1, equally convergent indicator of which model is
“true”, while avoiding artificial prior probabilities on model
indices, ω1 and ω2
interpretation of estimator of α at least as natural as handling
the posterior probability, while avoiding zero-one loss setting
α and its posterior distribution provide measure of proximity
to the models, while being interpretable as data propensity to
stand within one model
further allows for alternative perspectives on testing and
model choice, like predictive tools, cross-validation, and
information indices like WAIC
Computational motivations
avoids highly problematic computations of the marginal
likelihoods, since standard algorithms are available for
Bayesian mixture estimation
straightforward extension to a finite collection of models, with
a larger number of components, which considers all models at
once and eliminates least likely models by simulation
eliminates difficulty of label switching that plagues both
Bayesian estimation and Bayesian computation, since
components are no longer exchangeable
posterior distribution of α evaluates more thoroughly strength
of support for a given model than the single figure outcome of
a posterior probability
variability of posterior distribution on α allows for a more
thorough assessment of the strength of this support
Noninformative motivations
additional feature missing from traditional Bayesian answers:
a mixture model acknowledges possibility that, for a finite
dataset, both models or none could be acceptable
standard (proper and informative) prior modeling can be
reproduced in this setting, but non-informative (improper)
priors also are manageable therein, provided both models first
reparameterised towards shared parameters, e.g. location and
scale parameters
in special case when all parameters are common
Mα : x ∼ αf1(x|θ) + (1 − α)f2(x|θ) , 0 α 1
if θ is a location parameter, a flat prior π(θ) ∝ 1 is available
Weakly informative motivations
using the same parameters or some identical parameters on
both components highlights that opposition between the two
components is not an issue of enjoying different parameters
those common parameters are nuisance parameters, to be
integrated out [unlike Lindley’s paradox]
prior model weights ωi rarely discussed in classical Bayesian
approach, even though linear impact on posterior probabilities.
Here, prior modeling only involves selecting a prior on α, e.g.,
α ∼ B(a0, a0)
while a0 impacts posterior on α, it always leads to mass
accumulation near 1 or 0, i.e. favours most likely model
sensitivity analysis straightforward to carry
approach easily calibrated by parametric boostrap providing
reference posterior of α under each model
natural Metropolis–Hastings alternative
Comparison with posterior probability
0 100 200 300 400 500
-50-40-30-20-100
a0=.1
sample size
0 100 200 300 400 500
-50-40-30-20-100
a0=.3
sample size
0 100 200 300 400 500
-50-40-30-20-100
a0=.4
sample size
0 100 200 300 400 500
-50-40-30-20-100
a0=.5
sample size
Plots of ranges of log(n) log(1 − E[α|x]) (gray color) and log(1 − p(M1|x)) (red
dotted) over 100 N(0, 1) samples as sample size n grows from 1 to 500. and α
is the weight of N(0, 1) in the mixture model. The shaded areas indicate the
range of the estimations and each plot is based on a Beta prior with
a0 = .1, .2, .3, .4, .5, 1 and each posterior approximation is based on 104
iterations.
Towards which decision?
And if we have to make a decision?
soft consider behaviour of posterior under prior predictives
or posterior predictive [e.g., prior predictive does not exist]
boostrapping behaviour
comparison with Bayesian non-parametric solution
hard rethink the loss function
Conclusion
many applications of the Bayesian paradigm concentrate on
the comparison of scientific theories and on testing of null
hypotheses
natural tendency to default to Bayes factors
poorly understood sensitivity to prior modeling and posterior
calibration
c Time is ripe for a paradigm shift
Down with Bayes factors!
c Time is ripe for a paradigm shift
original testing problem replaced with a better controlled
estimation target
allow for posterior variability over the component frequency as
opposed to deterministic Bayes factors
range of acceptance, rejection and indecision conclusions
easily calibrated by simulation
posterior medians quickly settling near the boundary values of
0 and 1
potential derivation of a Bayesian b-value by looking at the
posterior area under the tail of the distribution of the weight
Prior modelling
c Time is ripe for a paradigm shift
Partly common parameterisation always feasible and hence
allows for reference priors
removal of the absolute prohibition of improper priors in
hypothesis testing
prior on the weight α shows sensitivity that naturally vanishes
as the sample size increases
default value of a0 = 0.5 in the Beta prior
Computing aspects
c Time is ripe for a paradigm shift
proposal that does not induce additional computational strain
when algorithmic solutions exist for both models, they can be
recycled towards estimating the encompassing mixture
easier than in standard mixture problems due to common
parameters that allow for original MCMC samplers to be
turned into proposals
Gibbs sampling completions useful for assessing potential
outliers but not essential to achieve a conclusion about the
overall problem

More Related Content

What's hot

An overview of Bayesian testing
An overview of Bayesian testingAn overview of Bayesian testing
An overview of Bayesian testingChristian Robert
 
from model uncertainty to ABC
from model uncertainty to ABCfrom model uncertainty to ABC
from model uncertainty to ABCChristian Robert
 
Approximate Bayesian model choice via random forests
Approximate Bayesian model choice via random forestsApproximate Bayesian model choice via random forests
Approximate Bayesian model choice via random forestsChristian Robert
 
random forests for ABC model choice and parameter estimation
random forests for ABC model choice and parameter estimationrandom forests for ABC model choice and parameter estimation
random forests for ABC model choice and parameter estimationChristian Robert
 
Bayesian statistics intro using r
Bayesian statistics intro using rBayesian statistics intro using r
Bayesian statistics intro using rJosue Guzman
 
accurate ABC Oliver Ratmann
accurate ABC Oliver Ratmannaccurate ABC Oliver Ratmann
accurate ABC Oliver Ratmannolli0601
 
Bayesian statistics using r intro
Bayesian statistics using r   introBayesian statistics using r   intro
Bayesian statistics using r introBayesLaplace1
 
Delayed acceptance for Metropolis-Hastings algorithms
Delayed acceptance for Metropolis-Hastings algorithmsDelayed acceptance for Metropolis-Hastings algorithms
Delayed acceptance for Metropolis-Hastings algorithmsChristian Robert
 
Laplace's Demon: seminar #1
Laplace's Demon: seminar #1Laplace's Demon: seminar #1
Laplace's Demon: seminar #1Christian Robert
 
ABC short course: introduction chapters
ABC short course: introduction chaptersABC short course: introduction chapters
ABC short course: introduction chaptersChristian Robert
 
Statistics (1): estimation, Chapter 2: Empirical distribution and bootstrap
Statistics (1): estimation, Chapter 2: Empirical distribution and bootstrapStatistics (1): estimation, Chapter 2: Empirical distribution and bootstrap
Statistics (1): estimation, Chapter 2: Empirical distribution and bootstrapChristian Robert
 
Convergence of ABC methods
Convergence of ABC methodsConvergence of ABC methods
Convergence of ABC methodsChristian Robert
 
better together? statistical learning in models made of modules
better together? statistical learning in models made of modulesbetter together? statistical learning in models made of modules
better together? statistical learning in models made of modulesChristian Robert
 
CISEA 2019: ABC consistency and convergence
CISEA 2019: ABC consistency and convergenceCISEA 2019: ABC consistency and convergence
CISEA 2019: ABC consistency and convergenceChristian Robert
 
Likelihood-free Design: a discussion
Likelihood-free Design: a discussionLikelihood-free Design: a discussion
Likelihood-free Design: a discussionChristian Robert
 
Multiple estimators for Monte Carlo approximations
Multiple estimators for Monte Carlo approximationsMultiple estimators for Monte Carlo approximations
Multiple estimators for Monte Carlo approximationsChristian Robert
 

What's hot (20)

An overview of Bayesian testing
An overview of Bayesian testingAn overview of Bayesian testing
An overview of Bayesian testing
 
from model uncertainty to ABC
from model uncertainty to ABCfrom model uncertainty to ABC
from model uncertainty to ABC
 
Approximate Bayesian model choice via random forests
Approximate Bayesian model choice via random forestsApproximate Bayesian model choice via random forests
Approximate Bayesian model choice via random forests
 
random forests for ABC model choice and parameter estimation
random forests for ABC model choice and parameter estimationrandom forests for ABC model choice and parameter estimation
random forests for ABC model choice and parameter estimation
 
Big model, big data
Big model, big dataBig model, big data
Big model, big data
 
Bayesian statistics intro using r
Bayesian statistics intro using rBayesian statistics intro using r
Bayesian statistics intro using r
 
ABC workshop: 17w5025
ABC workshop: 17w5025ABC workshop: 17w5025
ABC workshop: 17w5025
 
accurate ABC Oliver Ratmann
accurate ABC Oliver Ratmannaccurate ABC Oliver Ratmann
accurate ABC Oliver Ratmann
 
Intractable likelihoods
Intractable likelihoodsIntractable likelihoods
Intractable likelihoods
 
Bayesian statistics using r intro
Bayesian statistics using r   introBayesian statistics using r   intro
Bayesian statistics using r intro
 
Delayed acceptance for Metropolis-Hastings algorithms
Delayed acceptance for Metropolis-Hastings algorithmsDelayed acceptance for Metropolis-Hastings algorithms
Delayed acceptance for Metropolis-Hastings algorithms
 
Laplace's Demon: seminar #1
Laplace's Demon: seminar #1Laplace's Demon: seminar #1
Laplace's Demon: seminar #1
 
ABC short course: introduction chapters
ABC short course: introduction chaptersABC short course: introduction chapters
ABC short course: introduction chapters
 
Statistics (1): estimation, Chapter 2: Empirical distribution and bootstrap
Statistics (1): estimation, Chapter 2: Empirical distribution and bootstrapStatistics (1): estimation, Chapter 2: Empirical distribution and bootstrap
Statistics (1): estimation, Chapter 2: Empirical distribution and bootstrap
 
Convergence of ABC methods
Convergence of ABC methodsConvergence of ABC methods
Convergence of ABC methods
 
better together? statistical learning in models made of modules
better together? statistical learning in models made of modulesbetter together? statistical learning in models made of modules
better together? statistical learning in models made of modules
 
CISEA 2019: ABC consistency and convergence
CISEA 2019: ABC consistency and convergenceCISEA 2019: ABC consistency and convergence
CISEA 2019: ABC consistency and convergence
 
Likelihood-free Design: a discussion
Likelihood-free Design: a discussionLikelihood-free Design: a discussion
Likelihood-free Design: a discussion
 
ABC-Gibbs
ABC-GibbsABC-Gibbs
ABC-Gibbs
 
Multiple estimators for Monte Carlo approximations
Multiple estimators for Monte Carlo approximationsMultiple estimators for Monte Carlo approximations
Multiple estimators for Monte Carlo approximations
 

Viewers also liked

Reading the Lasso 1996 paper by Robert Tibshirani
Reading the Lasso 1996 paper by Robert TibshiraniReading the Lasso 1996 paper by Robert Tibshirani
Reading the Lasso 1996 paper by Robert TibshiraniChristian Robert
 
Introduction to MCMC methods
Introduction to MCMC methodsIntroduction to MCMC methods
Introduction to MCMC methodsChristian Robert
 
Phylogenetic models and MCMC methods for the reconstruction of language history
Phylogenetic models and MCMC methods for the reconstruction of language historyPhylogenetic models and MCMC methods for the reconstruction of language history
Phylogenetic models and MCMC methods for the reconstruction of language historyRobin Ryder
 
Omiros' talk on the Bernoulli factory problem
Omiros' talk on the  Bernoulli factory problemOmiros' talk on the  Bernoulli factory problem
Omiros' talk on the Bernoulli factory problemBigMC
 
3rd NIPS Workshop on PROBABILISTIC PROGRAMMING
3rd NIPS Workshop on PROBABILISTIC PROGRAMMING3rd NIPS Workshop on PROBABILISTIC PROGRAMMING
3rd NIPS Workshop on PROBABILISTIC PROGRAMMINGChristian Robert
 
folding Markov chains: the origaMCMC
folding Markov chains: the origaMCMCfolding Markov chains: the origaMCMC
folding Markov chains: the origaMCMCChristian Robert
 
Reading Birnbaum's (1962) paper, by Li Chenlu
Reading Birnbaum's (1962) paper, by Li ChenluReading Birnbaum's (1962) paper, by Li Chenlu
Reading Birnbaum's (1962) paper, by Li ChenluChristian Robert
 
Uniform and non-uniform pseudo random numbers generators for high dimensional...
Uniform and non-uniform pseudo random numbers generators for high dimensional...Uniform and non-uniform pseudo random numbers generators for high dimensional...
Uniform and non-uniform pseudo random numbers generators for high dimensional...LEBRUN Régis
 
Introduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnIntroduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnBenjamin Bengfort
 
Simulation (AMSI Public Lecture)
Simulation (AMSI Public Lecture)Simulation (AMSI Public Lecture)
Simulation (AMSI Public Lecture)Christian Robert
 

Viewers also liked (11)

Reading the Lasso 1996 paper by Robert Tibshirani
Reading the Lasso 1996 paper by Robert TibshiraniReading the Lasso 1996 paper by Robert Tibshirani
Reading the Lasso 1996 paper by Robert Tibshirani
 
Introduction to MCMC methods
Introduction to MCMC methodsIntroduction to MCMC methods
Introduction to MCMC methods
 
Phylogenetic models and MCMC methods for the reconstruction of language history
Phylogenetic models and MCMC methods for the reconstruction of language historyPhylogenetic models and MCMC methods for the reconstruction of language history
Phylogenetic models and MCMC methods for the reconstruction of language history
 
Omiros' talk on the Bernoulli factory problem
Omiros' talk on the  Bernoulli factory problemOmiros' talk on the  Bernoulli factory problem
Omiros' talk on the Bernoulli factory problem
 
Richard Everitt's slides
Richard Everitt's slidesRichard Everitt's slides
Richard Everitt's slides
 
3rd NIPS Workshop on PROBABILISTIC PROGRAMMING
3rd NIPS Workshop on PROBABILISTIC PROGRAMMING3rd NIPS Workshop on PROBABILISTIC PROGRAMMING
3rd NIPS Workshop on PROBABILISTIC PROGRAMMING
 
folding Markov chains: the origaMCMC
folding Markov chains: the origaMCMCfolding Markov chains: the origaMCMC
folding Markov chains: the origaMCMC
 
Reading Birnbaum's (1962) paper, by Li Chenlu
Reading Birnbaum's (1962) paper, by Li ChenluReading Birnbaum's (1962) paper, by Li Chenlu
Reading Birnbaum's (1962) paper, by Li Chenlu
 
Uniform and non-uniform pseudo random numbers generators for high dimensional...
Uniform and non-uniform pseudo random numbers generators for high dimensional...Uniform and non-uniform pseudo random numbers generators for high dimensional...
Uniform and non-uniform pseudo random numbers generators for high dimensional...
 
Introduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnIntroduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-Learn
 
Simulation (AMSI Public Lecture)
Simulation (AMSI Public Lecture)Simulation (AMSI Public Lecture)
Simulation (AMSI Public Lecture)
 

Similar to ISBA 2016: Foundations

Course on Bayesian computational methods
Course on Bayesian computational methodsCourse on Bayesian computational methods
Course on Bayesian computational methodsChristian Robert
 
San Antonio short course, March 2010
San Antonio short course, March 2010San Antonio short course, March 2010
San Antonio short course, March 2010Christian Robert
 
“I. Conjugate priors”
“I. Conjugate priors”“I. Conjugate priors”
“I. Conjugate priors”Steven Scott
 
01.conditional prob
01.conditional prob01.conditional prob
01.conditional probSteven Scott
 
01.ConditionalProb.pdf in the Bayes_intro folder
01.ConditionalProb.pdf in the Bayes_intro folder01.ConditionalProb.pdf in the Bayes_intro folder
01.ConditionalProb.pdf in the Bayes_intro folderSteven Scott
 
Bayes Classification
Bayes ClassificationBayes Classification
Bayes Classificationsathish sak
 
Bayesian decision making in clinical research
Bayesian decision making in clinical researchBayesian decision making in clinical research
Bayesian decision making in clinical researchBhaswat Chakraborty
 
02.bayesian learning
02.bayesian learning02.bayesian learning
02.bayesian learningSteven Scott
 
02.bayesian learning
02.bayesian learning02.bayesian learning
02.bayesian learningSteven Scott
 
Excursion 4 Tour II: Rejection Fallacies: Whose Exaggerating What?
Excursion 4 Tour II: Rejection Fallacies: Whose Exaggerating What?Excursion 4 Tour II: Rejection Fallacies: Whose Exaggerating What?
Excursion 4 Tour II: Rejection Fallacies: Whose Exaggerating What?jemille6
 
original
originaloriginal
originalbutest
 
Yes III: Computational methods for model choice
Yes III: Computational methods for model choiceYes III: Computational methods for model choice
Yes III: Computational methods for model choiceChristian Robert
 
Severe Testing: The Key to Error Correction
Severe Testing: The Key to Error CorrectionSevere Testing: The Key to Error Correction
Severe Testing: The Key to Error Correctionjemille6
 
Unit-2 Bayes Decision Theory.pptx
Unit-2 Bayes Decision Theory.pptxUnit-2 Bayes Decision Theory.pptx
Unit-2 Bayes Decision Theory.pptxavinashBajpayee1
 

Similar to ISBA 2016: Foundations (20)

Course on Bayesian computational methods
Course on Bayesian computational methodsCourse on Bayesian computational methods
Course on Bayesian computational methods
 
San Antonio short course, March 2010
San Antonio short course, March 2010San Antonio short course, March 2010
San Antonio short course, March 2010
 
“I. Conjugate priors”
“I. Conjugate priors”“I. Conjugate priors”
“I. Conjugate priors”
 
01.conditional prob
01.conditional prob01.conditional prob
01.conditional prob
 
01.ConditionalProb.pdf in the Bayes_intro folder
01.ConditionalProb.pdf in the Bayes_intro folder01.ConditionalProb.pdf in the Bayes_intro folder
01.ConditionalProb.pdf in the Bayes_intro folder
 
Bayes Classification
Bayes ClassificationBayes Classification
Bayes Classification
 
Bayesian decision making in clinical research
Bayesian decision making in clinical researchBayesian decision making in clinical research
Bayesian decision making in clinical research
 
JSM 2011 round table
JSM 2011 round tableJSM 2011 round table
JSM 2011 round table
 
JSM 2011 round table
JSM 2011 round tableJSM 2011 round table
JSM 2011 round table
 
02.bayesian learning
02.bayesian learning02.bayesian learning
02.bayesian learning
 
02.bayesian learning
02.bayesian learning02.bayesian learning
02.bayesian learning
 
bayesian learning
bayesian learningbayesian learning
bayesian learning
 
Excursion 4 Tour II: Rejection Fallacies: Whose Exaggerating What?
Excursion 4 Tour II: Rejection Fallacies: Whose Exaggerating What?Excursion 4 Tour II: Rejection Fallacies: Whose Exaggerating What?
Excursion 4 Tour II: Rejection Fallacies: Whose Exaggerating What?
 
original
originaloriginal
original
 
Bayesian Statistics.pdf
Bayesian Statistics.pdfBayesian Statistics.pdf
Bayesian Statistics.pdf
 
Bayesian data analysis1
Bayesian data analysis1Bayesian data analysis1
Bayesian data analysis1
 
Bayesian statistics
Bayesian statisticsBayesian statistics
Bayesian statistics
 
Yes III: Computational methods for model choice
Yes III: Computational methods for model choiceYes III: Computational methods for model choice
Yes III: Computational methods for model choice
 
Severe Testing: The Key to Error Correction
Severe Testing: The Key to Error CorrectionSevere Testing: The Key to Error Correction
Severe Testing: The Key to Error Correction
 
Unit-2 Bayes Decision Theory.pptx
Unit-2 Bayes Decision Theory.pptxUnit-2 Bayes Decision Theory.pptx
Unit-2 Bayes Decision Theory.pptx
 

More from Christian Robert

Asymptotics of ABC, lecture, Collège de France
Asymptotics of ABC, lecture, Collège de FranceAsymptotics of ABC, lecture, Collège de France
Asymptotics of ABC, lecture, Collège de FranceChristian Robert
 
Workshop in honour of Don Poskitt and Gael Martin
Workshop in honour of Don Poskitt and Gael MartinWorkshop in honour of Don Poskitt and Gael Martin
Workshop in honour of Don Poskitt and Gael MartinChristian Robert
 
How many components in a mixture?
How many components in a mixture?How many components in a mixture?
How many components in a mixture?Christian Robert
 
Testing for mixtures at BNP 13
Testing for mixtures at BNP 13Testing for mixtures at BNP 13
Testing for mixtures at BNP 13Christian Robert
 
Inferring the number of components: dream or reality?
Inferring the number of components: dream or reality?Inferring the number of components: dream or reality?
Inferring the number of components: dream or reality?Christian Robert
 
Testing for mixtures by seeking components
Testing for mixtures by seeking componentsTesting for mixtures by seeking components
Testing for mixtures by seeking componentsChristian Robert
 
discussion on Bayesian restricted likelihood
discussion on Bayesian restricted likelihooddiscussion on Bayesian restricted likelihood
discussion on Bayesian restricted likelihoodChristian Robert
 
NCE, GANs & VAEs (and maybe BAC)
NCE, GANs & VAEs (and maybe BAC)NCE, GANs & VAEs (and maybe BAC)
NCE, GANs & VAEs (and maybe BAC)Christian Robert
 
Coordinate sampler : A non-reversible Gibbs-like sampler
Coordinate sampler : A non-reversible Gibbs-like samplerCoordinate sampler : A non-reversible Gibbs-like sampler
Coordinate sampler : A non-reversible Gibbs-like samplerChristian Robert
 
a discussion of Chib, Shin, and Simoni (2017-8) Bayesian moment models
a discussion of Chib, Shin, and Simoni (2017-8) Bayesian moment modelsa discussion of Chib, Shin, and Simoni (2017-8) Bayesian moment models
a discussion of Chib, Shin, and Simoni (2017-8) Bayesian moment modelsChristian Robert
 
ABC based on Wasserstein distances
ABC based on Wasserstein distancesABC based on Wasserstein distances
ABC based on Wasserstein distancesChristian Robert
 
Poster for Bayesian Statistics in the Big Data Era conference
Poster for Bayesian Statistics in the Big Data Era conferencePoster for Bayesian Statistics in the Big Data Era conference
Poster for Bayesian Statistics in the Big Data Era conferenceChristian Robert
 

More from Christian Robert (20)

Asymptotics of ABC, lecture, Collège de France
Asymptotics of ABC, lecture, Collège de FranceAsymptotics of ABC, lecture, Collège de France
Asymptotics of ABC, lecture, Collège de France
 
Workshop in honour of Don Poskitt and Gael Martin
Workshop in honour of Don Poskitt and Gael MartinWorkshop in honour of Don Poskitt and Gael Martin
Workshop in honour of Don Poskitt and Gael Martin
 
discussion of ICML23.pdf
discussion of ICML23.pdfdiscussion of ICML23.pdf
discussion of ICML23.pdf
 
How many components in a mixture?
How many components in a mixture?How many components in a mixture?
How many components in a mixture?
 
restore.pdf
restore.pdfrestore.pdf
restore.pdf
 
Testing for mixtures at BNP 13
Testing for mixtures at BNP 13Testing for mixtures at BNP 13
Testing for mixtures at BNP 13
 
Inferring the number of components: dream or reality?
Inferring the number of components: dream or reality?Inferring the number of components: dream or reality?
Inferring the number of components: dream or reality?
 
CDT 22 slides.pdf
CDT 22 slides.pdfCDT 22 slides.pdf
CDT 22 slides.pdf
 
Testing for mixtures by seeking components
Testing for mixtures by seeking componentsTesting for mixtures by seeking components
Testing for mixtures by seeking components
 
discussion on Bayesian restricted likelihood
discussion on Bayesian restricted likelihooddiscussion on Bayesian restricted likelihood
discussion on Bayesian restricted likelihood
 
NCE, GANs & VAEs (and maybe BAC)
NCE, GANs & VAEs (and maybe BAC)NCE, GANs & VAEs (and maybe BAC)
NCE, GANs & VAEs (and maybe BAC)
 
ABC-Gibbs
ABC-GibbsABC-Gibbs
ABC-Gibbs
 
Coordinate sampler : A non-reversible Gibbs-like sampler
Coordinate sampler : A non-reversible Gibbs-like samplerCoordinate sampler : A non-reversible Gibbs-like sampler
Coordinate sampler : A non-reversible Gibbs-like sampler
 
eugenics and statistics
eugenics and statisticseugenics and statistics
eugenics and statistics
 
ABC-Gibbs
ABC-GibbsABC-Gibbs
ABC-Gibbs
 
asymptotics of ABC
asymptotics of ABCasymptotics of ABC
asymptotics of ABC
 
the ABC of ABC
the ABC of ABCthe ABC of ABC
the ABC of ABC
 
a discussion of Chib, Shin, and Simoni (2017-8) Bayesian moment models
a discussion of Chib, Shin, and Simoni (2017-8) Bayesian moment modelsa discussion of Chib, Shin, and Simoni (2017-8) Bayesian moment models
a discussion of Chib, Shin, and Simoni (2017-8) Bayesian moment models
 
ABC based on Wasserstein distances
ABC based on Wasserstein distancesABC based on Wasserstein distances
ABC based on Wasserstein distances
 
Poster for Bayesian Statistics in the Big Data Era conference
Poster for Bayesian Statistics in the Big Data Era conferencePoster for Bayesian Statistics in the Big Data Era conference
Poster for Bayesian Statistics in the Big Data Era conference
 

Recently uploaded

Transposable elements in prokaryotes.ppt
Transposable elements in prokaryotes.pptTransposable elements in prokaryotes.ppt
Transposable elements in prokaryotes.pptArshadWarsi13
 
Solution chemistry, Moral and Normal solutions
Solution chemistry, Moral and Normal solutionsSolution chemistry, Moral and Normal solutions
Solution chemistry, Moral and Normal solutionsHajira Mahmood
 
Harmful and Useful Microorganisms Presentation
Harmful and Useful Microorganisms PresentationHarmful and Useful Microorganisms Presentation
Harmful and Useful Microorganisms Presentationtahreemzahra82
 
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCRCall Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCRlizamodels9
 
Analytical Profile of Coleus Forskohlii | Forskolin .pdf
Analytical Profile of Coleus Forskohlii | Forskolin .pdfAnalytical Profile of Coleus Forskohlii | Forskolin .pdf
Analytical Profile of Coleus Forskohlii | Forskolin .pdfSwapnil Therkar
 
Manassas R - Parkside Middle School 🌎🏫
Manassas R - Parkside Middle School 🌎🏫Manassas R - Parkside Middle School 🌎🏫
Manassas R - Parkside Middle School 🌎🏫qfactory1
 
Speech, hearing, noise, intelligibility.pptx
Speech, hearing, noise, intelligibility.pptxSpeech, hearing, noise, intelligibility.pptx
Speech, hearing, noise, intelligibility.pptxpriyankatabhane
 
TOPIC 8 Temperature and Heat.pdf physics
TOPIC 8 Temperature and Heat.pdf physicsTOPIC 8 Temperature and Heat.pdf physics
TOPIC 8 Temperature and Heat.pdf physicsssuserddc89b
 
Cytokinin, mechanism and its application.pptx
Cytokinin, mechanism and its application.pptxCytokinin, mechanism and its application.pptx
Cytokinin, mechanism and its application.pptxVarshiniMK
 
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.aasikanpl
 
Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...Nistarini College, Purulia (W.B) India
 
TOTAL CHOLESTEROL (lipid profile test).pptx
TOTAL CHOLESTEROL (lipid profile test).pptxTOTAL CHOLESTEROL (lipid profile test).pptx
TOTAL CHOLESTEROL (lipid profile test).pptxdharshini369nike
 
Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?Patrick Diehl
 
Heredity: Inheritance and Variation of Traits
Heredity: Inheritance and Variation of TraitsHeredity: Inheritance and Variation of Traits
Heredity: Inheritance and Variation of TraitsCharlene Llagas
 
Dashanga agada a formulation of Agada tantra dealt in 3 Rd year bams agada tanta
Dashanga agada a formulation of Agada tantra dealt in 3 Rd year bams agada tantaDashanga agada a formulation of Agada tantra dealt in 3 Rd year bams agada tanta
Dashanga agada a formulation of Agada tantra dealt in 3 Rd year bams agada tantaPraksha3
 
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptxLIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptxmalonesandreagweneth
 
RESPIRATORY ADAPTATIONS TO HYPOXIA IN HUMNAS.pptx
RESPIRATORY ADAPTATIONS TO HYPOXIA IN HUMNAS.pptxRESPIRATORY ADAPTATIONS TO HYPOXIA IN HUMNAS.pptx
RESPIRATORY ADAPTATIONS TO HYPOXIA IN HUMNAS.pptxFarihaAbdulRasheed
 
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSpermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSarthak Sekhar Mondal
 
Grafana in space: Monitoring Japan's SLIM moon lander in real time
Grafana in space: Monitoring Japan's SLIM moon lander  in real timeGrafana in space: Monitoring Japan's SLIM moon lander  in real time
Grafana in space: Monitoring Japan's SLIM moon lander in real timeSatoshi NAKAHIRA
 
Call Girls in Hauz Khas Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Hauz Khas Delhi 💯Call Us 🔝9953322196🔝 💯Escort.Call Girls in Hauz Khas Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Hauz Khas Delhi 💯Call Us 🔝9953322196🔝 💯Escort.aasikanpl
 

Recently uploaded (20)

Transposable elements in prokaryotes.ppt
Transposable elements in prokaryotes.pptTransposable elements in prokaryotes.ppt
Transposable elements in prokaryotes.ppt
 
Solution chemistry, Moral and Normal solutions
Solution chemistry, Moral and Normal solutionsSolution chemistry, Moral and Normal solutions
Solution chemistry, Moral and Normal solutions
 
Harmful and Useful Microorganisms Presentation
Harmful and Useful Microorganisms PresentationHarmful and Useful Microorganisms Presentation
Harmful and Useful Microorganisms Presentation
 
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCRCall Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
 
Analytical Profile of Coleus Forskohlii | Forskolin .pdf
Analytical Profile of Coleus Forskohlii | Forskolin .pdfAnalytical Profile of Coleus Forskohlii | Forskolin .pdf
Analytical Profile of Coleus Forskohlii | Forskolin .pdf
 
Manassas R - Parkside Middle School 🌎🏫
Manassas R - Parkside Middle School 🌎🏫Manassas R - Parkside Middle School 🌎🏫
Manassas R - Parkside Middle School 🌎🏫
 
Speech, hearing, noise, intelligibility.pptx
Speech, hearing, noise, intelligibility.pptxSpeech, hearing, noise, intelligibility.pptx
Speech, hearing, noise, intelligibility.pptx
 
TOPIC 8 Temperature and Heat.pdf physics
TOPIC 8 Temperature and Heat.pdf physicsTOPIC 8 Temperature and Heat.pdf physics
TOPIC 8 Temperature and Heat.pdf physics
 
Cytokinin, mechanism and its application.pptx
Cytokinin, mechanism and its application.pptxCytokinin, mechanism and its application.pptx
Cytokinin, mechanism and its application.pptx
 
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
 
Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...
 
TOTAL CHOLESTEROL (lipid profile test).pptx
TOTAL CHOLESTEROL (lipid profile test).pptxTOTAL CHOLESTEROL (lipid profile test).pptx
TOTAL CHOLESTEROL (lipid profile test).pptx
 
Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?
 
Heredity: Inheritance and Variation of Traits
Heredity: Inheritance and Variation of TraitsHeredity: Inheritance and Variation of Traits
Heredity: Inheritance and Variation of Traits
 
Dashanga agada a formulation of Agada tantra dealt in 3 Rd year bams agada tanta
Dashanga agada a formulation of Agada tantra dealt in 3 Rd year bams agada tantaDashanga agada a formulation of Agada tantra dealt in 3 Rd year bams agada tanta
Dashanga agada a formulation of Agada tantra dealt in 3 Rd year bams agada tanta
 
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptxLIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
 
RESPIRATORY ADAPTATIONS TO HYPOXIA IN HUMNAS.pptx
RESPIRATORY ADAPTATIONS TO HYPOXIA IN HUMNAS.pptxRESPIRATORY ADAPTATIONS TO HYPOXIA IN HUMNAS.pptx
RESPIRATORY ADAPTATIONS TO HYPOXIA IN HUMNAS.pptx
 
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSpermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
 
Grafana in space: Monitoring Japan's SLIM moon lander in real time
Grafana in space: Monitoring Japan's SLIM moon lander  in real timeGrafana in space: Monitoring Japan's SLIM moon lander  in real time
Grafana in space: Monitoring Japan's SLIM moon lander in real time
 
Call Girls in Hauz Khas Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Hauz Khas Delhi 💯Call Us 🔝9953322196🔝 💯Escort.Call Girls in Hauz Khas Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Hauz Khas Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
 

ISBA 2016: Foundations

  • 1. Bayesian tests of hypotheses Christian P. Robert Universit´e Paris-Dauphine, Paris & University of Warwick, Coventry Joint work with K. Kamary, K. Mengersen & J. Rousseau
  • 2. Outline Bayesian testing of hypotheses Noninformative solutions Testing via mixtures Paradigm shift
  • 3. Testing issues Hypothesis testing central problem of statistical inference witness the recent ASA’s statement on p-values (Wasserstein, 2016) dramatically differentiating feature between classical and Bayesian paradigms wide open to controversy and divergent opinions, includ. within the Bayesian community non-informative Bayesian testing case mostly unresolved, witness the Jeffreys–Lindley paradox [Berger (2003), Mayo & Cox (2006), Gelman (2008)]
  • 4. ”proper use and interpretation of the p-value” ”Scientific conclusions and business or policy decisions should not be based only on whether a p-value passes a specific threshold.” ”By itself, a p-value does not provide a good measure of evidence regarding a model or hypothesis.” [Wasserstein, 2016]
  • 5. ”proper use and interpretation of the p-value” ”Scientific conclusions and business or policy decisions should not be based only on whether a p-value passes a specific threshold.” ”By itself, a p-value does not provide a good measure of evidence regarding a model or hypothesis.” [Wasserstein, 2016]
  • 6. Bayesian testing of hypotheses Bayesian model selection as comparison of k potential statistical models towards the selection of model that fits the data “best” mostly accepted perspective: it does not primarily seek to identify which model is “true”, but compares fits tools like Bayes factor naturally include a penalisation addressing model complexity, mimicked by Bayes Information (BIC) and Deviance Information (DIC) criteria posterior predictive tools successfully advocated in Gelman et al. (2013) even though they involve double use of data
  • 7. Bayesian testing of hypotheses Bayesian model selection as comparison of k potential statistical models towards the selection of model that fits the data “best” mostly accepted perspective: it does not primarily seek to identify which model is “true”, but compares fits tools like Bayes factor naturally include a penalisation addressing model complexity, mimicked by Bayes Information (BIC) and Deviance Information (DIC) criteria posterior predictive tools successfully advocated in Gelman et al. (2013) even though they involve double use of data
  • 8. Bayesian tests 101 Associated with the risk R(θ, δ) = Eθ[L(θ, δ(x))] = Pθ(δ(x) = 0) if θ ∈ Θ0, Pθ(δ(x) = 1) otherwise, Bayes test The Bayes estimator associated with π and with the 0 − 1 loss is δπ (x) = 1 if P(θ ∈ Θ0|x) > P(θ ∈ Θ0|x), 0 otherwise,
  • 9. Bayesian tests 101 Associated with the risk R(θ, δ) = Eθ[L(θ, δ(x))] = Pθ(δ(x) = 0) if θ ∈ Θ0, Pθ(δ(x) = 1) otherwise, Bayes test The Bayes estimator associated with π and with the 0 − 1 loss is δπ (x) = 1 if P(θ ∈ Θ0|x) > P(θ ∈ Θ0|x), 0 otherwise,
  • 10. Bayesian tests 102 Weights errors differently under both hypotheses: Theorem (Optimal Bayes decision) Under the 0 − 1 loss function L(θ, d) =    0 if d = IΘ0 (θ) a0 if d = 1 and θ ∈ Θ0 a1 if d = 0 and θ ∈ Θ0 the Bayes procedure is δπ (x) = 1 if P(θ ∈ Θ0|x) a0/(a0 + a1) 0 otherwise
  • 11. Bayesian tests 102 Weights errors differently under both hypotheses: Theorem (Optimal Bayes decision) Under the 0 − 1 loss function L(θ, d) =    0 if d = IΘ0 (θ) a0 if d = 1 and θ ∈ Θ0 a1 if d = 0 and θ ∈ Θ0 the Bayes procedure is δπ (x) = 1 if P(θ ∈ Θ0|x) a0/(a0 + a1) 0 otherwise
  • 12. A function of posterior probabilities Definition (Bayes factors) For hypotheses H0 : θ ∈ Θ0 vs. Ha : θ ∈ Θ0 B01 = π(Θ0|x) π(Θc 0|x) π(Θ0) π(Θc 0) = Θ0 f (x|θ)π0(θ)dθ Θc 0 f (x|θ)π1(θ)dθ [Jeffreys, ToP, 1939, V, §5.01] Bayes rule under 0 − 1 loss: acceptance if B01 > {(1 − π(Θ0))/a1}/{π(Θ0)/a0}
  • 13. self-contained concept Outside decision-theoretic environment: eliminates choice of π(Θ0) but depends on the choice of (π0, π1) Bayesian/marginal equivalent to the likelihood ratio Jeffreys’ scale of evidence: if log10(Bπ 10) between 0 and 0.5, evidence against H0 weak, if log10(Bπ 10) 0.5 and 1, evidence substantial, if log10(Bπ 10) 1 and 2, evidence strong and if log10(Bπ 10) above 2, evidence decisive [...fairly arbitrary!]
  • 14. self-contained concept Outside decision-theoretic environment: eliminates choice of π(Θ0) but depends on the choice of (π0, π1) Bayesian/marginal equivalent to the likelihood ratio Jeffreys’ scale of evidence: if log10(Bπ 10) between 0 and 0.5, evidence against H0 weak, if log10(Bπ 10) 0.5 and 1, evidence substantial, if log10(Bπ 10) 1 and 2, evidence strong and if log10(Bπ 10) above 2, evidence decisive [...fairly arbitrary!]
  • 15. self-contained concept Outside decision-theoretic environment: eliminates choice of π(Θ0) but depends on the choice of (π0, π1) Bayesian/marginal equivalent to the likelihood ratio Jeffreys’ scale of evidence: if log10(Bπ 10) between 0 and 0.5, evidence against H0 weak, if log10(Bπ 10) 0.5 and 1, evidence substantial, if log10(Bπ 10) 1 and 2, evidence strong and if log10(Bπ 10) above 2, evidence decisive [...fairly arbitrary!]
  • 16. self-contained concept Outside decision-theoretic environment: eliminates choice of π(Θ0) but depends on the choice of (π0, π1) Bayesian/marginal equivalent to the likelihood ratio Jeffreys’ scale of evidence: if log10(Bπ 10) between 0 and 0.5, evidence against H0 weak, if log10(Bπ 10) 0.5 and 1, evidence substantial, if log10(Bπ 10) 1 and 2, evidence strong and if log10(Bπ 10) above 2, evidence decisive [...fairly arbitrary!]
  • 17. consistency Example of a normal ¯Xn ∼ N(µ, 1/n) when µ ∼ N(0, 1), leading to B01 = (1 + n)−1/2 exp{n2 ¯x2 n /2(1 + n)}
  • 18. Some difficulties tension between using (i) posterior probabilities justified by binary loss function but depending on unnatural prior weights, and (ii) Bayes factors that eliminate dependence but escape direct connection with posterior, unless prior weights are integrated within loss delicate interpretation (or calibration) of strength of the Bayes factor towards supporting a given hypothesis or model, because not a Bayesian decision rule similar difficulty with posterior probabilities, with tendency to interpret them as p-values: only report of respective strengths of fitting data to both models
  • 19. Some difficulties tension between using (i) posterior probabilities justified by binary loss function but depending on unnatural prior weights, and (ii) Bayes factors that eliminate dependence but escape direct connection with posterior, unless prior weights are integrated within loss referring to a fixed and arbitrary cuttoff value falls into the same difficulties as regular p-values no “third way” like opting out from a decision
  • 20. Some further difficulties long-lasting impact of prior modeling, i.e., choice of prior distributions on parameters of both models, despite overall consistency proof for Bayes factor discontinuity in valid use of improper priors since they are not justified in most testing situations, leading to many alternative and ad hoc solutions, where data is either used twice or split in artificial ways [or further tortured into confession] binary (accept vs. reject) outcome more suited for immediate decision (if any) than for model evaluation, in connection with rudimentary loss function [atavistic remain of Neyman-Pearson formalism]
  • 21. Some additional difficulties related impossibility to ascertain simultaneous misfit or to detect outliers no assessment of uncertainty associated with decision itself besides posterior probability difficult computation of marginal likelihoods in most settings with further controversies about which algorithm to adopt strong dependence of posterior probabilities on conditioning statistics (ABC), which undermines their validity for model assessment temptation to create pseudo-frequentist equivalents such as q-values with even less Bayesian justifications c time for a paradigm shift back to some solutions
  • 22. Historical appearance of Bayesian tests Is the new parameter supported by the observations or is any variation expressible by it better interpreted as random? Thus we must set two hypotheses for comparison, the more complicated having the smaller initial probability ...compare a specially suggested value of a new parameter, often 0 [q], with the aggregate of other possible values [q ]. We shall call q the null hypothesis and q the alternative hypothesis [and] we must take P(q|H) = P(q |H) = 1/2 . (Jeffreys, ToP, 1939, V, §5.0)
  • 23. A major refurbishment Suppose we are considering whether a location parameter α is 0. The estimation prior probability for it is uniform and we should have to take f (α) = 0 and K[= B10] would always be infinite (Jeffreys, ToP, V, §5.02) When the null hypothesis is supported by a set of measure 0 against Lebesgue measure, π(Θ0) = 0 for an absolutely continuous prior distribution [End of the story?!]
  • 24. A major refurbishment When the null hypothesis is supported by a set of measure 0 against Lebesgue measure, π(Θ0) = 0 for an absolutely continuous prior distribution [End of the story?!] Requirement Defined prior distributions under both assumptions, π0(θ) ∝ π(θ)IΘ0 (θ), π1(θ) ∝ π(θ)IΘ1 (θ), (under the standard dominating measures on Θ0 and Θ1)
  • 25. A major refurbishment When the null hypothesis is supported by a set of measure 0 against Lebesgue measure, π(Θ0) = 0 for an absolutely continuous prior distribution [End of the story?!] Using the prior probabilities π(Θ0) = ρ0 and π(Θ1) = ρ1, π(θ) = ρ0π0(θ) + ρ1π1(θ).
  • 26. Point null hypotheses “Is it of the slightest use to reject a hypothesis until we have some idea of what to put in its place?” H. Jeffreys, ToP(p.390) Particular case H0 : θ = θ0 Take ρ0 = Prπ (θ = θ0) and g1 prior density under Hc 0 . Posterior probability of H0 π(Θ0|x) = f (x|θ0)ρ0 f (x|θ)π(θ) dθ = f (x|θ0)ρ0 f (x|θ0)ρ0 + (1 − ρ0)m1(x) and marginal under Hc 0 m1(x) = Θ1 f (x|θ)g1(θ) dθ. and Bπ 01(x) = f (x|θ0)ρ0 m1(x)(1 − ρ0) ρ0 1 − ρ0 = f (x|θ0) m1(x)
  • 27. Point null hypotheses “Is it of the slightest use to reject a hypothesis until we have some idea of what to put in its place?” H. Jeffreys, ToP(p.390) Particular case H0 : θ = θ0 Take ρ0 = Prπ (θ = θ0) and g1 prior density under Hc 0 . Posterior probability of H0 π(Θ0|x) = f (x|θ0)ρ0 f (x|θ)π(θ) dθ = f (x|θ0)ρ0 f (x|θ0)ρ0 + (1 − ρ0)m1(x) and marginal under Hc 0 m1(x) = Θ1 f (x|θ)g1(θ) dθ. and Bπ 01(x) = f (x|θ0)ρ0 m1(x)(1 − ρ0) ρ0 1 − ρ0 = f (x|θ0) m1(x)
  • 28. Noninformative proposals Bayesian testing of hypotheses Noninformative solutions Testing via mixtures Paradigm shift
  • 29. what’s special about the Bayes factor?! “The priors do not represent substantive knowledge of the parameters within the model Using Bayes’ theorem, these priors can then be updated to posteriors conditioned on the data that were actually observed In general, the fact that different priors result in different Bayes factors should not come as a surprise The Bayes factor (...) balances the tension between parsimony and goodness of fit, (...) against overfitting the data In induction there is no harm in being occasionally wrong; it is inevitable that we shall be” [Jeffreys, 1939; Ly et al., 2015]
  • 30. what’s wrong with the Bayes factor?! (1/2, 1/2) partition between hypotheses has very little to suggest in terms of extensions central difficulty stands with issue of picking a prior probability of a model unfortunate impossibility of using improper priors in most settings Bayes factors lack direct scaling associated with posterior probability and loss function twofold dependence on subjective prior measure, first in prior weights of models and second in lasting impact of prior modelling on the parameters Bayes factor offers no window into uncertainty associated with decision further reasons in the summary [Robert, 2016]
  • 31. Lindley’s paradox In a normal mean testing problem, ¯xn ∼ N(θ, σ2 /n) , H0 : θ = θ0 , under Jeffreys prior, θ ∼ N(θ0, σ2), the Bayes factor B01(tn) = (1 + n)1/2 exp −nt2 n/2[1 + n] , where tn = √ n|¯xn − θ0|/σ, satisfies B01(tn) n−→∞ −→ ∞ [assuming a fixed tn] [Lindley, 1957]
  • 32. A strong impropriety Improper priors not allowed in Bayes factors: If Θ1 π1(dθ1) = ∞ or Θ2 π2(dθ2) = ∞ then π1 or π2 cannot be coherently normalised while the normalisation matters in the Bayes factor B12 Lack of mathematical justification for “common nuisance parameter” [and prior of] [Berger, Pericchi, and Varshavsky, 1998; Marin and Robert, 2013]
  • 33. A strong impropriety Improper priors not allowed in Bayes factors: If Θ1 π1(dθ1) = ∞ or Θ2 π2(dθ2) = ∞ then π1 or π2 cannot be coherently normalised while the normalisation matters in the Bayes factor B12 Lack of mathematical justification for “common nuisance parameter” [and prior of] [Berger, Pericchi, and Varshavsky, 1998; Marin and Robert, 2013]
  • 34. On some resolutions of the paradox use of pseudo-Bayes factors, fractional Bayes factors, &tc, which lacks complete proper Bayesian justification [Berger & Pericchi, 2001] use of identical improper priors on nuisance parameters, calibration via the posterior predictive distribution, matching priors, use of score functions extending the log score function non-local priors correcting default priors
  • 35. On some resolutions of the paradox use of pseudo-Bayes factors, fractional Bayes factors, &tc, use of identical improper priors on nuisance parameters, a notion already entertained by Jeffreys [Berger et al., 1998; Marin & Robert, 2013] calibration via the posterior predictive distribution, matching priors, use of score functions extending the log score function non-local priors correcting default priors
  • 36. On some resolutions of the paradox use of pseudo-Bayes factors, fractional Bayes factors, &tc, use of identical improper priors on nuisance parameters, P´ech´e de jeunesse: equating the values of the prior densities at the point-null value θ0, ρ0 = (1 − ρ0)π1(θ0) [Robert, 1993] calibration via the posterior predictive distribution, matching priors, use of score functions extending the log score function non-local priors correcting default priors
  • 37. On some resolutions of the paradox use of pseudo-Bayes factors, fractional Bayes factors, &tc, use of identical improper priors on nuisance parameters, calibration via the posterior predictive distribution, which uses the data twice matching priors, use of score functions extending the log score function non-local priors correcting default priors
  • 38. On some resolutions of the paradox use of pseudo-Bayes factors, fractional Bayes factors, &tc, use of identical improper priors on nuisance parameters, calibration via the posterior predictive distribution, matching priors, whose sole purpose is to bring frequentist and Bayesian coverages as close as possible [Datta & Mukerjee, 2004] use of score functions extending the log score function non-local priors correcting default priors
  • 39. On some resolutions of the paradox use of pseudo-Bayes factors, fractional Bayes factors, &tc, use of identical improper priors on nuisance parameters, calibration via the posterior predictive distribution, matching priors, use of score functions extending the log score function log B12(x) = log m1(x) − log m2(x) = S0(x, m1) − S0(x, m2) , that are independent of the normalising constant [Dawid et al., 2013; Dawid & Musio, 2015] non-local priors correcting default priors
  • 40. On some resolutions of the paradox use of pseudo-Bayes factors, fractional Bayes factors, &tc, use of identical improper priors on nuisance parameters, calibration via the posterior predictive distribution, matching priors, use of score functions extending the log score function non-local priors correcting default priors towards more balanced error rates [Johnson & Rossell, 2010; Consonni et al., 2013]
  • 41. Pseudo-Bayes factors Idea Use one part x[i] of the data x to make the prior proper: πi improper but πi (·|x[i]) proper and fi (x[n/i]|θi ) πi (θi |x[i])dθi fj (x[n/i]|θj ) πj (θj |x[i])dθj independent of normalizing constant Use remaining x[n/i] to run test as if πj (θj |x[i]) is the true prior
  • 42. Pseudo-Bayes factors Idea Use one part x[i] of the data x to make the prior proper: πi improper but πi (·|x[i]) proper and fi (x[n/i]|θi ) πi (θi |x[i])dθi fj (x[n/i]|θj ) πj (θj |x[i])dθj independent of normalizing constant Use remaining x[n/i] to run test as if πj (θj |x[i]) is the true prior
  • 43. Pseudo-Bayes factors Idea Use one part x[i] of the data x to make the prior proper: πi improper but πi (·|x[i]) proper and fi (x[n/i]|θi ) πi (θi |x[i])dθi fj (x[n/i]|θj ) πj (θj |x[i])dθj independent of normalizing constant Use remaining x[n/i] to run test as if πj (θj |x[i]) is the true prior
  • 44. Motivation Provides a working principle for improper priors Gather enough information from data to achieve properness and use this properness to run the test on remaining data does not use x twice as in Aitkin’s (1991)
  • 45. Motivation Provides a working principle for improper priors Gather enough information from data to achieve properness and use this properness to run the test on remaining data does not use x twice as in Aitkin’s (1991)
  • 46. Motivation Provides a working principle for improper priors Gather enough information from data to achieve properness and use this properness to run the test on remaining data does not use x twice as in Aitkin’s (1991)
  • 47. Issues depends on the choice of x[i] many ways of combining pseudo-Bayes factors AIBF = BN ji 1 L Bij (x[ ]) MIBF = BN ji med[Bij (x[ ])] GIBF = BN ji exp 1 L log Bij (x[ ]) not often an exact Bayes factor and thus lacking inner coherence B12 = B10B02 and B01 = 1/B10 . [Berger & Pericchi, 1996]
  • 48. Issues depends on the choice of x[i] many ways of combining pseudo-Bayes factors AIBF = BN ji 1 L Bij (x[ ]) MIBF = BN ji med[Bij (x[ ])] GIBF = BN ji exp 1 L log Bij (x[ ]) not often an exact Bayes factor and thus lacking inner coherence B12 = B10B02 and B01 = 1/B10 . [Berger & Pericchi, 1996]
  • 49. Fractional Bayes factor Idea use directly the likelihood to separate training sample from testing sample BF 12 = B12(x) Lb 2(θ2)π2(θ2)dθ2 Lb 1(θ1)π1(θ1)dθ1 [O’Hagan, 1995] Proportion b of the sample used to gain proper-ness
  • 50. Fractional Bayes factor Idea use directly the likelihood to separate training sample from testing sample BF 12 = B12(x) Lb 2(θ2)π2(θ2)dθ2 Lb 1(θ1)π1(θ1)dθ1 [O’Hagan, 1995] Proportion b of the sample used to gain proper-ness
  • 51. Fractional Bayes factor (cont’d) Example (Normal mean) BF 12 = 1 √ b en(b−1)¯x2 n /2 corresponds to exact Bayes factor for the prior N 0, 1−b nb If b constant, prior variance goes to 0 If b = 1 n , prior variance stabilises around 1 If b = n−α , α < 1, prior variance goes to 0 too. c Call to external principles to pick the order of b
  • 52. Bayesian predictive “If the model fits, then replicated data generated under the model should look similar to observed data. To put it another way, the observed data should look plausible under the posterior predictive distribution. This is really a self-consistency check: an observed discrepancy can be due to model misfit or chance.” (BDA, p.143) Use of posterior predictive p(yrep |y) = p(yrep |θ)π(θ|y) dθ and measure of discrepancy T(·, ·) Replacing p-value p(y|θ) = P(T(yrep , θ) T(y, θ)|θ) with Bayesian posterior p-value P(T(yrep , θ) T(y, θ)|y) = p(y|θ)π(θ|x) dθ
  • 53. Bayesian predictive “If the model fits, then replicated data generated under the model should look similar to observed data. To put it another way, the observed data should look plausible under the posterior predictive distribution. This is really a self-consistency check: an observed discrepancy can be due to model misfit or chance.” (BDA, p.143) Use of posterior predictive p(yrep |y) = p(yrep |θ)π(θ|y) dθ and measure of discrepancy T(·, ·) Replacing p-value p(y|θ) = P(T(yrep , θ) T(y, θ)|θ) with Bayesian posterior p-value P(T(yrep , θ) T(y, θ)|y) = p(y|θ)π(θ|x) dθ
  • 54. Issues “the posterior predictive p-value is such a [Bayesian] probability statement, conditional on the model and data, about what might be expected in future replications. (BDA, p.151) sounds too much like a p-value...! relies on choice of T(·, ·) seems to favour overfitting (again) using the data twice (once for the posterior and twice in the p-value) needs to be calibrated (back to 0.05?) general difficulty in interpreting where is the penalty for model complexity?
  • 55. Changing the testing perspective Bayesian testing of hypotheses Noninformative solutions Testing via mixtures Paradigm shift
  • 56. Paradigm shift New proposal for a paradigm shift (!) in the Bayesian processing of hypothesis testing and of model selection convergent and naturally interpretable solution more extended use of improper priors Simple representation of the testing problem as a two-component mixture estimation problem where the weights are formally equal to 0 or 1
  • 57. Paradigm shift New proposal for a paradigm shift (!) in the Bayesian processing of hypothesis testing and of model selection convergent and naturally interpretable solution more extended use of improper priors Simple representation of the testing problem as a two-component mixture estimation problem where the weights are formally equal to 0 or 1
  • 58. Paradigm shift Simple representation of the testing problem as a two-component mixture estimation problem where the weights are formally equal to 0 or 1 Approach inspired from consistency result of Rousseau and Mengersen (2011) on estimated overfitting mixtures Mixture representation not directly equivalent to the use of a posterior probability Potential of a better approach to testing, while not expanding number of parameters Calibration of posterior distribution of the weight of a model, moving from artificial notion of posterior probability of a model
  • 59. Encompassing mixture model Idea: Given two statistical models, M1 : x ∼ f1(x|θ1) , θ1 ∈ Θ1 and M2 : x ∼ f2(x|θ2) , θ2 ∈ Θ2 , embed both within an encompassing mixture Mα : x ∼ αf1(x|θ1) + (1 − α)f2(x|θ2) , 0 α 1 (1) Note: Both models correspond to special cases of (1), one for α = 1 and one for α = 0 Draw inference on mixture representation (1), as if each observation was individually and independently produced by the mixture model
  • 60. Encompassing mixture model Idea: Given two statistical models, M1 : x ∼ f1(x|θ1) , θ1 ∈ Θ1 and M2 : x ∼ f2(x|θ2) , θ2 ∈ Θ2 , embed both within an encompassing mixture Mα : x ∼ αf1(x|θ1) + (1 − α)f2(x|θ2) , 0 α 1 (1) Note: Both models correspond to special cases of (1), one for α = 1 and one for α = 0 Draw inference on mixture representation (1), as if each observation was individually and independently produced by the mixture model
  • 61. Encompassing mixture model Idea: Given two statistical models, M1 : x ∼ f1(x|θ1) , θ1 ∈ Θ1 and M2 : x ∼ f2(x|θ2) , θ2 ∈ Θ2 , embed both within an encompassing mixture Mα : x ∼ αf1(x|θ1) + (1 − α)f2(x|θ2) , 0 α 1 (1) Note: Both models correspond to special cases of (1), one for α = 1 and one for α = 0 Draw inference on mixture representation (1), as if each observation was individually and independently produced by the mixture model
  • 62. Inferential motivations Sounds like approximation to the real model, but several definitive advantages to this paradigm shift: Bayes estimate of the weight α replaces posterior probability of model M1, equally convergent indicator of which model is “true”, while avoiding artificial prior probabilities on model indices, ω1 and ω2 interpretation of estimator of α at least as natural as handling the posterior probability, while avoiding zero-one loss setting α and its posterior distribution provide measure of proximity to the models, while being interpretable as data propensity to stand within one model further allows for alternative perspectives on testing and model choice, like predictive tools, cross-validation, and information indices like WAIC
  • 63. Computational motivations avoids highly problematic computations of the marginal likelihoods, since standard algorithms are available for Bayesian mixture estimation straightforward extension to a finite collection of models, with a larger number of components, which considers all models at once and eliminates least likely models by simulation eliminates difficulty of label switching that plagues both Bayesian estimation and Bayesian computation, since components are no longer exchangeable posterior distribution of α evaluates more thoroughly strength of support for a given model than the single figure outcome of a posterior probability variability of posterior distribution on α allows for a more thorough assessment of the strength of this support
  • 64. Noninformative motivations additional feature missing from traditional Bayesian answers: a mixture model acknowledges possibility that, for a finite dataset, both models or none could be acceptable standard (proper and informative) prior modeling can be reproduced in this setting, but non-informative (improper) priors also are manageable therein, provided both models first reparameterised towards shared parameters, e.g. location and scale parameters in special case when all parameters are common Mα : x ∼ αf1(x|θ) + (1 − α)f2(x|θ) , 0 α 1 if θ is a location parameter, a flat prior π(θ) ∝ 1 is available
  • 65. Weakly informative motivations using the same parameters or some identical parameters on both components highlights that opposition between the two components is not an issue of enjoying different parameters those common parameters are nuisance parameters, to be integrated out [unlike Lindley’s paradox] prior model weights ωi rarely discussed in classical Bayesian approach, even though linear impact on posterior probabilities. Here, prior modeling only involves selecting a prior on α, e.g., α ∼ B(a0, a0) while a0 impacts posterior on α, it always leads to mass accumulation near 1 or 0, i.e. favours most likely model sensitivity analysis straightforward to carry approach easily calibrated by parametric boostrap providing reference posterior of α under each model natural Metropolis–Hastings alternative
  • 66. Comparison with posterior probability 0 100 200 300 400 500 -50-40-30-20-100 a0=.1 sample size 0 100 200 300 400 500 -50-40-30-20-100 a0=.3 sample size 0 100 200 300 400 500 -50-40-30-20-100 a0=.4 sample size 0 100 200 300 400 500 -50-40-30-20-100 a0=.5 sample size Plots of ranges of log(n) log(1 − E[α|x]) (gray color) and log(1 − p(M1|x)) (red dotted) over 100 N(0, 1) samples as sample size n grows from 1 to 500. and α is the weight of N(0, 1) in the mixture model. The shaded areas indicate the range of the estimations and each plot is based on a Beta prior with a0 = .1, .2, .3, .4, .5, 1 and each posterior approximation is based on 104 iterations.
  • 67. Towards which decision? And if we have to make a decision? soft consider behaviour of posterior under prior predictives or posterior predictive [e.g., prior predictive does not exist] boostrapping behaviour comparison with Bayesian non-parametric solution hard rethink the loss function
  • 68. Conclusion many applications of the Bayesian paradigm concentrate on the comparison of scientific theories and on testing of null hypotheses natural tendency to default to Bayes factors poorly understood sensitivity to prior modeling and posterior calibration c Time is ripe for a paradigm shift
  • 69. Down with Bayes factors! c Time is ripe for a paradigm shift original testing problem replaced with a better controlled estimation target allow for posterior variability over the component frequency as opposed to deterministic Bayes factors range of acceptance, rejection and indecision conclusions easily calibrated by simulation posterior medians quickly settling near the boundary values of 0 and 1 potential derivation of a Bayesian b-value by looking at the posterior area under the tail of the distribution of the weight
  • 70. Prior modelling c Time is ripe for a paradigm shift Partly common parameterisation always feasible and hence allows for reference priors removal of the absolute prohibition of improper priors in hypothesis testing prior on the weight α shows sensitivity that naturally vanishes as the sample size increases default value of a0 = 0.5 in the Beta prior
  • 71. Computing aspects c Time is ripe for a paradigm shift proposal that does not induce additional computational strain when algorithmic solutions exist for both models, they can be recycled towards estimating the encompassing mixture easier than in standard mixture problems due to common parameters that allow for original MCMC samplers to be turned into proposals Gibbs sampling completions useful for assessing potential outliers but not essential to achieve a conclusion about the overall problem