2018 MUMS Fall Course - Essentials of Bayesian Hypothesis Testing - Jim Berger , September 4, 2018

Model Uncertainty SAMSI – Fall,2018
✬
✫
✩
✪
Lecture 2: Essentials of Bayesian hypothesis
testing
1

✬
✫
✩
✪
Outline
• An introductory example
• General notation
• Precise and imprecise hypotheses
• Choice of prior distributions for testing
• Paradoxes
• Okham’s Razor
• Multiple hypotheses and sequential testing
• Psychokinesis example
2

✬
✫
✩
✪
A Motivating Example
3

✬
✫
✩
✪
Hypotheses and data:
• Alvac had shown no eﬀect (in many studies) as a vaccine against HIV.
• Aidsvax had shown no eﬀect (in many studies) as a vaccine against
HIV.
Question: Would Alvac as a primer and Aidsvax as a booster work?
The Study: Conducted in Thailand with 16,395 individuals from the
general (not high-risk) population:
• 74 HIV cases reported in the 8198 individuals receiving placebos
• 51 HIV cases reported in the 8197 individuals receiving the treatment
4

✬
✫
✩
✪
The test that was performed:
• Let p1 and p2 denote the probability of HIV in the placebo and
treatment populations, respectively. The usual estimates would be
ˆp1 =
74
8198
= .009027, ˆp2 =
51
8197
= .006222 .
• Test H0 : p1 = p2 versus H1 : p1 ≥ p2
• Normal approximation okay, so
z =
ˆp1 − ˆp2
ˆσ{ˆp1−ˆp2}
=
.009027 − .006222
.001359
= 2.06
is approximately N(θ, 1), where θ = (p1 − p2)/(.001359).
We thus test H0 : θ = 0 versus H1 : θ ≥ 0, based on z.
• Observed z = 2.06, so the p-value is (where Z is N(0, 1) and Phi is the
standard normal cdf) P(Z ≥ 2.06) = 1 − Φ(2.06) = 0.02.
• The problem: interpreting this as the odds being 50 to 1 in favor of H1.
5

✬
✫
✩
✪
Bayesian analysis:
• Assign prior probabilities Pr(H0) and Pr(H1).
• Under H1, specify a prior distribution π1(θ) for θ ∈ (0, ∞).
• The prior probability of H0 and observing z is Pr(H0)f(z | 0), where
f(z | 0) is the standard normal density.
• The prior probability of H1 and observing z is Pr(H1)m(z | π1), where
m(z | π1) =
∞
0
f(z | θ)π1(θ)
is the marginal density of z under H1 and the prior π1.
• By Bayes theorem,
Pr(H0 | z) =
Pr(H0)f(z | 0)
Pr(H0)f(z | 0) + Pr(H1)m(z | π1)
=
1
1 + P r(H1)
P r(H0) B10
,
where B10(z) = m(z | π1)/f(z | 0) is the Bayes factor of H1 to H0.
6

✬
✫
✩
✪
The nonincreasing prior π1 most favorable to H1 is π(θ) = Uniform(0, 2.95),
and yields
B10(2.06) =
2.95
0
1√
2π
e−(2.06−θ)2
/2 1
2.95 dθ
1√
2π
e−(2.06−0)2/2
= 5.63 ,
so that
Pr(H0 | z = 2.06) =
1
1 + P r(H1)
P r(H0) × 5.63
,
Here is this posterior probability for various values of Pr(H1):
Pr(H0) 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Pr(H0 | 2.06) 0.03 0.06 0.10 0.14 0.20 0.28 0.37 0.50 0.70
Hence the common misinterpretation that p = 0.02 implies odds of 50 in
favor of the alternative is simply very wrong.
7

✬
✫
✩
✪
The following studies
• look at large collections of published studies where 0 < p < 0.05;
• compute a Bayes factor, B01 for each study;
• graph the Bayes factors versus the corresponding p-values.
The first two graphs are for 272 ‘significant’ epidemiological studies with
two different choices of the prior; the third for 50 ‘significant’ meta-analyses
(these three from J.P. Ioannides, Am J Epidemiology, 2008); and the last is
for 314 ecological studies (reported in Elgersma and Green, 2011).
8

✬
✫
✩
✪
9

✬
✫
✩
✪
General Notation
• X | θ ∼ f(x | θ).
• To test: H0 : θ ∈ Θ0 vs H1 : θ ∈ Θ1 .
• Prior distribution:
– Prior probabilities Pr(H0) and Pr(H1) of the hypotheses.
– Prior densities (often proper), π0(θ) and π1(θ), on Θ0 and Θ1.
∗ πi(θ) would be a point mass if Θi is a point.
• Marginal likelihoods under the hypotheses:
m(x | Hi) =
Θi
f(x | θ)πi(θ) dθ, i = 0, 1 .
• Bayes factor of H0 to H1:
B01 =
m(x | H0)
m(x | H1)
.
10

✬
✫
✩
✪
Conclusions from posterior probabilities or Bayes factors:
• In some sense, H0 would be accepted if Pr(H0 | x) > Pr(H1 | x) but
there is no real need to state this decision.
• Normally, just the Bayes factor B01 (or B10) is presented because
– it can be combined with diﬀering personal prior odds;
– the ‘default’ Pr(H0) = Pr(H1) can used.
• For interpreting B10, Jeﬀreys (1961) suggested the scale
B10 Strength of evidence
1:1 to 3:1 Barely worth mentioning
3:1 to 10:1 Substantial
10:1 to 30:1 Strong
30:1 to 100:1 Very strong
> 100:1 Decisive
12

✬
✫
✩
✪
Precise or Imprecise Hypothesis
(often Point Null or One-Sided)
A Key Issue: Is the precise hypothesis being tested plausible?
A precise hypothesis is an hypothesis of lower dimension than the
alternative (e.g. H0 : µ = 0 versus H0 : µ = 0).
A precise hypothesis is plausible if it has a reasonable prior probability of
being true. H0 : there is no Higgs boson particle, is plausible.
Example: Let θ denote the diﬀerence in mean treatment eﬀects for cancer
treatments A and B, and test H0 : θ = 0 versus H1 : θ = 0.
Scenario 1: Treatment A = standard chemotherapy
Treatment B = standard chemotherapy + steroids
Scenario 2: Treatment A = standard chemotherapy
Treatment B = a new radiation therapy
H0 : θ = 0 is plausible in Scenario 1, but not in Scenario 2; in the latter
case, instead test H0 : θ < 0 versus H1 : θ > 0.
13

✬
✫
✩
✪
Plausible precise null hypotheses:
• H0 : Gene A is not associated with Disease B.
• H0: There is no psychokinetic effect.
• H0: Vitamin C has no effect on the common cold.
• H0: A new HIV vaccine has no effect.
• H0: Cosmic microwave background radiation is isotropic.
• H0 : Males and females have the same distribution of eye color.
• H0 : Pollutant A does not cause disease B.
Implausible precise null hypotheses:
• H0 : Small mammals are as abundant on livestock grazing land as on
non-grazing land
• H0 : Bird abundance does not depend on the type of forest habitat
they occupy
• H0 : Children of different ages react the same to a given stimulus.
14

✬
✫
✩
✪
Approximating a believable precise hypothesis by an exact
precise null hypothesis
A precise null, like H0 : θ = θ0, is typically never true exactly; rather, it is
used as a surrogate for a ‘real null’
Hǫ
0 : |θ − θ0| < ǫ, ǫ small .
(Even if θ = θ0 in nature, the experiment studying θ will typically have a small
unknown bias, introducing an ǫ.)
Result (Berger and Delampady, 1987 Statistical Science):
Robust Bayesian theory can be used to show that, under reasonable
conditions, if ǫ < 1
4 σˆθ, where σˆθ is the standard error of the estimate of θ,
then
Pr(Hǫ
0 | x) ≈ Pr(H0 | x) .
Note: Typically, σˆθ ≈ c√
n
, where n is the sample size, so for large n the
above condition can be violated, and using a precise null may not be
appropriate, even if the real null is believable.
15

✬
✫
✩
✪
Posterior probabilities can equal p-values in one-sided testing:
Normal Example:
• X | θ ∼ N(x | θ, σ2
)
• One-sided testing
H0 : θ ≤ θ0 vs H1 : θ > θ0
• Choose the usual estimation objective prior π(θ) = 1, for which the
posterior distribution, π(θ | x), can be shown to be N(θ | x, σ2
).
• Posterior probability of H0:
Pr(H0 | x) = Pr(θ ≤ θ0 | x) = Φ
θ0 − x
σ
= 1 − Φ
x − θ0
σ
= Pr(X > x | θ0) = p-value .
• It has been argued (e.g., Berger and Mortera, JASA99) that objective
testing in one-sided cases should use priors more concentrated than
π(θ) = 1, so p-values are still questionable in one-sided testing.
16

✬
✫
✩
✪
Choice of Prior Distributions in Testing
A. Choosing priors for “common parameters” in testing:
Common parameters in densities under two hypotheses are parameters that
have the same role and and are present in both.
Example: If the data xi are i.i.d. N(xi | θ, σ2
), and it is desired to test
H0 : θ = 0 versus H1 : θ = 0 ,
the density under H0 is N(xi | 0, σ2
) and that under H1 is N(xi | θ, σ2
),
so σ2
is a common parameter to both densities and has the same role in
each.
Priors for common parameters: Use the standard ‘objective prior’ for
the common parameters.
Example: For the normal testing problem, the standard objective prior for
the variance is π(σ2
) = 1/σ2
.
17

✬
✫
✩
✪
Example: An example requiring parameter transformation to obtain
‘common parameters’ (Dass and Berger, 2003):
• Vehicle emissions data from McDonald et. al. (1995)
• Data consists of 3 types of emissions, hydrocarbon (HC), carbon
monoxide (CO) and nitrogen oxides (NOx) at 4 diﬀerent mileage states
0, 4000, 24,000(b) and 24,000(a).
Data for 4,000 miles
HC 0.26 0.48 0.40 0.38 0.31 0.49 0.25 0.23
CO 1.16 1.75 1.64 1.54 1.45 2.59 1.39 1.26
NOx 1.99 1.90 1.89 2.45 1.54 2.01 1.95 2.17
HC 0.39 0.21 0.22 0.45 0.39 0.36 0.44 0.22
CO 2.72 2.23 3.94 1.88 1.49 1.81 2.90 1.16
NOx 1.93 2.58 2.12 1.80 1.46 1.89 1.85 2.21
18

✬
✫
✩
✪
Goal: Based on independent data X = (X1, . . . , Xn), test whether the i.i.d.
Xi follow the Weibull or the lognormal distribution given, respectively, by
H0 : fW (x | β, γ) =
γ
β
x
β
γ−1
exp −
x
β
γ
, x > 0, β > 0, γ > 0.
H1 : fL(x | µ, σ2
) =
1
x
√
2πσ2
exp
−(log x − µ)2
2σ2
, x > 0, σ > 0 .
Both distributions are location-scale distributions in y = log x, i.e., are of
the form
1
σ
g
y − µ
σ
for some density g(·). To see for the Weibull, deﬁne y = log(x), µ = log(β),
and σ = 1/γ; then
fW (y | µ, σ) =
1
σ
e(y−µ)/σ
e−e(y−µ)/σ
.
19

✬
✫
✩
✪
Berger, Pericchi and Varshavsky (1998) argue that, for two hypotheses
(models) with the same invariance structure (here location-scale
invariance), one can use the right-Haar prior (usually improper) for both.
Here the right-Haar prior is
πRH
(µ, σ) =
1
σ
dσdµ .
The justiﬁcation is called predictive matching and goes as follows (related
to arguments in Jeﬀreys 1961):
• With only two observations (y1, y2), one cannot possibly distinguish
between fW (y | µ, σ) and fL(y | µ, σ) so we should have B01(y1, y2) = 1.
• Lemma:
∞
0
∞
−∞
1
σ
g
y1 − µ
σ
1
σ
g
y2 − µ
σ
πRH
(µ, σ)dµdσ =
1
2|y1 − y2|
for any density g(·), implying B01(y1, y2) = 1 (as desired) for two
location-scale densities when the right-Haar prior is used.
20

✬
✫
✩
✪
Using the right-Haar prior for both models, calculus then yields that the
Bayes factor of H0 to H1 is
B01(X) =
Γ(n)nn
π(n−1)/2
Γ(n − 1/2)
∞
0
v
n
n
i=1
exp
yi − ¯y
syv
−n
dv,
where ¯y = 1
n
n
i=1 yi and s2
y = 1
n
n
i=1(yi − ¯y)2
. Note: B01(X) is also the
classical UMP invariant test statistic.
As an example, consider four of the car emission data sets, each giving the
carbon monoxide emission at a diﬀerent mileage level.
For testing, H0 : Lognormal versus H1 : Weibull, the results were as
follows:
Data set at Mileage level
0 4000 24,000 30,000
B01 0.404 0.110 0.161 0.410
21

✬
✫
✩
✪
B. Choosing priors for non-common parameters
If subjective choice is not possible, be aware that
• Vague proper priors are often horrible: for instance, if X ∼ N(x | µ, 1)
and we test H0 : µ = 0 versus H1 : µ = 0 with a Uniform(−c, c) prior
for θ, the Bayes factor is
B01(c) =
f(x | 0)
c
−c
f(x | µ)(2c)−1dµ
≈
2c f(x | 0)
∞
−∞
f(x | µ)dµ
= 2c f(x | 0)
for large c, which depends dramatically on the choice of c.
• Improper priors are problematical, because they are unnormalized;
should we use π(µ) = 1 or π(µ) = 2, yielding
B01 =
f(x | 0)
∞
−∞
f(x | µ)(1)dµ
= f(x | 0) or B01 =
f(x | 0)
∞
−∞
f(x | µ)(2)dµ
=
1
2
f(x | 0) ?
• It is curious here that use of vague proper priors is much worse than
use of objective improper priors (though neither can be justiﬁed).
22

✬
✫
✩
✪
Various proposed default priors for non-common parameters
• Conventional priors
– Jeﬀreys choice
– The ‘robust prior’ and Bayesian t-test
• Priors induced from a single prior
• Intrinsic priors (derived from data or imaginary data)
23

✬
✫
✩
✪
Conventional priors
Jeffreys choices for the normal testing problem:
• Data: X = (X1, X2, . . . , Xn)
• We are testing
H0 : Xi ∼ N(xi | 0, σ2
0) versus H1 : Xi ∼ N(xi | µ, σ2
1).
• We thus seek π0(σ2
0) and π1(µ, σ2
1) = π1(µ | σ2
1)π1(σ2
1).
• Since σ2
0 and σ2
1 are common parameters with the same role, Jeffreys
used the same objective prior π0(σ2
0) = 1/σ2
0 and π1(σ2
1) = 1/σ2
1.
• π1(µ | σ2
1) must be proper (and not vague), since µ only occurs in H1.
Jeffreys argued that it
– should be centered at zero (H0);
– should have scale σ1 (the ‘natural’ scale of the problem);
– should be symmetric around zero;
– should have no moments (more on this later).
The ‘simplest prior’ satisfying these is the Cauchy(µ | 0, σ1) prior,
24

✬
✫
✩
✪
resulting in
Jeﬀreys proposal:
π0(σ2
0) =
1
σ2
0
, π1(µ, σ2
1) =
1
πσ1(1 + (µ/σ1)2)
·
1
σ2
1
.
Predictive matching argument for these priors:
For any location scale density 1
σ g(y−µ
σ ) and one observation y
under H1 : µ = 0,
m0(y) =
1
σ
g
y − µ
σ
1
σ2
dσ2
=
2
|y|
;
under H1 : µ = 0 and for any proper prior of the form 1
σ h µ
σ ,
m1(y) =
1
σ
g
y − µ
σ
1
σ
h
µ
σ
1
σ2
dµ dσ2
=
2
|y|
,
so that B01 = 1 for one observation, as should be the case. (Of course, this
doesn’t say that the prior for µ should be Cauchy.)
25

✬
✫
✩
✪
The robust prior and Bayesian t-test
• Computation of B01 for Jeffreys choice of prior requires
one-dimensional numerical integration. (Jeffreys gave a not-very-good
numerical approximation.)
• An alternative is the ‘robust prior’ from Berger (1985) (a generalization
of the Strawderman (1971) prior), to be discussed in later lectures.
– This prior satisfies all desiderata of Jeffreys;
– has identical tails and varies little from the Cauchy prior;
– yields an exact expression for the Bayes factor (Pericchi and Berger)
B01 =
2
n + 1
n − 2
n − 1
t2
1 +
t2
n − 1
− n
2
1 − 1 +
2t2
n2 − 1
−( n
2 −1) −1
,
where t =
√
n¯x/ (xi − ¯x)2/(n − 1) is the usual t-statistic. As t → 0,
B01 → 2(n + 1). For n = 2, this is to be interpreted as
B01 =
2
√
2 t2
√
3(1 + t2) log(1 + 2t2/3)
.
26

✬
✫
✩
✪
Encompassing approach: inducing priors on hypotheses from a
single prior
Sometimes, instead of separately assessing Pr(H0), Pr(H1), π0, π1, it is
possible to start with an overall prior π(θ) and deduce the
Pr(H0), Pr(H1), π0, π1:
Pr(H0) =
Θ0
π(θ) dθ and Pr(H1) =
Θ1
π(θ) dθ
π0(θ) =
1
Pr(H0)
π(θ)1Θ0 (θ) and π1(θ) =
1
Pr(H1)
π(θ)1Θ1 (θ) .
Note: To be sensible, the induced Pr(H0), Pr(H1), π0, π1 must themselves
be sensible.
Example: Intelligence testing:
• X | θ ∼ N(x, | θ, 100), and we observe x = 115.
• To test ‘below average’ versus ’above average’
H0 : θ ≤ 100 vs H1 : θ > 100.
27

✬
✫
✩
✪
• It is ‘known’ that θ ∼ N(θ | 100, 225).
• induced prior probabilities of hypotheses
Pr(H0) = Pr(θ ≤ 100) = 1
2 = Pr(H1)
• induced densities under each hypothesis:
π0(θ) = 2 N(θ | 100, 225)I(−∞,100)(θ)
π1(θ) = 2 N(θ | 100, 225)I(100,∞)(θ)
• Of course, we would not have needed to formally derive these.
– From the original encompassing prior π(θ), we can derive the
posterior and θ | x = 115 ∼ N(110.39, 69.23).
– Then directly compute the posterior probabilities:
Pr(H0 | x = 115) = Pr(θ ≤ 100 | x = 115) = 0.106
Pr(H1 | x = 115) = Pr(θ > 100 | x = 115) = 0.894
28

✬
✫
✩
✪
Intrinsic priors
Discussion of these can be found in Berger and Pericchi (2001). One
popular such prior, that applies to our testing problem, is the intrinsic
prior deﬁned as follows:
• Let πO
(θ) be a good estimation objective prior (using a constant prior
will almost always work ﬁne), with resulting posterior distribution and
marginal distribution for data x given, respectively, by
πO
(θ | x) = f(x | θ)πO
(θ)/mO
(x), mO
(x) = f(x | θ)πO
(θ) dθ .
• Then the intrinsic prior (which will be proper) is
πI
(θ) = πO
(θ | x∗
)f(x∗
| θ0) dx∗
,
with x∗
= (x∗
1, . . . , x∗
q) being imaginary data of the smallest sample size
q such that mO
(x∗
) < ∞ (this is an imaginary bootstrap construction).
29

✬
✫
✩
✪
πI
(θ) is often available in closed form, but even if not, computation of the
resulting Bayes factor is often a straightforward numerical exercise.
• The resulting Bayes factor is
B01(x) =
f(x | θ0)
f(x | θ)πI(θ)dθ
=
f(x | θ0)
mO(x | x∗)f(x∗ | θ0)dx∗
.
Example (Higgs Boson Example): Test H0 : θ = 0 versus H0 : θ > 0, based
on Xi ∼ f(xi | θ) = (θ + b) exp{−(θ + b)xi}, where b is known;
• Suppose we choose πO
(θ) = 1/(θ + b) (the more natural square root is
harder to work with).
• A minimal sample size for the resulting posterior to be proper is q = 1.
• Computation then yields πI
(θ) = πO
(θ | x∗
1)f(x∗
1 | 0)dx∗
1 = b/(θ + b)2
.
30

✬
✫
✩
✪
IV. “Paradoxes”
31

✬
✫
✩
✪
Normal Example (used to illustrate the various “paradoxes”):
• Xi | θ
i.i.d.
∼ N(xi | θ, σ2
), σ2
known.
• Test H0 : θ = 0 versus H1 : θ = 0.
• Can reduce to suﬃcient statistic ¯x ∼ N(¯x | θ, σ2
/n).
• Prior on H1: π1(θ) = N(θ | 0, v2
0)
• Marginal likelihood under H1: m1(¯x) = N(¯x | 0, v2
0 + σ2
/n).
• posterior probability:
Pr(H0 | ¯x) =

1 +
Pr(H1)
Pr(H0)
1
(2π(v2
0+σ2/n))1/2 exp −1
2
1
v2
0+σ2/n
(¯x)2
1
(2πσ2/n)1/2 exp −1
2
1
σ2/n
(¯x)2


−1
=

1 +
Pr(H1)
Pr(H0)
exp{1
2
z2
[1 + σ2
nv2
0
]−1
}
{1 + nv2
0/σ2}1/2


−1
,
where z =
|¯x|
σ/
√
n
is the usual (frequentist) test statistic for this problem.
32

✬
✫
✩
✪
An Aside: Comparing Pr(H0 | ¯x) with the p-value for various z and n and
with v2
0 = σ2
(the ‘unit information’ prior):
z p-value n = 5 n = 20 n = 100
1.645 0.1 0.44 0.56 0.72
1.960 0.05 0.33 0.42 0.60
2.576 0.01 0.13 0.16 0.27
33

✬
✫
✩
✪
The Jeffreys-Lindley ‘Paradox’
In the normal testing example, for fixed z and large n,
Pr(H0 | ¯x) =

1 +
Pr(H1)
Pr(H0)
exp{1
2
z2
[1 + σ2
nv2
0
]−1
}
{1 + nv2
0/σ2}1/2


−1
≈ 1 −
Pr(H1)
Pr(H0)
σ
√
n v0
exp{
1
2
z2
} −→ 1 as n → ∞ ,
so that a classical test can strongly reject H0 (which happens when z is
moderately large) and the Bayesian analysis can, at the same time, strongly
support H0 (if n is so large that exp{1
2 z2
}/
√
n is small, even though z is
moderately large); reaching opposite conclusions is the ‘paradox.’
• This is not a paradox in the true sense, since it is just mathematics.
• Robust Bayesian resolution: H0 : θ = 0 is just an approximation to
H0 : |θ| < ǫ, where ǫ can reflect reality or just experimental bias. The
approximation is only accurate when ǫ < σ/(4
√
n) (Berger and Delampady,
1987); thus, for very large n, it is not reasonable to use H0 : θ = 0 as the null
hypothesis, so the ‘paradox’ becomes vacuous.
34

✬
✫
✩
✪
The Jeﬀreys-Lindley ‘Paradox’and Experimental Bias
Suppose that H0 is truly precise (e.g. 0 psychic eﬀect or ‘no Higgs boson’),
but that the experiment has some bias b ∼ N(b | 0, δ2
). Then
Pr(H0 | ¯x) =

1 +
Pr(H1)
Pr(H0)
exp{1
2
z2
b [1 + (δ2+σ2/n)
v2
0
]−1
}
{1 +
v2
0
δ2+σ2/n
}1/2


−1
≈ 1 −
Pr(H1)
Pr(H0)
δ2 + σ2/n
v0
exp{
1
2
z2
b },
when δ2 + σ2/n is small, and where zb = |¯x|/ δ2 + σ2/n can be thought
of as standard normal under H0 in the presence of the bias. Then
lim
n→∞
Pr(H0 | ¯x) = 1 −
Pr(H1)
Pr(H0)
δ
v0
exp{
(¯x)2
2δ2
}
which does not go to 1. Also
Pr(H0 | ¯x) ≈ 1 −
Pr(H1)
Pr(H0)
δ
v0
exp{
z2
σ2
2nδ2
} ,
for the interesting range of z2
σ2
/[nδ2
].
35

✬
✫
✩
✪
Experimental biases:
Figure 1: Historical record of values of some particle properties published over
time, with quoted error bars (Particle Data Group).
36

✬
✫
✩
✪
Figure 2: Historical record of values of some particle properties published over
time, with quoted error bars (Particle Data Group).
37

✬
✫
✩
✪
The Barlett ‘Paradox’
In the normal testing example, when the prior variance v2
0 is large (i.e., a
vague proper prior is being used),
Pr(H0 | ¯x) =

1 +
Pr(H1)
Pr(H0)
exp{1
2 z2
[1 + σ2
nv2
0
]−1
}
{1 + nv2
0/σ2}1/2


−1
≈ 1 −
Pr(H1)
Pr(H0)
σ
√
n v0
exp{
1
2
z2
},
so that, if v2
0 → ∞, then Pr(H0 | ¯x) → 1, so that proper priors in testing
can not be ‘arbitrarily ﬂat’ (as we saw earlier).
38

✬
✫
✩
✪
Ockham’s Razor
• Attributed to thirteen-century Franciscan monk William of Ockham
(Occam in latin)
“Pluralitas non est ponenda sine necessitate.”
(Plurality must never be posited without necessity.)
“Frustra ﬁt per plura quod potest ﬁeri per pauciora.”
(It is vain to do with more what can be done with fewer.)
• Preferring the simpler of two hypothesis to the more complex when
both agree with data is an old principle in science.
• Regard H0 as simpler than H1 if it makes sharper predictions about
what data will be observed.
• Models are more complex if they have extra adjustable parameters that
allow them to be tweaked to accommodate a wider variety of data.
– “coin is fair” is a simpler model than “coin has unknown bias θ”
– s = a + ut + 1
2 gt2
is simpler than s = a + ut + 1
2 gt2
+ ct3
39

✬
✫
✩
✪
Example: Perihelion of Mercury (with Bill Jeﬀerys)
In the 19th century it was known that there was an unexplained residual
motion of Mercury’s perihelion (the point in its orbit where the planet was
closest to the Sun) in the amount of approximately 43 seconds of arc per
century.
Various hypotheses:
• A planet ‘Vulcan’ close to the sun.
• A ring of matter around the sun.
• Oblateness of the sun.
• Law of gravity is not inverse square but inverse (2 + ǫ).
All these hypotheses had a parameter that could be adjusted to deal with
whatever data on the motion of Mercury existed.
Data in 1920: X = 41.6 where X ∼ N(θ | 22
), θ being the perihelion
advance of Mercury, and the measurement standard deviation is 2.
40

✬
✫
✩
✪
Prior (before data) for gravity model MG: πG(θ) = N(θ | 0, 502
).
• Symmetric about 0 (corresponding to inverse square law).
• Decreasing away from zero; normality is convenient.
• Initially, τ = 50, because a gravity effect which would yield θ > 100
would have had other observed effects.
• We will also consider utilization of classes of priors:
– The class of all N(θ | 0, τ2
) priors, τ > 0.
– The class of all symmetric priors that are nonincreasing in |θ|.
General Relativity (1915) model ME: Predicted θE = 42.9, so no prior
is needed. (Thus this is a ‘simpler’ model.)
Bayes factor of ME to MG = 28.6, strongly favoring General Relativity,
even though the gravity model could fit the data better than General
Relativity. (The computation is an exercise.)
41

✬
✫
✩
✪
Multiple Hypotheses and Sequential Testing
Two nice features of Bayesian testing:
1. Multiple hypotheses can be easily handled.
Example; In a paired difference experiment, Xi is the observed difference in
effect between a subject receiving Treatment 1 and the paired subject
receiving Treatment 2.
Suppose the Xi are i.i.d from the N(xi | θ, 1) density, where
θ = mean effect of Treatment 1 - mean effect of Treatment 2
Standard Testing Formulation:
H0 : θ = 0 (no difference in treatments)
Ha : θ = 0 (a difference exists)
A More Revealing Formulation:
H0 : θ = 0 (no difference)
H1 : θ < 0 ( Treatment 2 is better)
H2 : θ > 0 ( Treatment 1 is better)
42

✬
✫
✩
✪
2. Interim or sequential analysis does not affect the Bayesian answer.
• In interim or sequential analysis, one periodically looks at the
accumulated data during a study, with the option of stopping the study
and drawing a conclusion at each look.
• In classical statistics, one must increase the error probability that is
used with each look at the data (since each of the analysis stages
increases the probability of having an incorrect rejection).
– Example: In testing whether a normal mean is zero or not (variance
known), and one wants significance at the α = 0.05 level,
∗ If a fixed sample size of 20 was used, reject the null hypothesis if
|¯x20|/
√
20 > 1.96.
∗ If one is going to first take 10 observations and see if rejection is
possible and, if not, take another 10 observations, then one rejects only
if |¯x10|/
√
10 > 2.178 (first stage) or |¯x20|/
√
20 > 2.178 (second stage).
– In the medical literature this is called “spending α for looks at the data.”
• Bayesians do not adjust for looks at the data (more generally called the
stopping rule principle).
43

✬
✫
✩
✪
The three hypothesis testing example, done in a fully sequential
Bayesian way.
• Thus after each observation is taken, the current posterior probability
of each hypothesis is calculated
Prior Distribution:
• Assign H0 and Ha prior probabilities of 1/2 each
• On Ha, assign θ the “default” Normal(0,2) distribution (so that
Pr(H1) = Pr(H2) = 1/4)
(The N(0, 2) prior is just for illustration: Berger and Mortera, 1999, use a
much better intrinsic prior here; see also Barbieri and Liseo, 2006).
Posterior Distribution:
After observing x = (x1, x2, . . . , xn), compute the posterior probabilities of
the various hypotheses, i.e.
Pr(H0 | x), Pr(H1 | x), Pr(H2 | x),
and Pr(Ha | x) = Pr(H0 | x) + Pr(H0 | x)
44

✬
✫
✩
✪
Data and posterior probabilities:
Pair dif mean Posterior Probabilities of
n xn ¯xn H0 Ha H1 H2
1 1.63 1.63 .417 .583 .054 .529
2 1.03 1.33 .352 .648 .030 .618
3 0.19 0.95 .453 .547 .035 .512
4 1.51 1.09 .266 .734 .015 .719
5 -0.21 0.83 .409 .591 .023 .568
6 0.95 0.85 .328 .672 .016 .657
7 0.64 0.82 .301 .699 .013 .686
8 1.22 0.87 .220 .780 .009 .771
9 0.60 0.84 .177 .823 .006 .817
10 1.54 .091 .082 .918 .003 .915
46

✬
✫
✩
✪
Comments:
(i) Neither multiple hypothesis nor the sequential aspect caused
diﬃculties. There is no penalty (e.g. ‘spending α’) for looks at the data
(ii) Quantiﬁcation of the support for H0 : θ = 0 is direct. At the 3rd
observations, Pr(H0 | x) = .453, at the end, Pr(H0 | x) = .082
(iii) H1 can be ruled out almost immediately
(iv) For testing H0 : θ = 0 versus Ha : θ = 0, the posterior probabilities are
also what are called conditional frequentist error probabilities, as will
be seen in Lecture 2.
– Thus frequentists also don’t have to adjust for looks at the data, if
they use the correct frequentist procedures.
47

✬
✫
✩
✪
There are two consequences of this result:
1. Use of the Bayes factor gives experimenters the freedom to employ
optional stopping without penalty.
2. There is no harm if ‘undisclosed optional stopping’ is used (common in
some areas of psychology), as long as the Bayes factor is used to assess
significance. In particular, it is a consequence that an experimenter
cannot fool someone through use of undisclosed optional stopping.
The Philosophical Puzzle: How can there be no penalty for interim analysis?
• Bayesian analysis is just probability theory and so cannot be wrong
foundationally.
• The ‘statistician’s client with a grant application example.’
But it is difficult; as Savage (1961) said “When I first heard the stopping rule
principle from Barnard in the early 50’s, I thought it was scandalous that anyone
in the profession could espouse a principle so obviously wrong, even as today I
find it scandalous that anyone could deny a principle so obviously right.”
48

✬
✫
✩
✪
The reason the Bayes factor does not depend on the stopping rule:
Optional stopping alters the data density to be
τN (x1, x2, . . . , xN )
N
i=1
f(xi | θ, 1) ,
where N is the (random) time at which one stops taking data and
τN (x1, x2, . . . , xN ) gives the probability (often 0 or 1) of stopping sampling.
Then
BN =
τN (x1, x2, . . . , xN ) N
i=1 f(xi | 0, 1)
π(θ)τN (x1, x2, . . . , xN )
N
i=1 f(xi | θ, 1)dθ
=
N
i=1 f(xi | 0, 1)
π(θ)
N
i=1 f(xi | θ, 1)dθ
.
49

✬
✫
✩
✪
Psychokinesis Example
Do people have the ability to perform psychokinesis, aﬀecting objects with
thoughts?
The experiment:
Schmidt, Jahn and Radin (1987) used electronic and
quantum-mechanical random event generators with visual
feedback; the subject with alleged psychokinetic ability tries to
“inﬂuence” the generator.
50

✬
✫
✩
✪
Stream of particles
Quantum
Gate
Red light
Green light
Quantum mechanics
implies the particles are
50/50 to go to each light
Tries to make
the particles to
go to red light
ÁÁ
51

✬
✫
✩
✪
Data and model:
• Each “particle” is a Bernoulli trial (red = 1, green = 0)
θ = probability of “1”
n = 104, 490, 000 trials
X = # “successes” (# of 1’s), X ∼ Binomial(n, θ)
x = 52, 263, 470 is the actual observation
To test H0 : θ = 1
2 (subject has no influence)
versus H1 : θ = 1
2 (subject has influence)
• P-value = Pθ= 1
2
(|X − n
2 | ≥ |x − n
2 |) ≈ .0003.
Is there strong evidence against H0 (i.e., strong evidence that the
subject influences the particles) ?
52

✬
✫
✩
✪
Bayesian Analysis: (Jeﬀerys, 1990)
Prior distribution:
Pr(Hi) = prior probability that Hi is true, i = 0, 1;
On H1 : θ = 1
2 , let π(θ) be the prior density for θ.
Subjective Bayes: choose the Pr(Hi) and π(θ) based on personal beliefs
Objective (or default) Bayes: choose
Pr(H0) = Pr(H1) = 1
2
π(θ) = 1 (on 0 < θ < 1)
53

✬
✫
✩
✪
Posterior probability of hypotheses:
Pr(H0|x) =
f(x | θ = 1
2
) Pr(H0)
Pr(H0) f(x | θ = 1
2
) + Pr(H1) f(x | θ)π(θ)dθ
For the objective prior,
B01 = likelihood of observed data under H0
‘average′ likelihood of observed data under H1
=
f(x | θ= 1
2
)
1
0 f(x | θ)π(θ)dθ
≈ 12
Pr(H0 | x = 52, 263, 470) ≈ 0.92 (recall, p-value ≈ .0003)
Posterior density on H1 : θ = 1
2 is
π(θ|x, H1) ∝ π(θ)f(x | θ) ∝ 1 × θx
(1 − θ)n−x
,
the Be(θ | 52, 263, 471 , 52, 226, 531) density.
54

✬
✫
✩
✪
0.4999 0.5000 0.5001 0.5002 0.5003 0.5004
02000060000100000 0.92
55

✬
✫
✩
✪
Choice of the prior density or weight function, π, on { θ : θ = 1
2
}:
Consider πr(θ) = U(θ | 1
2 − r, 1
2 + r) the uniform density on (1
2 − r, 1
2 + r)
Subjective interpretation: r is the largest chance in success probability
that you would expect, given that ESP exists. And you give equal
probability to all θ in the interval (1
2 − r, 1
2 + r) .
Resulting Bayes factor (letting FBe(· | a, b) denote the CDF of the
Beta(a, b) distribution)
B(r) =
f(x | 1/2)
1
0
f(x | θ)πr(θ) dθ
=
n
x
(n + 1) r
2n−1
[FB2 − FB1]−1
where
FB2 = FBe(1
2 + r | x + 1, n − x + 1) and
FB1 = FBe(1
2 − r | x + 1, n − x + 1)
For example, B(0.25) ≈ 6
56

✬
✫
✩
✪
0.00016 0.00020 0.00024 0.00028
0.0070.0080.0090.010
r
BF
57

✬
✫
✩
✪
r = largest increase in success probability that would be expected, given
ESP exists.
the minimum value of B(r) is 1
158 , attained at the minimizing value of
r = .00024
Conclusion: Although the p-value is small (.0003), for typical prior beliefs
the data would provide evidence for the simpler model H0 : no ESP.
Only if one believed a priori that |θ − 1
2 | ≤ .0021, would the evidence
for H1 be at least 20 to 1.
58

2018 MUMS Fall Course - Essentials of Bayesian Hypothesis Testing - Jim Berger , September 4, 2018

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to 2018 MUMS Fall Course - Essentials of Bayesian Hypothesis Testing - Jim Berger , September 4, 2018

Similar to 2018 MUMS Fall Course - Essentials of Bayesian Hypothesis Testing - Jim Berger , September 4, 2018 (20)

More from The Statistical and Applied Mathematical Sciences Institute

More from The Statistical and Applied Mathematical Sciences Institute (20)

Recently uploaded

Recently uploaded (20)

2018 MUMS Fall Course - Essentials of Bayesian Hypothesis Testing - Jim Berger , September 4, 2018