1.
CHAPTER 14: Frequentist Hypothesis Testing:
a Coherent Account
Aris Spanos [Fall 2013]
1 Inherent diﬃculties in learning statistical testing
Statistical testing is arguably the most important, but also the most
diﬃcult and confusing chapter of statistical inference for several reasons, including the following.
(i) The need to introduce numerous new notions, concepts and
procedures before one can paint — even in broad brushes — a coherent
picture of hypothesis testing.
(ii) The current textbook discussion of statistical testing is both
highly confusing and confused. There are several sources of confusion.
I (a) Testing is conceptually one of the most sophisticated sub-ﬁelds
of any scientiﬁc discipline.
I (b) Inadequate knowledge by textbook writers who often do not
have the technical skills to read and understand the original sources,
and have to rely on second hand accounts of previous textbook writers that are often misleading or just outright erroneous. In most of
these textbooks hypothesis testing is poorly explained as an idiot’s
guide to combining oﬀ-the-shelf formulae with statistical tables like the
Normal, the Student’s t, the chi-square, etc., where the underlying statistical model that gives rise to the testing procedure is hidden in the
background.
I (c) The misleading portrayal of Neyman-Pearson testing as essentially decision-theoretic in nature, when in fact the latter has much
greater aﬃnity with the Bayesian rather than the frequentist inference.
I (d) A deliberate attempt to distort and cannibalize frequentist
testing by certain Bayesian statisticians who revel in (unfairly) maligning frequentist inference in their misguided attempt to motivate their
preferred viewpoint of statistical inference. You will often hear such
Bayesians tell you that "what you really want (require) is probabilities
attached to hypotheses" and other such misadvised promptings! In
frequentist inference probabilities are always attached to the diﬀerent
possible values of the sample x∈R ; never to hypotheses!
(iii) The discussion of frequentist testing is rather incomplete in so
far as it has been beleaguered by serious foundational problems since
1
2.
the 1930s. As a result, diﬀerent applied ﬁelds have generated their own
secondary literatures attempting to address these problems, but often
making things much worse! Indeed, in some ﬁelds like psychology it
has reached the stage where one has to correct the ‘corrections’ of those
chastising the initial ‘correctors’!
In an attempt to alleviate problem (i), the discussion that follows
uses a sketchy historical development of frequentist testing. To ameliorate problem (ii), the discussion includes ‘red ﬂag’ pointers (¥) designed
to highlight important points that shed light on certain erroneous interpretations or misleading arguments. The discussion will pay special
attention to (iii), addressing some of the key foundational problems.
2
Francis Edgeworth
A typical example of a testing procedure at the end of the 19th century
is provided by Edgeworth (1885). From today’s perspective his testing
takes the form of viewing data x0 :=(11 12 1 ; 21 22 2 ) in
the context of the following bivariate simple Normal model:
¶
µµ
¶ µ 2
¶¶
µ
1
0
1
v NIID
=1 2
X:=
2
2
0 2
(1 )=1 (2)=2 (1 )= (2 ) = 2 (1 2 )=0
The hypothesis of interest concerns the equality of the two means:
Hypothesis of interest: 1 = 2
The 2-dimensional sample is: X:=(11 12 1 ; 21 22 2 )
Common sense, combined with the statistical knowledge at the time
suggested using the diﬀerence between the estimated means:
P
P
1 − 2 where 1 = =1 1 2 = =1 2
b
b
b 1
b 1
as a basis for deciding whether the two means are diﬀerent or not.
To render the diﬀerence (b1 − 2 ) free of the units of measurement,
b
Edgeworth divided it by its standard deviation:
q
p
P
P
b2 1
b2 1
(b1 −b2 )= (b2 + 2 ) 1 = (1 −b1 )2 2 = (2 −b1 )2
1 b1
=1
2
=1
3.
|ˆ
to deﬁne the distance function: (X)= √1 −ˆ 2 |2
2
(1 +1 )
The last issue he needed to address is how big the observed distance
(x0 ) should be to justify inferring that the two means are diﬀerent.
Edgeworth argued that (1 −2 )6=0 could not be ‘accidental’ (due to
chance) if:
√
|ˆ
(X) = √1 −ˆ2 |2 2 2
(1)
2
(1 +1 )
√
I Where did the threshold 2 2 come from? The tail area of N(0 1)
√
beyond ±2 2 is approximately 005, which was viewed as a reasonable
value for the probability of a ‘chance’ error, i.e. the error of inferring
a signiﬁcant discrepancy. But why Normal? At the time statistical
inference relied heavily on large sample size (asymptotic) results by
invoking the Central Limit Theorem.
In summary, Edgeworth introduced 3 features of statistical testing:
(i) a hypothesis of interest: 1 = 2
(ii) the notion of a standardized distance: (X),√
(iii) a threshold value for signiﬁcance: (X) 2 2
3
Karl Pearson
The Pearson approach to statistics can be summarized as follows:
Histogram
Data
=⇒
x0 :=(1 )
=⇒
Fitting a frequency curve
from the Pearson family
(; b1 b2 b3 b4 )
The Pearson family of frequency curves can be expressed in terms
of the following diﬀerential equation:
ln ()
=
(−1 )
2 +3 +4 2
(2)
Depending on the values taken by the parameters (1 2 3 4 ) this
equation can generate numerous frequency curves such as the Normal,
the Student’s , the Beta, the Gamma, the Laplace, the Pareto, etc.
3
4.
The ﬁrst four raw data moments 01 , 02 , 03 and 04 are suﬃcient to
b b b
b
determine (b1 b2 b3 b4 ) and that enables one can select the particular
() from the Pearson family.
Having selected a particular frequency curve 0 () on the basis of
b
(; b1 b2 b3 b4 )= (; θ) Karl Pearson proposed to assess the appro
priateness of this choice. His hypothesis of interest is the form:
0 () = ∗ ()∈Pearson(1 2 3 4 ) ∗ ()-true density.
Pearson proposed the standardized distance function:
X ( − )2
X (( )−( ))2
(X)=
=
v 2 ()
( )
=1
=1
(3)
b
where ( =1 2 ) and ( =1 2 ) denote the empirical and
assumed (as speciﬁed by 0 ()) frequencies, and introduced the notion
of a p-value:
P((X) (x0 )) = (x0 )
(4)
as a basis for inferring whether the choice of 0 () was appropriate or
not. The smaller the (x0 ) the worst the ﬁt.
¥ It is important to Note that this hypothesis of interest 0 ()= ∗ ()
is not about a particular parameter, but about the adequacy of the
choice of the probability model. In this sense, this was the ﬁrst MisSpeciﬁcation (M-S) test for a distributional assumption!
Example. Mendel’s cross-breeding experiments (1865-6) were based
on pea-plants with diﬀerent shape and color, which can be framed in
terms of two Bernoulli random variables:
R
W
z }| {
z }| {
(round)=0 (wrinkled)=1
Y
G
z }| {
z }| {
(yellow)=0 (green)=1
His theory of heredity was based on two assumptions:
(i) the two random variables and are independent,
(ii) round (=0) and yellow ( =0) are dominant traits, but ‘winkled’ (=1) and ‘green’ ( =1) are recessive traits with probabilities:
P(=0)=P( =0) = 75 P(=1)=P( =1) = 25
This substantive theory gives rise to Mendel’s model based on the
bivariate distribution below:
4
5.
0
1
0
1
()
5625 1875 750
1875 0625 250
() 750 250 100
Mendel’s theory model ( )
Data. The 4 gene-pair experiments Mendel carried out with =556,
gave rise the observed frequencies and relative frequencies given below:
R,Y
R,G
W,Y
W,G
(0 0) (0 1) (1 0) (1 1)
315
108
101
32
Observed frequencies
⇒
0
1
b
()
0
1
5666 1942 7608
1817 0576 2393
b
() 7483 2518 1001
Observed relative frequencies
How adequate is Mendel’s theory in light of the data? One can answer
that question using Pearson’s goodness-of-ﬁt chi-square test.
X (( )−( ))2
The chi-square test statistic (X)=
compares
( )
=1
how close are the observed to the expected relative frequencies.
Using the above data this test statistic yields:
³
´
(5666−5625)2
(1942−1875)2
(1817−1875)2
(0576−0625)2
(X)=556
+ 1875 + 1875 + 0625
=463
5625
In light of the fact that the tail area of 2 (3) yields: P((X) 0470)= 927
suggests accordance with Mendel’s theory.
Karl Pearson’s primary contributions to testing were:
(a) the broadening of the scope of the hypothesis of interest,
initiating misspeciﬁcation testing with a goodness-of-ﬁt test,
(b) a distance function whose distribution is known
(at least asymptotically), and
(c) the use of the tail probability as a basis of deciding how good
the ﬁt with data x0 is.
5
6.
4
William Gosset (aka ‘Student’)
Gosset’s 1908 seminal paper provided the cornerstone upon which Fisher
founded modern statistical inference. At that time it was known that in
the case when (1 2 ) are NIID (the simple Normal model),
P
1
the MLE estimator := = =1 had the following ‘sampling’
b
distribution:
³ 2´
√
v N ⇒ ( −) v N (0 1)
It was also known that in the case where 2 is replaced by its estimator:
P
1
2 = −1 =1 ( −ˆ )2
√
( −)
the distribution of
but asymptotically it can
√
?
( −)
is unknown, i.e.
vD (0 1)
√
( −)
v N (0 1)
be approximated by:
?
Gosset (1908) derived the unknown D () showing that for any 1:
√
( −)
v St(−1)
(5)
where St(−1) denotes a Student’s t distribution with (−1) degrees
of freedom. The subtle question that needs to be answered is: what
does the result mean in light of the fact that is unknown.
I This was the ﬁrst ﬁnite sample result [a distributional result that
is valid for any 1] that inspired R. A. Fisher to pioneer modern
frequentist inference.
5
R. A. Fisher’s signiﬁcance testing
The result (5) was formally proved and extended by Fisher (1915) and
used subsequently as a basis for several tests of hypotheses associated
with a number of diﬀerent statistical models in a series of papers.
A key element in Fisher’s framework was the introduction of the
notion of a statistical model, whose generic form:
M (x)={ (x; θ) θ∈Θ} x∈R
6
(6)
7.
where (x; θ) x∈R denotes the (joint) distribution of the sample
X:=(1 ) that encapsulates the prespeciﬁed probabilistic structure of the underlying stochastic process { ∈N} The link to phenomenon of interest comes from viewing data x0 :=(1 2 ) as a
‘truly typical’ realization of the process { ∈N}.
Fisher used the result (5) to construct a test of signiﬁcance for the:
null hypothesis: 0 : =0
(7)
in the context of the simple Normal model (table 1).
Table 1 - The simple Normal model
Statistical GM:
= + ∈N
[1] Normal:
v N( )
[2] Constant mean:
( ) = for all ∈N
[3] Constant variance: ( ) = 2 for all ∈N
[4] Independence:
{ ∈N} - independent process.
¥ Fisher must have realized that the result in (5) by Gosset could
not be used directly as a basis for a test because it involved the unknown
parameter One can only speculate, but it must have dawn on him
that the result can be rendered operational if it is interpreted in terms
of factual reasoning in the sense that it holds under the True State
of Nature (TSN) (=∗ the true value). Note that, in general, the
expression ‘∗ denotes the true value of ’ is a shorthand for saying
that ‘data x0 constitute a realization of the sample X with distribution
(x; ∗ )’. That is, (5) is a pivot (pivotal quantity):
√
∗ TSN
(X; )= ( − ) v
St(−1)
(8)
in the sense that it is a function of both X and whose distribution
under TSN is known.
Fisher (1956), p. 47, was very explicit about the nature of reasoning
underlying his signiﬁcance testing:
“In general, tests of signiﬁcance are based on hypothetical probabilities
calculated from their null hypotheses.” [emphasis in the original].
His problem was to adapt the result in (8) to hypothetical reasoning
with a view to construct a test. Fisher accomplished that in three steps.
7
8.
Step 1. He replaced the unknown ∗ in (8) with the known null
value 0
transforming the pivot (X; ) into a statistic
√
(X)= ( −0 )
A statistic is only a function of the sample X:=(1 2 )
Step 2. He adapted the factual reasoning underlying the pivot (8)
into hypothetical reasoning by evaluating the test statistic under 0 :
√
0
(X)= ( −0 ) v
St(−1)
(9)
Step 3. He employed (9) to deﬁne the -value:
P( (X) (x0 ); 0 ) = (x0 )
(10)
as an indicator of disagreement (inconsistency, contradiction) between data
x0 and 0 ; the bigger the value of (x0 ) the smaller the -value.
£ It is crucial to note that it is a serious mistake — committed often
by misguided authors — to interpret the above probabilistic statement
deﬁning the p-value as conditional on the null hypotheses 0 . Avoid
the highly misleading notation,
P( (X) (x0 ) | 0 ) = (x0 ) ×
of the vertical line (|) instead of a semi-colon (;). Conditioning on 0
makes no sense in frequentist inference since is not a random variable;
the evaluation in (10) is made under the scenario that 0 : =0 is true.
Although it is impossible to pin down Fisher’s interpretation of
the p-value because his views changed over time and he held several
interpretations at any one time, there is one ﬁxed point among his
numerous articulations (Fisher, 1925, 1955):
‘a small p-value can be interpreted as a simple logical disjunction:
either an extremely rare event has occurred or 0 is not true’.
The focus on "a small p-value" needs to be read in conjunction
with Fisher’s falsiﬁcationist stance about testing in the sense that
signiﬁcance tests can falsify but never verify hypotheses (Fisher, 1955):
“... tests of signiﬁcance, when used accurately, are capable of
rejecting or invalidating hypotheses, in so far as these are contradicted by the data; but that they are never capable of establishing
them as certainly true.”
8
9.
His logical disjunction combined with the falsiﬁcationist stance are
not helpful in shedding light on the key issue of learning from data:
what do data x0 convey about the truth or falsity of 0 ?
Occasionally, Fisher would go further and express what sounds more
like an aspiration for an evidential interpretation:
“The actual value of ... indicates the strength of evidence
against the hypothesis” (see Fisher, 1925, p. 80),
but he never articulated a coherent evidential interpretation based
on the p-value. Indeed, as late as 1956 he oﬀered ‘a degree of rational
disbelief’ interpretation (Fisher, 1956, pp. 46-47).
¥ A more pertinent interpretation of the -value (section 8) is :
‘the p-value is the probability – evaluated under the scenario that
0 is true – of all possible outcomes x∈R that accord less well with
0 than x0 does.
In light of the fact that data x0 were generated by the ‘true’ (θ=θ∗ )
statistical Data Generating Mechanism (DGM):
M∗ (x)={ (x; θ∗ )} x∈R
√
∗
↓
and the test statistic (X)= ( −0 ) measures the standardized diﬀerence between ∗ and 0 the p-value P( (X) (x0 ); 0 )=(x0 ) seeks
to measure the discordance between 0 and M∗ (x).
This precludes several erroneous interpretations of the p-value:
£ The p-value is the probability that 0 is true.
£ 1−(x0 ) is the probability that the 1 is true.
£ The p-value is the probability that the particular result is due to
‘chance error’.
£ The p-value is an indicator of the size or the substantive importance
of the observed eﬀect.
£ The p-value is the conditional probability of obtaining a test statistic (x) more extreme than (x0 ) given 0 ; there is no conditioning
involved.
Example 1. Consider testing the null hypothesis: 0 : = 70
in the context of the simple Normal model (see table 1), using the
exam score data in table 1.6 (see chapter 1) where:
=71686 2 =13606 and =70. The test statistic (9) yields:
ˆ
√
√
(x0 )= 70(71686−70) =3824
13606
P( (X) 3824; 0 =70)=00007
9
10.
where (x0 )=00007 is found from the St(69) tables. The tiny -value
suggests that x0 indicate strong discordance with 0 .
Table 2 - The simple Bernoulli model
Statistical GM:
= + ∈N
[1] Bernoulli:
v Ber( )
[2] Constant mean:
( ) = for all ∈N
[3] Constant variance: ( )=(1 − ) for all ∈N
[4] Independence:
{ ∈N} - independent process.
Example 2. Arbuthnot’s 1710 conjecture: the ratio of males
to females in newborns might not be ‘fair’. This can be tested using
the statistical null hypothesis:
0 : = 0
where 0 =5 denotes ‘fair’
in the context of the simple Bernoulli model (table 2), based on the
random variable deﬁned by: {male}={=1} {female}={=0}
P
1
Using the MLE of ˆ := = =1 whose sampling distribu
tion:
³
´
(1−)
v Bin ;
one can derive the test statistic:
√
0
(
(X)= √ −0 ) v Bin (0 1; )
(11)
0 (1−0 )
Data: =30762 newborns during the period 1993-5 in Cyprus, out
of which 16029 were boys and 14833 girls.
The test statistic takes the form (11) and ˆ = 16029 =521
30762
(x0 )=
√
30762(521−5)
√
5(5)
=7366 P((X) 7366; =5)=00000017
The tiny -value indicates strong discourdance with 0 . In general one
needs to specify a certain threshold for deciding how small the p-value is
‘small enough’ to falsify 0 . Fisher suggested several such thresholds,
01 025 05, 1, but insisted that the choice has to be made on a case
by case basis.
10
11.
The main elements of a Fisher signiﬁcance test { (X) (x0 )}.
Components of a Fisher signiﬁcance test
(i) a prespeciﬁed statistical model: M (x)
(ii) a null (0 : =0 ) hypothesis,
(iii) a test statistic (distance function) (X)
(iv) the distribution of (X) under 0
(v) the -value P( (X) (x0 ); 0 )=(x0 )
(vi) a threshold value 0 [e.g.01 025 05], such that:
(x0 ) 0 ⇒ x0 falsiﬁes (rejects) 0
Example. Consider a Fisher test based on a simple (one parameter)
Normal model (table 3).
Table 3 - The simple (one parameter) Normal model
Statistical GM:
= + ∈N
[1] Normal:
v N( )
[2] Constant mean:
( ) = for all ∈N
[3] Constant variance: ( )= 2 -known, for all ∈N
[4] Independence:
{ ∈N} - independent process.
(i) M (x) : v N ( 2 ) [2 is known], = 1 2
(ii) Null hypothesis: 0 : = 0 (e.g. 0 = 0)
√
(iii) a test statistic: (X) = ( −0 )
0
(iv) the distribution of (X) under 0 : (X) v N (0 1)
(v) the -value: P( (X) (x0 ); 0 )=(x0 )
(vi) a threshold value for discordance, say 0 = 05
Distribution Plot
Distribution Plot
Normal, Mean=0, StDev=1
Normal, Mean=0, StDev=1
0.3
Density
0.4
0.3
Density
0.4
0.2
0.1
0.2
0.1
0.05
0.0
0.025
-1.64
0.0
0
X
Normal: P( −164) = 05
0
X
1.96
Normal: P( 196) = 025
11
12.
6
The Neyman-Pearson (N-P) framework
Neyman and Pearson (1933) motivated their own approach to testing
as an attempt to improve upon Fisher’s testing by addressing what
they consider as ad hoc features of that approach:
[a] Fisher’s ad hoc choice of a test statistics (X) pertaining to the
optimality of tests,
[b] Fisher’s use of a post-data [(x0 ) is known] threshold for the value to indicate falsiﬁcation (rejection) of 0 and
[c] Fisher’s denial that (x0 ) can indicate conﬁrmation (accordance)
of 0
Neyman-Pearson (N-P) proposed solutions for [a]-[c] by:
¥ [i] Introducing the notion of an alternative hypothesis 1 as
the complement to the null with respect to Θ, within the same statistical
model:
M (x)={ (x; θ) θ∈Θ} x∈R
(12)
[ii] Replacing the post-data p-value and its threshold with a pre-data
[before (x0 ) is available] signiﬁcance level determining the decision
rules:
(i) Reject 0 (ii) Accept 0
which are calibrated using the error probabilities:
type I: P(Reject 0 when 0 is true)=,
type II: P(Accept 0 when 0 false)=.
These error probabilities enabled Neyman and Pearson to introduce the
key notion of an optimal N-P test: one that minimizes subject to
a ﬁxed , i.e.
[iii] Selecting the particular (X) in conjection with a rejection region for a given that has the highest pre-data capacity (1 − ) to
detect discrepancies in the direction of 1
In this sense, the N-P proposed solutions to the perceived weaknesses of Fisher’s signiﬁcance testing, have changed the original Fisher
framework in four important respects:
¥ (a) the N-P testing takes place within a prespeciﬁed M (x) by
partitioning it into the null and alternative hypotheses,
12
13.
(b) Fisher’s post-data p-value has been replaced with a pre-data
and
(c) the combination of (a)-(b) render ascertainable the pre-data capacity (power) of the test to detect discrepancies in the direction of
1 ,
(d) Fisher’s evidential aspirations for the p-value based on discordance, were replaced by the more behavioristic accept/reject decision
rules that aim to provide input for alternative actions: when accept 0
take action when reject 0 take action .
As claimed by Neyman and Pearson (1928):
“the tests themselves give no ﬁnal verdict, but as tools help the worker
who is using them to form his ﬁnal decision.”
6.1
The archetypal N-P hypotheses speciﬁcation
In N-P testing there has been a long debate concerning the proper way
to specify the null and alternative hypotheses. It is often thought to
be rather arbitrary and subject to abuse. This view stems primarily
from inadequate understanding of the role of statistical vs. substantive
information.
¥ It is argued that there is nothing arbitrary about specifying the
null and alternative hypotheses in N-P testing. The default alternative is always the complement to the null relative to the parameter
space of M (x).
Consider a generic statistical model:
M (x)={ (x; θ) θ∈Θ} x∈X:=R
The archetypal way to specify the null and alternative hypotheses
for N-P testing is:
0 : ∈Θ0 vs. 1 : ∈Θ1
(13)
where Θ0 and Θ1 constitute a partition of the parameter space Θ:
Θ0 ∩ Θ1 =∅ Θ0 ∪ Θ1 =Θ
In the case where the sets Θ0 or Θ1 contain a single point, say Θ0 ={0 }
that determines (x; 0 ) the hypothesis is said to be simple, otherwise
it is composite.
Example. For the simple Bernoulli model, 0 : = 0 is simple,
but 1 : 6= 0 is composite.
13
14.
¥ An equivalent formulation that brings out the fact that N-P testing poses questions pertaining to the ‘true’ statistical Data Generating
Mechanism (DGM) M∗ (x)={ ∗ (x)} x∈X most clearly is to re-write
(13) in the equivalent form:
0 : ∗ (x)∈M0 (x) vs. 1 : ∗ (x)∈M1 (x)
where ∗ (x) denotes the ‘true’ distribution of the sample, and:
M0 (x)={ (x; θ) θ∈Θ0 } or M1 (x)={ (x; θ) θ∈Θ1 } x∈X:=R ?
constitute a partition of M (x) P(x) denotes the set of all possible
statistical models that could have given rise to data x0 :=(1 2 ).
P (x )
H0
H1
N-P testing within M (x)
¥ When properly understood in the context of a statistical model
M (x) N-P testing leaves very little leeway in the speciﬁcation of 0
and 1 hypotheses. This is because the whole of Θ [all possible values
of ] is relevant on statistical grounds, despite the fact that only a
small subset is often deemed relevant on substantive grounds.
The reasoning underlying this argument is that N-P tests invariably
pose hypothetical questions — framed in terms of ∈Θ — pertaining
to the true statistical DGM M∗ (x)={ ∗ (x)} x∈X Partitioning
M (x) provides the most eﬀective way to learn from data by zeroing
in on the true in the same way one can zero in on an unknown
prespeciﬁed city on a map using sequential partitioning.
¥ This foregrounds the importance of securing the statistical adequacy of M (x) [validating its probabilistic assumptions] before it
provides the basis of any inference. Hence, whenever the union of Θ0
and Θ1 the values of postulated by 0 and 1 , respectively, do not
14
15.
exhaust Θ there is always the possibility that the true lies in the
complement: Θ − (Θ0 ∪ Θ1 ).
What constitutes an N-P test?
A misleading idea that needs to be done away with immediately is
that a N-P test is not a formula with some associated statistical tables!
It is a combination of a test statistic whose sampling distribution under
the null and alternative hypotheses is tractable in conjunction with a
rejection region.
Test statistic. Like an estimator, an N-P test statistic is a mapping
from the sample (X:=R ) to the parameter space (Θ):
() : X → R
that partitions the sample space into an acceptance 0 and a rejection region 1 : 0 ∩ 1 =∅ 0 ∪ 1 =X
in a way that correspond to Θ0 and Θ1 respectively:
¾
½
0 ↔ Θ0
=Θ
X=
1 ↔ Θ1
The N-P decision rules, based on a test statistic (X) take the form:
[i] if x0 ∈0 accept 0
[ii] if x0 ∈1 reject 0
(14)
and the associated error probabilities are:
type I:
P(x0 ∈1 ; 0 () true)=() for ∈Θ0
type II: P(x0 ∈0 ; 1 (1 ) true)=(1 ) for 1 ∈Θ1
True State of Nature
N-P rule
Accept 0
Reject 0
0 true
√
Type I error
0 false
Type √ error
II
£ Some misinformed statistics books go into great lengths to explain
that the above error probabilities should be viewed as conditional on
0 or 1 and they used the misleading notation (|):
P(x0 ∈1 |0 () true)=() for ∈Θ0
15
16.
Such conditioning makes no sense in a frequentist framework because
is treated as an unknown constant. Instead, they should be interpreted
as evaluations of the sampling distribution of the test statistic under
diﬀerent scenarios pertaining to diﬀerent hypothesized values of
¥ Some statistics books make a big deal of the distinction ‘accept
0 ’ vs. ‘fail to reject 0 ’, in their commendable attempt to bring out
the problem of misinterpreting ‘accept 0 ’ as tantamount to ‘there is
evidence for 0 ’; this is known as the fallacy of acceptance. By
the same token there is an analogous distinction between ‘reject 0 ’
vs. ‘accept 1 ’, that highlights the problem of misinterpreting ‘reject
0 ’ as tantamount to ‘there is evidence for 1 ’; this is known as the
fallacy of rejection.
What is objectionable about this practice is that the verbal distinctions by themselves do nothing but perpetuate these fallacies; we need
to address them, not play clever with words!
Table 3 - The simple (one parameter) Normal model
Statistical GM:
= + ∈N
[1] Normal:
v N( )
[2] Constant mean:
( ) = for all ∈N
[3] Constant variance: ( )= 2 -known, for all ∈N
[4] Independence:
{ ∈N} - independent process.
Example 1. Consider the following hypotheses in the context of
the simple (one parameter) Normal model:
0 : ≤ 0 vs. 1 : 0
Let us choose a test :={(X) 1 ()} of the form:
√
P
1
test statistic: (X)= ( −0 ) where = =1
rejection region: 1 ()={x : (x) }
(15)
(16)
To evaluate the two types of error probabilities the distribution of (X)
under both the null and alternatives are needed:
√
0 ( )
[I] (X)= ( −0 ) v 0 N(0 1)
√
1 ( )
[II] (X)= ( −0 ) v 1 N( 1 1)
16
(17)
for all 1 0
17.
where the non-zero mean 1 takes the form:
√
1
1 = ( −0 ) 0 for all 1 0
The evaluation of the type I error probability is based on (I):
= max≤0 P((X) ; 0 ())=P((X) ; =0 )
Notice that in cases where the null is for the form 0 : ≤ 0 the
type I error probability is deﬁned as the maximum over all values in
the interval (−∞ 0 ] which turns out to be the one evaluated at the
last point of the interval = 0 Hence, the same test will be optimal
also in the case where the null is of the form 0 : = 0
The evaluation of type II error probabilities is based on (II):
(1 )=P((X) ≤ ; 1 (1 )) for all 1 0
The power [rejecting the null when false] is equal to 1−(1 ) i.e.
(1 )=P((X) ; 1 (1 )) for all 1 0
Why do we care about power? The power (1 ) measures the pre-data
(generic) capacity of test :={(X) 1 ()} to detect a discrepancy,
say =1 −0 = 1 when present. Hence, when (1 )=35 this test has
very low capacity to detect such a discrepancy.
How is this information helpful in practice? If =1 is the discrepancy
of substantive interest, this test is practically useless for that purposes
because we know beforehand that this test does not enough capacity
to detect even if present!
17
18.
What can one do in such a case? The power of the above test is
√
1
monotonically increasing with 1 = ( −0 ) and thus, increasing the
sample size , increases the power.
Hypothesis testing reasoning.
¥ As with Fisher’s signiﬁcance testing, the reasoning underlying NP testing is hypothetical in the sense that both error probabilities are
based on evaluating the relevant test statistic under several hypothetical
scenarios, e.g. under 0 (=0 ) or under 1 (=1 ) for all 1 0 These
hypothetical sampling distributions are then used to compare 0 or 1
via (x0 ) to the True State of Nature (TSN) represented by data x0
This is in contrast to frequentist estimation which is factual in
nature, i.e. the sampling distributions of estimators is evaluated under
the TSN, i.e. = ∗
Signiﬁcance level vs. -value
It is important to note that there is a mathematical relationship
between the type I error probability (signiﬁcance level) and the p-value.
Placing them side by side:
P(type I error): P((X) ; =0 ) =
(18)
-value: P((X) (x0 ); =0 )=(x0 )
it becomes obvious that:
(a) they are both speciﬁed in terms of the distribution of the same
test statistic (X) under 0
(b) they both evaluate the probability of tail events of the form
{x: (x) } x∈X, but
(c) they diﬀer in terms of the threshold : vs. (x0 );
in that sense is a pre-data and (x0 ) is a post-data error probability.
In light of (a)-(c), three comments are called for.
First, the p-value can be viewed as the smallest signiﬁcance level
at which 0 would have been rejected with data x0 . For that reason the
p-value is often referred to as the observed signiﬁcance level. Hence, it
should come as no surprise to learn that the above N-P decision rules
(14) could be recast in terms of the p-value:
[i]* if (x0 ) accept 0
[ii]* if (x0 ) ≤ reject 0
18
19.
Indeed, practitioners often prefer to use the rules [i]*-[ii]* because the
p-value conveys additional information when it is not close to the predesignated threshold . For instance, rejecting 0 with (x0 )=0001
seems more informative than just 0 was rejected at = 05
¥ Second, the p-value seems to have an implicit alternative built
into the choice of the tail area. For instance, a 2-sided alternative 1 :
6=0 would require one to evaluate P(|(X)| |(x0 )|; =0 ) How
did Fisher get away with such an obvious breach of his own preaching?
Speculating on that, his answer might have been: 2-sided evaluation
of the p-value makes no sense, because, post-data, the sign of the test
statistic (x0 ) determines the direction of departure!
¥ Third, neither the signiﬁcance level nor the p-value can be
interpreted as probabilities attached to particular values of associated
with 0 or 1 , since the probabilities in (18) are ﬁrmly attached to
the sample realizations x∈X. Indeed, attaching probabilities to the
unknown constant makes no sense in frequentist statistics.
As argued by Fisher (1921), p. 25: “We may discuss the probability
of occurrence of quantities which can be observed or deduced from observations, in relation to any hypotheses which may be suggested to explain
these observations. We can know nothing of the probability of hypotheses...”
This scotches another two erroneous interpretations of the p-value:
£ (x0 ) = 005 does not mean that there is a 5% chance of a Type I
error (i.e. false positive).
£ (x0 ) = 005 does not mean that there is a 95% chance that the
results would replicate if the study were repeated.
Example 1 (continued). Consider applying test :={(X) 1 ()}
for: 0 =10 =1 =100 =104 = 05 ⇒ =1645
√
√
(x0 )= ( −0 ) = 100(10175−10) =175
1
The p-value is: P((X) 175; =0 )=04
Reject 0
standard Normal tables
One-sided values
Two-sided values
=100 : =128
=100 : =1645
=050 : =1645
=050 : =196
=025 : =196
=025 : =200
=010 : =233
=010 : =258
19
20.
The trade-oﬀ between type I and II error probabilities
I How does the introduction of type I and II error probabilities addresses
the arbitrariness associated with Fisher’s choice of a test statistic?
The ideal test is one whose error probabilities are zero, but no such
test exist for a given ; an ideal test, like the ideal estimator, exists as
→ ∞! Worse, there is a trade-oﬀ between the above error probabilities: as one increases the other decreases and vice versa.
Example. In the case of test :={(X) 1 ()} in (16), decreasing the probability of type I error from =05 to =01 increases the
threshold from =1645 to =233 which makes it easier to accept
0 and this increases the probability of type II error.
To address this trade-oﬀ Neyman and Pearson (1933) proposed:
(i) to specify 0 and 1 in such a way so as to render the
type I error the more serious one, and then:
(ii) to choose an signiﬁcance level test {(X) 1 ()}
so as to maximize its power for all ∈Θ1
N-P rationale. Consider the analogy with a criminal oﬀense trial,
where the jury in such a trial are instructed by the judge to ﬁnd the
defendant ‘not guilty’ unless they have been convinced ‘beyond any
reasonable doubt’ by the evidence:
0 : not guilty, vs. 1 : guilty.
The clause ‘beyond any reasonable doubt’ amounts to ﬁxing the type
I error to a very small value, to reduce the risk of sending innocent
people to prison, or even death. At the same time, one would want the
system to minimize the risk of letting guilty people get oﬀ scot-free".
More formally, the choice of an optimal N-P test is based on ﬁxing
an upper bound for the type I error probability:
P(x∈1 ; 0 () true) ≤ for all ∈Θ0
and then select a test {(X) 1 ()} that minimizes the type II error
probability, or equivalently, maximizes the power:
()=P(x0 ∈1 ; 1 () true)=1−() for all ∈Θ1
I The general rule is that in selecting an optimal N-P test the whole
of the parameter space Θ is relevant. This is why partitioning both Θ
and the sample space X:=R using a test statistic (X) provides the
key to N-P testing.
20
21.
Optimal properties of tests
The property that sets the gold standard for optimal test is:
[1] Uniformly Most Powerful (UMP): A test
:={(X) 1 ()} is said to be UMP if it has higher power than any
e
other -level test for all values ∈Θ1 , i.e.
e
(; ) ≥ (; ) for all ∈Θ1
Example. Test :={(X) 1 ()} in (16) is UMP; Lehmann (1986).
Additional properties of N-P tests
[2] Unbiasedness: A test :={(X) 1 ()} is said to be unbiased
if the probability of rejecting 0 when false is always greater than that
of rejecting 0 when true, i.e.
max P(x0 ∈1 ; 0 ()) ≤ P(x0 ∈1 ; 1 ()) for all ∈Θ1
∈Θ0
[3] Consistency: A test :={(X) 1 ()} is said to be consistent
if its power goes to one for all ∈Θ1 as → ∞, i.e.
lim () = P(x0 ∈1 ; 1 () true)=1 for all ∈Θ1
→∞
Like in estimation, this is a minimal property for tests.
Evaluating power. How does one evaluate the power of test
:={(X) 1 ()} for diﬀerent discrepancies =1 −0 ?
Step 1. In light of the fact that for the relevant sampling distribution for the evaluation of (1 ) is:
√
1 ( )
[II] (X)= ( −0 ) v 1 N( 1 1) for all 1 0
the use the N(0 1) tables requires on to split the test statistic into:
√
√
√
( −0 )
1
= ( −1 ) + 1 1 = ( −0 )
since under 1 (1 ) the distribution of the ﬁrst component is:
³√
´ ( )
√
1 1
( −1 )
( −0 )
=
− 1
v N(0 1) for 1 0
Step 2. Evaluating of power the test :={(X) 1 ()} for different discrepancies yields:
=1 −0
=1
=2
=3
√
1
1 = ( −0 ) :
=1:
=2:
=3:
√
(1 )=P( ( −1 ) − 1 + ; 1 )
(101)=P( −1 + 1645) = 259
(102)=P( −2 + 1645) = 639
(103)=P( −3 + 1645) = 913
21
22.
where is a generic standard Normal r.v., i.e. v N(0 1)
The power of this test is typical of an optimal test since (1 )
√
1
increases with the non-zero mean 1 = ( −0 ) and thus:
(a) the power increases with the sample size
(b) the power increases with the discrepancy =(1 −0 ) and
(c) the power decreases with
It is very important to emphasize three features of an optimal test.
First, a test is not just a formula associated with particular statistical
tables. It is a combination of a test statistic and a rejection region;
hence the notation :={(X) 1 ()}.
Second, the optimality of an N-P test is inextricably bound up with
the optimality of the estimator the test statistic is based on. Hence,
it is no accident that most optimal N-P tests are based on consistent,
fully eﬃcient and suﬃcient estimators. Example. In the case of the
simple (one parameter) Normal model, consider replacing with the
2
unbiased estimator 2 =(1 + )2 vN( 2 ) The resulting test:
b
ˇ
:={(X) 1 ()} where (X)=
√
2(2 −0 )
and 1 ()={x: (x) }
will not be optimal because its power is much lower than that of , and
ˇ
is inconsistent! This is because the non-zero mean of the sampling
√
1
distribution under =1 will be = 2(−0 ) which does not change as
→ ∞
Third, it is important to note that by changing the rejection region
one can render an optimal N-P test useless! For instance, replacing the
22
23.
rejection region of :={(X) 1 ()} with:
1 ()={x: (x) }
the resulting test T :={(X) 1 ()} is practically useless because it
¸
is biased and its power decreases as the discrepancy increases.
Example 2 - the Student’s t test. Consider the 1-sided hypotheses:
(1-s): 0 : = 0 vs. 1 : 0
(19)
or (1-s*): 0 : ≤ 0 vs. 1 : 0
in the context of the simple Normal model (table 1).
In this case, a UMP (consistent) test :={ (X) 1 ()} exists.
It is the well-known Student’s t test, and takes the form:
test statistic:
√
(X)= ( −0 )
2
1
= −1
P
( − )2
=1
(20)
rejection region: 1 ()={x : (x) }
where can be evaluated using the Student’s t tables.
To evaluate the two types of error probabilities the distribution of
(X) under both the null and alternatives are needed:
√
( −0 ) 0 (0 )
v
√
1 ( )
(X)= ( −0 ) v 1
[I]* (X)=
[II]*
where 1 =
√
(1 −0 )
St(−1)
(21)
St( 1 ; −1) for all 1 0
is the non-centrality parameter.
Student’s t tables (=60)
One-sided values
Two-sided values
=100 : =1296
=100 : =1671
=050 : =1671
=050 : =2000
=010 : =2390
=010 : =2660
=001 : =3232
=001 : =3460
In summary, the main components of a Neyman-Pearson (N-P) test
{ (X) 1 ()} are given below.
23
24.
Components of a Neyman-Pearson (N-P) test
(i)
a prespeciﬁed statistical model: M (x)
(ii) a null (0 ) and the alternative (1 ) within M (x),
(iii) a test statistic (distance function) (X)
(iv) the distribution of (X) under 0
(v) prespecifying the signiﬁcance level [.01, .025, .05],
(vi) the rejection region 1 ()
(vii) the distribution of (X) under 1
6.2
The N-P Lemma and its extensions
The cornerstone of the Neyman-Pearson (N-P) approach is
the Neyman-Pearson lemma. Contemplate the simple generic statistical model:
M (x)={ (x; )} ∈Θ:={0 1 }} x∈R
(22)
and consider the problem of testing the simple hypotheses:
0 : = 0 vs. 1 : = 1
(23)
¥ The fact that the assumed parameter space is Θ:={0 1 } and (23)
constitute a partition, is often left out from most statistics textbook
discussions of this famous lemma!
Existence. There is exists an -signiﬁcance level Uniformly Most
Powerful (UMP) [-UMP] test, whose generic form is:
(X)=( (x;1 ) )
(x;0 )
1 ()={x: (x) }
(24)
where () is a monotone function.
Suﬃciency. If an -level test of the form (24) exists, then it is UMP for
testing (23).
Necessity. If {(X) 1 ()} is -UMP test, then it will be given by (24).
At ﬁrst sight the N-P lemma seems rather contrived because it is an
existence result for a simple statistical model M (x) whose parameter space is artiﬁcial Θ:={0 1 }, but ﬁts perfectly into the archetypal
formulation. In addition, to implement this result one would need to
ﬁnd () yielding a test statistic (X) whose distribution is known under
24
26.
Despite the diﬀerence between (ii) and (iii), the relevant error probabilities coincide. The type I error probability for (ii) is deﬁned as the
maximum over all ≤ 0 , but coincides with that of (iii) (=0 ):
= max≤0 P((X) ; 0 ())=P((X) ; =0 )
Similarly, the p-value is the same because:
(x0 )= max P((X)(x0 ); 0 ())=P((X)(x0 ); =0 )
≤0
[2] For 1 0 the test :={(X) 1 ()} where
1 ()={x: (x) } is a -UMP for the hypotheses:
(iv) (1-s≤): 0 : ≥ 0 vs.
1 : 0
(v) (1-s): 0 : = 0 vs. 1 : 0
Again, the type I error probability and p-value are deﬁned by:
= max P((X) ; 0 ())=P((X) ; =0 )
≥0
(x0 )= max P((X)(x0 ); 0 ())=P((X)(x0 ); =0 )
≥0
The existence of these -UMP tests extends the N-P lemma to more
realistic cases by invoking two regularity conditions:
¥ [A] The ratio (26) is a monotone function of the statistic in
the sense that for any two values 1 0 (x;1 ) changes monotonically
(x;0 )
(x;1 )
with This implies that (x;0 ) if and only if for some
This regularity condition is valid for most statistical models of interest in practice, including the one parameter Exponential family of
distributions [Normal, Gamma, Beta, Binomial, Negative Binomial,
Poisson, etc.], the Uniform, the Exponential, the Logistic, the Hypergeometric etc.; Lehmann (1986).
¥ [B] The parameter space under 1 say Θ1 is convex, i.e. for any
two values (1 2 ) ∈Θ1 their convex combinations 1 + (1 − )2 ∈Θ1
for any 0 ≤ ≤ 1
When convexity does not hold, like the 2-sided alternative:
(vi) (2-s): 0 : = 0 vs.
1 : 6= 0
[3] the test :={(X) 1 ()} 1 ()={x: |(x)| }
2
is -UMPU (Unbiased); the -level and p-value are:
=P(|(X)| ; =0 ) q (x0 )=P(|(X)| |(x0 )|; =0 )
2
26
27.
6.3
Constructing optimal tests: Likelihood Ratio
The likelihood ratio test procedure can be viewed as a generalization
of the Neyman-Pearson lemma to more realistic cases where the null
and/or the alternative are composite hypotheses.
(a) The hypothesis of interest are of the general form:
0 : θ∈Θ0 vs. 1 : θ∈Θ1
(b) The test statistic is related to the ‘likelihood’ ratio:
max∈Θ
(X)= max∈Θ (;X) =
(;X)
0
(;X)
(;X)
(27)
Note that the max in the numerator is over all θ∈Θ [yielding the MLE
b
θ], but that of the denominator is conﬁned to all values under 0 :
e
θ∈Θ0 [yielding the constrained MLE θ].
(c) The rejection region is deﬁned in terms of a transformed ratio,
where () is chosen to yield a known sampling distribution under 0 :
1 ()= {x: (X)= ( (X)) }
(28)
In terms of procedures to construct optimal tests, the LR procedure
can be shown to yield several optimal tests; Lehmann (1986). Moreover,
in cases where no such () can be found, one can use the asymptotic
distribution which states that under certain restrictions (Wilks, 1938):
³
´
b X− ln (θ; X) 0 2 ()
e
v
2 ln (X) =2 ln (θ;
0
where ∼ reads “under 0 is asymptotically distributed as” and de
notes the number of restrictions involved in Θ0
Example. Consider testing the hypotheses:
0 : 2 =2 vs. 1 : 2 6=2
0
0
in the context of the simple Normal model (table 1). From the previous
chapter we know that the Maximum Likelihood method gives rise to:
P
P
b
ˆ 1
b 1
max(θ; x)=(θ; x) ⇒ = =1 and 2 = =1 ( −ˆ )2
∈Θ
P
e
max(θ; x)=(θ; x) ⇒ = =1 and 2 =2
ˆ 1
0
∈Θ0
27
28.
Moreover, the two estimated likelihoods are:
¡
2 ¢− 2
max(θ; x)= 2b
∈Θ
n
o
2
2 −2
max(θ; x)= (2 0 ) exp − 22
∈Θ0
0
Hence, the likelihood ratio is:
−
2
2
(;X) (20 )
(X)= (;X) =
exp{−(2 22 )}
0
−
2
(22 )
h 2
i
2
2
= ( 2 ) exp{−( 2 )+1}
0
0
The function () that transforms this ratio into a test statistic is:
2
0
(X)=( (X))=( 2 ) v 2 (−1)
0
and a rejection region of the form:
1 ={x: (X) 2 or (X) 1 }
where for a size test, the constants 1 2 are chosen in such as way
R∞
R 1
so as:
() = 2 () =
0
2
() denotes the chi-square density function. The test deﬁned by
{(X) 1 ()} turns out to be UMP Unbiased (UMPU) (see Lehmann,
1986).
6.4
Substantive vs. statistical signiﬁcance in N-P testing
We consider two substantive values of interest that concern the ratio
of Boys (B) to Girls (G) in newborns:
Arbuthnot: #B=#G, Bernoulli: 18B to 17G.
(29)
Statistical hypotheses. The Arbuthnot and Bernoulli conjectured
values for the proportion can be embedded into a simple Bernoulli
model (table 2), where =P(=1)=P(Boy):
Arbuthnot: = 1
2
Bernoulli: =
18
35
(30)
How should one proceed to appraise and vis-a-vis the above data?
The archetypal speciﬁcation (13) suggests probing each value separately using the diﬀerence ( − ) as the discrepancy of interest.
28
29.
Despite the fact that on substantive grounds the only relevant values of
are ( ), on statistical inference grounds, the rest of the parameter
space is relevant; M (x) provides the relevant inductive premises.
This argument, however, has been questioned in the literature on the
grounds that N-P testing should probe these two values using simple-vssimple hypotheses? After all, the argument goes, the Neyman-Pearson
lemma would secure a -UMP test!
This argument shows insuﬃcient understanding of the NeymanPearson lemma, because the parameter space of the simple Bernoulli
model is not Θ={ } but Θ=[0 1] Moreover, the fact that the
simple Bernoulli model has a monotone likelihood ratio, i.e.:
³
´ ³
´
=1
(x;1 )
1 (1−0 )
= (1−1 )
for (0 1 ) s.t. 00 1 1
(x;0 )
(1−0 )
0 (1−1 )
P
is a monotonically increasing function of the statistic = =1
since 1 (1−0 ) 1 ensuring the existence of several -UMP tests.
0 (1−1 )
In particular, [i] :={(X) C1 ()}:
√
( −0 ) 0
(X)= √
0 (1−0 )
v Bin (0 1; ) C1 ()={x: (x) }
(31)
is a -UMP test for the N-P hypotheses:
(ii) (1-s≥): 0 : ≤ 0 vs. 1 : 0
(iii) (1-s): 0 : = 0 vs. 1 : 0
and [ii] :={(X) C1 ()}:
√
( −0 ) 0
(X)= √
0 (1−0 )
v Bin (0 1; ) C1 ()={x: (x) }
(32)
(iv) (1-s≤): 0 : ≥ 0 vs. 1 : 0
(v) (1-s): 0 : = 0 vs. 1 : 0
is a -UMP test for the N-P hypotheses. These results stem from the
fact that the formulations (ii)-(v) ensure the convexity of the parameter
space under 1
Moreover, in the case of a two-sided hypotheses:
(i) (2-s): 0 : = 0 vs.
29
1 : 6= 0
30.
the test [iii] :={(X) C1 ()}:
√
0
(
(X)= √ −0 ) v Bin (0 1; ) C1 ()={x: |(x)| }
0 (1−0 )
(33)
is a -UMP, Unbiased; see Lehmann (1986).
¥ Randomization? No! In cases where the test statistic has a
discrete sampling distribution under 0 as in (32)-(33), one might
not be able to deﬁne exactly.
The traditional way of dealing with this problem is to randomize,
but this ‘solution’ raises more problems than it solves. Instead, the
best way to address the discreteness issue is to select a value that
is attained for the given or approximate the discrete distribution
with a continuous one to circumvent the problem. In practice, there
are several continuous distributions one can use, depending on whether
the original distribution is symmetric or not. In the above case, even for
moderately small size sizes, say =20 the Normal distribution provides
an excellent approximation for values of around .5; see graph. With
=30762, the approximation is nearly exact.
D is tr ib u tio n P lo t
0 .2 0
D is tr ib u tio n
B in o m ia l
n
20
D is tr ib u tio n
N o rm al
p
0.5
M ean
10
S tD e v
2.236
Density
0 .1 5
0 .1 0
0 .0 5
0 .0 0
2
4
6
8
10
X
12
14
16
18
Normal approx. of Binomial: (; =5 =20)
What about the choice between one-sided vs. two-sided?
The answer depends crucially on whether there exists reliable substantive information that renders part of the parameter space Θ irrelevant for both substantive and statistical purposes.
6.5
N-P testing of Arbuthnot’s value
In this case there is reliable substantive information that the probability
of the newborn being a male, =P(=1), is systematically slightly
above 1 ; the human sex ratio at birth is approximately =512 (Hardy,
2
30
31.
2002). Hence, a statistically more informative way to assess =5 might
be to use one-sided (1-s) directional probing.
Data: =30762 newborns during the period 1993-5 in Cyprus, out
of which 16029 were boys and 14833 girls.
In view of the huge sample size it is advisable to choose a smaller
signiﬁcance level, say =01 ⇒ =2326
Case 1. For testing Arbuthnot’s value 0 = 1 the relevant hypotheses
2
are:
(1-s): 0 : = 1 vs. 1 : 1
2
2
(34)
1
or (1-s*≤): 0 : ≤ 2 vs. 1 : 1
2
and := {(X) C1 ()} is a -UMP test. In view of the fact that:
(x0 )=
√
30762( 16029 − 1 )
30762 2
√
5(5)
=7389
(35)
(x0 )=P((X)7389; 0 ) = 00000
the null hypothesis 0 : = 1 is strongly rejected.
2
What if one were to ignore the substantive information and apply a
2-sided test?
Case 2. For testing Arbuthnot’s value this takes the form:
(2-s): 0 : =
1
2
vs.
1 : 6= 1
2
(36)
:={(X) C1 ()} in (33) is a -UMP test; for =05 =2576
2
In view of the fact that:
(x0 )=
√
30762( 16029 − 1 )
30762 2
√
5(5)
=7389 2576
q (x0 )=P(|(X)| 7389; 0 )=00000
the null hypothesis 0 : = 1 is strongly rejected.
2
6.6
N-P testing of Bernoulli’s value
The ﬁrst question we need to answer is the appropriate form of the
alternative in testing Bernoulli’s conjecture = 18 .
35
Using the same substantive information that indicates is systematically slightly above 1 (Hardy, 2002), it seems reasonable to use a
2
1-sided alternative in the direction of 1 .
2
Case 1. For probing Bernoulli’s value the relevant hypotheses are:
(1-s): 0 : =
18
35
31
vs. 1 : 1
18
35
(37)
32.
The optimal test (UMP) for (37) is := {(X) C1 ()}, where for
=01 ⇒ = −2326. Using the same data:
√
16029
18
30762(
− )
(x0 )= √ 18 30762 35 =2379 (x0 )=P((X) 2379; 0 )=991
18
35 (1− 35 )
the null hypothesis 0 : = 18 is accepted with a high p-value.
35
What if one were to ignore the substantive information as pertaining
only to Arbuthnot’s value, and apply a 2-sided test?
Case 2. For testing Bernoulli’s value using the 2-s probing:
(2-s): 0 : = 18 vs. 1 : 6= 18
(38)
35
35
based on the UMPU test :={(X) C1 ()} in (33):
√
30762( 16029 − 18 )
30762 35
(x0 )= √ 18
18
35 (1− 35 )
=2379 2576
(39)
q (x0 )=P(|(X)| 2379; 0 )=0174
yields accepting 0 , but the p-value is considerably smaller than
(x0 )=991 for the 1-sided test.
The question that arises is which one is the relevant p-value?
Indeed, a moments reﬂection suggests that these two are not the
only choices. There is also the p-value for:
(1-s): 0 : = 18 vs. 1 : 18
(40)
35
35
which yields a much smaller p-value:
(x0 )=P((X) 2379; 0 )=009
(41)
reversing the previous results and leading to rejecting 0 .
¥ This raises the question:
On what basis should one choose the relevant p-value?
The short answer ‘on substantive grounds’ seems questionable because,
ﬁrst, it ignores the fact that the whole of the parameter space is relevant on statistical grounds, and second, one applies a test to assess
the validity of substantive information, not foist it on the data at
the outset. Even more questionable is the argument that one should
choose a simple alternative on substantive grounds in order to apply the
Neyman-Pearson lemma that guarantees a -UMP test; this is based
on a misapprehension of the lemma.
The longer answer, that takes account the fact that post-data (x0 )
is known, and thus there is often a clear direction of departure indicated
by data x0 associated with the sign of (x0 ) will be elaborated upon
in section 8.
32
33.
7
Foundational issues raised by N-P testing
Let us now appraise how successful the N-P framework was in addressing the perceived weaknesses of Fisher’s testing:
[a] his ad hoc choice of a test statistics (X)
[b] his use of a post-data [(x0 ) is known] threshold for the -value
that indicates discordance (reject) with 0 and
[c] his denial that (x0 ) can indicate accordance with 0
Very brieﬂy, the N-P framework partly succeeded in addressing
[a] and [c], but did not provide a coherent evidential account that
answers the basic question (Mayo, 1996):
when do data x0 provide evidence for or against
(42)
a hypothesis or a claim ?
This is primarily because one could not interpret the decision results
‘Accept 0 ’ (‘Reject 0 ’) as data x0 provide evidence for (against) 0
[and thus provide evidence for 1 ]. Why?
(i) A particular test :={(X) C1 ()} could have led to ‘Accept
0 ’ because the power of that test to detect an existing discrepancy
was very low; the test had no generic capacity to detect an existing
discrepancy. This can easily happen when the sample size is small.
(ii) A particular test :={(X) C1 ()} could have led to ‘Reject
0 ’ simply because the power of that test was high enough to detect
‘trivial’ discrepancies from 0 , i.e. the generic capacity of the test was
extremely high. This can easily happen when the sample size is very
large.
In light of the fact that the N-P framework did not provide a bridge
between the accept/reject rules and what scientiﬁc inquiry is after,
i.e. when data x0 provide evidence for or against a hypothesis ,
it should come as no surprise to learn that most practitioners sought
such answers by trying unsuccessfully (and misleadingly) to distill an
evidential interpretation out of Fisher’s p-value. In their eyes the predesignated is equally vulnerable to manipulation [pre-designation to
keep practitioners ‘honest’ is unenforceable in practice], but the p-value
is more informative and data-speciﬁc.
Is Fisher’s p-value better in answering the question (42)? No. A
small (large) — relative to a certain threhold — p-value could not
be interpreted as evidence for the presence (absence) of a substantive
33
34.
discrepancy for the same reasons as (i)-(ii). The generic capacity
of the test (its power) aﬀects the p-value. For instance, a very small
p-value can easily arise in the case of a very large sample size .
Fallacies of Acceptance and Rejection
The issues with the accept/reject 0 and p-value results raised above
can be formalized into two classic fallacies.
(a) The fallacy of acceptance: no evidence against 0 is misinterpreted
as evidence for 0 .
This fallacy can easily arise in cases where the test in question has
low power to detect discrepancies of interest.
(b) The fallacy of rejection: evidence against 0 is misinterpreted as
evidence for a particular 1 .
This fallacy arises in cases where the test in question has high power
to detect substantively minor discrepancies. Since the power of a test
increases with the sample size this renders N-P rejections, as well as
tiny p-values, with large highly susceptible to this fallacy.
In the statistics literature, as well as in the secondary literatures in
several applied ﬁelds, there have been numerous attempts to circumvent
these two fallacies, but none succeeded.
8
Post-data severity evaluation
A moment’s reﬂection about the inability of the accept/reject 0 and
p-value results to provide a satisfactory answer to the above question
in (42) suggests that it is primarily due to the fact that there is a
problem when such results are detached from the test itself, and are
treated as providing the same evidence for a particular hypothesis
(0 or 1 ), regardless of the generic capacity (the power) of the test
in question. That is, whether data x0 provide evidence for or against a
particular hypothesis (0 or 1 ) depends crucially on the generic
capacity of the test (power) in question to detect discrepancies from
0 . This stems from the intuition that a small p-value or a rejection
of 0 based on a test with low power (e.g. a small ) for detecting a
particular discrepancy provides stronger evidence for the presence of a particular discrepancy than using a test with much higher
power (e.g. a large ). Mayo (1996) proposed a frequentist evidential
account based on harnessing this intuition in the form of a post-data
34
35.
severity evaluation of the accept/reject results. This is based on
custom-tailoring the generic capacity of the test to establish the discrepancy warranted by data x0 . This evidential account can be used
to circumvent the above fallacies, as well as other (misplaced) charges
against frequentist testing.
The severity evaluation is a post-data appraisal of the accept/reject
and p-value results that revolves around the discrepancy from 0
warranted by data x0
¥ A hypothesis passes a severe test with data x0 if:
(S-1) x0 accords with , and
(S-2) with very high probability, test would have produced a result
that accords less well with than x0 does, if were false.
Severity can be viewed as an feature of a test as it relates to
a particular data x0 and a speciﬁc claim being considered. Hence,
the severity function has three arguments, ( x0 ) denoting the
severity with which passes with x0 ; see Mayo and Spanos (2006).
Example. To explain how the above severity evaluation can be
applied in practice, let us return to the problem of assessing the two
substantive values of interest:
Arbuthnot: = 1 Bernoulli: = 18
2
35
When these values are viewed in the context of the error statistical perspective, it becomes clear that the way to frame the probing is to choose
one of the values as the null hypothesis, and let the diﬀerence between
them ( − ) represent the discrepancy of substantive interest.
Data: =30762 newborns during the period 1993-5 in Cyprus,
16029 are boys and 14833 girls.
Let us return to probing the Arbuthnot value based on:
(1-s): 0 : 0 = 1 vs. 1 : 0 1
2
2
that gave rise to the observed test statistic:
(x0 )=
√
30762( 16029 − 1 )
30762 2
√
5(5)
=7.389,
(43)
leading to a rejection of 0 at =01 ⇒ =2326 Indeed, 0 would
have also been rejected by a (2-s) test since =2576.
An important feature of the severity evaluation is that it is postdata, and thus the sign of the observed test statistic (x0 ) provides
35
36.
directional information that indicates the directional inferential claims
that ‘passed’. In relation to the above example, the severity ‘accordance’ condition (S-1) implies that the rejection of 0 = 1 with (x0 )=7389
2
0, indicates that the form of the inferential claim that ‘passed’ is of the
generic form:
1 = 0 + for some ≥ 0
(44)
The directional feature of the severity evaluation is very important
in addressing several criticisms of N-P testing, including the potential
arbitrariness and possible abuse of:
[a] switching between one-sided, two-sided or simple-vs-simple hypotheses,
[b] interchanging the null and alternative hypotheses, and
[c] manipulating the level of signiﬁcance in an attempt to get the
desired testing result.
To establish the particular discrepancy warranted by data x0 , the
severity post-data ‘discordance’ condition (S-2) calls for evaluating the
probability of the event: "outcomes x that accord less well with 1
than x0 does", i.e. [x: (x) ≤ (x0 )] yielding:
( ; 1 ) = P(x : (x) ≤ (x0 ); 1 is false)
(45)
The evaluation of this probability is based on the sampling distribution:
√
ˆ
=
(
(X)= √ −0 ) v1 Bin ((1 ) (1 ); ) for 1 0
0 (1−0 )
(46)
√
(1 )= √(1 −0 ) ≥ 0 (1 )= 1 (1−1 ) 0 (1 ) ≤ 1
0 (1−0 )
0 (1−0 )
Since the hypothesis that ‘passed’ is of the form 1 =0 +, the
primary objective of SEV( ; 1 ) is to determine the largest discrepancy ≥ 0 warranted by data x0 .
Table 4 evaluates SEV(; 1 ) for diﬀerent values of with the
evaluations based on the standardized (X):
[(X)−(1 )] =1
√
(1 )
v Bin (0 1; ) ' N(0 1)
For example, when 1 = 5142 :
√ ˆ
( −0 )
√
0 (1−0 )
− (1 )=7389 −
36
√
30762(5142−5)
√
5(5)
=2408
(47)
37.
and Φ( ≤ 2408)=992 where Φ() is the cumulative distribution function (cdf) of the standard Normal distribution. Note that the scaling
p
(1 )=9996 ' 1 can be ignored.
Table 4: Reject 0 : = 5 vs. 1 : 5
Relevant claim
Severity
1 =[5+]
P(x: (X)≤(x0 ); 1 )
010
0510
999
013
0513
997
0142
05142
992
0143
05143
991
015
0515
983
016
0516
962
017
0517
923
018
0518
859
020
0520
645
021
0521
500
025
0525
084
An evidential interpretation. Taking a very high probability,
say 95 as a threshold, the largest discrepancy from the null warranted
by this data is:
≤ 01637 since ( ; 51637) = 950
Is this discrepancy substantively signiﬁcant? In general, to answer this
question one needs to appeal to substantive subject matter information to assess the warranted discrepancy on substantive grounds. In
human biology it is commonly accepted that the sex ratio at birth is
approximately 105 boys to 100 girls; see Hardy (2002). Translating this
ratio in terms of yields = 105 =5122 which suggests that the above
205
warranted discrepancy ≥ 01637 is substantively signiﬁcant since this
outputs ≥ 5173 which exceeds . In light of that, the substantive
discrepancy of interest between and :
∗ =(1835) − 5 = 0142857
yields SEV( ; 5143)=991 indicating that there is excellent evidence for the claim 1 =0 + ∗ .
I In terms of the ultimate objective of statistical inference, learning
from data, this evidential interpretation seems highly eﬀective because
it narrows down an inﬁnite set Θ to a very small subset!
37
38.
8.1
Revisiting issues pertaining to N-P testing
The above evidential interpretation based on the severity assessment
can also be used to shed light on a number of issues raised in the
previous sections.
(a) The arbitrariness of N-P hypotheses speciﬁcation
The question that naturally arises at this stage is whether having
good evidence for inferring 18 depends on the particular way one
35
has chosen to specify the hypotheses for the N-P test. Intuitively,
one would expect that the evidence should not be dependent on the
speciﬁcation of the null and alternative hypotheses as such, but on x0
and the statistical model M (x) x∈R .
To explore that issue let us focus on probing the Bernoulli value:
0 : 0 = 18 where the test statistic yields:
35
√
16029
18
30762(
− )
(x0 )= √ 18 30762 35 =2379
18
35 (1− 35 )
(48)
This value leads to rejecting 0 at =01 when one uses a (1-s) test
[ =2326], but accepting 0 when one uses a (2-s) test [ =2576].
Which result should one believe?
Irrespective of these conﬂicting results, the severity ‘accordance’
condition (S-1) implies that, in light of the observed test statistic
(x0 )=23790, the directional claim that ‘passed’ on the basis of (48)
is of the form:
1 =0 +
To evaluate the particular discrepancy warranted by data x0 , condition (S-2) calls for the evaluation of the probability of the event:
‘outcomes x that accord less well with 1 than x0 does’, i.e. [x:
(x) ≤ (x0 )] giving rise to (45), but with a diﬀerent 0 .
Table 5 provides several such evaluations of SEV( ; 1 ) for
diﬀerent values of ; note that these evaluations are also based on (46).
A number of negative values of are included in order to address the
original substantive hypotheses of interest, as well as bring out the fact
that the results of table 4 and 5 are identical when viewed as inferences
pertaining to ; they simply have a diﬀerent null values 0 and thus
diﬀerent discrepancies, but identical 1 values.
38
39.
The severity evaluations in tables 4 and 5, not only render the choice
between (2-s), (1-s) and simple-vs-simple irrelevant, they also scotch
the well-rehearsed argument pertaining to the asymmetry between the
null and alternative hypotheses. The N-P convention of selecting a
small is generally viewed as reﬂecting a strong bias against rejection
of the null. On the other hand, Bayesian critics of frequentist testing
argue the exact opposite: N-P testing is biased against accepting the
null; see Lindley (1993). The evaluations in tables 4 and 5 demonstrate
that the evidential interpretation of frequentist testing based on severe
testing addresses all asymmetries, notional or real.
Table 5: Accept/Reject 0 : = 18 vs. 1 : ≷ 18
35
35
Relevant claim
Severity
1 =[5143+] P(x: (x)≤(x0 ); 1 )
−0043
0510
999
−0013
0513
997
−0001
05142
992
0000
05143
991
0007
0515
983
0017
0516
962
0027
0517
923
0037
0518
859
0057
0520
645
0067
0521
500
0107
0525
084
(b) Addressing the Fallacy of Rejection
The potential arbitrariness of the N-P speciﬁcation of the null and
alternative hypotheses and the associated p-values is brought out in
probing the Bernoulli value = 18 using diﬀerent alternatives:
35
(1-s): (x0 )=P((X) 2379; 0 )=991
(2-s):
q (x0 )=P(|(X)| 2379; 0 )=017
(1-s): (x0 )=P((X) 2379; 0 )=009
Can the above severity evaluation explain away these highly conﬂicting and
confusing results?
39
40.
The ﬁrst choice 1 : 18 was driven solely by substantive infor35
mation relating to = 1 , which is often a bad idea because it ignores
2
the statistical dimension of inference. The second choice was based on
lack of information about the direction of departure, which makes sense
pre-data, but not post-data. The third choice reﬂects the post-data direction of departure indicated by (x0 )=2379 0 In light of that, the
severity evaluation renders (x0 )=009 the relevant p-value, after all!
¥ The key problem with the p-value. In addition, the severity
evaluation brings out the vulnerability of both the N-P reject rule and
the p-value to the fallacy of rejection in cases where is large. Viewing
the relevant p-value ((x0 )=009) from the severity vantage point, it
is directly related to 1 passing a severe test because: the probability
that test would have produced a result that accords less well with
1 than x0 does (x: (x) (x0 )), if 1 were false (0 true):
Sev( ; x0 ; 0 ) =P((X) (x0 ); ≤ 0 ) =
=1−P((X)(x0 ); =0 )=991
is very high; Mayo (1996). The key problem with the p-value, however,
is that is establishes the existence of some discrepancy ≥ 0 but
provides no information concerning the magnitude licensed by x0 The
severity evaluation remedies that by relating Sev( ; x0 ; 0 )=991
to the inferential claim 18 with the implicit discrepancy associated
35
with the p-value being =0.
(c) Addressing the Fallacy of Acceptance
Example. Let us return to the hypotheses:
(1-s): 0 : 0 = 1 vs. 1 : 0 1
2
2
in the context of the simple Bernoulli model (table 2), and consider
applying test with =025 ⇒ =196 to the case where data x0
are based on =2000 and yield =0503:
√
√
(x0 )= 2000(503−5) =268 P((x) 268; 0 ) = 394
5(5)
(x0 ) = 268 leads to accepting 0 , and the p-value indicates no rejection of 0
Does this mean that data x0 provide evidence for 0 ? or more accurately, does x0 provide evidence for no substantive discrepancy from
0 ? Not necessarily!
40
41.
To answer that question one needs to apply the post-data severe
testing reasoning to evaluate the discrepancy ≥ 0 for 1 =0 + war
ranted by data x0 . Test :={(X) 1 ()} passes 0 , and the idea is
to establish the smallest discrepancy ≥ 0 from 0 warranted by data
x0 by evaluating (post-data) the claim ≤ 1 for 1 =0 +
Condition (S-1) of severity is satisﬁed since x0 accords with 0 because (x0 ) , but (S-2) requires one to evaluate the probability
of "all outcomes x for which test accords less well with 0 than
x0 does", i.e. (x: (x) (x0 )) under the hypothetical scenario that
" ≤ 1 is false" or equivalently " 1 is true":
( ; x0 ; ≤ 1 )=P(x: (x) (x0 ); 1 ) for ≥ 0
(49)
The evaluation of (49) relies on the sampling distribution (46), and
for diﬀerent discrepancies ( ≥ 0) yields the results in table 6. Note
that Sev( ; x0 ; ≤ 1 ) is evaluated at =1 because the probability
increases with .
The numerical evaluations are based on the standardized (X) from
(47). For example, , when 1 = 515 :
√ ˆ
(
√ −0 ) − (1 )=268 −
0 (1−0 )
√
2000(515−5)
√
5(5)
= −1074
Φ( −1074)=859 where Φ() is the cumulative distribution function
q
p
(cdf) of N(0 1). Note that the scaling factor (1 )= 515(1−515) =
5(5)
9996 can be ignored because its value is so close to 1.
Table 6 — Severity Evaluation of ‘Accept 0 ’ with ( ; x0 )
≤ 1 = 0 +
Relevant claim:
Sev( ≤ 1 )
.001 .005
.01
.015 .017 .0173 .018
.429 .571 .734 .859 .895
.900
.02
.025
.910 .936 .975 .992
Assuming a severity threshold of, say 90 is high enough, the above
results indicate that the ‘smallest’ discrepancy warranted by x0 , is ≥
0173 since:
( ; x0 ; ≤ 5173)=9
41
.03
42.
That is, data x0 in conjunction with test provide evidence (with
severity at least 9) for the presence of a discrepancy as large as ≥
0173. Is this discrepancy substantively insigniﬁcant? No!
As mentioned above, the accepted ratio at birth in human biology
is =512 which suggests that the warranted discrepancy ≥ 0173
is, indeed, substantively signiﬁcant since this outputs ≥ 5173 which
exceeds .
This is an example where the post-data severity evaluation can be
used to address the fallacy of acceptance. The severity assessment indicates that the statistically insigniﬁcant result (x0 )=268 at =025,
actually provides evidence against 0 : ≤ 5 and for ≥ 5173 with
severity at least 90
(d) The arbitrariness of choosing the signiﬁcance level
It is often argued by critics of frequentist testing that the choice
of the signiﬁcance level in the context of the N-P approach, or the
threshold for the p-value in the case of Fisherian signiﬁcance testing,
is totally arbitrary and manipulable.
To see how the severity evaluations can address this problem, let us
try to ‘manipulate’ the original signiﬁcance level (=01) to a diﬀerent
one that would alter the accept/reject results for 0 : = 18 Looking
35
back at results for Bernoulli’s conjectured value using the data from
Cyprus, it is easy to see that choosing a larger signiﬁcance level, say
=02 ⇒ =2053 would lead to rejecting 0 for both N-P formulations:
(2-s) (1 : 6= 18 ) and (1-s) (1 : 18 )
35
35
since:
(x0 )=2379 2053
(2-s): (x0 )=P(|(X)| 2379; 0 )=0174
(1-s): (x0 )=P((X) 2379; 0 )=009
How does this change the original severity evaluations? The short
answer is: it doesn’t! The severity evaluations remain invariant to
any changes to because they depend on on (x0 )=2379 0 and
not on which indicates that the generic form of the hypothesis that
‘passed’ is 1 =0 + and the same evaluations apply.
42
43.
9
Summary and conclusions
In frequentist inference learning from data x0 about the stochastic
phenomenon of interest is accomplished by applying optimal inference
procedures with ascertainable error probabilities in the context of a statistical model:
M (x)={ (x; θ) θ∈Θ} x∈R
(50)
Hypothesis testing gives rise to learning from data by partitioning
M (x) into two subsets:
M0 (x)={ (x; θ) θ∈Θ0 } or M1 (x)={ (x; θ) θ∈Θ1 } x∈R ?
(51)
framed in terms of the parameter(s):
0 : ∈Θ0 vs. 0 : ∈Θ1
(52)
and inquiring x0 for evidence in favor one of the two subsets (51) using
hypothetical reasoning.
Note that one could specify (52) more perceptively as:
∈Θ
∈Θ
0
1
z
}|
{
z
}|
{
∗
∗
0 : (x)∈M0 (x)={ (x; θ) θ∈Θ0 } vs. 1 : (x)∈M1 (x)={ (x; θ) θ∈Θ1 }
where ∗ (x)= (x; θ∗ ) denotes the ‘true’ distribution of the sample. This
notation elucidates the basic question posed in hypothesis testing:
I Given that the true M∗ (x) lies within the boundaries of M (x)
can x0 be used to narrow it down to a smaller subset M0 (x)?
A test :={(X) 1 ()} is deﬁned in terms of a test statistic
((X)) and a rejection region (1 ()), and its optimality is calibrated
in terms of the relevant error probabilities:
type I:
P(x0 ∈1 ; 0 () true) ≤ () for ∈Θ0
type II: P(x0 ∈0 ; 1 (1 ) true)=(1 ) for 1 ∈Θ1
These error probabilities specify how often these procedures lead to
erroneous inferences, and thus determine the pre-data capacity (power)
of the test in question to give rise to valid inferences:
(1 )=P(x0 ∈1 ; 1 (1 ) true)=(1 ) for all 1 ∈Θ1
An inference is reached by an inductive procedure which, with high
probability, will reach true conclusions from valid (or approximately
43
44.
valid) premises (statistical model). Hence, the trustworthiness of frequentist inference depends on two interrelated pre-conditions:
(a) adopting optimal inference procedures, in the context of:
(b) a statistically adequate model.
I How does hypothesis testing related to the other forms of frequentist
inference?
It turns out that testing is inextricably bound up with estimation
(point and interval) in the sense that an optimal estimator is often the
cornerstone of an optimal test and an optimal conﬁdence interval. For
instance, consistent, fully eﬃcient and minimal suﬃcient estimators
provide the basis for most of the optimal tests and conﬁdence intervals!
Having said that, optimal point estimation is considered rather noncontroversial, but optimal interval estimation and hypothesis testing
have raised several foundational issues that have bedeviled frequentist inference since the 1930s. In particular, the use and abuse of the
relevant error probabilities calibrating these latter forms of inductive
inference have contributed signiﬁcantly to the general confusion and
controversy when applying these procedures.
Addressing foundational issues
The ﬁrst such issue concerns the presupposition of the (approximately) true premises, i.e. the true M∗ (x) lies within the boundaries
of M (x) This can be addressed by securing the statistical adequacy
of M (x) vis-a-vis data x0 using trenchant Mis-Speciﬁcation (MS) testing, before applying frequentist testing; see chapter 15. As
shown above, when M (x) is misspeciﬁed, the nominal and actual error probabilities can be very diﬀerent, rendering the reliability of the
test in question, at best unknown and at worst highly misleading. That
is, statistical adequacy ensures the ascertainability of the relevant error
probabilities, and hence the reliability of inference.
The second issue concerns the form and nature of the evidence
x0 can provide for ∈Θ0 or ∈Θ1 Neither Fisher’s p-value, nor the
N-P accept/reject rules can provide such an evidential interpretation,
primarily because they are vulnerable to two serious fallacies:
(c) Fallacy of acceptance: no evidence against the null is misinterpreted as evidence for it.
(d) Fallacy of rejection: evidence against the null is misinterpreted
as evidence for a speciﬁc alternative.
44
45.
As discussed above, these fallacies can be circumvented by supplementing the accept/reject rules (or the p-value) with a post-data evaluation of inference based on severe testing with a view to determined
the discrepancy from the null warranted by data x0 This establishes
the inferential claim warranted [and thus, the unwarranted ones], in
particular cases, so that it gives rise to learning from data.
The severity perspective sheds ample light on the various controversies relating to p-values vs. type I and II error probabilities, delineating
that the p-value is a post-data and the type I and II are pre-data error probabilities, and although inadequate for the task of furnishing
an evidential interpretation, they were intended to fulﬁll complementary roles. Pre-data error probabilities, like the type I and II are used
to appraise the generic capacity of testing procedures, and post-data
error probabilities, like severity, are used to bridge the gap between
the coarse ‘accept/reject’ and evidence provided by data x0 for the
warranted discrepancy from the null, by custom-tailoring the pre-data
capacity in light of (x0 ).
This reﬁnement/extension of the Fisher-Neyman-Pearson framework can be used to address, not only the above mentioned fallacies,
but several criticisms that have beleaguered frequentist testing since
the 1930s, including:
(e) the arbitrariness of selecting the thresholds,
(f) the framing of the null and alternative hypotheses in the context
of a statistical model M (x), and
(g) the numerous misinterpretations of the p-value.
45
46.
10
Appendix: Revisiting Conﬁdence Intervals (CIs)
Mathematical Duality between Testing and Cls
From a purely mathematical perspective a point estimator is a mapping () of the form:
() : R → Θ
Similarly, an interval estimator is also a mapping () of the form:
(): R → Θ, such that −1 ()={x: ∈(x)}⊂F for ∈Θ.
Hence, one can assign a probability to −1 ()=(x)={x: ∈(x)} in
the sense that:
P (x: ∈(x); =∗ ) :=P (∈(X); =∗ )
denotes the probability that the random set (X) contains ∗ -the true
value of The expression ‘∗ denotes the true value of ’ is a shorthand
for saying that ‘data x0 constitute a realization of the sample X with
distribution (x; ∗ )’. One can deﬁne a (1−) Conﬁdence Interval (CI)
for by attaching a lower bound to this probability:
P (∈(X); =∗ ) ≥ (1 − )
Neyman (1937) utilized a mathematical duality between the N-P
Hypothesis Testing (HT) and Conﬁdence Intervals (CIs) to put forward
an optimal theory for the latter.
The -level test of the hypotheses:
0 : =0 vs. 1 : 6=0
takes the form {(X;0 ) 1 (0 ; )} (X;0 )-the test statistic and the
rejection region deﬁned by:
1 (0 ; )={x: |(x;0 )| } ⊂ R
2
The complement of the latter with respect to the sample space (R ),
deﬁnes the acceptance region:
0 (0 ; )={x: |(x;0 )| ≤ } = R − 1 (0 ; )
2
The mathematical duality between HT and CIs stems from the fact
that for each x∈R one can deﬁne a corresponding set (x;) on the
parameter space by:
(x;)={: x∈0 (0 ; )} ∈Θ e.g. (x;)={: |(x;0 )| ≤ }
2
46
47.
Conversely, for each (x;) there is a CI corresponding to the acceptance region 0 (0 ; ) such that:
0 (0 ; )={x: 0 ∈(x;)} x∈R e.g. 0 (0 ; )={x: |(x;0 | ≤ }
2
This equivalence is encapsulated by:
for each x∈R x∈0 (0 ; ) ⇔ 0 ∈(x; )
for each 0 ∈Θ
When the lower b (X) and upper bound b (X) of a CI, i.e.
´
³
b (X) ≤ ≤ b (X) =1−
P
are monotone increasing functions of X the equivalence is:
´
³
´
³
b (x) ≤ ≤ b (x) ⇔ x∈ b−1 () ≤ x ≤ b−1 ()
∈
Optimality of Conﬁdence Intervals
The mathematical duality between HT and CIs is particularly useful
in deﬁning an optimal CI that is analogous to an optimal N-P test.
The key result is that the acceptance region of a -signiﬁcance level
test that is Uniformly Most Powerful (UMP), gives rise to (1−) CI
that is Uniformly Most Accurate (UMA), i.e. the error coverage probability that (x; ) contains a value 6= 0 where 0 is assumed to be the true value, is minimized; Lehmann (1986). Similarly,
a −UMPU test gives rise to a UMAU Conﬁdence Interval.
Moreover, it can be shown that such optimal CIs have
the shortest length in the following sense:
if [b (x) b (x)] is a (1−)-UMAU CI, then its length [width] is less
than or equal to that of any other (1−) CI, say [e (x) e (x)]:
∗ (b (x) − b (x)) ≤ ∗ (e (x) − e (x))
Underlying reasoning: CI vs. HT
Does this mathematical duality render HT and CIs equivalent in
terms of their respective inference results? No!
47
48.
The primary source of the confusion on this issue is that they share
a common pivotal function, and their error probabilities are evaluated
using the same tail areas. But that’s where the analogy ends.
To bring out how their inferential claims and respective error probabilities are fundamentally diﬀerent, consider the simple (one parameter)
Normal model (table 3), where the common pivot is:
(X; )=
√
( −)
v N (0 1)
(53)
Since is unknown, (53) is meaningless as it stands, and it crucial to
understand how the CI and HT procedures render it meaningful using
diﬀerent types of reasoning, referred to as factual and hypothetical,
respectively:
√
∗ TSN
(X;)= ( − ) v N(0 1)
(54)
√
0
(X)= ( −0 ) v N(0 1)
Putting the (1−) CI side-by-side with the (1−) acceptance region:
´
³
∗
P − ( √ ) ≤ ≤ + ( √ ); = =1−
2
2
´
³
(55)
P 0 − ( √ ) ≤ ≤ 0 + ( √ ); =0 =1−
2
2
it is clear that their inferential claims are crucially diﬀerent:
[a] The CI claims that with probability (1−) the random bounds
[ ± ( √ )] will cover (overlay) the true ∗ [whatever that happens
2
to be], under the TSN.
[b] The test claims that with probability (1−) test would yield
a result that accords equally well or better with 0 (x: |(x;0 )| ≤ ),
2
if 0 were true.
Hence, their claims pertain to the parameter vs. the sample space,
and the evaluation under the TSN (=∗ ) vs. under the null (=0 ),
render the two statements inferentially very diﬀerent. The idea that a
given (known) 0 and the true ∗ are interchangeable is an illusion.
This can be seen most clearly in the simple Bernoulli case where
P
needs to be estimated using b = =1 := for the UMAU (1−)
1
CI:
³
´
(1− )
(1− )
∗
P − ( √ ) ≤ + ( √ ); = =1−
2
2
48
49.
but the acceptance region of -UMPU test is deﬁned in terms of the
hypothesized value 0 :
´
³
0 (1−0 )
0 (1−0 )
(
√
(
√
P 0 − 2
) ≤ 0 + 2
); =0 =1−
The fact is that there is only one ∗ and an inﬁnity of values 0 As
a result, the HT hypothetical reasoning is equally well-deﬁned pre-data
and post data. In contrast, there is only one TSN which has already
played out post-data, rendering coverage error probability degenerate,
i.e. the observed CI:
[ − ( √ ) + ( √ )]
2
2
(56)
either includes or excludes the true value ∗ .
Observed Conﬁdence Intervals and Severity
In addition to addressing the fallacies of acceptance and rejection,
the post-data severity evaluation can be used to address the issue of
degenerate post-data error probabilities and the inability to distinguish
between diﬀerent values of within an observed CI; Mayo and Spanos
(2006). This is accomplished by making two key changes:
(a) Replace the factual reasoning underlying CIs, which becomes
degenerate post-data, with the hypothetical reasoning underlying the
post-data severity evaluation.
(b) Replace any claims pertaining to overlaying the true ∗ with
severity-based inferential claims of the form:
1 =0 + ≤ 1 =0 + for some ≥ 0
(57)
This can be achieved by relating the observed bounds:
± ( √ )
(58)
2
to particular values of 1 associated with (57), say 1 = − ( √ ) and
2
evaluating the post-data severity of the claim:
1 = − ( √ )
2
A moment’s reﬂection, however, suggests that the connection between establishing the warranted discrepancy from the null value 0
and the observed CI (58) is more apparent than real, because the severity evaluation probability is attached to the inferential claim (57), and
not to the particular value 1 . Hence, at best one would be creating
49
50.
the illusion that the severity assessment on the boundary of a 1-sided
observed CI is related to the conﬁdence level; it does not! The truth of
the matter is that the inferential claim (58) does not pertain directly
to ∗ but to the warranted discrepancy in light of x0 .
Fallacious arguments for using Conﬁdence Intervals
A. Conﬁdence Intervals are more reliable than p-values
It is often argued in the social science statistical literature that CIs
are more reliable than p-values because the latter are vulnerable to the
large problem; as the sample size → ∞ the p-value becomes smaller
and smaller. Hence, one would reject any null hypothesis given enough
observations. What these critics do not seem to realize is that a CI
like:
³
´
∗
( √ ) ≤ ≤ + ( √ ); =
P − 2
=1−
2
is equally vulnerable to the large problem because the length of the
interval:
h
i h
i
( √ ) − − ( √ )
+ 2
= 2 ( √ )
2
2
shrinks to zero as → ∞
B. The middle of a Conﬁdence Interval is more probable?
Despite the obvious fact that post-data the probability of coverage
is either zero or one, there have been numerous recurring attempts in
the literature to discriminate among the diﬀerent values of within an
observed interval:
“... the diﬀerence between population means is much more likely to
be near the middle of the conﬁdence interval than towards the extremes.”
(Gardner and Altman, 2000, p. 22)
This claim is clearly fallacious. A moment’s reﬂection suggests
that viewing all possible observed CIs (56) as realizations from a single
distribution in (54), leads one to the conclusion that, unless the particular happens (by accident) to coincide with the true value ∗ values
to the right or the left of will be more likely depending on whether
is to the left or the right of ∗ Given that ∗ is unknown, however,
such an evaluation cannot be made in practice with a single realization.
This is illustrated below with a CI for in the case of the simple (one
parameter) Normal model.
50
52.
C. Omnibus curves, consonance intervals and all that!
In addition to these obvious fallacious claims, there have been several more sophisticated attempts to relate the coverage error probability
with the type I and II error probabilities, including the omnibus conﬁdence
curve (Birnbaum, 1961), the consonance interval curve (Kempthorne
and Folks, 1971), and the p-value curve (Poole, 1987). However, all
these attempts run into major conceptual befuddlements primarily because they ignored the key diﬀerence in the underlying reasoning. All
of the them end up inadvertently switching from factual to hypothetical reasoning half-stream without realizing that it cannot be done and
retain the CI inferential claim.
As shown in Spanos (2005), in the case of a (1−2) 2-sided CI there
is a one-to-one mapping between ∈[0 1] and ∈R via:
([1 − Φ( )] =) ⇒ (Φ−1 (1−)= )
(59)
which gives rise to:
Φ−1 (1−)=
√
( −)
=(x0 ; )
⇒
(60)
⇒ ()=1−Φ ((x0 ; )) for each ∈R
This mapping between ∈R and ∈[0 1] can be seen more clearly by:
(a) placing a particular value of say 1 on the boundary of the
observed CI, i.e.
1 = − ( √ ) ⇒ =
√
( −1 )
= (x0 ; 1 )
(b) holding constant, and solving it to show that for each there exists a such that =(x0 ; 1 ) This conﬁrms the relationship between
and the p-value in the sense that the latter represents the smallest
signiﬁcance level at which the null would have been rejected.
That is, there is a one-to-one relationship between the diﬀerent signiﬁcant levels [or — the associated rejection threshold] and the
observed value of the test statistic for a corresponding value of .
52
53.
11
Pioneers of statistical testing
Karl Pearson (1857-1936)
Francis Edgeworth (1845—1926)
William Gosset (1876—1937)
R.A.Fisher (1890-1962)
JerzyNeyman (1894-1981)
Egon Pearson (1895-1980)
53
54.
Relevant references:
I Mayo, D. G. and A. Spanos. (2006), “Severe Testing as a Basic
Concept in a Neyman-Pearson Philosophy of Induction,” The British
Journal for the Philosophy of Science, 57: 323-357.
I Mayo, D. G. and A. Spanos (2011), “Error Statistics,” pp. 151196 in the Handbook of Philosophy of Science, vol. 7: Philosophy of
Statistics, D. Gabbay, P. Thagard, and J. Woods (editors), Elsevier.
I Spanos, A. (1999), Probability Theory and Statistical Inference:
econometric modeling with observational data, Cambridge University
Press, Cambridge.
I Spanos, A. (2006), “Where Do Statistical Models Come From?
Revisiting the Problem of Speciﬁcation,” pp. 98-119 in Optimality:
The Second Erich L. Lehmann Symposium, edited by J. Rojo, Lecture
Notes-Monograph Series, vol. 49, Institute of Mathematical Statistics.
I Spanos, A. (2008), “Review of Stephen T. Ziliak and Deirdre
N. McCloskey’s The Cult of Statistical Signiﬁcance,” Erasmus Journal
for Philosophy and Economics, 1:154-164. http://ejpe.org/pdf/1-1-br2.pdf.
I Spanos, A. (2011), “Misplaced Criticisms of Neyman-Pearson (NP) Testing in the Case of Two Simple Hypotheses,” Advances and Applications in Statistical Science, 6: 229-242, (2011).
I Spanos, A. (2012), “Revisiting the Berger location model: Fallacious Conﬁdence Interval or a Rigged Example?” Statistical Methodology, 9: 555-561.
I Spanos, A. (2012), “A Frequentist Interpretation of Probability
for Model-Based Inductive Inference,” Synthese, 190: 1555—1585.
I Spanos, A. (2013), “Revisiting the Likelihoodist Evidential Account,” Journal of Statistical Theory and Practice, 7: 187-195.
I Spanos, A. (2013), “Who Should Be Afraid of the Jeﬀreys-Lindley
Paradox?” Philosophy of Science, 80: 73-93.
I Spanos, A. (2013), “The ‘Mixed Experiment’ Example Revisited:
Fallacious Frequentist Inference or an Improper Statistical Model?”
Advances and Applications in Statistical Sciences, 8: 29-47.
I Spanos, A. and A. McGuirk (2001), “The Model Speciﬁcation
Problem from a Probabilistic Reduction Perspective,” Journal of the
American Agricultural Association, 83: 1168-1176.
54
Be the first to comment