A. Spanos Probability/Statistics Lecture Notes 5: Post-data severity evaluation

PHIL 6334 - Probability/Statistics Lecture Notes 5:
Post-data severity evaluation
Aris Spanos [Spring 2014]
1 Introduction
Fallacies of Acceptance and Rejection
How is one supposed to interpret accept or reject the null?
I Unfortunately, in fields like econometrics ‘accept 0’ is
routinely, but erroneously, interpreted as ‘data x0 provide ev-
idence for 0’, and ‘reject 0’ is routinely but erroneously
interpreted as ‘data x0 provide evidence for some alternative
1’.
The problem is that neither of these evidential claims can be
justified, since they are both vulnerable to two classic fallacies.
(a) The fallacy of acceptance: no evidence against 0 is mis-
interpreted as evidence for 0.
This fallacy can easily arise in cases where the test in ques-
tion has low power to detect discrepancies of interest, e.g.
small sample size .
(b) The fallacy of rejection: evidence against 0 is misinter-
preted as evidence for a particular 1.
This fallacy can easily arise in cases where the power of
a test is very high, e.g. the case of a very large sample size
 This renders N-P rejections, as well as tiny p-values, with
large  highly susceptible to this fallacy.
In the statistics literature, as well as in the secondary liter-
atures in several applied fields, there have been numerous at-
tempts to circumvent these two fallacies, but none succeeded.
1

The ﬁrst successful attempt was made by Mayo (1996) by in-
troducing the notion of a post-data severity evaluation.
2 The post-data severity evaluation
2.1 The notion of post-data severity
The post-data severity assessment aims to supplement fre-
quentist testing with a view to bridge the gap between the p-
value and the accept/reject rules on one hand, and providing
evidence for or against a hypothesis in the form of the dis-
crepancy  from the null warranted by data x0, on the other.
I Its key diﬀerence from the Bayesian and likelihoodist
approaches to testing is that it takes into account the generic
capacity of the test in establishing .
I The intuition behind this notion is that a rejection of
0 using a less (more) powerful test provides better (worse)
evidence for a departure from 0. Similarly, an acceptance of
0 using a less (more) powerful test provides worse (better)
evidence for no departure from 0
The severity evaluation is a post-data appraisal of the ac-
cept/reject and p-value results with a view to provide an evi-
dential interpretation. It can be used to address not only the
fallacies or acceptance and rejection and several additional
criticisms of N-P testing. The discussion that follows relies
heavily on Mayo and Spanos (2006).
¥ A hypothesis  passes a severe test  with data x0 if:
(S-1) x0 accords with , and
(S-2) with very high probability, test  would have produced
a result that accords less well with  than x0 does, if  were
false.
Severity can be viewed as an feature of a test  as it re-
2

lates to a particular data x0 and a speciﬁc claim  being
considered. Hence, the severity function has three arguments,
 ( x0 ) denoting the severity with which  passes 
with x0.
Example 1. Let us assume that the appropriate statisti-
cal model for data x0 is the simple (one parameter) Normal
model, where  is known (table 1).
Table 1 - Simple Normal (one parameter) Model
Statistical GM: = +  ∈N={1 2 }
[1] Normality:  v N( ) ∈R
[2] Constant mean: ()=
[3] Constant variance:  ()=2
[known]
⎫
⎬
⎭
∈N.
[4] Independence: { ∈N} independent process
Let us consider the hypotheses of interest:
0 : =0 vs. 1 :   0 (1)
in the context of the simple Normal model (table 1). The
optimal (UMP) test for these hypotheses is:
={(X)=
√
(−0)

 1()={x : (x)  } (2)
where =1

P
=1   is the threshold rejection value. Given
that:
(X)=
√
(−0)

=0
v N(0 1) (3)
one can evaluate the type I error probability (signiﬁcance level)
 using:
P((X)  ; 0 true)=
where  is the type I error; 0    1. To evaluate the type II
error probability one needs to know the sampling distribution
of (X) when 0 is false. However, since 0 is false refers to
3

1 :   0 this evaluation will involve all values of  greater
than 0 (i.e. 10) :
(1)=P((X) ≤ ; 0 false)=P((X) ≤ ; =1) ∀(10)
The relevant sampling distribution takes the form:
(X)=
√
(−0)

=1
v N(1 1)
where 1=
√
(1−0)

 for all 10
(4)
To use the Normal tables one needs to transform
√
(−0)

into
√
(−1)

using:
(X)
z }| {√

¡
 − 0
¢

−
1
z }| {√
 (1 − 0)

=
√
(−1)

=1
v N(0 1) for 10
(5)
The power is deﬁned by 1 − (1) :
(1) =P((X)  ; =1)=
=P(
√
(−1)

  −
√
(−1)

; =1)=
=P(   −
√
(−1)

; =1) for all 1≥0
where  is a generic standard Normal r.v., i.e.  v N(0 1)
0=12 =2 =025 (=196) =100
=1−0 1=
√
(1−0)

: (1)=P(
√
(−1)

  − 1; 1)
=1 =5 (121)=P(  196 − 5)=072
=2 =1 (122)=P(  196 − 2)=169
=3 =15 (133)=P(  196 − 3)=323
=5 =25 (135)=P(  196 − 3)=705
=7 =35 (137)=P(  196 − 3)=938
4

2.2 Severity in the case of reject 0
Consider the case where 0=12 =2 =100 =025 (=196)
and =126
Evaluating the test statistic yields:
(x0)=
√
100(126−12)
2
=30
which results in rejecting 0: =12 The p-value conﬁrms the
rejection since:
(x0)=P((X)  (x0); =12)=0013
Evaluating the post-data severity in order the establish the
discrepancy  from the null warranted by test  and data x0
(=126)
(S-1). The severity ‘accordance’ condition (S-1) implies
that:
the rejection of 0=12 with (x0)=30 accords with 1
and the relevant inferential claim is:
  1=0+ for some  ≥ 0 (6)
(S-2). To establish the particular discrepancy  warranted by
data x0, the post-data severity ‘discordance’ condition:
5

"(S-2): with very high probability, test  would have pro-
duced a result that accords less well with 1 than x0 does, if
1 were false."
calls for evaluating the probability of the tail events:
"outcomes x that accord less well with 1 than x0 does",
i.e. [x: (x) ≤ (x0)] giving rise to:
 (;   1) =P((X) ≤ (x0);   1 is false)=
=P((X) ≤ (x0);  ≤ 1 is true)=
=P((X) ≤ (x0); =1)
(7)
To evaluate this probability we need to use the same distri-
bution under the alternative (4) as in the case of the power,
but now instead of using  as the threshold we will use (x0)
and adjust it as in (5)
(x0)−1=
√
(−1)

 (8)
For instance, for a discrepancy =1 the severity evaluation
is:
 (;   1=121) =P(
√
(−121)

≤
√
100(126−121)
2
; =1)
=P( ≤ 25; =1)=994
where  v N(0 1). Similarly, for a discrepancy =5 the
severity evaluation is:
 (;   1=125) =P(
√
(−125)

≤
√
100(126−125)
2
; =1)
=P( ≤ 05; =1)=691
Table 2 reports several such severity evaluations for diﬀerent
discrepancies =1  10
6

0=12 =2 =100 and =126
Table 2: Reject 0: =12 vs. 1:   12
Relevant claim Severity
 1=[12+] P(x: (X)≤(x0); 1)
10   121 994
20   122 977
30   123 933
344   12344 900
40   124 841
50   125 691
60   126 500
70   127 309
80   128 159
90   129 067
10   130 023

7

The idea of using the post-data severity evaluation in the
case of reject 0 is to establish the largest warranted discrep-
ancy  from the null at a certain high threshold, say .90. In
this case the discrepancy is  ≤ 344
I How does the post-data severity evaluation address the
fallacy of rejection? By pointing out the warranted and un-
warranted discrepancies from the null and specifying the rel-
evant inferential claim.
2.3 Severity in the case of accept 0
Consider the case where 0=12 =2 =100 =025 (=196)
and =121
Evaluating the test statistic yields:
(x0)=
√
100(121−12)
2
=5
which results in accepting 0: =12 The p-value conﬁrms
the acceptance since:
(x0)=P((X)  (x0); =12)=309
Let us evaluate the post-data severity in order the establish
the discrepancy from the null warranted by test  and data
x0 yielding =121
(S-1). The severity ‘accordance’ condition (S-1) implies
that:
the acceptance of 0=12 with (x0)=5 accords with 0
and the relevant inferential claim  is:
 ≤ 1=0+ for some  ≥ 0 (9)
(S-2). To establish the particular discrepancy  warranted by
data x0, the post-data severity ‘discordance’ condition:
8

"(S-2): with very high probability, test  would have pro-
duced a result that accords less well with 0 than x0 does, if
0 were false."
calls for evaluating the probability of the tail events:
"outcomes x that accord less well with 0 than x0 does",
i.e. [x: (x)  (x0)] giving rise to:
 (;  ≤ 1) =P((X)  (x0); =0 is false)=
=P((X)  (x0);   0 is true)=
=P((X)  (x0); =1)
(10)
For a discrepancy =1 the severity evaluation is:
 (;  ≤ 1=121) =P(
√
(−121)


√
100(121−121)
2
; =1)
=P(  00; =1)=500
Similarly, for a discrepancy =5 the severity evaluation is:
 (;  ≤ 1=125) =P(
√
(−125)


√
100(121−125)
2
; =1)
=P(  −20; =1)=691
Table 3 reports several such severity evaluations for diﬀerent
discrepancies = − 3  7
9

0=12 =2 =100 and =121
Table 3: Accept 0: =12 vs. 1:   12
Relevant claim Severity
 ≤1=[12+] P(x: (X)(x0); 1)
−3  ≤ 117 023
−2  ≤ 118 067
−1  ≤ 110 159
0  ≤ 120 309
10  ≤ 121 500
20  ≤ 122 691
30  ≤ 123 841
356  ≤ 12356 900
40  ≤ 124 933
50  ≤ 125 977
60  ≤ 126 994
70  ≤ 127 999

10

The idea of using the post-data severity evaluation in the
case of accept 0 is to establish the smallest warranted dis-
crepancy  from the null at a certain high threshold, say .90.
In this case the discrepancy is  ≥ 356
I How does the post-data severity evaluation address the
fallacy of acceptance? By pointing out the warranted and
unwarranted discrepancies from the null and specifying the
relevant inferential claim.
2.4 The large n problem
The large  problem was initially raised by Lindley (1957) in
the context of the simple Normal model (table 1) where the
variance 2
 0 is assumed known, by pointing out:
[a] the large  problem: frequentist testing is susceptible
to the "fallacious" result that there is always a large enough
sample size  for which any point null, say 0: =0, will be
rejected by a frequentist -significance level test.
Lindley claimed that this result is paradoxical because, when
viewed from the Bayesian perspective, one can show:
[b] the Jeffreys-Lindley paradox: for certain choices of
the prior, the posterior probability of 0 given a frequentist
-significance level rejection, will approach 1 as →∞.
Claims [a] and [b] contrast the behavior of a frequentist test
(p-value) and the posterior probability of 0 as →∞, that
highlights a potential for conflict between the frequentist and
Bayesian accounts of evidence.
[c] Bayesian charge: a hypothesis that is well-supported
by Bayes factor can be (misleadingly) rejected by a frequentist
test when  is large; see Berger and Sellke (1987), pp. 112-3.
A paradox? No! From the error statistical perspective:
11

(i) There is nothing fallacious about a small p-value, or a
rejection of 0 when  is large [it is a feature of a consistent
frequentist test].
What is paradoxical is why the posterior probability of 0
as →∞ goes to 1, irrespective of the truth or falsity of 0!
I Hence, the real problem does not lie with the p-value or
the accept/reject rules as such, but with how such results are
transformed into evidence for or against a particular . The
problem arises when such accept/reject results are detached
from the test itself, and are treated as providing the same
evidence for a particular alternative 1, regardless of the the
power of the test in question, which depends crucially on 
The large  problem can be addressed using the post-data
severity evaluation.

To illustrate that, consider the case where =025 (=196) =1
and the the observed value of the test statistic in (2) is (x0)=197
In this case data x0 result in rejecting of 0: =12 and the
12

p-value is:
(x0)=P((X)  197; =12)=024
In the traditional accounts of frequentist testing this result
would be interpreted in the same way, irrespective of whether
the sample size was =25 =100 =400. The post-data
severity evaluation, however, takes that into account because
 affects the generic capacity (power) of the test. For instance,
the severity of inferring   121 associated with the same
(x0)=197, will be different for each sample size:
 (; =25;   121) =P( ≤ 197 −
√
25(121−12)
1
)=93
 (; =100;   121) =P( ≤ 197 −
√
100(121−12)
1
)=83
 (; =400;   121) =P( ≤ 197 −
√
400(121−12)
1
)=49
2.5 The problem with the p-value
Viewing the p-value from the severity vantage point, it can be
defined as follows:
‘the p-value is the probability of all possible outcomes x∈R

that accord less well with 0 than x0 does, if 0 were true.
Hence, a small p-value can be related to 1 passing a severe
test because the probability that test  would have produced
a result that accords less well with 1 than x0 does (x: (x) 
(x0)), if 1 were false (0 true):
Sev(; x0; 0) =P((X)  (x0); ≤0) =
=1−P((X)(x0); =0)
is very high, i.e. (x0 is very low.
I Hence, the key problem with the p-value is that is estab-
lishes the existence of some discrepancy  ≥ 0 from the null,
but provides no information concerning its magnitude  The
13

severity evaluation remedies that because it revolves around
the discrepancy  by being evaluated under different values
associated with the inferential claim  ≥ 0 + In this sense,
the p-value can be related to a severity evaluation associated
with the inferential claim  ≥ 0 where the implicit discrep-
ancy is =0, i.e. in the case of the p-value, SEV is implicitly
evaluated under the null!
3 Conclusions
Neither Fisher’s p-value, nor the N-P accept/reject rules can
provide such an evidential interpretation, primarily because
they are vulnerable to two serious fallacies.
(a) Fallacy of acceptance: no evidence against the null is
misinterpreted as evidence for it.
(b) Fallacy of rejection: evidence against the null is misin-
terpreted as evidence for a specific alternative.
These fallacies can be circumvented by supplementing the
accept/reject rules (or the p-value) with a post-data evaluation
of inference based on severe testing with a view to deter-
mined the discrepancy  from the null warranted by data x0
This establishes the inferential claim warranted [and thus, the
unwarranted ones].
The severity assessment enables one to address the crucial
fallacies of acceptance and rejection as well as the potential
arbitrariness and possible abuse of:
[c] switching between one-sided, two-sided or simple-vs-
simple hypotheses,
[d] interchanging the null and alternative hypotheses,
[e] manipulating the level of significance in an attempt to
get the desired testing result,
[f] the relevant p-value.
14

[g] observed conﬁdence intervals vs. severity evaluations.
Doesn’t the post-data severity evaluation change the origi-
nal threshold  with a severity threshold? Aren’t both equally
arbitrary?
No! Any choice that can be discussed in the particular
context between diﬀerent modelers is neither arbitrary nor
subjective, but it is debatable! The severity curve will provide
all possible discrepancies from the null and the modeler can
decide which threshold is appropriate in each case.
15

A. Spanos Probability/Statistics Lecture Notes 5: Post-data severity evaluation

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to A. Spanos Probability/Statistics Lecture Notes 5: Post-data severity evaluation

Similar to A. Spanos Probability/Statistics Lecture Notes 5: Post-data severity evaluation (20)

More from jemille6

More from jemille6 (20)

Recently uploaded

Recently uploaded (20)

A. Spanos Probability/Statistics Lecture Notes 5: Post-data severity evaluation