Upcoming SlideShare
×

# A. spanos slides ch14-2013 (4)

7,838 views

Published on

CHAPTER 14: Frequentist Hypothesis Testing: a Coherent Account by Aris Spanos [Fall 2013]

Published in: Technology, Education
0 Likes
Statistics
Notes
• Full Name
Comment goes here.

Are you sure you want to Yes No
Your message goes here
• Be the first to comment

• Be the first to like this

Views
Total views
7,838
On SlideShare
0
From Embeds
0
Number of Embeds
7,058
Actions
Shares
0
50
0
Likes
0
Embeds 0
No embeds

No notes for slide

### A. spanos slides ch14-2013 (4)

1. 1. CHAPTER 14: Frequentist Hypothesis Testing: a Coherent Account Aris Spanos [Fall 2013] 1 Inherent diﬃculties in learning statistical testing Statistical testing is arguably the most important, but also the most diﬃcult and confusing chapter of statistical inference for several reasons, including the following. (i) The need to introduce numerous new notions, concepts and procedures before one can paint — even in broad brushes — a coherent picture of hypothesis testing. (ii) The current textbook discussion of statistical testing is both highly confusing and confused. There are several sources of confusion. I (a) Testing is conceptually one of the most sophisticated sub-ﬁelds of any scientiﬁc discipline. I (b) Inadequate knowledge by textbook writers who often do not have the technical skills to read and understand the original sources, and have to rely on second hand accounts of previous textbook writers that are often misleading or just outright erroneous. In most of these textbooks hypothesis testing is poorly explained as an idiot’s guide to combining oﬀ-the-shelf formulae with statistical tables like the Normal, the Student’s t, the chi-square, etc., where the underlying statistical model that gives rise to the testing procedure is hidden in the background. I (c) The misleading portrayal of Neyman-Pearson testing as essentially decision-theoretic in nature, when in fact the latter has much greater aﬃnity with the Bayesian rather than the frequentist inference. I (d) A deliberate attempt to distort and cannibalize frequentist testing by certain Bayesian statisticians who revel in (unfairly) maligning frequentist inference in their misguided attempt to motivate their preferred viewpoint of statistical inference. You will often hear such Bayesians tell you that "what you really want (require) is probabilities attached to hypotheses" and other such misadvised promptings! In frequentist inference probabilities are always attached to the diﬀerent possible values of the sample x∈R ; never to hypotheses!  (iii) The discussion of frequentist testing is rather incomplete in so far as it has been beleaguered by serious foundational problems since 1
2. 2. the 1930s. As a result, diﬀerent applied ﬁelds have generated their own secondary literatures attempting to address these problems, but often making things much worse! Indeed, in some ﬁelds like psychology it has reached the stage where one has to correct the ‘corrections’ of those chastising the initial ‘correctors’! In an attempt to alleviate problem (i), the discussion that follows uses a sketchy historical development of frequentist testing. To ameliorate problem (ii), the discussion includes ‘red ﬂag’ pointers (¥) designed to highlight important points that shed light on certain erroneous interpretations or misleading arguments. The discussion will pay special attention to (iii), addressing some of the key foundational problems. 2 Francis Edgeworth A typical example of a testing procedure at the end of the 19th century is provided by Edgeworth (1885). From today’s perspective his testing takes the form of viewing data x0 :=(11  12      1 ; 21  22   2 ) in the context of the following bivariate simple Normal model: ¶ µµ ¶ µ 2 ¶¶ µ 1  0 1 v NIID   =1 2    X:= 2 2 0 2 (1 )=1  (2)=2   (1 )= (2 ) =  2  (1  2 )=0 The hypothesis of interest concerns the equality of the two means: Hypothesis of interest: 1 = 2  The 2-dimensional sample is: X:=(11  12      1 ; 21  22   2 ) Common sense, combined with the statistical knowledge at the time suggested using the diﬀerence between the estimated means: P P 1 − 2  where 1 =  =1 1  2 =  =1 2  b b b 1 b 1 as a basis for deciding whether the two means are diﬀerent or not. To render the diﬀerence (b1 − 2 ) free of the units of measurement,  b Edgeworth divided it by its standard deviation: q   p P P b2 1  b2 1   (b1 −b2 )= (b2 + 2 )  1 =  (1 −b1 )2   2 =  (2 −b1 )2     1 b1 =1 2 =1
3. 3. |ˆ  to deﬁne the distance function: (X)= √1 −ˆ 2 |2  2 (1 +1 )   The last issue he needed to address is how big the observed distance (x0 ) should be to justify inferring that the two means are diﬀerent. Edgeworth argued that (1 −2 )6=0 could not be ‘accidental’ (due to chance) if: √ |ˆ  (X) = √1 −ˆ2 |2  2 2 (1) 2 (1 +1 )   √ I Where did the threshold 2 2 come from? The tail area of N(0 1) √ beyond ±2 2 is approximately 005, which was viewed as a reasonable value for the probability of a ‘chance’ error, i.e. the error of inferring a signiﬁcant discrepancy. But why Normal? At the time statistical inference relied heavily on large sample size  (asymptotic) results by invoking the Central Limit Theorem. In summary, Edgeworth introduced 3 features of statistical testing: (i) a hypothesis of interest: 1 = 2  (ii) the notion of a standardized distance: (X),√ (iii) a threshold value for signiﬁcance: (X)  2 2 3 Karl Pearson The Pearson approach to statistics can be summarized as follows: Histogram Data =⇒ x0 :=(1    ) =⇒ Fitting a frequency curve from the Pearson family  (; b1  b2  b3  b4 )     The Pearson family of frequency curves can be expressed in terms of the following diﬀerential equation:  ln  ()  = (−1 )  2 +3 +4 2 (2) Depending on the values taken by the parameters (1  2  3  4 ) this equation can generate numerous frequency curves such as the Normal, the Student’s , the Beta, the Gamma, the Laplace, the Pareto, etc. 3
4. 4. The ﬁrst four raw data moments 01 , 02 , 03 and 04 are suﬃcient to b b b b determine (b1  b2  b3  b4 ) and that enables one can select the particular      () from the Pearson family. Having selected a particular frequency curve 0 () on the basis of b  (; b1  b2  b3  b4 )= (; θ) Karl Pearson proposed to assess the appro    priateness of this choice. His hypothesis of interest is the form: 0 () =  ∗ ()∈Pearson(1  2  3  4 )  ∗ ()-true density. Pearson proposed the standardized distance function: X ( − )2 X (( )−( ))2     (X)= = v 2 ()  ( ) =1 =1  (3) b where (  =1 2  ) and ( =1 2  ) denote the empirical and assumed (as speciﬁed by 0 ()) frequencies, and introduced the notion of a p-value: P((X)  (x0 )) = (x0 ) (4) as a basis for inferring whether the choice of 0 () was appropriate or not. The smaller the (x0 ) the worst the ﬁt. ¥ It is important to Note that this hypothesis of interest 0 ()= ∗ () is not about a particular parameter, but about the adequacy of the choice of the probability model. In this sense, this was the ﬁrst MisSpeciﬁcation (M-S) test for a distributional assumption! Example. Mendel’s cross-breeding experiments (1865-6) were based on pea-plants with diﬀerent shape and color, which can be framed in terms of two Bernoulli random variables: R W z }| { z }| { (round)=0 (wrinkled)=1 Y G z }| { z }| {  (yellow)=0  (green)=1 His theory of heredity was based on two assumptions: (i) the two random variables  and  are independent, (ii) round (=0) and yellow ( =0) are dominant traits, but ‘winkled’ (=1) and ‘green’ ( =1) are recessive traits with probabilities: P(=0)=P( =0) = 75 P(=1)=P( =1) = 25 This substantive theory gives rise to Mendel’s model based on the bivariate distribution below: 4
5. 5.  0 1 0 1  () 5625 1875 750 1875 0625 250  () 750 250 100 Mendel’s theory model  ( ) Data. The 4 gene-pair experiments Mendel carried out with =556, gave rise the observed frequencies and relative frequencies given below: R,Y R,G W,Y W,G (0 0) (0 1) (1 0) (1 1) 315 108 101 32 Observed frequencies ⇒  0 1 b  () 0 1 5666 1942 7608 1817 0576 2393 b  () 7483 2518 1001 Observed relative frequencies How adequate is Mendel’s theory in light of the data? One can answer that question using Pearson’s goodness-of-ﬁt chi-square test. X (( )−( ))2   The chi-square test statistic (X)= compares ( ) =1 how close are the observed to the expected relative frequencies. Using the above data this test statistic yields: ³ ´ (5666−5625)2 (1942−1875)2 (1817−1875)2 (0576−0625)2 (X)=556 + 1875 + 1875 + 0625 =463 5625 In light of the fact that the tail area of 2 (3) yields: P((X)  0470)= 927 suggests accordance with Mendel’s theory. Karl Pearson’s primary contributions to testing were: (a) the broadening of the scope of the hypothesis of interest, initiating misspeciﬁcation testing with a goodness-of-ﬁt test, (b) a distance function whose distribution is known (at least asymptotically), and (c) the use of the tail probability as a basis of deciding how good the ﬁt with data x0 is. 5
6. 6. 4 William Gosset (aka ‘Student’) Gosset’s 1908 seminal paper provided the cornerstone upon which Fisher founded modern statistical inference. At that time it was known that in the case when (1  2    ) are NIID (the simple Normal model), P 1 the MLE estimator  :=  =  =1  had the following ‘sampling’ b distribution: ³ 2´ √   v N   ⇒ (  −) v N (0 1)    It was also known that in the case where 2 is replaced by its estimator: P 1 2 = −1 =1 ( −ˆ  )2   √ (  −)  the distribution of but asymptotically it can √ ? (  −) is unknown, i.e. vD (0 1)   √ (  −) v N (0 1)  be approximated by:   ? Gosset (1908) derived the unknown D ()  showing that for any 1: √ (  −)  v St(−1) (5) where St(−1) denotes a Student’s t distribution with (−1) degrees of freedom. The subtle question that needs to be answered is: what does the result mean in light of the fact that  is unknown. I This was the ﬁrst ﬁnite sample result [a distributional result that is valid for any   1] that inspired R. A. Fisher to pioneer modern frequentist inference. 5 R. A. Fisher’s signiﬁcance testing The result (5) was formally proved and extended by Fisher (1915) and used subsequently as a basis for several tests of hypotheses associated with a number of diﬀerent statistical models in a series of papers. A key element in Fisher’s framework was the introduction of the notion of a statistical model, whose generic form: M (x)={ (x; θ) θ∈Θ} x∈R   6 (6)
7. 7. where  (x; θ) x∈R denotes the (joint) distribution of the sample  X:=(1    ) that encapsulates the prespeciﬁed probabilistic structure of the underlying stochastic process {  ∈N} The link to phenomenon of interest comes from viewing data x0 :=(1  2       ) as a ‘truly typical’ realization of the process {  ∈N}. Fisher used the result (5) to construct a test of signiﬁcance for the: null hypothesis: 0 : =0  (7) in the context of the simple Normal model (table 1). Table 1 - The simple Normal model Statistical GM:  =  +   ∈N [1] Normal:  v N( ) [2] Constant mean: ( ) =  for all ∈N [3] Constant variance:  ( ) = 2  for all ∈N [4] Independence: {  ∈N} - independent process. ¥ Fisher must have realized that the result in (5) by Gosset could not be used directly as a basis for a test because it involved the unknown parameter  One can only speculate, but it must have dawn on him that the result can be rendered operational if it is interpreted in terms of factual reasoning in the sense that it holds under the True State of Nature (TSN) (=∗ the true value). Note that, in general, the expression ‘∗ denotes the true value of ’ is a shorthand for saying that ‘data x0 constitute a realization of the sample X with distribution  (x; ∗ )’. That is, (5) is a pivot (pivotal quantity): √ ∗ TSN  (X; )= ( − ) v St(−1) (8) in the sense that it is a function of both X and  whose distribution under TSN is known. Fisher (1956), p. 47, was very explicit about the nature of reasoning underlying his signiﬁcance testing: “In general, tests of signiﬁcance are based on hypothetical probabilities calculated from their null hypotheses.” [emphasis in the original]. His problem was to adapt the result in (8) to hypothetical reasoning with a view to construct a test. Fisher accomplished that in three steps. 7
8. 8. Step 1. He replaced the unknown ∗ in (8) with the known null value 0  transforming the pivot  (X; ) into a statistic √  (X)= ( −0 )  A statistic is only a function of the sample X:=(1  2    ) Step 2. He adapted the factual reasoning underlying the pivot (8) into hypothetical reasoning by evaluating the test statistic under 0 : √ 0  (X)= ( −0 ) v St(−1) (9) Step 3. He employed (9) to deﬁne the -value: P( (X)   (x0 ); 0 ) = (x0 ) (10) as an indicator of disagreement (inconsistency, contradiction) between data x0 and 0 ; the bigger the value of  (x0 ) the smaller the -value. £ It is crucial to note that it is a serious mistake — committed often by misguided authors — to interpret the above probabilistic statement deﬁning the p-value as conditional on the null hypotheses 0 . Avoid the highly misleading notation, P( (X)   (x0 ) | 0 ) = (x0 ) × of the vertical line (|) instead of a semi-colon (;). Conditioning on 0 makes no sense in frequentist inference since  is not a random variable; the evaluation in (10) is made under the scenario that 0 : =0 is true. Although it is impossible to pin down Fisher’s interpretation of the p-value because his views changed over time and he held several interpretations at any one time, there is one ﬁxed point among his numerous articulations (Fisher, 1925, 1955): ‘a small p-value can be interpreted as a simple logical disjunction: either an extremely rare event has occurred or 0 is not true’. The focus on "a small p-value" needs to be read in conjunction with Fisher’s falsiﬁcationist stance about testing in the sense that signiﬁcance tests can falsify but never verify hypotheses (Fisher, 1955): “... tests of signiﬁcance, when used accurately, are capable of rejecting or invalidating hypotheses, in so far as these are contradicted by the data; but that they are never capable of establishing them as certainly true.” 8
9. 9. His logical disjunction combined with the falsiﬁcationist stance are not helpful in shedding light on the key issue of learning from data: what do data x0 convey about the truth or falsity of 0 ? Occasionally, Fisher would go further and express what sounds more like an aspiration for an evidential interpretation: “The actual value of  ... indicates the strength of evidence against the hypothesis” (see Fisher, 1925, p. 80), but he never articulated a coherent evidential interpretation based on the p-value. Indeed, as late as 1956 he oﬀered ‘a degree of rational disbelief’ interpretation (Fisher, 1956, pp. 46-47). ¥ A more pertinent interpretation of the -value (section 8) is : ‘the p-value is the probability – evaluated under the scenario that 0 is true – of all possible outcomes x∈R that accord less well with  0 than x0 does. In light of the fact that data x0 were generated by the ‘true’ (θ=θ∗ ) statistical Data Generating Mechanism (DGM): M∗ (x)={ (x; θ∗ )} x∈R   √ ∗ ↓ and the test statistic  (X)= ( −0 ) measures the standardized diﬀerence between ∗ and 0  the p-value P( (X)   (x0 ); 0 )=(x0 ) seeks to measure the discordance between 0 and M∗ (x). This precludes several erroneous interpretations of the p-value: £ The p-value is the probability that 0 is true. £ 1−(x0 ) is the probability that the 1 is true. £ The p-value is the probability that the particular result is due to ‘chance error’. £ The p-value is an indicator of the size or the substantive importance of the observed eﬀect. £ The p-value is the conditional probability of obtaining a test statistic (x) more extreme than (x0 ) given 0 ; there is no conditioning involved. Example 1. Consider testing the null hypothesis: 0 :  = 70 in the context of the simple Normal model (see table 1), using the exam score data in table 1.6 (see chapter 1) where:  =71686 2 =13606 and =70. The test statistic (9) yields: ˆ √ √  (x0 )= 70(71686−70) =3824 13606 P( (X)  3824; 0 =70)=00007 9
10. 10. where (x0 )=00007 is found from the St(69) tables. The tiny -value suggests that x0 indicate strong discordance with 0 . Table 2 - The simple Bernoulli model Statistical GM:  =  +   ∈N [1] Bernoulli:  v Ber( ) [2] Constant mean: ( ) =  for all ∈N [3] Constant variance:  ( )=(1 − ) for all ∈N [4] Independence: {  ∈N} - independent process. Example 2. Arbuthnot’s 1710 conjecture: the ratio of males to females in newborns might not be ‘fair’. This can be tested using the statistical null hypothesis: 0 :  = 0  where 0 =5 denotes ‘fair’ in the context of the simple Bernoulli model (table 2), based on the random variable  deﬁned by: {male}={=1} {female}={=0} P 1 Using the MLE of  ˆ :=  =  =1   whose sampling distribu tion: ³ ´ (1−)   v Bin   ;   one can derive the test statistic: √ 0 ( (X)= √  −0 ) v Bin (0 1; )  (11) 0 (1−0 ) Data: =30762 newborns during the period 1993-5 in Cyprus, out of which 16029 were boys and 14833 girls. The test statistic takes the form (11) and ˆ = 16029 =521  30762 (x0 )= √ 30762(521−5) √ 5(5) =7366 P((X)  7366; =5)=00000017 The tiny -value indicates strong discourdance with 0 . In general one needs to specify a certain threshold for deciding how small the p-value is ‘small enough’ to falsify 0 . Fisher suggested several such thresholds, 01 025 05, 1, but insisted that the choice has to be made on a case by case basis. 10
11. 11. The main elements of a Fisher signiﬁcance test { (X) (x0 )}. Components of a Fisher signiﬁcance test (i) a prespeciﬁed statistical model: M (x) (ii) a null (0 : =0 ) hypothesis, (iii) a test statistic (distance function)  (X) (iv) the distribution of  (X) under 0  (v) the -value P( (X)   (x0 ); 0 )=(x0 ) (vi) a threshold value 0 [e.g.01 025 05], such that: (x0 )  0 ⇒ x0 falsiﬁes (rejects) 0  Example. Consider a Fisher test based on a simple (one parameter) Normal model (table 3). Table 3 - The simple (one parameter) Normal model Statistical GM:  =  +   ∈N [1] Normal:  v N( ) [2] Constant mean: ( ) =  for all ∈N [3] Constant variance:  ( )= 2 -known, for all ∈N [4] Independence: {  ∈N} - independent process. (i) M (x) :  v N ( 2 ) [2 is known],  = 1 2    (ii) Null hypothesis: 0 :  = 0 (e.g. 0 = 0) √  (iii) a test statistic:  (X) = ( −0 )  0 (iv) the distribution of  (X) under 0 :  (X) v N (0 1) (v) the -value: P( (X)   (x0 ); 0 )=(x0 ) (vi) a threshold value for discordance, say 0 = 05 Distribution Plot Distribution Plot Normal, Mean=0, StDev=1 Normal, Mean=0, StDev=1 0.3 Density 0.4 0.3 Density 0.4 0.2 0.1 0.2 0.1 0.05 0.0 0.025 -1.64 0.0 0 X Normal: P(  −164) = 05 0 X 1.96 Normal: P(  196) = 025 11
12. 12. 6 The Neyman-Pearson (N-P) framework Neyman and Pearson (1933) motivated their own approach to testing as an attempt to improve upon Fisher’s testing by addressing what they consider as ad hoc features of that approach: [a] Fisher’s ad hoc choice of a test statistics (X) pertaining to the optimality of tests, [b] Fisher’s use of a post-data [(x0 ) is known] threshold for the value to indicate falsiﬁcation (rejection) of 0  and [c] Fisher’s denial that (x0 ) can indicate conﬁrmation (accordance) of 0  Neyman-Pearson (N-P) proposed solutions for [a]-[c] by: ¥ [i] Introducing the notion of an alternative hypothesis 1 as the complement to the null with respect to Θ, within the same statistical model: M (x)={ (x; θ) θ∈Θ} x∈R  (12)  [ii] Replacing the post-data p-value and its threshold with a pre-data [before (x0 ) is available] signiﬁcance level  determining the decision rules: (i) Reject 0  (ii) Accept 0  which are calibrated using the error probabilities: type I: P(Reject 0  when 0 is true)=, type II: P(Accept 0  when 0 false)=. These error probabilities enabled Neyman and Pearson to introduce the key notion of an optimal N-P test: one that minimizes  subject to a ﬁxed , i.e. [iii] Selecting the particular (X) in conjection with a rejection region for a given  that has the highest pre-data capacity (1 − ) to detect discrepancies in the direction of 1  In this sense, the N-P proposed solutions to the perceived weaknesses of Fisher’s signiﬁcance testing, have changed the original Fisher framework in four important respects: ¥ (a) the N-P testing takes place within a prespeciﬁed M (x) by partitioning it into the null and alternative hypotheses, 12
13. 13. (b) Fisher’s post-data p-value has been replaced with a pre-data  and  (c) the combination of (a)-(b) render ascertainable the pre-data capacity (power) of the test to detect discrepancies in the direction of 1 , (d) Fisher’s evidential aspirations for the p-value based on discordance, were replaced by the more behavioristic accept/reject decision rules that aim to provide input for alternative actions: when accept 0 take action  when reject 0 take action . As claimed by Neyman and Pearson (1928): “the tests themselves give no ﬁnal verdict, but as tools help the worker who is using them to form his ﬁnal decision.” 6.1 The archetypal N-P hypotheses speciﬁcation In N-P testing there has been a long debate concerning the proper way to specify the null and alternative hypotheses. It is often thought to be rather arbitrary and subject to abuse. This view stems primarily from inadequate understanding of the role of statistical vs. substantive information. ¥ It is argued that there is nothing arbitrary about specifying the null and alternative hypotheses in N-P testing. The default alternative is always the complement to the null relative to the parameter space of M (x). Consider a generic statistical model: M (x)={ (x; θ) θ∈Θ} x∈X:=R   The archetypal way to specify the null and alternative hypotheses for N-P testing is: 0 : ∈Θ0 vs. 1 : ∈Θ1  (13) where Θ0 and Θ1 constitute a partition of the parameter space Θ: Θ0 ∩ Θ1 =∅ Θ0 ∪ Θ1 =Θ In the case where the sets Θ0 or Θ1 contain a single point, say Θ0 ={0 } that determines  (x; 0 ) the hypothesis is said to be simple, otherwise it is composite. Example. For the simple Bernoulli model, 0 :  = 0 is simple, but 1 :  6= 0 is composite. 13
14. 14. ¥ An equivalent formulation that brings out the fact that N-P testing poses questions pertaining to the ‘true’ statistical Data Generating Mechanism (DGM) M∗ (x)={ ∗ (x)} x∈X most clearly is to re-write (13) in the equivalent form: 0 :  ∗ (x)∈M0 (x) vs. 1 :  ∗ (x)∈M1 (x) where  ∗ (x) denotes the ‘true’ distribution of the sample, and: M0 (x)={ (x; θ) θ∈Θ0 } or M1 (x)={ (x; θ) θ∈Θ1 } x∈X:=R ?  constitute a partition of M (x) P(x) denotes the set of all possible statistical models that could have given rise to data x0 :=(1  2       ). P (x ) H0 H1 N-P testing within M (x) ¥ When properly understood in the context of a statistical model M (x) N-P testing leaves very little leeway in the speciﬁcation of 0 and 1 hypotheses. This is because the whole of Θ [all possible values of ] is relevant on statistical grounds, despite the fact that only a small subset is often deemed relevant on substantive grounds. The reasoning underlying this argument is that N-P tests invariably pose hypothetical questions — framed in terms of ∈Θ — pertaining to the true statistical DGM M∗ (x)={ ∗ (x)} x∈X Partitioning M (x) provides the most eﬀective way to learn from data by zeroing in on the true  in the same way one can zero in on an unknown prespeciﬁed city on a map using sequential partitioning. ¥ This foregrounds the importance of securing the statistical adequacy of M (x) [validating its probabilistic assumptions] before it provides the basis of any inference. Hence, whenever the union of Θ0 and Θ1 the values of  postulated by 0 and 1 , respectively, do not 14
15. 15. exhaust Θ there is always the possibility that the true  lies in the complement: Θ − (Θ0 ∪ Θ1 ). What constitutes an N-P test? A misleading idea that needs to be done away with immediately is that a N-P test is not a formula with some associated statistical tables! It is a combination of a test statistic whose sampling distribution under the null and alternative hypotheses is tractable in conjunction with a rejection region. Test statistic. Like an estimator, an N-P test statistic is a mapping from the sample (X:=R ) to the parameter space (Θ):  () : X → R that partitions the sample space into an acceptance 0 and a rejection region 1 : 0 ∩ 1 =∅ 0 ∪ 1 =X in a way that correspond to Θ0 and Θ1  respectively: ¾ ½ 0 ↔ Θ0 =Θ X= 1 ↔ Θ1 The N-P decision rules, based on a test statistic (X) take the form: [i] if x0 ∈0  accept 0  [ii] if x0 ∈1  reject 0  (14) and the associated error probabilities are: type I: P(x0 ∈1 ; 0 () true)=() for ∈Θ0  type II: P(x0 ∈0 ; 1 (1 ) true)=(1 ) for 1 ∈Θ1  True State of Nature N-P rule Accept 0 Reject 0 0 true √ Type I error 0 false Type √ error II £ Some misinformed statistics books go into great lengths to explain that the above error probabilities should be viewed as conditional on 0 or 1  and they used the misleading notation (|): P(x0 ∈1 |0 () true)=() for ∈Θ0  15
16. 16. Such conditioning makes no sense in a frequentist framework because  is treated as an unknown constant. Instead, they should be interpreted as evaluations of the sampling distribution of the test statistic under diﬀerent scenarios pertaining to diﬀerent hypothesized values of  ¥ Some statistics books make a big deal of the distinction ‘accept 0 ’ vs. ‘fail to reject 0 ’, in their commendable attempt to bring out the problem of misinterpreting ‘accept 0 ’ as tantamount to ‘there is evidence for 0 ’; this is known as the fallacy of acceptance. By the same token there is an analogous distinction between ‘reject 0 ’ vs. ‘accept 1 ’, that highlights the problem of misinterpreting ‘reject 0 ’ as tantamount to ‘there is evidence for 1 ’; this is known as the fallacy of rejection. What is objectionable about this practice is that the verbal distinctions by themselves do nothing but perpetuate these fallacies; we need to address them, not play clever with words! Table 3 - The simple (one parameter) Normal model Statistical GM:  =  +   ∈N [1] Normal:  v N( ) [2] Constant mean: ( ) =  for all ∈N [3] Constant variance:  ( )= 2 -known, for all ∈N [4] Independence: {  ∈N} - independent process. Example 1. Consider the following hypotheses in the context of the simple (one parameter) Normal model: 0 :  ≤ 0  vs. 1 :   0  Let us choose a test  :={(X) 1 ()} of the form: √ P  1 test statistic: (X)= ( −0 )  where   =  =1   rejection region: 1 ()={x : (x)   } (15) (16) To evaluate the two types of error probabilities the distribution of (X) under both the null and alternatives are needed: √ 0 ( )  [I] (X)= ( −0 ) v 0 N(0 1) √ 1 ( )  [II] (X)= ( −0 ) v 1 N( 1  1) 16 (17) for all 1  0 
17. 17. where the non-zero mean  1 takes the form: √ 1  1 = ( −0 )  0 for all 1  0  The evaluation of the type I error probability is based on (I): = max≤0 P((X)  ; 0 ())=P((X)   ; =0 ) Notice that in cases where the null is for the form 0 :  ≤ 0  the type I error probability is deﬁned as the maximum over all values in the interval (−∞ 0 ] which turns out to be the one evaluated at the last point of the interval  = 0  Hence, the same test will be optimal also in the case where the null is of the form 0 :  = 0  The evaluation of type II error probabilities is based on (II): (1 )=P((X) ≤  ; 1 (1 )) for all 1  0  The power [rejecting the null when false] is equal to 1−(1 ) i.e. (1 )=P((X)   ; 1 (1 )) for all 1  0  Why do we care about power? The power (1 ) measures the pre-data (generic) capacity of test  :={(X) 1 ()} to detect a discrepancy, say =1 −0 = 1 when present. Hence, when (1 )=35 this test has very low capacity to detect such a discrepancy. How is this information helpful in practice? If =1 is the discrepancy of substantive interest, this test is practically useless for that purposes because we know beforehand that this test does not enough capacity to detect  even if present! 17
18. 18. What can one do in such a case? The power of the above test is √ 1 monotonically increasing with  1 = ( −0 )  and thus, increasing the sample size , increases the power. Hypothesis testing reasoning. ¥ As with Fisher’s signiﬁcance testing, the reasoning underlying NP testing is hypothetical in the sense that both error probabilities are based on evaluating the relevant test statistic under several hypothetical scenarios, e.g. under 0 (=0 ) or under 1 (=1 ) for all 1 0  These hypothetical sampling distributions are then used to compare 0 or 1 via (x0 ) to the True State of Nature (TSN) represented by data x0  This is in contrast to frequentist estimation which is factual in nature, i.e. the sampling distributions of estimators is evaluated under the TSN, i.e.  = ∗  Signiﬁcance level vs. -value It is important to note that there is a mathematical relationship between the type I error probability (signiﬁcance level) and the p-value. Placing them side by side: P(type I error): P((X)   ; =0 ) =  (18) -value: P((X)  (x0 ); =0 )=(x0 ) it becomes obvious that: (a) they are both speciﬁed in terms of the distribution of the same test statistic (X) under 0  (b) they both evaluate the probability of tail events of the form {x: (x)  } x∈X, but (c) they diﬀer in terms of the threshold :  vs. (x0 ); in that sense  is a pre-data and (x0 ) is a post-data error probability. In light of (a)-(c), three comments are called for. First, the p-value can be viewed as the smallest signiﬁcance level  at which 0 would have been rejected with data x0 . For that reason the p-value is often referred to as the observed signiﬁcance level. Hence, it should come as no surprise to learn that the above N-P decision rules (14) could be recast in terms of the p-value: [i]* if (x0 )   accept 0  [ii]* if (x0 ) ≤  reject 0  18
19. 19. Indeed, practitioners often prefer to use the rules [i]*-[ii]* because the p-value conveys additional information when it is not close to the predesignated threshold . For instance, rejecting 0 with (x0 )=0001 seems more informative than just 0 was rejected at  = 05 ¥ Second, the p-value seems to have an implicit alternative built into the choice of the tail area. For instance, a 2-sided alternative 1 : 6=0 would require one to evaluate P(|(X)|  |(x0 )|; =0 ) How did Fisher get away with such an obvious breach of his own preaching? Speculating on that, his answer might have been: 2-sided evaluation of the p-value makes no sense, because, post-data, the sign of the test statistic (x0 ) determines the direction of departure! ¥ Third, neither the signiﬁcance level  nor the p-value can be interpreted as probabilities attached to particular values of  associated with 0 or 1 , since the probabilities in (18) are ﬁrmly attached to the sample realizations x∈X. Indeed, attaching probabilities to the unknown constant  makes no sense in frequentist statistics. As argued by Fisher (1921), p. 25: “We may discuss the probability of occurrence of quantities which can be observed or deduced from observations, in relation to any hypotheses which may be suggested to explain these observations. We can know nothing of the probability of hypotheses...” This scotches another two erroneous interpretations of the p-value: £ (x0 ) = 005 does not mean that there is a 5% chance of a Type I error (i.e. false positive). £ (x0 ) = 005 does not mean that there is a 95% chance that the results would replicate if the study were repeated. Example 1 (continued). Consider applying test :={(X) 1 ()} for: 0 =10 =1 =100  =104  = 05 ⇒  =1645 √ √  (x0 )= ( −0 ) = 100(10175−10) =175    1 The p-value is: P((X)  175; =0 )=04 Reject 0  standard Normal tables One-sided values Two-sided values =100 :  =128 =100 :  =1645 =050 :  =1645 =050 :  =196 =025 :  =196 =025 :  =200 =010 :  =233 =010 :  =258 19
20. 20. The trade-oﬀ between type I and II error probabilities I How does the introduction of type I and II error probabilities addresses the arbitrariness associated with Fisher’s choice of a test statistic? The ideal test is one whose error probabilities are zero, but no such test exist for a given ; an ideal test, like the ideal estimator, exists as  → ∞! Worse, there is a trade-oﬀ between the above error probabilities: as one increases the other decreases and vice versa. Example. In the case of test  :={(X) 1 ()} in (16), decreasing the probability of type I error from =05 to =01 increases the threshold from  =1645 to  =233 which makes it easier to accept 0  and this increases the probability of type II error. To address this trade-oﬀ Neyman and Pearson (1933) proposed: (i) to specify 0 and 1 in such a way so as to render the type I error the more serious one, and then: (ii) to choose an  signiﬁcance level test {(X) 1 ()} so as to maximize its power for all ∈Θ1  N-P rationale. Consider the analogy with a criminal oﬀense trial, where the jury in such a trial are instructed by the judge to ﬁnd the defendant ‘not guilty’ unless they have been convinced ‘beyond any reasonable doubt’ by the evidence: 0 : not guilty, vs. 1 : guilty. The clause ‘beyond any reasonable doubt’ amounts to ﬁxing the type I error to a very small value, to reduce the risk of sending innocent people to prison, or even death. At the same time, one would want the system to minimize the risk of letting guilty people get oﬀ scot-free". More formally, the choice of an optimal N-P test is based on ﬁxing an upper bound  for the type I error probability: P(x∈1 ; 0 () true) ≤  for all ∈Θ0  and then select a test {(X) 1 ()} that minimizes the type II error probability, or equivalently, maximizes the power: ()=P(x0 ∈1 ; 1 () true)=1−() for all ∈Θ1  I The general rule is that in selecting an optimal N-P test the whole of the parameter space Θ is relevant. This is why partitioning both Θ and the sample space X:=R using a test statistic (X) provides the  key to N-P testing. 20
21. 21. Optimal properties of tests The property that sets the gold standard for optimal test is: [1] Uniformly Most Powerful (UMP): A test  :={(X) 1 ()} is said to be UMP if it has higher power than any e other -level test   for all values ∈Θ1 , i.e. e (;  ) ≥ (;  ) for all ∈Θ1  Example. Test  :={(X) 1 ()} in (16) is UMP; Lehmann (1986). Additional properties of N-P tests [2] Unbiasedness: A test  :={(X) 1 ()} is said to be unbiased if the probability of rejecting 0 when false is always greater than that of rejecting 0 when true, i.e. max P(x0 ∈1 ; 0 ()) ≤   P(x0 ∈1 ; 1 ()) for all ∈Θ1  ∈Θ0 [3] Consistency: A test  :={(X) 1 ()} is said to be consistent if its power goes to one for all ∈Θ1 as  → ∞, i.e. lim   () = P(x0 ∈1 ; 1 () true)=1 for all ∈Θ1  →∞ Like in estimation, this is a minimal property for tests. Evaluating power. How does one evaluate the power of test  :={(X) 1 ()} for diﬀerent discrepancies =1 −0 ? Step 1. In light of the fact that for the relevant sampling distribution for the evaluation of (1 ) is: √ 1 ( )  [II] (X)= ( −0 ) v 1 N( 1  1) for all 1  0  the use the N(0 1) tables requires on to split the test statistic into: √ √ √ (  −0 )  1 = ( −1 ) +  1   1 = ( −0 )  since under 1 (1 ) the distribution of the ﬁrst component is: ³√ ´  ( ) √ 1 1 (  −1 ) (  −0 ) = − 1 v N(0 1) for 1  0    Step 2. Evaluating of power the test  :={(X) 1 ()} for different discrepancies yields: =1 −0 =1 =2 =3 √ 1  1 = ( −0 ) : =1: =2: =3: √  (1 )=P( ( −1 )  − 1 + ; 1 ) (101)=P(  −1 + 1645) = 259 (102)=P(  −2 + 1645) = 639 (103)=P(  −3 + 1645) = 913 21
22. 22. where  is a generic standard Normal r.v., i.e.  v N(0 1) The power of this test is typical of an optimal test since (1 ) √ 1 increases with the non-zero mean  1 = ( −0 )  and thus: (a) the power increases with the sample size  (b) the power increases with the discrepancy =(1 −0 ) and (c) the power decreases with  It is very important to emphasize three features of an optimal test. First, a test is not just a formula associated with particular statistical tables. It is a combination of a test statistic and a rejection region; hence the notation :={(X) 1 ()}. Second, the optimality of an N-P test is inextricably bound up with the optimality of the estimator the test statistic is based on. Hence, it is no accident that most optimal N-P tests are based on consistent, fully eﬃcient and suﬃcient estimators. Example. In the case of the simple (one parameter) Normal model, consider replacing   with the 2 unbiased estimator 2 =(1 + )2 vN( 2 ) The resulting test: b ˇ  :={(X) 1 ()} where (X)= √ 2(2 −0 )   and 1 ()={x: (x)   } will not be optimal because its power is much lower than that of  , and ˇ  is inconsistent! This is because the non-zero mean of the sampling √ 1 distribution under =1 will be  = 2(−0 ) which does not change as  → ∞ Third, it is important to note that by changing the rejection region one can render an optimal N-P test useless! For instance, replacing the 22
23. 23. rejection region of :={(X) 1 ()} with:  1 ()={x: (x)   } the resulting test T :={(X)  1 ()} is practically useless because it ¸ is biased and its power decreases as the discrepancy  increases. Example 2 - the Student’s t test. Consider the 1-sided hypotheses: (1-s): 0 :  = 0  vs. 1 :   0 (19) or (1-s*): 0 :  ≤ 0  vs. 1 :   0  in the context of the simple Normal model (table 1). In this case, a UMP (consistent) test  :={ (X) 1 ()} exists. It is the well-known Student’s t test, and takes the form: test statistic: √  (X)= ( −0 )  2  1 = −1  P ( −  )2  =1 (20) rejection region: 1 ()={x :  (x)   } where  can be evaluated using the Student’s t tables. To evaluate the two types of error probabilities the distribution of  (X) under both the null and alternatives are needed: √ (  −0 ) 0 (0 ) v  √ 1 ( )  (X)= ( −0 ) v 1 [I]*  (X)= [II]* where  1 = √ (1 −0 )  St(−1) (21) St( 1 ; −1) for all 1  0  is the non-centrality parameter. Student’s t tables (=60) One-sided values Two-sided values =100 :  =1296 =100 :  =1671 =050 :  =1671 =050 :  =2000 =010 :  =2390 =010 :  =2660 =001 :  =3232 =001 :  =3460 In summary, the main components of a Neyman-Pearson (N-P) test { (X) 1 ()} are given below. 23
24. 24. Components of a Neyman-Pearson (N-P) test (i) a prespeciﬁed statistical model: M (x) (ii) a null (0 ) and the alternative (1 ) within M (x), (iii) a test statistic (distance function)  (X) (iv) the distribution of  (X) under 0  (v) prespecifying the signiﬁcance level  [.01, .025, .05], (vi) the rejection region 1 () (vii) the distribution of  (X) under 1  6.2 The N-P Lemma and its extensions The cornerstone of the Neyman-Pearson (N-P) approach is the Neyman-Pearson lemma. Contemplate the simple generic statistical model: M (x)={ (x; )} ∈Θ:={0  1 }} x∈R  (22)  and consider the problem of testing the simple hypotheses: 0 :  = 0 vs. 1 :  = 1  (23) ¥ The fact that the assumed parameter space is Θ:={0  1 } and (23) constitute a partition, is often left out from most statistics textbook discussions of this famous lemma! Existence. There is exists an -signiﬁcance level Uniformly Most Powerful (UMP) [-UMP] test, whose generic form is: (X)=(  (x;1 ) )  (x;0 ) 1 ()={x: (x)   } (24) where () is a monotone function. Suﬃciency. If an -level test of the form (24) exists, then it is UMP for testing (23). Necessity. If {(X) 1 ()} is -UMP test, then it will be given by (24). At ﬁrst sight the N-P lemma seems rather contrived because it is an existence result for a simple statistical model M (x) whose parameter space is artiﬁcial Θ:={0  1 }, but ﬁts perfectly into the archetypal formulation. In addition, to implement this result one would need to ﬁnd () yielding a test statistic (X) whose distribution is known under 24
25. 25. both 0 and 1 . Worse, this lemma is often misconstrued as suggesting that for an -UMP test to exist one needs to conﬁne testing to simple-vs-simple cases even when Θ is uncountable! ¥ The truth of the matter is that the construction of an -UMP test in more realistic cases has nothing to do with simple-vs-simple hypotheses, but instead it is invariably based on the archetypal N-P testing formulation in (13), and relies primarily on monotone likelihood ratios and other features of the prespeciﬁed statistical model M (x). Example. To illustrate these comments consider the simple-vssimple hypotheses: (i) (1-1): 0 : =0 vs. 1 : =1  (25) in the context of a simple Normal (one parameter) model (table 3). Applying the N-P lemma requires setting up the ratio: © ª  (x;1 )  = exp 2 (1 − 0 )  − 22 (2 − 2 )  (26) 1 0  (x;0 ) which is clearly not a test statistic, as it stands. However, there exists a monotone function () which transforms (26) into a familiar test statistic (Spanos, 1999, pp. 708-9): h i √ (x;1 )  (x;1 ) 1  1 (X)=( (x;0 ) )= ( 1 ) ln(  (x;0 ) )+ 2 = ( −0 )  with ascertainable type I and II error probabilities based on: = = (X) v 0 N(0 1) (X) v 1 N( 1  1)  1 = √ (1 −0 )   In this particular case, it is clear that the Neyman-Pearson lemma does not apply because the hypotheses in (25) do not constitute a partition of the parameter space Θ=R, √ in (13). However, it turns out as  that when the test statistic (X)= ( −0 ) is combined with information relating to the values 0 and 1 , it can provide the basis for constructing several optimal tests (Lehmann, 1986):   [1] For 1  0  the test  :={(X) 1 ()} where  1 ()={x: (x)   } is a -UMP for the hypotheses: (ii) (1-s≥): 0 :  ≤ 0 vs. 1 :   0  (iii) (1-s): 0 :  = 0 vs. 1 :   0  25
26. 26. Despite the diﬀerence between (ii) and (iii), the relevant error probabilities coincide. The type I error probability for (ii) is deﬁned as the maximum over all  ≤ 0 , but coincides with that of (iii) (=0 ): = max≤0 P((X)  ; 0 ())=P((X)   ; =0 ) Similarly, the p-value is the same because:  (x0 )= max P((X)(x0 ); 0 ())=P((X)(x0 ); =0 ) ≤0   [2] For 1  0  the test  :={(X) 1 ()} where  1 ()={x: (x)   } is a -UMP for the hypotheses: (iv) (1-s≤): 0 :  ≥ 0 vs. 1 :   0  (v) (1-s): 0 :  = 0 vs. 1 :   0  Again, the type I error probability and p-value are deﬁned by: = max P((X)   ; 0 ())=P((X)  ; =0 ) ≥0  (x0 )= max P((X)(x0 ); 0 ())=P((X)(x0 ); =0 ) ≥0 The existence of these -UMP tests extends the N-P lemma to more realistic cases by invoking two regularity conditions: ¥ [A] The ratio (26) is a monotone function of the statistic    in the sense that for any two values 1 0  (x;1 ) changes monotonically (x;0 )  (x;1 ) with    This implies that  (x;0 )  if and only if    for some  This regularity condition is valid for most statistical models of interest in practice, including the one parameter Exponential family of distributions [Normal, Gamma, Beta, Binomial, Negative Binomial, Poisson, etc.], the Uniform, the Exponential, the Logistic, the Hypergeometric etc.; Lehmann (1986). ¥ [B] The parameter space under 1  say Θ1  is convex, i.e. for any two values (1  2 ) ∈Θ1  their convex combinations 1 + (1 − )2 ∈Θ1  for any 0 ≤  ≤ 1 When convexity does not hold, like the 2-sided alternative: (vi) (2-s): 0 :  = 0 vs. 1 :  6= 0  [3] the test  :={(X) 1 ()} 1 ()={x: |(x)|    } 2 is -UMPU (Unbiased); the -level and p-value are: =P(|(X)|    ; =0 ) q (x0 )=P(|(X)| |(x0 )|; =0 ) 2 26
27. 27. 6.3 Constructing optimal tests: Likelihood Ratio The likelihood ratio test procedure can be viewed as a generalization of the Neyman-Pearson lemma to more realistic cases where the null and/or the alternative are composite hypotheses. (a) The hypothesis of interest are of the general form: 0 : θ∈Θ0 vs. 1 : θ∈Θ1  (b) The test statistic is related to the ‘likelihood’ ratio: max∈Θ  (X)= max∈Θ (;X) = (;X) 0  (;X)  (;X) (27) Note that the max in the numerator is over all θ∈Θ [yielding the MLE b θ], but that of the denominator is conﬁned to all values under 0 : e θ∈Θ0 [yielding the constrained MLE θ]. (c) The rejection region is deﬁned in terms of a transformed ratio, where () is chosen to yield a known sampling distribution under 0 : 1 ()= {x:  (X)= ( (X))   }  (28) In terms of procedures to construct optimal tests, the LR procedure can be shown to yield several optimal tests; Lehmann (1986). Moreover, in cases where no such () can be found, one can use the asymptotic distribution which states that under certain restrictions (Wilks, 1938): ³ ´ b X− ln (θ; X) 0 2 () e v 2 ln  (X) =2 ln (θ;   0 where ∼ reads “under 0 is asymptotically distributed as” and  de notes the number of restrictions involved in Θ0  Example. Consider testing the hypotheses: 0 :  2 =2 vs. 1 :  2 6=2  0 0 in the context of the simple Normal model (table 1). From the previous chapter we know that the Maximum Likelihood method gives rise to: P P b ˆ 1 b 1  max(θ; x)=(θ; x) ⇒  =  =1  and  2 =  =1 ( −ˆ  )2  ∈Θ P e max(θ; x)=(θ; x) ⇒  =  =1  and  2 =2  ˆ 1 0 ∈Θ0 27
28. 28. Moreover, the two estimated likelihoods are:  ¡ 2 ¢− 2 max(θ; x)= 2b   ∈Θ n o  2  2 −2 max(θ; x)= (2 0 ) exp − 22  ∈Θ0 0 Hence, the likelihood ratio is: − 2 2  (;X) (20 )  (X)= (;X) =  exp{−(2 22 )}  0 − 2  (22 ) h 2 i 2   2  = ( 2 ) exp{−( 2 )+1}  0 0 The function () that transforms this ratio into a test statistic is: 2  0  (X)=( (X))=( 2 ) v 2 (−1)  0 and a rejection region of the form: 1 ={x: (X)  2 or (X)  1 } where for a size  test, the constants 1  2 are chosen in such as way R∞ R 1 so as: () = 2 () =   0 2 () denotes the chi-square density function. The test deﬁned by {(X) 1 ()} turns out to be UMP Unbiased (UMPU) (see Lehmann, 1986). 6.4 Substantive vs. statistical signiﬁcance in N-P testing We consider two substantive values of interest that concern the ratio of Boys (B) to Girls (G) in newborns: Arbuthnot: #B=#G, Bernoulli: 18B to 17G. (29) Statistical hypotheses. The Arbuthnot and Bernoulli conjectured values for the proportion can be embedded into a simple Bernoulli model (table 2), where =P(=1)=P(Boy): Arbuthnot:  = 1  2 Bernoulli:  = 18  35 (30) How should one proceed to appraise  and  vis-a-vis the above data? The archetypal speciﬁcation (13) suggests probing each value separately using the diﬀerence ( −  ) as the discrepancy of interest. 28
29. 29. Despite the fact that on substantive grounds the only relevant values of  are (  ), on statistical inference grounds, the rest of the parameter space is relevant; M (x) provides the relevant inductive premises. This argument, however, has been questioned in the literature on the grounds that N-P testing should probe these two values using simple-vssimple hypotheses? After all, the argument goes, the Neyman-Pearson lemma would secure a -UMP test! This argument shows insuﬃcient understanding of the NeymanPearson lemma, because the parameter space of the simple Bernoulli model is not Θ={   } but Θ=[0 1] Moreover, the fact that the simple Bernoulli model has a monotone likelihood ratio, i.e.: ³ ´ ³ ´  =1  (x;1 ) 1 (1−0 ) = (1−1 )  for (0  1 ) s.t. 00 1 1  (x;0 ) (1−0 ) 0 (1−1 ) P is a monotonically increasing function of the statistic   = =1  since 1 (1−0 )  1 ensuring the existence of several -UMP tests. 0 (1−1 )   In particular, [i]  :={(X) C1 ()}: √ (  −0 ) 0 (X)= √ 0 (1−0 )  v Bin (0 1; )  C1 ()={x: (x)   } (31) is a -UMP test for the N-P hypotheses: (ii) (1-s≥): 0 :  ≤ 0 vs. 1 :   0  (iii) (1-s): 0 :  = 0 vs. 1 :   0    and [ii]  :={(X) C1 ()}: √ (  −0 ) 0 (X)= √ 0 (1−0 )  v Bin (0 1; )  C1 ()={x: (x)   } (32) (iv) (1-s≤): 0 :  ≥ 0 vs. 1 :   0  (v) (1-s): 0 :  = 0 vs. 1 :   0  is a -UMP test for the N-P hypotheses. These results stem from the fact that the formulations (ii)-(v) ensure the convexity of the parameter space under 1  Moreover, in the case of a two-sided hypotheses: (i) (2-s): 0 :  = 0 vs. 29 1 :  6= 0 
30. 30. the test [iii]  :={(X) C1 ()}: √  0 ( (X)= √  −0 ) v Bin (0 1; )  C1 ()={x: |(x)|   } 0 (1−0 ) (33) is a -UMP, Unbiased; see Lehmann (1986). ¥ Randomization? No! In cases where the test statistic has a discrete sampling distribution under 0  as in (32)-(33), one might not be able to deﬁne  exactly. The traditional way of dealing with this problem is to randomize, but this ‘solution’ raises more problems than it solves. Instead, the best way to address the discreteness issue is to select a value  that is attained for the given  or approximate the discrete distribution with a continuous one to circumvent the problem. In practice, there are several continuous distributions one can use, depending on whether the original distribution is symmetric or not. In the above case, even for moderately small size sizes, say =20 the Normal distribution provides an excellent approximation for values of  around .5; see graph. With =30762, the approximation is nearly exact. D is tr ib u tio n P lo t 0 .2 0 D is tr ib u tio n B in o m ia l n 20 D is tr ib u tio n N o rm al p 0.5 M ean 10 S tD e v 2.236 Density 0 .1 5 0 .1 0 0 .0 5 0 .0 0 2 4 6 8 10 X 12 14 16 18 Normal approx. of Binomial:  (; =5 =20) What about the choice between one-sided vs. two-sided? The answer depends crucially on whether there exists reliable substantive information that renders part of the parameter space Θ irrelevant for both substantive and statistical purposes. 6.5 N-P testing of Arbuthnot’s value In this case there is reliable substantive information that the probability of the newborn being a male, =P(=1), is systematically slightly above 1 ; the human sex ratio at birth is approximately  =512 (Hardy, 2 30
31. 31. 2002). Hence, a statistically more informative way to assess =5 might be to use one-sided (1-s) directional probing. Data: =30762 newborns during the period 1993-5 in Cyprus, out of which 16029 were boys and 14833 girls. In view of the huge sample size it is advisable to choose a smaller signiﬁcance level, say =01 ⇒  =2326 Case 1. For testing Arbuthnot’s value 0 = 1 the relevant hypotheses 2 are: (1-s): 0 :  = 1 vs. 1 :   1  2 2 (34) 1 or (1-s*≤): 0 :  ≤ 2 vs. 1 :   1  2   and  := {(X) C1 ()} is a -UMP test. In view of the fact that: (x0 )= √ 30762( 16029 − 1 ) 30762 2 √ 5(5) =7389 (35) (x0 )=P((X)7389; 0 ) = 00000 the null hypothesis 0 : = 1 is strongly rejected. 2 What if one were to ignore the substantive information and apply a 2-sided test? Case 2. For testing Arbuthnot’s value this takes the form: (2-s): 0 :  = 1 2 vs. 1 :  6= 1  2 (36)  :={(X) C1 ()} in (33) is a -UMP test; for =05   =2576 2 In view of the fact that: (x0 )= √ 30762( 16029 − 1 ) 30762 2 √ 5(5) =7389  2576 q (x0 )=P(|(X)|  7389; 0 )=00000 the null hypothesis 0 : = 1 is strongly rejected. 2 6.6 N-P testing of Bernoulli’s value The ﬁrst question we need to answer is the appropriate form of the alternative in testing Bernoulli’s conjecture = 18 . 35 Using the same substantive information that indicates  is systematically slightly above 1 (Hardy, 2002), it seems reasonable to use a 2 1-sided alternative in the direction of 1 . 2 Case 1. For probing Bernoulli’s value the relevant hypotheses are: (1-s): 0 :  = 18 35 31 vs. 1 : 1  18  35 (37)
32. 32.   The optimal test (UMP) for (37) is  := {(X) C1 ()}, where for =01 ⇒  = −2326. Using the same data: √ 16029 18 30762( − ) (x0 )= √ 18 30762 35 =2379  (x0 )=P((X)  2379; 0 )=991 18 35 (1− 35 ) the null hypothesis 0 : = 18 is accepted with a high p-value. 35 What if one were to ignore the substantive information as pertaining only to Arbuthnot’s value, and apply a 2-sided test? Case 2. For testing Bernoulli’s value using the 2-s probing: (2-s): 0 :  = 18 vs. 1 :  6= 18  (38) 35 35 based on the UMPU test  :={(X) C1 ()} in (33): √ 30762( 16029 − 18 ) 30762 35 (x0 )= √ 18 18 35 (1− 35 ) =2379  2576 (39) q (x0 )=P(|(X)|  2379; 0 )=0174 yields accepting 0 , but the p-value is considerably smaller than  (x0 )=991 for the 1-sided test. The question that arises is which one is the relevant p-value? Indeed, a moments reﬂection suggests that these two are not the only choices. There is also the p-value for: (1-s): 0 :  = 18 vs. 1 :   18  (40) 35 35 which yields a much smaller p-value: (x0 )=P((X)  2379; 0 )=009 (41) reversing the previous results and leading to rejecting 0 . ¥ This raises the question: On what basis should one choose the relevant p-value? The short answer ‘on substantive grounds’ seems questionable because, ﬁrst, it ignores the fact that the whole of the parameter space is relevant on statistical grounds, and second, one applies a test to assess the validity of substantive information, not foist it on the data at the outset. Even more questionable is the argument that one should choose a simple alternative on substantive grounds in order to apply the Neyman-Pearson lemma that guarantees a -UMP test; this is based on a misapprehension of the lemma. The longer answer, that takes account the fact that post-data (x0 ) is known, and thus there is often a clear direction of departure indicated by data x0 associated with the sign of (x0 ) will be elaborated upon in section 8. 32
33. 33. 7 Foundational issues raised by N-P testing Let us now appraise how successful the N-P framework was in addressing the perceived weaknesses of Fisher’s testing: [a] his ad hoc choice of a test statistics (X) [b] his use of a post-data [(x0 ) is known] threshold for the -value that indicates discordance (reject) with 0  and [c] his denial that (x0 ) can indicate accordance with 0  Very brieﬂy, the N-P framework partly succeeded in addressing [a] and [c], but did not provide a coherent evidential account that answers the basic question (Mayo, 1996): when do data x0 provide evidence for or against (42) a hypothesis or a claim ? This is primarily because one could not interpret the decision results ‘Accept 0 ’ (‘Reject 0 ’) as data x0 provide evidence for (against) 0 [and thus provide evidence for 1 ]. Why? (i) A particular test  :={(X) C1 ()} could have led to ‘Accept 0 ’ because the power of that test to detect an existing discrepancy  was very low; the test had no generic capacity to detect an existing discrepancy. This can easily happen when the sample size  is small. (ii) A particular test  :={(X) C1 ()} could have led to ‘Reject 0 ’ simply because the power of that test was high enough to detect ‘trivial’ discrepancies from 0 , i.e. the generic capacity of the test was extremely high. This can easily happen when the sample size  is very large. In light of the fact that the N-P framework did not provide a bridge between the accept/reject rules and what scientiﬁc inquiry is after, i.e. when data x0 provide evidence for or against a hypothesis , it should come as no surprise to learn that most practitioners sought such answers by trying unsuccessfully (and misleadingly) to distill an evidential interpretation out of Fisher’s p-value. In their eyes the predesignated  is equally vulnerable to manipulation [pre-designation to keep practitioners ‘honest’ is unenforceable in practice], but the p-value is more informative and data-speciﬁc. Is Fisher’s p-value better in answering the question (42)? No. A small (large) — relative to a certain threhold  — p-value could not be interpreted as evidence for the presence (absence) of a substantive 33
34. 34. discrepancy  for the same reasons as (i)-(ii). The generic capacity of the test (its power) aﬀects the p-value. For instance, a very small p-value can easily arise in the case of a very large sample size . Fallacies of Acceptance and Rejection The issues with the accept/reject 0 and p-value results raised above can be formalized into two classic fallacies. (a) The fallacy of acceptance: no evidence against 0 is misinterpreted as evidence for 0 . This fallacy can easily arise in cases where the test in question has low power to detect discrepancies of interest. (b) The fallacy of rejection: evidence against 0 is misinterpreted as evidence for a particular 1 . This fallacy arises in cases where the test in question has high power to detect substantively minor discrepancies. Since the power of a test increases with the sample size  this renders N-P rejections, as well as tiny p-values, with large  highly susceptible to this fallacy. In the statistics literature, as well as in the secondary literatures in several applied ﬁelds, there have been numerous attempts to circumvent these two fallacies, but none succeeded. 8 Post-data severity evaluation A moment’s reﬂection about the inability of the accept/reject 0 and p-value results to provide a satisfactory answer to the above question in (42) suggests that it is primarily due to the fact that there is a problem when such results are detached from the test itself, and are treated as providing the same evidence for a particular hypothesis  (0 or 1 ), regardless of the generic capacity (the power) of the test in question. That is, whether data x0 provide evidence for or against a particular hypothesis  (0 or 1 ) depends crucially on the generic capacity of the test (power) in question to detect discrepancies from 0 . This stems from the intuition that a small p-value or a rejection of 0 based on a test with low power (e.g. a small ) for detecting a particular discrepancy  provides stronger evidence for the presence of a particular discrepancy  than using a test with much higher power (e.g. a large ). Mayo (1996) proposed a frequentist evidential account based on harnessing this intuition in the form of a post-data 34
35. 35. severity evaluation of the accept/reject results. This is based on custom-tailoring the generic capacity of the test to establish the discrepancy  warranted by data x0 . This evidential account can be used to circumvent the above fallacies, as well as other (misplaced) charges against frequentist testing. The severity evaluation is a post-data appraisal of the accept/reject and p-value results that revolves around the discrepancy  from 0 warranted by data x0  ¥ A hypothesis  passes a severe test  with data x0 if: (S-1) x0 accords with , and (S-2) with very high probability, test  would have produced a result that accords less well with  than x0 does, if  were false. Severity can be viewed as an feature of a test  as it relates to a particular data x0 and a speciﬁc claim  being considered. Hence, the severity function has three arguments,  ( x0  ) denoting the severity with which  passes  with x0 ; see Mayo and Spanos (2006). Example. To explain how the above severity evaluation can be applied in practice, let us return to the problem of assessing the two substantive values of interest: Arbuthnot: = 1  Bernoulli:  = 18  2 35 When these values are viewed in the context of the error statistical perspective, it becomes clear that the way to frame the probing is to choose one of the values as the null hypothesis, and let the diﬀerence between them ( − ) represent the discrepancy of substantive interest. Data: =30762 newborns during the period 1993-5 in Cyprus, 16029 are boys and 14833 girls. Let us return to probing the Arbuthnot value  based on: (1-s): 0 : 0 = 1 vs. 1 : 0  1 2 2 that gave rise to the observed test statistic: (x0 )= √ 30762( 16029 − 1 ) 30762 2 √ 5(5) =7.389, (43) leading to a rejection of 0 at =01 ⇒ =2326 Indeed, 0 would have also been rejected by a (2-s) test since =2576. An important feature of the severity evaluation is that it is postdata, and thus the sign of the observed test statistic (x0 ) provides 35
36. 36. directional information that indicates the directional inferential claims that ‘passed’. In relation to the above example, the severity ‘accordance’ condition (S-1) implies that the rejection of 0 = 1 with (x0 )=7389  2 0, indicates that the form of the inferential claim that ‘passed’ is of the generic form:   1 = 0 + for some  ≥ 0 (44) The directional feature of the severity evaluation is very important in addressing several criticisms of N-P testing, including the potential arbitrariness and possible abuse of: [a] switching between one-sided, two-sided or simple-vs-simple hypotheses, [b] interchanging the null and alternative hypotheses, and [c] manipulating the level of signiﬁcance in an attempt to get the desired testing result. To establish the particular discrepancy  warranted by data x0 , the severity post-data ‘discordance’ condition (S-2) calls for evaluating the probability of the event: "outcomes x that accord less well with   1 than x0 does", i.e. [x: (x) ≤ (x0 )] yielding:   ( ;   1 ) = P(x : (x) ≤ (x0 );   1 is false) (45) The evaluation of this probability is based on the sampling distribution: √ ˆ = ( (X)= √  −0 ) v1 Bin ((1 )  (1 ); )  for 1  0 0 (1−0 ) (46) √ (1 )= √(1 −0 ) ≥ 0  (1 )= 1 (1−1 )  0   (1 ) ≤ 1 0 (1−0 ) 0 (1−0 ) Since the hypothesis that ‘passed’ is of the form   1 =0 +, the  primary objective of SEV( ;   1 ) is to determine the largest discrepancy  ≥ 0 warranted by data x0 . Table 4 evaluates SEV(;   1 ) for diﬀerent values of  with the evaluations based on the standardized (X): [(X)−(1 )] =1 √  (1 ) v Bin (0 1; ) ' N(0 1) For example, when 1 = 5142 : √ ˆ ( −0 ) √ 0 (1−0 ) − (1 )=7389 − 36 √ 30762(5142−5) √ 5(5) =2408 (47)
37. 37. and Φ( ≤ 2408)=992 where Φ() is the cumulative distribution function (cdf) of the standard Normal distribution. Note that the scaling p  (1 )=9996 ' 1 can be ignored. Table 4: Reject 0 :  = 5 vs. 1 :   5 Relevant claim Severity  1 =[5+] P(x: (X)≤(x0 ); 1 ) 010   0510 999 013   0513 997 0142   05142 992 0143   05143 991 015   0515 983 016   0516 962 017   0517 923 018   0518 859 020   0520 645 021   0521 500 025   0525 084 An evidential interpretation. Taking a very high probability, say 95 as a threshold, the largest discrepancy from the null warranted by this data is:   ≤ 01637 since  ( ;   51637) = 950 Is this discrepancy substantively signiﬁcant? In general, to answer this question one needs to appeal to substantive subject matter information to assess the warranted discrepancy on substantive grounds. In human biology it is commonly accepted that the sex ratio at birth is approximately 105 boys to 100 girls; see Hardy (2002). Translating this ratio in terms of  yields  = 105 =5122 which suggests that the above 205 warranted discrepancy  ≥ 01637 is substantively signiﬁcant since this outputs  ≥ 5173 which exceeds  . In light of that, the substantive discrepancy of interest between  and  :  ∗ =(1835) − 5 = 0142857  yields SEV( ;   5143)=991 indicating that there is excellent evidence for the claim   1 =0 + ∗ . I In terms of the ultimate objective of statistical inference, learning from data, this evidential interpretation seems highly eﬀective because it narrows down an inﬁnite set Θ to a very small subset! 37
38. 38. 8.1 Revisiting issues pertaining to N-P testing The above evidential interpretation based on the severity assessment can also be used to shed light on a number of issues raised in the previous sections. (a) The arbitrariness of N-P hypotheses speciﬁcation The question that naturally arises at this stage is whether having good evidence for inferring   18 depends on the particular way one 35 has chosen to specify the hypotheses for the N-P test. Intuitively, one would expect that the evidence should not be dependent on the speciﬁcation of the null and alternative hypotheses as such, but on x0 and the statistical model M (x) x∈R .  To explore that issue let us focus on probing the Bernoulli value: 0 : 0 = 18  where the test statistic yields: 35 √ 16029 18 30762( − ) (x0 )= √ 18 30762 35 =2379 18 35 (1− 35 ) (48) This value leads to rejecting 0 at =01 when one uses a (1-s) test [ =2326], but accepting 0 when one uses a (2-s) test [ =2576]. Which result should one believe? Irrespective of these conﬂicting results, the severity ‘accordance’ condition (S-1) implies that, in light of the observed test statistic (x0 )=23790, the directional claim that ‘passed’ on the basis of (48) is of the form:   1 =0 + To evaluate the particular discrepancy  warranted by data x0 , condition (S-2) calls for the evaluation of the probability of the event: ‘outcomes x that accord less well with   1 than x0 does’, i.e. [x: (x) ≤ (x0 )] giving rise to (45), but with a diﬀerent 0 .  Table 5 provides several such evaluations of SEV( ;   1 ) for diﬀerent values of ; note that these evaluations are also based on (46). A number of negative values of  are included in order to address the original substantive hypotheses of interest, as well as bring out the fact that the results of table 4 and 5 are identical when viewed as inferences pertaining to ; they simply have a diﬀerent null values 0  and thus diﬀerent discrepancies, but identical 1 values. 38
39. 39. The severity evaluations in tables 4 and 5, not only render the choice between (2-s), (1-s) and simple-vs-simple irrelevant, they also scotch the well-rehearsed argument pertaining to the asymmetry between the null and alternative hypotheses. The N-P convention of selecting a small  is generally viewed as reﬂecting a strong bias against rejection of the null. On the other hand, Bayesian critics of frequentist testing argue the exact opposite: N-P testing is biased against accepting the null; see Lindley (1993). The evaluations in tables 4 and 5 demonstrate that the evidential interpretation of frequentist testing based on severe testing addresses all asymmetries, notional or real. Table 5: Accept/Reject 0 : = 18 vs. 1 : ≷ 18 35 35 Relevant claim Severity  1 =[5143+] P(x: (x)≤(x0 ); 1 ) −0043   0510 999 −0013   0513 997 −0001   05142 992 0000   05143 991 0007   0515 983 0017   0516 962 0027   0517 923 0037   0518 859 0057   0520 645 0067   0521 500 0107   0525 084 (b) Addressing the Fallacy of Rejection The potential arbitrariness of the N-P speciﬁcation of the null and alternative hypotheses and the associated p-values is brought out in probing the Bernoulli value  = 18 using diﬀerent alternatives: 35 (1-s):  (x0 )=P((X)  2379; 0 )=991 (2-s): q (x0 )=P(|(X)|  2379; 0 )=017 (1-s):  (x0 )=P((X)  2379; 0 )=009 Can the above severity evaluation explain away these highly conﬂicting and confusing results? 39
40. 40. The ﬁrst choice 1 :   18 was driven solely by substantive infor35 mation relating to = 1 , which is often a bad idea because it ignores 2 the statistical dimension of inference. The second choice was based on lack of information about the direction of departure, which makes sense pre-data, but not post-data. The third choice reﬂects the post-data direction of departure indicated by (x0 )=2379  0 In light of that, the severity evaluation renders  (x0 )=009 the relevant p-value, after all! ¥ The key problem with the p-value. In addition, the severity evaluation brings out the vulnerability of both the N-P reject rule and the p-value to the fallacy of rejection in cases where  is large. Viewing the relevant p-value ((x0 )=009) from the severity vantage point, it is directly related to 1 passing a severe test because: the probability that test  would have produced a result that accords less well with 1 than x0 does (x: (x)  (x0 )), if 1 were false (0 true):  Sev( ; x0 ; 0 ) =P((X)  (x0 );  ≤ 0 ) = =1−P((X)(x0 ); =0 )=991 is very high; Mayo (1996). The key problem with the p-value, however, is that is establishes the existence of some discrepancy  ≥ 0 but provides no information concerning the magnitude licensed by x0  The  severity evaluation remedies that by relating Sev( ; x0 ;   0 )=991 to the inferential claim   18  with the implicit discrepancy associated 35 with the p-value being =0. (c) Addressing the Fallacy of Acceptance Example. Let us return to the hypotheses: (1-s): 0 : 0 = 1 vs. 1 : 0  1 2 2 in the context of the simple Bernoulli model (table 2), and consider  applying test  with =025 ⇒  =196 to the case where data x0 are based on =2000 and yield =0503: √ √ (x0 )= 2000(503−5) =268 P((x)  268; 0 ) = 394 5(5) (x0 ) = 268 leads to accepting 0 , and the p-value indicates no rejection of 0  Does this mean that data x0 provide evidence for 0 ? or more accurately, does x0 provide evidence for no substantive discrepancy from 0 ? Not necessarily! 40
41. 41. To answer that question one needs to apply the post-data severe testing reasoning to evaluate the discrepancy  ≥ 0 for 1 =0 + war  ranted by data x0 . Test  :={(X) 1 ()} passes 0 , and the idea is to establish the smallest discrepancy  ≥ 0 from 0 warranted by data x0 by evaluating (post-data) the claim  ≤ 1 for 1 =0 + Condition (S-1) of severity is satisﬁed since x0 accords with 0 because (x0 )  , but (S-2) requires one to evaluate the probability  of "all outcomes x for which test  accords less well with 0 than x0 does", i.e. (x: (x)  (x0 )) under the hypothetical scenario that " ≤ 1 is false" or equivalently "  1 is true":   ( ; x0 ;  ≤ 1 )=P(x: (x)  (x0 );   1 ) for  ≥ 0 (49) The evaluation of (49) relies on the sampling distribution (46), and for diﬀerent discrepancies ( ≥ 0) yields the results in table 6. Note  that Sev( ; x0 ;  ≤ 1 ) is evaluated at =1 because the probability increases with . The numerical evaluations are based on the standardized (X) from (47). For example, , when 1 = 515 : √ ˆ ( √  −0 ) − (1 )=268 − 0 (1−0 ) √ 2000(515−5) √ 5(5) = −1074 Φ(  −1074)=859 where Φ() is the cumulative distribution function q p (cdf) of N(0 1). Note that the scaling factor  (1 )= 515(1−515) = 5(5) 9996 can be ignored because its value is so close to 1.  Table 6 — Severity Evaluation of ‘Accept 0 ’ with ( ; x0 )  ≤ 1 = 0 +  Relevant claim:  Sev( ≤ 1 ) .001 .005 .01 .015 .017 .0173 .018 .429 .571 .734 .859 .895 .900 .02 .025 .910 .936 .975 .992 Assuming a severity threshold of, say 90 is high enough, the above results indicate that the ‘smallest’ discrepancy warranted by x0 , is  ≥ 0173 since:  ( ; x0 ;  ≤ 5173)=9 41 .03
42. 42. That is, data x0  in conjunction with test   provide evidence (with severity at least 9) for the presence of a discrepancy as large as  ≥ 0173. Is this discrepancy substantively insigniﬁcant? No! As mentioned above, the accepted ratio at birth in human biology is  =512 which suggests that the warranted discrepancy  ≥ 0173 is, indeed, substantively signiﬁcant since this outputs  ≥ 5173 which exceeds  . This is an example where the post-data severity evaluation can be used to address the fallacy of acceptance. The severity assessment indicates that the statistically insigniﬁcant result (x0 )=268 at =025, actually provides evidence against 0 :  ≤ 5 and for  ≥ 5173 with severity at least 90 (d) The arbitrariness of choosing the signiﬁcance level It is often argued by critics of frequentist testing that the choice of the signiﬁcance level in the context of the N-P approach, or the threshold for the p-value in the case of Fisherian signiﬁcance testing, is totally arbitrary and manipulable. To see how the severity evaluations can address this problem, let us try to ‘manipulate’ the original signiﬁcance level (=01) to a diﬀerent one that would alter the accept/reject results for 0 : = 18  Looking 35 back at results for Bernoulli’s conjectured value using the data from Cyprus, it is easy to see that choosing a larger signiﬁcance level, say =02 ⇒  =2053 would lead to rejecting 0 for both N-P formulations: (2-s) (1 : 6= 18 ) and (1-s) (1 :  18 ) 35 35 since: (x0 )=2379  2053 (2-s): (x0 )=P(|(X)|  2379; 0 )=0174 (1-s):  (x0 )=P((X)  2379; 0 )=009 How does this change the original severity evaluations? The short answer is: it doesn’t! The severity evaluations remain invariant to any changes to  because they depend on on (x0 )=2379  0 and not on   which indicates that the generic form of the hypothesis that ‘passed’ is   1 =0 + and the same evaluations apply. 42
43. 43. 9 Summary and conclusions In frequentist inference learning from data x0 about the stochastic phenomenon of interest is accomplished by applying optimal inference procedures with ascertainable error probabilities in the context of a statistical model: M (x)={ (x; θ) θ∈Θ} x∈R  (50)  Hypothesis testing gives rise to learning from data by partitioning M (x) into two subsets: M0 (x)={ (x; θ) θ∈Θ0 } or M1 (x)={ (x; θ) θ∈Θ1 } x∈R ?  (51) framed in terms of the parameter(s): 0 : ∈Θ0  vs. 0 : ∈Θ1  (52) and inquiring x0 for evidence in favor one of the two subsets (51) using hypothetical reasoning. Note that one could specify (52) more perceptively as: ∈Θ ∈Θ 0 1 z }| { z }| { ∗ ∗ 0 :  (x)∈M0 (x)={ (x; θ) θ∈Θ0 } vs. 1 :  (x)∈M1 (x)={ (x; θ) θ∈Θ1 } where  ∗ (x)= (x; θ∗ ) denotes the ‘true’ distribution of the sample. This notation elucidates the basic question posed in hypothesis testing: I Given that the true M∗ (x) lies within the boundaries of M (x) can x0 be used to narrow it down to a smaller subset M0 (x)? A test  :={(X) 1 ()} is deﬁned in terms of a test statistic ((X)) and a rejection region (1 ()), and its optimality is calibrated in terms of the relevant error probabilities: type I: P(x0 ∈1 ; 0 () true) ≤ () for ∈Θ0  type II: P(x0 ∈0 ; 1 (1 ) true)=(1 ) for 1 ∈Θ1  These error probabilities specify how often these procedures lead to erroneous inferences, and thus determine the pre-data capacity (power) of the test in question to give rise to valid inferences: (1 )=P(x0 ∈1 ; 1 (1 ) true)=(1 ) for all 1 ∈Θ1  An inference is reached by an inductive procedure which, with high probability, will reach true conclusions from valid (or approximately 43
44. 44. valid) premises (statistical model). Hence, the trustworthiness of frequentist inference depends on two interrelated pre-conditions: (a) adopting optimal inference procedures, in the context of: (b) a statistically adequate model. I How does hypothesis testing related to the other forms of frequentist inference? It turns out that testing is inextricably bound up with estimation (point and interval) in the sense that an optimal estimator is often the cornerstone of an optimal test and an optimal conﬁdence interval. For instance, consistent, fully eﬃcient and minimal suﬃcient estimators provide the basis for most of the optimal tests and conﬁdence intervals! Having said that, optimal point estimation is considered rather noncontroversial, but optimal interval estimation and hypothesis testing have raised several foundational issues that have bedeviled frequentist inference since the 1930s. In particular, the use and abuse of the relevant error probabilities calibrating these latter forms of inductive inference have contributed signiﬁcantly to the general confusion and controversy when applying these procedures. Addressing foundational issues The ﬁrst such issue concerns the presupposition of the (approximately) true premises, i.e. the true M∗ (x) lies within the boundaries of M (x) This can be addressed by securing the statistical adequacy of M (x) vis-a-vis data x0  using trenchant Mis-Speciﬁcation (MS) testing, before applying frequentist testing; see chapter 15. As shown above, when M (x) is misspeciﬁed, the nominal and actual error probabilities can be very diﬀerent, rendering the reliability of the test in question, at best unknown and at worst highly misleading. That is, statistical adequacy ensures the ascertainability of the relevant error probabilities, and hence the reliability of inference. The second issue concerns the form and nature of the evidence x0 can provide for ∈Θ0 or ∈Θ1  Neither Fisher’s p-value, nor the N-P accept/reject rules can provide such an evidential interpretation, primarily because they are vulnerable to two serious fallacies: (c) Fallacy of acceptance: no evidence against the null is misinterpreted as evidence for it. (d) Fallacy of rejection: evidence against the null is misinterpreted as evidence for a speciﬁc alternative. 44
45. 45. As discussed above, these fallacies can be circumvented by supplementing the accept/reject rules (or the p-value) with a post-data evaluation of inference based on severe testing with a view to determined the discrepancy  from the null warranted by data x0  This establishes the inferential claim warranted [and thus, the unwarranted ones], in particular cases, so that it gives rise to learning from data. The severity perspective sheds ample light on the various controversies relating to p-values vs. type I and II error probabilities, delineating that the p-value is a post-data and the type I and II are pre-data error probabilities, and although inadequate for the task of furnishing an evidential interpretation, they were intended to fulﬁll complementary roles. Pre-data error probabilities, like the type I and II are used to appraise the generic capacity of testing procedures, and post-data error probabilities, like severity, are used to bridge the gap between the coarse ‘accept/reject’ and evidence provided by data x0 for the warranted discrepancy from the null, by custom-tailoring the pre-data capacity in light of (x0 ). This reﬁnement/extension of the Fisher-Neyman-Pearson framework can be used to address, not only the above mentioned fallacies, but several criticisms that have beleaguered frequentist testing since the 1930s, including: (e) the arbitrariness of selecting the thresholds, (f) the framing of the null and alternative hypotheses in the context of a statistical model M (x), and (g) the numerous misinterpretations of the p-value. 45
46. 46. 10 Appendix: Revisiting Conﬁdence Intervals (CIs) Mathematical Duality between Testing and Cls From a purely mathematical perspective a point estimator is a mapping () of the form: () : R → Θ  Similarly, an interval estimator is also a mapping () of the form: (): R → Θ, such that −1 ()={x: ∈(x)}⊂F for ∈Θ.  Hence, one can assign a probability to −1 ()=(x)={x: ∈(x)} in the sense that: P (x: ∈(x); =∗ ) :=P (∈(X); =∗ )  denotes the probability that the random set (X) contains ∗ -the true value of  The expression ‘∗ denotes the true value of ’ is a shorthand for saying that ‘data x0 constitute a realization of the sample X with distribution  (x; ∗ )’. One can deﬁne a (1−) Conﬁdence Interval (CI) for  by attaching a lower bound to this probability: P (∈(X); =∗ ) ≥ (1 − ) Neyman (1937) utilized a mathematical duality between the N-P Hypothesis Testing (HT) and Conﬁdence Intervals (CIs) to put forward an optimal theory for the latter. The -level test of the hypotheses: 0 : =0  vs. 1 : 6=0  takes the form {(X;0 ) 1 (0 ; )} (X;0 )-the test statistic and the rejection region deﬁned by: 1 (0 ; )={x: |(x;0 )|    } ⊂ R   2 The complement of the latter with respect to the sample space (R ),  deﬁnes the acceptance region: 0 (0 ; )={x: |(x;0 )| ≤   } = R − 1 (0 ; )  2 The mathematical duality between HT and CIs stems from the fact that for each x∈R  one can deﬁne a corresponding set (x;) on the  parameter space by: (x;)={: x∈0 (0 ; )} ∈Θ e.g. (x;)={: |(x;0 )| ≤  } 2 46
47. 47. Conversely, for each (x;) there is a CI corresponding to the acceptance region 0 (0 ; ) such that: 0 (0 ; )={x: 0 ∈(x;)} x∈R  e.g. 0 (0 ; )={x: |(x;0 | ≤  }  2 This equivalence is encapsulated by: for each x∈R x∈0 (0 ; ) ⇔ 0 ∈(x; ) for each 0 ∈Θ When the lower b (X) and upper bound b (X) of a CI, i.e.   ´ ³ b (X) ≤  ≤ b (X) =1−  P  are monotone increasing functions of X the equivalence is: ´ ³ ´ ³ b (x) ≤  ≤ b (x) ⇔ x∈ b−1 () ≤ x ≤ b−1 ()     ∈  Optimality of Conﬁdence Intervals The mathematical duality between HT and CIs is particularly useful in deﬁning an optimal CI that is analogous to an optimal N-P test. The key result is that the acceptance region of a -signiﬁcance level test  that is Uniformly Most Powerful (UMP), gives rise to (1−) CI that is Uniformly Most Accurate (UMA), i.e. the error coverage probability that (x; ) contains a value  6= 0  where 0 is assumed to be the true value, is minimized; Lehmann (1986). Similarly, a −UMPU test gives rise to a UMAU Conﬁdence Interval. Moreover, it can be shown that such optimal CIs have the shortest length in the following sense: if [b (x) b (x)] is a (1−)-UMAU CI, then its length [width] is less   than or equal to that of any other (1−) CI, say [e (x) e (x)]:       ∗ (b (x) − b (x)) ≤ ∗ (e (x) − e (x)) Underlying reasoning: CI vs. HT Does this mathematical duality render HT and CIs equivalent in terms of their respective inference results? No! 47
48. 48. The primary source of the confusion on this issue is that they share a common pivotal function, and their error probabilities are evaluated using the same tail areas. But that’s where the analogy ends. To bring out how their inferential claims and respective error probabilities are fundamentally diﬀerent, consider the simple (one parameter) Normal model (table 3), where the common pivot is: (X; )= √ (  −)  v N (0 1)  (53) Since  is unknown, (53) is meaningless as it stands, and it crucial to understand how the CI and HT procedures render it meaningful using diﬀerent types of reasoning, referred to as factual and hypothetical, respectively: √ ∗ TSN  (X;)= ( − ) v N(0 1) (54) √ 0  (X)= ( −0 ) v N(0 1) Putting the (1−) CI side-by-side with the (1−) acceptance region: ´ ³   ∗ P   −  ( √ ) ≤  ≤   +  ( √ ); = =1− 2 2 ´ ³ (55)   P 0 −  ( √ ) ≤   ≤ 0 +  ( √ ); =0 =1− 2 2 it is clear that their inferential claims are crucially diﬀerent: [a] The CI claims that with probability (1−) the random bounds  [  ±   ( √ )] will cover (overlay) the true ∗ [whatever that happens 2 to be], under the TSN. [b] The test claims that with probability (1−)  test  would yield a result that accords equally well or better with 0 (x: |(x;0 )| ≤   ), 2 if 0 were true. Hence, their claims pertain to the parameter vs. the sample space, and the evaluation under the TSN (=∗ ) vs. under the null (=0 ), render the two statements inferentially very diﬀerent. The idea that a given (known) 0 and the true ∗ are interchangeable is an illusion. This can be seen most clearly in the simple Bernoulli case where  P needs to be estimated using b =  =1  :=  for the UMAU (1−)  1 CI: ³ ´  (1− )  (1− ) ∗     P   −  ( √ ) ≤     +  ( √ ); = =1− 2 2 48
49. 49. but the acceptance region of -UMPU test is deﬁned in terms of the hypothesized value 0 : ´ ³ 0 (1−0 ) 0 (1−0 ) ( √ ( √ P 0 − 2 ) ≤    0 + 2 ); =0 =1−   The fact is that there is only one ∗ and an inﬁnity of values 0  As a result, the HT hypothetical reasoning is equally well-deﬁned pre-data and post data. In contrast, there is only one TSN which has already played out post-data, rendering coverage error probability degenerate, i.e. the observed CI:   [ −   ( √ )  +   ( √ )] 2 2 (56) either includes or excludes the true value ∗ . Observed Conﬁdence Intervals and Severity In addition to addressing the fallacies of acceptance and rejection, the post-data severity evaluation can be used to address the issue of degenerate post-data error probabilities and the inability to distinguish between diﬀerent values of  within an observed CI; Mayo and Spanos (2006). This is accomplished by making two key changes: (a) Replace the factual reasoning underlying CIs, which becomes degenerate post-data, with the hypothetical reasoning underlying the post-data severity evaluation. (b) Replace any claims pertaining to overlaying the true ∗ with severity-based inferential claims of the form:   1 =0 +  ≤ 1 =0 + for some  ≥ 0 (57) This can be achieved by relating the observed bounds:   ±   ( √ ) (58) 2  to particular values of 1 associated with (57), say 1 = −  ( √ ) and 2 evaluating the post-data severity of the claim:    1 = −  ( √ ) 2 A moment’s reﬂection, however, suggests that the connection between establishing the warranted discrepancy  from the null value 0 and the observed CI (58) is more apparent than real, because the severity evaluation probability is attached to the inferential claim (57), and not to the particular value 1 . Hence, at best one would be creating 49
50. 50. the illusion that the severity assessment on the boundary of a 1-sided observed CI is related to the conﬁdence level; it does not! The truth of the matter is that the inferential claim (58) does not pertain directly to ∗  but to the warranted discrepancy  in light of x0 . Fallacious arguments for using Conﬁdence Intervals A. Conﬁdence Intervals are more reliable than p-values It is often argued in the social science statistical literature that CIs are more reliable than p-values because the latter are vulnerable to the large  problem; as the sample size  → ∞ the p-value becomes smaller and smaller. Hence, one would reject any null hypothesis given enough observations. What these critics do not seem to realize is that a CI like: ³ ´   ∗  ( √ ) ≤  ≤  +  ( √ ); = P   − 2  =1−   2 is equally vulnerable to the large  problem because the length of the interval: h i h i     ( √ ) −  −  ( √ )   + 2  = 2  ( √ )   2 2 shrinks to zero as  → ∞ B. The middle of a Conﬁdence Interval is more probable? Despite the obvious fact that post-data the probability of coverage is either zero or one, there have been numerous recurring attempts in the literature to discriminate among the diﬀerent values of  within an observed interval: “... the diﬀerence between population means is much more likely to be near the middle of the conﬁdence interval than towards the extremes.” (Gardner and Altman, 2000, p. 22) This claim is clearly fallacious. A moment’s reﬂection suggests that viewing all possible observed CIs (56) as realizations from a single distribution in (54), leads one to the conclusion that, unless the particular  happens (by accident) to coincide with the true value ∗  values to the right or the left of  will be more likely depending on whether  is to the left or the right of ∗  Given that ∗ is unknown, however, such an evaluation cannot be made in practice with a single realization. This is illustrated below with a CI for  in the case of the simple (one parameter) Normal model. 50
51. 51. 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. ∗ ` − − − −  − − − − a ` − − − −  − − − − a ` − − − −  − − − − a ` − − − −  − − − − a ` − − − −  − − − − a ` − − − −  − − − − a ` − − − −  − − − − a ` − − − −  − − − − a ` − − − −  − − − − a ` − − − −  − − − − a ` − − − −  − − − − a ` − − − −  − − − − a ` − − − −  − − − − a ` − − − −  − − − − a ` − − − −  − − − − a ` − − − −  − − − − a ` − − − −  − − − − a ` − − − −  − − − − a ` − − − −  − − − − a ` − − − −  − − − − a 51