ABSTRACT: Mounting failures of replication in the social and biological sciences give a new urgency to critically appraising proposed statistical reforms. While many reforms are welcome (preregistration of experiments, replication, discouraging cookbook uses of statistics), there have been casualties. The philosophical presuppositions behind the meta-research battles remain largely hidden. Too often the statistics wars have become proxy wars between competing tribe leaders, each keen to advance one or another tool or school, rather than build on efforts to do better science. Efforts of replication researchers and open science advocates are diminished when so much attention is centered on repeating hackneyed howlers of statistical significance tests (statistical significance isn’t substantive significance, no evidence against isn’t evidence for), when erroneous understanding of basic statistical terms goes uncorrected, and when bandwagon effects lead to popular reforms that downplay the importance of error probability control. These casualties threaten our ability to hold accountable the “experts,” the agencies, and all the data handlers increasingly exerting power over our lives.
1. The Statistics Wars: Errors and
Casualties
Deborah G Mayo
Dept of Philosophy, Virginia Tech
Conference: Statistical Reasoning and Scientific Error
MuST Conference in Philosophy of Science
July 1, 2019
1
3. Role of probability: performance
or probabilism?
(Frequentist vs. Bayesian)
• End of foundations? (Unifications, Peace
Treaties, and Eclecticism)
• Long-standing battles still simmer below the
surface (agreement on numbers)
3
4. Statistical inference as severe
testing
•Brush the dust off pivotal debates in relation to
today’s statistical crisis in science
5. • We set sail with a simple tool: a minimal
requirement for evidence
• Sufficiently general to apply to any methods
now in use
• Excavation tool: You needn’t accept this
philosophy to use it to get beyond today’s
statistical wars and appraise reforms
5
6. Statistical reforms
• Several are welcome: preregistration, avoidance of
cookbook statistics, calls for more replication
research
• Others are quite radical, and even violate our minimal
principle of evidence
• These are the errors and casualties I refer to
• To combat paradoxical, self-defeating “reforms” and
unthinking bandwagon effects requires taking on very
strong ideological leaders
6
7. In SIST: How to get beyond the
stat wars requires chutzpah
“You will need to critically evaluate …brilliant
leaders, high priests, maybe even royalty. Are
they asking the most unbiased questions in
examining methods, or are they like admen
touting their brand, dragging out howlers to
make their favorite method look good? (I am not
sparing any of the statistical tribes here.)” (p. 12)
• Statistics wars as proxy wars
7
8. Most often used tools are most
criticized
“Several methodologists have pointed out that the
high rate of nonreplication of research discoveries is
a consequence of the convenient, yet ill-founded
strategy of claiming conclusive research findings
solely on the basis of a single study assessed by
formal statistical significance, typically for a p-value
less than 0.05. …” (Ioannidis 2005, 696)
Do researchers do that?
8
9. R.A. Fisher
“[W]e need, not an isolated record, but a reliable
method of procedure. In relation to the test of
significance, we may say that a phenomenon is
experimentally demonstrable when we know how to
conduct an experiment which will rarely fail to give
us a statistically significant result.” (Fisher 1947, 14)
9
10. Simple significance tests (Fisher)
“p-value. …to test the conformity of the particular data
under analysis with H0 in some respect:
…we find a function T = t(y) of the data, the test
statistic, such that
• the larger the value of T the more inconsistent are
the data with H0;
• T = t(Y) has a known probability distribution
when H0 is true.
…the p-value corresponding to any t0bs as
p = p(t) = Pr(T ≥ t0bs; H0)”
(Mayo and Cox 2006, 81) 10
11. Testing reasoning
• If even larger differences than t0bs occur fairly
frequently under H0 (i.e., P-value is not small),
there’s scarcely evidence of incompatibility
with H0
• Small P-value indicates some underlying
discrepancy from H0 because very probably
you would have seen a less impressive
difference than t0bs were H0 true.
• This still isn’t evidence of a genuine statistical
effect H1, let alone a scientific conclusion H*
Stat-Sub fallacy H => H*
11
12. 12
“It would be a mistake to allow the tail to wag the
dog by being overly influenced by flawed statistical
inferences” (Cook et al. 2019)
In response to suggesting the concept of statistical
significance be dropped (Amhrein et al. 2019)
Don’t let the tail wag the dog
13. Neyman-Pearson (N-P) tests:
A null and alternative hypotheses H0, H1
that are exhaustive*
H0: μ ≤ 0 vs. H1: μ > 0
“no effect” vs. “some positive effect”
• So this fallacy of rejection H1H* is blocked
• Rejecting H0 only indicates statistical alternatives
H1 (how discrepant from null)
*(introduces Type II error, and power )
13
14. Casualties from (Pathological)
Fisher-N-P Battles
• Neyman Pearson (N-P) give a rationale for
Fisherian tests
• All is hunky dory until Fisher and Neyman
began fighting (from 1935) largely due to
professional and personality disputes
• A huge philosophical difference is read into
their in-fighting
14
15. Get beyond the inconsistent hybrid
charge
• Major casualties from the “inconsistent hybrid”:
Fisher–inferential; N-P–long run performance
• Fisherians can’t use power
• N-P testers must adhere to fixed error
probabilities (P <); can’t report P-values (P =)
15
16. 16
Erich Lehmann (Neyman’s student)
“It is good practice to determine … the smallest
significance level…at which the hypothesis would
be rejected for the given observation. ..the P-
value gives an idea of how strongly the data
contradict the hypothesis [and] enables others
to reach a verdict based on the significance
level of their choice.”
(Lehmann and Romano 2005, 63-4)
17. Error Statistics
• Fisher and N-P both fall under tools for “appraising
and bounding the probabilities (under respective
hypotheses) of seriously misleading interpretations
of data” (Birnbaum 1970, 1033)–error probabilities
• I place all under the rubric of error statistics
• Confidence intervals, N-P and Fisherian tests,
resampling, randomization
17
18. Both Fisher & N-P: it’s easy to lie
with biasing selection effects
• Sufficient finagling—cherry-picking, significance
seeking, multiple testing, post-data subgroups,
trying and trying again—may practically
guarantee a preferred claim H gets support,
even if it’s unwarranted by evidence
• Violates severity
18
19. Severity Requirement:
If the test had little or no capability of finding flaws
with H (even if H is incorrect), then agreement
between data x0 and H provides poor (or no)
evidence for H
• Such a test fails a minimal requirement for a
stringent or severe test
• N-P and Fisher did not put it in these terms, but
our severe tester does
19
20. Requires a third role for
probability
Probabilism. To assign a degree of probability,
confirmation, support or belief in a hypothesis, given data x0
(absolute or comparative)
(e.g., Bayesian, likelihoodist, Fisher (at times))
Performance. Ensure long-run reliability of methods,
coverage probabilities (frequentist, behavioristic Neyman-
Pearson, Fisher (at times))
Only probabilism is thought to be inferential or evidential 20
21. What happened to using probability
to assess error-probing capacity?
• Neither “probabilism” nor “performance” directly
captures assessing error probing capacity
• Good long-run performance is a necessary, not a
sufficient, condition for severity
21
22. Key to solving a key problem for
frequentists
• Why is good performance relevant for
inference in the case at hand?
• What bothers you with selective reporting,
cherry picking, stopping when the data look
good, P-hacking
• Not problems about long-runs—
22
23. We cannot say the case at hand has done a
good job of avoiding the sources of
misinterpreting data
24. A claim C is not warranted _______
• Probabilism: unless C is true or probable (gets
a probability boost, made comparatively firmer)
• Performance: unless it stems from a method
with low long-run error
• Probativism (severe testing): unless
something (a fair amount) has been done to
probe ways we can be wrong about C
24
25. A severe test: My weight
Informal example: To test if I’ve gained weight
between the time I left for Germany and my return, I
use a series of well-calibrated and stable scales,
both before leaving and upon my return.
All show an over 4 lb gain, none shows a difference
in weighing EGEK, I infer:
H: I’ve gained at least 4 pounds
25
26. 26
• Properties of the scales are akin to the properties of
statistical tests (performance).
• No one claims the justification is merely long run,
and can say nothing about my weight.
• We infer something about the source of the
readings from the high capability to reveal if any
scales were wrong
27. 27
The severe tester assumed to be in a
context of wanting to find things out
• I could insist all the scales are wrong—they work
fine with weighing known objects—but this would
prevent correctly finding out about weight…..
• What sort of extraordinary circumstance could
cause them all to go astray just when we don’t
know the weight of the test object?
• Argument from coincidence-goes beyond being
highly improbable
28. 28
According to modern logical empiricist orthodoxy,
in deciding whether hypothesis h is confirmed by
evidence e, . . . we must consider only the
statements h and e, and the logical relations
[C(h,e)] between them. It is quite irrelevant whether
e was known first and h proposed to explain it, or
whether e resulted from testing predictions drawn
from h”.
(Alan Musgrave 1974, p. 2)
Battles about roles of probability
trace to philosophies of inference
29. Likelihood Principle (LP)
In logics of induction, like probabilist accounts (as
I’m using the term) the import of the data is via the
ratios of likelihoods of hypotheses
Pr(x0;H0)/Pr(x0;H1)
The data x0 are fixed, while the hypotheses vary
29
30. Comparative logic of support
• Ian Hacking (1965) “Law of Likelihood”:
x support hypothesis H0 less well than H1 if,
Pr(x;H0) < Pr(x;H1)
(rejects in 1980)
• Any hypothesis that perfectly fits the data is
maximally likely (even if data-dredged)
• “there always is such a rival hypothesis viz., that
things just had to turn out the way they actually
did” (Barnard 1972, 129). 30
31. Error probabilities
• Pr(H0 is less well supported than H1;H0 ) is high
for some H1 or other
“In order to fix a limit between ‘small’ and ‘large’
values of [the likelihood ratio] we must know how
often such values appear when we deal with a
true hypothesis.” (Pearson and Neyman 1967,
106)
31
32. Hunting for significance
(nominal vs. actual)
Suppose that twenty sets of differences have
been examined, that one difference seems large
enough to test and that this difference turns out
to be ‘significant at the 5 percent level.’ ….The
actual level of significance is not 5 percent,
but 64 percent! (Selvin 1970, 104)
(Morrison & Henkel’s Significance Test Controversy
1970!)
32
33. Spurious P-Value
The hunter reports: Such results would be
difficult to achieve under the assumption of H0
When in fact such results are easy to get under
the assumption of H0
• There are many more ways to be wrong with
hunting (different sample space)
• Need to adjust P-values or at least report the
multiple testing
33
34. Some accounts of evidence object:
“Two problems that plague frequentist inference:
multiple comparisons and multiple looks, or…data
dredging and peeking at the data. The frequentist
solution to both problems involves adjusting the P-
value…
But adjusting the measure of evidence because
of considerations that have nothing to do with
the data defies scientific sense” (Goodman 1999,
1010)
(Co-director, with Ioannidis, the Meta-Research Innovation
Center at Stanford) 34
35. On the LP, error probabilities
appeal to something irrelevant
“Sampling distributions, significance levels,
power, all depend on something more [than
the likelihood function]–something that is
irrelevant in Bayesian inference–namely the
sample space”
(Lindley 1971, 436)
35
36. Casualty: Error control is lost
“[I]f the sampling plan is ignored, the researcher is able
to always reject the null hypothesis, even if it is true. This
example is sometimes used to argue that any statistical
framework should somehow take the sampling plan into
account. ..This feeling is, however, contradicted by a
mathematical analysis. (E-J Wagenmakers, 2007, 785)
But the “proof” assumes the likelihood principle (LP) by
which error probabilities drop out. (Edwards, Lindman,
and Savage 1963)
A mathematical ‘principle of rationality’ (LP) gets weight
over what appears to be common sense
36
37. At odds with key way to advance
replication:
21 Word Solution
“We report how we determined our sample size,
and data exclusions (if any), all manipulations, and
all measures in the study” (Simmons, Nelson, and
Simonsohn 2012, 4).
• Replication researchers find flexibility with data-
dredging and stopping rules major source of failed-
replication (the “forking paths”, Gelman and Loken
2014))
37
38. Many “reforms” offered as
alternative to significance tests
follow the LP
• “Bayes factors can be used in the complete absence
of a sampling plan…” (Bayarri, Benjamin, Berger,
Sellke 2016, 100)
• It seems very strange that a frequentist could not
analyze a given set of data…if the stopping rule is not
given….Data should be able to speak for itself.
(Berger and Wolpert 1988, 78; authors of the
Likelihood Principle)
• No wonder reformers talk past each other
38
39. Tension within replication reforms
American Statistical Association 2016 (statement
on P-values):
• Principle 4: P-values and related analyses should not
be reported selectively. …Whenever a researcher
chooses what to present based on statistical results,
valid interpretation of those results is severely
compromised if the reader is not informed of the
choice and its basis.
• Other approaches: In view of the prevalent misuses
of and misconceptions concerning p-values, some
statisticians prefer to supplement or even replace p-
values with other approaches… such as likelihood
ratios or Bayes Factors;(p. 132)
39
40. Replication Paradox
• Test Critic: It’s too easy to satisfy standard
significance thresholds
• You: Why do replicationists find it so hard to achieve
significance thresholds (with preregistration)?
• Test Critic: Obviously the initial studies were guilty
of P-hacking, cherry-picking, data-dredging (QRPs)
• You: So, the replication researchers want methods
that pick up on, adjust, and block these biasing
selection effects.
• Test Critic: Actually “reforms” recommend methods
where the need to alter P-values due to data
dredging vanishes 40
41. Probabilists can still block intuitively
unwarranted inferences
(without error probabilities)?
• Supplement with subjective beliefs: What do I
believe? As opposed to What is the evidence?
(Royall)
• Likelihoodists + prior probabilities
41
42. Problems
• Additional source of flexibility, priors as well as
biasing selection effects
• Doesn’t show what researchers had done
wrong—battle of beliefs
• The believability of data-dredged hypotheses is
what makes them so seductive
42
43. Peace Treaty (J. Berger 2003, 2006):
use “default” priors: unification
• Default priors are supposed to prevent prior beliefs
from influencing the posteriors–data dominant
• “The priors are not to be considered expressions of
uncertainty, ignorance, or degree of belief.
Conventional priors may not even be probabilities…”
(Cox and Mayo 2010, 299)
• No agreement on rival systems for default/non-
subjective priors (no “uninformative” priors)
43
44. Casualties: Criticisms of data-
dredgers lose force
Wanting to promote an account that downplays error
probabilities, Bayesian critics turn to other means–give
H0 (no effect) a high prior probability in a Bayesian
analysis
• The researcher deserving criticism deflects this
saying: you can always counter an effect this way
(the defense violates our minimal principle of
evidence)
44
45. Bem’s “Feeling the future” 2011:
ESP?
• Daryl Bem (2011): subjects do better than chance
at predicting the (erotic) picture shown in the future
• Some locate the start of the Replication Crisis with
Bem
• Bem admits data dredging
• Bayesian critics resort to a default Bayesian prior
to (a point) null hypothesis
45
46. Bem’s response
“Whenever the null hypothesis is sharply defined but
the prior distribution on the alternative hypothesis is
diffused over a wide range of values, as it is [here] it
boosts the probability that any observed data will be
higher under the null hypothesis than under the
alternative.
This is known as the Lindley-Jeffreys paradox*: A
frequentist [can always] be contradicted by a
…Bayesian analysis that concludes that the same data
are more likely under the null.” (Bem et al. 2011, 717)
*Bayes-Fisher disagreement
46
47. 47
Key to today’s statistics wars: P-
values vs posteriors
The posterior probability Pr(H0|x) can be large
while the P-value is small
To a Bayesian this shows P-values exaggerate
evidence against
Significance testers object to highly significant
results being interpreted as no evidence against
the null– or even evidence for it! High Type 2
error
48. Bayes (Jeffreys)/Fisher disagreement
(“spike and smear”)
• The “P-values exaggerate” arguments refer to
testing a point null hypothesis, a lump of prior
probability given to H0 (or a tiny region around 0).
Xi ~ N(μ, σ2)
H0: μ = 0 vs. H1: μ ≠ 0.
• The rest appropriately spread over the
alternative, an α significant result can correspond
to
Pr(H0|x) = (1 – α)! (e.g., 0.95)
48
49. “Concentrating mass on the point null hypothesis is
biasing the prior in favor of H0 as much as possible”
(Casella and R. Berger 1987, 111) whether in 1 or 2-
sided tests
Yet ‘spike and smear” is the basis for: “Redefine
Statistical Significance” (Benjamin et al. (2018)
Opposing megateam: Lakens et al. (2018)
• Whether tests should use a lower Type 1 error
probability is separate; the problem is supposing
there should be agreement between quantities
measuring different things
49
50. 50
Whether P-values exaggerate,
“depends on one’s philosophy of
statistics” (Greenland et al., 2016)
..based on directly comparing P values against
certain quantities (likelihood ratios and Bayes
factors) that play a central role as evidence
measures in Bayesian analysis … Nonetheless,
many other statisticians do not accept these
quantities as gold standards,”… (Greenland, Senn,
Rothman, Carlin, Poole, Goodman, Altman 2016,
342)
51. Recent Example of a battle based on P-
values disagree with posteriors
• If we imagine randomly selecting a hypothesis
from an urn of nulls 90% of which are true
• Consider just 2 possibilities: H0: no effect
H1: meaningful effect, all else ignored,
• Take the prevalence of 90% as
Pr(H0) = 0.9, Pr(H1)= 0.1
• Reject H0 with a single (just) 0.05 significant result,
with cherry-picking, selection effects
Then it can be shown most “findings” are false 51
52. 52
Diagnostic Screening (DS) model of Tests
• Pr(H0|Test T rejects H0 ) > 0.5
really: prevalence of true nulls among those
rejected at the 0.05 level > 0.5.
Call this: False Finding rate FFR
• Pr(Test T rejects H0 | H0 ) = 0.05
Criticism: N-P Type I error probability ≠ FFR
(Ioannidis 2005, Colquhoun 2014)
54. 54
But there are major casualties
Pr(H0|Test T rejects H0) is not a Type I error
probability.
Transposes conditional
Combines crude performance with a probabilist
assignment (What’s Prev(H1)?)
Fallacy of probabilistic instantiation: Even if the
prevalence of true effects in the urn is 0.1 does not
follow that a specific hypothesis gets a probability of
0.1 of being true, for a frequentist
55. Some Bayesians reject probabilism
(Gelman: Falsificationist Bayesian;
Shalizi: error statistician)
“[C]rucial parts of Bayesian data analysis, such as
model checking, can be understood as ‘error probes’
in Mayo’s sense”
“[W]hat we are advocating, then, is what Cox and
Hinkley (1974) call ‘pure significance testing’, in
which certain of the model’s implications are
compared directly to the data.” (Gelman and Shalizi
2013, 10, 20).
Falsify (statistically) with small P-values 55
56. 56
Can’t also claim:
“No p-value can reveal the plausibility, presence,
[or] truth of an association or effect”. (Wasserstein
et al., 2019, 2)
I recommend “No p-value ‘by itself’…”
Granted P-values don’t give effect sizes
Who says we only report P-values and stop?
57. 57
Frequentist Family Feud
Confidence Interval (CI) “Crusade”
Leaders of the “New Statistics” (in psychology) often
wage a proxy war for replacing tests with CIs.
58. 58
• The “CIs only” battlers encourage supplementing
tests with CIs, which is good; but there have been
casualties
• They promote the view: the only alternative to CIs is
the abusive cookbook variation of Fisherian
significance tests
• But CIs are inversions of tests: A 95% CI contains
the parameter values you can’t reject at the 5%
level
• Same man who co-developed N-P tests developed
(1930): Neyman
59. 59
Duality of tests and CIs
(estimating μ in a Normal Distribution)
μ > M0 – 1.96σ/√n CI-lower
μ < M0 + 1.96σ/√n CI-upper
M0 : the observed sample mean
CI-lower: the value of μ that M0 is statistically
significantly greater than at P= 0.025
CI-upper: the value of μ that M0 is statistically
significantly lower than at P= 0.025
You could get a CI by asking for these values, and
learn indicated effect sizes with tests
60. CIs (as standardly used) inherit
problems of N-P tests
• Too dichotomous: in/out
• Justified in terms of long-‐run coverage
• All members of the CI treated on par
• Fixed confidence levels (need several
benchmarks)
60
61. 61
The severe tester reformulates
tests with a discrepancy γ from H0
Severity function: SEV(Test T, data x, claim C)
Instead of a binary cut-off (significant or not) the
particular outcome is used to infer
discrepancies that are or are not warranted
62. 62
To avoid Fallacies of Rejection
(e.g., magnitude error)
If you very probably would have observed a more
impressive (smaller) P-value if μ = μ1 (μ1 = μ0 + γ);
the data are poor evidence that
μ > μ1.
To connect with power: spoze M0 just reaches
statistical significance at level P, say 0.025; and
compute power in relation to this cut-off
If the power against μ1 is high then the data are
poor evidence that
μ > μ1.
64. Similarly, severity tells us:
• an α-significant difference indicates less of a
discrepancy from the null if it results from larger (n1)
rather than a smaller (n2) sample size (n1 > n2 )
• What’s more indicative of a large effect (fire), a fire
alarm that goes off with burnt toast or one that
doesn’t go off unless the house is fully ablaze?
• [The larger sample size is like the one that goes off
with burnt toast] 64
65. What about fallacies of
non-significant results?
• They don’t warrant 0 discrepancy; (there are
discrepancies the test had low probability of
detecting)
• Not uninformative; can find upper bounds μ1
SEV(μ < μ1) is high
• It’s with negative results (P-values not small) that
severity goes in the same direction as power
–provided model assumptions hold
65
66. Severely applies informally: to
probe links from statistical to H*
• Casualty of focusing on whether the replication
gets low P-values:
• Much replication ignore the larger question: are
they even measuring the phenomenon they
intend?
• Failed replication often construed: There’s a real
effect but it’s smaller
66
67. OSC: Reproducibility Project:
Psychology: 2011-15 (Science
2015) (led by Brian Nosek, U. VA)
• Crowd sourced: Replicators chose 100 articles
from three journals (2008)
67
68. • One of the non-replications: cleanliness and
morality: Do cleanliness primes make you
less judgmental?
“Ms. Schnall had 40 undergraduates unscramble
some words. One group unscrambled words
that suggested cleanliness (pure, immaculate,
pristine), while the other group unscrambled
neutral words. They were then presented with
a number of moral dilemmas, like whether it’s
cool to eat your dog after it gets run over by a
car.”
68
69. “Subjects who had unscrambled clean words
weren’t as harsh on the guy who chows down on
his chow.”
(Bartlett, Chronicle of Higher Education)
Is the cleanliness prime responsible?
69
70. Nor is there discussion of the multiple
testing in the original study
• Only 1 of the 6 dilemmas in the original study
showed statistically significant differences in degree
of wrongness–not the dog one
• No differences on 9 different emotions (relaxed,
angry, happy, sad, afraid, depressed, disgusted,
upset, and confused)
• Similar studies in experimental philosophy:
philosophers of science need to critique them 70
71. 71
The statistics wars: Errors & casualties
• Mounting failures of replication …give a new urgency
to critically appraising proposed statistical reforms.
• While many reforms are welcome (preregistration of
experiments, replication, discouraging cookbook uses
of statistics), there have been casualties.
• The philosophical presuppositions …remain largely
hidden.
• Too often the statistics wars have become proxy wars
between competing tribe leaders, each keen to
advance one or another tool or school, rather than
build on efforts to do better science.
72. 72
Efforts of replication researchers and open science
advocates are diminished when
• attention is centered on repeating hackneyed
howlers of statistical significance tests (statistical
significance isn’t substantive significance, no
evidence against isn’t evidence for),
• erroneous understanding of basic statistical terms
goes uncorrected, and
• bandwagon effects lead to popular reforms that
downplay the importance of error probability
control.
These casualties threaten our ability to hold
accountable the “experts,” the agencies, and all the
data handlers increasingly exerting power over our
lives.
74. NOTES ON SLIDES
SLIDE #57. The term confidence interval crusade is from Hurlbert and Lombardi
(2008).
Slide #59. We’re really Involved with Two 1-sided tests:
“More commonly,” Cox and Hinkley (1974, p. 79) note “large and small values of
[the test statistic] indicate quite different kinds of departure… . Then it is best to
regard the tests in the two directions as two different tests… .one to examine the
possibility” μ > μ0, the other for μ < μ0 .
“.. look at both [p+ and p- ] and select the more significant of the two i.e., the
smaller. To preserve the physical meaning of the significance level, we make a
correction for selection.” (ibid. p. 106)
In the continuous case, we double the separate p-values.
SLIDE #63. This Figure alludes to aa specific example, but the point is general. The
example is H0 : μ ≤ 0 vs. H1; μ > 0, α = 0.025, n = 25, 𝜎 𝑋 = 0.2
SLIDE #65. Note: This is akin to power analysis (first used by Neyman and
popularized by Jacob Cohen) except that SEV uses the actual outcome rather than
the one that just misses rejection. Focusing on Fisherian tests, power
considerations are absent. Note that severity is not the same as something called
post-hoc power, where the observed difference becomes the hypothesis for which
power is computed. I call it “shpower”. Another serious casualty is to warn against
“post hoc power” when “shpower” is meant. 74
75. Jimmy Savage on the LP:
“According to Bayes' theorem,…. if y is the
datum of some other experiment, and if it
happens that P(x|µ) and P(y|µ) are
proportional functions of µ (that is,
constant multiples of each other), then
each of the two data x and y have exactly
the same thing to say about the values of
µ…” (Savage 1962, p. 17)
75
76. FEV: Frequentist Principle of Evidence; Mayo
and Cox (2006); SEV: Mayo 1991, Mayo and
Spanos (2006)
FEV/SEV A small P-value indicates discrepancy γ from
H0, if and only if, there is a high probability the test
would have resulted in a larger P-value were a
discrepancy as large as γ absent.
FEV/SEV A moderate P-value indicates the absence of
a discrepancy γ from H0, only if there is a high
probability the test would have given a worse fit with
H0 (i.e., a smaller P-value) were a discrepancy γ
present.
76
77. Researcher degrees of freedom in
the Schnall study
After the priming task, participants rated six moral
dilemmas : ‘‘Dog’’ (eating one’s dead dog),
‘‘Trolley’’ (switching the tracks of a trolley to kill
one workman instead of five), ‘‘Wallet’’ (keeping
money inside a found wallet), ‘‘Plane Crash’’
(killing a terminally ill plane crash survivor to avoid
starvation), ‘‘Resume’’ (putting false information
on a resume), and ‘‘Kitten’’ (using a kitten for
sexual arousal). Participants rated how wrong
each action was from 0 (perfectly OK) to 9
(extremely wrong). 77
78. References
• Amrhein, V., Greenland, S. and McShane B. (2019). “Comment: Retire Statistical Significance”, Nature 567: 305-7. (Online, 20
March 2019).
• Barnard, G. (1972). “The Logic of Statistical Inference (Review of ‘The Logic of Statistical Inference’ by Ian Hacking)”, British
Journal for the Philosophy of Science 23(2), 123–32.
• Bartlett, T. (2014). “Replication Crisis in Psychology Research Turns Ugly and Odd”, The Chronicle of Higher Education (online)
June 23, 2014.
• Bayarri, M., Benjamin, D., Berger, J., Sellke, T. (2016). “Rejection Odds and Rejection Ratios: A Proposal for Statistical Practice in
Testing Hypotheses." Journal of Mathematical Psychology 72: 90-103.
• Bem, J. (2011). “Feeling the Future: Experimental Evidence for Anomalous Retroactive Influences on Cognition and Affect”, Journal
of Personality and Social Psychology 100(3), 407-425.
• Bem, J., Utts, J., and Johnson, W. (2011). “Must Psychologists Change the Way They Analyze Their Data?”, Journal of Personality
and Social Psychology 101(4), 716-719.
• Benjamin, D., Berger, J., Johannesson, M., et al. (2017). “Redefine Statistical Significance”, Nature Human Behaviour 2, 6–10.
• Berger, J. O. (2003). ‘Could Fisher, Jeffreys and Neyman Have Agreed on Testing?’ and Rejoinder’, Statistical Science 18(1), 1–12;
28–32.
• Berger, J. O. (2006). “The Case for Objective Bayesian Analysis.” Bayesian Analysis 1 (3): 385–402.
• Berger, J. O. and Wolpert, R. (1988). The Likelihood Principle, 2nd ed. Vol. 6 Lecture Notes-Monograph Series. Hayward, CA:
Institute of Mathematical Statistics.
• Birnbaum, A. (1970). “Statistical Methods in Scientific Inference (letter to the Editor).” Nature 225 (5237) (March 14): 1033.
• Casella, G. and Berger, R. (1987). “Reconciling Bayesian and Frequentist Evidence in the One-sided Testing Problem”, Journal of
the American Statistical Association 82(397), 106-11.
• Colquhoun, D. (2014). ‘An Investigation of the False Discovery Rate and the Misinterpretation of P-values’, Royal Society Open
Science 1(3), 140216.
• Cook, J., Fergusson, D. ,Ford, I., Gonen, M., Kimmelman, J., Korn, E., 6 and Begg, C. (2019). “There Is Still a Place for
Significance Testing in Clinical Trials”, Clinical Trials 16(3): 223-4.
• Cox, D. R., and Mayo, D. G. (2010). “Objectivity and Conditionality in Frequentist Inference.” Error and Inference: Recent
Exchanges on Experimental Reasoning, Reliability, and the Objectivity and Rationality of Science. Mayo and Spanos (eds.), 276–
304. CUP.
78
79. • Edwards, W., Lindman, H., and Savage, L. (1963). “Bayesian Statistical Inference for Psychological Research”, Psychological
Review 70(3), 193-242.
• Fisher, R. A. (1947). The Design of Experiments 4th ed., Edinburgh: Oliver and Boyd.
• Gelman, A. and Shalizi, C. (2013). “Philosophy and the Practice of Bayesian Statistics” and “Rejoinder’” British Journal of
Mathematical and Statistical Psychology 66(1): 8–38; 76-80.
• Goodman SN. (1999). “Toward evidence-based medical statistics. 2: The Bayes factor,” Annals of Internal Medicine 1999;
130:1005 –1013.
• Greenland, S., Senn, S., Rothman, K., et al. (2016). ‘Statistical Tests, P values, Confidence Intervals, and Power: A Guide to
Misinterpretations’, European Journal of Epidemiology 31(4), 337–50.
• Hacking, I. (1965). Logic of Statistical Inference. Cambridge: Cambridge University Press.
• Hacking, I. (1980). ‘The Theory of Probable Inference: Neyman, Peirce and Braithwaite’, in Mellor, D. (ed.), Science, Belief and
Behavior: Essays in Honour of R. B. Braithwaite, Cambridge: Cambridge University Press, pp. 141–60.
• Hurlbert, S. and Lombardi, C. (2009). ‘Final Collapse of the Neyman-Pearson Decision Theoretic Framework and Rise of the
NeoFisherian’, Annales Zoologici Fennici 46, 311–49.
• Ioannidis, J. (2005). “Why Most Published Research Findings are False”, PLoS Medicine 2(8), 0696–0701.
• Lakens, D., et al. (2018). ‘Justify your Alpha’, Nature Human Behavior 2, 168–71.
• Lehmann, E. and Romano, J. (2005). Testing Statistical Hypotheses, 3rd ed. New York: Springer.
• Lindley, D. V. (1971). “The Estimation of Many Parameters.” in Godambe, V. and Sprott, D. (eds.), Foundations of Statistical
Inference 435–455. Toronto: Holt, Rinehart and Winston.
• Mayo, D. G. (1996). Error and the Growth of Experimental Knowledge. Science and Its Conceptual Foundation. Chicago:
University of Chicago Press.
• Mayo, D. (2016). ‘Don’t Throw Out the Error Control Baby with the Bad Statistics Bathwater: A Commentary on Wasserstein, R.
L. and Lazar, N. A. 2016, “The ASA’s Statement on p-Values: Context, Process, and Purpose”’, The American Statistician 70(2)
(supplemental materials).
• Mayo, D. G. (2018). Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars, Cambridge: Cambridge
University Press. 79
80. 80
• Mayo, D. G. and Cox, D. R. (2006). "Frequentist Statistics as a Theory of Inductive Inference” in Rojo, J. (ed.) The
Second Erich L. Lehmann Symposium: Optimality, 2006, Lecture Notes-Monograph Series, Volume 49, Institute of
Mathematical Statistics: 247-275.
• Mayo, D. G., and A. Spanos. (2006). “Severe Testing as a Basic Concept in a Neyman–Pearson Philosophy of
Induction.” British Journal for the Philosophy of Science 57 (2) (June 1): 323–357.
• Mayo, D. G., and A. Spanos (2011). “Error Statistics.” In Philosophy of Statistics, edited by Prasanta S.
Bandyopadhyay and Malcolm R. Forster, 7:152–198. Handbook of the Philosophy of Science. The Netherlands:
Elsevier.
• Morrison, D. E., and R. E. Henkel, (eds.), (1970). The Significance Test Controversy: A Reader. Chicago: Aldine De
Gruyter.
• Musgrave, A. (1974). ‘Logical versus Historical Theories of Confirmation’, BJPS 25(1), 1–23.
• Open Science Collaboration (2015). “Estimating the Reproducibility of Psychological Science”, Science 349(6251),
943–51.
• Pearson, E. S. & Neyman, J. (1967). “On the problem of two samples”, Joint Statistical Papers by J. Neyman &
E.S. Pearson, 99-115 (Berkeley: U. of Calif. Press).
• Royall, R. (1997). Statistical Evidence: A Likelihood Paradigm. Boca Raton FL: Chapman and Hall, CRC press.
• Savage, L. J. (1962). The Foundations of Statistical Inference: A Discussion. London: Methuen.
• Schnall, S., Benton, J., and Harvey, S. (2008). ‘With a Clean Conscience: Cleanliness Reduces the Severity of
Moral Judgments’, Psychological Science 19(12), 1219–22.
• Selvin, H. (1970). “A critique of tests of significance in survey research. In The significance test controversy, edited
by D. Morrison and R. Henkel, 94-106. Chicago: Aldine De Gruyter.
• Simmons, J. Nelson, L. and Simonsohn, U. (2012) “A 21 word solution”, Dialogue: 26(2), 4–7.
• Wagenmakers, E-J., (2007). “A Practical Solution to the Pervasive Problems of P values”, Psychonomic Bulletin &
Review 14(5): 779-804.
• Wasserstein, R. and Lazar, N. (2016). “The ASA’s Statement on P-values: Context, Process and Purpose”, The
American Statistician 70(2), 129–33.
• Wasserstein, R., Schirm, A. and Lazar, N. (2019) Editorial: “Moving to a World Beyond ‘p < 0.05’”, The American
Statistician 73(S1): 1-19.
Editor's Notes
They are at once ancient and up to the minute.
They reflect disagreements on one of the deepest, oldest, philosophical questions: How do humans learn about the world despite threats of error?
At the same time, they are the engine behind current controversies surrounding high-profile failures of replication in the social and biological sciences.
Should probability enter to ensure we won’t reach mistaken interpretations of data too often in the long run of experience? Or to capture degrees of belief about claims? (performance or probabilism)
The field has been marked by disagreements between competing tribes of frequentists and Bayesians that have been so contentious that everyone wants to believe we are long past them.
We now enjoy unifications and reconciliations between rival schools, it will be said, and practitioners are eclectic, prepared to use whatever method “works.”
The truth is, long-standing battles still simmer below the surface of today’ s debates about scientific integrity, irreproducibility
Reluctance to reopen wounds from old battles has allowed them to fester.
The reconciliations and unifications have been revealed to have serious problems, and there’s little agreement on which to use or how to interpret them.
As for eclecticism, it’s often not clear what is even meant by “works.”
The presumption that all we need is an agreement on numbers–never mind if they’re measuring different things–leads to statistical schizophrenia.
What’s behind the constant drum beat today that science is in crisis?
The problem is that high powered methods can make it easy to uncover impressive-looking findings even if they are false: spurious correlations and other errors have not been severely probed.
We set sail with a simple tool: If little or nothing has been done to rule out flaws in inferring a claim, then it has not passed a severe test.
In the severe testing view, probability arises in scientific contexts to assess and control how capable methods are at uncovering and avoiding erroneous interpretations of data.
That’s what it means to view statistical inference as severe testing.
In saying we may view statistical inference as severe testing, I’m not saying statistical inference is always about formal statistical testing.
The concept of severe testing is sufficiently general to apply to any of the methods now in use, whether for exploration, estimation, or prediction.
Regardless of the type of claim, you don’t have evidence for it if nothing has been done to have found it flawed.
With Bayes Factors make to make the null hypothesis comparatively more probable than a chosen alternative