Slides given for Deborah G. Mayo talk at Minnesota Center for Philosophy of Science at University of Minnesota on the ASA 2016 statement on P-values and Error Statistics
MARGINALIZATION (Different learners in Marginalized Group
Mayo minnesota 28 march 2 (1)
1. The ASA (2016) Statement on P-values:
How to Stop Refighting
the Statistics Wars
The CLA Quantitative Methods
Collaboration Committee &
Minnesota Center for Philosophy of Science
April 8, 2016
Deborah G Mayo
2. Brad Efron
“By and large, Statistics is a prosperous and
happy country, but it is not a completely
peaceful one. Two contending philosophical
parties, the Bayesians and the frequentists,
have been vying for supremacy over the past
two-and-a-half centuries. …Unlike most
philosophical arguments, this one has
important practical consequences. The two
philosophies represent competing visions of
how science progresses….” (2013, p. 130)
2
3. Today’s Practice: Eclectic
O Use of eclectic tools, little handwringing
of foundations
O Bayes-frequentist unifications
O Scratch a bit below the surface
foundational problems emerge….
O Not just 2: family feuds within (Fisherian,
Neyman-Pearson; tribes of Bayesians.
likelihoodists)
3
4. Why are the statistics wars more
serious today?
O Replication crises led to programs to
restore credibility: fraud busting,
reproducibility studies
O Taskforces, journalistic reforms, and
debunking treatises
O Proposed methodological reforms––many
welcome (preregistration)–some quite
radical
4
5. I was a philosophical observer at the
ASA P-value “pow wow”
5
6. “Don’t throw out the error control baby
with the bad statistics bathwater”
The American Statistician
6
7. O “Statistical significance tests are a small
part of a rich set of:
“techniques for systematically appraising
and bounding the probabilities … of
seriously misleading interpretations of
data” (Birnbaum 1970, p. 1033)
O These I call error statistical methods (or
sampling theory)”.
7
8. One Rock in a Shifting Scene
O “Birnbaum calls it the ‘one rock in a
shifting scene’ in statistical practice
O “Misinterpretations and abuses of tests,
warned against by the very founders of
the tools, shouldn’t be the basis for
supplanting them with methods unable or
less able to assess, control, and alert us
to erroneous interpretations of data”
8
9. Error Statistics
O Statistics: Collection, modeling, drawing
inferences from data to claims about
aspects of processes
O The inference may be in error
O It’s qualified by a claim about the
method’s capabilities to control and alert
us to erroneous interpretations (error
probabilities)
9
10. “p-value. …to test the conformity of the
particular data under analysis with H0 in
some respect:
…we find a function t = t(y) of the data,
to be called the test statistic, such that
• the larger the value of t the more
inconsistent are the data with H0;
• The random variable T = t(Y) has a
(numerically) known probability
distribution when H0 is true.
…the p-value corresponding to any t as
p = p(t) = P(T ≥ t; H0)”
(Mayo and Cox 2006, p. 81) 10
11. O “Clearly, if even larger differences
than t occur fairly frequently under H0
(p-value is not small), there’s scarcely
evidence of incompatibility
O But a small p-value doesn’t warrant
inferring a genuine effect H, let alone a
scientific conclusion H*–as the ASA
document correctly warns (Principle 3)”
11
12. A Paradox for Significance Test Critics
Critic: It’s much too easy to get a small P-
value
You: Why do they find it so difficult to
replicate the small P-values others found?
Is it easy or is it hard?
12
13. O R.A. Fisher: it’s easy to lie with statistics
by selective reporting (he called it the
“political principle”)
O Sufficient finagling—cherry-picking, P-
hacking, significance seeking—may
practically guarantee a researcher’s
preferred claim C gets support, even if it’s
unwarranted by evidence
O Note: Rejecting a null taken as support for
some non-null claim C
13
14. Severity Requirement:
O If data x0 agree with a claim C, but the
test procedure had little or no capability
of finding flaws with C (even if the claim
is incorrect), then x0 provide poor
evidence for C
O Such a test fails a minimal requirement
for a stringent or severe test
O My account: severe testing based on
error statistics
14
15. Two main views of the role of
probability in inference (not in ASA doc)
O Probabilism. To assign a degree of
probability, confirmation, support or belief
in a hypothesis, given data x0. (e.g.,
Bayesian, likelihoodist)—with regard for
inner coherency
O Performance. Ensure long-run reliability
of methods, coverage probabilities
(frequentist, behavioristic Neyman-
Pearson)
15
16. What happened to using probability to assess
the error probing capacity by the severity
criterion?
O Neither “probabilism” nor “performance”
directly captures it
O Good long-run performance is a
necessary, not a sufficient, condition for
severity
O That’s why frequentist methods can be
shown to have howlers
16
17. O Problems with selective reporting, cherry
picking, stopping when the data look
good, P-hacking, are not problems about
long-runs—
O It’s that we cannot say the case at hand
has done a good job of avoiding the
sources of misinterpreting data
17
18. A claim C is not warranted _______
O Probabilism: unless C is true or probable
(gets a probability boost, is made
comparatively firmer)
O Performance: unless it stems from a
method with low long-run error
O Probativism (severe testing) something
(a fair amount) has been done to probe
ways we can be wrong about C
18
19. O If you assume probabilism is required for
inference, error probabilities are relevant
for inference only by misinterpretation
False!
O I claim, error probabilities play a crucial
role in appraising well-testedness
O It’s crucial to be able to say, C is highly
believable or plausible but poorly tested
O Probabilists can allow for the distinct task
of severe testing (you may not have to
entirely take sides in the stat wars)
19
20. The ASA doc gives no sense of
different tools for different jobs
O “To use an eclectic toolbox in statistics,
it’s important not to expect an agreement
on numbers form methods evaluating
different things
O A p-value isn’t ‘invalid’ because it does
not supply “the probability of the null
hypothesis, given the finding” (the
posterior probability of H0) (Trafimow and
Marks*, 2015)
*Editors of a journal, Basic and Applied Social
Psychology
20
21. O ASA Principle 2 says a p-value ≠
posterior but one doesn’t get the sense of
its role in error probability control
O It’s not that I’m keen to defend many
common uses of significance tests
O The criticisms are often based on
misunderstandings; consequently so are
many “reforms”
21
22. Biasing selection effects:
O One function of severity is to identify
problematic selection effects (not all are)
O Biasing selection effects: when data or
hypotheses are selected or generated (or
a test criterion is specified), in such a way
that the minimal severity requirement is
violated, seriously altered or incapable of
being assessed
O Picking up on these alterations is
precisely what enables error statistics to
be self-correcting—
22
23. Nominal vs actual significance levels
The ASA correctly warns that “[c]onducting
multiple analyses of the data and reporting
only those with certain p-values” leads to
spurious p-values (Principle 4)
Suppose that twenty sets of differences
have been examined, that one difference
seems large enough to test and that this
difference turns out to be ‘significant at
the 5 percent level.’ ….The actual level of
significance is not 5 percent, but 64
percent! (Selvin, 1970, p. 104)
From (Morrison & Henkel’s Significance Test
controversy 1970!)
23
24. O They were clear on the fallacy: blurring
the “computed” or “nominal” significance
level, and the “actual” level
O There are many more ways you can be
wrong with hunting (different sample
space)
O Here’s a case where a p-value report is
invalid
24
25. You report: Such results would be difficult
to achieve under the assumption of H0
When in fact such results are common
under the assumption of H0
(Formally):
O You say Pr(P-value < Pobs; H0) ~ Pobs
small
O But in fact Pr(P-value < Pobs; H0) = high
25
26. O Nowadays, we’re likely to see the tests
blamed
O My view: Tests don’t kill inference, people
do
O Even worse are those statistical accounts
where the abuse vanishes!
26
27. On some views, taking account of biasing
selection effects “defies scientific sense”
Two problems that plague frequentist inference:
multiple comparisons and multiple looks, or, as
they are more commonly called, data dredging
and peeking at the data. The frequentist solution
to both problems involves adjusting the P-
value…But adjusting the measure of evidence
because of considerations that have nothing
to do with the data defies scientific sense,
belies the claim of ‘objectivity’ that is often made
for the P-value” (Goodman 1999, p. 1010)
(To his credit, he’s open about this; heads the
Meta-Research Innovation Center at Stanford)
27
28. Technical activism isn’t free of philosophy
Ben Goldacre (of Bad Science) in a 2016
Nature article, is puzzled that bad statistical
practices continue even in the face of the
new "technical activism”:
The editors at Annals of Internal
Medicine,… repeatedly (but confusedly)
argue that it is acceptable to identify
“prespecified outcomes” [from results]
produced after a trial began; ….they say
that their expertise allows them to
permit — and even solicit —
undeclared outcome-switching
28
29. His paper: “Make journals report clinical
trials properly”
O He shouldn’t close his eyes to the
possibility that some of the pushback he’s
seeing has a basis in statistical
philosophy!
29
30. Likelihood Principle (LP)
The vanishing act links to a pivotal
disagreement in the philosophy of statistics
battles
In probabilisms, the import of the data is via
the ratios of likelihoods of hypotheses
P(x0;H1)/P(x0;H0)
The data x0 are fixed, while the hypotheses
vary
30
31. Jimmy Savage on the LP:
O “According to Bayes' theorem,…. if y is
the datum of some other experiment, and
if it happens that P(x|µ) and P(y|µ) are
proportional functions of µ (that is,
constant multiples of each other), then
each of the two data x and y have
exactly the same thing to say about the
values of µ…” (Savage 1962, p. 17)
31
32. All error probabilities violate the LP
(even without selection effects):
Sampling distributions, significance levels,
power, all depend on something more [than
the likelihood function]–something that is
irrelevant in Bayesian inference–namely the
sample space (Lindley 1971, p. 436)
The LP implies…the irrelevance of
predesignation, of whether a hypothesis
was thought of before hand or was
introduced to explain known effects
(Rosenkrantz, 1977, p. 122)
32
33. Paradox of Optional Stopping:
Error probing capacities are altered not just
by cherry picking and data dredging, but
also via data dependent stopping rules:
Xi ~ N(μ, σ2), 2-sided H0: μ = 0 vs. H1: μ ≠ 0.
Instead of fixing the sample size n in
advance, in some tests, n is determined by
a stopping rule:
33
34. “Trying and trying again”
O Keep sampling until H0 is rejected at
0.05 level
i.e., keep sampling until M 1.96 s/√n
O Trying and trying again: Having failed
to rack up a 1.96 s difference after 10
trials, go to 20, 30 and so on until
obtaining a 1.96 s difference
34
35. Nominal vs. Actual
significance levels again:
O With n fixed the Type 1 error probability is
0.05
O With this stopping rule the actual
significance level differs from, and will be
greater than 0.05
O Violates Cox and Hinkley’s (1974) “weak
repeated sampling principle”
35
36. 1959 Savage Forum
Jimmy Savage audaciously declared:
“optional stopping is no sin”
so the problem must be with significance
levels
Peter Armitage:
“thou shalt be misled”
if thou dost not know the person tried and
tried again (p. 72)
36
37. O “The ASA correctly warns that
“[c]onducting multiple analyses of the
data and reporting only those with
certain p-values” leads to spurious p-
values (Principle 4)
O However, the same p-hacked hypothesis
can occur in Bayes factors; optional
stopping can be guaranteed to exclude
true nulls from HPD intervals
37
38. With One Big Difference:
O “The direct grounds to criticize inferences
as flouting error statistical control is lost
O They condition on the actual data,
O error probabilities take into account other
outcomes that could have occurred but
did not (sampling distribution)”
38
39. Tension: Does Principle 4 Hold for
Other Approaches?
O “In view of the prevalent misuses of and
misconceptions concerning p-values, some
statisticians prefer to supplement or even
replace p-values with other approaches”
(They include Bayes factors, likelihood ratios,
as “alternative measures of evidence”)
O They appear to extend “full reporting and
transparency” (principle 4) to all methods.
O Some controversy: should it apply only to “p-
values and related statistics”
39
40. How might probabilists block intuitively
unwarranted inferences (without error
probabilities)?
A subjective Bayesian might say:
If our beliefs were mixed into the
interpretation of the evidence, we wouldn’t
declare there’s statistical evidence of some
unbelievable claim (distinguishing shades of
grey and being politically moderate,
ovulation and voting preferences)
40
41. Rescued by beliefs
O That could work in some cases (it still
wouldn’t show what researchers had done
wrong)—battle of beliefs
O Besides, researchers sincerely believe
their hypotheses
O So now you’ve got two sources of
flexibility, priors and biasing selection
effects
41
42. No help with our most important
problem
O How to distinguish the warrant for a
single hypothesis H with different methods
(e.g., one has biasing selection effects,
another, pre-registered results and
precautions)?
42
43. Most Bayesians are “conventional”
O Eliciting subjective priors too difficult,
scientists reluctant to allow subjective
beliefs to overshadow data
O Default, or reference priors are supposed
to prevent prior beliefs from influencing
the posteriors (O-Bayesians, 2006)
43
44. O A classic conundrum: no general non-
informative prior exists, so most are
conventional
O “The priors are not to be considered
expressions of uncertainty, ignorance, or
degree of belief. Conventional priors may
not even be probabilities…” (Cox and
Mayo 2010, p. 299)
O Prior probability: An undefined
mathematical construct for obtaining
posteriors (giving highest weight to data,
or satisfying invariance, or matching or….)
44
45. Conventional Bayesian Reforms are
touted as free of selection effects
O Jim Berger gives us “conditional error
probabilities” CEPs
O “[I]t is considerably less important to disabuse
students of the notion that a frequentist error
probability is the probability that the hypothesis
is true, given the data”, since under his new
definition “a CEP actually has that
interpretation”
O “CEPs do not depend on the stopping rule”
(“Could Fisher, Jeffreys and Neyman Have Agreed on Testing?”
2003)
45
46. By and large the ASA doc highlights
classic foibles
“In relation to the test of significance, we
may say that a phenomenon is
experimentally demonstrable when we
know how to conduct an experiment
which will rarely fail to give us a
statistically significant result”
(Fisher 1935, p. 14)
(“isolated” low P-value ≠> H: statistical
effect)
46
47. Statistical ≠> substantive (H ≠> H*)
“[A]ccording to Fisher, rejecting the null
hypothesis is not equivalent to accepting
the efficacy of the cause in question. The
latter...requires obtaining more significant
results when the experiment, or an
improvement of it, is repeated at other
laboratories or under other conditions”
(Gigerentzer 1989, pp. 95-6)
47
48. O Flaws in alternative H* have not been probed
by the test,
O The inference from a statistically significant
result to H* fails to pass with severity
O “Merely refuting the null hypothesis is too
weak to corroborate” substantive H*, “we
have to have ‘Popperian risk’, ‘severe test’
[as in Mayo], or what philosopher Wesley
Salmon called ‘a highly improbable
coincidence’” (Meehl and Waller 2002,
p.184)
48
49. O Encouraged by something called NHSTs
–that supposedly allow moving from
statistical to substantive
O If defined that way, they exist only as
abuses of tests
O ASA doc ignores Neyman-Pearson (N-P)
tests
49
50. Neyman-Pearson (N-P) Tests:
A null and alternative hypotheses
H0, H1 that exhaust the parameter
space
O So the fallacy of rejection H – > H* is
impossible
O Rejecting the null only indicates statistical
alternatives
50
51. P-values Don’t Report Effect Sizes
Principle 5
Who ever said to just report a P-value?
O “Tests should be accompanied by
interpretive tools that avoid the fallacies of
rejection and non-rejection. These
correctives can be articulated in either
Fisherian or Neyman-Pearson terms”
(Mayo and Cox 2006, Mayo and Spanos
2006)
51
52. To Avoid Inferring a Discrepancy
Beyond What’s Warranted:
large n problem.
O Severity tells us: an α-significant
difference is indicative of less of a
discrepancy from the null if it results from
larger (n1) rather than a smaller (n2)
sample size (n1 > n2 )
52
53. O What’s more indicative of a large effect
(fire), a fire alarm that goes off with burnt
toast or one so insensitive that it doesn’t
go off unless the house is fully ablaze?
[The larger sample size is like the one that
goes off with burnt toast]
53
54. What About the Fallacy of
Non-Significant Results?
O Non-Replication occurs with non-
significant results, but there’s much
confusion in interpreting them
O No point in running replication research if
you view negative results as uninformative
54
55. O They don’t warrant 0 discrepancy
O Use the same severity reasoning to rule
out discrepancies that very probably
would have resulted in a larger difference
than observed- set upper bounds
O If you very probably would have observed
a more impressive (smaller) p-value than
you did, if μ > μ1 (where μ1 = μ0 + γ), then
the data are good evidence that μ< μ1
O Akin to power analysis (Cohen, Neyman)
but sensitive to x0
55
56. Improves on Confidence Intervals
“This is akin to confidence intervals (which
are dual to tests) but we get around their
shortcomings:
O We do not fix a single confidence level,
O The evidential warrant for different points
in any interval are distinguished”
O Go beyond “performance goal” to give
inferential construal
56
57. Simple Fisherian Tests Have Important
Uses
O Model validation:
George Box calls for ecumenism because
“diagnostic checks and tests of fit” he
argues “require frequentist theory
significance tests for their formal
justification” (Box 1983, p. 57)
57
58. “What we are advocating, ..is what Cox
and Hinkley (1974) call ‘pure significance
testing’, in which certain of the model’s
implications are compared directly to the
data, rather than entering into a contest
with some alternative model” (Gelman &
Shalizi p. 20)
O Fraudbusting and forensics: Finding
Data too good to be true (Simonsohn)
58
59. Concluding remarks: Reforms without
Philosophy of Statistics are Blind
O I end my commentary: “Failing to understand the
correct (if limited) role of simple significance tests
threatens to throw the error control baby out with the
bad statistics bathwater
O Avoid refighting the same wars, or banning methods
based on cookbook methods long lampooned
O Makes no sense to banish tools for testing
assumptions the other methods require and cannot
perform
59
60. O Don’t expect an agreement on numbers
form methods evaluating difference things
O Recognize different roles of probability:
probabilism, long run performance ,
probativism (severe testing)
O Probabilisms may enable rather than
block illicit inferences due to biasing
selection effects
O Main paradox of the “replication crisis”
60
61. Paradox of Replication
O Critic: It’s too easy to satisfy standard significance
thresholds
O You: Why do replicationists find it so hard to achieve
them with preregistered trials?
O Critic: Most likely the initial studies were guilty of p-
hacking, cherry-picking, significance seeking, QRPs
O You: So, replication researchers want methods that
pick up on and block these biasing selection effects
O Critic: Actually the “reforms” recommend methods
where selection effects make no difference
61
62. Either you care about error probabilities
or not
O If not, experimental design principles (e.g.,
RCTs) may well go by the board
O Not enough to have a principle: we must be
transparent about data-dependent selections
O Your statistical account needs a way to make
use of the information
O “Technical activists” are not free of conflicts of
interest and of philosophy
62
63. Granted, error statistical improvements
are needed
O An inferential construal of error probabilities
wasn’t clearly given (Birnbaum)-my goal
O It’s not long-run error control (performance),
but severely probing flaws today
O I also grant an error statistical account need
to say more about how they’ll use
background information
63
64. Future ASA project
O Look at the “other approaches” (Bayes
factors, LRs, Bayesian updating)
O What is it for a replication to succeed or
fail on those approaches?
(can’t be just a matter of prior beliefs in
the hypotheses)
64
65. Finally, it should be recognized that
often better statistics cannot help
O Rather than search for more “idols”, do
better science, get better experiments and
theories
O One hypothesis must always be: our
results point to the inability of our study to
severely probe the phenomenon of
interest
65
66. Be ready to admit questionable science
O The scientific status of an inquiry is
questionable if unable to distinguish
poorly run study and poor hypothesis
O Continually violate minimal requirements
for severe testing
66
67. Non-replications often construed as
simply weaker effects
2 that didn’t replicate in psychology:
O Belief in free will and cheating
O Physical distance (of points plotted) and
emotional closeness
67
69. The ASA’s Six Principles
O (1) P-values can indicate how incompatible the data are
with a specified statistical model
O (2) P-values do not measure the probability that the
studied hypothesis is true, or the probability that the data
were produced by random chance alone
O (3) Scientific conclusions and business or policy
decisions should not be based only on whether a p-value
passes a specific threshold
O (4) Proper inference requires full reporting and
transparency
O (5) A p-value, or statistical significance, does not
measure the size of an effect or the importance of a result
O (6) By itself, a p-value does not provide a good measure
of evidence regarding a model or hypothesis
69
70. Mayo and Cox (2010): Frequentist Principle of
Evidence (FEV); SEV: Mayo and Spanos (2006)
FEV/SEV: insignificant result: A moderate P-value
is evidence of the absence of a discrepancy δ
from H0, only if there is a high probability the
test would have given a worse fit with H0 (i.e.,
d(X) > d(x0) ) were a discrepancy δ to exist
FEV/SEV significant result d(X) > d(x0) is
evidence of discrepancy δ from H0, if and only if,
there is a high probability the test would have d(X)
< d(x0) were a discrepancy as large as δ absent
70
71. Test T+: Normal testing: H0: μ < μ0 vs. H1: μ > μ0
σ known
(FEV/SEV): If d(x) is not statistically significant,
then
μ < M0 + kεσ/√n passes the test T+ with
severity (1 – ε)
(FEV/SEV): If d(x) is statistically significant, then
μ > M0 + kεσ/√n passes the test T+ with
severity (1 – ε)
where P(d(X) > kε) = ε 71
72. References
O Armitage, P. 1962. “Contribution to Discussion.” In The Foundations of Statistical
Inference: A Discussion, edited by L. J. Savage. London: Methuen.
O Berger, J. O. 2003 'Could Fisher, Jeffreys and Neyman Have Agreed on Testing?'
and 'Rejoinder,', Statistical Science 18(1): 1-12; 28-32.
O Berger, J. O. 2006. “The Case for Objective Bayesian Analysis.” Bayesian Analysis
1 (3): 385–402.
O Birnbaum, A. 1970. “Statistical Methods in Scientific Inference (letter to the
Editor).” Nature 225 (5237) (March 14): 1033.
O Efron, B. 2013. 'A 250-Year Argument: Belief, Behavior, and the Bootstrap', Bulletin
of the American Mathematical Society 50(1): 126-46.
O Box, G. 1983. “An Apology for Ecumenism in Statistics,” in Box, G.E.P., Leonard, T.
and Wu, D. F. J. (eds.), pp. 51-84, Scientific Inference, Data Analysis, and
Robustness. New York: Academic Press.
O Cox, D. R. and Hinkley, D. 1974. Theoretical Statistics. London: Chapman and
Hall.
O Cox, D. R., and Deborah G. Mayo. 2010. “Objectivity and Conditionality in
Frequentist Inference.” In Error and Inference: Recent Exchanges on Experimental
Reasoning, Reliability, and the Objectivity and Rationality of Science, edited by
Deborah G. Mayo and Aris Spanos, 276–304. Cambridge: Cambridge University
Press.
O Fisher, R. A. 1935. The Design of Experiments. Edinburgh: Oliver and Boyd.
O Fisher, R. A. 1955. “Statistical Methods and Scientific Induction.” Journal of the
Royal Statistical Society, Series B (Methodological) 17 (1) (January 1): 69–78.
72
73. O Gelman, A. and Shalizi, C. 2013. 'Philosophy and the Practice of Bayesian
Statistics' and 'Rejoinder', British Journal of Mathematical and Statistical
Psychology 66(1): 8–38; 76-80.
O Gigerenzer, G., Swijtink, Porter, T. Daston, L. Beatty, J, and Kruger, L. 1989. The
Empire of Chance. Cambridge: Cambridge University Press.
O Goldacre, B. 2008. Bad Science. HarperCollins Publishers.
O Goldacre, B. 2016. “Make journals report clinical trials properly”, Nature
530(7588);online 02Feb2016.
O Goodman SN. 1999. “Toward evidence-based medical statistics. 2: The Bayes
factor,” Annals of Internal Medicine 1999; 130:1005 –1013.
O Lindley, D. V. 1971. “The Estimation of Many Parameters.” In Foundations of
Statistical Inference, edited by V. P. Godambe and D. A. Sprott, 435–455. Toronto:
Holt, Rinehart and Winston.
O Mayo, D. G. 1996. Error and the Growth of Experimental Knowledge. Science and
Its Conceptual Foundation. Chicago: University of Chicago Press.
O Mayo, D. G. and Cox, D. R. (2010). "Frequentist Statistics as a Theory of Inductive
Inference" in Error and Inference: Recent Exchanges on Experimental Reasoning,
Reliability and the Objectivity and Rationality of Science (D Mayo and A. Spanos
eds.), Cambridge: Cambridge University Press: 1-27. This paper appeared in The
Second Erich L. Lehmann Symposium: Optimality, 2006, Lecture Notes-
Monograph Series, Volume 49, Institute of Mathematical Statistics, pp. 247-275.
O Mayo, D. G., and A. Spanos. 2006. “Severe Testing as a Basic Concept in a
Neyman–Pearson Philosophy of Induction.” British Journal for the Philosophy of
Science 57 (2) (June 1): 323–357.
73
74. O Mayo, D. G., and A. Spanos. 2011. “Error Statistics.” In Philosophy of Statistics,
edited by Prasanta S. Bandyopadhyay and Malcolm R. Forster, 7:152–198. Handbook
of the Philosophy of Science. The Netherlands: Elsevier.
O Meehl, P. E., and N. G. Waller. 2002. “The Path Analysis Controversy: A New
Statistical Approach to Strong Appraisal of Verisimilitude.” Psychological Methods 7
(3): 283–300.
O Morrison, D. E., and R. E. Henkel, ed. 1970. The Significance Test Controversy: A
Reader. Chicago: Aldine De Gruyter.
O Pearson, E. S. & Neyman, J. (1930). On the problem of two samples. Joint Statistical
Papers by J. Neyman & E.S. Pearson, 99-115 (Berkeley: U. of Calif. Press). First
published in Bul. Acad. Pol.Sci. 73-96.
O Rosenkrantz, R. 1977. Inference, Method and Decision: Towards a Bayesian
Philosophy of Science. Dordrecht, The Netherlands: D. Reidel.
O Savage, L. J. 1962. The Foundations of Statistical Inference: A Discussion. London:
Methuen.
O Selvin, H. 1970. “A critique of tests of significance in survey research. In The
significance test controversy, edited by D. Morrison and R. Henkel, 94-106. Chicago:
Aldine De Gruyter.
O Simonsohn, U. 2013, "Just Post It: The Lesson From Two Cases of Fabricated Data
Detected by Statistics Alone", Psychological Science, vol. 24, no. 10, pp. 1875-1888.
O Trafimow D. and Marks, M. 2015. “Editorial”, Basic and Applied Social Psychology
37(1): pp. 1-2.
O Wasserstein, R. and Lazar, N. 2016. “The ASA’s statement on p-values: context,
process, and purpose”, The American Statistician 74
75. Abstract
If a statistical methodology is to be adequate, it needs
to register how “questionable research practices”
(QRPs) alter a method’s error probing capacities. If
little has been done to rule out flaws in taking data as
evidence for a claim, then that claim has not passed a
stringent or severe test. The goal of severe testing is
the linchpin for (re)interpreting frequentist methods so
as to avoid long-standing fallacies at the heart of
today’s statistics wars. A contrasting philosophy views
statistical inference in terms of posterior probabilities
in hypotheses: probabilism. Presupposing probabilism,
critics mistakenly argue that significance and
confidence levels are misinterpreted, exaggerate
evidence, or are irrelevant for inference.
Recommended replacements—Bayesian updating,
Bayes factors, likelihood ratios—fail to control severity.
75