D. Mayo presentation at the X-Phil conference on "Reproducibility and Replicabililty in Psychology and Experimental Philosophy", University College London (June 14, 2018)
Replication Crises and the Statistics Wars: Hidden Controversies
1. Replication Crises and the
Statistics Wars: Hidden
Controversies
Deborah G Mayo
June 14, 2018
Reproducibility and Replicability
in Psychology and Experimental Philosophy
2. • High-profile failures of replication have
brought forth reams of reforms
• How should the integrity of science be
restored?
• Experts do not agree.
• We need to pull back the curtain on why.
2
3. Hidden Controversies
1. The Statistics Wars: Age-old abuses, fallacies of
statistics (Bayesian-frequentist, Fisher-N-P debates)
simmer below todays debates; reformulations of
frequentist methods are ignored
2. Replication Paradoxes: While a major source of
handwringing stems from biasing selection effects—
cherry picking, data dredging, trying and trying again,
some reforms and alternative methods conflict with
the needed error control
3. Underlying assumptions and their violations
• A) statistical model assumptions
• B) links from experiments, measurements &
statistical inferences to substantive questions
3
4. Significance Tests
• The most used methods are most criticized
• Statistical significance tests are a small part of a
rich set of:
“techniques for systematically appraising and
bounding the probabilities … of seriously
misleading interpretations of data [bounding
error probabilities]” (Birnbaum 1970, 1033)
• These I call error statistical methods (or sampling
theory).
4
5. “p-value. …to test the conformity of the
particular data under analysis with H0 in
some respect:
…we find a function T = t(y) of the data, to
be called the test statistic, such that
• the larger the value of T the more
inconsistent are the data with H0;
• The random variable T = t(Y) has a
(numerically) known probability
distribution when H0 is true.
…the p-value corresponding to any t0bs as
p = p(t) = Pr(T ≥ t0bs; H0)”
(Mayo and Cox 2006, 81)
5
6. Testing Reasoning
• If even larger differences than t0bs occur fairly
frequently under H0 (i.e., P-value is not small),
there’s scarcely evidence of
incompatibility with H0
• Small P-value indicates some underlying
discrepancy from H0 because very probably you
would have seen a less impressive difference
than t0bs were H0 true.
• This still isn’t evidence of a genuine statistical
effect H1, let alone a scientific conclusion H*
Stat-Sub fallacy H => H*
6
7. Neyman-Pearson (N-P) tests:
A null and alternative hypotheses H0, H1
that are exhaustive
H0: μ ≤ 0 vs. H1: μ > 0
• So this fallacy of rejection H1H* is impossible
• Rejecting H0 only indicates statistical alternatives
H1 (how discrepant from null)
As opposed to NHST
7
8. We must get beyond N-P/ Fisher
incompatibilist caricature
• Beyond “inconsistent hybrid” (Gigerenzer 2004, 590):
Fisher–inferential; N-P–long run performance
• Both N-P & F offer techniques to assess and control
error probabilities
• “Incompatibilism” results in P-value users robbed of
features from N-P tests they need (power)
• Wrongly supposes N-P testers have only fixed error
probabilities & are not allowed to report P-values
8
9. Replication Paradox
(for critics of N-P & F Significance Tests)
Critic: It’s much too easy to get a small P-
value
You: Why do they find it so difficult to
replicate the small P-values others found?
Is it easy or is it hard?
9
10. Both Fisher and N-P methods: it’s easy to lie
with statistics by selective reporting
• Sufficient finagling—cherry-picking, P-
hacking, significance seeking, multiple
testing, look elsewhere—may practically
guarantee a preferred claim H gets support,
even if it’s unwarranted by evidence
10
11. Severity Requirement:
Everyone agrees: If the test had little or no
capability of finding flaws with H (even if H is
incorrect), then agreement between data x0 and H
provides poor (or no) evidence for H
(“too cheap to be worth having” Popper 1983, 30)
• Such a test fails a minimal requirement for a
stringent or severe test
• Their new idea was using probability to reflect
this capability
• They may not have put it explicitly in these
terms but my account does (requires
reinterpreting tests)
11
12. This alters the role of probability: typically just 2
• Probabilism. To assign a degree of probability,
confirmation, support or belief in a hypothesis,
given data x0
(e.g., Bayesian, likelihoodist, Fisher (at times))
Performance. Ensure long-run reliability of
methods, coverage probabilities (frequentist,
behavioristic Neyman-Pearson, Fisher (at times))
12
13. What happened to using probability to assess
error probing capacity?
• Neither “probabilism” nor “performance” directly
captures it
• Good long-run performance is a necessary, not
a sufficient, condition for severity
• Need to explicitly reinterpret tests according to
how well or poorly discrepancies are indicated
13
14. Key to revising the role of error
probabilities
• Problems with selective reporting, cherry
picking, stopping when the data look
good, P-hacking, are not problems about
long-runs—
• It’s that we cannot say the case at hand
has done a good job of avoiding the
sources of misinterpreting data
14
15. A claim C is not warranted _______
• Probabilism: unless C is true or probable (gets
a probability boost, is made comparatively
firmer)
• Performance: unless it stems from a method
with low long-run error
• Probativism (severe testing) unless
something (a fair amount) has been done to
probe ways we can be wrong about C
15
16. Biasing selection effects:
One function of severity is to identify problematic
selection effects (not all are)
• Biasing selection effects: when data or
hypotheses are selected or generated (or a test
criterion is specified), in such a way that the
minimal severity requirement is violated,
seriously altered or incapable of being
assessed
16
17. Nominal vs. actual significance levels
Suppose that twenty sets of differences have
been examined, that one difference seems large
enough to test and that this difference turns out
to be ‘significant at the 5 percent level.’ ….The
actual level of significance is not 5 percent,
but 64 percent! (Selvin 1970, 104)
From (Morrison & Henkel’s Significance Test
controversy 1970!)
17
18. Spurious P-Value
You report: Such results would be difficult to
achieve under the assumption of H0
When in fact such results are common under the
assumption of H0
There are many more ways you can be wrong with
hunting (different sample space)
(Formally):
• You say Pr(P-value ≤ Pobs; H0) ~ Pobs = small
• But in fact Pr(P-value ≤ Pobs; H0) = high
18
19. Scapegoating
• Nowadays, we’re likely to see the tests
blamed
• My view: Tests don’t kill inferences,
people do
• Even worse are those statistical accounts
where the abuse vanishes!
19
20. Some say taking account of biasing selection
effects “defies scientific sense”
Two problems that plague frequentist inference:
multiple comparisons and multiple looks, or, as they
are more commonly called, data dredging and
peeking at the data. The frequentist solution to both
problems involves adjusting the P-value…
But adjusting the measure of evidence because
of considerations that have nothing to do with
the data defies scientific sense” (Goodman 1999,
1010)
(To his credit, he’s open about this; heads the Meta-Research
Innovation Center at Stanford)
20
21. Likelihood Principle (LP)
The vanishing act links to a pivotal disagreement in
the philosophy of statistics battles
In probabilisms, the import of the data is via the
ratios of likelihoods of hypotheses
Pr(x0;H0)/Pr(x0;H1)
The data x0 are fixed, while the hypotheses vary
21
22. All error probabilities violate the LP
(even without selection effects):
Sampling distributions, significance levels, power, all
depend on something more [than the likelihood
function]–something that is irrelevant in Bayesian
inference–namely the sample space
(Lindley 1971, 436)
The LP implies…the irrelevance of predesignation,
of whether a hypothesis was thought of before hand
or was introduced to explain known effects
(Rosenkrantz 1977, 122)
22
23. Optional Stopping:
Error probing capacities are altered not just
by cherry picking and data dredging, but
also via data dependent stopping rules:
Xi ~ N(μ, σ2), 2-sided H0: μ = 0 vs. H1: μ ≠ 0.
Instead of fixing the sample size n in
advance, in some tests, n is determined by
a stopping rule:
23
24. • Keep sampling until H0 is rejected at 0.05
level
i.e., keep sampling until M > 1.96 s/√n
• Trying and trying again: Having failed to
rack up a 1.96 s/√n difference after 10
trials, go to 20, 30 and so on until
obtaining a 1.96 s difference
24
25. Nominal vs. Actual
significance levels again:
• With n fixed the Type 1 error probability is 0.05
• With this stopping rule the actual significance
level differs from, and will be greater than 0.05
(proper stopping rule)
25
26. • Equivalently, you can ensure that 0 is
excluded from a confidence interval (or
credibility interval) even if true (Berger and
Wolpert 1988)
• Likewise the same p-hacked hypothesis can
occur in Bayes factors, credibility intervals,
likelihood ratios
26
27. With One Big Difference:
• The direct grounds to criticize inferences
as flouting error statistical control is lost
• They condition on the actual data,
• Error probabilities take into account other
outcomes that could have occurred but
did not (sampling distribution)
27
28. 28
What Counts as Cheating?
“[I]f the sampling plan is ignored, the researcher is
able to always reject the null hypothesis, even if it is
true. This example is sometimes used to argue that
any statistical framework should somehow take the
sampling plan into account. Some people feel that
‘optional stopping’ amounts to cheating…. This
feeling is, however, contradicted by a mathematical
analysis. (Eric-Jan Wagenmakers, 2007, 785)
But the “proof” assumes the likelihood principle (LP)
by which error probabilities drop out. (Edwards,
Lindman, and Savage 1963)
29. • Replication researchers (re)discovered that data-
dependent hypotheses and stopping are a major
source of spurious significance levels.
“Authors must decide the rule for terminating data
collection before data collection begins and report
this rule in the articles” (Simmons, Nelson, and
Simonsohn 2011, 1362).
• Or report how their stopping plan alters relevant
error probabilities.
29
30. 30
“This [optional stopping] example is cited as a
counter-example by both Bayesians and
frequentists! If we statisticians can't agree which
theory this example is counter to, what is a
clinician to make of this debate?”
(Little 2006, 215).
Underlies today’s disputes
31. • Critic: It’s too easy to satisfy standard significance
thresholds
• You: Why do replicationists find it so hard to achieve
significance thresholds (with preregistration)?
• Critic: Obviously the initial studies were guilty of P-
hacking, cherry-picking, data-dredging (QRPs)
• You: So, the replication researchers want methods
that pick up on, adjust, and block these biasing
selection effects.
• Critic: Actually “reforms” recommend methods where
the need to alter P-values due to data dredging
vanishes
31
32. How might probabilists block intuitively
unwarranted inferences
(without error probabilities)?
A subjective Bayesian might say:
If our beliefs were mixed into the interpretation of
the evidence, we wouldn’t declare there’s
statistical evidence of some unbelievable claim
(distinguishing shades of grey and being
politically moderate, ovulation and voting
preferences)
32
33. Rescued by beliefs
• That could work in some cases (it still
wouldn’t show what researchers had done
wrong)—battle of beliefs
• Besides, researchers sincerely believe their
hypotheses
• So now you’ve got two sources of flexibility,
priors and biasing selection effects
33
34. No help with our most important problem
• How to distinguish the warrant for a single
hypothesis H with different methods
(e.g., one has biasing selection effects, another,
pre-registered results and precautions)?
• Since there’s a single H, its prior would be the
same
34
35. Most Bayesians use “default” priors
• Eliciting subjective priors too difficult,
scientists reluctant to allow subjective beliefs
to overshadow data
• Default priors are supposed to prevent prior
beliefs from influencing the posteriors (O-
Bayesians, 2006)
35
36. How should we interpret them?
• “The priors are not to be considered expressions
of uncertainty, ignorance, or degree of belief.
Default priors may not even be probabilities…”
(Cox and Mayo 2010, 299)
• No agreement on rival systems for default/non-
subjective priors
36
37. • Default Bayesian Reforms often touted as free of
selection effects
“…Bayes factors can be used in the complete
absence of a sampling plan…”
(Bayarri, Benjamin, Berger, Sellke 2016, 100)
37
38. • Whatever you think of them, there’s a big
difference in foundations (error statistical vs
Bayesian/probabilist)
• Knowing something about the statistics wars,
you’d get to the bottom of current proxy wars
38
39. Recent push to “redefine statistical significance” is
based on the odds ratios/ Bayes Factors (BF)
(Benjamin et al 2017)
39
40. Old & new debates: P-values exaggerate based
on Bayes Factors or Likelihood Ratios
• Testing of a point null hypothesis, a lump of prior
probability given to H0
Xi ~ N(μ, σ2), 2-sided H0: μ = 0 vs. H1: μ ≠ 0.
• The criticism is the posterior probability on H0 can
be larger than the P-value.
• So if you interpret a P-value as a posterior
on H0 (a fallacy), you’d be saying the Pr(H0|x) is
low
40
41. Bayes-Fisher Disagreement or Jeffreys-Lindley
“Paradox”
• With-a lump of prior given to the point null, and
the rest appropriately spread over the alternative,
an α significant result can correspond to
Pr(H0 |x) = (1 - α)! (e.g., 0.95)
“spike and smear”
• To a Bayesian this shows P-values exaggerate
evidence against
41
42. Do P-values Exaggerate Evidence?
• Significance testers balk at allowing highly
significant results to be interpreted as no
evidence against the null–or even evidence for it!
Bad Type II error
• Lump of prior on the null conflicts with the idea
that all nil nulls are false)
• P-values can also equal the posterior! (without
the spiked prior)–even though they measure very
different things.
42
43. • “[T]he reason that Bayesians can regard P-values as
overstating the evidence against the null is simply a
reflection of the fact that Bayesians can
disagree sharply with each other“ (Stephen Senn
2002, 2442)
• “In fact it is not the case that P-values are too
small, but rather that Bayes point null posterior
probabilities are much too big!….Our concern
should not be to analyze these misspecified
problems, but to educate the user so that the
hypotheses are properly formulated,” (Casella and
Berger 1987b, 334, 335).
• “concentrating mass on the point null hypothesis
is biasing the prior in favor of H0 as much as
possible” (ibid. 1987a, 111) whether in 1 or 2-sided
tests
43
44. • In one sense, lowering the probability of type I
errors is innocuous (if that reflects your balance
of Type I and II errors)
• To base the argument on lowering P-values on
BFs –if you don’t do inference by means of BFs–
is not innocuous
(because of the forfeit of error control)
44
45. BF Advocates Forfeit Their Strongest Criticisms
• Daryl Bem (2011): subjects did better than
chance at predicting the picture to be shown in the
future, credit to ESP
• Some locate the start of the Replication Crisis
With (Bem’s “Feeling the Future” 2011)
Admits data dredging
• Keen to show we should trade significance tests
for BFs, critics resort to a default Bayesian prior to
make the null hypothesis comparatively more
probable than a chosen alternative
45
46. Repligate
• Replication research has pushback:
some call it methodological terrorism
(enforcing good science or bullying?)
• If you must appeal to your disbelief in my
effect to find I fail to replicate I’m more
likely to see it as bullying
• I can also deflect your criticism
46
47. Bem’s Response
Whenever the null hypothesis is sharply defined but
the prior distribution on the alternative hypothesis is
diffused over a wide range of values, as it is [here] it
boosts the probability that any observed data will be
higher under the null hypothesis than under the
alternative. This is known as the Lindley-Jeffreys
paradox: A frequentist analysis that yields strong
evidence in support of the experimental hypothesis
can be contradicted by a misguided Bayesian analysis
that concludes that the same data are more likely
under the null. (Bem et al., 717)
47
49. Jeffreys-Lindley result points to large n
problem
• Fixing the P-value, increasing sample size n, the
cut-off gets smaller
• Get to a point where x is closer to the null than
various alternatives
• Many would lower the P-value requirement as n
increases-that’s fine but can always avoid inferring a
discrepancy beyond what’s warranted:
• Severity prevents interpreting a significant result as
indicating a larger discrepancy than warranted (Mts
out of Molehill fallacy) 49
50. • tells us: an α-significant difference indicates less of
a discrepancy from the null if it results from larger
(n1) rather than a smaller (n2) sample size (n1 > n2 )
• What’s more indicative of a large effect (fire), a fire
alarm that goes off with burnt toast or one that
doesn’t go off unless the house is fully ablaze?
• [The larger sample size is like the one that goes off
with burnt toast] 50
51. What About Fallacies of
Non-Significant Results?
• They don’t warrant 0 discrepancy
• Using severity reasoning: rule out discrepancies
that very probably would have resulted in a larger
difference than observed- set upper bounds
• If you very probably would have observed a more
impressive (smaller) P-value than you did, if μ >
μ1 (μ1 = μ0 + γ), then the data are good evidence
that μ< μ1
• Akin to power analysis (Cohen, Neyman) but
sensitive to x0 51
52. Confidence Intervals also require
reinterpretation
Duality between tests and intervals: values within
the (1 -‐c) CI are non-‐rejectable at the c level
• Too dichotomous: in/out, plausible/not
plausible
• Justified in terms of long-‐run coverage
• All members of the CI treated on par
• Fixed confidence levels (need several
benchmarks) 52
53. Violated Assumptions Do More Damage Than
Setting P-values
We should worry more about violated
model assumptions
• Ironically, testing model assumptions
(and fraudbusting) performed by simple
significance tests (e.g., IID)
53
54. Some Bayesians Care About Model Testing
• “[C]rucial parts of Bayesian data analysis, such
as model checking, can be understood as ‘error
probes’ in Mayo’s sense” which might be seen
as using modern statistics to implement the
Popperian criteria of severe tests.”
(Andrew Gelman and Cosma Shalizi 2013, 10).
54
55. In dealing with non-replication we should
worry much more about
• Links from experiments to inferences of
interest
• I’ve seen plausible hypotheses poorly (or
questionably) tested
55
56. Is it Non-replication or Something About
the Design?
• Replications are to be superior,
preregistered, free of perverse incentives,
high power
• But might they be influenced by replicator’s
beliefs? (Kahneman: get approval from
original researcher)
• Subjects (often students) often told the
purpose of the experiment
56
57. • One of the non-replications: cleanliness and
morality: Does unscrambling soap words make
you less judgmental?
“Ms. Schnall had 40 undergraduates unscramble
some words. One group unscrambled words
that suggested cleanliness (pure, immaculate,
pristine), while the other group unscrambled
neutral words. They were then presented with
a number of moral dilemmas, like whether it’s
cool to eat your dog after it gets run over by a
car. …(Bartlett 2014)
57
58. …Turns out, it did. Subjects who had
unscrambled clean words weren’t as harsh on the
guy who chows down on his chow.”
(Bartlett 2014)
• Focusing on the P-values ignore larger
questions of measurement in psychology & the
leap from the statistical to the substantive.
HH*
58
59. 2 that didn’t replicate in psychology:
• Belief in free will and cheating
• Physical distance (of points plotted) and
emotional closeness
I encourage meta-methodological scrutiny by
philosophers–especially those in Xphi
59
60. 2 that didn’t replicate
• Belief in free will and cheating
“It found that participants who read a passage
arguing that their behavior is predetermined were
more likely than those who had not read the
passage to cheat on a subsequent test.”
• Physical distance (of points plotted) and
emotional closeness
“Volunteers asked to plot two points that were far
apart on graph paper later reported weaker
emotional attachment to family members,
compared with subjects who had graphed points
close together.” (Carey, 2015) 60
61. Hidden Controversies
1. The Statistics Wars: Age-old abuses, fallacies of
statistics (Bayesian-frequentist, Fisher-N-P debates)
simmer below todays debates; reformulations of
frequentist methods are ignored
2. Replication Paradoxes: While a major source of
handwringing stem from biasing selection effects—
cherry picking, data dredging, trying and trying again,
some reforms and preferred alternatives conflict with
the needed error control.
3. Underlying assumptions and their violations
• A) statistical model assumptions
• B) links from experiments, measurements &
statistical inferences to substantive questions
61
62. • Recognize different roles of probability: probabilism,
long run performance, probativism (severe testing)
• Move away from cookbook stat–long deplored–but
don’t expect agreement on numbers from evaluations
of different things (P-values, posteriors)
• Recognize criticisms & reforms are often based on
rival underlying philosophies of evidence
• The danger is that some reforms may enable rather
than directly reveal illicit inferences due to biasing
selection effects
• Worry more about model assumptions, and whether
our experiments are teaching us about the
phenomenon of interest 62
64. References
• Armitage, P. 1962. “Contribution to Discussion”, in The Foundations of Statistical
Inference: A Discussion, edited by L. J. Savage. London: Methuen.
• Bartlett, T. 2014. “Replication Crisis in Psychology Research Turns Ugly and Odd”,
Chronicle of Higher Education online, June 23, 2014.
• Bayarri, M., Benjamin, D., Berger, J., and Sellke, T. 2016. “Rejection Odds and
Rejection Ratios: A Proposal for Statistical Practice in Testing Hypotheses”,
Journal of Mathematical Psychology, 72,90-103.
• Bem, J. 2011. “Feeling the Future: Experimental Evidence for Anomalous
Retroactive Influences on Cognition and Affect”, Journal of Personality and Social
Psychology 100(3), 407-425.
• Bem, J., Utts, J., and Johnson, W. 2011. “Must Psychologists Change the Way
They Analyze Their Data?”, Journal of Personality and Social Psychology 101(4),
716-719.
• Benjamin, D., Berger, J., Johannesson, M., et al. (2017). “Redefine Statistical
Significance”, Nature Human Behaviour 2, 6–10.
• Berger, J. O. 2003 “Could Fisher, Jeffreys and Neyman Have Agreed on Testing?”
and “Rejoinder”, Statistical Science 18(1): 1-12; 28-32.
• Berger, J. O. 2006. “The Case for Objective Bayesian Analysis.” Bayesian Analysis
1 (3): 385–402.
• Berger, J. O. and Wolpert, R. 1988. The Likelihood Principle, 2nd ed. Vol. 6 Lecture
Notes-Monograph Series. Hayward, CA: Institute of Mathematical Statistics.
• Birnbaum, A. 1970. “Statistical Methods in Scientific Inference (letter to the
Editor).” Nature 225 (5237) (March 14): 1033.
64
65. • Carey, B. (2015). “Many Psychology Findings Not as Strong as Claimed, Study
Says”, New York Times (August 27, 2015).
• Casella, G. and Berger, R. 1987a. “Reconciling Bayesian and Frequentist
Evidence in the One-sided testing Problem”, Journal of the American Statistical
Association 82(397): 106-111.
• Casella, G. and Berger, R. 1987b. “Comment on Testing Precise Hypotheses by J.
Berger and M. Delampady”, Statistical Science 2(3): 344-347.
• Cohen, J. 1988. Statistical Power Analysis for the Behavioral Sciences 2nd ed.
Hillsdale NJ: Erlbaum.
• Cox, D. R. and Hinkley, D. 1974. Theoretical Statistics. London: Chapman and
Hall.
• Cox, D. R., and Mayo, D.G. 2010. “Objectivity and Conditionality in Frequentist
Inference.” In Error and Inference: Recent Exchanges on Experimental Reasoning,
Reliability, and the Objectivity and Rationality of Science, edited by Deborah G.
Mayo and Aris Spanos, 276–304. Cambridge: Cambridge University Press.
• Edwards, W., Lindman, H., and Savage, L. 1963. “Bayesian Statistical Inference
for Psychological Research”, Psychological Review 70(3): 193-242.
• Fisher, R. A. 1935. The Design of Experiments. Edinburgh: Oliver and Boyd. 78.
• Fisher, R. A. 1955. “Statistical Methods and Scientific Induction.” Journal of the
Royal Statistical Society, Series B (Methodological) 17 (1) (January 1): 69–
• Gelman, A. and Shalizi, C. 2013. “Philosophy and the Practice of Bayesian
Statistics” and “Rejoinder”, British Journal of Mathematical and Statistical
Psychology 66(1): 8–38; 76-80.
• Gigerenzer, G. 2004. “Mindless Statistics”, Journal of Socio-Economics 33(5), 587-
606. 65
66. • Gigerenzer, G., Swijtink, Porter, T. Daston, L. Beatty, J, and Kruger, L. 1989. The
Empire of Chance. Cambridge: Cambridge University Press.
• Goodman SN. 1999. “Toward evidence-based medical statistics. 2: The Bayes factor,”
Annals of Internal Medicine 1999; 130:1005 –1013.
• Kahneman, D. 2014. “A New Etiquette for Replication”, Social Psychology 45(4), 299-
311.
• Lindley, D. V. 1971. “The Estimation of Many Parameters.” In Foundations of Statistical
Inference, edited by V. P. Godambe and D. A. Sprott, 435–455. Toronto: Holt, Rinehart
and Winston.
• Little, R. 2006. “Calibrated Bayes: A Bayes/Frequentist Roadmap”, The American
Statistician 60(3): 213-223.
• Mayo, D. G. 1996. Error and the Growth of Experimental Knowledge. Science and Its
Conceptual Foundation. Chicago: University of Chicago Press.
• Mayo, D. G. 2018. Statistical Inference as Severe Testing: How to Get Beyond the
Statistics Wars, Cambridge: Cambridge University Press.
• Mayo, D. G. and Cox, D. R. (2006). "Frequentist Statistics as a Theory of Inductive
Inference” in Rojo, J. (ed.) The Second Erich L. Lehmann Symposium: Optimality,
2006, Lecture Notes-Monograph Series, Volume 49, Institute of Mathematical
Statistics: 247-275.
• Mayo, D. G., and A. Spanos. 2006. “Severe Testing as a Basic Concept in a Neyman–
Pearson Philosophy of Induction.” British Journal for the Philosophy of Science 57 (2)
(June 1): 323–357.
66
67. • Mayo, D. G., and A. Spanos. 2011. “Error Statistics.” In Philosophy of Statistics,
edited by Prasanta S. Bandyopadhyay and Malcolm R. Forster, 7:152–198.
Handbook of the Philosophy of Science. The Netherlands: Elsevier.
• Meehl, P. E., and N. G. Waller. 2002. “The Path Analysis Controversy: A New
Statistical Approach to Strong Appraisal of Verisimilitude.” Psychological Methods
7 (3): 283–300.
• Morrison, D. E., and R. E. Henkel, ed. 1970. The Significance Test Controversy: A
Reader. Chicago: Aldine De Gruyter.
• Neyman, J. (1962). 'Two Breakthroughs in the Theory of Statistical Decision
Making', Revue De l'Institut International De Statistique / Review of the
International Statistical Institute, 30(1), 11-27.
• Open Science Collaboration (2015). ‘Estimating the Reproducibility of
Psychological Science’, Science 349(6251), 943–51.
• Pearson, E. S. & Neyman, J. (1930). “On the problem of two samples”, Joint
Statistical Papers by J. Neyman & E.S. Pearson, 99-115 (Berkeley: U. of Calif.
Press). First published in Bul. Acad. Pol.Sci. 73-96.
• Popper, K. (1983). Realism and the Aim of Science. Totowa, NJ: Rowman and
Littlefield.
• Rosenkrantz, R. 1977. Inference, Method and Decision: Towards a Bayesian
Philosophy of Science. Dordrecht, The Netherlands: D. Reidel.
67
68. • Savage, L. J. 1961. “The Foundations of Statistics Reconsidered”, Proceedings of the
Fourth Berkeley Symposium on Mathematical Statistics and Probability I, Berkeley:
University of California Press, 575-586.
• Savage, L. J. 1962. The Foundations of Statistical Inference: A Discussion. London:
Methuen.
• Schnall, S. Benton, J. and Harvey, S. 2008. “With a Clean Conscience: Cleanliness
Reduces the Severity of Moral Judgments”, Psychological Science 19(12): 1219-
1222.
• Selvin, H. 1970. “A critique of tests of significance in survey research. In The significance
test controversy, edited by D. Morrison and R. Henkel, 94-106. Chicago: Aldine De
Gruyter.
• Senn, S. 2002. “A Comment on Replication, P-values and Evidence S.N. Goodman
Statistics in Medicine 1992; 11: 875-879”, Statistics in Medicine 21(16), 2437-2444.
• Simmons, J., Nelson, L., and Simonsohn, U. 2011, “False-Positive Psychology:
Undisclosed Flexibility in Data Collection and Analysis Allow Presenting Anything as
Significant", Psychological Science,22(11): 1359-1366.
• Wagenmakers, E-J., 2007. “A Practical Solution to the Pervasive Problems of P values”,
Psychonomic Bulletin & Review 14(5): 779-804.
• Wasserstein, R. and Lazar, N. 2016. “The ASA’s statement on p-values: context, process,
and purpose”,(and supplemental materials), The American Statistician 70(2): 129-33.
68
69. Severity for Test T+:
SEV(T+, d(x0), claim C)
Normal testing: H0: μ ≤ μ0 vs. H1: μ > μ0 known σ;
discrepancy parameter γ; μ1 = μ0 +γ; d0 = d(x0) (observed
value of test statistic) √n(M - μ0)/σ
SIR: (Severity Interpretation with low P-values)
• (a): (high): If there’s a very low probability that so large
a d0 would have resulted, if μ were no greater than μ1,
then d0 it indicates μ > μ1: SEV(μ > μ1) is high.
• (b): (low) If there is a fairly high probability that d0 would
have been larger than it is, even if μ = μ1, then d0 is not
a good indication μ > μ1: SEV(μ > μ1) is low. 69
70. SIN: (Severity Interpretation for Negative
results)
• (a): (high) If there is a very high probability
that d0 would have been larger than it is, were
μ > μ1, then μ ≤ μ1 passes the test with high
severity: SEV(μ ≤ μ1) is high.
• (b): (low) If there is a low probability that d0
would have been larger than it is, even if μ >
μ1, then μ ≤ μ1 passes with low severity:
SEV(μ ≤ μ1) is low.
70
71. Jimmy Savage on the LP:
“According to Bayes' theorem,…. if y is the
datum of some other experiment, and if it
happens that P(x|µ) and P(y|µ) are
proportional functions of µ (that is,
constant multiples of each other), then
each of the two data x and y have exactly
the same thing to say about the values of
µ…” (Savage 1962, p. 17)
71