A talk given by Deborah G Mayo
(Dept of Philosophy, Virginia Tech) to the Seminar in Advanced Research Methods at the Dept of Psychology, Princeton University on
November 14, 2023
TITLE: Statistical Inference as Severe Testing: Beyond Probabilism and Performance
ABSTRACT: I develop a statistical philosophy in which error probabilities of methods may be used to evaluate and control the stringency or severity of tests. A claim is severely tested to the extent it has been subjected to and passes a test that probably would have found flaws, were they present. The severe-testing requirement leads to reformulating statistical significance tests to avoid familiar criticisms and abuses. While high-profile failures of replication in the social and biological sciences stem from biasing selection effects—data dredging, multiple testing, optional stopping—some reforms and proposed alternatives to statistical significance tests conflict with the error control that is required to satisfy severity. I discuss recent arguments to redefine, abandon, or replace statistical significance.
Introduction to ArtificiaI Intelligence in Higher Education
Statistical Inference as Severe Testing: Beyond Performance and Probabilism
1. Statistical Inference as Severe
Testing: Beyond Performance
and Probabilism
Deborah G Mayo
Dept of Philosophy, Virginia Tech
Seminar in Advanced Research Methods
Dept of Psychology, Princeton University
November 14, 2023
1
3. Role of Probability: performance or
probabilism?
• Despite “unifications,” long-standing battles
simmer below the surface in today’s
”statistical crisis in science”
• What’s behind it?
3
4. Minimal principle of evidence
• We set sail with a minimal principle:
• We don’t have evidence for a claim C if little
if anything has been done that would have
found C flawed, even if it is
4
5. Statistical inference as severe testing
• Probability is used to assess error-probing
capabilities
• Excavation tool for appraising reforms
5
6. Replication crisis leads to “reforms”
Several are welcome:
• preregistration of protocol, replication
checks, avoid cookbook statistics
Others are radical
• and even lead to violating our minimal
requirement for evidence
6
7. • How to subject today’s “reforms” to your
own severe, critical examination?
7
8. Most often used tools are most
criticized
“Several methodologists have pointed out
that the high rate of nonreplication of
research discoveries is a consequence of
the convenient, yet ill-founded strategy of
claiming conclusive research findings solely
on the basis of a single study assessed by
formal statistical significance, typically for a
p-value less than 0.05. …” (Ioannidis 2005,
696)
8
9. R.A. Fisher
“[W]e need, not an isolated record, but a
reliable method of procedure. In relation to
the test of significance, we may say that a
phenomenon is experimentally
demonstrable when we know how to
conduct an experiment which will rarely fail
to give us a statistically significant result.”
(Fisher 1947, 14)
9
10. Simple significance tests (Fisher)
“to test the conformity of the particular data
under analysis with H0 in some respect:
…we find a function T = t(y) of the data, the
test statistic, such that
• the larger the value of T the more
inconsistent are the data with H0;
p = Pr(T ≥ t0bs; H0)”
(Mayo and Cox 2006, 81) 10
11. Testing Reasoning
• Small P-value indicates some underlying
discrepancy from H0 because very
probably you would have seen a less
impressive difference than t0bs were
H0 true.
• This still isn’t evidence of a genuine
statistical effect H1, let alone a scientific
conclusion H*
11
12. Fallacy of rejection
• H* makes claims that haven’t been
probed by the statistical test
• The moves from experimental
interventions to H* don’t get enough
attention–but statistical accounts
should block them
12
13. 13
Neyman and Pearson tests (1933) put
Fisherian tests on firmer ground:
Introduce alternative hypotheses H0, H1
H0: μ = 0 vs. H1: μ > 0
• Trade-off between Type I errors and Type II errors
• Restricts the inference to statistical alternatives (in a model)
15. 15
Contemporary casualties of Fisher-
Neyman (N-P) battles
• N-P & Fisher tests claimed to be an “inconsistent
hybrid” (Gigerenzer 2004, 590):
• Fisherians can’t use power; N-P testers can’t report
P-values only fixed error probabilities (e.g., P < .05)
• In fact, Fisher & N-P recommended both pre-
data error probabilities and post-data P-value
16. They are mathematically
essentially identical
• They both fall under tools for “appraising
and bounding the probabilities of seriously
misleading interpretations of data”
(Birnbaum 1970, 1033)–error probabilities
• I place all under the rubric of error statistics
• Confidence intervals, N-P and Fisherian
tests, resampling, randomization
16
17. Both Fisher & N-P: it’s easy to lie
with biasing selection effects
• Sufficient finagling—cherry-picking,
significance seeking, multiple testing,
post-data subgroups, trying and trying
again—may practically guarantee an
impressive-looking effect, even if it’s
unwarranted by evidence
• Violates severity
17
18. 18
Severity Requirement
• We have evidence for a claim C only to
the extent C has been subjected to and
passes a test that would probably have
found C flawed, just if it is.
• This probability is the stringency or
severity with which it has passed the
test.
19. Requires a third role for probability
Probabilism. To assign a degree of confirmation,
support or belief in a hypothesis, given data x0 (absolute
or comparative)
(e.g., Bayesian, likelihoodist, Fisher at times)
Performance. Ensure long-run reliability of methods,
coverage probabilities (frequentist, behavioristic
Neyman-Pearson, Fisher)
Only probabilism is thought to be inferential or evidential
19
20. What happened to using probability to
assess error-probing capacity?
• Neither “probabilism” nor “performance” directly
captures error probing capacity
• Good long-run performance is a necessary, not a
sufficient, condition for severity
20
21. Key to solving a central problem
for frequentists
• Why is good performance relevant for
inference in the case at hand?
• What bothers you with selective
reporting, cherry picking, stopping when
the data look good, P-hacking
• Not problems about long-runs—
21
22. We cannot say the case at hand has
done a good job of avoiding the
sources of misinterpreting data
23. A claim C is not warranted _______
• Probabilism: unless C is true or
probable (gets a probability boost, made
comparatively firmer)
• Performance: unless it stems from a
method with low long-run error
• Probativism (severe testing) unless
something (a fair amount) has been done
to probe ways we can be wrong about C
23
24. Fishing for significance alters
probative capacity
Suppose that twenty sets of differences
have been examined, that one difference
seems large enough to test and that this
difference turns out to be ‘significant at the
5 percent level.’ ….The actual level of
significance is not 5 percent, but 64
percent!* (Selvin 1970, 104)
(Morrison & Henkel’s Significance Test Controversy
1970!) [*Pr(no success) = (.95)20]
24
25. • Frequentist (error statisticians) need to
adjust P-values to avoid being “fooled
by randomness”
25
26. • In a classic example in the volume
adjustments for multiplicity led to
falsifying claimed infant training benefits
on personality (from the 1940s)
• Only 18 of 460 tests were found
statistically significant (11 in the
direction expected by Freudian theory
of the day)
26
27. Today’s Meta-research is not free of
philosophy of statistics
From the Bayesian perspective:
“adjusting the measure of evidence because
of considerations that have nothing to do with
the data defies scientific sense” (Goodman
1999, 1010)
(Co-director of the Meta-Research Innovation Center
at Stanford)
27
28. Likelihood Principle (LP)
A pivotal disagreement about statistical
evidence
In probabilisms, the import of the data is via
the ratios of likelihoods of hypotheses
Pr(x0;H0)/Pr(x0;H1)
The data x0 are fixed, while the hypotheses
vary
28
29. Logic of Support
• Ian Hacking (1965) “Law of Likelihood”: x
support hypothesis H0 less well than H1 if,
Pr(x;H0) < Pr(x;H1)
(rejects in 1980)
• “there always is such a rival hypothesis
viz., that things just had to turn out the
way they actually did” (Barnard 1972,
129).
29
31. On the LP, error probabilities
appeal to something irrelevant
“Sampling distributions, significance
levels, power, all depend on something
more [than the likelihood function]–
something that is irrelevant in Bayesian
inference–namely the sample space”
(Lindley 1971, 436)
31
32. 32
Familiar “reforms” offered as alternative
to significance tests follow the LP
• “The Bayes factor only depends on the actually
observed data, and not on whether they have been
collected from an experiment with fixed or variable
sample size.”
(van Dongen, Sprenger, and Wagenmakers 2022)
• “It seems very strange that a frequentist could not
analyze a given set of data…if the stopping rule is
not given….Data should be able to speak for
itself”.
(Berger and Wolpert 1988, 78)
33. In testing the mean of a standard
normal distribution
33
34. Optional Stopping
34
• “if an experimenter uses this [optional
stopping] procedure, then with probability
1 he will eventually reject any sharp null
hypothesis, even though it be true”.
(Edwards, Lindman, and Savage in
Psychological Review 1963, 239)
35. The Stopping Rule Principle
From their Bayesian standpoint the stopping
rule is irrelevant
• “[the] irrelevance of stopping rules to
statistical inference restores a simplicity
and freedom to experimental design that
had been lost by classical emphasis on
significance levels (in the sense of Neyman
and Pearson).” (Edwards, Lindman, and
Savage 1963, 239)
35
36. Contrast this with: The 21 Word
Solution
• Replication researchers (re)discovered that
data-dependent hypotheses and stopping are a
major source of spurious significance levels.
• Simmons, Nelson, and Simonsohn (2011) place
at the top of their list the need to block flexible
stopping
• “Authors must decide the rule for terminating
data collection before data collection begins and
report this rule in the articles” (ibid. 1362).
36
37. 37
You might think replication researchers
disavow the stopping rule principle
“[I]f the sampling plan is ignored, the
researcher is able to always reject the null
hypothesis, even if it is true. ..Some people
feel that ‘optional stopping’ amounts to
cheating…. This feeling is, however,
contradicted by a mathematical analysis. (Eric-
Jan Wagenmakers, 2007, 785)
The mathematical analysis assumes the
likelihood principle
38. Replication Paradox
• Significance test critic: It’s too easy to satisfy
standard statistical significance thresholds
• You: Why is it so hard to replicate significance
thresholds with preregistered protocols?
• Significance test critic: Obviously the initial studies
were guilty of P-hacking, cherry-picking, data-
dredging (QRPs)
• You: So, the replication researchers want methods
that pick up on these biasing selection effects.
• Significance test critic: Actually, “reforms”
recommend methods with no need to adjust P-values
due to multiplicity 38
39. 39
But the value of preregistered reports
is error statistical
• Your appraisal is altered by considering the
probability that some hypotheses, stopping
point, subgroups, etc. could have led to a
false positive –even if informal
• True, there are many ways to correct P-
values (Bonferroni, false discovery rates).
• The main thing is to have an alert that the
reported P-values are invalid
40. Probabilists can still block intuitively
unwarranted inferences
(without error probabilities)
• Supplement with subjective beliefs
• Likelihoods + prior probabilities
40
41. Problems
• Could work in some cases, but doesn’t show
what researchers had done wrong—battle of
beliefs
• The believability of data-dredged hypotheses
is what makes them so seductive
• Additional source of flexibility, priors and
biasing selection effects
41
42. Most Bayesians (last decade) use
“default” priors
• Default priors are supposed to prevent prior beliefs
from influencing the posteriors–data dominant
42
43. How should we interpret them?
• “The priors are not to be considered expressions
of uncertainty, ignorance, or degree of belief.
Conventional priors may not even be
probabilities…” (Cox and Mayo 2010, 299)
• No agreement on rival systems for default/non-
subjective priors
(maximum entropy, invariance, maximizing the
missing information, coverage matching.)
43
44. Criticisms of P-hackers lose force
• Wanting to promote an account that
downplays error probabilities, the data
dredger deserving criticism is given a life-raft:
44
45. Bem’s “Feeling the future” 2011:
ESP?
• Daryl Bem (2011): subjects do better than
chance at predicting the (erotic) picture
shown in the future
• Bem admits data dredging, but Bayesian
critics resort to a default Bayesian prior to (a
point) null hypothesis
Wagenmakers et al. 2011 “Why psychologists must
change the way they analyze their data”
45
46. Bem’s response
“Whenever the null hypothesis is sharply defined but the
prior distribution on the alternative hypothesis is diffused
over a wide range of values, as it is as it is in...
Wagenmakers et al. (2011), it boosts the probability that
any observed data will be higher under the null
hypothesis than under the alternative.
This is known as the Lindley-Jeffreys paradox*: ... strong
[frequentist] evidence in support of the experimental
hypothesis be contradicted by a ... Bayesian analysis.”
(Bem et al. 2011, 717)
*Bayes-Fisher disagreement 46
47. Many of Today’s Significance Test
Debates Trace to Bayes/Fisher
Disagreement
• The posterior probability Pr(H0|x) can be
high while the P-value is low (2-sided test)
47
48. Bayes/Fisher Disagreement
With a lump of prior given to a point null, and
the rest appropriately spread over the
alternative [spike and smear], an α significant
result can correspond to
Pr(H0 |x) = (1 - α)! (e.g., 0.95)
with large n
Xi ~ N(μ, σ2), 2-sided H0: μ = 0 vs. H1: μ ≠ 0.
48
49. • To the Bayesian, the P-value exaggerates
the evidence against H0
• The significance tester balks at taking low p-
values as no evidence against, or even
evidence for, H0
49
50. “Concentrating mass on the point null
hypothesis is biasing the prior in favor
of H0 as much as possible” (Casella
and R. Berger 1987, 111)
50
51. “Redefine Statistical Significance”
‘Spike and smear” is the basis for the move
to lower the P-value threshold to .005
(Benjamin et al. 2018)
Opposing megateam: Lakens et al. (2018)
51
52. • The problem isn’t lowering the probability
of type I errors
• The problem is assuming there should be
agreement between quantities measuring
different things
52
53. 53
Whether P-values exaggerate,
“depends on one’s philosophy of
statistics …
…based on directly comparing P values
against certain quantities (likelihood ratios
and Bayes factors) that play a central role as
evidence measures in Bayesian analysis …
other statisticians do not accept these
quantities as gold standards,”… (Greenland,
Senn, Rothman, Carlin, Poole, Goodman,
Altman 2016, 342)
54. 54
• A silver lining to distinguishing highly
probable and highly probed–can use
different methods for different contexts
55. 55
“A Bayesian Perspective on Severity” van
Dongen, Sprenger, Wagenmakers [VSW]
(Psychonomic Bulletin and Review 2022):
“As Mayo emphasizes, the Bayes factor is insensitive
to variations in the sampling protocol that affect the
error rates, i.e., optional stopping” ...
They argue that Bayesians can satisfy severity
“regardless of whether the test has been conducted in
a severe or less severe fashion”. (VSW 2022)
56. 56
What they mean is that data can be much more
probable on hypothesis H1 than on H0
But severity in their comparative subjective Bayesian
sense does not mean H1 was well probed (in the error
statistical sense)
58. 58
The Bayes factor proponents and
severe testers are on the same
side against
• “declarations of ‘statistical significance’ be
abandoned” (Wasserstein, Schirm & Lazar
2019).
• “whether a p-value passes any arbitrary
threshold should not be considered at all" in
interpreting data (ibid., 2)
No significance/no threshold view
59. 59
ASA (President’s) Task Force on
Statistical Significance and
Replicability (2019-2021)
The ASA executive director’s “no threshold”
view is not ASA policy:
“P-values and significance testing, properly
applied and interpreted, are important tools
that should not be abandoned.” (Benjamini et
al. 2021)
60. 60
Severity Reformulate tests
in terms of discrepancies (effect sizes) that
are and are not severely-tested
SEV(Test T, data x, claim C)
• In a nutshell: one tests several
discrepancies from a test hypothesis and
infers those well or poorly warranted
Mayo1981-2018; Mayo and Spanos (2006); Mayo and
Cox (2006); Mayo and Hand (2022)
63. In the same way, severity avoids
the “large n” problem
• Fixing the P-value, increasing sample
size n, the cut-off gets smaller
• Large n is the basis for the Jeffreys-
Lindley paradox
63
64. Severity tells us:
• an α-significant difference indicates less of a
discrepancy from the null if it results from larger (n1)
rather than a smaller (n2) sample size (n1 > n2 )
• What’s more indicative of a large effect (fire), a fire
alarm that goes off with burnt toast or one that
doesn’t go off unless the house is fully ablaze?
• [The larger sample size is like the one that goes off
with burnt toast] 64
66. What About Fallacies of
Non-Significant Results?
• They don’t warrant 0 discrepancy
• Using severity reasoning: rule out discrepancies
that very probably would have resulted in larger
differences than observed- set upper bounds
66
67. Confidence Intervals CIs are also
improved
Duality between tests and intervals: values within the
(1 --α) CI are non--rejectable at the α level.
• Differentiate the warrant for claims within the
interval
• Move away from fixed confidence levels (e.g., .95)
• Provide an inferential rationale
67
68. 68
We get an inferential rationale
CI Estimate:
CI-lower < μ < CI-upper
Performance rationale: Because it came from a
procedure with good coverage probability
Severe Tester:
μ > CI-lower because with high probability (.975) we
would have observed a smaller ̅
𝑥 if μ ≤ CI-lower
likewise for warranting μ < CI-upper
69. • I begin with a simple tool: the minimal
requirement for evidence
• We have evidence for C only to the
extent C has been subjected to and
passes a test it probably would have
failed if false
69
70. • Biasing selection effects make it easy to
find impressive-looking effects
erroneously
• They alter a method’s error probing
capacities
• They do not alter evidence (in traditional
probabilisms): Likelihood Principle (LP)
• On the LP, error probabilities consider
“imaginary data” and “intentions”
70
71. • To the severe tester, probabilists are
robbed from a main way to block spurious
results
• Probabilists may block inferences without
appeal to error probabilities: high prior to
H0 (no effect) can result in a high posterior
probability to H0
• Problems: Increased flexibility, puts blame
in the wrong place, unclear how to interpret
• Gives a life-raft to the P-hacker and cherry
picker 71
72. • The aim of severe testing directs the
reinterpretation of significance tests and other
methods
• Severe probing (formal or informal) must take
place at every level: from data to statistical
hypothesis; to substantive claims
• A silver lining to distinguishing highly probable
and highly probed–can use different methods
for different contexts
72
74. References
• Barnard, G. (1972). The logic of statistical inference (Review of “The Logic of
Statistical Inference” by Ian Hacking). British Journal for the Philosophy of Science
23(2), 123–32.
• Bem, J. (2011). Feeling the future: Experimental evidence for anomalous
retroactive influences on cognition and affect. Journal of Personality and Social
Psychology 100(3), 407-425.
• Bem, J., Utts, J., and Johnson, W. (2011). Must psychologists change the way they
analyze their data? Journal of Personality and Social Psychology 101(4), 716-719.
• Benjamin, D., Berger, J., Johannesson, M., et al. (2018). Redefine statistical
significance. Nature Human Behaviour, 2, 6–10. https://doi.org/10.1038/s41562-
017-0189-z
• Benjamini, Y., De Veaux, R., Efron, B., et al. (2021). The ASA President’s task
force statement on statistical significance and replicability. The Annals of Applied
Statistics. https://doi.org/10.1080/09332480.2021.2003631.
• Berger, J. O. and Wolpert, R. (1988). The Likelihood Principle, 2nd ed. Vol. 6
Lecture Notes-Monograph Series. Hayward, CA: Institute of Mathematical
Statistics.
• Birnbaum, A. (1970). Statistical methods in scientific inference (letter to the
Editor).” Nature, 225 (5237) (March 14), 1033.
74
75. • Casella, G. and Berger, R. (1987). Reconciling Bayesian and frequentist evidence
in the one-sided testing problem. Journal of the American Statistical Association,
82(397), 106-11.
• Cox, D. R. (2006). Principles of statistical inference. Cambridge University Press.
https://doi.org/10.1017/CBO9780511813559
• Cox, D. R., and Mayo, D. G. (2010). Objectivity and conditionality in frequentist
inference. In D. Mayo & A. Spanos (Eds.), Error and Inference: Recent Exchanges
on Experimental Reasoning, Reliability, and the Objectivity and Rationality of
Science,, pp. 276–304. Cambridge: Cambridge University Press.
• Edwards, W., Lindman, H., and Savage, L. (1963). Bayesian statistical inference for
psychological research. Psychological Review, 70(3), 193-242.
• Fisher, R. A. (1947). The Design of Experiments 4th ed., Edinburgh: Oliver and
Boyd.
• Gigerenzer, G. 2004. Mindless statistics. Journal of Socio-Economics, 33(5), 587–
606.
• Goodman SN. (1999). Toward evidence-based medical statistics. 2: The Bayes
factor. Annals of Internal Medicine, 130, 1005 –1013.
• Greenland, S., Senn, S.J., Rothman, K.J. et al. (2016). Statistical tests, P values,
confidence intervals, and power: A guide to misinterpretations. Eur J Epidemiol 31,
337–350. https://doi.org/10.1007/s10654-016-0149-3 75
76. • Hacking, I. (1965). Logic of Statistical Inference. Cambridge: Cambridge University
Press.
• Hacking, I. (1980). The theory of probable inference: Neyman, Peirce and
Braithwaite. In D. Mellor (Ed.), Science, Belief and Behavior: Essays in Honour of
R. B. Braithwaite, Cambridge: Cambridge University Press, pp. 141–60.
• Ioannidis, J. (2005). Why most published research findings are false. PLoS
Medicine 2(8), 0696–0701.
• Lakens, D., et al. (2018). Justify your alpha. Nature Human Behavior 2, 168-71.
• Lindley, D. V. (1971). The estimation of many parameters. In V. Godambe & D.
Sprott, (Eds.), Foundations of Statistical Inference pp. 435–455. Toronto: Holt,
Rinehart and Winston.
• Mayo, D. (1981). In Defense of the Neyman-Pearson Theory of Confidence
Intervals. Philosophy of Science, 48(2), 269-280.
• Mayo, D. G. (1996). Error and the Growth of Experimental Knowledge. Science
and Its Conceptual Foundation. Chicago: University of Chicago Press.
• Mayo, D. G. (2018). Statistical Inference as Severe Testing: How to Get Beyond
the Statistics Wars, Cambridge: Cambridge University Press.
• Mayo, D. G. (2020). Significance tests: Vitiated or vindicated by the replication
crisis in psychology? Review of Philosophy and Psychology 12, 101-120.
DOI https://doi.org/10.1007/s13164-020-00501-w
76
77. 77
• Mayo, D. G. (2020). P-values on trial: Selective reporting of (best practice guides
against) selective reporting. Harvard Data Science Review 2.1.
• Mayo, D. G. (2022). The statistics wars and intellectual conflicts of interest.
Conservation Biology : The Journal of the Society for Conservation Biology, 36(1),
13861. https://doi.org/10.1111/cobi.13861.
• Mayo, D. G. and Cox, D. R. (2006). Frequentist statistics as a theory of inductive
inference. In J. Rojo, (Ed.) The Second Erich L. Lehmann Symposium: Optimality,
2006, pp. 247-275. Lecture Notes-Monograph Series, Volume 49, Institute of
Mathematical Statistics.
• Mayo, D.G., Hand, D. (2022). Statistical significance and its critics: practicing
damaging science, or damaging scientific practice?. Synthese 200, 220.
• Mayo, D. G. and Kruse, M. (2001). Principles of inference and their consequences.
In D. Cornfield & J. Williamson (Eds.) Foundations of Bayesianism, pp. 381-403.
Dordrecht: Kluwer Academic Publishes.
• Mayo, D. G., and A. Spanos. (2006). Severe testing as a basic concept in a
Neyman–Pearson philosophy of induction.” British Journal for the Philosophy of
Science 57(2) (June 1), 323–357.
• Mayo, D. G., and A. Spanos (2011). Error statistics. In P. Bandyopadhyay and M.
Forster (Eds.), Philosophy of Statistics, 7, pp. 152–198. Handbook of the Philosophy
of Science. The Netherlands: Elsevier.
78. 78
• Morrison, D. E., and R. E. Henkel, (Eds.), (1970). The Significance Test Controversy:
A Reader. Chicago: Aldine De Gruyter.
• Neyman, J. and Pearson, E. (1933). On the problem of the most efficient tests of
statistical hypotheses. Philosophical Transactions of the Royal Society of London
Series A 231, 289–337. Reprinted in Joint Statistical Papers, 140–85.
• Neyman, J. & Pearson, E. (1967). Joint statistical papers of J. Neyman and E. S.
Pearson. University of California Press.
• Open Science Collaboration (2015). Estimating the reproducibility of psychological
science”, Science 349(6251), 943-51.
• Pearson, E. S. & Neyman, J. (1967). On the problem of two samples. In J. Neyman &
E.S. Pearson (Eds.) Joint Statistical Papers, pp. 99-115 (Berkeley: U. of Calif.
Press). First published 1930 in Bul. Acad. Pol.Sci. 73-96.
• Savage, L. J. (1962). The Foundations of Statistical Inference: A Discussion. London:
Methuen.
• Selvin, H. (1970). A critique of tests of significance in survey research. In D. Morrison
and R. Henkel (Eds.). The Significance Test Controversy, pp., 94-106. Chicago:
Aldine De Gruyter.
• Simmons, J. Nelson, L. and Simonsohn, U. (2011). A false-positive psychology:
Undisclosed flexibility in data collection and analysis allow presenting anything as
significant”, Dialogue: Psychological Science, 22(11), 1359-66.
79. 79
• Simmons, J., et al. 2012. A 21 word solution. Dialogue: The Official Newsletter of the
Society for Personality and Social Psychology 26 (2), 4–7.
• van Dongen, N., Sprenger, J. & Wagenmakers, EJ. (2022). A Bayesian perspective
on severity: Risky predictions and specific hypotheses. Psychon Bull Rev 30, 516–
533. https://doi.org/10.3758/s13423-022-02069-1
• Wagenmakers, E-J., (2007). A practical solution to the pervasive problems of p
values. Psychonomic Bulletin & Review 14(5), 779-804.
• Wagenmakers et al. (2011). Why psychologists must change the way they analyze
their data: The case of psi: Comment on Bem (2011). Journal of Personality and
Social Psychology, 100(3): 426-32.
• Wasserstein, R., Schirm, A., & Lazar, N. (2019). Moving to a world beyond “p < 0.05”
(Editorial). The American Statistician 73(S1), 1–19.
https://doi.org/10.1080/00031305.2019.1583913
80. Jimmy Savage on the LP:
“According to Bayes' theorem,…. if y is
the datum of some other experiment,
and if it happens that P(x|µ) and
P(y|µ) are proportional functions of
µ (that is, constant multiples of each
other), then each of the two data x
and y have exactly the same thing to
say about the values of µ…” (Savage
1962, 17)
80