SlideShare a Scribd company logo
1 of 78
The Statistical Replication Crisis:
Paradoxes and Scapegoats
Deborah G Mayo
Virginia Tech & LSE
10 May 2016
American Statistical Society
(ASA):Statement on P-values
“The statistical community has been deeply
concerned about issues of reproducibility and
replicability of scientific conclusions. …. much
confusion and even doubt about the validity of
science is arising. Such doubt can lead to
radical choices such as…to ban P-values…
Misunderstanding or misuse of statistical
inference is only one cause of the
“reproducibility crisis” (Peng, 2015), but to our
community, it is an important one.” (ASA 2016)
2
Reforms without philosophy of
statistics are blind
O Replication crises (in social science and
biology) led to programs to restore credibility:
fraud busting, reproducibility studies
O Taskforces, journalistic reforms, and debunking
treatises
O Proposed methodological reforms––many
welcome (preregistration)–some quite radical
O The situation gives a practical spin to debates
in philosophy of science and statistics
3
Replication crisis in social psychology
O Diederik Stapel, the social psychologist who
fabricated his data (2011)
O Investigating Stapel revealed a culture of
verification bias; selective reporting so common;
they began calling them questionable research
practices (QRPs)
O A string of high profile cases followed, as did
proposed reforms, replication research 4
“I see a train-wreck looming,” Daniel
Kahneman, calls for a “daisy chain” of
replication in Sept. 2012
OSC: Reproducibility Project: Psychology:
2011-15 (Science 2015): Crowd-sourced effort to
replicate 100 articles (Led by Brian Nozek, U. VA)
5
I was a philosophical observer at the
ASA P-value “pow wow”
6
“Don’t throw out the error control baby
with the bad statistics bathwater”
The American Statistician
7
O “Statistical significance tests are a small
part of a rich set of:
“techniques for systematically appraising
and bounding the probabilities … of
seriously misleading interpretations of
data” (Birnbaum 1970, p. 1033)
O These I call error statistical methods (or
sampling theory)”.
8
One rock in a shifting scene
O “Birnbaum calls it the ‘one rock in a
shifting scene’ in statistical practice”
O “Misinterpretations and abuses of tests,
warned against by the very founders of
the tools, shouldn’t be the basis for
supplanting them with methods unable or
less able to assess, control, and alert us
to erroneous interpretations of data”
9
Error Statistics
O Statistics: Collection, modeling, drawing
inferences from data to claims about
aspects of processes
O The inference may be in error
O It’s qualified by a claim about the
methods’ capabilities to control and alert
us to erroneous interpretations (error
probabilities)
10
“p-value. …to test the conformity of the
particular data under analysis with H0 in
some respect:
…we find a function t = t(y) of the data,
to be called the test statistic, such that
• the larger the value of t the more
inconsistent are the data with H0;
• The random variable T = t(Y) has a
(numerically) known probability
distribution when H0 is true.
…the p-value corresponding to any t as
p = p(t) = P(T ≥ t; H0)”
(Mayo and Cox 2006, p. 81) 11
O “Clearly, if even larger differences
than t occur fairly frequently under H0
(p-value is not small), there’s scarcely
evidence of incompatibility
O But a small p-value doesn’t warrant
inferring a genuine effect H, let alone a
scientific conclusion H*–as the ASA
document correctly warns (Principle 3)”
12
A paradox for significance test critics
Critic1: It’s much too easy to get small P-values.
You: Why do they find it so difficult to replicate
the small P-values others found?*
Is it easy or is it hard?
*(Only 36 of 100 psychology experiments
yielded small P-values in Open Science
Collaboration on replication in psychology)
13
O R.A. Fisher: it’s easy to lie with statistics
by selective reporting (he called it the
“political principle”)
O Sufficient finagling (helped by Big Data)—
cherry-picking, P-hacking, significance
seeking—may practically guarantee a
preferred claim C gets support, even if it’s
unwarranted by evidence
O Note: Rejecting a null taken as support
for some non-null claim C
14
Severity Requirement:
“Mere supporting instances are as a rule too
cheap to be worth having” (Popper 1983, 130)
O If data x0 agree with a claim C, but the test
procedure had little or no capability of finding
flaws with C (even if the claim is incorrect),
then x0 provide poor evidence for C
O Such a test fails a minimal requirement for a
stringent or severe test
O My account: severe testing based on error
statistics
15
By and large the ASA doc highlights
classic foibles
“In relation to the test of significance, we may
say that a phenomenon is experimentally
demonstrable when we know how to conduct
an experiment which will rarely fail to give us a
statistically significant result”
(Fisher 1935, 14)
(“isolated” low P-value ≠> H: statistical effect)
16
Statistical ≠> substantive (H ≠> H*)
“[A]ccording to Fisher, rejecting the null
hypothesis is not equivalent to accepting
the efficacy of the cause in question. The
latter...requires obtaining more significant
results when the experiment, or an
improvement of it, is repeated at other
laboratories or under other conditions”
(Gigerentzer 1989, 95-6)
17
O Flaws in alternative H* have not been
probed by the test,
O The inference from a statistically significant
result to H* fails to pass with severity
O “Merely refuting the null hypothesis is too
weak to corroborate” substantive H*, “we
have to have ‘Popperian risk’, ‘severe test’
[as in Mayo]’” (Meehl and Waller 2002,184)
18
The ASA’s Six Principles
O (1) P-values can indicate how incompatible the data are
with a specified statistical model
O (2) P-values do not measure the probability that the
studied hypothesis is true, or the probability that the data
were produced by random chance alone
O (3) Scientific conclusions and business or policy
decisions should not be based only on whether a p-value
passes a specific threshold
O (4) Proper inference requires full reporting and
transparency
O (5) A p-value, or statistical significance, does not
measure the size of an effect or the importance of a result
O (6) By itself, a p-value does not provide a good measure
of evidence regarding a model or hypothesis
19
The ASA’s Six Principles
O (1) P-values can indicate how incompatible the data are with
a specified statistical model
O (2) P-values do NOT measure the probability that the
studied hypothesis is true, or the probability that the data
were produced by random chance alone
O (3) Scientific conclusions and business or policy decisions
should NOT be based only on whether a p-value passes a
specific threshold
O (4) Proper inference requires full reporting and transparency
O (5) A p-value, or statistical significance, does NOT measure
the size of an effect or the importance of a result
O (6) By itself, a p-value does NOT provide a good measure of
evidence regarding a model or hypothesis
20
Two main views of the role of
probability in inference (not in ASA doc)
O Probabilism. To assign a degree of
probability, confirmation, support or belief
in a hypothesis, given data x0. (e.g.,
Bayesian, likelihoodist)—with regard for
inner coherency
O Performance. Ensure long-run reliability
of methods, coverage probabilities
(frequentist, behavioristic Neyman-
Pearson)
21
What happened to using probability to
assess error probing capacity and
severity?
O Neither “probabilism” nor “performance”
directly captures it
O Good long-run performance is a
necessary, not a sufficient, condition for
severity
O That’s why frequentist methods can be
shown to have howlers
22
O Problems with selective reporting, cherry
picking, stopping when the data look
good, P-hacking, are not problems about
long-runs—
O It’s that we cannot say about the case at
hand that it has done a good job of
avoiding the sources of misinterpreting
data
23
A claim C is not warranted _______
O Probabilism: unless C is true or probable
(gets a probability boost, is made
comparatively firmer)
O Performance: unless it stems from a
method with low long-run error
O Probativism (severe testing) something
(a fair amount) has been done to probe
ways we can be wrong about C
24
O If you assume probabilism is required for
inference, error probabilities are relevant
for inference only by misinterpretation
False!
O I claim, error probabilities play a crucial
role in appraising well-testedness
O It’s crucial to be able to say, C is highly
believable or plausible but poorly tested
O Probabilists can allow for the distinct task
of severe testing (you can have your cake
and eat it)
25
The ASA doc gives no sense of
different tools for different jobs
O “To use an eclectic toolbox in statistics,
it’s important not to expect an agreement
on numbers form methods evaluating
different things
O A p-value isn’t ‘invalid’ because it does
not supply “the probability of the null
hypothesis, given the finding” (the
posterior probability of H0) (Trafimow and
Marks,* 2015)
*Editors of a journal, Basic and Applied Social
Psychology ban P-values: don’t ask don’t tell
26
O Principle 2 says a p-value ≠ posterior but one
doesn’t get the sense of its role in error
probability control
O It’s not that I’m keen to defend many common
uses of significance tests
O The criticisms are often based on
misunderstandings; consequently so are
many “reforms”
[Aside: Principle 2 says “P-values do not
measure the probability that the studied
hypothesis is true, or the probability that the
data were produced by random chance alone.”]
27
Biasing selection effects:
O One function of severity is to identify
which selection effects are problematic
(not all are)
O Biasing selection effects: when data or
hypotheses are selected, generated,
interpreted (or a test criterion is specified),
such that the minimal severity requirement
is violated, seriously altered or incapable
of being assessed
28
Nominal vs actual significance levels
“The ASA correctly warns that “[c]onducting
multiple analyses of the data and reporting
only those with certain p-values” leads to
spurious p-values (Principle 4)”
“Suppose that twenty sets of differences
have been examined, that one difference
seems large enough to test and that this
difference turns out to be ‘significant at
the 5 percent level.’ ….The actual level of
significance is not 5 percent, but 64
percent!” (Selvin, 1970, p. 104)
From (Morrison & Henkel’s Significance Test
controversy 1970!)
29
O They were clear on the fallacy: blurring
the “computed” or “nominal” significance
level, and the “actual” level
O There are many more ways you can be
wrong with hunting (different sample
space)–ok for exploratory
O Here’s a case where a p-value report is
invalid
30
You report: Such results would be difficult
to achieve under the assumption of H0
When in fact such results are common
under the assumption of H0
(Formally):
O You say Pr(P-value < Pobs; H0) = Pobs
(small)
O But in fact Pr(P-value < Pobs; H0) = high
31
Scapegoating
O Nowadays, we’re likely to see the tests
blamed
O My view: Tests don’t kill inferences,
people do
O Even worse are those statistical accounts
where the abuse vanishes!
32
On some views, taking account of biasing
selection effects “defies scientific sense”
“Two problems that plague frequentist inference:
multiple comparisons and multiple looks, or, as
they are more commonly called, data dredging
and peeking at the data. The frequentist solution
to both problems involves adjusting the P-value…
But adjusting the measure of evidence
because of considerations that have nothing
to do with the data defies scientific sense,
belies the claim of ‘objectivity’ that is often made
for the P-value” (Goodman 1999, p. 1010)
(To his credit, he’s open about this; heads the
Meta-Research Innovation Center at Stanford)
33
Technical activism isn’t free of
philosophy
Ben Goldacre (of Bad Science) in a 2016 Nature
article, is puzzled that bad statistical practices
continue even in the face of the new "technical
activism”:
“The editors at Annals of Internal Medicine,…
repeatedly (but confusedly) argue that it is
acceptable to identify “prespecified outcomes”
[from results] produced after a trial
began; ….they say that their expertise allows
them to permit — and even solicit —
undeclared outcome-switching.” (2016, 7)
34
His paper: “Make journals report
clinical trials properly”
O He shouldn’t close his eyes to the
possibility that some of the pushback he’s
seeing has a basis in statistical
philosophy!
35
Likelihood Principle (LP)
The vanishing act links to a pivotal disagreement
in the philosophy of statistics battles
In probabilisms, the import of the data is via the
ratios of likelihoods of hypotheses
P(x0;H1)/P(x0;H0) for x0 fixed
O “They condition on the actual data,
O error probabilities take into account other
outcomes that could have occurred but did not
(sampling distribution)”
36
All error probabilities violate the LP
(even without selection effects):
“Sampling distributions, significance levels,
power, all depend on something more [than
the likelihood function]–something that is
irrelevant in Bayesian inference–namely the
sample space.” (Lindley 1971, p. 436)
“The LP implies…the irrelevance of
predesignation, of whether a hypothesis was
thought of before hand or was introduced to
explain known effects.” (Rosenkrantz, 1977,
p. 122)
37
Another paradox enters
O ASA report: “In view of the prevalent misuses of
and misconceptions concerning p-values, some
statisticians prefer to supplement or even replace
p-values with other approaches. These
include…confidence intervals; Bayesian
methods,…likelihood ratios or Bayes Factors…”
O However, the same p-hacked hypothesis can occur
in “likelihood ratios, Bayes factors and Bayesian
updating with one big difference:
38
Does principle 4 hold for other
approaches?
O “The direct grounds to criticize inferences
as flouting error statistical control is lost”
O An 11th hour point of controversy: whether
to retain “full reporting and transparency”
(principle 4) for all methods
O Or should it apply only to “p-values and
related statistics”
39
How might probabilists block intuitively
unwarranted inferences (without error
probabilities)?
A subjective Bayesian might say:
If our beliefs were mixed into the interpretation of
the evidence, we wouldn’t declare there’s
statistical evidence of some unbelievable claim
(distinguishing shades of grey and being
politically moderate, ovulation and voting
preferences)
40
Rescued by beliefs
O That could work in some cases (it still
wouldn’t show what researchers had done
wrong)—battle of beliefs
O Besides, researchers sincerely believe
their hypotheses
O Now you’ve got two sources of flexibility,
priors and biasing selection effects
41
No help with our most important
problem
O How to distinguish the warrant for a
single hypothesis H with different methods
(e.g., one has biasing selection effects,
another, pre-registered results and
precautions)?
42
Most Bayesians are “conventional”
O Eliciting subjective priors too difficult,
scientists reluctant to allow subjective
beliefs to overshadow data
O Default, or reference priors are supposed
to prevent prior beliefs from influencing
the posteriors (O-Bayesians, 2006)
O A classic conundrum: no general non-
informative prior exists, so most are
conventional
43
O “The priors are not to be considered expressions
of uncertainty, ignorance, or degree of belief.
Conventional priors may not even be
probabilities…” (Cox and Mayo 2010, p. 299)
O Conventional Bayesian Reforms are touted as
free of selection effects
O “Bayes factors can be used in the complete
absence of a sampling plan” (Bayarri, Benjamin,
Berger, Sellke 2016)
44
Granted, some are prepared to abandon
the LP for model testing
In an attempted meeting of the minds Andrew
Gelman and Cosma Shalizi say:
O “[C]rucial parts of Bayesian data analysis, such
as model checking, can be understood as ‘error
probes’ in Mayo’s sense” which might be seen
as using modern statistics to implement the
Popperian criteria of severe tests.” (2013, 10).
O Fisherian tests are fine here
45
The Problem is with so-called NHST
O NHSTs encourage the move from H to H*–
that supposedly allow moving from statistical
to substantive
O If defined that way, they exist only as
abuses of tests
O ASA doc ignores Neyman-Pearson (N-P)
tests
46
Neyman-Pearson (N-P) tests:
A null and alternative hypotheses
H0, H1 that exhaust the parameter
space
O So the fallacy of rejection HH* is
impossible
O Rejecting the null only indicates statistical
alternatives
47
P-values don’t report effect sizes
Principle 5
Who ever said to just report a P-value?
O “Tests should be accompanied by
interpretive tools that avoid the fallacies of
rejection and non-rejection. These
correctives can be articulated in either
Fisherian or Neyman-Pearson terms”
(Mayo and Cox 2006, Mayo and Spanos
2006)
48
To avoid inferring a discrepancy
beyond what’s warranted:
Large n problem.
O Severity tells us: an α-significant
difference is indicative of less of a
discrepancy from the null if it results from
larger (n1) rather than a smaller (n2)
sample size (n1 > n2)
49
O What’s more indicative of a large effect
(fire), a fire alarm that goes off with burnt
toast or one so insensitive that it doesn’t
go off unless the house is fully ablaze?
50
[The larger sample size is like the
one that goes off with burnt toast]
What about the fallacy of
non-significant results?
This gets to my experience with the open
source replication project in psychology
O Non-Replication occurs with non-significant
results, but there’s much confusion in
interpreting them
O Given the focus on NHST – many take
nonsignificance as uninformative
51
O Insignificant results don’t warrant 0
discrepancy
O Use the same severity reasoning to rule out
discrepancies that very probably would have
resulted in a larger difference than observed-
set upper bounds
O “If you very probably would have observed a
more impressive (smaller) p-value than you
did, if μ > μ1 (where μ1 = μ0 + γ), then the data
are good evidence that μ≤ μ1”
O Akin to power analysis (Cohen, Neyman) but
sensitive to x0
52
Improves on confidence intervals
“This is akin to confidence intervals (which
are dual to tests) but we get around their
shortcomings:
O We do not fix a single confidence level,
O The evidential warrant for different points
in any interval are distinguished”
O Go beyond “performance goal” to give
inferential construal
53
Fraudbusting
O Uri Simonsohn exposed suspicious data in the
work of social psychologists Dirk Smeesters,
Lawrence Sanna (2011-12); Jens Förster (2014-
15)
O Detecting something suspicious in particular
data
O “Fabricated data detected by statistics alone”
(Simonsohn 2013) 54
P-values can’t be trusted except when
used to argue that P-values can’t be
trusted
O Paradoxically, significance test critics agree
with fraudbusting using tests they purport
to reject
O “The methodology here is not new. It goes
back to Fisher (founder of modern statistics)
in the 30’s. … The tests of goodness of fit
were again and again, too good. …using a
trick invented by Fisher for the purpose, and
well known to all statisticians”. (Richard. D.
Gill)
55
OSC: Reproducibility Project: Psychology: 2011-
15 (Science 2015) (led by Brian Nozek, U. VA)
O Crowd sourced: Replicators select 100
articles from three journals (2008) to try and
replicate using the same method as the
initial research: direct replication
O Aimed to avoid fraudbusting in closed
committees
56
Repligate: Witch hunting or good science?
O Open replications also have pushback
O Daniel Gilbert (psychology at Harvard,) “so-called
replicators” are "shameless little bullies" and
"second stringers" who engage in tactics "out of
Senator Joe McCarthy’s playbook”
O Daniel Kahneman: New Rule: consult with the
original authors for “hidden secrets” that only the
original authors would know
57
Work completed in Aug 2015
O Does a negative replication mean the original
was a false positive?
O Advantages of the Replications:
–Preregistered, designed to have high power
–Free of “perverse incentives” of usual
research: guaranteed to be published
– No file drawers
58
Vindication for the P-value?
“The findings also offered some support for the
oft-criticized statistical tool known as the P-
value …a low P value was fairly predictive of
which psychology studies could be
replicated”. …(Smithsonian, 2015)
O I deny all we need is to lower the required P-
value (as some reformers claim)
O “…replication attempts find it difficult to get
small p-values with preregistered results. This
shows the problem isn’t p-values but failing to
adjust them for cherry picking, multiple testing,
etc.” 59
Highly impressive effort…but
O Fidelity questions (replications differ from
originals)-new round in Science 2016
O My blogpost: Non-significance is the new
significance (a new perverse incentive?)
O My real problem…. They stick to what might
be called “purely statistical” issues: can we get
a low P-value or not?
60
Non-replications construed as
simply weaker effects
O One of the non-replications: cleanliness and
morality: Does unscrambling soap words make
you less judgmental?
“Ms. Schnall had 40 undergraduates unscramble
some words. One group unscrambled words that
suggested cleanliness (pure, immaculate, pristine),
while the other group unscrambled neutral words.
They were then presented with a number of moral
dilemmas, like whether it’s cool to eat your dog
after it gets run over by a car. …
61
…Subjects who had unscrambled clean words
weren’t as harsh on the guy who chows down on
his chow.” (Chronicle of Higher Education)
(situated cognition effect?)
O By focusing on the P-values, they ignore the
larger question of the methodological adequacy
of the leap from the statistical to the substantive.
O Are they even measuring the phenomenon they
intend? Is the result due to the “treatment”?
62
Coming to a philosophy department
near you (experimental philosophy)
2 that didn’t replicate in psychology:
O Belief in free will and cheating
O Physical distance (of points plotted)
and emotional closeness
I encourage meta-methodological scrutiny
by philosophers
63
Concluding remarks
O I end my commentary: “Failing to understand the
correct (if limited) role of simple significance tests
threatens to throw the error control baby out with
the bad statistics bathwater
O Don’t scapegoat methods based on cookbook
uses long lampooned”
O Makes no sense to banish tools for testing
assumptions, fraudbusting, replication research
64
O Recognize different roles of probability:
probabilism, long run performance,
probativism (severe testing)
O Don’t expect an agreement on numbers
from methods evaluating different things
O Probabilisms may enable rather than
block illicit inferences due to biasing
selection effects
O Main paradox of the “replication crisis”
65
Future ASA project
O Look at the “other approaches” (Bayes
factors, LRs, Bayesian updating)
O What is it for a replication to succeed or
fail on those approaches?
(can’t be just a matter of prior beliefs in
the hypotheses)
66
Error statistical improvements are
needed
O An inferential construal of error probabilities
wasn’t clearly given (Birnbaum)-my goal
O It’s not long-run error control (performance),
but severely probing flaws today
O Needs to say more about how they’ll use
background information
67
Recognize that often better
statistics cannot help
O One hypothesis must always be: our
results point to the inability of our study to
severely probe the phenomenon of
interest
O Be ready to admit questionable science–
unable to distinguish poorly run study and
poor hypothesis
68
2 that didn’t replicate
O Belief in free will and cheating
“It found that participants who read a passage
arguing that their behavior is predetermined were
more likely than those who had not read the
passage to cheat on a subsequent test.”
O Physical distance (of points plotted) and
emotional closeness
“Volunteers asked to plot two points that were far
apart on graph paper later reported weaker
emotional attachment to family members,
compared with subjects who had graphed points
close together.” (NYT)
69
Mayo and Cox (2010): Frequentist Principle of
Evidence (FEV); SEV: Mayo and Spanos (2006)
FEV/SEV: insignificant result: A moderate P-value
is evidence of the absence of a discrepancy δ
from H0, only if there is a high probability the
test would have given a worse fit with H0 (i.e.,
d(X) > d(x0) ) were a discrepancy δ to exist
FEV/SEV significant result d(X) > d(x0) is
evidence of discrepancy δ from H0, if and only if,
there is a high probability the test would have d(X)
< d(x0) were a discrepancy as large as δ absent
71
Test T+: Normal testing: H0: μ < μ0 vs. H1: μ > μ0
σ known
(FEV/SEV): If d(x) is not statistically significant,
then
μ < M0 + kεσ/√n passes the test T+ with
severity (1 – ε)
(FEV/SEV): If d(x) is statistically significant, then
μ > M0 + kεσ/√n passes the test T+ with
severity (1 – ε)
where P(d(X) > kε) = ε 72
References
O Armitage, P. 1962. “Contribution to Discussion.” In The Foundations of Statistical
Inference: A Discussion, edited by L. J. Savage. London: Methuen.
O Bayarri, M., Benjamin, D., Berger, J., Sellke, T. (forthcoming). “Rejection Odds and
Rejection Ratios: A Proposal for Statistical Practice in Testing Hypotheses."
Journal of Mathematical Psychology. Invited paper for special issue on “Bayesian
hypothesis testing.”
O Berger, J. O. 2003 'Could Fisher, Jeffreys and Neyman Have Agreed on Testing?'
and 'Rejoinder,', Statistical Science 18(1): 1-12; 28-32.
O Berger, J. O. 2006. “The Case for Objective Bayesian Analysis.” Bayesian Analysis
1 (3): 385–402.
O Birnbaum, A. 1970. “Statistical Methods in Scientific Inference (letter to the
Editor).” Nature 225 (5237) (March 14): 1033.
O Efron, B. 2013. 'A 250-Year Argument: Belief, Behavior, and the Bootstrap', Bulletin
of the American Mathematical Society 50(1): 126-46.
O Box, G. 1983. “An Apology for Ecumenism in Statistics,” in Box, G.E.P., Leonard, T.
and Wu, D. F. J. (eds.), pp. 51-84, Scientific Inference, Data Analysis, and
Robustness. New York: Academic Press.
O Cox, D. R. and Hinkley, D. 1974. Theoretical Statistics. London: Chapman and
Hall.
O Cox, D. R., and Deborah G. Mayo. 2010. “Objectivity and Conditionality in
Frequentist Inference.” In Error and Inference: Recent Exchanges on Experimental
Reasoning, Reliability, and the Objectivity and Rationality of Science, edited by
Deborah G. Mayo and Aris Spanos, 276–304. Cambridge: Cambridge University
Press.
73
O Fisher, R. A. 1935. The Design of Experiments. Edinburgh: Oliver and Boyd.
O Fisher, R. A. 1955. “Statistical Methods and Scientific Induction.” Journal of the
Royal Statistical Society, Series B (Methodological) 17 (1) (January 1): 69–78.
O Gelman, A. and Shalizi, C. 2013. “Philosophy and the Practice of Bayesian
Statistics” and “Rejoinder’” British Journal of Mathematical and Statistical
Psychology 66(1): 8–38; 76-80.
O Gigerenzer, G., Swijtink, Porter, T. Daston, L. Beatty, J, and Kruger, L. 1989. The
Empire of Chance. Cambridge: Cambridge University Press.
O Gilbert, D. Twitter post: https://twitter.com/dantgilbert/status/470199929626193921
O Gill, comment: On the “Suspicion of Scientific Midconduc by Jens Forster by
Neuroskeptic May 6, 2014 on Discover Magazine Blog:
http://blogs.discovermagazine.com/neuroskeptic/2014/05/06/suspicion-
misconduct-forster/#.Vynr3j-scQ0.
O Goldacre, B. 2008. Bad Science. HarperCollins Publishers.
O Goldacre, B. 2016. “Make journals report clinical trials properly”, Nature 530(7588);
7; online 04Feb2016.
O Goodman SN. 1999. “Toward evidence-based medical statistics. 2: The Bayes
factor,” Annals of Internal Medicine 1999; 130:1005 –1013.
O Handwerk, B. 2015. “Scientists Replicated 100 Psychology Studies, and Fewer
than Half Got the Same Results.” Smithsonian Magazine (August 27, 2015)
http://www.smithsonianmag.com/science-nature/scientists-replicated-100-
psychology-studies-and-fewer-half-got-same-results-180956426/?no-ist
O Hasselman, F. and Mayo, D. 2015, April 17. “seveRity” (R-program). Retrieved
from osf.io/k6w3h 74
O Levelt Committee, Noort Committee, Drenth Committee. 2012. 'Flawed science: The
fraudulent research practices of social psychologist Diederik Stapel', Stapel
Investigation: Joint Tilburg/Groningen/Amsterdam investigation of the publications by
Mr. Stapel. (https://www.commissielevelt.nl/)
O Lindley, D. V. 1971. “The Estimation of Many Parameters.” In Foundations of Statistical
Inference, edited by V. P. Godambe and D. A. Sprott, 435–455. Toronto: Holt, Rinehart
and Winston.
O Mayo, D. G. 1996. Error and the Growth of Experimental Knowledge. Science and Its
Conceptual Foundation. Chicago: University of Chicago Press.
O Mayo, D. G. 2016. 'Don't Throw Out the Error Control Baby with the Bad Statistics
Bathwater: A Commentary', The American Statistician, online March 7, 2016.
http://www.tandfonline.com/doi/pdf/10.1080/00031305.2016.1154108.
O Mayo, D. G. Error Statistics Philosophy Blog: errorstatistics.com
O Mayo, D. G. and Cox, D. R. (2010). "Frequentist Statistics as a Theory of Inductive
Inference" in Error and Inference: Recent Exchanges on Experimental Reasoning,
Reliability and the Objectivity and Rationality of Science (D Mayo and A. Spanos eds.),
Cambridge: Cambridge University Press: 1-27. This paper appeared in The Second
Erich L. Lehmann Symposium: Optimality, 2006, Lecture Notes-Monograph Series,
Volume 49, Institute of Mathematical Statistics, pp. 247-275.
O Mayo, D. G., and A. Spanos. 2006. “Severe Testing as a Basic Concept in a Neyman–
Pearson Philosophy of Induction.” British Journal for the Philosophy of Science 57 (2)
(June 1): 323–357.
O Mayo, D. G., and A. Spanos. 2011. “Error Statistics.” In Philosophy of Statistics,
edited by Prasanta S. Bandyopadhyay and Malcolm R. Forster, 7:152–198. Handbook
of the Philosophy of Science. The Netherlands: Elsevier. 75
O Meehl, P. E., and N. G. Waller. 2002. “The Path Analysis Controversy: A New Statistical
Approach to Strong Appraisal of Verisimilitude.” Psychological Methods 7 (3): 283–300.
O Morrison, D. E., and R. E. Henkel, ed. 1970. The Significance Test Controversy: A Reader.
Chicago: Aldine De Gruyter.
O Open Science Collaboration (Nozeck, B. et al). 2015. “Estimating the Reproducibility of
Psychological Science.” Science 349(6251)
O Pearson, E. S. & Neyman, J. (1930). On the problem of two samples. Joint Statistical Papers
by J. Neyman & E.S. Pearson, 99-115 (Berkeley: U. of Calif. Press). First published in Bul.
Acad. Pol.Sci. 73-96.
O Rosenkrantz, R. 1977. Inference, Method and Decision: Towards a Bayesian Philosophy of
Science. Dordrecht, The Netherlands: D. Reidel.
O Savage, L. J. 1962. The Foundations of Statistical Inference: A Discussion. London:
Methuen.
O Selvin, H. 1970. “A critique of tests of significance in survey research. In The significance
test controversy, edited by D. Morrison and R. Henkel, 94-106. Chicago: Aldine De Gruyter.
O Simonsohn, U. 2013, "Just Post It: The Lesson From Two Cases of Fabricated Data
Detected by Statistics Alone", Psychological Science, vol. 24, no. 10, pp. 1875-1888.
O Smithsonian Magazine (See Handwerk)
O Trafimow D. and Marks, M. 2015. “Editorial”, Basic and Applied Social Psychology 37(1): pp.
1-2.
O Wasserstein, R. and Lazar, N. 2016. “The ASA’s statement on p-values: context, process,
and purpose”, The American Statistician
 Link to ASA statement & Commentaries (under supplemental):
http://amstat.tandfonline.com/doi/abs/10.1080/00031305.2016.1154108 76
Jimmy Savage on the LP:
“According to Bayes' theorem,…. if y is the
datum of some other experiment, and if it
happens that P(x|µ) and P(y|µ) are
proportional functions of µ (that is,
constant multiples of each other), then
each of the two data x and y have exactly
the same thing to say about the values of
µ…” (Savage 1962, p. 17)
77
O Abstract: Mounting failures of replication in the social and
biological sciences give a practical spin to statistical foundations
in the form of the question: How can we attain reliability when Big
Data methods make illicit cherry-picking and significance seeking
so easy? Researchers, professional societies, and journals are
increasingly getting serious about methodological reforms to
restore scientific integrity – some are quite welcome (e.g., pre-
registration), while others are quite radical. The American
Statistical Association convened members from differing tribes of
frequentists, Bayesians, and likelihoodists to codify misuses of P-
values. Largely overlooked are the philosophical presuppositions
of both criticisms and proposed reforms. Paradoxically, alter-
native replacement methods may enable rather than reveal illicit
inferences due to cherry-picking, multiple testing, and other
biasing selection effects. Crowd-sourced reproducibility research
in psychology is helping to change the reward structure but has
its own shortcomings. Focusing on purely statistical
considerations, it tends to overlook problems with artificial
experiments. Without a better understanding of the philosophical
issues, we can expect the latest reforms to fail.
78

More Related Content

What's hot

Yoav Benjamini, "In the world beyond p<.05: When & How to use P<.0499..."
Yoav Benjamini, "In the world beyond p<.05: When & How to use P<.0499..."Yoav Benjamini, "In the world beyond p<.05: When & How to use P<.0499..."
Yoav Benjamini, "In the world beyond p<.05: When & How to use P<.0499..."jemille6
 
Replication Crises and the Statistics Wars: Hidden Controversies
Replication Crises and the Statistics Wars: Hidden ControversiesReplication Crises and the Statistics Wars: Hidden Controversies
Replication Crises and the Statistics Wars: Hidden Controversiesjemille6
 
Surrogate Science: How Fisher, Neyman-Pearson, and Bayes Were Transformed int...
Surrogate Science: How Fisher, Neyman-Pearson, and Bayes Were Transformed int...Surrogate Science: How Fisher, Neyman-Pearson, and Bayes Were Transformed int...
Surrogate Science: How Fisher, Neyman-Pearson, and Bayes Were Transformed int...jemille6
 
Probing with Severity: Beyond Bayesian Probabilism and Frequentist Performance
Probing with Severity: Beyond Bayesian Probabilism and Frequentist PerformanceProbing with Severity: Beyond Bayesian Probabilism and Frequentist Performance
Probing with Severity: Beyond Bayesian Probabilism and Frequentist Performancejemille6
 
D. G. Mayo: Your data-driven claims must still be probed severely
D. G. Mayo: Your data-driven claims must still be probed severelyD. G. Mayo: Your data-driven claims must still be probed severely
D. G. Mayo: Your data-driven claims must still be probed severelyjemille6
 
Controversy Over the Significance Test Controversy
Controversy Over the Significance Test ControversyControversy Over the Significance Test Controversy
Controversy Over the Significance Test Controversyjemille6
 
Exploratory Research is More Reliable Than Confirmatory Research
Exploratory Research is More Reliable Than Confirmatory ResearchExploratory Research is More Reliable Than Confirmatory Research
Exploratory Research is More Reliable Than Confirmatory Researchjemille6
 
D. Mayo: The Science Wars and the Statistics Wars: scientism, popular statist...
D. Mayo: The Science Wars and the Statistics Wars: scientism, popular statist...D. Mayo: The Science Wars and the Statistics Wars: scientism, popular statist...
D. Mayo: The Science Wars and the Statistics Wars: scientism, popular statist...jemille6
 
April 3 2014 slides mayo
April 3 2014 slides mayoApril 3 2014 slides mayo
April 3 2014 slides mayojemille6
 
Discussion a 4th BFFF Harvard
Discussion a 4th BFFF HarvardDiscussion a 4th BFFF Harvard
Discussion a 4th BFFF HarvardChristian Robert
 
beyond objectivity and subjectivity; a discussion paper
beyond objectivity and subjectivity; a discussion paperbeyond objectivity and subjectivity; a discussion paper
beyond objectivity and subjectivity; a discussion paperChristian Robert
 
Statistical Flukes, the Higgs Discovery, and 5 Sigma
Statistical Flukes, the Higgs Discovery, and 5 Sigma Statistical Flukes, the Higgs Discovery, and 5 Sigma
Statistical Flukes, the Higgs Discovery, and 5 Sigma jemille6
 
Byrd statistical considerations of the histomorphometric test protocol (1)
Byrd statistical considerations of the histomorphometric test protocol (1)Byrd statistical considerations of the histomorphometric test protocol (1)
Byrd statistical considerations of the histomorphometric test protocol (1)jemille6
 
P-Value "Reforms": Fixing Science or Threat to Replication and Falsification
P-Value "Reforms": Fixing Science or Threat to Replication and FalsificationP-Value "Reforms": Fixing Science or Threat to Replication and Falsification
P-Value "Reforms": Fixing Science or Threat to Replication and Falsificationjemille6
 
D. G. Mayo: The Replication Crises and its Constructive Role in the Philosoph...
D. G. Mayo: The Replication Crises and its Constructive Role in the Philosoph...D. G. Mayo: The Replication Crises and its Constructive Role in the Philosoph...
D. G. Mayo: The Replication Crises and its Constructive Role in the Philosoph...jemille6
 
D. Mayo: Philosophy of Statistics & the Replication Crisis in Science
D. Mayo: Philosophy of Statistics & the Replication Crisis in ScienceD. Mayo: Philosophy of Statistics & the Replication Crisis in Science
D. Mayo: Philosophy of Statistics & the Replication Crisis in Sciencejemille6
 
Mayo: Evidence as Passing a Severe Test (How it Gets You Beyond the Statistic...
Mayo: Evidence as Passing a Severe Test (How it Gets You Beyond the Statistic...Mayo: Evidence as Passing a Severe Test (How it Gets You Beyond the Statistic...
Mayo: Evidence as Passing a Severe Test (How it Gets You Beyond the Statistic...jemille6
 
D. Mayo: Philosophical Interventions in the Statistics Wars
D. Mayo: Philosophical Interventions in the Statistics WarsD. Mayo: Philosophical Interventions in the Statistics Wars
D. Mayo: Philosophical Interventions in the Statistics Warsjemille6
 
D.G. Mayo Slides LSE PH500 Meeting #1
D.G. Mayo Slides LSE PH500 Meeting #1D.G. Mayo Slides LSE PH500 Meeting #1
D.G. Mayo Slides LSE PH500 Meeting #1jemille6
 
Mayo minnesota 28 march 2 (1)
Mayo minnesota 28 march 2 (1)Mayo minnesota 28 march 2 (1)
Mayo minnesota 28 march 2 (1)jemille6
 

What's hot (20)

Yoav Benjamini, "In the world beyond p<.05: When & How to use P<.0499..."
Yoav Benjamini, "In the world beyond p<.05: When & How to use P<.0499..."Yoav Benjamini, "In the world beyond p<.05: When & How to use P<.0499..."
Yoav Benjamini, "In the world beyond p<.05: When & How to use P<.0499..."
 
Replication Crises and the Statistics Wars: Hidden Controversies
Replication Crises and the Statistics Wars: Hidden ControversiesReplication Crises and the Statistics Wars: Hidden Controversies
Replication Crises and the Statistics Wars: Hidden Controversies
 
Surrogate Science: How Fisher, Neyman-Pearson, and Bayes Were Transformed int...
Surrogate Science: How Fisher, Neyman-Pearson, and Bayes Were Transformed int...Surrogate Science: How Fisher, Neyman-Pearson, and Bayes Were Transformed int...
Surrogate Science: How Fisher, Neyman-Pearson, and Bayes Were Transformed int...
 
Probing with Severity: Beyond Bayesian Probabilism and Frequentist Performance
Probing with Severity: Beyond Bayesian Probabilism and Frequentist PerformanceProbing with Severity: Beyond Bayesian Probabilism and Frequentist Performance
Probing with Severity: Beyond Bayesian Probabilism and Frequentist Performance
 
D. G. Mayo: Your data-driven claims must still be probed severely
D. G. Mayo: Your data-driven claims must still be probed severelyD. G. Mayo: Your data-driven claims must still be probed severely
D. G. Mayo: Your data-driven claims must still be probed severely
 
Controversy Over the Significance Test Controversy
Controversy Over the Significance Test ControversyControversy Over the Significance Test Controversy
Controversy Over the Significance Test Controversy
 
Exploratory Research is More Reliable Than Confirmatory Research
Exploratory Research is More Reliable Than Confirmatory ResearchExploratory Research is More Reliable Than Confirmatory Research
Exploratory Research is More Reliable Than Confirmatory Research
 
D. Mayo: The Science Wars and the Statistics Wars: scientism, popular statist...
D. Mayo: The Science Wars and the Statistics Wars: scientism, popular statist...D. Mayo: The Science Wars and the Statistics Wars: scientism, popular statist...
D. Mayo: The Science Wars and the Statistics Wars: scientism, popular statist...
 
April 3 2014 slides mayo
April 3 2014 slides mayoApril 3 2014 slides mayo
April 3 2014 slides mayo
 
Discussion a 4th BFFF Harvard
Discussion a 4th BFFF HarvardDiscussion a 4th BFFF Harvard
Discussion a 4th BFFF Harvard
 
beyond objectivity and subjectivity; a discussion paper
beyond objectivity and subjectivity; a discussion paperbeyond objectivity and subjectivity; a discussion paper
beyond objectivity and subjectivity; a discussion paper
 
Statistical Flukes, the Higgs Discovery, and 5 Sigma
Statistical Flukes, the Higgs Discovery, and 5 Sigma Statistical Flukes, the Higgs Discovery, and 5 Sigma
Statistical Flukes, the Higgs Discovery, and 5 Sigma
 
Byrd statistical considerations of the histomorphometric test protocol (1)
Byrd statistical considerations of the histomorphometric test protocol (1)Byrd statistical considerations of the histomorphometric test protocol (1)
Byrd statistical considerations of the histomorphometric test protocol (1)
 
P-Value "Reforms": Fixing Science or Threat to Replication and Falsification
P-Value "Reforms": Fixing Science or Threat to Replication and FalsificationP-Value "Reforms": Fixing Science or Threat to Replication and Falsification
P-Value "Reforms": Fixing Science or Threat to Replication and Falsification
 
D. G. Mayo: The Replication Crises and its Constructive Role in the Philosoph...
D. G. Mayo: The Replication Crises and its Constructive Role in the Philosoph...D. G. Mayo: The Replication Crises and its Constructive Role in the Philosoph...
D. G. Mayo: The Replication Crises and its Constructive Role in the Philosoph...
 
D. Mayo: Philosophy of Statistics & the Replication Crisis in Science
D. Mayo: Philosophy of Statistics & the Replication Crisis in ScienceD. Mayo: Philosophy of Statistics & the Replication Crisis in Science
D. Mayo: Philosophy of Statistics & the Replication Crisis in Science
 
Mayo: Evidence as Passing a Severe Test (How it Gets You Beyond the Statistic...
Mayo: Evidence as Passing a Severe Test (How it Gets You Beyond the Statistic...Mayo: Evidence as Passing a Severe Test (How it Gets You Beyond the Statistic...
Mayo: Evidence as Passing a Severe Test (How it Gets You Beyond the Statistic...
 
D. Mayo: Philosophical Interventions in the Statistics Wars
D. Mayo: Philosophical Interventions in the Statistics WarsD. Mayo: Philosophical Interventions in the Statistics Wars
D. Mayo: Philosophical Interventions in the Statistics Wars
 
D.G. Mayo Slides LSE PH500 Meeting #1
D.G. Mayo Slides LSE PH500 Meeting #1D.G. Mayo Slides LSE PH500 Meeting #1
D.G. Mayo Slides LSE PH500 Meeting #1
 
Mayo minnesota 28 march 2 (1)
Mayo minnesota 28 march 2 (1)Mayo minnesota 28 march 2 (1)
Mayo minnesota 28 march 2 (1)
 

Similar to "The Statistical Replication Crisis: Paradoxes and Scapegoats”

Statistical "Reforms": Fixing Science or Threats to Replication and Falsifica...
Statistical "Reforms": Fixing Science or Threats to Replication and Falsifica...Statistical "Reforms": Fixing Science or Threats to Replication and Falsifica...
Statistical "Reforms": Fixing Science or Threats to Replication and Falsifica...jemille6
 
D. G. Mayo Columbia slides for Workshop on Probability &Learning
D. G. Mayo Columbia slides for Workshop on Probability &LearningD. G. Mayo Columbia slides for Workshop on Probability &Learning
D. G. Mayo Columbia slides for Workshop on Probability &Learningjemille6
 
Mayo O&M slides (4-28-13)
Mayo O&M slides (4-28-13)Mayo O&M slides (4-28-13)
Mayo O&M slides (4-28-13)jemille6
 
D.g. mayo 1st mtg lse ph 500
D.g. mayo 1st mtg lse ph 500D.g. mayo 1st mtg lse ph 500
D.g. mayo 1st mtg lse ph 500jemille6
 
Statistical Inference as Severe Testing: Beyond Performance and Probabilism
Statistical Inference as Severe Testing: Beyond Performance and ProbabilismStatistical Inference as Severe Testing: Beyond Performance and Probabilism
Statistical Inference as Severe Testing: Beyond Performance and Probabilismjemille6
 
Error Control and Severity
Error Control and SeverityError Control and Severity
Error Control and Severityjemille6
 
“The importance of philosophy of science for statistical science and vice versa”
“The importance of philosophy of science for statistical science and vice versa”“The importance of philosophy of science for statistical science and vice versa”
“The importance of philosophy of science for statistical science and vice versa”jemille6
 
Frequentist Statistics as a Theory of Inductive Inference (2/27/14)
Frequentist Statistics as a Theory of Inductive Inference (2/27/14)Frequentist Statistics as a Theory of Inductive Inference (2/27/14)
Frequentist Statistics as a Theory of Inductive Inference (2/27/14)jemille6
 
Meeting #1 Slides Phil 6334/Econ 6614 SP2019
Meeting #1 Slides Phil 6334/Econ 6614 SP2019Meeting #1 Slides Phil 6334/Econ 6614 SP2019
Meeting #1 Slides Phil 6334/Econ 6614 SP2019jemille6
 
Hypothesis testing
Hypothesis testingHypothesis testing
Hypothesis testingpraveen3030
 
Research Methodology
Research MethodologyResearch Methodology
Research MethodologyAneel Raza
 
The Statistics Wars and Their Casualties
The Statistics Wars and Their CasualtiesThe Statistics Wars and Their Casualties
The Statistics Wars and Their Casualtiesjemille6
 
The Statistics Wars and Their Causalities (refs)
The Statistics Wars and Their Causalities (refs)The Statistics Wars and Their Causalities (refs)
The Statistics Wars and Their Causalities (refs)jemille6
 
The Statistics Wars and Their Casualties (w/refs)
The Statistics Wars and Their Casualties (w/refs)The Statistics Wars and Their Casualties (w/refs)
The Statistics Wars and Their Casualties (w/refs)jemille6
 
Hypothesis TestingThe Right HypothesisIn business, or an.docx
Hypothesis TestingThe Right HypothesisIn business, or an.docxHypothesis TestingThe Right HypothesisIn business, or an.docx
Hypothesis TestingThe Right HypothesisIn business, or an.docxadampcarr67227
 
Power, Effect Sizes, Confidence Intervals, & Academic Integrity
Power, Effect Sizes, Confidence Intervals, & Academic IntegrityPower, Effect Sizes, Confidence Intervals, & Academic Integrity
Power, Effect Sizes, Confidence Intervals, & Academic IntegrityJames Neill
 

Similar to "The Statistical Replication Crisis: Paradoxes and Scapegoats” (20)

Statistical "Reforms": Fixing Science or Threats to Replication and Falsifica...
Statistical "Reforms": Fixing Science or Threats to Replication and Falsifica...Statistical "Reforms": Fixing Science or Threats to Replication and Falsifica...
Statistical "Reforms": Fixing Science or Threats to Replication and Falsifica...
 
D. G. Mayo Columbia slides for Workshop on Probability &Learning
D. G. Mayo Columbia slides for Workshop on Probability &LearningD. G. Mayo Columbia slides for Workshop on Probability &Learning
D. G. Mayo Columbia slides for Workshop on Probability &Learning
 
Mayo O&M slides (4-28-13)
Mayo O&M slides (4-28-13)Mayo O&M slides (4-28-13)
Mayo O&M slides (4-28-13)
 
D.g. mayo 1st mtg lse ph 500
D.g. mayo 1st mtg lse ph 500D.g. mayo 1st mtg lse ph 500
D.g. mayo 1st mtg lse ph 500
 
On p-values
On p-valuesOn p-values
On p-values
 
Statistical Inference as Severe Testing: Beyond Performance and Probabilism
Statistical Inference as Severe Testing: Beyond Performance and ProbabilismStatistical Inference as Severe Testing: Beyond Performance and Probabilism
Statistical Inference as Severe Testing: Beyond Performance and Probabilism
 
Error Control and Severity
Error Control and SeverityError Control and Severity
Error Control and Severity
 
Hypothesis Testing.pptx
Hypothesis Testing.pptxHypothesis Testing.pptx
Hypothesis Testing.pptx
 
“The importance of philosophy of science for statistical science and vice versa”
“The importance of philosophy of science for statistical science and vice versa”“The importance of philosophy of science for statistical science and vice versa”
“The importance of philosophy of science for statistical science and vice versa”
 
Frequentist Statistics as a Theory of Inductive Inference (2/27/14)
Frequentist Statistics as a Theory of Inductive Inference (2/27/14)Frequentist Statistics as a Theory of Inductive Inference (2/27/14)
Frequentist Statistics as a Theory of Inductive Inference (2/27/14)
 
Meeting #1 Slides Phil 6334/Econ 6614 SP2019
Meeting #1 Slides Phil 6334/Econ 6614 SP2019Meeting #1 Slides Phil 6334/Econ 6614 SP2019
Meeting #1 Slides Phil 6334/Econ 6614 SP2019
 
Hypothesis testing
Hypothesis testingHypothesis testing
Hypothesis testing
 
Basic Conecepts of Inferential Statistics _ Slideshare.pptx
Basic Conecepts of Inferential Statistics _ Slideshare.pptxBasic Conecepts of Inferential Statistics _ Slideshare.pptx
Basic Conecepts of Inferential Statistics _ Slideshare.pptx
 
Research Methodology
Research MethodologyResearch Methodology
Research Methodology
 
The Statistics Wars and Their Casualties
The Statistics Wars and Their CasualtiesThe Statistics Wars and Their Casualties
The Statistics Wars and Their Casualties
 
The Statistics Wars and Their Causalities (refs)
The Statistics Wars and Their Causalities (refs)The Statistics Wars and Their Causalities (refs)
The Statistics Wars and Their Causalities (refs)
 
The Statistics Wars and Their Casualties (w/refs)
The Statistics Wars and Their Casualties (w/refs)The Statistics Wars and Their Casualties (w/refs)
The Statistics Wars and Their Casualties (w/refs)
 
Mayod@psa 21(na)
Mayod@psa 21(na)Mayod@psa 21(na)
Mayod@psa 21(na)
 
Hypothesis TestingThe Right HypothesisIn business, or an.docx
Hypothesis TestingThe Right HypothesisIn business, or an.docxHypothesis TestingThe Right HypothesisIn business, or an.docx
Hypothesis TestingThe Right HypothesisIn business, or an.docx
 
Power, Effect Sizes, Confidence Intervals, & Academic Integrity
Power, Effect Sizes, Confidence Intervals, & Academic IntegrityPower, Effect Sizes, Confidence Intervals, & Academic Integrity
Power, Effect Sizes, Confidence Intervals, & Academic Integrity
 

More from jemille6

D. Mayo JSM slides v2.pdf
D. Mayo JSM slides v2.pdfD. Mayo JSM slides v2.pdf
D. Mayo JSM slides v2.pdfjemille6
 
reid-postJSM-DRC.pdf
reid-postJSM-DRC.pdfreid-postJSM-DRC.pdf
reid-postJSM-DRC.pdfjemille6
 
Errors of the Error Gatekeepers: The case of Statistical Significance 2016-2022
Errors of the Error Gatekeepers: The case of Statistical Significance 2016-2022Errors of the Error Gatekeepers: The case of Statistical Significance 2016-2022
Errors of the Error Gatekeepers: The case of Statistical Significance 2016-2022jemille6
 
Causal inference is not statistical inference
Causal inference is not statistical inferenceCausal inference is not statistical inference
Causal inference is not statistical inferencejemille6
 
What are questionable research practices?
What are questionable research practices?What are questionable research practices?
What are questionable research practices?jemille6
 
What's the question?
What's the question? What's the question?
What's the question? jemille6
 
The neglected importance of complexity in statistics and Metascience
The neglected importance of complexity in statistics and MetascienceThe neglected importance of complexity in statistics and Metascience
The neglected importance of complexity in statistics and Metasciencejemille6
 
Mathematically Elegant Answers to Research Questions No One is Asking (meta-a...
Mathematically Elegant Answers to Research Questions No One is Asking (meta-a...Mathematically Elegant Answers to Research Questions No One is Asking (meta-a...
Mathematically Elegant Answers to Research Questions No One is Asking (meta-a...jemille6
 
On Severity, the Weight of Evidence, and the Relationship Between the Two
On Severity, the Weight of Evidence, and the Relationship Between the TwoOn Severity, the Weight of Evidence, and the Relationship Between the Two
On Severity, the Weight of Evidence, and the Relationship Between the Twojemille6
 
Revisiting the Two Cultures in Statistical Modeling and Inference as they rel...
Revisiting the Two Cultures in Statistical Modeling and Inference as they rel...Revisiting the Two Cultures in Statistical Modeling and Inference as they rel...
Revisiting the Two Cultures in Statistical Modeling and Inference as they rel...jemille6
 
Comparing Frequentists and Bayesian Control of Multiple Testing
Comparing Frequentists and Bayesian Control of Multiple TestingComparing Frequentists and Bayesian Control of Multiple Testing
Comparing Frequentists and Bayesian Control of Multiple Testingjemille6
 
Good Data Dredging
Good Data DredgingGood Data Dredging
Good Data Dredgingjemille6
 
The Duality of Parameters and the Duality of Probability
The Duality of Parameters and the Duality of ProbabilityThe Duality of Parameters and the Duality of Probability
The Duality of Parameters and the Duality of Probabilityjemille6
 
On the interpretation of the mathematical characteristics of statistical test...
On the interpretation of the mathematical characteristics of statistical test...On the interpretation of the mathematical characteristics of statistical test...
On the interpretation of the mathematical characteristics of statistical test...jemille6
 
The role of background assumptions in severity appraisal (
The role of background assumptions in severity appraisal (The role of background assumptions in severity appraisal (
The role of background assumptions in severity appraisal (jemille6
 
The two statistical cornerstones of replicability: addressing selective infer...
The two statistical cornerstones of replicability: addressing selective infer...The two statistical cornerstones of replicability: addressing selective infer...
The two statistical cornerstones of replicability: addressing selective infer...jemille6
 
The replication crisis: are P-values the problem and are Bayes factors the so...
The replication crisis: are P-values the problem and are Bayes factors the so...The replication crisis: are P-values the problem and are Bayes factors the so...
The replication crisis: are P-values the problem and are Bayes factors the so...jemille6
 
The ASA president Task Force Statement on Statistical Significance and Replic...
The ASA president Task Force Statement on Statistical Significance and Replic...The ASA president Task Force Statement on Statistical Significance and Replic...
The ASA president Task Force Statement on Statistical Significance and Replic...jemille6
 
D. G. Mayo jan 11 slides
D. G. Mayo jan 11 slides D. G. Mayo jan 11 slides
D. G. Mayo jan 11 slides jemille6
 
T. Pradeu & M. Lemoine: Philosophy in Science: Definition and Boundaries
T. Pradeu & M. Lemoine: Philosophy in Science: Definition and BoundariesT. Pradeu & M. Lemoine: Philosophy in Science: Definition and Boundaries
T. Pradeu & M. Lemoine: Philosophy in Science: Definition and Boundariesjemille6
 

More from jemille6 (20)

D. Mayo JSM slides v2.pdf
D. Mayo JSM slides v2.pdfD. Mayo JSM slides v2.pdf
D. Mayo JSM slides v2.pdf
 
reid-postJSM-DRC.pdf
reid-postJSM-DRC.pdfreid-postJSM-DRC.pdf
reid-postJSM-DRC.pdf
 
Errors of the Error Gatekeepers: The case of Statistical Significance 2016-2022
Errors of the Error Gatekeepers: The case of Statistical Significance 2016-2022Errors of the Error Gatekeepers: The case of Statistical Significance 2016-2022
Errors of the Error Gatekeepers: The case of Statistical Significance 2016-2022
 
Causal inference is not statistical inference
Causal inference is not statistical inferenceCausal inference is not statistical inference
Causal inference is not statistical inference
 
What are questionable research practices?
What are questionable research practices?What are questionable research practices?
What are questionable research practices?
 
What's the question?
What's the question? What's the question?
What's the question?
 
The neglected importance of complexity in statistics and Metascience
The neglected importance of complexity in statistics and MetascienceThe neglected importance of complexity in statistics and Metascience
The neglected importance of complexity in statistics and Metascience
 
Mathematically Elegant Answers to Research Questions No One is Asking (meta-a...
Mathematically Elegant Answers to Research Questions No One is Asking (meta-a...Mathematically Elegant Answers to Research Questions No One is Asking (meta-a...
Mathematically Elegant Answers to Research Questions No One is Asking (meta-a...
 
On Severity, the Weight of Evidence, and the Relationship Between the Two
On Severity, the Weight of Evidence, and the Relationship Between the TwoOn Severity, the Weight of Evidence, and the Relationship Between the Two
On Severity, the Weight of Evidence, and the Relationship Between the Two
 
Revisiting the Two Cultures in Statistical Modeling and Inference as they rel...
Revisiting the Two Cultures in Statistical Modeling and Inference as they rel...Revisiting the Two Cultures in Statistical Modeling and Inference as they rel...
Revisiting the Two Cultures in Statistical Modeling and Inference as they rel...
 
Comparing Frequentists and Bayesian Control of Multiple Testing
Comparing Frequentists and Bayesian Control of Multiple TestingComparing Frequentists and Bayesian Control of Multiple Testing
Comparing Frequentists and Bayesian Control of Multiple Testing
 
Good Data Dredging
Good Data DredgingGood Data Dredging
Good Data Dredging
 
The Duality of Parameters and the Duality of Probability
The Duality of Parameters and the Duality of ProbabilityThe Duality of Parameters and the Duality of Probability
The Duality of Parameters and the Duality of Probability
 
On the interpretation of the mathematical characteristics of statistical test...
On the interpretation of the mathematical characteristics of statistical test...On the interpretation of the mathematical characteristics of statistical test...
On the interpretation of the mathematical characteristics of statistical test...
 
The role of background assumptions in severity appraisal (
The role of background assumptions in severity appraisal (The role of background assumptions in severity appraisal (
The role of background assumptions in severity appraisal (
 
The two statistical cornerstones of replicability: addressing selective infer...
The two statistical cornerstones of replicability: addressing selective infer...The two statistical cornerstones of replicability: addressing selective infer...
The two statistical cornerstones of replicability: addressing selective infer...
 
The replication crisis: are P-values the problem and are Bayes factors the so...
The replication crisis: are P-values the problem and are Bayes factors the so...The replication crisis: are P-values the problem and are Bayes factors the so...
The replication crisis: are P-values the problem and are Bayes factors the so...
 
The ASA president Task Force Statement on Statistical Significance and Replic...
The ASA president Task Force Statement on Statistical Significance and Replic...The ASA president Task Force Statement on Statistical Significance and Replic...
The ASA president Task Force Statement on Statistical Significance and Replic...
 
D. G. Mayo jan 11 slides
D. G. Mayo jan 11 slides D. G. Mayo jan 11 slides
D. G. Mayo jan 11 slides
 
T. Pradeu & M. Lemoine: Philosophy in Science: Definition and Boundaries
T. Pradeu & M. Lemoine: Philosophy in Science: Definition and BoundariesT. Pradeu & M. Lemoine: Philosophy in Science: Definition and Boundaries
T. Pradeu & M. Lemoine: Philosophy in Science: Definition and Boundaries
 

Recently uploaded

Daily Lesson Plan in Mathematics Quarter 4
Daily Lesson Plan in Mathematics Quarter 4Daily Lesson Plan in Mathematics Quarter 4
Daily Lesson Plan in Mathematics Quarter 4JOYLYNSAMANIEGO
 
Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Mark Reed
 
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxiammrhaywood
 
4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptxmary850239
 
How to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPHow to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPCeline George
 
ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...
ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...
ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...JojoEDelaCruz
 
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdfGrade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdfJemuel Francisco
 
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxINTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxHumphrey A Beña
 
ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4MiaBumagat1
 
What is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPWhat is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPCeline George
 
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptxQ4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptxlancelewisportillo
 
Food processing presentation for bsc agriculture hons
Food processing presentation for bsc agriculture honsFood processing presentation for bsc agriculture hons
Food processing presentation for bsc agriculture honsManeerUddin
 
Karra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptxKarra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptxAshokKarra1
 
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...Postal Advocate Inc.
 
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONTHEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONHumphrey A Beña
 
Concurrency Control in Database Management system
Concurrency Control in Database Management systemConcurrency Control in Database Management system
Concurrency Control in Database Management systemChristalin Nelson
 
AUDIENCE THEORY -CULTIVATION THEORY - GERBNER.pptx
AUDIENCE THEORY -CULTIVATION THEORY -  GERBNER.pptxAUDIENCE THEORY -CULTIVATION THEORY -  GERBNER.pptx
AUDIENCE THEORY -CULTIVATION THEORY - GERBNER.pptxiammrhaywood
 

Recently uploaded (20)

Daily Lesson Plan in Mathematics Quarter 4
Daily Lesson Plan in Mathematics Quarter 4Daily Lesson Plan in Mathematics Quarter 4
Daily Lesson Plan in Mathematics Quarter 4
 
Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)
 
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
 
4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx
 
How to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPHow to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERP
 
ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...
ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...
ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...
 
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdfGrade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
 
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptxYOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
 
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxINTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
 
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptxFINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
 
ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4
 
What is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPWhat is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERP
 
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptxQ4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
 
Food processing presentation for bsc agriculture hons
Food processing presentation for bsc agriculture honsFood processing presentation for bsc agriculture hons
Food processing presentation for bsc agriculture hons
 
Karra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptxKarra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptx
 
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
 
Raw materials used in Herbal Cosmetics.pptx
Raw materials used in Herbal Cosmetics.pptxRaw materials used in Herbal Cosmetics.pptx
Raw materials used in Herbal Cosmetics.pptx
 
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONTHEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
 
Concurrency Control in Database Management system
Concurrency Control in Database Management systemConcurrency Control in Database Management system
Concurrency Control in Database Management system
 
AUDIENCE THEORY -CULTIVATION THEORY - GERBNER.pptx
AUDIENCE THEORY -CULTIVATION THEORY -  GERBNER.pptxAUDIENCE THEORY -CULTIVATION THEORY -  GERBNER.pptx
AUDIENCE THEORY -CULTIVATION THEORY - GERBNER.pptx
 

"The Statistical Replication Crisis: Paradoxes and Scapegoats”

  • 1. The Statistical Replication Crisis: Paradoxes and Scapegoats Deborah G Mayo Virginia Tech & LSE 10 May 2016
  • 2. American Statistical Society (ASA):Statement on P-values “The statistical community has been deeply concerned about issues of reproducibility and replicability of scientific conclusions. …. much confusion and even doubt about the validity of science is arising. Such doubt can lead to radical choices such as…to ban P-values… Misunderstanding or misuse of statistical inference is only one cause of the “reproducibility crisis” (Peng, 2015), but to our community, it is an important one.” (ASA 2016) 2
  • 3. Reforms without philosophy of statistics are blind O Replication crises (in social science and biology) led to programs to restore credibility: fraud busting, reproducibility studies O Taskforces, journalistic reforms, and debunking treatises O Proposed methodological reforms––many welcome (preregistration)–some quite radical O The situation gives a practical spin to debates in philosophy of science and statistics 3
  • 4. Replication crisis in social psychology O Diederik Stapel, the social psychologist who fabricated his data (2011) O Investigating Stapel revealed a culture of verification bias; selective reporting so common; they began calling them questionable research practices (QRPs) O A string of high profile cases followed, as did proposed reforms, replication research 4
  • 5. “I see a train-wreck looming,” Daniel Kahneman, calls for a “daisy chain” of replication in Sept. 2012 OSC: Reproducibility Project: Psychology: 2011-15 (Science 2015): Crowd-sourced effort to replicate 100 articles (Led by Brian Nozek, U. VA) 5
  • 6. I was a philosophical observer at the ASA P-value “pow wow” 6
  • 7. “Don’t throw out the error control baby with the bad statistics bathwater” The American Statistician 7
  • 8. O “Statistical significance tests are a small part of a rich set of: “techniques for systematically appraising and bounding the probabilities … of seriously misleading interpretations of data” (Birnbaum 1970, p. 1033) O These I call error statistical methods (or sampling theory)”. 8
  • 9. One rock in a shifting scene O “Birnbaum calls it the ‘one rock in a shifting scene’ in statistical practice” O “Misinterpretations and abuses of tests, warned against by the very founders of the tools, shouldn’t be the basis for supplanting them with methods unable or less able to assess, control, and alert us to erroneous interpretations of data” 9
  • 10. Error Statistics O Statistics: Collection, modeling, drawing inferences from data to claims about aspects of processes O The inference may be in error O It’s qualified by a claim about the methods’ capabilities to control and alert us to erroneous interpretations (error probabilities) 10
  • 11. “p-value. …to test the conformity of the particular data under analysis with H0 in some respect: …we find a function t = t(y) of the data, to be called the test statistic, such that • the larger the value of t the more inconsistent are the data with H0; • The random variable T = t(Y) has a (numerically) known probability distribution when H0 is true. …the p-value corresponding to any t as p = p(t) = P(T ≥ t; H0)” (Mayo and Cox 2006, p. 81) 11
  • 12. O “Clearly, if even larger differences than t occur fairly frequently under H0 (p-value is not small), there’s scarcely evidence of incompatibility O But a small p-value doesn’t warrant inferring a genuine effect H, let alone a scientific conclusion H*–as the ASA document correctly warns (Principle 3)” 12
  • 13. A paradox for significance test critics Critic1: It’s much too easy to get small P-values. You: Why do they find it so difficult to replicate the small P-values others found?* Is it easy or is it hard? *(Only 36 of 100 psychology experiments yielded small P-values in Open Science Collaboration on replication in psychology) 13
  • 14. O R.A. Fisher: it’s easy to lie with statistics by selective reporting (he called it the “political principle”) O Sufficient finagling (helped by Big Data)— cherry-picking, P-hacking, significance seeking—may practically guarantee a preferred claim C gets support, even if it’s unwarranted by evidence O Note: Rejecting a null taken as support for some non-null claim C 14
  • 15. Severity Requirement: “Mere supporting instances are as a rule too cheap to be worth having” (Popper 1983, 130) O If data x0 agree with a claim C, but the test procedure had little or no capability of finding flaws with C (even if the claim is incorrect), then x0 provide poor evidence for C O Such a test fails a minimal requirement for a stringent or severe test O My account: severe testing based on error statistics 15
  • 16. By and large the ASA doc highlights classic foibles “In relation to the test of significance, we may say that a phenomenon is experimentally demonstrable when we know how to conduct an experiment which will rarely fail to give us a statistically significant result” (Fisher 1935, 14) (“isolated” low P-value ≠> H: statistical effect) 16
  • 17. Statistical ≠> substantive (H ≠> H*) “[A]ccording to Fisher, rejecting the null hypothesis is not equivalent to accepting the efficacy of the cause in question. The latter...requires obtaining more significant results when the experiment, or an improvement of it, is repeated at other laboratories or under other conditions” (Gigerentzer 1989, 95-6) 17
  • 18. O Flaws in alternative H* have not been probed by the test, O The inference from a statistically significant result to H* fails to pass with severity O “Merely refuting the null hypothesis is too weak to corroborate” substantive H*, “we have to have ‘Popperian risk’, ‘severe test’ [as in Mayo]’” (Meehl and Waller 2002,184) 18
  • 19. The ASA’s Six Principles O (1) P-values can indicate how incompatible the data are with a specified statistical model O (2) P-values do not measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone O (3) Scientific conclusions and business or policy decisions should not be based only on whether a p-value passes a specific threshold O (4) Proper inference requires full reporting and transparency O (5) A p-value, or statistical significance, does not measure the size of an effect or the importance of a result O (6) By itself, a p-value does not provide a good measure of evidence regarding a model or hypothesis 19
  • 20. The ASA’s Six Principles O (1) P-values can indicate how incompatible the data are with a specified statistical model O (2) P-values do NOT measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone O (3) Scientific conclusions and business or policy decisions should NOT be based only on whether a p-value passes a specific threshold O (4) Proper inference requires full reporting and transparency O (5) A p-value, or statistical significance, does NOT measure the size of an effect or the importance of a result O (6) By itself, a p-value does NOT provide a good measure of evidence regarding a model or hypothesis 20
  • 21. Two main views of the role of probability in inference (not in ASA doc) O Probabilism. To assign a degree of probability, confirmation, support or belief in a hypothesis, given data x0. (e.g., Bayesian, likelihoodist)—with regard for inner coherency O Performance. Ensure long-run reliability of methods, coverage probabilities (frequentist, behavioristic Neyman- Pearson) 21
  • 22. What happened to using probability to assess error probing capacity and severity? O Neither “probabilism” nor “performance” directly captures it O Good long-run performance is a necessary, not a sufficient, condition for severity O That’s why frequentist methods can be shown to have howlers 22
  • 23. O Problems with selective reporting, cherry picking, stopping when the data look good, P-hacking, are not problems about long-runs— O It’s that we cannot say about the case at hand that it has done a good job of avoiding the sources of misinterpreting data 23
  • 24. A claim C is not warranted _______ O Probabilism: unless C is true or probable (gets a probability boost, is made comparatively firmer) O Performance: unless it stems from a method with low long-run error O Probativism (severe testing) something (a fair amount) has been done to probe ways we can be wrong about C 24
  • 25. O If you assume probabilism is required for inference, error probabilities are relevant for inference only by misinterpretation False! O I claim, error probabilities play a crucial role in appraising well-testedness O It’s crucial to be able to say, C is highly believable or plausible but poorly tested O Probabilists can allow for the distinct task of severe testing (you can have your cake and eat it) 25
  • 26. The ASA doc gives no sense of different tools for different jobs O “To use an eclectic toolbox in statistics, it’s important not to expect an agreement on numbers form methods evaluating different things O A p-value isn’t ‘invalid’ because it does not supply “the probability of the null hypothesis, given the finding” (the posterior probability of H0) (Trafimow and Marks,* 2015) *Editors of a journal, Basic and Applied Social Psychology ban P-values: don’t ask don’t tell 26
  • 27. O Principle 2 says a p-value ≠ posterior but one doesn’t get the sense of its role in error probability control O It’s not that I’m keen to defend many common uses of significance tests O The criticisms are often based on misunderstandings; consequently so are many “reforms” [Aside: Principle 2 says “P-values do not measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone.”] 27
  • 28. Biasing selection effects: O One function of severity is to identify which selection effects are problematic (not all are) O Biasing selection effects: when data or hypotheses are selected, generated, interpreted (or a test criterion is specified), such that the minimal severity requirement is violated, seriously altered or incapable of being assessed 28
  • 29. Nominal vs actual significance levels “The ASA correctly warns that “[c]onducting multiple analyses of the data and reporting only those with certain p-values” leads to spurious p-values (Principle 4)” “Suppose that twenty sets of differences have been examined, that one difference seems large enough to test and that this difference turns out to be ‘significant at the 5 percent level.’ ….The actual level of significance is not 5 percent, but 64 percent!” (Selvin, 1970, p. 104) From (Morrison & Henkel’s Significance Test controversy 1970!) 29
  • 30. O They were clear on the fallacy: blurring the “computed” or “nominal” significance level, and the “actual” level O There are many more ways you can be wrong with hunting (different sample space)–ok for exploratory O Here’s a case where a p-value report is invalid 30
  • 31. You report: Such results would be difficult to achieve under the assumption of H0 When in fact such results are common under the assumption of H0 (Formally): O You say Pr(P-value < Pobs; H0) = Pobs (small) O But in fact Pr(P-value < Pobs; H0) = high 31
  • 32. Scapegoating O Nowadays, we’re likely to see the tests blamed O My view: Tests don’t kill inferences, people do O Even worse are those statistical accounts where the abuse vanishes! 32
  • 33. On some views, taking account of biasing selection effects “defies scientific sense” “Two problems that plague frequentist inference: multiple comparisons and multiple looks, or, as they are more commonly called, data dredging and peeking at the data. The frequentist solution to both problems involves adjusting the P-value… But adjusting the measure of evidence because of considerations that have nothing to do with the data defies scientific sense, belies the claim of ‘objectivity’ that is often made for the P-value” (Goodman 1999, p. 1010) (To his credit, he’s open about this; heads the Meta-Research Innovation Center at Stanford) 33
  • 34. Technical activism isn’t free of philosophy Ben Goldacre (of Bad Science) in a 2016 Nature article, is puzzled that bad statistical practices continue even in the face of the new "technical activism”: “The editors at Annals of Internal Medicine,… repeatedly (but confusedly) argue that it is acceptable to identify “prespecified outcomes” [from results] produced after a trial began; ….they say that their expertise allows them to permit — and even solicit — undeclared outcome-switching.” (2016, 7) 34
  • 35. His paper: “Make journals report clinical trials properly” O He shouldn’t close his eyes to the possibility that some of the pushback he’s seeing has a basis in statistical philosophy! 35
  • 36. Likelihood Principle (LP) The vanishing act links to a pivotal disagreement in the philosophy of statistics battles In probabilisms, the import of the data is via the ratios of likelihoods of hypotheses P(x0;H1)/P(x0;H0) for x0 fixed O “They condition on the actual data, O error probabilities take into account other outcomes that could have occurred but did not (sampling distribution)” 36
  • 37. All error probabilities violate the LP (even without selection effects): “Sampling distributions, significance levels, power, all depend on something more [than the likelihood function]–something that is irrelevant in Bayesian inference–namely the sample space.” (Lindley 1971, p. 436) “The LP implies…the irrelevance of predesignation, of whether a hypothesis was thought of before hand or was introduced to explain known effects.” (Rosenkrantz, 1977, p. 122) 37
  • 38. Another paradox enters O ASA report: “In view of the prevalent misuses of and misconceptions concerning p-values, some statisticians prefer to supplement or even replace p-values with other approaches. These include…confidence intervals; Bayesian methods,…likelihood ratios or Bayes Factors…” O However, the same p-hacked hypothesis can occur in “likelihood ratios, Bayes factors and Bayesian updating with one big difference: 38
  • 39. Does principle 4 hold for other approaches? O “The direct grounds to criticize inferences as flouting error statistical control is lost” O An 11th hour point of controversy: whether to retain “full reporting and transparency” (principle 4) for all methods O Or should it apply only to “p-values and related statistics” 39
  • 40. How might probabilists block intuitively unwarranted inferences (without error probabilities)? A subjective Bayesian might say: If our beliefs were mixed into the interpretation of the evidence, we wouldn’t declare there’s statistical evidence of some unbelievable claim (distinguishing shades of grey and being politically moderate, ovulation and voting preferences) 40
  • 41. Rescued by beliefs O That could work in some cases (it still wouldn’t show what researchers had done wrong)—battle of beliefs O Besides, researchers sincerely believe their hypotheses O Now you’ve got two sources of flexibility, priors and biasing selection effects 41
  • 42. No help with our most important problem O How to distinguish the warrant for a single hypothesis H with different methods (e.g., one has biasing selection effects, another, pre-registered results and precautions)? 42
  • 43. Most Bayesians are “conventional” O Eliciting subjective priors too difficult, scientists reluctant to allow subjective beliefs to overshadow data O Default, or reference priors are supposed to prevent prior beliefs from influencing the posteriors (O-Bayesians, 2006) O A classic conundrum: no general non- informative prior exists, so most are conventional 43
  • 44. O “The priors are not to be considered expressions of uncertainty, ignorance, or degree of belief. Conventional priors may not even be probabilities…” (Cox and Mayo 2010, p. 299) O Conventional Bayesian Reforms are touted as free of selection effects O “Bayes factors can be used in the complete absence of a sampling plan” (Bayarri, Benjamin, Berger, Sellke 2016) 44
  • 45. Granted, some are prepared to abandon the LP for model testing In an attempted meeting of the minds Andrew Gelman and Cosma Shalizi say: O “[C]rucial parts of Bayesian data analysis, such as model checking, can be understood as ‘error probes’ in Mayo’s sense” which might be seen as using modern statistics to implement the Popperian criteria of severe tests.” (2013, 10). O Fisherian tests are fine here 45
  • 46. The Problem is with so-called NHST O NHSTs encourage the move from H to H*– that supposedly allow moving from statistical to substantive O If defined that way, they exist only as abuses of tests O ASA doc ignores Neyman-Pearson (N-P) tests 46
  • 47. Neyman-Pearson (N-P) tests: A null and alternative hypotheses H0, H1 that exhaust the parameter space O So the fallacy of rejection HH* is impossible O Rejecting the null only indicates statistical alternatives 47
  • 48. P-values don’t report effect sizes Principle 5 Who ever said to just report a P-value? O “Tests should be accompanied by interpretive tools that avoid the fallacies of rejection and non-rejection. These correctives can be articulated in either Fisherian or Neyman-Pearson terms” (Mayo and Cox 2006, Mayo and Spanos 2006) 48
  • 49. To avoid inferring a discrepancy beyond what’s warranted: Large n problem. O Severity tells us: an α-significant difference is indicative of less of a discrepancy from the null if it results from larger (n1) rather than a smaller (n2) sample size (n1 > n2) 49
  • 50. O What’s more indicative of a large effect (fire), a fire alarm that goes off with burnt toast or one so insensitive that it doesn’t go off unless the house is fully ablaze? 50 [The larger sample size is like the one that goes off with burnt toast]
  • 51. What about the fallacy of non-significant results? This gets to my experience with the open source replication project in psychology O Non-Replication occurs with non-significant results, but there’s much confusion in interpreting them O Given the focus on NHST – many take nonsignificance as uninformative 51
  • 52. O Insignificant results don’t warrant 0 discrepancy O Use the same severity reasoning to rule out discrepancies that very probably would have resulted in a larger difference than observed- set upper bounds O “If you very probably would have observed a more impressive (smaller) p-value than you did, if μ > μ1 (where μ1 = μ0 + γ), then the data are good evidence that μ≤ μ1” O Akin to power analysis (Cohen, Neyman) but sensitive to x0 52
  • 53. Improves on confidence intervals “This is akin to confidence intervals (which are dual to tests) but we get around their shortcomings: O We do not fix a single confidence level, O The evidential warrant for different points in any interval are distinguished” O Go beyond “performance goal” to give inferential construal 53
  • 54. Fraudbusting O Uri Simonsohn exposed suspicious data in the work of social psychologists Dirk Smeesters, Lawrence Sanna (2011-12); Jens Förster (2014- 15) O Detecting something suspicious in particular data O “Fabricated data detected by statistics alone” (Simonsohn 2013) 54
  • 55. P-values can’t be trusted except when used to argue that P-values can’t be trusted O Paradoxically, significance test critics agree with fraudbusting using tests they purport to reject O “The methodology here is not new. It goes back to Fisher (founder of modern statistics) in the 30’s. … The tests of goodness of fit were again and again, too good. …using a trick invented by Fisher for the purpose, and well known to all statisticians”. (Richard. D. Gill) 55
  • 56. OSC: Reproducibility Project: Psychology: 2011- 15 (Science 2015) (led by Brian Nozek, U. VA) O Crowd sourced: Replicators select 100 articles from three journals (2008) to try and replicate using the same method as the initial research: direct replication O Aimed to avoid fraudbusting in closed committees 56
  • 57. Repligate: Witch hunting or good science? O Open replications also have pushback O Daniel Gilbert (psychology at Harvard,) “so-called replicators” are "shameless little bullies" and "second stringers" who engage in tactics "out of Senator Joe McCarthy’s playbook” O Daniel Kahneman: New Rule: consult with the original authors for “hidden secrets” that only the original authors would know 57
  • 58. Work completed in Aug 2015 O Does a negative replication mean the original was a false positive? O Advantages of the Replications: –Preregistered, designed to have high power –Free of “perverse incentives” of usual research: guaranteed to be published – No file drawers 58
  • 59. Vindication for the P-value? “The findings also offered some support for the oft-criticized statistical tool known as the P- value …a low P value was fairly predictive of which psychology studies could be replicated”. …(Smithsonian, 2015) O I deny all we need is to lower the required P- value (as some reformers claim) O “…replication attempts find it difficult to get small p-values with preregistered results. This shows the problem isn’t p-values but failing to adjust them for cherry picking, multiple testing, etc.” 59
  • 60. Highly impressive effort…but O Fidelity questions (replications differ from originals)-new round in Science 2016 O My blogpost: Non-significance is the new significance (a new perverse incentive?) O My real problem…. They stick to what might be called “purely statistical” issues: can we get a low P-value or not? 60
  • 61. Non-replications construed as simply weaker effects O One of the non-replications: cleanliness and morality: Does unscrambling soap words make you less judgmental? “Ms. Schnall had 40 undergraduates unscramble some words. One group unscrambled words that suggested cleanliness (pure, immaculate, pristine), while the other group unscrambled neutral words. They were then presented with a number of moral dilemmas, like whether it’s cool to eat your dog after it gets run over by a car. … 61
  • 62. …Subjects who had unscrambled clean words weren’t as harsh on the guy who chows down on his chow.” (Chronicle of Higher Education) (situated cognition effect?) O By focusing on the P-values, they ignore the larger question of the methodological adequacy of the leap from the statistical to the substantive. O Are they even measuring the phenomenon they intend? Is the result due to the “treatment”? 62
  • 63. Coming to a philosophy department near you (experimental philosophy) 2 that didn’t replicate in psychology: O Belief in free will and cheating O Physical distance (of points plotted) and emotional closeness I encourage meta-methodological scrutiny by philosophers 63
  • 64. Concluding remarks O I end my commentary: “Failing to understand the correct (if limited) role of simple significance tests threatens to throw the error control baby out with the bad statistics bathwater O Don’t scapegoat methods based on cookbook uses long lampooned” O Makes no sense to banish tools for testing assumptions, fraudbusting, replication research 64
  • 65. O Recognize different roles of probability: probabilism, long run performance, probativism (severe testing) O Don’t expect an agreement on numbers from methods evaluating different things O Probabilisms may enable rather than block illicit inferences due to biasing selection effects O Main paradox of the “replication crisis” 65
  • 66. Future ASA project O Look at the “other approaches” (Bayes factors, LRs, Bayesian updating) O What is it for a replication to succeed or fail on those approaches? (can’t be just a matter of prior beliefs in the hypotheses) 66
  • 67. Error statistical improvements are needed O An inferential construal of error probabilities wasn’t clearly given (Birnbaum)-my goal O It’s not long-run error control (performance), but severely probing flaws today O Needs to say more about how they’ll use background information 67
  • 68. Recognize that often better statistics cannot help O One hypothesis must always be: our results point to the inability of our study to severely probe the phenomenon of interest O Be ready to admit questionable science– unable to distinguish poorly run study and poor hypothesis 68
  • 69. 2 that didn’t replicate O Belief in free will and cheating “It found that participants who read a passage arguing that their behavior is predetermined were more likely than those who had not read the passage to cheat on a subsequent test.” O Physical distance (of points plotted) and emotional closeness “Volunteers asked to plot two points that were far apart on graph paper later reported weaker emotional attachment to family members, compared with subjects who had graphed points close together.” (NYT) 69
  • 70.
  • 71. Mayo and Cox (2010): Frequentist Principle of Evidence (FEV); SEV: Mayo and Spanos (2006) FEV/SEV: insignificant result: A moderate P-value is evidence of the absence of a discrepancy δ from H0, only if there is a high probability the test would have given a worse fit with H0 (i.e., d(X) > d(x0) ) were a discrepancy δ to exist FEV/SEV significant result d(X) > d(x0) is evidence of discrepancy δ from H0, if and only if, there is a high probability the test would have d(X) < d(x0) were a discrepancy as large as δ absent 71
  • 72. Test T+: Normal testing: H0: μ < μ0 vs. H1: μ > μ0 σ known (FEV/SEV): If d(x) is not statistically significant, then μ < M0 + kεσ/√n passes the test T+ with severity (1 – ε) (FEV/SEV): If d(x) is statistically significant, then μ > M0 + kεσ/√n passes the test T+ with severity (1 – ε) where P(d(X) > kε) = ε 72
  • 73. References O Armitage, P. 1962. “Contribution to Discussion.” In The Foundations of Statistical Inference: A Discussion, edited by L. J. Savage. London: Methuen. O Bayarri, M., Benjamin, D., Berger, J., Sellke, T. (forthcoming). “Rejection Odds and Rejection Ratios: A Proposal for Statistical Practice in Testing Hypotheses." Journal of Mathematical Psychology. Invited paper for special issue on “Bayesian hypothesis testing.” O Berger, J. O. 2003 'Could Fisher, Jeffreys and Neyman Have Agreed on Testing?' and 'Rejoinder,', Statistical Science 18(1): 1-12; 28-32. O Berger, J. O. 2006. “The Case for Objective Bayesian Analysis.” Bayesian Analysis 1 (3): 385–402. O Birnbaum, A. 1970. “Statistical Methods in Scientific Inference (letter to the Editor).” Nature 225 (5237) (March 14): 1033. O Efron, B. 2013. 'A 250-Year Argument: Belief, Behavior, and the Bootstrap', Bulletin of the American Mathematical Society 50(1): 126-46. O Box, G. 1983. “An Apology for Ecumenism in Statistics,” in Box, G.E.P., Leonard, T. and Wu, D. F. J. (eds.), pp. 51-84, Scientific Inference, Data Analysis, and Robustness. New York: Academic Press. O Cox, D. R. and Hinkley, D. 1974. Theoretical Statistics. London: Chapman and Hall. O Cox, D. R., and Deborah G. Mayo. 2010. “Objectivity and Conditionality in Frequentist Inference.” In Error and Inference: Recent Exchanges on Experimental Reasoning, Reliability, and the Objectivity and Rationality of Science, edited by Deborah G. Mayo and Aris Spanos, 276–304. Cambridge: Cambridge University Press. 73
  • 74. O Fisher, R. A. 1935. The Design of Experiments. Edinburgh: Oliver and Boyd. O Fisher, R. A. 1955. “Statistical Methods and Scientific Induction.” Journal of the Royal Statistical Society, Series B (Methodological) 17 (1) (January 1): 69–78. O Gelman, A. and Shalizi, C. 2013. “Philosophy and the Practice of Bayesian Statistics” and “Rejoinder’” British Journal of Mathematical and Statistical Psychology 66(1): 8–38; 76-80. O Gigerenzer, G., Swijtink, Porter, T. Daston, L. Beatty, J, and Kruger, L. 1989. The Empire of Chance. Cambridge: Cambridge University Press. O Gilbert, D. Twitter post: https://twitter.com/dantgilbert/status/470199929626193921 O Gill, comment: On the “Suspicion of Scientific Midconduc by Jens Forster by Neuroskeptic May 6, 2014 on Discover Magazine Blog: http://blogs.discovermagazine.com/neuroskeptic/2014/05/06/suspicion- misconduct-forster/#.Vynr3j-scQ0. O Goldacre, B. 2008. Bad Science. HarperCollins Publishers. O Goldacre, B. 2016. “Make journals report clinical trials properly”, Nature 530(7588); 7; online 04Feb2016. O Goodman SN. 1999. “Toward evidence-based medical statistics. 2: The Bayes factor,” Annals of Internal Medicine 1999; 130:1005 –1013. O Handwerk, B. 2015. “Scientists Replicated 100 Psychology Studies, and Fewer than Half Got the Same Results.” Smithsonian Magazine (August 27, 2015) http://www.smithsonianmag.com/science-nature/scientists-replicated-100- psychology-studies-and-fewer-half-got-same-results-180956426/?no-ist O Hasselman, F. and Mayo, D. 2015, April 17. “seveRity” (R-program). Retrieved from osf.io/k6w3h 74
  • 75. O Levelt Committee, Noort Committee, Drenth Committee. 2012. 'Flawed science: The fraudulent research practices of social psychologist Diederik Stapel', Stapel Investigation: Joint Tilburg/Groningen/Amsterdam investigation of the publications by Mr. Stapel. (https://www.commissielevelt.nl/) O Lindley, D. V. 1971. “The Estimation of Many Parameters.” In Foundations of Statistical Inference, edited by V. P. Godambe and D. A. Sprott, 435–455. Toronto: Holt, Rinehart and Winston. O Mayo, D. G. 1996. Error and the Growth of Experimental Knowledge. Science and Its Conceptual Foundation. Chicago: University of Chicago Press. O Mayo, D. G. 2016. 'Don't Throw Out the Error Control Baby with the Bad Statistics Bathwater: A Commentary', The American Statistician, online March 7, 2016. http://www.tandfonline.com/doi/pdf/10.1080/00031305.2016.1154108. O Mayo, D. G. Error Statistics Philosophy Blog: errorstatistics.com O Mayo, D. G. and Cox, D. R. (2010). "Frequentist Statistics as a Theory of Inductive Inference" in Error and Inference: Recent Exchanges on Experimental Reasoning, Reliability and the Objectivity and Rationality of Science (D Mayo and A. Spanos eds.), Cambridge: Cambridge University Press: 1-27. This paper appeared in The Second Erich L. Lehmann Symposium: Optimality, 2006, Lecture Notes-Monograph Series, Volume 49, Institute of Mathematical Statistics, pp. 247-275. O Mayo, D. G., and A. Spanos. 2006. “Severe Testing as a Basic Concept in a Neyman– Pearson Philosophy of Induction.” British Journal for the Philosophy of Science 57 (2) (June 1): 323–357. O Mayo, D. G., and A. Spanos. 2011. “Error Statistics.” In Philosophy of Statistics, edited by Prasanta S. Bandyopadhyay and Malcolm R. Forster, 7:152–198. Handbook of the Philosophy of Science. The Netherlands: Elsevier. 75
  • 76. O Meehl, P. E., and N. G. Waller. 2002. “The Path Analysis Controversy: A New Statistical Approach to Strong Appraisal of Verisimilitude.” Psychological Methods 7 (3): 283–300. O Morrison, D. E., and R. E. Henkel, ed. 1970. The Significance Test Controversy: A Reader. Chicago: Aldine De Gruyter. O Open Science Collaboration (Nozeck, B. et al). 2015. “Estimating the Reproducibility of Psychological Science.” Science 349(6251) O Pearson, E. S. & Neyman, J. (1930). On the problem of two samples. Joint Statistical Papers by J. Neyman & E.S. Pearson, 99-115 (Berkeley: U. of Calif. Press). First published in Bul. Acad. Pol.Sci. 73-96. O Rosenkrantz, R. 1977. Inference, Method and Decision: Towards a Bayesian Philosophy of Science. Dordrecht, The Netherlands: D. Reidel. O Savage, L. J. 1962. The Foundations of Statistical Inference: A Discussion. London: Methuen. O Selvin, H. 1970. “A critique of tests of significance in survey research. In The significance test controversy, edited by D. Morrison and R. Henkel, 94-106. Chicago: Aldine De Gruyter. O Simonsohn, U. 2013, "Just Post It: The Lesson From Two Cases of Fabricated Data Detected by Statistics Alone", Psychological Science, vol. 24, no. 10, pp. 1875-1888. O Smithsonian Magazine (See Handwerk) O Trafimow D. and Marks, M. 2015. “Editorial”, Basic and Applied Social Psychology 37(1): pp. 1-2. O Wasserstein, R. and Lazar, N. 2016. “The ASA’s statement on p-values: context, process, and purpose”, The American Statistician  Link to ASA statement & Commentaries (under supplemental): http://amstat.tandfonline.com/doi/abs/10.1080/00031305.2016.1154108 76
  • 77. Jimmy Savage on the LP: “According to Bayes' theorem,…. if y is the datum of some other experiment, and if it happens that P(x|µ) and P(y|µ) are proportional functions of µ (that is, constant multiples of each other), then each of the two data x and y have exactly the same thing to say about the values of µ…” (Savage 1962, p. 17) 77
  • 78. O Abstract: Mounting failures of replication in the social and biological sciences give a practical spin to statistical foundations in the form of the question: How can we attain reliability when Big Data methods make illicit cherry-picking and significance seeking so easy? Researchers, professional societies, and journals are increasingly getting serious about methodological reforms to restore scientific integrity – some are quite welcome (e.g., pre- registration), while others are quite radical. The American Statistical Association convened members from differing tribes of frequentists, Bayesians, and likelihoodists to codify misuses of P- values. Largely overlooked are the philosophical presuppositions of both criticisms and proposed reforms. Paradoxically, alter- native replacement methods may enable rather than reveal illicit inferences due to cherry-picking, multiple testing, and other biasing selection effects. Crowd-sourced reproducibility research in psychology is helping to change the reward structure but has its own shortcomings. Focusing on purely statistical considerations, it tends to overlook problems with artificial experiments. Without a better understanding of the philosophical issues, we can expect the latest reforms to fail. 78