SlideShare a Scribd company logo
1 of 71
Replication Crises and the
Statistics Wars: Hidden
Controversies
Deborah G Mayo
June 14, 2018
Reproducibility and Replicability
in Psychology and Experimental Philosophy
• High-profile failures of replication have
brought forth reams of reforms
• How should the integrity of science be
restored?
• Experts do not agree.
• We need to pull back the curtain on why.
2
Hidden Controversies
1. The Statistics Wars: Age-old abuses, fallacies of
statistics (Bayesian-frequentist, Fisher-N-P debates)
simmer below todays debates; reformulations of
frequentist methods are ignored
2. Replication Paradoxes: While a major source of
handwringing stems from biasing selection effects—
cherry picking, data dredging, trying and trying again,
some reforms and alternative methods conflict with
the needed error control
3. Underlying assumptions and their violations
• A) statistical model assumptions
• B) links from experiments, measurements &
statistical inferences to substantive questions
3
Significance Tests
• The most used methods are most criticized
• Statistical significance tests are a small part of a
rich set of:
“techniques for systematically appraising and
bounding the probabilities … of seriously
misleading interpretations of data [bounding
error probabilities]” (Birnbaum 1970, 1033)
• These I call error statistical methods (or sampling
theory).
4
“p-value. …to test the conformity of the
particular data under analysis with H0 in
some respect:
…we find a function T = t(y) of the data, to
be called the test statistic, such that
• the larger the value of T the more
inconsistent are the data with H0;
• The random variable T = t(Y) has a
(numerically) known probability
distribution when H0 is true.
…the p-value corresponding to any t0bs as
p = p(t) = Pr(T ≥ t0bs; H0)”
(Mayo and Cox 2006, 81)
5
Testing Reasoning
• If even larger differences than t0bs occur fairly
frequently under H0 (i.e., P-value is not small),
there’s scarcely evidence of
incompatibility with H0
• Small P-value indicates some underlying
discrepancy from H0 because very probably you
would have seen a less impressive difference
than t0bs were H0 true.
• This still isn’t evidence of a genuine statistical
effect H1, let alone a scientific conclusion H*
Stat-Sub fallacy H => H*
6
Neyman-Pearson (N-P) tests:
A null and alternative hypotheses H0, H1
that are exhaustive
H0: μ ≤ 0 vs. H1: μ > 0
• So this fallacy of rejection H1H* is impossible
• Rejecting H0 only indicates statistical alternatives
H1 (how discrepant from null)
As opposed to NHST
7
We must get beyond N-P/ Fisher
incompatibilist caricature
• Beyond “inconsistent hybrid” (Gigerenzer 2004, 590):
Fisher–inferential; N-P–long run performance
• Both N-P & F offer techniques to assess and control
error probabilities
• “Incompatibilism” results in P-value users robbed of
features from N-P tests they need (power)
• Wrongly supposes N-P testers have only fixed error
probabilities & are not allowed to report P-values
8
Replication Paradox
(for critics of N-P & F Significance Tests)
Critic: It’s much too easy to get a small P-
value
You: Why do they find it so difficult to
replicate the small P-values others found?
Is it easy or is it hard?
9
Both Fisher and N-P methods: it’s easy to lie
with statistics by selective reporting
• Sufficient finagling—cherry-picking, P-
hacking, significance seeking, multiple
testing, look elsewhere—may practically
guarantee a preferred claim H gets support,
even if it’s unwarranted by evidence
10
Severity Requirement:
Everyone agrees: If the test had little or no
capability of finding flaws with H (even if H is
incorrect), then agreement between data x0 and H
provides poor (or no) evidence for H
(“too cheap to be worth having” Popper 1983, 30)
• Such a test fails a minimal requirement for a
stringent or severe test
• Their new idea was using probability to reflect
this capability
• They may not have put it explicitly in these
terms but my account does (requires
reinterpreting tests)
11
This alters the role of probability: typically just 2
• Probabilism. To assign a degree of probability,
confirmation, support or belief in a hypothesis,
given data x0
(e.g., Bayesian, likelihoodist, Fisher (at times))
Performance. Ensure long-run reliability of
methods, coverage probabilities (frequentist,
behavioristic Neyman-Pearson, Fisher (at times))
12
What happened to using probability to assess
error probing capacity?
• Neither “probabilism” nor “performance” directly
captures it
• Good long-run performance is a necessary, not
a sufficient, condition for severity
• Need to explicitly reinterpret tests according to
how well or poorly discrepancies are indicated
13
Key to revising the role of error
probabilities
• Problems with selective reporting, cherry
picking, stopping when the data look
good, P-hacking, are not problems about
long-runs—
• It’s that we cannot say the case at hand
has done a good job of avoiding the
sources of misinterpreting data
14
A claim C is not warranted _______
• Probabilism: unless C is true or probable (gets
a probability boost, is made comparatively
firmer)
• Performance: unless it stems from a method
with low long-run error
• Probativism (severe testing) unless
something (a fair amount) has been done to
probe ways we can be wrong about C
15
Biasing selection effects:
One function of severity is to identify problematic
selection effects (not all are)
• Biasing selection effects: when data or
hypotheses are selected or generated (or a test
criterion is specified), in such a way that the
minimal severity requirement is violated,
seriously altered or incapable of being
assessed
16
Nominal vs. actual significance levels
Suppose that twenty sets of differences have
been examined, that one difference seems large
enough to test and that this difference turns out
to be ‘significant at the 5 percent level.’ ….The
actual level of significance is not 5 percent,
but 64 percent! (Selvin 1970, 104)
From (Morrison & Henkel’s Significance Test
controversy 1970!)
17
Spurious P-Value
You report: Such results would be difficult to
achieve under the assumption of H0
When in fact such results are common under the
assumption of H0
There are many more ways you can be wrong with
hunting (different sample space)
(Formally):
• You say Pr(P-value ≤ Pobs; H0) ~ Pobs = small
• But in fact Pr(P-value ≤ Pobs; H0) = high
18
Scapegoating
• Nowadays, we’re likely to see the tests
blamed
• My view: Tests don’t kill inferences,
people do
• Even worse are those statistical accounts
where the abuse vanishes!
19
Some say taking account of biasing selection
effects “defies scientific sense”
Two problems that plague frequentist inference:
multiple comparisons and multiple looks, or, as they
are more commonly called, data dredging and
peeking at the data. The frequentist solution to both
problems involves adjusting the P-value…
But adjusting the measure of evidence because
of considerations that have nothing to do with
the data defies scientific sense” (Goodman 1999,
1010)
(To his credit, he’s open about this; heads the Meta-Research
Innovation Center at Stanford)
20
Likelihood Principle (LP)
The vanishing act links to a pivotal disagreement in
the philosophy of statistics battles
In probabilisms, the import of the data is via the
ratios of likelihoods of hypotheses
Pr(x0;H0)/Pr(x0;H1)
The data x0 are fixed, while the hypotheses vary
21
All error probabilities violate the LP
(even without selection effects):
Sampling distributions, significance levels, power, all
depend on something more [than the likelihood
function]–something that is irrelevant in Bayesian
inference–namely the sample space
(Lindley 1971, 436)
The LP implies…the irrelevance of predesignation,
of whether a hypothesis was thought of before hand
or was introduced to explain known effects
(Rosenkrantz 1977, 122)
22
Optional Stopping:
Error probing capacities are altered not just
by cherry picking and data dredging, but
also via data dependent stopping rules:
Xi ~ N(μ, σ2), 2-sided H0: μ = 0 vs. H1: μ ≠ 0.
Instead of fixing the sample size n in
advance, in some tests, n is determined by
a stopping rule:
23
• Keep sampling until H0 is rejected at 0.05
level
i.e., keep sampling until M > 1.96 s/√n
• Trying and trying again: Having failed to
rack up a 1.96 s/√n difference after 10
trials, go to 20, 30 and so on until
obtaining a 1.96 s difference
24
Nominal vs. Actual
significance levels again:
• With n fixed the Type 1 error probability is 0.05
• With this stopping rule the actual significance
level differs from, and will be greater than 0.05
(proper stopping rule)
25
• Equivalently, you can ensure that 0 is
excluded from a confidence interval (or
credibility interval) even if true (Berger and
Wolpert 1988)
• Likewise the same p-hacked hypothesis can
occur in Bayes factors, credibility intervals,
likelihood ratios
26
With One Big Difference:
• The direct grounds to criticize inferences
as flouting error statistical control is lost
• They condition on the actual data,
• Error probabilities take into account other
outcomes that could have occurred but
did not (sampling distribution)
27
28
What Counts as Cheating?
“[I]f the sampling plan is ignored, the researcher is
able to always reject the null hypothesis, even if it is
true. This example is sometimes used to argue that
any statistical framework should somehow take the
sampling plan into account. Some people feel that
‘optional stopping’ amounts to cheating…. This
feeling is, however, contradicted by a mathematical
analysis. (Eric-Jan Wagenmakers, 2007, 785)
But the “proof” assumes the likelihood principle (LP)
by which error probabilities drop out. (Edwards,
Lindman, and Savage 1963)
• Replication researchers (re)discovered that data-
dependent hypotheses and stopping are a major
source of spurious significance levels.
“Authors must decide the rule for terminating data
collection before data collection begins and report
this rule in the articles” (Simmons, Nelson, and
Simonsohn 2011, 1362).
• Or report how their stopping plan alters relevant
error probabilities.
29
30
“This [optional stopping] example is cited as a
counter-example by both Bayesians and
frequentists! If we statisticians can't agree which
theory this example is counter to, what is a
clinician to make of this debate?”
(Little 2006, 215).
Underlies today’s disputes
• Critic: It’s too easy to satisfy standard significance
thresholds
• You: Why do replicationists find it so hard to achieve
significance thresholds (with preregistration)?
• Critic: Obviously the initial studies were guilty of P-
hacking, cherry-picking, data-dredging (QRPs)
• You: So, the replication researchers want methods
that pick up on, adjust, and block these biasing
selection effects.
• Critic: Actually “reforms” recommend methods where
the need to alter P-values due to data dredging
vanishes
31
How might probabilists block intuitively
unwarranted inferences
(without error probabilities)?
A subjective Bayesian might say:
If our beliefs were mixed into the interpretation of
the evidence, we wouldn’t declare there’s
statistical evidence of some unbelievable claim
(distinguishing shades of grey and being
politically moderate, ovulation and voting
preferences)
32
Rescued by beliefs
• That could work in some cases (it still
wouldn’t show what researchers had done
wrong)—battle of beliefs
• Besides, researchers sincerely believe their
hypotheses
• So now you’ve got two sources of flexibility,
priors and biasing selection effects
33
No help with our most important problem
• How to distinguish the warrant for a single
hypothesis H with different methods
(e.g., one has biasing selection effects, another,
pre-registered results and precautions)?
• Since there’s a single H, its prior would be the
same
34
Most Bayesians use “default” priors
• Eliciting subjective priors too difficult,
scientists reluctant to allow subjective beliefs
to overshadow data
• Default priors are supposed to prevent prior
beliefs from influencing the posteriors (O-
Bayesians, 2006)
35
How should we interpret them?
• “The priors are not to be considered expressions
of uncertainty, ignorance, or degree of belief.
Default priors may not even be probabilities…”
(Cox and Mayo 2010, 299)
• No agreement on rival systems for default/non-
subjective priors
36
• Default Bayesian Reforms often touted as free of
selection effects
“…Bayes factors can be used in the complete
absence of a sampling plan…”
(Bayarri, Benjamin, Berger, Sellke 2016, 100)
37
• Whatever you think of them, there’s a big
difference in foundations (error statistical vs
Bayesian/probabilist)
• Knowing something about the statistics wars,
you’d get to the bottom of current proxy wars
38
Recent push to “redefine statistical significance” is
based on the odds ratios/ Bayes Factors (BF)
(Benjamin et al 2017)
39
Old & new debates: P-values exaggerate based
on Bayes Factors or Likelihood Ratios
• Testing of a point null hypothesis, a lump of prior
probability given to H0
Xi ~ N(μ, σ2), 2-sided H0: μ = 0 vs. H1: μ ≠ 0.
• The criticism is the posterior probability on H0 can
be larger than the P-value.
• So if you interpret a P-value as a posterior
on H0 (a fallacy), you’d be saying the Pr(H0|x) is
low
40
Bayes-Fisher Disagreement or Jeffreys-Lindley
“Paradox”
• With-a lump of prior given to the point null, and
the rest appropriately spread over the alternative,
an α significant result can correspond to
Pr(H0 |x) = (1 - α)! (e.g., 0.95)
“spike and smear”
• To a Bayesian this shows P-values exaggerate
evidence against
41
Do P-values Exaggerate Evidence?
• Significance testers balk at allowing highly
significant results to be interpreted as no
evidence against the null–or even evidence for it!
Bad Type II error
• Lump of prior on the null conflicts with the idea
that all nil nulls are false)
• P-values can also equal the posterior! (without
the spiked prior)–even though they measure very
different things.
42
• “[T]he reason that Bayesians can regard P-values as
overstating the evidence against the null is simply a
reflection of the fact that Bayesians can
disagree sharply with each other“ (Stephen Senn
2002, 2442)
• “In fact it is not the case that P-values are too
small, but rather that Bayes point null posterior
probabilities are much too big!….Our concern
should not be to analyze these misspecified
problems, but to educate the user so that the
hypotheses are properly formulated,” (Casella and
Berger 1987b, 334, 335).
• “concentrating mass on the point null hypothesis
is biasing the prior in favor of H0 as much as
possible” (ibid. 1987a, 111) whether in 1 or 2-sided
tests
43
• In one sense, lowering the probability of type I
errors is innocuous (if that reflects your balance
of Type I and II errors)
• To base the argument on lowering P-values on
BFs –if you don’t do inference by means of BFs–
is not innocuous
(because of the forfeit of error control)
44
BF Advocates Forfeit Their Strongest Criticisms
• Daryl Bem (2011): subjects did better than
chance at predicting the picture to be shown in the
future, credit to ESP
• Some locate the start of the Replication Crisis
With (Bem’s “Feeling the Future” 2011)
Admits data dredging
• Keen to show we should trade significance tests
for BFs, critics resort to a default Bayesian prior to
make the null hypothesis comparatively more
probable than a chosen alternative
45
Repligate
• Replication research has pushback:
some call it methodological terrorism
(enforcing good science or bullying?)
• If you must appeal to your disbelief in my
effect to find I fail to replicate I’m more
likely to see it as bullying
• I can also deflect your criticism
46
Bem’s Response
Whenever the null hypothesis is sharply defined but
the prior distribution on the alternative hypothesis is
diffused over a wide range of values, as it is [here] it
boosts the probability that any observed data will be
higher under the null hypothesis than under the
alternative. This is known as the Lindley-Jeffreys
paradox: A frequentist analysis that yields strong
evidence in support of the experimental hypothesis
can be contradicted by a misguided Bayesian analysis
that concludes that the same data are more likely
under the null. (Bem et al., 717)
47
Admittedly, significance tests require supplements to
avoid overestimating (or underestimating) effect
sizes
48
Jeffreys-Lindley result points to large n
problem
• Fixing the P-value, increasing sample size n, the
cut-off gets smaller
• Get to a point where x is closer to the null than
various alternatives
• Many would lower the P-value requirement as n
increases-that’s fine but can always avoid inferring a
discrepancy beyond what’s warranted:
• Severity prevents interpreting a significant result as
indicating a larger discrepancy than warranted (Mts
out of Molehill fallacy) 49
• tells us: an α-significant difference indicates less of
a discrepancy from the null if it results from larger
(n1) rather than a smaller (n2) sample size (n1 > n2 )
• What’s more indicative of a large effect (fire), a fire
alarm that goes off with burnt toast or one that
doesn’t go off unless the house is fully ablaze?
• [The larger sample size is like the one that goes off
with burnt toast] 50
What About Fallacies of
Non-Significant Results?
• They don’t warrant 0 discrepancy
• Using severity reasoning: rule out discrepancies
that very probably would have resulted in a larger
difference than observed- set upper bounds
• If you very probably would have observed a more
impressive (smaller) P-value than you did, if μ >
μ1 (μ1 = μ0 + γ), then the data are good evidence
that μ< μ1
• Akin to power analysis (Cohen, Neyman) but
sensitive to x0 51
Confidence Intervals also require
reinterpretation
Duality between tests and intervals: values within
the (1 -‐c) CI are non-‐rejectable at the c level
• Too dichotomous: in/out, plausible/not
plausible
• Justified in terms of long-‐run coverage
• All members of the CI treated on par
• Fixed confidence levels (need several
benchmarks) 52
Violated Assumptions Do More Damage Than
Setting P-values
We should worry more about violated
model assumptions
• Ironically, testing model assumptions
(and fraudbusting) performed by simple
significance tests (e.g., IID)
53
Some Bayesians Care About Model Testing
• “[C]rucial parts of Bayesian data analysis, such
as model checking, can be understood as ‘error
probes’ in Mayo’s sense” which might be seen
as using modern statistics to implement the
Popperian criteria of severe tests.”
(Andrew Gelman and Cosma Shalizi 2013, 10).
54
In dealing with non-replication we should
worry much more about
• Links from experiments to inferences of
interest
• I’ve seen plausible hypotheses poorly (or
questionably) tested
55
Is it Non-replication or Something About
the Design?
• Replications are to be superior,
preregistered, free of perverse incentives,
high power
• But might they be influenced by replicator’s
beliefs? (Kahneman: get approval from
original researcher)
• Subjects (often students) often told the
purpose of the experiment
56
• One of the non-replications: cleanliness and
morality: Does unscrambling soap words make
you less judgmental?
“Ms. Schnall had 40 undergraduates unscramble
some words. One group unscrambled words
that suggested cleanliness (pure, immaculate,
pristine), while the other group unscrambled
neutral words. They were then presented with
a number of moral dilemmas, like whether it’s
cool to eat your dog after it gets run over by a
car. …(Bartlett 2014)
57
…Turns out, it did. Subjects who had
unscrambled clean words weren’t as harsh on the
guy who chows down on his chow.”
(Bartlett 2014)
• Focusing on the P-values ignore larger
questions of measurement in psychology & the
leap from the statistical to the substantive.
HH*
58
2 that didn’t replicate in psychology:
• Belief in free will and cheating
• Physical distance (of points plotted) and
emotional closeness
I encourage meta-methodological scrutiny by
philosophers–especially those in Xphi
59
2 that didn’t replicate
• Belief in free will and cheating
“It found that participants who read a passage
arguing that their behavior is predetermined were
more likely than those who had not read the
passage to cheat on a subsequent test.”
• Physical distance (of points plotted) and
emotional closeness
“Volunteers asked to plot two points that were far
apart on graph paper later reported weaker
emotional attachment to family members,
compared with subjects who had graphed points
close together.” (Carey, 2015) 60
Hidden Controversies
1. The Statistics Wars: Age-old abuses, fallacies of
statistics (Bayesian-frequentist, Fisher-N-P debates)
simmer below todays debates; reformulations of
frequentist methods are ignored
2. Replication Paradoxes: While a major source of
handwringing stem from biasing selection effects—
cherry picking, data dredging, trying and trying again,
some reforms and preferred alternatives conflict with
the needed error control.
3. Underlying assumptions and their violations
• A) statistical model assumptions
• B) links from experiments, measurements &
statistical inferences to substantive questions
61
• Recognize different roles of probability: probabilism,
long run performance, probativism (severe testing)
• Move away from cookbook stat–long deplored–but
don’t expect agreement on numbers from evaluations
of different things (P-values, posteriors)
• Recognize criticisms & reforms are often based on
rival underlying philosophies of evidence
• The danger is that some reforms may enable rather
than directly reveal illicit inferences due to biasing
selection effects
• Worry more about model assumptions, and whether
our experiments are teaching us about the
phenomenon of interest 62
63
References
• Armitage, P. 1962. “Contribution to Discussion”, in The Foundations of Statistical
Inference: A Discussion, edited by L. J. Savage. London: Methuen.
• Bartlett, T. 2014. “Replication Crisis in Psychology Research Turns Ugly and Odd”,
Chronicle of Higher Education online, June 23, 2014.
• Bayarri, M., Benjamin, D., Berger, J., and Sellke, T. 2016. “Rejection Odds and
Rejection Ratios: A Proposal for Statistical Practice in Testing Hypotheses”,
Journal of Mathematical Psychology, 72,90-103.
• Bem, J. 2011. “Feeling the Future: Experimental Evidence for Anomalous
Retroactive Influences on Cognition and Affect”, Journal of Personality and Social
Psychology 100(3), 407-425.
• Bem, J., Utts, J., and Johnson, W. 2011. “Must Psychologists Change the Way
They Analyze Their Data?”, Journal of Personality and Social Psychology 101(4),
716-719.
• Benjamin, D., Berger, J., Johannesson, M., et al. (2017). “Redefine Statistical
Significance”, Nature Human Behaviour 2, 6–10.
• Berger, J. O. 2003 “Could Fisher, Jeffreys and Neyman Have Agreed on Testing?”
and “Rejoinder”, Statistical Science 18(1): 1-12; 28-32.
• Berger, J. O. 2006. “The Case for Objective Bayesian Analysis.” Bayesian Analysis
1 (3): 385–402.
• Berger, J. O. and Wolpert, R. 1988. The Likelihood Principle, 2nd ed. Vol. 6 Lecture
Notes-Monograph Series. Hayward, CA: Institute of Mathematical Statistics.
• Birnbaum, A. 1970. “Statistical Methods in Scientific Inference (letter to the
Editor).” Nature 225 (5237) (March 14): 1033.
64
• Carey, B. (2015). “Many Psychology Findings Not as Strong as Claimed, Study
Says”, New York Times (August 27, 2015).
• Casella, G. and Berger, R. 1987a. “Reconciling Bayesian and Frequentist
Evidence in the One-sided testing Problem”, Journal of the American Statistical
Association 82(397): 106-111.
• Casella, G. and Berger, R. 1987b. “Comment on Testing Precise Hypotheses by J.
Berger and M. Delampady”, Statistical Science 2(3): 344-347.
• Cohen, J. 1988. Statistical Power Analysis for the Behavioral Sciences 2nd ed.
Hillsdale NJ: Erlbaum.
• Cox, D. R. and Hinkley, D. 1974. Theoretical Statistics. London: Chapman and
Hall.
• Cox, D. R., and Mayo, D.G. 2010. “Objectivity and Conditionality in Frequentist
Inference.” In Error and Inference: Recent Exchanges on Experimental Reasoning,
Reliability, and the Objectivity and Rationality of Science, edited by Deborah G.
Mayo and Aris Spanos, 276–304. Cambridge: Cambridge University Press.
• Edwards, W., Lindman, H., and Savage, L. 1963. “Bayesian Statistical Inference
for Psychological Research”, Psychological Review 70(3): 193-242.
• Fisher, R. A. 1935. The Design of Experiments. Edinburgh: Oliver and Boyd. 78.
• Fisher, R. A. 1955. “Statistical Methods and Scientific Induction.” Journal of the
Royal Statistical Society, Series B (Methodological) 17 (1) (January 1): 69–
• Gelman, A. and Shalizi, C. 2013. “Philosophy and the Practice of Bayesian
Statistics” and “Rejoinder”, British Journal of Mathematical and Statistical
Psychology 66(1): 8–38; 76-80.
• Gigerenzer, G. 2004. “Mindless Statistics”, Journal of Socio-Economics 33(5), 587-
606. 65
• Gigerenzer, G., Swijtink, Porter, T. Daston, L. Beatty, J, and Kruger, L. 1989. The
Empire of Chance. Cambridge: Cambridge University Press.
• Goodman SN. 1999. “Toward evidence-based medical statistics. 2: The Bayes factor,”
Annals of Internal Medicine 1999; 130:1005 –1013.
• Kahneman, D. 2014. “A New Etiquette for Replication”, Social Psychology 45(4), 299-
311.
• Lindley, D. V. 1971. “The Estimation of Many Parameters.” In Foundations of Statistical
Inference, edited by V. P. Godambe and D. A. Sprott, 435–455. Toronto: Holt, Rinehart
and Winston.
• Little, R. 2006. “Calibrated Bayes: A Bayes/Frequentist Roadmap”, The American
Statistician 60(3): 213-223.
• Mayo, D. G. 1996. Error and the Growth of Experimental Knowledge. Science and Its
Conceptual Foundation. Chicago: University of Chicago Press.
• Mayo, D. G. 2018. Statistical Inference as Severe Testing: How to Get Beyond the
Statistics Wars, Cambridge: Cambridge University Press.
• Mayo, D. G. and Cox, D. R. (2006). "Frequentist Statistics as a Theory of Inductive
Inference” in Rojo, J. (ed.) The Second Erich L. Lehmann Symposium: Optimality,
2006, Lecture Notes-Monograph Series, Volume 49, Institute of Mathematical
Statistics: 247-275.
• Mayo, D. G., and A. Spanos. 2006. “Severe Testing as a Basic Concept in a Neyman–
Pearson Philosophy of Induction.” British Journal for the Philosophy of Science 57 (2)
(June 1): 323–357.
66
• Mayo, D. G., and A. Spanos. 2011. “Error Statistics.” In Philosophy of Statistics,
edited by Prasanta S. Bandyopadhyay and Malcolm R. Forster, 7:152–198.
Handbook of the Philosophy of Science. The Netherlands: Elsevier.
• Meehl, P. E., and N. G. Waller. 2002. “The Path Analysis Controversy: A New
Statistical Approach to Strong Appraisal of Verisimilitude.” Psychological Methods
7 (3): 283–300.
• Morrison, D. E., and R. E. Henkel, ed. 1970. The Significance Test Controversy: A
Reader. Chicago: Aldine De Gruyter.
• Neyman, J. (1962). 'Two Breakthroughs in the Theory of Statistical Decision
Making', Revue De l'Institut International De Statistique / Review of the
International Statistical Institute, 30(1), 11-27.
• Open Science Collaboration (2015). ‘Estimating the Reproducibility of
Psychological Science’, Science 349(6251), 943–51.
• Pearson, E. S. & Neyman, J. (1930). “On the problem of two samples”, Joint
Statistical Papers by J. Neyman & E.S. Pearson, 99-115 (Berkeley: U. of Calif.
Press). First published in Bul. Acad. Pol.Sci. 73-96.
• Popper, K. (1983). Realism and the Aim of Science. Totowa, NJ: Rowman and
Littlefield.
• Rosenkrantz, R. 1977. Inference, Method and Decision: Towards a Bayesian
Philosophy of Science. Dordrecht, The Netherlands: D. Reidel.
67
• Savage, L. J. 1961. “The Foundations of Statistics Reconsidered”, Proceedings of the
Fourth Berkeley Symposium on Mathematical Statistics and Probability I, Berkeley:
University of California Press, 575-586.
• Savage, L. J. 1962. The Foundations of Statistical Inference: A Discussion. London:
Methuen.
• Schnall, S. Benton, J. and Harvey, S. 2008. “With a Clean Conscience: Cleanliness
Reduces the Severity of Moral Judgments”, Psychological Science 19(12): 1219-
1222.
• Selvin, H. 1970. “A critique of tests of significance in survey research. In The significance
test controversy, edited by D. Morrison and R. Henkel, 94-106. Chicago: Aldine De
Gruyter.
• Senn, S. 2002. “A Comment on Replication, P-values and Evidence S.N. Goodman
Statistics in Medicine 1992; 11: 875-879”, Statistics in Medicine 21(16), 2437-2444.
• Simmons, J., Nelson, L., and Simonsohn, U. 2011, “False-Positive Psychology:
Undisclosed Flexibility in Data Collection and Analysis Allow Presenting Anything as
Significant", Psychological Science,22(11): 1359-1366.
• Wagenmakers, E-J., 2007. “A Practical Solution to the Pervasive Problems of P values”,
Psychonomic Bulletin & Review 14(5): 779-804.
• Wasserstein, R. and Lazar, N. 2016. “The ASA’s statement on p-values: context, process,
and purpose”,(and supplemental materials), The American Statistician 70(2): 129-33.
68
Severity for Test T+:
SEV(T+, d(x0), claim C)
Normal testing: H0: μ ≤ μ0 vs. H1: μ > μ0 known σ;
discrepancy parameter γ; μ1 = μ0 +γ; d0 = d(x0) (observed
value of test statistic) √n(M - μ0)/σ
SIR: (Severity Interpretation with low P-values)
• (a): (high): If there’s a very low probability that so large
a d0 would have resulted, if μ were no greater than μ1,
then d0 it indicates μ > μ1: SEV(μ > μ1) is high.
• (b): (low) If there is a fairly high probability that d0 would
have been larger than it is, even if μ = μ1, then d0 is not
a good indication μ > μ1: SEV(μ > μ1) is low. 69
SIN: (Severity Interpretation for Negative
results)
• (a): (high) If there is a very high probability
that d0 would have been larger than it is, were
μ > μ1, then μ ≤ μ1 passes the test with high
severity: SEV(μ ≤ μ1) is high.
• (b): (low) If there is a low probability that d0
would have been larger than it is, even if μ >
μ1, then μ ≤ μ1 passes with low severity:
SEV(μ ≤ μ1) is low.
70
Jimmy Savage on the LP:
“According to Bayes' theorem,…. if y is the
datum of some other experiment, and if it
happens that P(x|µ) and P(y|µ) are
proportional functions of µ (that is,
constant multiples of each other), then
each of the two data x and y have exactly
the same thing to say about the values of
µ…” (Savage 1962, p. 17)
71

More Related Content

What's hot

D. Mayo: The Science Wars and the Statistics Wars: scientism, popular statist...
D. Mayo: The Science Wars and the Statistics Wars: scientism, popular statist...D. Mayo: The Science Wars and the Statistics Wars: scientism, popular statist...
D. Mayo: The Science Wars and the Statistics Wars: scientism, popular statist...jemille6
 
Feb21 mayobostonpaper
Feb21 mayobostonpaperFeb21 mayobostonpaper
Feb21 mayobostonpaperjemille6
 
Controversy Over the Significance Test Controversy
Controversy Over the Significance Test ControversyControversy Over the Significance Test Controversy
Controversy Over the Significance Test Controversyjemille6
 
Statistical skepticism: How to use significance tests effectively
Statistical skepticism: How to use significance tests effectively Statistical skepticism: How to use significance tests effectively
Statistical skepticism: How to use significance tests effectively jemille6
 
D. Mayo: Replication Research Under an Error Statistical Philosophy
D. Mayo: Replication Research Under an Error Statistical Philosophy D. Mayo: Replication Research Under an Error Statistical Philosophy
D. Mayo: Replication Research Under an Error Statistical Philosophy jemille6
 
"The Statistical Replication Crisis: Paradoxes and Scapegoats”
"The Statistical Replication Crisis: Paradoxes and Scapegoats”"The Statistical Replication Crisis: Paradoxes and Scapegoats”
"The Statistical Replication Crisis: Paradoxes and Scapegoats”jemille6
 
Surrogate Science: How Fisher, Neyman-Pearson, and Bayes Were Transformed int...
Surrogate Science: How Fisher, Neyman-Pearson, and Bayes Were Transformed int...Surrogate Science: How Fisher, Neyman-Pearson, and Bayes Were Transformed int...
Surrogate Science: How Fisher, Neyman-Pearson, and Bayes Were Transformed int...jemille6
 
Byrd statistical considerations of the histomorphometric test protocol (1)
Byrd statistical considerations of the histomorphometric test protocol (1)Byrd statistical considerations of the histomorphometric test protocol (1)
Byrd statistical considerations of the histomorphometric test protocol (1)jemille6
 
Gelman psych crisis_2
Gelman psych crisis_2Gelman psych crisis_2
Gelman psych crisis_2jemille6
 
April 3 2014 slides mayo
April 3 2014 slides mayoApril 3 2014 slides mayo
April 3 2014 slides mayojemille6
 
Senn repligate
Senn repligateSenn repligate
Senn repligatejemille6
 
Yoav Benjamini, "In the world beyond p<.05: When & How to use P<.0499..."
Yoav Benjamini, "In the world beyond p<.05: When & How to use P<.0499..."Yoav Benjamini, "In the world beyond p<.05: When & How to use P<.0499..."
Yoav Benjamini, "In the world beyond p<.05: When & How to use P<.0499..."jemille6
 
D.G. Mayo Slides LSE PH500 Meeting #1
D.G. Mayo Slides LSE PH500 Meeting #1D.G. Mayo Slides LSE PH500 Meeting #1
D.G. Mayo Slides LSE PH500 Meeting #1jemille6
 
Meeting #1 Slides Phil 6334/Econ 6614 SP2019
Meeting #1 Slides Phil 6334/Econ 6614 SP2019Meeting #1 Slides Phil 6334/Econ 6614 SP2019
Meeting #1 Slides Phil 6334/Econ 6614 SP2019jemille6
 
Statistical Flukes, the Higgs Discovery, and 5 Sigma
Statistical Flukes, the Higgs Discovery, and 5 Sigma Statistical Flukes, the Higgs Discovery, and 5 Sigma
Statistical Flukes, the Higgs Discovery, and 5 Sigma jemille6
 
Mayo: Day #2 slides
Mayo: Day #2 slidesMayo: Day #2 slides
Mayo: Day #2 slidesjemille6
 
Fusion Confusion? Comments on Nancy Reid: "BFF Four-Are we Converging?"
Fusion Confusion? Comments on Nancy Reid: "BFF Four-Are we Converging?"Fusion Confusion? Comments on Nancy Reid: "BFF Four-Are we Converging?"
Fusion Confusion? Comments on Nancy Reid: "BFF Four-Are we Converging?"jemille6
 
Discussion a 4th BFFF Harvard
Discussion a 4th BFFF HarvardDiscussion a 4th BFFF Harvard
Discussion a 4th BFFF HarvardChristian Robert
 
Exploratory Research is More Reliable Than Confirmatory Research
Exploratory Research is More Reliable Than Confirmatory ResearchExploratory Research is More Reliable Than Confirmatory Research
Exploratory Research is More Reliable Than Confirmatory Researchjemille6
 
beyond objectivity and subjectivity; a discussion paper
beyond objectivity and subjectivity; a discussion paperbeyond objectivity and subjectivity; a discussion paper
beyond objectivity and subjectivity; a discussion paperChristian Robert
 

What's hot (20)

D. Mayo: The Science Wars and the Statistics Wars: scientism, popular statist...
D. Mayo: The Science Wars and the Statistics Wars: scientism, popular statist...D. Mayo: The Science Wars and the Statistics Wars: scientism, popular statist...
D. Mayo: The Science Wars and the Statistics Wars: scientism, popular statist...
 
Feb21 mayobostonpaper
Feb21 mayobostonpaperFeb21 mayobostonpaper
Feb21 mayobostonpaper
 
Controversy Over the Significance Test Controversy
Controversy Over the Significance Test ControversyControversy Over the Significance Test Controversy
Controversy Over the Significance Test Controversy
 
Statistical skepticism: How to use significance tests effectively
Statistical skepticism: How to use significance tests effectively Statistical skepticism: How to use significance tests effectively
Statistical skepticism: How to use significance tests effectively
 
D. Mayo: Replication Research Under an Error Statistical Philosophy
D. Mayo: Replication Research Under an Error Statistical Philosophy D. Mayo: Replication Research Under an Error Statistical Philosophy
D. Mayo: Replication Research Under an Error Statistical Philosophy
 
"The Statistical Replication Crisis: Paradoxes and Scapegoats”
"The Statistical Replication Crisis: Paradoxes and Scapegoats”"The Statistical Replication Crisis: Paradoxes and Scapegoats”
"The Statistical Replication Crisis: Paradoxes and Scapegoats”
 
Surrogate Science: How Fisher, Neyman-Pearson, and Bayes Were Transformed int...
Surrogate Science: How Fisher, Neyman-Pearson, and Bayes Were Transformed int...Surrogate Science: How Fisher, Neyman-Pearson, and Bayes Were Transformed int...
Surrogate Science: How Fisher, Neyman-Pearson, and Bayes Were Transformed int...
 
Byrd statistical considerations of the histomorphometric test protocol (1)
Byrd statistical considerations of the histomorphometric test protocol (1)Byrd statistical considerations of the histomorphometric test protocol (1)
Byrd statistical considerations of the histomorphometric test protocol (1)
 
Gelman psych crisis_2
Gelman psych crisis_2Gelman psych crisis_2
Gelman psych crisis_2
 
April 3 2014 slides mayo
April 3 2014 slides mayoApril 3 2014 slides mayo
April 3 2014 slides mayo
 
Senn repligate
Senn repligateSenn repligate
Senn repligate
 
Yoav Benjamini, "In the world beyond p<.05: When & How to use P<.0499..."
Yoav Benjamini, "In the world beyond p<.05: When & How to use P<.0499..."Yoav Benjamini, "In the world beyond p<.05: When & How to use P<.0499..."
Yoav Benjamini, "In the world beyond p<.05: When & How to use P<.0499..."
 
D.G. Mayo Slides LSE PH500 Meeting #1
D.G. Mayo Slides LSE PH500 Meeting #1D.G. Mayo Slides LSE PH500 Meeting #1
D.G. Mayo Slides LSE PH500 Meeting #1
 
Meeting #1 Slides Phil 6334/Econ 6614 SP2019
Meeting #1 Slides Phil 6334/Econ 6614 SP2019Meeting #1 Slides Phil 6334/Econ 6614 SP2019
Meeting #1 Slides Phil 6334/Econ 6614 SP2019
 
Statistical Flukes, the Higgs Discovery, and 5 Sigma
Statistical Flukes, the Higgs Discovery, and 5 Sigma Statistical Flukes, the Higgs Discovery, and 5 Sigma
Statistical Flukes, the Higgs Discovery, and 5 Sigma
 
Mayo: Day #2 slides
Mayo: Day #2 slidesMayo: Day #2 slides
Mayo: Day #2 slides
 
Fusion Confusion? Comments on Nancy Reid: "BFF Four-Are we Converging?"
Fusion Confusion? Comments on Nancy Reid: "BFF Four-Are we Converging?"Fusion Confusion? Comments on Nancy Reid: "BFF Four-Are we Converging?"
Fusion Confusion? Comments on Nancy Reid: "BFF Four-Are we Converging?"
 
Discussion a 4th BFFF Harvard
Discussion a 4th BFFF HarvardDiscussion a 4th BFFF Harvard
Discussion a 4th BFFF Harvard
 
Exploratory Research is More Reliable Than Confirmatory Research
Exploratory Research is More Reliable Than Confirmatory ResearchExploratory Research is More Reliable Than Confirmatory Research
Exploratory Research is More Reliable Than Confirmatory Research
 
beyond objectivity and subjectivity; a discussion paper
beyond objectivity and subjectivity; a discussion paperbeyond objectivity and subjectivity; a discussion paper
beyond objectivity and subjectivity; a discussion paper
 

Similar to Replication Crises and the Statistics Wars: Hidden Controversies

Statistical Inference as Severe Testing: Beyond Performance and Probabilism
Statistical Inference as Severe Testing: Beyond Performance and ProbabilismStatistical Inference as Severe Testing: Beyond Performance and Probabilism
Statistical Inference as Severe Testing: Beyond Performance and Probabilismjemille6
 
Statistical "Reforms": Fixing Science or Threats to Replication and Falsifica...
Statistical "Reforms": Fixing Science or Threats to Replication and Falsifica...Statistical "Reforms": Fixing Science or Threats to Replication and Falsifica...
Statistical "Reforms": Fixing Science or Threats to Replication and Falsifica...jemille6
 
P-Value "Reforms": Fixing Science or Threat to Replication and Falsification
P-Value "Reforms": Fixing Science or Threat to Replication and FalsificationP-Value "Reforms": Fixing Science or Threat to Replication and Falsification
P-Value "Reforms": Fixing Science or Threat to Replication and Falsificationjemille6
 
“The importance of philosophy of science for statistical science and vice versa”
“The importance of philosophy of science for statistical science and vice versa”“The importance of philosophy of science for statistical science and vice versa”
“The importance of philosophy of science for statistical science and vice versa”jemille6
 
D. G. Mayo Columbia slides for Workshop on Probability &Learning
D. G. Mayo Columbia slides for Workshop on Probability &LearningD. G. Mayo Columbia slides for Workshop on Probability &Learning
D. G. Mayo Columbia slides for Workshop on Probability &Learningjemille6
 
The Statistics Wars: Errors and Casualties
The Statistics Wars: Errors and CasualtiesThe Statistics Wars: Errors and Casualties
The Statistics Wars: Errors and Casualtiesjemille6
 
D.g. mayo 1st mtg lse ph 500
D.g. mayo 1st mtg lse ph 500D.g. mayo 1st mtg lse ph 500
D.g. mayo 1st mtg lse ph 500jemille6
 
D. Mayo: Philosophical Interventions in the Statistics Wars
D. Mayo: Philosophical Interventions in the Statistics WarsD. Mayo: Philosophical Interventions in the Statistics Wars
D. Mayo: Philosophical Interventions in the Statistics Warsjemille6
 
Error Control and Severity
Error Control and SeverityError Control and Severity
Error Control and Severityjemille6
 
Philosophy of Science and Philosophy of Statistics
Philosophy of Science and Philosophy of StatisticsPhilosophy of Science and Philosophy of Statistics
Philosophy of Science and Philosophy of Statisticsjemille6
 
The Statistics Wars and Their Casualties
The Statistics Wars and Their CasualtiesThe Statistics Wars and Their Casualties
The Statistics Wars and Their Casualtiesjemille6
 
The Statistics Wars and Their Causalities (refs)
The Statistics Wars and Their Causalities (refs)The Statistics Wars and Their Causalities (refs)
The Statistics Wars and Their Causalities (refs)jemille6
 
The Statistics Wars and Their Casualties (w/refs)
The Statistics Wars and Their Casualties (w/refs)The Statistics Wars and Their Casualties (w/refs)
The Statistics Wars and Their Casualties (w/refs)jemille6
 
Mayo minnesota 28 march 2 (1)
Mayo minnesota 28 march 2 (1)Mayo minnesota 28 march 2 (1)
Mayo minnesota 28 march 2 (1)jemille6
 
Mayo O&M slides (4-28-13)
Mayo O&M slides (4-28-13)Mayo O&M slides (4-28-13)
Mayo O&M slides (4-28-13)jemille6
 
Frequentist Statistics as a Theory of Inductive Inference (2/27/14)
Frequentist Statistics as a Theory of Inductive Inference (2/27/14)Frequentist Statistics as a Theory of Inductive Inference (2/27/14)
Frequentist Statistics as a Theory of Inductive Inference (2/27/14)jemille6
 
hypothesis testing overview
hypothesis testing overviewhypothesis testing overview
hypothesis testing overviewi i
 
20200519073328de6dca404c.pdfkshhjejhehdhd
20200519073328de6dca404c.pdfkshhjejhehdhd20200519073328de6dca404c.pdfkshhjejhehdhd
20200519073328de6dca404c.pdfkshhjejhehdhdHimanshuSharma723273
 

Similar to Replication Crises and the Statistics Wars: Hidden Controversies (20)

Statistical Inference as Severe Testing: Beyond Performance and Probabilism
Statistical Inference as Severe Testing: Beyond Performance and ProbabilismStatistical Inference as Severe Testing: Beyond Performance and Probabilism
Statistical Inference as Severe Testing: Beyond Performance and Probabilism
 
Statistical "Reforms": Fixing Science or Threats to Replication and Falsifica...
Statistical "Reforms": Fixing Science or Threats to Replication and Falsifica...Statistical "Reforms": Fixing Science or Threats to Replication and Falsifica...
Statistical "Reforms": Fixing Science or Threats to Replication and Falsifica...
 
P-Value "Reforms": Fixing Science or Threat to Replication and Falsification
P-Value "Reforms": Fixing Science or Threat to Replication and FalsificationP-Value "Reforms": Fixing Science or Threat to Replication and Falsification
P-Value "Reforms": Fixing Science or Threat to Replication and Falsification
 
“The importance of philosophy of science for statistical science and vice versa”
“The importance of philosophy of science for statistical science and vice versa”“The importance of philosophy of science for statistical science and vice versa”
“The importance of philosophy of science for statistical science and vice versa”
 
D. G. Mayo Columbia slides for Workshop on Probability &Learning
D. G. Mayo Columbia slides for Workshop on Probability &LearningD. G. Mayo Columbia slides for Workshop on Probability &Learning
D. G. Mayo Columbia slides for Workshop on Probability &Learning
 
The Statistics Wars: Errors and Casualties
The Statistics Wars: Errors and CasualtiesThe Statistics Wars: Errors and Casualties
The Statistics Wars: Errors and Casualties
 
D.g. mayo 1st mtg lse ph 500
D.g. mayo 1st mtg lse ph 500D.g. mayo 1st mtg lse ph 500
D.g. mayo 1st mtg lse ph 500
 
D. Mayo: Philosophical Interventions in the Statistics Wars
D. Mayo: Philosophical Interventions in the Statistics WarsD. Mayo: Philosophical Interventions in the Statistics Wars
D. Mayo: Philosophical Interventions in the Statistics Wars
 
Error Control and Severity
Error Control and SeverityError Control and Severity
Error Control and Severity
 
Philosophy of Science and Philosophy of Statistics
Philosophy of Science and Philosophy of StatisticsPhilosophy of Science and Philosophy of Statistics
Philosophy of Science and Philosophy of Statistics
 
The Statistics Wars and Their Casualties
The Statistics Wars and Their CasualtiesThe Statistics Wars and Their Casualties
The Statistics Wars and Their Casualties
 
The Statistics Wars and Their Causalities (refs)
The Statistics Wars and Their Causalities (refs)The Statistics Wars and Their Causalities (refs)
The Statistics Wars and Their Causalities (refs)
 
The Statistics Wars and Their Casualties (w/refs)
The Statistics Wars and Their Casualties (w/refs)The Statistics Wars and Their Casualties (w/refs)
The Statistics Wars and Their Casualties (w/refs)
 
Mayod@psa 21(na)
Mayod@psa 21(na)Mayod@psa 21(na)
Mayod@psa 21(na)
 
Mayo minnesota 28 march 2 (1)
Mayo minnesota 28 march 2 (1)Mayo minnesota 28 march 2 (1)
Mayo minnesota 28 march 2 (1)
 
Mayo O&M slides (4-28-13)
Mayo O&M slides (4-28-13)Mayo O&M slides (4-28-13)
Mayo O&M slides (4-28-13)
 
Frequentist Statistics as a Theory of Inductive Inference (2/27/14)
Frequentist Statistics as a Theory of Inductive Inference (2/27/14)Frequentist Statistics as a Theory of Inductive Inference (2/27/14)
Frequentist Statistics as a Theory of Inductive Inference (2/27/14)
 
hypothesis testing overview
hypothesis testing overviewhypothesis testing overview
hypothesis testing overview
 
20200519073328de6dca404c.pdfkshhjejhehdhd
20200519073328de6dca404c.pdfkshhjejhehdhd20200519073328de6dca404c.pdfkshhjejhehdhd
20200519073328de6dca404c.pdfkshhjejhehdhd
 
Hypothesis Testing.pptx
Hypothesis Testing.pptxHypothesis Testing.pptx
Hypothesis Testing.pptx
 

More from jemille6

D. Mayo JSM slides v2.pdf
D. Mayo JSM slides v2.pdfD. Mayo JSM slides v2.pdf
D. Mayo JSM slides v2.pdfjemille6
 
reid-postJSM-DRC.pdf
reid-postJSM-DRC.pdfreid-postJSM-DRC.pdf
reid-postJSM-DRC.pdfjemille6
 
Errors of the Error Gatekeepers: The case of Statistical Significance 2016-2022
Errors of the Error Gatekeepers: The case of Statistical Significance 2016-2022Errors of the Error Gatekeepers: The case of Statistical Significance 2016-2022
Errors of the Error Gatekeepers: The case of Statistical Significance 2016-2022jemille6
 
Causal inference is not statistical inference
Causal inference is not statistical inferenceCausal inference is not statistical inference
Causal inference is not statistical inferencejemille6
 
What are questionable research practices?
What are questionable research practices?What are questionable research practices?
What are questionable research practices?jemille6
 
What's the question?
What's the question? What's the question?
What's the question? jemille6
 
The neglected importance of complexity in statistics and Metascience
The neglected importance of complexity in statistics and MetascienceThe neglected importance of complexity in statistics and Metascience
The neglected importance of complexity in statistics and Metasciencejemille6
 
Mathematically Elegant Answers to Research Questions No One is Asking (meta-a...
Mathematically Elegant Answers to Research Questions No One is Asking (meta-a...Mathematically Elegant Answers to Research Questions No One is Asking (meta-a...
Mathematically Elegant Answers to Research Questions No One is Asking (meta-a...jemille6
 
On Severity, the Weight of Evidence, and the Relationship Between the Two
On Severity, the Weight of Evidence, and the Relationship Between the TwoOn Severity, the Weight of Evidence, and the Relationship Between the Two
On Severity, the Weight of Evidence, and the Relationship Between the Twojemille6
 
Revisiting the Two Cultures in Statistical Modeling and Inference as they rel...
Revisiting the Two Cultures in Statistical Modeling and Inference as they rel...Revisiting the Two Cultures in Statistical Modeling and Inference as they rel...
Revisiting the Two Cultures in Statistical Modeling and Inference as they rel...jemille6
 
Comparing Frequentists and Bayesian Control of Multiple Testing
Comparing Frequentists and Bayesian Control of Multiple TestingComparing Frequentists and Bayesian Control of Multiple Testing
Comparing Frequentists and Bayesian Control of Multiple Testingjemille6
 
Good Data Dredging
Good Data DredgingGood Data Dredging
Good Data Dredgingjemille6
 
The Duality of Parameters and the Duality of Probability
The Duality of Parameters and the Duality of ProbabilityThe Duality of Parameters and the Duality of Probability
The Duality of Parameters and the Duality of Probabilityjemille6
 
On the interpretation of the mathematical characteristics of statistical test...
On the interpretation of the mathematical characteristics of statistical test...On the interpretation of the mathematical characteristics of statistical test...
On the interpretation of the mathematical characteristics of statistical test...jemille6
 
The role of background assumptions in severity appraisal (
The role of background assumptions in severity appraisal (The role of background assumptions in severity appraisal (
The role of background assumptions in severity appraisal (jemille6
 
The two statistical cornerstones of replicability: addressing selective infer...
The two statistical cornerstones of replicability: addressing selective infer...The two statistical cornerstones of replicability: addressing selective infer...
The two statistical cornerstones of replicability: addressing selective infer...jemille6
 
The replication crisis: are P-values the problem and are Bayes factors the so...
The replication crisis: are P-values the problem and are Bayes factors the so...The replication crisis: are P-values the problem and are Bayes factors the so...
The replication crisis: are P-values the problem and are Bayes factors the so...jemille6
 
The ASA president Task Force Statement on Statistical Significance and Replic...
The ASA president Task Force Statement on Statistical Significance and Replic...The ASA president Task Force Statement on Statistical Significance and Replic...
The ASA president Task Force Statement on Statistical Significance and Replic...jemille6
 
D. G. Mayo jan 11 slides
D. G. Mayo jan 11 slides D. G. Mayo jan 11 slides
D. G. Mayo jan 11 slides jemille6
 
T. Pradeu & M. Lemoine: Philosophy in Science: Definition and Boundaries
T. Pradeu & M. Lemoine: Philosophy in Science: Definition and BoundariesT. Pradeu & M. Lemoine: Philosophy in Science: Definition and Boundaries
T. Pradeu & M. Lemoine: Philosophy in Science: Definition and Boundariesjemille6
 

More from jemille6 (20)

D. Mayo JSM slides v2.pdf
D. Mayo JSM slides v2.pdfD. Mayo JSM slides v2.pdf
D. Mayo JSM slides v2.pdf
 
reid-postJSM-DRC.pdf
reid-postJSM-DRC.pdfreid-postJSM-DRC.pdf
reid-postJSM-DRC.pdf
 
Errors of the Error Gatekeepers: The case of Statistical Significance 2016-2022
Errors of the Error Gatekeepers: The case of Statistical Significance 2016-2022Errors of the Error Gatekeepers: The case of Statistical Significance 2016-2022
Errors of the Error Gatekeepers: The case of Statistical Significance 2016-2022
 
Causal inference is not statistical inference
Causal inference is not statistical inferenceCausal inference is not statistical inference
Causal inference is not statistical inference
 
What are questionable research practices?
What are questionable research practices?What are questionable research practices?
What are questionable research practices?
 
What's the question?
What's the question? What's the question?
What's the question?
 
The neglected importance of complexity in statistics and Metascience
The neglected importance of complexity in statistics and MetascienceThe neglected importance of complexity in statistics and Metascience
The neglected importance of complexity in statistics and Metascience
 
Mathematically Elegant Answers to Research Questions No One is Asking (meta-a...
Mathematically Elegant Answers to Research Questions No One is Asking (meta-a...Mathematically Elegant Answers to Research Questions No One is Asking (meta-a...
Mathematically Elegant Answers to Research Questions No One is Asking (meta-a...
 
On Severity, the Weight of Evidence, and the Relationship Between the Two
On Severity, the Weight of Evidence, and the Relationship Between the TwoOn Severity, the Weight of Evidence, and the Relationship Between the Two
On Severity, the Weight of Evidence, and the Relationship Between the Two
 
Revisiting the Two Cultures in Statistical Modeling and Inference as they rel...
Revisiting the Two Cultures in Statistical Modeling and Inference as they rel...Revisiting the Two Cultures in Statistical Modeling and Inference as they rel...
Revisiting the Two Cultures in Statistical Modeling and Inference as they rel...
 
Comparing Frequentists and Bayesian Control of Multiple Testing
Comparing Frequentists and Bayesian Control of Multiple TestingComparing Frequentists and Bayesian Control of Multiple Testing
Comparing Frequentists and Bayesian Control of Multiple Testing
 
Good Data Dredging
Good Data DredgingGood Data Dredging
Good Data Dredging
 
The Duality of Parameters and the Duality of Probability
The Duality of Parameters and the Duality of ProbabilityThe Duality of Parameters and the Duality of Probability
The Duality of Parameters and the Duality of Probability
 
On the interpretation of the mathematical characteristics of statistical test...
On the interpretation of the mathematical characteristics of statistical test...On the interpretation of the mathematical characteristics of statistical test...
On the interpretation of the mathematical characteristics of statistical test...
 
The role of background assumptions in severity appraisal (
The role of background assumptions in severity appraisal (The role of background assumptions in severity appraisal (
The role of background assumptions in severity appraisal (
 
The two statistical cornerstones of replicability: addressing selective infer...
The two statistical cornerstones of replicability: addressing selective infer...The two statistical cornerstones of replicability: addressing selective infer...
The two statistical cornerstones of replicability: addressing selective infer...
 
The replication crisis: are P-values the problem and are Bayes factors the so...
The replication crisis: are P-values the problem and are Bayes factors the so...The replication crisis: are P-values the problem and are Bayes factors the so...
The replication crisis: are P-values the problem and are Bayes factors the so...
 
The ASA president Task Force Statement on Statistical Significance and Replic...
The ASA president Task Force Statement on Statistical Significance and Replic...The ASA president Task Force Statement on Statistical Significance and Replic...
The ASA president Task Force Statement on Statistical Significance and Replic...
 
D. G. Mayo jan 11 slides
D. G. Mayo jan 11 slides D. G. Mayo jan 11 slides
D. G. Mayo jan 11 slides
 
T. Pradeu & M. Lemoine: Philosophy in Science: Definition and Boundaries
T. Pradeu & M. Lemoine: Philosophy in Science: Definition and BoundariesT. Pradeu & M. Lemoine: Philosophy in Science: Definition and Boundaries
T. Pradeu & M. Lemoine: Philosophy in Science: Definition and Boundaries
 

Recently uploaded

_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting Data_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting DataJhengPantaleon
 
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...M56BOOKSTORE PRODUCT/SERVICE
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfsanyamsingh5019
 
Concept of Vouching. B.Com(Hons) /B.Compdf
Concept of Vouching. B.Com(Hons) /B.CompdfConcept of Vouching. B.Com(Hons) /B.Compdf
Concept of Vouching. B.Com(Hons) /B.CompdfUmakantAnnand
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13Steve Thomason
 
MENTAL STATUS EXAMINATION format.docx
MENTAL     STATUS EXAMINATION format.docxMENTAL     STATUS EXAMINATION format.docx
MENTAL STATUS EXAMINATION format.docxPoojaSen20
 
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTiammrhaywood
 
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17Celine George
 
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdfssuser54595a
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionSafetyChain Software
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Educationpboyjonauth
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...EduSkills OECD
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Krashi Coaching
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxheathfieldcps1
 
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...Marc Dusseiller Dusjagr
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactdawncurless
 
Crayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon ACrayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon AUnboundStockton
 

Recently uploaded (20)

Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
 
_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting Data_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting Data
 
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdf
 
Concept of Vouching. B.Com(Hons) /B.Compdf
Concept of Vouching. B.Com(Hons) /B.CompdfConcept of Vouching. B.Com(Hons) /B.Compdf
Concept of Vouching. B.Com(Hons) /B.Compdf
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13
 
MENTAL STATUS EXAMINATION format.docx
MENTAL     STATUS EXAMINATION format.docxMENTAL     STATUS EXAMINATION format.docx
MENTAL STATUS EXAMINATION format.docx
 
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
 
Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Bikash Puri  Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Bikash Puri  Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝
 
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
 
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
 
Staff of Color (SOC) Retention Efforts DDSD
Staff of Color (SOC) Retention Efforts DDSDStaff of Color (SOC) Retention Efforts DDSD
Staff of Color (SOC) Retention Efforts DDSD
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory Inspection
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Education
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
 
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impact
 
Crayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon ACrayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon A
 

Replication Crises and the Statistics Wars: Hidden Controversies

  • 1. Replication Crises and the Statistics Wars: Hidden Controversies Deborah G Mayo June 14, 2018 Reproducibility and Replicability in Psychology and Experimental Philosophy
  • 2. • High-profile failures of replication have brought forth reams of reforms • How should the integrity of science be restored? • Experts do not agree. • We need to pull back the curtain on why. 2
  • 3. Hidden Controversies 1. The Statistics Wars: Age-old abuses, fallacies of statistics (Bayesian-frequentist, Fisher-N-P debates) simmer below todays debates; reformulations of frequentist methods are ignored 2. Replication Paradoxes: While a major source of handwringing stems from biasing selection effects— cherry picking, data dredging, trying and trying again, some reforms and alternative methods conflict with the needed error control 3. Underlying assumptions and their violations • A) statistical model assumptions • B) links from experiments, measurements & statistical inferences to substantive questions 3
  • 4. Significance Tests • The most used methods are most criticized • Statistical significance tests are a small part of a rich set of: “techniques for systematically appraising and bounding the probabilities … of seriously misleading interpretations of data [bounding error probabilities]” (Birnbaum 1970, 1033) • These I call error statistical methods (or sampling theory). 4
  • 5. “p-value. …to test the conformity of the particular data under analysis with H0 in some respect: …we find a function T = t(y) of the data, to be called the test statistic, such that • the larger the value of T the more inconsistent are the data with H0; • The random variable T = t(Y) has a (numerically) known probability distribution when H0 is true. …the p-value corresponding to any t0bs as p = p(t) = Pr(T ≥ t0bs; H0)” (Mayo and Cox 2006, 81) 5
  • 6. Testing Reasoning • If even larger differences than t0bs occur fairly frequently under H0 (i.e., P-value is not small), there’s scarcely evidence of incompatibility with H0 • Small P-value indicates some underlying discrepancy from H0 because very probably you would have seen a less impressive difference than t0bs were H0 true. • This still isn’t evidence of a genuine statistical effect H1, let alone a scientific conclusion H* Stat-Sub fallacy H => H* 6
  • 7. Neyman-Pearson (N-P) tests: A null and alternative hypotheses H0, H1 that are exhaustive H0: μ ≤ 0 vs. H1: μ > 0 • So this fallacy of rejection H1H* is impossible • Rejecting H0 only indicates statistical alternatives H1 (how discrepant from null) As opposed to NHST 7
  • 8. We must get beyond N-P/ Fisher incompatibilist caricature • Beyond “inconsistent hybrid” (Gigerenzer 2004, 590): Fisher–inferential; N-P–long run performance • Both N-P & F offer techniques to assess and control error probabilities • “Incompatibilism” results in P-value users robbed of features from N-P tests they need (power) • Wrongly supposes N-P testers have only fixed error probabilities & are not allowed to report P-values 8
  • 9. Replication Paradox (for critics of N-P & F Significance Tests) Critic: It’s much too easy to get a small P- value You: Why do they find it so difficult to replicate the small P-values others found? Is it easy or is it hard? 9
  • 10. Both Fisher and N-P methods: it’s easy to lie with statistics by selective reporting • Sufficient finagling—cherry-picking, P- hacking, significance seeking, multiple testing, look elsewhere—may practically guarantee a preferred claim H gets support, even if it’s unwarranted by evidence 10
  • 11. Severity Requirement: Everyone agrees: If the test had little or no capability of finding flaws with H (even if H is incorrect), then agreement between data x0 and H provides poor (or no) evidence for H (“too cheap to be worth having” Popper 1983, 30) • Such a test fails a minimal requirement for a stringent or severe test • Their new idea was using probability to reflect this capability • They may not have put it explicitly in these terms but my account does (requires reinterpreting tests) 11
  • 12. This alters the role of probability: typically just 2 • Probabilism. To assign a degree of probability, confirmation, support or belief in a hypothesis, given data x0 (e.g., Bayesian, likelihoodist, Fisher (at times)) Performance. Ensure long-run reliability of methods, coverage probabilities (frequentist, behavioristic Neyman-Pearson, Fisher (at times)) 12
  • 13. What happened to using probability to assess error probing capacity? • Neither “probabilism” nor “performance” directly captures it • Good long-run performance is a necessary, not a sufficient, condition for severity • Need to explicitly reinterpret tests according to how well or poorly discrepancies are indicated 13
  • 14. Key to revising the role of error probabilities • Problems with selective reporting, cherry picking, stopping when the data look good, P-hacking, are not problems about long-runs— • It’s that we cannot say the case at hand has done a good job of avoiding the sources of misinterpreting data 14
  • 15. A claim C is not warranted _______ • Probabilism: unless C is true or probable (gets a probability boost, is made comparatively firmer) • Performance: unless it stems from a method with low long-run error • Probativism (severe testing) unless something (a fair amount) has been done to probe ways we can be wrong about C 15
  • 16. Biasing selection effects: One function of severity is to identify problematic selection effects (not all are) • Biasing selection effects: when data or hypotheses are selected or generated (or a test criterion is specified), in such a way that the minimal severity requirement is violated, seriously altered or incapable of being assessed 16
  • 17. Nominal vs. actual significance levels Suppose that twenty sets of differences have been examined, that one difference seems large enough to test and that this difference turns out to be ‘significant at the 5 percent level.’ ….The actual level of significance is not 5 percent, but 64 percent! (Selvin 1970, 104) From (Morrison & Henkel’s Significance Test controversy 1970!) 17
  • 18. Spurious P-Value You report: Such results would be difficult to achieve under the assumption of H0 When in fact such results are common under the assumption of H0 There are many more ways you can be wrong with hunting (different sample space) (Formally): • You say Pr(P-value ≤ Pobs; H0) ~ Pobs = small • But in fact Pr(P-value ≤ Pobs; H0) = high 18
  • 19. Scapegoating • Nowadays, we’re likely to see the tests blamed • My view: Tests don’t kill inferences, people do • Even worse are those statistical accounts where the abuse vanishes! 19
  • 20. Some say taking account of biasing selection effects “defies scientific sense” Two problems that plague frequentist inference: multiple comparisons and multiple looks, or, as they are more commonly called, data dredging and peeking at the data. The frequentist solution to both problems involves adjusting the P-value… But adjusting the measure of evidence because of considerations that have nothing to do with the data defies scientific sense” (Goodman 1999, 1010) (To his credit, he’s open about this; heads the Meta-Research Innovation Center at Stanford) 20
  • 21. Likelihood Principle (LP) The vanishing act links to a pivotal disagreement in the philosophy of statistics battles In probabilisms, the import of the data is via the ratios of likelihoods of hypotheses Pr(x0;H0)/Pr(x0;H1) The data x0 are fixed, while the hypotheses vary 21
  • 22. All error probabilities violate the LP (even without selection effects): Sampling distributions, significance levels, power, all depend on something more [than the likelihood function]–something that is irrelevant in Bayesian inference–namely the sample space (Lindley 1971, 436) The LP implies…the irrelevance of predesignation, of whether a hypothesis was thought of before hand or was introduced to explain known effects (Rosenkrantz 1977, 122) 22
  • 23. Optional Stopping: Error probing capacities are altered not just by cherry picking and data dredging, but also via data dependent stopping rules: Xi ~ N(μ, σ2), 2-sided H0: μ = 0 vs. H1: μ ≠ 0. Instead of fixing the sample size n in advance, in some tests, n is determined by a stopping rule: 23
  • 24. • Keep sampling until H0 is rejected at 0.05 level i.e., keep sampling until M > 1.96 s/√n • Trying and trying again: Having failed to rack up a 1.96 s/√n difference after 10 trials, go to 20, 30 and so on until obtaining a 1.96 s difference 24
  • 25. Nominal vs. Actual significance levels again: • With n fixed the Type 1 error probability is 0.05 • With this stopping rule the actual significance level differs from, and will be greater than 0.05 (proper stopping rule) 25
  • 26. • Equivalently, you can ensure that 0 is excluded from a confidence interval (or credibility interval) even if true (Berger and Wolpert 1988) • Likewise the same p-hacked hypothesis can occur in Bayes factors, credibility intervals, likelihood ratios 26
  • 27. With One Big Difference: • The direct grounds to criticize inferences as flouting error statistical control is lost • They condition on the actual data, • Error probabilities take into account other outcomes that could have occurred but did not (sampling distribution) 27
  • 28. 28 What Counts as Cheating? “[I]f the sampling plan is ignored, the researcher is able to always reject the null hypothesis, even if it is true. This example is sometimes used to argue that any statistical framework should somehow take the sampling plan into account. Some people feel that ‘optional stopping’ amounts to cheating…. This feeling is, however, contradicted by a mathematical analysis. (Eric-Jan Wagenmakers, 2007, 785) But the “proof” assumes the likelihood principle (LP) by which error probabilities drop out. (Edwards, Lindman, and Savage 1963)
  • 29. • Replication researchers (re)discovered that data- dependent hypotheses and stopping are a major source of spurious significance levels. “Authors must decide the rule for terminating data collection before data collection begins and report this rule in the articles” (Simmons, Nelson, and Simonsohn 2011, 1362). • Or report how their stopping plan alters relevant error probabilities. 29
  • 30. 30 “This [optional stopping] example is cited as a counter-example by both Bayesians and frequentists! If we statisticians can't agree which theory this example is counter to, what is a clinician to make of this debate?” (Little 2006, 215). Underlies today’s disputes
  • 31. • Critic: It’s too easy to satisfy standard significance thresholds • You: Why do replicationists find it so hard to achieve significance thresholds (with preregistration)? • Critic: Obviously the initial studies were guilty of P- hacking, cherry-picking, data-dredging (QRPs) • You: So, the replication researchers want methods that pick up on, adjust, and block these biasing selection effects. • Critic: Actually “reforms” recommend methods where the need to alter P-values due to data dredging vanishes 31
  • 32. How might probabilists block intuitively unwarranted inferences (without error probabilities)? A subjective Bayesian might say: If our beliefs were mixed into the interpretation of the evidence, we wouldn’t declare there’s statistical evidence of some unbelievable claim (distinguishing shades of grey and being politically moderate, ovulation and voting preferences) 32
  • 33. Rescued by beliefs • That could work in some cases (it still wouldn’t show what researchers had done wrong)—battle of beliefs • Besides, researchers sincerely believe their hypotheses • So now you’ve got two sources of flexibility, priors and biasing selection effects 33
  • 34. No help with our most important problem • How to distinguish the warrant for a single hypothesis H with different methods (e.g., one has biasing selection effects, another, pre-registered results and precautions)? • Since there’s a single H, its prior would be the same 34
  • 35. Most Bayesians use “default” priors • Eliciting subjective priors too difficult, scientists reluctant to allow subjective beliefs to overshadow data • Default priors are supposed to prevent prior beliefs from influencing the posteriors (O- Bayesians, 2006) 35
  • 36. How should we interpret them? • “The priors are not to be considered expressions of uncertainty, ignorance, or degree of belief. Default priors may not even be probabilities…” (Cox and Mayo 2010, 299) • No agreement on rival systems for default/non- subjective priors 36
  • 37. • Default Bayesian Reforms often touted as free of selection effects “…Bayes factors can be used in the complete absence of a sampling plan…” (Bayarri, Benjamin, Berger, Sellke 2016, 100) 37
  • 38. • Whatever you think of them, there’s a big difference in foundations (error statistical vs Bayesian/probabilist) • Knowing something about the statistics wars, you’d get to the bottom of current proxy wars 38
  • 39. Recent push to “redefine statistical significance” is based on the odds ratios/ Bayes Factors (BF) (Benjamin et al 2017) 39
  • 40. Old & new debates: P-values exaggerate based on Bayes Factors or Likelihood Ratios • Testing of a point null hypothesis, a lump of prior probability given to H0 Xi ~ N(μ, σ2), 2-sided H0: μ = 0 vs. H1: μ ≠ 0. • The criticism is the posterior probability on H0 can be larger than the P-value. • So if you interpret a P-value as a posterior on H0 (a fallacy), you’d be saying the Pr(H0|x) is low 40
  • 41. Bayes-Fisher Disagreement or Jeffreys-Lindley “Paradox” • With-a lump of prior given to the point null, and the rest appropriately spread over the alternative, an α significant result can correspond to Pr(H0 |x) = (1 - α)! (e.g., 0.95) “spike and smear” • To a Bayesian this shows P-values exaggerate evidence against 41
  • 42. Do P-values Exaggerate Evidence? • Significance testers balk at allowing highly significant results to be interpreted as no evidence against the null–or even evidence for it! Bad Type II error • Lump of prior on the null conflicts with the idea that all nil nulls are false) • P-values can also equal the posterior! (without the spiked prior)–even though they measure very different things. 42
  • 43. • “[T]he reason that Bayesians can regard P-values as overstating the evidence against the null is simply a reflection of the fact that Bayesians can disagree sharply with each other“ (Stephen Senn 2002, 2442) • “In fact it is not the case that P-values are too small, but rather that Bayes point null posterior probabilities are much too big!….Our concern should not be to analyze these misspecified problems, but to educate the user so that the hypotheses are properly formulated,” (Casella and Berger 1987b, 334, 335). • “concentrating mass on the point null hypothesis is biasing the prior in favor of H0 as much as possible” (ibid. 1987a, 111) whether in 1 or 2-sided tests 43
  • 44. • In one sense, lowering the probability of type I errors is innocuous (if that reflects your balance of Type I and II errors) • To base the argument on lowering P-values on BFs –if you don’t do inference by means of BFs– is not innocuous (because of the forfeit of error control) 44
  • 45. BF Advocates Forfeit Their Strongest Criticisms • Daryl Bem (2011): subjects did better than chance at predicting the picture to be shown in the future, credit to ESP • Some locate the start of the Replication Crisis With (Bem’s “Feeling the Future” 2011) Admits data dredging • Keen to show we should trade significance tests for BFs, critics resort to a default Bayesian prior to make the null hypothesis comparatively more probable than a chosen alternative 45
  • 46. Repligate • Replication research has pushback: some call it methodological terrorism (enforcing good science or bullying?) • If you must appeal to your disbelief in my effect to find I fail to replicate I’m more likely to see it as bullying • I can also deflect your criticism 46
  • 47. Bem’s Response Whenever the null hypothesis is sharply defined but the prior distribution on the alternative hypothesis is diffused over a wide range of values, as it is [here] it boosts the probability that any observed data will be higher under the null hypothesis than under the alternative. This is known as the Lindley-Jeffreys paradox: A frequentist analysis that yields strong evidence in support of the experimental hypothesis can be contradicted by a misguided Bayesian analysis that concludes that the same data are more likely under the null. (Bem et al., 717) 47
  • 48. Admittedly, significance tests require supplements to avoid overestimating (or underestimating) effect sizes 48
  • 49. Jeffreys-Lindley result points to large n problem • Fixing the P-value, increasing sample size n, the cut-off gets smaller • Get to a point where x is closer to the null than various alternatives • Many would lower the P-value requirement as n increases-that’s fine but can always avoid inferring a discrepancy beyond what’s warranted: • Severity prevents interpreting a significant result as indicating a larger discrepancy than warranted (Mts out of Molehill fallacy) 49
  • 50. • tells us: an α-significant difference indicates less of a discrepancy from the null if it results from larger (n1) rather than a smaller (n2) sample size (n1 > n2 ) • What’s more indicative of a large effect (fire), a fire alarm that goes off with burnt toast or one that doesn’t go off unless the house is fully ablaze? • [The larger sample size is like the one that goes off with burnt toast] 50
  • 51. What About Fallacies of Non-Significant Results? • They don’t warrant 0 discrepancy • Using severity reasoning: rule out discrepancies that very probably would have resulted in a larger difference than observed- set upper bounds • If you very probably would have observed a more impressive (smaller) P-value than you did, if μ > μ1 (μ1 = μ0 + γ), then the data are good evidence that μ< μ1 • Akin to power analysis (Cohen, Neyman) but sensitive to x0 51
  • 52. Confidence Intervals also require reinterpretation Duality between tests and intervals: values within the (1 -‐c) CI are non-‐rejectable at the c level • Too dichotomous: in/out, plausible/not plausible • Justified in terms of long-‐run coverage • All members of the CI treated on par • Fixed confidence levels (need several benchmarks) 52
  • 53. Violated Assumptions Do More Damage Than Setting P-values We should worry more about violated model assumptions • Ironically, testing model assumptions (and fraudbusting) performed by simple significance tests (e.g., IID) 53
  • 54. Some Bayesians Care About Model Testing • “[C]rucial parts of Bayesian data analysis, such as model checking, can be understood as ‘error probes’ in Mayo’s sense” which might be seen as using modern statistics to implement the Popperian criteria of severe tests.” (Andrew Gelman and Cosma Shalizi 2013, 10). 54
  • 55. In dealing with non-replication we should worry much more about • Links from experiments to inferences of interest • I’ve seen plausible hypotheses poorly (or questionably) tested 55
  • 56. Is it Non-replication or Something About the Design? • Replications are to be superior, preregistered, free of perverse incentives, high power • But might they be influenced by replicator’s beliefs? (Kahneman: get approval from original researcher) • Subjects (often students) often told the purpose of the experiment 56
  • 57. • One of the non-replications: cleanliness and morality: Does unscrambling soap words make you less judgmental? “Ms. Schnall had 40 undergraduates unscramble some words. One group unscrambled words that suggested cleanliness (pure, immaculate, pristine), while the other group unscrambled neutral words. They were then presented with a number of moral dilemmas, like whether it’s cool to eat your dog after it gets run over by a car. …(Bartlett 2014) 57
  • 58. …Turns out, it did. Subjects who had unscrambled clean words weren’t as harsh on the guy who chows down on his chow.” (Bartlett 2014) • Focusing on the P-values ignore larger questions of measurement in psychology & the leap from the statistical to the substantive. HH* 58
  • 59. 2 that didn’t replicate in psychology: • Belief in free will and cheating • Physical distance (of points plotted) and emotional closeness I encourage meta-methodological scrutiny by philosophers–especially those in Xphi 59
  • 60. 2 that didn’t replicate • Belief in free will and cheating “It found that participants who read a passage arguing that their behavior is predetermined were more likely than those who had not read the passage to cheat on a subsequent test.” • Physical distance (of points plotted) and emotional closeness “Volunteers asked to plot two points that were far apart on graph paper later reported weaker emotional attachment to family members, compared with subjects who had graphed points close together.” (Carey, 2015) 60
  • 61. Hidden Controversies 1. The Statistics Wars: Age-old abuses, fallacies of statistics (Bayesian-frequentist, Fisher-N-P debates) simmer below todays debates; reformulations of frequentist methods are ignored 2. Replication Paradoxes: While a major source of handwringing stem from biasing selection effects— cherry picking, data dredging, trying and trying again, some reforms and preferred alternatives conflict with the needed error control. 3. Underlying assumptions and their violations • A) statistical model assumptions • B) links from experiments, measurements & statistical inferences to substantive questions 61
  • 62. • Recognize different roles of probability: probabilism, long run performance, probativism (severe testing) • Move away from cookbook stat–long deplored–but don’t expect agreement on numbers from evaluations of different things (P-values, posteriors) • Recognize criticisms & reforms are often based on rival underlying philosophies of evidence • The danger is that some reforms may enable rather than directly reveal illicit inferences due to biasing selection effects • Worry more about model assumptions, and whether our experiments are teaching us about the phenomenon of interest 62
  • 63. 63
  • 64. References • Armitage, P. 1962. “Contribution to Discussion”, in The Foundations of Statistical Inference: A Discussion, edited by L. J. Savage. London: Methuen. • Bartlett, T. 2014. “Replication Crisis in Psychology Research Turns Ugly and Odd”, Chronicle of Higher Education online, June 23, 2014. • Bayarri, M., Benjamin, D., Berger, J., and Sellke, T. 2016. “Rejection Odds and Rejection Ratios: A Proposal for Statistical Practice in Testing Hypotheses”, Journal of Mathematical Psychology, 72,90-103. • Bem, J. 2011. “Feeling the Future: Experimental Evidence for Anomalous Retroactive Influences on Cognition and Affect”, Journal of Personality and Social Psychology 100(3), 407-425. • Bem, J., Utts, J., and Johnson, W. 2011. “Must Psychologists Change the Way They Analyze Their Data?”, Journal of Personality and Social Psychology 101(4), 716-719. • Benjamin, D., Berger, J., Johannesson, M., et al. (2017). “Redefine Statistical Significance”, Nature Human Behaviour 2, 6–10. • Berger, J. O. 2003 “Could Fisher, Jeffreys and Neyman Have Agreed on Testing?” and “Rejoinder”, Statistical Science 18(1): 1-12; 28-32. • Berger, J. O. 2006. “The Case for Objective Bayesian Analysis.” Bayesian Analysis 1 (3): 385–402. • Berger, J. O. and Wolpert, R. 1988. The Likelihood Principle, 2nd ed. Vol. 6 Lecture Notes-Monograph Series. Hayward, CA: Institute of Mathematical Statistics. • Birnbaum, A. 1970. “Statistical Methods in Scientific Inference (letter to the Editor).” Nature 225 (5237) (March 14): 1033. 64
  • 65. • Carey, B. (2015). “Many Psychology Findings Not as Strong as Claimed, Study Says”, New York Times (August 27, 2015). • Casella, G. and Berger, R. 1987a. “Reconciling Bayesian and Frequentist Evidence in the One-sided testing Problem”, Journal of the American Statistical Association 82(397): 106-111. • Casella, G. and Berger, R. 1987b. “Comment on Testing Precise Hypotheses by J. Berger and M. Delampady”, Statistical Science 2(3): 344-347. • Cohen, J. 1988. Statistical Power Analysis for the Behavioral Sciences 2nd ed. Hillsdale NJ: Erlbaum. • Cox, D. R. and Hinkley, D. 1974. Theoretical Statistics. London: Chapman and Hall. • Cox, D. R., and Mayo, D.G. 2010. “Objectivity and Conditionality in Frequentist Inference.” In Error and Inference: Recent Exchanges on Experimental Reasoning, Reliability, and the Objectivity and Rationality of Science, edited by Deborah G. Mayo and Aris Spanos, 276–304. Cambridge: Cambridge University Press. • Edwards, W., Lindman, H., and Savage, L. 1963. “Bayesian Statistical Inference for Psychological Research”, Psychological Review 70(3): 193-242. • Fisher, R. A. 1935. The Design of Experiments. Edinburgh: Oliver and Boyd. 78. • Fisher, R. A. 1955. “Statistical Methods and Scientific Induction.” Journal of the Royal Statistical Society, Series B (Methodological) 17 (1) (January 1): 69– • Gelman, A. and Shalizi, C. 2013. “Philosophy and the Practice of Bayesian Statistics” and “Rejoinder”, British Journal of Mathematical and Statistical Psychology 66(1): 8–38; 76-80. • Gigerenzer, G. 2004. “Mindless Statistics”, Journal of Socio-Economics 33(5), 587- 606. 65
  • 66. • Gigerenzer, G., Swijtink, Porter, T. Daston, L. Beatty, J, and Kruger, L. 1989. The Empire of Chance. Cambridge: Cambridge University Press. • Goodman SN. 1999. “Toward evidence-based medical statistics. 2: The Bayes factor,” Annals of Internal Medicine 1999; 130:1005 –1013. • Kahneman, D. 2014. “A New Etiquette for Replication”, Social Psychology 45(4), 299- 311. • Lindley, D. V. 1971. “The Estimation of Many Parameters.” In Foundations of Statistical Inference, edited by V. P. Godambe and D. A. Sprott, 435–455. Toronto: Holt, Rinehart and Winston. • Little, R. 2006. “Calibrated Bayes: A Bayes/Frequentist Roadmap”, The American Statistician 60(3): 213-223. • Mayo, D. G. 1996. Error and the Growth of Experimental Knowledge. Science and Its Conceptual Foundation. Chicago: University of Chicago Press. • Mayo, D. G. 2018. Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars, Cambridge: Cambridge University Press. • Mayo, D. G. and Cox, D. R. (2006). "Frequentist Statistics as a Theory of Inductive Inference” in Rojo, J. (ed.) The Second Erich L. Lehmann Symposium: Optimality, 2006, Lecture Notes-Monograph Series, Volume 49, Institute of Mathematical Statistics: 247-275. • Mayo, D. G., and A. Spanos. 2006. “Severe Testing as a Basic Concept in a Neyman– Pearson Philosophy of Induction.” British Journal for the Philosophy of Science 57 (2) (June 1): 323–357. 66
  • 67. • Mayo, D. G., and A. Spanos. 2011. “Error Statistics.” In Philosophy of Statistics, edited by Prasanta S. Bandyopadhyay and Malcolm R. Forster, 7:152–198. Handbook of the Philosophy of Science. The Netherlands: Elsevier. • Meehl, P. E., and N. G. Waller. 2002. “The Path Analysis Controversy: A New Statistical Approach to Strong Appraisal of Verisimilitude.” Psychological Methods 7 (3): 283–300. • Morrison, D. E., and R. E. Henkel, ed. 1970. The Significance Test Controversy: A Reader. Chicago: Aldine De Gruyter. • Neyman, J. (1962). 'Two Breakthroughs in the Theory of Statistical Decision Making', Revue De l'Institut International De Statistique / Review of the International Statistical Institute, 30(1), 11-27. • Open Science Collaboration (2015). ‘Estimating the Reproducibility of Psychological Science’, Science 349(6251), 943–51. • Pearson, E. S. & Neyman, J. (1930). “On the problem of two samples”, Joint Statistical Papers by J. Neyman & E.S. Pearson, 99-115 (Berkeley: U. of Calif. Press). First published in Bul. Acad. Pol.Sci. 73-96. • Popper, K. (1983). Realism and the Aim of Science. Totowa, NJ: Rowman and Littlefield. • Rosenkrantz, R. 1977. Inference, Method and Decision: Towards a Bayesian Philosophy of Science. Dordrecht, The Netherlands: D. Reidel. 67
  • 68. • Savage, L. J. 1961. “The Foundations of Statistics Reconsidered”, Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability I, Berkeley: University of California Press, 575-586. • Savage, L. J. 1962. The Foundations of Statistical Inference: A Discussion. London: Methuen. • Schnall, S. Benton, J. and Harvey, S. 2008. “With a Clean Conscience: Cleanliness Reduces the Severity of Moral Judgments”, Psychological Science 19(12): 1219- 1222. • Selvin, H. 1970. “A critique of tests of significance in survey research. In The significance test controversy, edited by D. Morrison and R. Henkel, 94-106. Chicago: Aldine De Gruyter. • Senn, S. 2002. “A Comment on Replication, P-values and Evidence S.N. Goodman Statistics in Medicine 1992; 11: 875-879”, Statistics in Medicine 21(16), 2437-2444. • Simmons, J., Nelson, L., and Simonsohn, U. 2011, “False-Positive Psychology: Undisclosed Flexibility in Data Collection and Analysis Allow Presenting Anything as Significant", Psychological Science,22(11): 1359-1366. • Wagenmakers, E-J., 2007. “A Practical Solution to the Pervasive Problems of P values”, Psychonomic Bulletin & Review 14(5): 779-804. • Wasserstein, R. and Lazar, N. 2016. “The ASA’s statement on p-values: context, process, and purpose”,(and supplemental materials), The American Statistician 70(2): 129-33. 68
  • 69. Severity for Test T+: SEV(T+, d(x0), claim C) Normal testing: H0: μ ≤ μ0 vs. H1: μ > μ0 known σ; discrepancy parameter γ; μ1 = μ0 +γ; d0 = d(x0) (observed value of test statistic) √n(M - μ0)/σ SIR: (Severity Interpretation with low P-values) • (a): (high): If there’s a very low probability that so large a d0 would have resulted, if μ were no greater than μ1, then d0 it indicates μ > μ1: SEV(μ > μ1) is high. • (b): (low) If there is a fairly high probability that d0 would have been larger than it is, even if μ = μ1, then d0 is not a good indication μ > μ1: SEV(μ > μ1) is low. 69
  • 70. SIN: (Severity Interpretation for Negative results) • (a): (high) If there is a very high probability that d0 would have been larger than it is, were μ > μ1, then μ ≤ μ1 passes the test with high severity: SEV(μ ≤ μ1) is high. • (b): (low) If there is a low probability that d0 would have been larger than it is, even if μ > μ1, then μ ≤ μ1 passes with low severity: SEV(μ ≤ μ1) is low. 70
  • 71. Jimmy Savage on the LP: “According to Bayes' theorem,…. if y is the datum of some other experiment, and if it happens that P(x|µ) and P(y|µ) are proportional functions of µ (that is, constant multiples of each other), then each of the two data x and y have exactly the same thing to say about the values of µ…” (Savage 1962, p. 17) 71