SlideShare a Scribd company logo
Philosophy of Science and
Philosophy of Statistics
Deborah G Mayo
Dept of Philosophy, Virginia Tech
APA Central: Epistemology Meets Philosophy of
Statistics
February 27, 2020
1
2
What is the Philosophy of
Statistics (PhilStat)?
At one level, statisticians and philosophers of
science ask many of the same questions:
What should be observed and what may justifiably
be inferred from the resulting data?
What is a good test?
How can spurious relationships be distinguished
from genuine regularities? from causal regularities?
• These very general questions are entwined with
long standing debates in philosophy of science
• No wonder the field of statistics tends to cross
over so often into philosophical territory.
3
4
Statistics à Philosophy
(1) Model Scientific Inference—capture the actual or
rational ways to arrive at evidence and inference
(2) Solve Philosophical Problems about scientific
inference, observation, experiment;
(3) Metamethodology: Analyze intuitive rules
(e.g., novelty, simplicity)
5
Formal Epistemology?
Could be
• Phil Stat
• Analytic epistemology with probabilities
“Bayesian statistics is one thing.. Bayesian
epistemology is something else. The idea of putting
probabilities over hypotheses delivered to
philosophy a godsend, an entire package of
superficiality.” (Glymour 2010, 334)
6
• His worry: starting with an intuitive principle,
epistemologists reconstruct it with a probabilistic
confirmation logic.
• You haven’t shown, for example, beliefs ought to
go up with varied evidence, you represent it
probabilistically
• I don't knock rational reconstruction using
probability, and analogs of some puzzles arise in
statistics (tacking paradox, old evidence)
7
Philosophy à Statistics
• Central job: minister to scientists’ conceptual,
logical and methodological discomforts
• Despite technical sophistication, basic concepts of
statistical inference and modeling are more
unsettled than ever.
My Interest: Philosophy in
Statistics Wars
Statistical Crisis in Science
• in many fields, latitude in collecting and
interpreting data makes it too easy to dredge up
impressive looking findings even when spurious.
• We set sail with a simple tool: a minimal
requirement for evidence
• Sufficiently general to apply to any methods now in
use
9
Statistical reforms
• Several are welcome: preregistration, avoidance
of cookbook statistics, calls for more replication
research
• Others are quite radical, and even violate our
minimal principle of evidence
• To combat paradoxical, self-defeating “reforms,”
requires a mix of statistics, philosophy, history
10
Most often used tools are most
criticized
“Several methodologists have pointed out that the
high rate of nonreplication of research discoveries is
a consequence of the convenient, yet ill-founded
strategy of claiming conclusive research findings
solely on the basis of a single study assessed by
formal statistical significance, typically for a p-value
less than 0.05. …” (Ioannidis 2005, 696)
Do researchers do that?
11
R.A. Fisher
“[W]e need, not an isolated record, but a reliable
method of procedure. In relation to the test of
significance, we may say that a phenomenon is
experimentally demonstrable when we know how
to conduct an experiment which will rarely fail to
give us a statistically significant result.”
(Fisher 1947, 14)
12
Simple significance tests (Fisher)
to test the conformity of the particular data under
analysis with H0 in some respect:
…we find a function d(X) of the data, the test statistic,
such that
• the larger the value of d(X) the more inconsistent
are the data with H0;
• d(X) has a known probability distribution
when H0 is true.
…the p-value corresponding to any d(x) (or d0bs)
p = p(t) = Pr(d(X) ≥ d(x); H0)
(Mayo and Cox 2006, 81; d for t, x for y) 13
Testing reasoning
• If even larger differences than d0bs occur fairly
frequently under H0 (i.e., P-value not small),
there’s no evidence of incompatibility with H0
• Small P-value indicates some underlying
discrepancy from H0 because very probably
you would have seen a less impressive
difference than d0bs were H0 true.
• This still isn’t evidence of a genuine statistical
effect H1, let alone a scientific conclusion H*
Stat-Sub fallacy H => H*
14
Neyman-Pearson (N-P) tests (1933):
A test (null) and alternative hypotheses
H0, H1 that are exhaustive
H0: μ ≤ 0 vs. H1: μ > 0
Philosophers should adopt the
language of statistics, e.g., Xi ~ N(μ, σ2)
15
Neyman-Pearson (N-P) tests (1933):
• This fallacy of rejection H1è H* is impossible
• Rejecting H0 only indicates statistical alternatives
H1 (how discrepant from null)
• We get the type II error, and power
16
Error Statistics
• Fisher and N-P both fall under tools for “appraising
and bounding the probabilities of seriously
misleading interpretations of data” (Birnbaum 1970,
1033)–error probabilities
• I place all under the rubric of error statistics
• Confidence intervals, N-P and Fisherian tests,
resampling, randomization
17
Both Fisher & N-P: it’s easy to lie
with biasing selection effects
• Sufficient finagling—cherry-picking, significance
seeking, multiple testing, post-data subgroups,
trying and trying again—may practically
guarantee a preferred claim H gets support,
even if it’s unwarranted by evidence
• Violates minimal requirement for evidence.
18
Severity Requirement:
• If the test had little or no capability of finding flaws
with H (even if H is incorrect), then agreement
between data x0 and H provides poor (or no)
evidence for H
(Popper: “too cheap to be worth having)
• A claim passes with severity only to the extent that
it is subjected to, and passes, a test that it
probably would have failed, if false.
• This probability is how severely it has passed
(degree of “corroboration”)
19
Requires a third role for probability
Probabilism. To assign a degree of probability,
confirmation, support or belief in a hypothesis,
given data x0 (absolute or comparative)
(e.g., Bayesian, likelihoodist, Fisher (at times))
Performance. Ensure long-run reliability of
methods, coverage probabilities (frequentist,
behavioristic Neyman-Pearson, Fisher (at times))
Only probabilism is thought to be inferential or
evidential 20
What happened to using probability
to assess error-probing capacity?
• Neither “probabilism” nor “performance” directly
captures assessing error probing capacity
• Good long-run performance is a necessary, not a
sufficient, condition for severity
21
A claim C is not warranted _______
• Probabilism: unless C is true or probable (gets
a probability boost, made comparatively firmer)
• Performance: unless it stems from a method
with low long-run error
• Probativism (severe testing): unless
something (a fair amount) has been done to
probe ways we can be wrong about C
22
A severe test: My weight
Informal example: To test if I’ve gained weight
between the time I left for England and my return,
I use a series of well-calibrated and stable scales,
both before leaving and upon my return.
All show an over 4 lb gain, none shows a
difference in weighing EGEK, I infer:
H: I’ve gained at least 4 pounds
23
24
• Properties of the scales are akin to the
properties of statistical tests (performance).
• No one claims the justification is merely long
run, and can say nothing about my weight.
• We infer something about the source of the
readings from the high capability to reveal if
any scales were wrong
25
The severe tester assumed to be in a
context of wanting to find things out
• I could insist all the scales are wrong—they work
fine with weighing known objects—but this would
prevent correctly finding out about weight…..
• What sort of extraordinary circumstance could
cause them all to go astray just when we don’t
know the weight of the test object?
• Argument from coincidence-goes beyond being
highly improbable
26
Popper : Carnap as Frequentist : Bayes
“According to modern logical empiricist orthodoxy,
in deciding whether hypothesis h is confirmed by
evidence e, . . . we must consider only the
statements h and e, and the logical relations
[C(h,e)] between them.
It is quite irrelevant whether e was known first and
h proposed to explain it, or whether e resulted from
testing predictions drawn from h”. (Alan Musgrave
1974, 2)
Battles about roles of probability
trace to philosophies of inference
Likelihood Principle (LP)
In logics of induction, like probabilist accounts (as
I’m using the term) the import of the data is via the
ratios of likelihoods of hypotheses
Pr(x0;H0)/Pr(x0;H1)
The data x0 are fixed, while the hypotheses vary
27
Comparative logic of support
• Ian Hacking (1965) “Law of Likelihood”:
x support hypothesis H0 less well than H1 if,
Pr(x;H0) < Pr(x;H1)
(rejects in 1980)
• Any hypothesis that perfectly fits the data is
maximally likely (even if data-dredged)
• “there always is such a rival hypothesis viz., that
things just had to turn out the way they actually
did” (Barnard 1972, 129). 28
Error probabilities
• Pr(H0 is less well supported than H1;H0 ) is high
for some H1 or other
29
Hunting for significance
(nominal vs. actual)
Suppose that twenty sets of differences have
been examined, that one difference seems large
enough to test and that this difference turns out
to be ‘significant at the 5 percent level.’ ….The
actual level of significance is not 5 percent,
but 64 percent! (Selvin 1970, 104)
(Morrison & Henkel’s Significance Test Controversy
1970!)
30
Some accounts of evidence object:
“Two problems that plague frequentist inference:
multiple comparisons and multiple looks, or…data
dredging and peeking at the data. The frequentist
solution to both problems involves adjusting the P-
value…
But adjusting the measure of evidence because
of considerations that have nothing to do with
the data defies scientific sense” (Goodman
1999, 1010)
(Co-director, with Ioannidis, the Meta-Research Innovation
Center at Stanford)
31
On the LP, error probabilities
appeal to something irrelevant
“Sampling distributions, significance levels,
power, all depend on something more [than
the likelihood function]–something that is
irrelevant in Bayesian inference–namely the
sample space” (Lindley 1971, 436)
Probabilisms, condition on the actual data
32
At odds with key way to advance
replication: 21 Word Solution
“We report how we determined our sample size,
and data exclusions (if any), all manipulations, and
all measures in the study” (Simmons, Nelson, and
Simonsohn 2012, 4).
• Replication researchers find flexibility with data-
dredging and stopping rules major source of
failed-replication (the “forking paths”, Gelman and
Loken 2014)
33
Many “reforms” offered as alternative
to significance tests follow the LP
• “Bayes factors [likelihood ratios] can be used in the
complete absence of a sampling plan…” (Bayarri,
Benjamin, Berger, Sellke 2016, 100)
• It seems very strange that a frequentist could not
analyze a given set of data…if the stopping rule is
not given….Data should be able to speak for itself.
(Berger and Wolpert 1988, 78; authors of the
Likelihood Principle)
• No wonder reformers talk past each other
34
Replication Paradox
• Test Critic: It’s too easy to satisfy standard
significance thresholds
• You: Why do replicationists find it so hard to achieve
significance thresholds (with preregistration)?
• Test Critic: Obviously the initial studies were guilty
of P-hacking, cherry-picking, data-dredging (QRPs)
• You: So, the replication researchers want methods
that pick up on, adjust, and block these biasing
selection effects.
• Test Critic: Actually “reforms” recommend methods
where the need to alter P-values due to data
dredging vanishes 35
Probabilists can still block intuitively
unwarranted inferences
• Supplement with subjective beliefs: What do I
believe? As opposed to What is the evidence?
(Royall 1997; 2004)
• Likelihoods + prior probabilities
36
Problems
• Additional source of flexibility, priors as well as
biasing selection effects
• Doesn’t show what researchers had done
wrong—battle of beliefs
• The believability of data-dredged hypotheses is
what makes them so seductive
37
Contrast with philosophy: Bayesian
statisticians use “default” priors
“[V]irtually never would different experts give prior
distributions that even overlapped” (J. Berger. 2006)
• Default priors are to be data dominant in some sense
• “The priors are not to be considered expressions of
uncertainty, ignorance, or degree of belief. [they] may
not even be probabilities…” (Cox and Mayo 2010,
299)
• No agreement on rival systems for default/non-
subjective priors (no “uninformative” priors) 38
39
Many of today’s statistics wars:
P-values vs posteriors
• The posterior probability Pr(H0|x) can be large
while the P-value is small
• To a Bayesian this shows P-values exaggerate
evidence against
• Significance testers object to highly significant
results being interpreted as no evidence against
the null– or even evidence for it! High Type 2
error
Bayes (Jeffreys)/Fisher disagreement
(“spike and smear”)
• The “P-values exaggerate” charges refer to
testing a point null hypothesis, a lump of prior
probability given to H0 (or a tiny region around 0).
Xi ~ N(μ, σ2)
H0: μ = 0 vs. H1: μ ≠ 0.
• The rest appropriately spread over the alternative,
an α significant result can correspond to
Pr(H0|x) = (1 – α)! (e.g., 0.95)
40
“Concentrating mass on the point null hypothesis
is biasing the prior in favor of H0 as much as
possible” (Casella and R. Berger 1987, 111)
whether in 1 or 2-sided tests
Yet ‘spike and smear” is the basis for: “Redefine
Statistical Significance” (Benjamin et al., 2017)
41
Opposing megateam: Lakens et al. (2018)
• Whether tests should use a lower Type 1 error
probability is separate; the problem is
supposing there should be agreement
between quantities measuring different things
42
Recent Example of a battle based on
P-values disagree with posteriors
• If we imagine randomly selecting a hypothesis
from an urn of nulls 90% of which are true
• Consider just 2 possibilities: H0: no effect
H1: meaningful effect, all else ignored,
• Take the prevalence of 90% as
Pr(H0) = 0.9, Pr(H1)= 0.1
• Reject H0 with a single (just) 0.05 significant result,
with cherry-picking, selection effects
Then it can be shown most “findings” are false 43
44
Diagnostic Screening (DS) Model
• Pr(H0|Test T rejects H0 ) > 0.5
really: prevalence of true nulls among those
rejected at the 0.05 level > 0.5.
Call this: False Finding rate FFR
• Pr(Test T rejects H0 | H0 ) = 0.05
Criticism: N-P Type I error probability ≠ FFR
(Ioannidis 2005, Colquhoun 2014)
45
DS testers see this as a major
criticism of tests
• But there are major confusions
• Pr(H0|Test T rejects H0 ) is not a Type I error
probability.
• Transposes conditional
• Combines crude performance with a probabilist
assignment (true to neither Bayesians nor error
statisticians)
• OK in certain screening contexts (genomics)
FFR: False Finding Rate: Prev(H0 ) = .9
46α = 0.05 and (1 – β) = .8, FFR = 0.36, the PPV = .64
PPV
• Complement of FFR: the positive predictive value
PPV
Pr(H1|Test T rejects H0)
47
48
What’s Pr(H1) (i.e., Prev(H1))?
“Proportion of experiments we do over a lifetime in
which there is a real effect” (Colquhoun 2014, 9)
Proportion of true relationships among those
tested in a field. Ioannidis (2005, 0696)
Hypotheses can be individuated in many ways
Probabilistic Instantiation Fallacy
• Pr(the randomly selected null hypothesis is true) = .9
• The randomly selected null hypothesis is H51
• Pr(H51 is true) = .9
Each His is either is true or not!
(It could have a genuine frequentist prior but it
wouldn’t equal .9)
49
50
Is the PPV (complement of the FFR)
relevant to what’s wanted?
Crud Factor. In many fields of social science it’s
thought nearly everything is related to everything:
“all nulls false”.
It also promotes the “stay safe” idea.
Some Bayesians reject probabilism
(Gelman: Falsificationist Bayesian;
Shalizi: error statistician)
“[C]rucial parts of Bayesian data analysis, such as
model checking, can be understood as ‘error probes’
in Mayo’s sense”
“[W]hat we are advocating, then, is what Cox and
Hinkley (1974) call ‘pure significance testing’, in
which certain of the model’s implications are
compared directly to the data.” (Gelman and Shalizi
2013, 10, 20). 51
• Can’t also jump on the “abandon significance/
don’t use P-value thresholds” bandwagon
• If there’s no threshold, there’s no falsification,
and no tests
• Granted P-values don’t give effect sizes
52
53
The severe tester reformulates
tests with a discrepancy γ from H0
• Severity function: SEV(Test T, data x, claim C)
• Instead of a binary cut-off (significant or not)
the particular outcome is used to infer
discrepancies that are or are not warranted
54
To avoid Fallacies of Rejection
(e.g., magnitude error)
Testing the mean of a Normal distribution:
H0: μ ≤ 0 vs. H1: μ > 0
• If you very probably would have observed a more
impressive (smaller) P-value if μ = μ1 (μ1 = μ0 + γ);
the data are poor evidence that
μ > μ1.
55
Relation to a Test’s Power:
Let M be the sample mean (a random variable) and
it’s value M0
• Say M just reaches statistical significance at level
P, say 0.025; and compute power in relation to
this cut-off
• If the power against μ1 is high then the data are
poor evidence that
μ > μ1.
Power vs Severity for 𝛍 > 𝛍 𝟏
56
Similarly, severity tells us:
• an α-significant difference indicates less of a
discrepancy from the null if it results from larger (n1)
rather than a smaller (n2) sample size (n1 > n2 )
• What’s more indicative of a large effect (fire), a fire
alarm that goes off with burnt toast or one that
doesn’t go off unless the house is fully ablaze?
• [The larger sample size is like the one that goes off
with burnt toast] 57
• Anyone who equates severity and power has it
backwards
• Only one time they could be equal is if M just
misses statistical significance and want to
assess claims of form
μ < μ1, μ < μ2, μ < μ3,… μ < μk,…..
• Then SEV(μ < μk) = POW(μk)
58
We avoid fallacies of
non-significant results?
• They don’t warrant 0 discrepancy
• Not uninformative; can find upper bounds μ1
SEV(μ < μ1) is high
• It’s with negative results (P-values not small)
that severity goes in the same direction as
power
–provided model assumptions hold
59
FEV: Frequentist Principle of Evidence; Mayo
and Cox (2006); SEV: Mayo 1991, Mayo and
Spanos (2006)
FEV/SEV A small P-value indicates discrepancy γ
from H0, åonly if, there is a high probability the test
would have resulted in a larger P-value were a
discrepancy as large as γ absent.
FEV/SEV A moderate P-value indicates the absence
of a discrepancy γ from H0, only if there is a high
probability the test would have given a worse fit with
H0 (i.e., a smaller P-value) were a discrepancy γ
present.
60
61
The 2019: Don’t say ‘significance’,
don’t use P-value thresholds
• Editors of the March 2019 issue TAS "A World
Beyond p < 0.05"—Wasserstein, Schirm, Lazar—
say "declarations of ‘statistical significance’ be
abandoned" (p. 2).
• On their view: Prespecified P-value thresholds
should never be used in interpreting results.
• it is not just a word ban but a gatekeeper ban
62
“Retiring statistical significance
would give bias a free pass".
John Ioannidis (2019)
"...potential for falsification is a prerequisite for
science. Fields that obstinately resist refutation
can hide behind the abolition of statistical
significance but risk becoming self-ostracized
from the remit of science”.
I agree, and in Mayo (2019) I show why.
63
• Complying with the “no threshold” view precludes
the FDA's long-established drug review
procedures, as Wasserstein et al. (2019) recognize
• They think by removing P-value thresholds,
researchers lose an incentive to data dredge, and
otherwise exploit researcher flexibility
• Even if true, it's a bad argument.
(Decriminalizing robbery results in less robbery arrests)
• But it's not true.
64
• Even without the word significance, eager
researchers still can’t take the large (non-
significant) P-value to indicate a genuine effect
• It would be to say: Even though larger differences
would frequently occur by chance variability alone,
my data provide evidence they are not due to
chance variability
• In short, he would still need to report a reasonably
small P-value
• The eager investigator will need to "spin" his
results, ransack, data dredge
65
• In a world without predesignated thresholds, it
would be hard to hold him accountable for
reporting a nominally small P-value:
• “whether a p-value passes any arbitrary threshold
should not be considered at all" in interpreting
data (Wasserstein et al. 2019, 2)
66
No tests, no falsification
• The “no thresholds” view also blocks common uses
of confidence intervals and Bayes factor standards
• If you cannot say about any results, ahead of time,
they will not be allowed to count in favor of a claim,
then you do not have a test of it
• Don’t confuse having a threshold for a terrible test
with using a fixed P-value across all studies in an
unthinking manner
• We should reject the latter
67
New ASA Task Force on
Significance Tests and Replication
• to “prepare a …piece reflecting “good statistical
practice,” without leaving the impression that p-
values and hypothesis tests…have no
role.” (Karen Kafadar 2019, 4)
• I hope that philosophers (of science and of
knowledge) get involved!
68
References
• Barnard, G. (1972). “The Logic of Statistical Inference (Review of ‘The Logic of Statistical Inference’ by
Ian Hacking)”, British Journal for the Philosophy of Science 23(2), 123–32.
• Bartlett, T. (2014). “Replication Crisis in Psychology Research Turns Ugly and Odd”, The Chronicle of
Higher Education (online) June 23, 2014.
• Bayarri, M., Benjamin, D., Berger, J., Sellke, T. (2016). “Rejection Odds and Rejection Ratios: A
Proposal for Statistical Practice in Testing Hypotheses.” Journal of Mathematical Psychology 72: 90-
103.
• Benjamin, D., Berger, J., Johannesson, M., et al. (2017). “Redefine Statistical Significance”, Nature
Human Behaviour 2, 6–10.
• Berger, J. O. (2006). “The Case for Objective Bayesian Analysis.” Bayesian Analysis 1 (3): 385–402.
• Berger, J. O. and Wolpert, R. (1988). The Likelihood Principle, 2nd
ed. Vol. 6 Lecture Notes-Monograph
Series. Hayward, CA: Institute of Mathematical Statistics.
• Birnbaum, A. (1970). “Statistical Methods in Scientific Inference (letter to the Editor).” Nature 225 (5237)
(March 14): 1033.
• Casella, G. and Berger, R. (1987). “Reconciling Bayesian and Frequentist Evidence in the One-sided
Testing Problem”, Journal of the American Statistical Association 82(397), 106-11.
• Colquhoun, D. (2014). ‘An Investigation of the False Discovery Rate and the Misinterpretation of P-
values’, Royal Society Open Science 1(3), 140216.
• Cox, D. R., and Mayo, D. G. (2010). “Objectivity and Conditionality in Frequentist Inference.” Error and
Inference: Recent Exchanges on Experimental Reasoning, Reliability, and the Objectivity and Rationality
of Science. Mayo and Spanos (eds.), pp. 276–304. CUP.
69
• FDA (U. S. Food and Drug Administration) (2017). “Multiple Endpoints in Clinical Trials: Guidance for
Industry (DRAFT GUIDANCE).” Retrieved from https://www.fda.gov/media/102657/download
• Fisher, R. A. (1947). The Design of Experiments 4th ed., Edinburgh: Oliver and Boyd.
• Gelman, A. and Loken, E. (2014). “The Statistical Crisis in Science”, American Scientist 2, 460-65.
• Gelman, A. and Shalizi, C. (2013). “Philosophy and the Practice of Bayesian Statistics” and “Rejoinder’”
British Journal of Mathematical and Statistical Psychology 66(1): 8–38; 76-80.
• Glymour, C. (2010). "Explanation and Truth". In Mayo, D. and Spanos, A. (eds.) Error and Inference:
Recent Exchanges on Experimental Reasoning, Reliability, and the Objectivity and Rationality of Science,
pp. 331–50. CUP.
• Goodman S. N. (1999). “Toward evidence-based medical statistics. 2: The Bayes factor,” Annals of
Internal Medicine 1999; 130:1005 –1013.
• Hacking, I. (1965). Logic of Statistical Inference. CUP.
• Hacking, I. (1980). “The Theory of Probable Inference: Neyman, Peirce and Braithwaite”, in Mellor, D.
(ed.), Science, Belief and Behavior: Essays in Honour of R. B. Braithwaite, pp. 141–60. CUP.
• Ioannidis, J. (2005). “Why Most Published Research Findings are False”, PLoS Medicine 2(8), 0696–
0701.
• Ioannidis J. (2019). “The Importance of Predefined Rules and Prespecified Statistical Analyses: Do Not
Abandon Significance.” JAMA. 321(21): 2067–2068. doi:10.1001/jama.2019. 4582
• Kafadar, K. (2019). “The Year in Review … And More to Come,” President's Corner, AmStat News, (Issue
510), (Dec. 2019).
• Lakens, D., et al. (2018). “Justify your Alpha”, Nature Human Behavior 2, 168–71.
70
71
• Lindley, D. V. (1971). “The Estimation of Many Parameters.” in Godambe, V. and Sprott, D. (eds.),
Foundations of Statistical Inference 435–455. Toronto: Holt, Rinehart and Winston.
• Mayo, D. (1991). “Novel Evidence and Severe Tests,” Philosophy of Science, 58 (4): 523-552.
Reprinted (1991) in The Philosopher’s Annual XIV: 203-232.
• Mayo, D. G. (1996). Error and the Growth of Experimental Knowledge. Science and Its Conceptual
Foundation. Chicago: University of Chicago Press.
• Mayo, D. G. (2018). Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars,
Cambridge: Cambridge University Press.
• Mayo, D. (2019). ”P-value thresholds: Forfeit at your peril”. European Journal of Clinical Investigation,
49, e13170. https://doi.org/10.1111/eci.13170
• Mayo, D. G. and Cox, D. R. (2006). "Frequentist Statistics as a Theory of Inductive Inference” in Rojo,
J. (ed.) The Second Erich L. Lehmann Symposium: Optimality, 2006, Lecture Notes-Monograph Series,
49, pp. 247-275. Institute of Mathematical Statistics.
• Mayo, D. G., and A. Spanos. (2006). “Severe Testing as a Basic Concept in a Neyman–Pearson
Philosophy of Induction.” British Journal for the Philosophy of Science 57 (2) (June 1): 323–357.
• Meehl, P. (1990). "Why Summaries of Research on Psychological Theories Are Often Uninterpretable",
Psychological Reports 66(1): 195–244.
• Meehl, P. and Waller, N. (2002). "The Path Analysis Controversy: A New Statistical Approach to Strong
Appraisal of Verisimilitude", Psychological Methods 7(3): 283–300.
• Morrison, D. E., and R. E. Henkel, (eds.), (1970). The Significance Test Controversy: A Reader.
Chicago: Aldine De Gruyter.
• Musgrave, A. (1974). “Logical versus Historical Theories of Confirmation”, BJPS 25(1), 1–23.
72
• Neyman, J. and Pearson, E.S. (1933). "On the Problem of the Most Efficient Tests of Statistical
Hypotheses." Philosophical Transactions of the Royal Society of London Series A 231, 289–337.
Reprinted in Joint Statistical Papers, 140–85.
• Neyman, J. and Pearson, E.S. (1967). Joint Statistical Papers of J. Neyman and E. S. Pearson.
Cambridge: Cambridge University Press.
• Open Science Collaboration (2015). “Estimating the Reproducibility of Psychological Science”,
Science 349(6251), 943–51.
• Pearson, E. S. & Neyman, J. (1967). “On the problem of two samples”, Joint Statistical Papers by J.
Neyman & E.S. Pearson, 99-115 (Berkeley: U. of Calif. Press).
• Royall, R. (1997). Statistical Evidence: A Likelihood Paradigm. Boca Raton FL: Chapman and Hall,
CRC press.
• Royall, R. (2004). “The Likelihood Paradigm for Statistical Evidence” and “Rejoinder”. In Taper, M.
and Lele, S. (eds.) The Nature of Scientific Evidence, , pp. 119–138; 145–151. Chicago: University of
Chicago Press.
• Savage, L. J. (1962). The Foundations of Statistical Inference: A Discussion. London: Methuen.
• Selvin, H. (1970). “A critique of tests of significance in survey research”. In The significance test
controversy, edited by D. Morrison and R. Henkel, 94-106. Chicago: Aldine De Gruyter.
• Simmons, J. Nelson, L. and Simonsohn, U. (2012) “A 21 word solution”, Dialogue: 26(2), 4–7.
• Wasserstein, R., Schirm, A. & Lazar, N. (2019). Moving to a World Beyond “p < 0.05” [Editorial]. The
American Statistician, 73(S1), 1–19. https://doi.org/10.1080/00031305.2019.1583913

More Related Content

What's hot

Introduction to logic
Introduction to logicIntroduction to logic
Introduction to logicNaeem Hassan
 
Paradigms Thomas kuhn Theory
Paradigms Thomas kuhn TheoryParadigms Thomas kuhn Theory
Paradigms Thomas kuhn Theory
Kaium Chowdhury
 
Kepastian dan kebenaran ilmu pengetahuan
Kepastian dan kebenaran ilmu pengetahuanKepastian dan kebenaran ilmu pengetahuan
Kepastian dan kebenaran ilmu pengetahuanIntan El-Durroty
 
Filsafat Pengetahuan
Filsafat PengetahuanFilsafat Pengetahuan
Filsafat Pengetahuan
Nurmahmudah M.Phil.
 
History and Philosophy of Science
History and Philosophy of ScienceHistory and Philosophy of Science
History and Philosophy of Science
Zendle Ann Barrameda
 
Kumpulan Soal beserta Jawaban Filsafat Ilmu Mengandung Makna Epistemologi,Ont...
Kumpulan Soal beserta Jawaban Filsafat Ilmu Mengandung Makna Epistemologi,Ont...Kumpulan Soal beserta Jawaban Filsafat Ilmu Mengandung Makna Epistemologi,Ont...
Kumpulan Soal beserta Jawaban Filsafat Ilmu Mengandung Makna Epistemologi,Ont...
anggakurniawan45
 
Explanation in science (philosophy of science)
Explanation in science (philosophy of science)Explanation in science (philosophy of science)
Explanation in science (philosophy of science)
Anuj Bhatia
 
Logical positivism and Post-positivism
Logical positivism and Post-positivism Logical positivism and Post-positivism
Logical positivism and Post-positivism
Fatima Maqbool
 
Filsafat agama ilmu jadi.pptx
Filsafat agama ilmu jadi.pptxFilsafat agama ilmu jadi.pptx
Filsafat agama ilmu jadi.pptx
rara wibowo
 
Masalah Bebas Nilai
Masalah Bebas NilaiMasalah Bebas Nilai
Masalah Bebas Nilai
Siska Fauziah
 
Unit 5. Empathy, truth and method
Unit 5. Empathy, truth and methodUnit 5. Empathy, truth and method
Unit 5. Empathy, truth and method
Nadia Gabriela Dresscher
 
Ethics in academic research: avoiding plagiarism
Ethics in academic research: avoiding plagiarismEthics in academic research: avoiding plagiarism
Ethics in academic research: avoiding plagiarism
Dr. Utpal Das
 
Karl Popper's Theory of Falsification
Karl Popper's Theory of FalsificationKarl Popper's Theory of Falsification
Karl Popper's Theory of Falsification
Career Point University - Kota Rajasthan
 
Kumpulan soal dan jawab
Kumpulan soal dan jawabKumpulan soal dan jawab
Kumpulan soal dan jawab
Almayszaroh
 
Philosophy 101 Intro
Philosophy 101 IntroPhilosophy 101 Intro
Philosophy 101 IntroDavid Brown
 
Scientific Realism and Scientific Anti-Realism (2017)
Scientific Realism and Scientific Anti-Realism (2017)Scientific Realism and Scientific Anti-Realism (2017)
Scientific Realism and Scientific Anti-Realism (2017)
Ju Young Lee
 
The Philosophy of Plato.pptx
The Philosophy of Plato.pptxThe Philosophy of Plato.pptx
The Philosophy of Plato.pptx
EricLumogdaLegada
 
Soal filsafat ilmu 26 02-2021 UAS R . Adhi Indra Kurnia
Soal filsafat ilmu 26 02-2021  UAS R . Adhi Indra KurniaSoal filsafat ilmu 26 02-2021  UAS R . Adhi Indra Kurnia
Soal filsafat ilmu 26 02-2021 UAS R . Adhi Indra Kurnia
R . Adhi Indra Kurnia
 
filsafat Ilmu
filsafat Ilmufilsafat Ilmu

What's hot (20)

Introduction to logic
Introduction to logicIntroduction to logic
Introduction to logic
 
Paradigms Thomas kuhn Theory
Paradigms Thomas kuhn TheoryParadigms Thomas kuhn Theory
Paradigms Thomas kuhn Theory
 
Kepastian dan kebenaran ilmu pengetahuan
Kepastian dan kebenaran ilmu pengetahuanKepastian dan kebenaran ilmu pengetahuan
Kepastian dan kebenaran ilmu pengetahuan
 
Filsafat ilmu lengkap
Filsafat ilmu lengkapFilsafat ilmu lengkap
Filsafat ilmu lengkap
 
Filsafat Pengetahuan
Filsafat PengetahuanFilsafat Pengetahuan
Filsafat Pengetahuan
 
History and Philosophy of Science
History and Philosophy of ScienceHistory and Philosophy of Science
History and Philosophy of Science
 
Kumpulan Soal beserta Jawaban Filsafat Ilmu Mengandung Makna Epistemologi,Ont...
Kumpulan Soal beserta Jawaban Filsafat Ilmu Mengandung Makna Epistemologi,Ont...Kumpulan Soal beserta Jawaban Filsafat Ilmu Mengandung Makna Epistemologi,Ont...
Kumpulan Soal beserta Jawaban Filsafat Ilmu Mengandung Makna Epistemologi,Ont...
 
Explanation in science (philosophy of science)
Explanation in science (philosophy of science)Explanation in science (philosophy of science)
Explanation in science (philosophy of science)
 
Logical positivism and Post-positivism
Logical positivism and Post-positivism Logical positivism and Post-positivism
Logical positivism and Post-positivism
 
Filsafat agama ilmu jadi.pptx
Filsafat agama ilmu jadi.pptxFilsafat agama ilmu jadi.pptx
Filsafat agama ilmu jadi.pptx
 
Masalah Bebas Nilai
Masalah Bebas NilaiMasalah Bebas Nilai
Masalah Bebas Nilai
 
Unit 5. Empathy, truth and method
Unit 5. Empathy, truth and methodUnit 5. Empathy, truth and method
Unit 5. Empathy, truth and method
 
Ethics in academic research: avoiding plagiarism
Ethics in academic research: avoiding plagiarismEthics in academic research: avoiding plagiarism
Ethics in academic research: avoiding plagiarism
 
Karl Popper's Theory of Falsification
Karl Popper's Theory of FalsificationKarl Popper's Theory of Falsification
Karl Popper's Theory of Falsification
 
Kumpulan soal dan jawab
Kumpulan soal dan jawabKumpulan soal dan jawab
Kumpulan soal dan jawab
 
Philosophy 101 Intro
Philosophy 101 IntroPhilosophy 101 Intro
Philosophy 101 Intro
 
Scientific Realism and Scientific Anti-Realism (2017)
Scientific Realism and Scientific Anti-Realism (2017)Scientific Realism and Scientific Anti-Realism (2017)
Scientific Realism and Scientific Anti-Realism (2017)
 
The Philosophy of Plato.pptx
The Philosophy of Plato.pptxThe Philosophy of Plato.pptx
The Philosophy of Plato.pptx
 
Soal filsafat ilmu 26 02-2021 UAS R . Adhi Indra Kurnia
Soal filsafat ilmu 26 02-2021  UAS R . Adhi Indra KurniaSoal filsafat ilmu 26 02-2021  UAS R . Adhi Indra Kurnia
Soal filsafat ilmu 26 02-2021 UAS R . Adhi Indra Kurnia
 
filsafat Ilmu
filsafat Ilmufilsafat Ilmu
filsafat Ilmu
 

Similar to Philosophy of Science and Philosophy of Statistics

Meeting #1 Slides Phil 6334/Econ 6614 SP2019
Meeting #1 Slides Phil 6334/Econ 6614 SP2019Meeting #1 Slides Phil 6334/Econ 6614 SP2019
Meeting #1 Slides Phil 6334/Econ 6614 SP2019
jemille6
 
The Statistics Wars: Errors and Casualties
The Statistics Wars: Errors and CasualtiesThe Statistics Wars: Errors and Casualties
The Statistics Wars: Errors and Casualties
jemille6
 
D. Mayo: Philosophical Interventions in the Statistics Wars
D. Mayo: Philosophical Interventions in the Statistics WarsD. Mayo: Philosophical Interventions in the Statistics Wars
D. Mayo: Philosophical Interventions in the Statistics Wars
jemille6
 
Mayod@psa 21(na)
Mayod@psa 21(na)Mayod@psa 21(na)
Mayod@psa 21(na)
DeborahMayo4
 
D. Mayo: Philosophy of Statistics & the Replication Crisis in Science
D. Mayo: Philosophy of Statistics & the Replication Crisis in ScienceD. Mayo: Philosophy of Statistics & the Replication Crisis in Science
D. Mayo: Philosophy of Statistics & the Replication Crisis in Science
jemille6
 
D.G. Mayo Slides LSE PH500 Meeting #1
D.G. Mayo Slides LSE PH500 Meeting #1D.G. Mayo Slides LSE PH500 Meeting #1
D.G. Mayo Slides LSE PH500 Meeting #1
jemille6
 
D.g. mayo 1st mtg lse ph 500
D.g. mayo 1st mtg lse ph 500D.g. mayo 1st mtg lse ph 500
D.g. mayo 1st mtg lse ph 500
jemille6
 
“The importance of philosophy of science for statistical science and vice versa”
“The importance of philosophy of science for statistical science and vice versa”“The importance of philosophy of science for statistical science and vice versa”
“The importance of philosophy of science for statistical science and vice versa”
jemille6
 
D. G. Mayo: The Replication Crises and its Constructive Role in the Philosoph...
D. G. Mayo: The Replication Crises and its Constructive Role in the Philosoph...D. G. Mayo: The Replication Crises and its Constructive Role in the Philosoph...
D. G. Mayo: The Replication Crises and its Constructive Role in the Philosoph...
jemille6
 
Mayo &amp; parker spsp 2016 june 16
Mayo &amp; parker   spsp 2016 june 16Mayo &amp; parker   spsp 2016 june 16
Mayo &amp; parker spsp 2016 june 16
jemille6
 
Replication Crises and the Statistics Wars: Hidden Controversies
Replication Crises and the Statistics Wars: Hidden ControversiesReplication Crises and the Statistics Wars: Hidden Controversies
Replication Crises and the Statistics Wars: Hidden Controversies
jemille6
 
Severe Testing: The Key to Error Correction
Severe Testing: The Key to Error CorrectionSevere Testing: The Key to Error Correction
Severe Testing: The Key to Error Correction
jemille6
 
Statistical Inference as Severe Testing: Beyond Performance and Probabilism
Statistical Inference as Severe Testing: Beyond Performance and ProbabilismStatistical Inference as Severe Testing: Beyond Performance and Probabilism
Statistical Inference as Severe Testing: Beyond Performance and Probabilism
jemille6
 
D. G. Mayo Columbia slides for Workshop on Probability &Learning
D. G. Mayo Columbia slides for Workshop on Probability &LearningD. G. Mayo Columbia slides for Workshop on Probability &Learning
D. G. Mayo Columbia slides for Workshop on Probability &Learning
jemille6
 
Frequentist Statistics as a Theory of Inductive Inference (2/27/14)
Frequentist Statistics as a Theory of Inductive Inference (2/27/14)Frequentist Statistics as a Theory of Inductive Inference (2/27/14)
Frequentist Statistics as a Theory of Inductive Inference (2/27/14)
jemille6
 
Phil 6334 Mayo slides Day 1
Phil 6334 Mayo slides Day 1Phil 6334 Mayo slides Day 1
Phil 6334 Mayo slides Day 1
jemille6
 
P-Value "Reforms": Fixing Science or Threat to Replication and Falsification
P-Value "Reforms": Fixing Science or Threat to Replication and FalsificationP-Value "Reforms": Fixing Science or Threat to Replication and Falsification
P-Value "Reforms": Fixing Science or Threat to Replication and Falsification
jemille6
 
Controversy Over the Significance Test Controversy
Controversy Over the Significance Test ControversyControversy Over the Significance Test Controversy
Controversy Over the Significance Test Controversy
jemille6
 
D. G. Mayo: Your data-driven claims must still be probed severely
D. G. Mayo: Your data-driven claims must still be probed severelyD. G. Mayo: Your data-driven claims must still be probed severely
D. G. Mayo: Your data-driven claims must still be probed severely
jemille6
 
D. Mayo: Replication Research Under an Error Statistical Philosophy
D. Mayo: Replication Research Under an Error Statistical Philosophy D. Mayo: Replication Research Under an Error Statistical Philosophy
D. Mayo: Replication Research Under an Error Statistical Philosophy
jemille6
 

Similar to Philosophy of Science and Philosophy of Statistics (20)

Meeting #1 Slides Phil 6334/Econ 6614 SP2019
Meeting #1 Slides Phil 6334/Econ 6614 SP2019Meeting #1 Slides Phil 6334/Econ 6614 SP2019
Meeting #1 Slides Phil 6334/Econ 6614 SP2019
 
The Statistics Wars: Errors and Casualties
The Statistics Wars: Errors and CasualtiesThe Statistics Wars: Errors and Casualties
The Statistics Wars: Errors and Casualties
 
D. Mayo: Philosophical Interventions in the Statistics Wars
D. Mayo: Philosophical Interventions in the Statistics WarsD. Mayo: Philosophical Interventions in the Statistics Wars
D. Mayo: Philosophical Interventions in the Statistics Wars
 
Mayod@psa 21(na)
Mayod@psa 21(na)Mayod@psa 21(na)
Mayod@psa 21(na)
 
D. Mayo: Philosophy of Statistics & the Replication Crisis in Science
D. Mayo: Philosophy of Statistics & the Replication Crisis in ScienceD. Mayo: Philosophy of Statistics & the Replication Crisis in Science
D. Mayo: Philosophy of Statistics & the Replication Crisis in Science
 
D.G. Mayo Slides LSE PH500 Meeting #1
D.G. Mayo Slides LSE PH500 Meeting #1D.G. Mayo Slides LSE PH500 Meeting #1
D.G. Mayo Slides LSE PH500 Meeting #1
 
D.g. mayo 1st mtg lse ph 500
D.g. mayo 1st mtg lse ph 500D.g. mayo 1st mtg lse ph 500
D.g. mayo 1st mtg lse ph 500
 
“The importance of philosophy of science for statistical science and vice versa”
“The importance of philosophy of science for statistical science and vice versa”“The importance of philosophy of science for statistical science and vice versa”
“The importance of philosophy of science for statistical science and vice versa”
 
D. G. Mayo: The Replication Crises and its Constructive Role in the Philosoph...
D. G. Mayo: The Replication Crises and its Constructive Role in the Philosoph...D. G. Mayo: The Replication Crises and its Constructive Role in the Philosoph...
D. G. Mayo: The Replication Crises and its Constructive Role in the Philosoph...
 
Mayo &amp; parker spsp 2016 june 16
Mayo &amp; parker   spsp 2016 june 16Mayo &amp; parker   spsp 2016 june 16
Mayo &amp; parker spsp 2016 june 16
 
Replication Crises and the Statistics Wars: Hidden Controversies
Replication Crises and the Statistics Wars: Hidden ControversiesReplication Crises and the Statistics Wars: Hidden Controversies
Replication Crises and the Statistics Wars: Hidden Controversies
 
Severe Testing: The Key to Error Correction
Severe Testing: The Key to Error CorrectionSevere Testing: The Key to Error Correction
Severe Testing: The Key to Error Correction
 
Statistical Inference as Severe Testing: Beyond Performance and Probabilism
Statistical Inference as Severe Testing: Beyond Performance and ProbabilismStatistical Inference as Severe Testing: Beyond Performance and Probabilism
Statistical Inference as Severe Testing: Beyond Performance and Probabilism
 
D. G. Mayo Columbia slides for Workshop on Probability &Learning
D. G. Mayo Columbia slides for Workshop on Probability &LearningD. G. Mayo Columbia slides for Workshop on Probability &Learning
D. G. Mayo Columbia slides for Workshop on Probability &Learning
 
Frequentist Statistics as a Theory of Inductive Inference (2/27/14)
Frequentist Statistics as a Theory of Inductive Inference (2/27/14)Frequentist Statistics as a Theory of Inductive Inference (2/27/14)
Frequentist Statistics as a Theory of Inductive Inference (2/27/14)
 
Phil 6334 Mayo slides Day 1
Phil 6334 Mayo slides Day 1Phil 6334 Mayo slides Day 1
Phil 6334 Mayo slides Day 1
 
P-Value "Reforms": Fixing Science or Threat to Replication and Falsification
P-Value "Reforms": Fixing Science or Threat to Replication and FalsificationP-Value "Reforms": Fixing Science or Threat to Replication and Falsification
P-Value "Reforms": Fixing Science or Threat to Replication and Falsification
 
Controversy Over the Significance Test Controversy
Controversy Over the Significance Test ControversyControversy Over the Significance Test Controversy
Controversy Over the Significance Test Controversy
 
D. G. Mayo: Your data-driven claims must still be probed severely
D. G. Mayo: Your data-driven claims must still be probed severelyD. G. Mayo: Your data-driven claims must still be probed severely
D. G. Mayo: Your data-driven claims must still be probed severely
 
D. Mayo: Replication Research Under an Error Statistical Philosophy
D. Mayo: Replication Research Under an Error Statistical Philosophy D. Mayo: Replication Research Under an Error Statistical Philosophy
D. Mayo: Replication Research Under an Error Statistical Philosophy
 

More from jemille6

D. Mayo JSM slides v2.pdf
D. Mayo JSM slides v2.pdfD. Mayo JSM slides v2.pdf
D. Mayo JSM slides v2.pdf
jemille6
 
reid-postJSM-DRC.pdf
reid-postJSM-DRC.pdfreid-postJSM-DRC.pdf
reid-postJSM-DRC.pdf
jemille6
 
Errors of the Error Gatekeepers: The case of Statistical Significance 2016-2022
Errors of the Error Gatekeepers: The case of Statistical Significance 2016-2022Errors of the Error Gatekeepers: The case of Statistical Significance 2016-2022
Errors of the Error Gatekeepers: The case of Statistical Significance 2016-2022
jemille6
 
Causal inference is not statistical inference
Causal inference is not statistical inferenceCausal inference is not statistical inference
Causal inference is not statistical inference
jemille6
 
What are questionable research practices?
What are questionable research practices?What are questionable research practices?
What are questionable research practices?
jemille6
 
What's the question?
What's the question? What's the question?
What's the question?
jemille6
 
The neglected importance of complexity in statistics and Metascience
The neglected importance of complexity in statistics and MetascienceThe neglected importance of complexity in statistics and Metascience
The neglected importance of complexity in statistics and Metascience
jemille6
 
Mathematically Elegant Answers to Research Questions No One is Asking (meta-a...
Mathematically Elegant Answers to Research Questions No One is Asking (meta-a...Mathematically Elegant Answers to Research Questions No One is Asking (meta-a...
Mathematically Elegant Answers to Research Questions No One is Asking (meta-a...
jemille6
 
On Severity, the Weight of Evidence, and the Relationship Between the Two
On Severity, the Weight of Evidence, and the Relationship Between the TwoOn Severity, the Weight of Evidence, and the Relationship Between the Two
On Severity, the Weight of Evidence, and the Relationship Between the Two
jemille6
 
Revisiting the Two Cultures in Statistical Modeling and Inference as they rel...
Revisiting the Two Cultures in Statistical Modeling and Inference as they rel...Revisiting the Two Cultures in Statistical Modeling and Inference as they rel...
Revisiting the Two Cultures in Statistical Modeling and Inference as they rel...
jemille6
 
Comparing Frequentists and Bayesian Control of Multiple Testing
Comparing Frequentists and Bayesian Control of Multiple TestingComparing Frequentists and Bayesian Control of Multiple Testing
Comparing Frequentists and Bayesian Control of Multiple Testing
jemille6
 
Good Data Dredging
Good Data DredgingGood Data Dredging
Good Data Dredging
jemille6
 
The Duality of Parameters and the Duality of Probability
The Duality of Parameters and the Duality of ProbabilityThe Duality of Parameters and the Duality of Probability
The Duality of Parameters and the Duality of Probability
jemille6
 
Error Control and Severity
Error Control and SeverityError Control and Severity
Error Control and Severity
jemille6
 
The Statistics Wars and Their Causalities (refs)
The Statistics Wars and Their Causalities (refs)The Statistics Wars and Their Causalities (refs)
The Statistics Wars and Their Causalities (refs)
jemille6
 
The Statistics Wars and Their Casualties (w/refs)
The Statistics Wars and Their Casualties (w/refs)The Statistics Wars and Their Casualties (w/refs)
The Statistics Wars and Their Casualties (w/refs)
jemille6
 
On the interpretation of the mathematical characteristics of statistical test...
On the interpretation of the mathematical characteristics of statistical test...On the interpretation of the mathematical characteristics of statistical test...
On the interpretation of the mathematical characteristics of statistical test...
jemille6
 
The role of background assumptions in severity appraisal (
The role of background assumptions in severity appraisal (The role of background assumptions in severity appraisal (
The role of background assumptions in severity appraisal (
jemille6
 
The two statistical cornerstones of replicability: addressing selective infer...
The two statistical cornerstones of replicability: addressing selective infer...The two statistical cornerstones of replicability: addressing selective infer...
The two statistical cornerstones of replicability: addressing selective infer...
jemille6
 
The replication crisis: are P-values the problem and are Bayes factors the so...
The replication crisis: are P-values the problem and are Bayes factors the so...The replication crisis: are P-values the problem and are Bayes factors the so...
The replication crisis: are P-values the problem and are Bayes factors the so...
jemille6
 

More from jemille6 (20)

D. Mayo JSM slides v2.pdf
D. Mayo JSM slides v2.pdfD. Mayo JSM slides v2.pdf
D. Mayo JSM slides v2.pdf
 
reid-postJSM-DRC.pdf
reid-postJSM-DRC.pdfreid-postJSM-DRC.pdf
reid-postJSM-DRC.pdf
 
Errors of the Error Gatekeepers: The case of Statistical Significance 2016-2022
Errors of the Error Gatekeepers: The case of Statistical Significance 2016-2022Errors of the Error Gatekeepers: The case of Statistical Significance 2016-2022
Errors of the Error Gatekeepers: The case of Statistical Significance 2016-2022
 
Causal inference is not statistical inference
Causal inference is not statistical inferenceCausal inference is not statistical inference
Causal inference is not statistical inference
 
What are questionable research practices?
What are questionable research practices?What are questionable research practices?
What are questionable research practices?
 
What's the question?
What's the question? What's the question?
What's the question?
 
The neglected importance of complexity in statistics and Metascience
The neglected importance of complexity in statistics and MetascienceThe neglected importance of complexity in statistics and Metascience
The neglected importance of complexity in statistics and Metascience
 
Mathematically Elegant Answers to Research Questions No One is Asking (meta-a...
Mathematically Elegant Answers to Research Questions No One is Asking (meta-a...Mathematically Elegant Answers to Research Questions No One is Asking (meta-a...
Mathematically Elegant Answers to Research Questions No One is Asking (meta-a...
 
On Severity, the Weight of Evidence, and the Relationship Between the Two
On Severity, the Weight of Evidence, and the Relationship Between the TwoOn Severity, the Weight of Evidence, and the Relationship Between the Two
On Severity, the Weight of Evidence, and the Relationship Between the Two
 
Revisiting the Two Cultures in Statistical Modeling and Inference as they rel...
Revisiting the Two Cultures in Statistical Modeling and Inference as they rel...Revisiting the Two Cultures in Statistical Modeling and Inference as they rel...
Revisiting the Two Cultures in Statistical Modeling and Inference as they rel...
 
Comparing Frequentists and Bayesian Control of Multiple Testing
Comparing Frequentists and Bayesian Control of Multiple TestingComparing Frequentists and Bayesian Control of Multiple Testing
Comparing Frequentists and Bayesian Control of Multiple Testing
 
Good Data Dredging
Good Data DredgingGood Data Dredging
Good Data Dredging
 
The Duality of Parameters and the Duality of Probability
The Duality of Parameters and the Duality of ProbabilityThe Duality of Parameters and the Duality of Probability
The Duality of Parameters and the Duality of Probability
 
Error Control and Severity
Error Control and SeverityError Control and Severity
Error Control and Severity
 
The Statistics Wars and Their Causalities (refs)
The Statistics Wars and Their Causalities (refs)The Statistics Wars and Their Causalities (refs)
The Statistics Wars and Their Causalities (refs)
 
The Statistics Wars and Their Casualties (w/refs)
The Statistics Wars and Their Casualties (w/refs)The Statistics Wars and Their Casualties (w/refs)
The Statistics Wars and Their Casualties (w/refs)
 
On the interpretation of the mathematical characteristics of statistical test...
On the interpretation of the mathematical characteristics of statistical test...On the interpretation of the mathematical characteristics of statistical test...
On the interpretation of the mathematical characteristics of statistical test...
 
The role of background assumptions in severity appraisal (
The role of background assumptions in severity appraisal (The role of background assumptions in severity appraisal (
The role of background assumptions in severity appraisal (
 
The two statistical cornerstones of replicability: addressing selective infer...
The two statistical cornerstones of replicability: addressing selective infer...The two statistical cornerstones of replicability: addressing selective infer...
The two statistical cornerstones of replicability: addressing selective infer...
 
The replication crisis: are P-values the problem and are Bayes factors the so...
The replication crisis: are P-values the problem and are Bayes factors the so...The replication crisis: are P-values the problem and are Bayes factors the so...
The replication crisis: are P-values the problem and are Bayes factors the so...
 

Recently uploaded

How libraries can support authors with open access requirements for UKRI fund...
How libraries can support authors with open access requirements for UKRI fund...How libraries can support authors with open access requirements for UKRI fund...
How libraries can support authors with open access requirements for UKRI fund...
Jisc
 
Natural birth techniques - Mrs.Akanksha Trivedi Rama University
Natural birth techniques - Mrs.Akanksha Trivedi Rama UniversityNatural birth techniques - Mrs.Akanksha Trivedi Rama University
Natural birth techniques - Mrs.Akanksha Trivedi Rama University
Akanksha trivedi rama nursing college kanpur.
 
Executive Directors Chat Leveraging AI for Diversity, Equity, and Inclusion
Executive Directors Chat  Leveraging AI for Diversity, Equity, and InclusionExecutive Directors Chat  Leveraging AI for Diversity, Equity, and Inclusion
Executive Directors Chat Leveraging AI for Diversity, Equity, and Inclusion
TechSoup
 
BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...
BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...
BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...
Nguyen Thanh Tu Collection
 
Chapter 4 - Islamic Financial Institutions in Malaysia.pptx
Chapter 4 - Islamic Financial Institutions in Malaysia.pptxChapter 4 - Islamic Financial Institutions in Malaysia.pptx
Chapter 4 - Islamic Financial Institutions in Malaysia.pptx
Mohd Adib Abd Muin, Senior Lecturer at Universiti Utara Malaysia
 
Lapbook sobre os Regimes Totalitários.pdf
Lapbook sobre os Regimes Totalitários.pdfLapbook sobre os Regimes Totalitários.pdf
Lapbook sobre os Regimes Totalitários.pdf
Jean Carlos Nunes Paixão
 
1.4 modern child centered education - mahatma gandhi-2.pptx
1.4 modern child centered education - mahatma gandhi-2.pptx1.4 modern child centered education - mahatma gandhi-2.pptx
1.4 modern child centered education - mahatma gandhi-2.pptx
JosvitaDsouza2
 
Biological Screening of Herbal Drugs in detailed.
Biological Screening of Herbal Drugs in detailed.Biological Screening of Herbal Drugs in detailed.
Biological Screening of Herbal Drugs in detailed.
Ashokrao Mane college of Pharmacy Peth-Vadgaon
 
MASS MEDIA STUDIES-835-CLASS XI Resource Material.pdf
MASS MEDIA STUDIES-835-CLASS XI Resource Material.pdfMASS MEDIA STUDIES-835-CLASS XI Resource Material.pdf
MASS MEDIA STUDIES-835-CLASS XI Resource Material.pdf
goswamiyash170123
 
Unit 2- Research Aptitude (UGC NET Paper I).pdf
Unit 2- Research Aptitude (UGC NET Paper I).pdfUnit 2- Research Aptitude (UGC NET Paper I).pdf
Unit 2- Research Aptitude (UGC NET Paper I).pdf
Thiyagu K
 
Francesca Gottschalk - How can education support child empowerment.pptx
Francesca Gottschalk - How can education support child empowerment.pptxFrancesca Gottschalk - How can education support child empowerment.pptx
Francesca Gottschalk - How can education support child empowerment.pptx
EduSkills OECD
 
Unit 8 - Information and Communication Technology (Paper I).pdf
Unit 8 - Information and Communication Technology (Paper I).pdfUnit 8 - Information and Communication Technology (Paper I).pdf
Unit 8 - Information and Communication Technology (Paper I).pdf
Thiyagu K
 
The basics of sentences session 5pptx.pptx
The basics of sentences session 5pptx.pptxThe basics of sentences session 5pptx.pptx
The basics of sentences session 5pptx.pptx
heathfieldcps1
 
Thesis Statement for students diagnonsed withADHD.ppt
Thesis Statement for students diagnonsed withADHD.pptThesis Statement for students diagnonsed withADHD.ppt
Thesis Statement for students diagnonsed withADHD.ppt
EverAndrsGuerraGuerr
 
JEE1_This_section_contains_FOUR_ questions
JEE1_This_section_contains_FOUR_ questionsJEE1_This_section_contains_FOUR_ questions
JEE1_This_section_contains_FOUR_ questions
ShivajiThube2
 
The Diamonds of 2023-2024 in the IGRA collection
The Diamonds of 2023-2024 in the IGRA collectionThe Diamonds of 2023-2024 in the IGRA collection
The Diamonds of 2023-2024 in the IGRA collection
Israel Genealogy Research Association
 
S1-Introduction-Biopesticides in ICM.pptx
S1-Introduction-Biopesticides in ICM.pptxS1-Introduction-Biopesticides in ICM.pptx
S1-Introduction-Biopesticides in ICM.pptx
tarandeep35
 
Advantages and Disadvantages of CMS from an SEO Perspective
Advantages and Disadvantages of CMS from an SEO PerspectiveAdvantages and Disadvantages of CMS from an SEO Perspective
Advantages and Disadvantages of CMS from an SEO Perspective
Krisztián Száraz
 
The Accursed House by Émile Gaboriau.pptx
The Accursed House by Émile Gaboriau.pptxThe Accursed House by Émile Gaboriau.pptx
The Accursed House by Émile Gaboriau.pptx
DhatriParmar
 
Overview on Edible Vaccine: Pros & Cons with Mechanism
Overview on Edible Vaccine: Pros & Cons with MechanismOverview on Edible Vaccine: Pros & Cons with Mechanism
Overview on Edible Vaccine: Pros & Cons with Mechanism
DeeptiGupta154
 

Recently uploaded (20)

How libraries can support authors with open access requirements for UKRI fund...
How libraries can support authors with open access requirements for UKRI fund...How libraries can support authors with open access requirements for UKRI fund...
How libraries can support authors with open access requirements for UKRI fund...
 
Natural birth techniques - Mrs.Akanksha Trivedi Rama University
Natural birth techniques - Mrs.Akanksha Trivedi Rama UniversityNatural birth techniques - Mrs.Akanksha Trivedi Rama University
Natural birth techniques - Mrs.Akanksha Trivedi Rama University
 
Executive Directors Chat Leveraging AI for Diversity, Equity, and Inclusion
Executive Directors Chat  Leveraging AI for Diversity, Equity, and InclusionExecutive Directors Chat  Leveraging AI for Diversity, Equity, and Inclusion
Executive Directors Chat Leveraging AI for Diversity, Equity, and Inclusion
 
BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...
BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...
BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...
 
Chapter 4 - Islamic Financial Institutions in Malaysia.pptx
Chapter 4 - Islamic Financial Institutions in Malaysia.pptxChapter 4 - Islamic Financial Institutions in Malaysia.pptx
Chapter 4 - Islamic Financial Institutions in Malaysia.pptx
 
Lapbook sobre os Regimes Totalitários.pdf
Lapbook sobre os Regimes Totalitários.pdfLapbook sobre os Regimes Totalitários.pdf
Lapbook sobre os Regimes Totalitários.pdf
 
1.4 modern child centered education - mahatma gandhi-2.pptx
1.4 modern child centered education - mahatma gandhi-2.pptx1.4 modern child centered education - mahatma gandhi-2.pptx
1.4 modern child centered education - mahatma gandhi-2.pptx
 
Biological Screening of Herbal Drugs in detailed.
Biological Screening of Herbal Drugs in detailed.Biological Screening of Herbal Drugs in detailed.
Biological Screening of Herbal Drugs in detailed.
 
MASS MEDIA STUDIES-835-CLASS XI Resource Material.pdf
MASS MEDIA STUDIES-835-CLASS XI Resource Material.pdfMASS MEDIA STUDIES-835-CLASS XI Resource Material.pdf
MASS MEDIA STUDIES-835-CLASS XI Resource Material.pdf
 
Unit 2- Research Aptitude (UGC NET Paper I).pdf
Unit 2- Research Aptitude (UGC NET Paper I).pdfUnit 2- Research Aptitude (UGC NET Paper I).pdf
Unit 2- Research Aptitude (UGC NET Paper I).pdf
 
Francesca Gottschalk - How can education support child empowerment.pptx
Francesca Gottschalk - How can education support child empowerment.pptxFrancesca Gottschalk - How can education support child empowerment.pptx
Francesca Gottschalk - How can education support child empowerment.pptx
 
Unit 8 - Information and Communication Technology (Paper I).pdf
Unit 8 - Information and Communication Technology (Paper I).pdfUnit 8 - Information and Communication Technology (Paper I).pdf
Unit 8 - Information and Communication Technology (Paper I).pdf
 
The basics of sentences session 5pptx.pptx
The basics of sentences session 5pptx.pptxThe basics of sentences session 5pptx.pptx
The basics of sentences session 5pptx.pptx
 
Thesis Statement for students diagnonsed withADHD.ppt
Thesis Statement for students diagnonsed withADHD.pptThesis Statement for students diagnonsed withADHD.ppt
Thesis Statement for students diagnonsed withADHD.ppt
 
JEE1_This_section_contains_FOUR_ questions
JEE1_This_section_contains_FOUR_ questionsJEE1_This_section_contains_FOUR_ questions
JEE1_This_section_contains_FOUR_ questions
 
The Diamonds of 2023-2024 in the IGRA collection
The Diamonds of 2023-2024 in the IGRA collectionThe Diamonds of 2023-2024 in the IGRA collection
The Diamonds of 2023-2024 in the IGRA collection
 
S1-Introduction-Biopesticides in ICM.pptx
S1-Introduction-Biopesticides in ICM.pptxS1-Introduction-Biopesticides in ICM.pptx
S1-Introduction-Biopesticides in ICM.pptx
 
Advantages and Disadvantages of CMS from an SEO Perspective
Advantages and Disadvantages of CMS from an SEO PerspectiveAdvantages and Disadvantages of CMS from an SEO Perspective
Advantages and Disadvantages of CMS from an SEO Perspective
 
The Accursed House by Émile Gaboriau.pptx
The Accursed House by Émile Gaboriau.pptxThe Accursed House by Émile Gaboriau.pptx
The Accursed House by Émile Gaboriau.pptx
 
Overview on Edible Vaccine: Pros & Cons with Mechanism
Overview on Edible Vaccine: Pros & Cons with MechanismOverview on Edible Vaccine: Pros & Cons with Mechanism
Overview on Edible Vaccine: Pros & Cons with Mechanism
 

Philosophy of Science and Philosophy of Statistics

  • 1. Philosophy of Science and Philosophy of Statistics Deborah G Mayo Dept of Philosophy, Virginia Tech APA Central: Epistemology Meets Philosophy of Statistics February 27, 2020 1
  • 2. 2 What is the Philosophy of Statistics (PhilStat)? At one level, statisticians and philosophers of science ask many of the same questions: What should be observed and what may justifiably be inferred from the resulting data? What is a good test? How can spurious relationships be distinguished from genuine regularities? from causal regularities?
  • 3. • These very general questions are entwined with long standing debates in philosophy of science • No wonder the field of statistics tends to cross over so often into philosophical territory. 3
  • 4. 4 Statistics à Philosophy (1) Model Scientific Inference—capture the actual or rational ways to arrive at evidence and inference (2) Solve Philosophical Problems about scientific inference, observation, experiment; (3) Metamethodology: Analyze intuitive rules (e.g., novelty, simplicity)
  • 5. 5 Formal Epistemology? Could be • Phil Stat • Analytic epistemology with probabilities “Bayesian statistics is one thing.. Bayesian epistemology is something else. The idea of putting probabilities over hypotheses delivered to philosophy a godsend, an entire package of superficiality.” (Glymour 2010, 334)
  • 6. 6 • His worry: starting with an intuitive principle, epistemologists reconstruct it with a probabilistic confirmation logic. • You haven’t shown, for example, beliefs ought to go up with varied evidence, you represent it probabilistically • I don't knock rational reconstruction using probability, and analogs of some puzzles arise in statistics (tacking paradox, old evidence)
  • 7. 7 Philosophy à Statistics • Central job: minister to scientists’ conceptual, logical and methodological discomforts • Despite technical sophistication, basic concepts of statistical inference and modeling are more unsettled than ever.
  • 8. My Interest: Philosophy in Statistics Wars
  • 9. Statistical Crisis in Science • in many fields, latitude in collecting and interpreting data makes it too easy to dredge up impressive looking findings even when spurious. • We set sail with a simple tool: a minimal requirement for evidence • Sufficiently general to apply to any methods now in use 9
  • 10. Statistical reforms • Several are welcome: preregistration, avoidance of cookbook statistics, calls for more replication research • Others are quite radical, and even violate our minimal principle of evidence • To combat paradoxical, self-defeating “reforms,” requires a mix of statistics, philosophy, history 10
  • 11. Most often used tools are most criticized “Several methodologists have pointed out that the high rate of nonreplication of research discoveries is a consequence of the convenient, yet ill-founded strategy of claiming conclusive research findings solely on the basis of a single study assessed by formal statistical significance, typically for a p-value less than 0.05. …” (Ioannidis 2005, 696) Do researchers do that? 11
  • 12. R.A. Fisher “[W]e need, not an isolated record, but a reliable method of procedure. In relation to the test of significance, we may say that a phenomenon is experimentally demonstrable when we know how to conduct an experiment which will rarely fail to give us a statistically significant result.” (Fisher 1947, 14) 12
  • 13. Simple significance tests (Fisher) to test the conformity of the particular data under analysis with H0 in some respect: …we find a function d(X) of the data, the test statistic, such that • the larger the value of d(X) the more inconsistent are the data with H0; • d(X) has a known probability distribution when H0 is true. …the p-value corresponding to any d(x) (or d0bs) p = p(t) = Pr(d(X) ≥ d(x); H0) (Mayo and Cox 2006, 81; d for t, x for y) 13
  • 14. Testing reasoning • If even larger differences than d0bs occur fairly frequently under H0 (i.e., P-value not small), there’s no evidence of incompatibility with H0 • Small P-value indicates some underlying discrepancy from H0 because very probably you would have seen a less impressive difference than d0bs were H0 true. • This still isn’t evidence of a genuine statistical effect H1, let alone a scientific conclusion H* Stat-Sub fallacy H => H* 14
  • 15. Neyman-Pearson (N-P) tests (1933): A test (null) and alternative hypotheses H0, H1 that are exhaustive H0: μ ≤ 0 vs. H1: μ > 0 Philosophers should adopt the language of statistics, e.g., Xi ~ N(μ, σ2) 15
  • 16. Neyman-Pearson (N-P) tests (1933): • This fallacy of rejection H1è H* is impossible • Rejecting H0 only indicates statistical alternatives H1 (how discrepant from null) • We get the type II error, and power 16
  • 17. Error Statistics • Fisher and N-P both fall under tools for “appraising and bounding the probabilities of seriously misleading interpretations of data” (Birnbaum 1970, 1033)–error probabilities • I place all under the rubric of error statistics • Confidence intervals, N-P and Fisherian tests, resampling, randomization 17
  • 18. Both Fisher & N-P: it’s easy to lie with biasing selection effects • Sufficient finagling—cherry-picking, significance seeking, multiple testing, post-data subgroups, trying and trying again—may practically guarantee a preferred claim H gets support, even if it’s unwarranted by evidence • Violates minimal requirement for evidence. 18
  • 19. Severity Requirement: • If the test had little or no capability of finding flaws with H (even if H is incorrect), then agreement between data x0 and H provides poor (or no) evidence for H (Popper: “too cheap to be worth having) • A claim passes with severity only to the extent that it is subjected to, and passes, a test that it probably would have failed, if false. • This probability is how severely it has passed (degree of “corroboration”) 19
  • 20. Requires a third role for probability Probabilism. To assign a degree of probability, confirmation, support or belief in a hypothesis, given data x0 (absolute or comparative) (e.g., Bayesian, likelihoodist, Fisher (at times)) Performance. Ensure long-run reliability of methods, coverage probabilities (frequentist, behavioristic Neyman-Pearson, Fisher (at times)) Only probabilism is thought to be inferential or evidential 20
  • 21. What happened to using probability to assess error-probing capacity? • Neither “probabilism” nor “performance” directly captures assessing error probing capacity • Good long-run performance is a necessary, not a sufficient, condition for severity 21
  • 22. A claim C is not warranted _______ • Probabilism: unless C is true or probable (gets a probability boost, made comparatively firmer) • Performance: unless it stems from a method with low long-run error • Probativism (severe testing): unless something (a fair amount) has been done to probe ways we can be wrong about C 22
  • 23. A severe test: My weight Informal example: To test if I’ve gained weight between the time I left for England and my return, I use a series of well-calibrated and stable scales, both before leaving and upon my return. All show an over 4 lb gain, none shows a difference in weighing EGEK, I infer: H: I’ve gained at least 4 pounds 23
  • 24. 24 • Properties of the scales are akin to the properties of statistical tests (performance). • No one claims the justification is merely long run, and can say nothing about my weight. • We infer something about the source of the readings from the high capability to reveal if any scales were wrong
  • 25. 25 The severe tester assumed to be in a context of wanting to find things out • I could insist all the scales are wrong—they work fine with weighing known objects—but this would prevent correctly finding out about weight….. • What sort of extraordinary circumstance could cause them all to go astray just when we don’t know the weight of the test object? • Argument from coincidence-goes beyond being highly improbable
  • 26. 26 Popper : Carnap as Frequentist : Bayes “According to modern logical empiricist orthodoxy, in deciding whether hypothesis h is confirmed by evidence e, . . . we must consider only the statements h and e, and the logical relations [C(h,e)] between them. It is quite irrelevant whether e was known first and h proposed to explain it, or whether e resulted from testing predictions drawn from h”. (Alan Musgrave 1974, 2) Battles about roles of probability trace to philosophies of inference
  • 27. Likelihood Principle (LP) In logics of induction, like probabilist accounts (as I’m using the term) the import of the data is via the ratios of likelihoods of hypotheses Pr(x0;H0)/Pr(x0;H1) The data x0 are fixed, while the hypotheses vary 27
  • 28. Comparative logic of support • Ian Hacking (1965) “Law of Likelihood”: x support hypothesis H0 less well than H1 if, Pr(x;H0) < Pr(x;H1) (rejects in 1980) • Any hypothesis that perfectly fits the data is maximally likely (even if data-dredged) • “there always is such a rival hypothesis viz., that things just had to turn out the way they actually did” (Barnard 1972, 129). 28
  • 29. Error probabilities • Pr(H0 is less well supported than H1;H0 ) is high for some H1 or other 29
  • 30. Hunting for significance (nominal vs. actual) Suppose that twenty sets of differences have been examined, that one difference seems large enough to test and that this difference turns out to be ‘significant at the 5 percent level.’ ….The actual level of significance is not 5 percent, but 64 percent! (Selvin 1970, 104) (Morrison & Henkel’s Significance Test Controversy 1970!) 30
  • 31. Some accounts of evidence object: “Two problems that plague frequentist inference: multiple comparisons and multiple looks, or…data dredging and peeking at the data. The frequentist solution to both problems involves adjusting the P- value… But adjusting the measure of evidence because of considerations that have nothing to do with the data defies scientific sense” (Goodman 1999, 1010) (Co-director, with Ioannidis, the Meta-Research Innovation Center at Stanford) 31
  • 32. On the LP, error probabilities appeal to something irrelevant “Sampling distributions, significance levels, power, all depend on something more [than the likelihood function]–something that is irrelevant in Bayesian inference–namely the sample space” (Lindley 1971, 436) Probabilisms, condition on the actual data 32
  • 33. At odds with key way to advance replication: 21 Word Solution “We report how we determined our sample size, and data exclusions (if any), all manipulations, and all measures in the study” (Simmons, Nelson, and Simonsohn 2012, 4). • Replication researchers find flexibility with data- dredging and stopping rules major source of failed-replication (the “forking paths”, Gelman and Loken 2014) 33
  • 34. Many “reforms” offered as alternative to significance tests follow the LP • “Bayes factors [likelihood ratios] can be used in the complete absence of a sampling plan…” (Bayarri, Benjamin, Berger, Sellke 2016, 100) • It seems very strange that a frequentist could not analyze a given set of data…if the stopping rule is not given….Data should be able to speak for itself. (Berger and Wolpert 1988, 78; authors of the Likelihood Principle) • No wonder reformers talk past each other 34
  • 35. Replication Paradox • Test Critic: It’s too easy to satisfy standard significance thresholds • You: Why do replicationists find it so hard to achieve significance thresholds (with preregistration)? • Test Critic: Obviously the initial studies were guilty of P-hacking, cherry-picking, data-dredging (QRPs) • You: So, the replication researchers want methods that pick up on, adjust, and block these biasing selection effects. • Test Critic: Actually “reforms” recommend methods where the need to alter P-values due to data dredging vanishes 35
  • 36. Probabilists can still block intuitively unwarranted inferences • Supplement with subjective beliefs: What do I believe? As opposed to What is the evidence? (Royall 1997; 2004) • Likelihoods + prior probabilities 36
  • 37. Problems • Additional source of flexibility, priors as well as biasing selection effects • Doesn’t show what researchers had done wrong—battle of beliefs • The believability of data-dredged hypotheses is what makes them so seductive 37
  • 38. Contrast with philosophy: Bayesian statisticians use “default” priors “[V]irtually never would different experts give prior distributions that even overlapped” (J. Berger. 2006) • Default priors are to be data dominant in some sense • “The priors are not to be considered expressions of uncertainty, ignorance, or degree of belief. [they] may not even be probabilities…” (Cox and Mayo 2010, 299) • No agreement on rival systems for default/non- subjective priors (no “uninformative” priors) 38
  • 39. 39 Many of today’s statistics wars: P-values vs posteriors • The posterior probability Pr(H0|x) can be large while the P-value is small • To a Bayesian this shows P-values exaggerate evidence against • Significance testers object to highly significant results being interpreted as no evidence against the null– or even evidence for it! High Type 2 error
  • 40. Bayes (Jeffreys)/Fisher disagreement (“spike and smear”) • The “P-values exaggerate” charges refer to testing a point null hypothesis, a lump of prior probability given to H0 (or a tiny region around 0). Xi ~ N(μ, σ2) H0: μ = 0 vs. H1: μ ≠ 0. • The rest appropriately spread over the alternative, an α significant result can correspond to Pr(H0|x) = (1 – α)! (e.g., 0.95) 40
  • 41. “Concentrating mass on the point null hypothesis is biasing the prior in favor of H0 as much as possible” (Casella and R. Berger 1987, 111) whether in 1 or 2-sided tests Yet ‘spike and smear” is the basis for: “Redefine Statistical Significance” (Benjamin et al., 2017) 41
  • 42. Opposing megateam: Lakens et al. (2018) • Whether tests should use a lower Type 1 error probability is separate; the problem is supposing there should be agreement between quantities measuring different things 42
  • 43. Recent Example of a battle based on P-values disagree with posteriors • If we imagine randomly selecting a hypothesis from an urn of nulls 90% of which are true • Consider just 2 possibilities: H0: no effect H1: meaningful effect, all else ignored, • Take the prevalence of 90% as Pr(H0) = 0.9, Pr(H1)= 0.1 • Reject H0 with a single (just) 0.05 significant result, with cherry-picking, selection effects Then it can be shown most “findings” are false 43
  • 44. 44 Diagnostic Screening (DS) Model • Pr(H0|Test T rejects H0 ) > 0.5 really: prevalence of true nulls among those rejected at the 0.05 level > 0.5. Call this: False Finding rate FFR • Pr(Test T rejects H0 | H0 ) = 0.05 Criticism: N-P Type I error probability ≠ FFR (Ioannidis 2005, Colquhoun 2014)
  • 45. 45 DS testers see this as a major criticism of tests • But there are major confusions • Pr(H0|Test T rejects H0 ) is not a Type I error probability. • Transposes conditional • Combines crude performance with a probabilist assignment (true to neither Bayesians nor error statisticians) • OK in certain screening contexts (genomics)
  • 46. FFR: False Finding Rate: Prev(H0 ) = .9 46α = 0.05 and (1 – β) = .8, FFR = 0.36, the PPV = .64
  • 47. PPV • Complement of FFR: the positive predictive value PPV Pr(H1|Test T rejects H0) 47
  • 48. 48 What’s Pr(H1) (i.e., Prev(H1))? “Proportion of experiments we do over a lifetime in which there is a real effect” (Colquhoun 2014, 9) Proportion of true relationships among those tested in a field. Ioannidis (2005, 0696) Hypotheses can be individuated in many ways
  • 49. Probabilistic Instantiation Fallacy • Pr(the randomly selected null hypothesis is true) = .9 • The randomly selected null hypothesis is H51 • Pr(H51 is true) = .9 Each His is either is true or not! (It could have a genuine frequentist prior but it wouldn’t equal .9) 49
  • 50. 50 Is the PPV (complement of the FFR) relevant to what’s wanted? Crud Factor. In many fields of social science it’s thought nearly everything is related to everything: “all nulls false”. It also promotes the “stay safe” idea.
  • 51. Some Bayesians reject probabilism (Gelman: Falsificationist Bayesian; Shalizi: error statistician) “[C]rucial parts of Bayesian data analysis, such as model checking, can be understood as ‘error probes’ in Mayo’s sense” “[W]hat we are advocating, then, is what Cox and Hinkley (1974) call ‘pure significance testing’, in which certain of the model’s implications are compared directly to the data.” (Gelman and Shalizi 2013, 10, 20). 51
  • 52. • Can’t also jump on the “abandon significance/ don’t use P-value thresholds” bandwagon • If there’s no threshold, there’s no falsification, and no tests • Granted P-values don’t give effect sizes 52
  • 53. 53 The severe tester reformulates tests with a discrepancy γ from H0 • Severity function: SEV(Test T, data x, claim C) • Instead of a binary cut-off (significant or not) the particular outcome is used to infer discrepancies that are or are not warranted
  • 54. 54 To avoid Fallacies of Rejection (e.g., magnitude error) Testing the mean of a Normal distribution: H0: μ ≤ 0 vs. H1: μ > 0 • If you very probably would have observed a more impressive (smaller) P-value if μ = μ1 (μ1 = μ0 + γ); the data are poor evidence that μ > μ1.
  • 55. 55 Relation to a Test’s Power: Let M be the sample mean (a random variable) and it’s value M0 • Say M just reaches statistical significance at level P, say 0.025; and compute power in relation to this cut-off • If the power against μ1 is high then the data are poor evidence that μ > μ1.
  • 56. Power vs Severity for 𝛍 > 𝛍 𝟏 56
  • 57. Similarly, severity tells us: • an α-significant difference indicates less of a discrepancy from the null if it results from larger (n1) rather than a smaller (n2) sample size (n1 > n2 ) • What’s more indicative of a large effect (fire), a fire alarm that goes off with burnt toast or one that doesn’t go off unless the house is fully ablaze? • [The larger sample size is like the one that goes off with burnt toast] 57
  • 58. • Anyone who equates severity and power has it backwards • Only one time they could be equal is if M just misses statistical significance and want to assess claims of form μ < μ1, μ < μ2, μ < μ3,… μ < μk,….. • Then SEV(μ < μk) = POW(μk) 58
  • 59. We avoid fallacies of non-significant results? • They don’t warrant 0 discrepancy • Not uninformative; can find upper bounds μ1 SEV(μ < μ1) is high • It’s with negative results (P-values not small) that severity goes in the same direction as power –provided model assumptions hold 59
  • 60. FEV: Frequentist Principle of Evidence; Mayo and Cox (2006); SEV: Mayo 1991, Mayo and Spanos (2006) FEV/SEV A small P-value indicates discrepancy γ from H0, åonly if, there is a high probability the test would have resulted in a larger P-value were a discrepancy as large as γ absent. FEV/SEV A moderate P-value indicates the absence of a discrepancy γ from H0, only if there is a high probability the test would have given a worse fit with H0 (i.e., a smaller P-value) were a discrepancy γ present. 60
  • 61. 61
  • 62. The 2019: Don’t say ‘significance’, don’t use P-value thresholds • Editors of the March 2019 issue TAS "A World Beyond p < 0.05"—Wasserstein, Schirm, Lazar— say "declarations of ‘statistical significance’ be abandoned" (p. 2). • On their view: Prespecified P-value thresholds should never be used in interpreting results. • it is not just a word ban but a gatekeeper ban 62
  • 63. “Retiring statistical significance would give bias a free pass". John Ioannidis (2019) "...potential for falsification is a prerequisite for science. Fields that obstinately resist refutation can hide behind the abolition of statistical significance but risk becoming self-ostracized from the remit of science”. I agree, and in Mayo (2019) I show why. 63
  • 64. • Complying with the “no threshold” view precludes the FDA's long-established drug review procedures, as Wasserstein et al. (2019) recognize • They think by removing P-value thresholds, researchers lose an incentive to data dredge, and otherwise exploit researcher flexibility • Even if true, it's a bad argument. (Decriminalizing robbery results in less robbery arrests) • But it's not true. 64
  • 65. • Even without the word significance, eager researchers still can’t take the large (non- significant) P-value to indicate a genuine effect • It would be to say: Even though larger differences would frequently occur by chance variability alone, my data provide evidence they are not due to chance variability • In short, he would still need to report a reasonably small P-value • The eager investigator will need to "spin" his results, ransack, data dredge 65
  • 66. • In a world without predesignated thresholds, it would be hard to hold him accountable for reporting a nominally small P-value: • “whether a p-value passes any arbitrary threshold should not be considered at all" in interpreting data (Wasserstein et al. 2019, 2) 66
  • 67. No tests, no falsification • The “no thresholds” view also blocks common uses of confidence intervals and Bayes factor standards • If you cannot say about any results, ahead of time, they will not be allowed to count in favor of a claim, then you do not have a test of it • Don’t confuse having a threshold for a terrible test with using a fixed P-value across all studies in an unthinking manner • We should reject the latter 67
  • 68. New ASA Task Force on Significance Tests and Replication • to “prepare a …piece reflecting “good statistical practice,” without leaving the impression that p- values and hypothesis tests…have no role.” (Karen Kafadar 2019, 4) • I hope that philosophers (of science and of knowledge) get involved! 68
  • 69. References • Barnard, G. (1972). “The Logic of Statistical Inference (Review of ‘The Logic of Statistical Inference’ by Ian Hacking)”, British Journal for the Philosophy of Science 23(2), 123–32. • Bartlett, T. (2014). “Replication Crisis in Psychology Research Turns Ugly and Odd”, The Chronicle of Higher Education (online) June 23, 2014. • Bayarri, M., Benjamin, D., Berger, J., Sellke, T. (2016). “Rejection Odds and Rejection Ratios: A Proposal for Statistical Practice in Testing Hypotheses.” Journal of Mathematical Psychology 72: 90- 103. • Benjamin, D., Berger, J., Johannesson, M., et al. (2017). “Redefine Statistical Significance”, Nature Human Behaviour 2, 6–10. • Berger, J. O. (2006). “The Case for Objective Bayesian Analysis.” Bayesian Analysis 1 (3): 385–402. • Berger, J. O. and Wolpert, R. (1988). The Likelihood Principle, 2nd ed. Vol. 6 Lecture Notes-Monograph Series. Hayward, CA: Institute of Mathematical Statistics. • Birnbaum, A. (1970). “Statistical Methods in Scientific Inference (letter to the Editor).” Nature 225 (5237) (March 14): 1033. • Casella, G. and Berger, R. (1987). “Reconciling Bayesian and Frequentist Evidence in the One-sided Testing Problem”, Journal of the American Statistical Association 82(397), 106-11. • Colquhoun, D. (2014). ‘An Investigation of the False Discovery Rate and the Misinterpretation of P- values’, Royal Society Open Science 1(3), 140216. • Cox, D. R., and Mayo, D. G. (2010). “Objectivity and Conditionality in Frequentist Inference.” Error and Inference: Recent Exchanges on Experimental Reasoning, Reliability, and the Objectivity and Rationality of Science. Mayo and Spanos (eds.), pp. 276–304. CUP. 69
  • 70. • FDA (U. S. Food and Drug Administration) (2017). “Multiple Endpoints in Clinical Trials: Guidance for Industry (DRAFT GUIDANCE).” Retrieved from https://www.fda.gov/media/102657/download • Fisher, R. A. (1947). The Design of Experiments 4th ed., Edinburgh: Oliver and Boyd. • Gelman, A. and Loken, E. (2014). “The Statistical Crisis in Science”, American Scientist 2, 460-65. • Gelman, A. and Shalizi, C. (2013). “Philosophy and the Practice of Bayesian Statistics” and “Rejoinder’” British Journal of Mathematical and Statistical Psychology 66(1): 8–38; 76-80. • Glymour, C. (2010). "Explanation and Truth". In Mayo, D. and Spanos, A. (eds.) Error and Inference: Recent Exchanges on Experimental Reasoning, Reliability, and the Objectivity and Rationality of Science, pp. 331–50. CUP. • Goodman S. N. (1999). “Toward evidence-based medical statistics. 2: The Bayes factor,” Annals of Internal Medicine 1999; 130:1005 –1013. • Hacking, I. (1965). Logic of Statistical Inference. CUP. • Hacking, I. (1980). “The Theory of Probable Inference: Neyman, Peirce and Braithwaite”, in Mellor, D. (ed.), Science, Belief and Behavior: Essays in Honour of R. B. Braithwaite, pp. 141–60. CUP. • Ioannidis, J. (2005). “Why Most Published Research Findings are False”, PLoS Medicine 2(8), 0696– 0701. • Ioannidis J. (2019). “The Importance of Predefined Rules and Prespecified Statistical Analyses: Do Not Abandon Significance.” JAMA. 321(21): 2067–2068. doi:10.1001/jama.2019. 4582 • Kafadar, K. (2019). “The Year in Review … And More to Come,” President's Corner, AmStat News, (Issue 510), (Dec. 2019). • Lakens, D., et al. (2018). “Justify your Alpha”, Nature Human Behavior 2, 168–71. 70
  • 71. 71 • Lindley, D. V. (1971). “The Estimation of Many Parameters.” in Godambe, V. and Sprott, D. (eds.), Foundations of Statistical Inference 435–455. Toronto: Holt, Rinehart and Winston. • Mayo, D. (1991). “Novel Evidence and Severe Tests,” Philosophy of Science, 58 (4): 523-552. Reprinted (1991) in The Philosopher’s Annual XIV: 203-232. • Mayo, D. G. (1996). Error and the Growth of Experimental Knowledge. Science and Its Conceptual Foundation. Chicago: University of Chicago Press. • Mayo, D. G. (2018). Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars, Cambridge: Cambridge University Press. • Mayo, D. (2019). ”P-value thresholds: Forfeit at your peril”. European Journal of Clinical Investigation, 49, e13170. https://doi.org/10.1111/eci.13170 • Mayo, D. G. and Cox, D. R. (2006). "Frequentist Statistics as a Theory of Inductive Inference” in Rojo, J. (ed.) The Second Erich L. Lehmann Symposium: Optimality, 2006, Lecture Notes-Monograph Series, 49, pp. 247-275. Institute of Mathematical Statistics. • Mayo, D. G., and A. Spanos. (2006). “Severe Testing as a Basic Concept in a Neyman–Pearson Philosophy of Induction.” British Journal for the Philosophy of Science 57 (2) (June 1): 323–357. • Meehl, P. (1990). "Why Summaries of Research on Psychological Theories Are Often Uninterpretable", Psychological Reports 66(1): 195–244. • Meehl, P. and Waller, N. (2002). "The Path Analysis Controversy: A New Statistical Approach to Strong Appraisal of Verisimilitude", Psychological Methods 7(3): 283–300. • Morrison, D. E., and R. E. Henkel, (eds.), (1970). The Significance Test Controversy: A Reader. Chicago: Aldine De Gruyter. • Musgrave, A. (1974). “Logical versus Historical Theories of Confirmation”, BJPS 25(1), 1–23.
  • 72. 72 • Neyman, J. and Pearson, E.S. (1933). "On the Problem of the Most Efficient Tests of Statistical Hypotheses." Philosophical Transactions of the Royal Society of London Series A 231, 289–337. Reprinted in Joint Statistical Papers, 140–85. • Neyman, J. and Pearson, E.S. (1967). Joint Statistical Papers of J. Neyman and E. S. Pearson. Cambridge: Cambridge University Press. • Open Science Collaboration (2015). “Estimating the Reproducibility of Psychological Science”, Science 349(6251), 943–51. • Pearson, E. S. & Neyman, J. (1967). “On the problem of two samples”, Joint Statistical Papers by J. Neyman & E.S. Pearson, 99-115 (Berkeley: U. of Calif. Press). • Royall, R. (1997). Statistical Evidence: A Likelihood Paradigm. Boca Raton FL: Chapman and Hall, CRC press. • Royall, R. (2004). “The Likelihood Paradigm for Statistical Evidence” and “Rejoinder”. In Taper, M. and Lele, S. (eds.) The Nature of Scientific Evidence, , pp. 119–138; 145–151. Chicago: University of Chicago Press. • Savage, L. J. (1962). The Foundations of Statistical Inference: A Discussion. London: Methuen. • Selvin, H. (1970). “A critique of tests of significance in survey research”. In The significance test controversy, edited by D. Morrison and R. Henkel, 94-106. Chicago: Aldine De Gruyter. • Simmons, J. Nelson, L. and Simonsohn, U. (2012) “A 21 word solution”, Dialogue: 26(2), 4–7. • Wasserstein, R., Schirm, A. & Lazar, N. (2019). Moving to a World Beyond “p < 0.05” [Editorial]. The American Statistician, 73(S1), 1–19. https://doi.org/10.1080/00031305.2019.1583913