Philosophy of Science and Philosophy of Statistics

Philosophy of Science and
Philosophy of Statistics
Deborah G Mayo
Dept of Philosophy, Virginia Tech
APA Central: Epistemology Meets Philosophy of
Statistics
February 27, 2020
1

2
What is the Philosophy of
Statistics (PhilStat)?
At one level, statisticians and philosophers of
science ask many of the same questions:
What should be observed and what may justifiably
be inferred from the resulting data?
What is a good test?
How can spurious relationships be distinguished
from genuine regularities? from causal regularities?

• These very general questions are entwined with
long standing debates in philosophy of science
• No wonder the field of statistics tends to cross
over so often into philosophical territory.
3

4
Statistics à Philosophy
(1) Model Scientific Inference—capture the actual or
rational ways to arrive at evidence and inference
(2) Solve Philosophical Problems about scientific
inference, observation, experiment;
(3) Metamethodology: Analyze intuitive rules
(e.g., novelty, simplicity)

5
Formal Epistemology?
Could be
• Phil Stat
• Analytic epistemology with probabilities
“Bayesian statistics is one thing.. Bayesian
epistemology is something else. The idea of putting
probabilities over hypotheses delivered to
philosophy a godsend, an entire package of
superficiality.” (Glymour 2010, 334)

6
• His worry: starting with an intuitive principle,
epistemologists reconstruct it with a probabilistic
confirmation logic.
• You haven’t shown, for example, beliefs ought to
go up with varied evidence, you represent it
probabilistically
• I don't knock rational reconstruction using
probability, and analogs of some puzzles arise in
statistics (tacking paradox, old evidence)

7
Philosophy à Statistics
• Central job: minister to scientists’ conceptual,
logical and methodological discomforts
• Despite technical sophistication, basic concepts of
statistical inference and modeling are more
unsettled than ever.

My Interest: Philosophy in
Statistics Wars

Statistical Crisis in Science
• in many fields, latitude in collecting and
interpreting data makes it too easy to dredge up
impressive looking findings even when spurious.
• We set sail with a simple tool: a minimal
requirement for evidence
• Sufficiently general to apply to any methods now in
use
9

Statistical reforms
• Several are welcome: preregistration, avoidance
of cookbook statistics, calls for more replication
research
• Others are quite radical, and even violate our
minimal principle of evidence
• To combat paradoxical, self-defeating “reforms,”
requires a mix of statistics, philosophy, history
10

Most often used tools are most
criticized
“Several methodologists have pointed out that the
high rate of nonreplication of research discoveries is
a consequence of the convenient, yet ill-founded
strategy of claiming conclusive research findings
solely on the basis of a single study assessed by
formal statistical significance, typically for a p-value
less than 0.05. …” (Ioannidis 2005, 696)
Do researchers do that?
11

R.A. Fisher
“[W]e need, not an isolated record, but a reliable
method of procedure. In relation to the test of
significance, we may say that a phenomenon is
experimentally demonstrable when we know how
to conduct an experiment which will rarely fail to
give us a statistically significant result.”
(Fisher 1947, 14)
12

Simple significance tests (Fisher)
to test the conformity of the particular data under
analysis with H0 in some respect:
…we find a function d(X) of the data, the test statistic,
such that
• the larger the value of d(X) the more inconsistent
are the data with H0;
• d(X) has a known probability distribution
when H0 is true.
…the p-value corresponding to any d(x) (or d0bs)
p = p(t) = Pr(d(X) ≥ d(x); H0)
(Mayo and Cox 2006, 81; d for t, x for y) 13

Testing reasoning
• If even larger differences than d0bs occur fairly
frequently under H0 (i.e., P-value not small),
there’s no evidence of incompatibility with H0
• Small P-value indicates some underlying
discrepancy from H0 because very probably
you would have seen a less impressive
difference than d0bs were H0 true.
• This still isn’t evidence of a genuine statistical
effect H1, let alone a scientific conclusion H*
Stat-Sub fallacy H => H*
14

Neyman-Pearson (N-P) tests (1933):
A test (null) and alternative hypotheses
H0, H1 that are exhaustive
H0: μ ≤ 0 vs. H1: μ > 0
Philosophers should adopt the
language of statistics, e.g., Xi ~ N(μ, σ2)
15

Neyman-Pearson (N-P) tests (1933):
• This fallacy of rejection H1è H* is impossible
• Rejecting H0 only indicates statistical alternatives
H1 (how discrepant from null)
• We get the type II error, and power
16

Error Statistics
• Fisher and N-P both fall under tools for “appraising
and bounding the probabilities of seriously
misleading interpretations of data” (Birnbaum 1970,
1033)–error probabilities
• I place all under the rubric of error statistics
• Confidence intervals, N-P and Fisherian tests,
resampling, randomization
17

Both Fisher & N-P: it’s easy to lie
with biasing selection effects
• Sufficient finagling—cherry-picking, significance
seeking, multiple testing, post-data subgroups,
trying and trying again—may practically
guarantee a preferred claim H gets support,
even if it’s unwarranted by evidence
• Violates minimal requirement for evidence.
18

Severity Requirement:
• If the test had little or no capability of finding flaws
with H (even if H is incorrect), then agreement
between data x0 and H provides poor (or no)
evidence for H
(Popper: “too cheap to be worth having)
• A claim passes with severity only to the extent that
it is subjected to, and passes, a test that it
probably would have failed, if false.
• This probability is how severely it has passed
(degree of “corroboration”)
19

Requires a third role for probability
Probabilism. To assign a degree of probability,
confirmation, support or belief in a hypothesis,
given data x0 (absolute or comparative)
(e.g., Bayesian, likelihoodist, Fisher (at times))
Performance. Ensure long-run reliability of
methods, coverage probabilities (frequentist,
behavioristic Neyman-Pearson, Fisher (at times))
Only probabilism is thought to be inferential or
evidential 20

What happened to using probability
to assess error-probing capacity?
• Neither “probabilism” nor “performance” directly
captures assessing error probing capacity
• Good long-run performance is a necessary, not a
sufficient, condition for severity
21

A claim C is not warranted _______
• Probabilism: unless C is true or probable (gets
a probability boost, made comparatively firmer)
• Performance: unless it stems from a method
with low long-run error
• Probativism (severe testing): unless
something (a fair amount) has been done to
probe ways we can be wrong about C
22

A severe test: My weight
Informal example: To test if I’ve gained weight
between the time I left for England and my return,
I use a series of well-calibrated and stable scales,
both before leaving and upon my return.
All show an over 4 lb gain, none shows a
difference in weighing EGEK, I infer:
H: I’ve gained at least 4 pounds
23

24
• Properties of the scales are akin to the
properties of statistical tests (performance).
• No one claims the justification is merely long
run, and can say nothing about my weight.
• We infer something about the source of the
readings from the high capability to reveal if
any scales were wrong

25
The severe tester assumed to be in a
context of wanting to find things out
• I could insist all the scales are wrong—they work
fine with weighing known objects—but this would
prevent correctly finding out about weight…..
• What sort of extraordinary circumstance could
cause them all to go astray just when we don’t
know the weight of the test object?
• Argument from coincidence-goes beyond being
highly improbable

26
Popper : Carnap as Frequentist : Bayes
“According to modern logical empiricist orthodoxy,
in deciding whether hypothesis h is confirmed by
evidence e, . . . we must consider only the
statements h and e, and the logical relations
[C(h,e)] between them.
It is quite irrelevant whether e was known first and
h proposed to explain it, or whether e resulted from
testing predictions drawn from h”. (Alan Musgrave
1974, 2)
Battles about roles of probability
trace to philosophies of inference

Likelihood Principle (LP)
In logics of induction, like probabilist accounts (as
I’m using the term) the import of the data is via the
ratios of likelihoods of hypotheses
Pr(x0;H0)/Pr(x0;H1)
The data x0 are fixed, while the hypotheses vary
27

Comparative logic of support
• Ian Hacking (1965) “Law of Likelihood”:
x support hypothesis H0 less well than H1 if,
Pr(x;H0) < Pr(x;H1)
(rejects in 1980)
• Any hypothesis that perfectly fits the data is
maximally likely (even if data-dredged)
• “there always is such a rival hypothesis viz., that
things just had to turn out the way they actually
did” (Barnard 1972, 129). 28

Error probabilities
• Pr(H0 is less well supported than H1;H0 ) is high
for some H1 or other
29

Hunting for significance
(nominal vs. actual)
Suppose that twenty sets of differences have
been examined, that one difference seems large
enough to test and that this difference turns out
to be ‘significant at the 5 percent level.’ ….The
actual level of significance is not 5 percent,
but 64 percent! (Selvin 1970, 104)
(Morrison & Henkel’s Significance Test Controversy
1970!)
30

Some accounts of evidence object:
“Two problems that plague frequentist inference:
multiple comparisons and multiple looks, or…data
dredging and peeking at the data. The frequentist
solution to both problems involves adjusting the P-
value…
But adjusting the measure of evidence because
of considerations that have nothing to do with
the data defies scientific sense” (Goodman
1999, 1010)
(Co-director, with Ioannidis, the Meta-Research Innovation
Center at Stanford)
31

On the LP, error probabilities
appeal to something irrelevant
“Sampling distributions, significance levels,
power, all depend on something more [than
the likelihood function]–something that is
irrelevant in Bayesian inference–namely the
sample space” (Lindley 1971, 436)
Probabilisms, condition on the actual data
32

At odds with key way to advance
replication: 21 Word Solution
“We report how we determined our sample size,
and data exclusions (if any), all manipulations, and
all measures in the study” (Simmons, Nelson, and
Simonsohn 2012, 4).
• Replication researchers find flexibility with data-
dredging and stopping rules major source of
failed-replication (the “forking paths”, Gelman and
Loken 2014)
33

Many “reforms” offered as alternative
to significance tests follow the LP
• “Bayes factors [likelihood ratios] can be used in the
complete absence of a sampling plan…” (Bayarri,
Benjamin, Berger, Sellke 2016, 100)
• It seems very strange that a frequentist could not
analyze a given set of data…if the stopping rule is
not given….Data should be able to speak for itself.
(Berger and Wolpert 1988, 78; authors of the
Likelihood Principle)
• No wonder reformers talk past each other
34

Replication Paradox
• Test Critic: It’s too easy to satisfy standard
significance thresholds
• You: Why do replicationists find it so hard to achieve
significance thresholds (with preregistration)?
• Test Critic: Obviously the initial studies were guilty
of P-hacking, cherry-picking, data-dredging (QRPs)
• You: So, the replication researchers want methods
that pick up on, adjust, and block these biasing
selection effects.
• Test Critic: Actually “reforms” recommend methods
where the need to alter P-values due to data
dredging vanishes 35

Probabilists can still block intuitively
unwarranted inferences
• Supplement with subjective beliefs: What do I
believe? As opposed to What is the evidence?
(Royall 1997; 2004)
• Likelihoods + prior probabilities
36

Problems
• Additional source of flexibility, priors as well as
biasing selection effects
• Doesn’t show what researchers had done
wrong—battle of beliefs
• The believability of data-dredged hypotheses is
what makes them so seductive
37

Contrast with philosophy: Bayesian
statisticians use “default” priors
“[V]irtually never would different experts give prior
distributions that even overlapped” (J. Berger. 2006)
• Default priors are to be data dominant in some sense
• “The priors are not to be considered expressions of
uncertainty, ignorance, or degree of belief. [they] may
not even be probabilities…” (Cox and Mayo 2010,
299)
• No agreement on rival systems for default/non-
subjective priors (no “uninformative” priors) 38

39
Many of today’s statistics wars:
P-values vs posteriors
• The posterior probability Pr(H0|x) can be large
while the P-value is small
• To a Bayesian this shows P-values exaggerate
evidence against
• Significance testers object to highly significant
results being interpreted as no evidence against
the null– or even evidence for it! High Type 2
error

Bayes (Jeffreys)/Fisher disagreement
(“spike and smear”)
• The “P-values exaggerate” charges refer to
testing a point null hypothesis, a lump of prior
probability given to H0 (or a tiny region around 0).
Xi ~ N(μ, σ2)
H0: μ = 0 vs. H1: μ ≠ 0.
• The rest appropriately spread over the alternative,
an α significant result can correspond to
Pr(H0|x) = (1 – α)! (e.g., 0.95)
40

“Concentrating mass on the point null hypothesis
is biasing the prior in favor of H0 as much as
possible” (Casella and R. Berger 1987, 111)
whether in 1 or 2-sided tests
Yet ‘spike and smear” is the basis for: “Redefine
Statistical Significance” (Benjamin et al., 2017)
41

Opposing megateam: Lakens et al. (2018)
• Whether tests should use a lower Type 1 error
probability is separate; the problem is
supposing there should be agreement
between quantities measuring different things
42

Recent Example of a battle based on
P-values disagree with posteriors
• If we imagine randomly selecting a hypothesis
from an urn of nulls 90% of which are true
• Consider just 2 possibilities: H0: no effect
H1: meaningful effect, all else ignored,
• Take the prevalence of 90% as
Pr(H0) = 0.9, Pr(H1)= 0.1
• Reject H0 with a single (just) 0.05 significant result,
with cherry-picking, selection effects
Then it can be shown most “findings” are false 43

44
Diagnostic Screening (DS) Model
• Pr(H0|Test T rejects H0 ) > 0.5
really: prevalence of true nulls among those
rejected at the 0.05 level > 0.5.
Call this: False Finding rate FFR
• Pr(Test T rejects H0 | H0 ) = 0.05
Criticism: N-P Type I error probability ≠ FFR
(Ioannidis 2005, Colquhoun 2014)

45
DS testers see this as a major
criticism of tests
• But there are major confusions
• Pr(H0|Test T rejects H0 ) is not a Type I error
probability.
• Transposes conditional
• Combines crude performance with a probabilist
assignment (true to neither Bayesians nor error
statisticians)
• OK in certain screening contexts (genomics)

FFR: False Finding Rate: Prev(H0 ) = .9
46α = 0.05 and (1 – β) = .8, FFR = 0.36, the PPV = .64

PPV
• Complement of FFR: the positive predictive value
PPV
Pr(H1|Test T rejects H0)
47

48
What’s Pr(H1) (i.e., Prev(H1))?
“Proportion of experiments we do over a lifetime in
which there is a real effect” (Colquhoun 2014, 9)
Proportion of true relationships among those
tested in a field. Ioannidis (2005, 0696)
Hypotheses can be individuated in many ways

Probabilistic Instantiation Fallacy
• Pr(the randomly selected null hypothesis is true) = .9
• The randomly selected null hypothesis is H51
• Pr(H51 is true) = .9
Each His is either is true or not!
(It could have a genuine frequentist prior but it
wouldn’t equal .9)
49

50
Is the PPV (complement of the FFR)
relevant to what’s wanted?
Crud Factor. In many fields of social science it’s
thought nearly everything is related to everything:
“all nulls false”.
It also promotes the “stay safe” idea.

Some Bayesians reject probabilism
(Gelman: Falsificationist Bayesian;
Shalizi: error statistician)
“[C]rucial parts of Bayesian data analysis, such as
model checking, can be understood as ‘error probes’
in Mayo’s sense”
“[W]hat we are advocating, then, is what Cox and
Hinkley (1974) call ‘pure significance testing’, in
which certain of the model’s implications are
compared directly to the data.” (Gelman and Shalizi
2013, 10, 20). 51

• Can’t also jump on the “abandon significance/
don’t use P-value thresholds” bandwagon
• If there’s no threshold, there’s no falsification,
and no tests
• Granted P-values don’t give effect sizes
52

53
The severe tester reformulates
tests with a discrepancy γ from H0
• Severity function: SEV(Test T, data x, claim C)
• Instead of a binary cut-off (significant or not)
the particular outcome is used to infer
discrepancies that are or are not warranted

54
To avoid Fallacies of Rejection
(e.g., magnitude error)
Testing the mean of a Normal distribution:
H0: μ ≤ 0 vs. H1: μ > 0
• If you very probably would have observed a more
impressive (smaller) P-value if μ = μ1 (μ1 = μ0 + γ);
the data are poor evidence that
μ > μ1.

55
Relation to a Test’s Power:
Let M be the sample mean (a random variable) and
it’s value M0
• Say M just reaches statistical significance at level
P, say 0.025; and compute power in relation to
this cut-off
• If the power against μ1 is high then the data are
poor evidence that
μ > μ1.

Power vs Severity for 𝛍 > 𝛍 𝟏
56

Similarly, severity tells us:
• an α-significant difference indicates less of a
discrepancy from the null if it results from larger (n1)
rather than a smaller (n2) sample size (n1 > n2 )
• What’s more indicative of a large effect (fire), a fire
alarm that goes off with burnt toast or one that
doesn’t go off unless the house is fully ablaze?
• [The larger sample size is like the one that goes off
with burnt toast] 57

• Anyone who equates severity and power has it
backwards
• Only one time they could be equal is if M just
misses statistical significance and want to
assess claims of form
μ < μ1, μ < μ2, μ < μ3,… μ < μk,…..
• Then SEV(μ < μk) = POW(μk)
58

We avoid fallacies of
non-significant results?
• They don’t warrant 0 discrepancy
• Not uninformative; can find upper bounds μ1
SEV(μ < μ1) is high
• It’s with negative results (P-values not small)
that severity goes in the same direction as
power
–provided model assumptions hold
59

FEV: Frequentist Principle of Evidence; Mayo
and Cox (2006); SEV: Mayo 1991, Mayo and
Spanos (2006)
FEV/SEV A small P-value indicates discrepancy γ
from H0, åonly if, there is a high probability the test
would have resulted in a larger P-value were a
discrepancy as large as γ absent.
FEV/SEV A moderate P-value indicates the absence
of a discrepancy γ from H0, only if there is a high
probability the test would have given a worse fit with
H0 (i.e., a smaller P-value) were a discrepancy γ
present.
60

The 2019: Don’t say ‘significance’,
don’t use P-value thresholds
• Editors of the March 2019 issue TAS "A World
Beyond p < 0.05"—Wasserstein, Schirm, Lazar—
say "declarations of ‘statistical significance’ be
abandoned" (p. 2).
• On their view: Prespecified P-value thresholds
should never be used in interpreting results.
• it is not just a word ban but a gatekeeper ban
62

“Retiring statistical significance
would give bias a free pass".
John Ioannidis (2019)
"...potential for falsification is a prerequisite for
science. Fields that obstinately resist refutation
can hide behind the abolition of statistical
significance but risk becoming self-ostracized
from the remit of science”.
I agree, and in Mayo (2019) I show why.
63

• Complying with the “no threshold” view precludes
the FDA's long-established drug review
procedures, as Wasserstein et al. (2019) recognize
• They think by removing P-value thresholds,
researchers lose an incentive to data dredge, and
otherwise exploit researcher flexibility
• Even if true, it's a bad argument.
(Decriminalizing robbery results in less robbery arrests)
• But it's not true.
64

• Even without the word significance, eager
researchers still can’t take the large (non-
significant) P-value to indicate a genuine effect
• It would be to say: Even though larger differences
would frequently occur by chance variability alone,
my data provide evidence they are not due to
chance variability
• In short, he would still need to report a reasonably
small P-value
• The eager investigator will need to "spin" his
results, ransack, data dredge
65

• In a world without predesignated thresholds, it
would be hard to hold him accountable for
reporting a nominally small P-value:
• “whether a p-value passes any arbitrary threshold
should not be considered at all" in interpreting
data (Wasserstein et al. 2019, 2)
66

No tests, no falsification
• The “no thresholds” view also blocks common uses
of confidence intervals and Bayes factor standards
• If you cannot say about any results, ahead of time,
they will not be allowed to count in favor of a claim,
then you do not have a test of it
• Don’t confuse having a threshold for a terrible test
with using a fixed P-value across all studies in an
unthinking manner
• We should reject the latter
67

New ASA Task Force on
Significance Tests and Replication
• to “prepare a …piece reflecting “good statistical
practice,” without leaving the impression that p-
values and hypothesis tests…have no
role.” (Karen Kafadar 2019, 4)
• I hope that philosophers (of science and of
knowledge) get involved!
68

References
• Barnard, G. (1972). “The Logic of Statistical Inference (Review of ‘The Logic of Statistical Inference’ by
Ian Hacking)”, British Journal for the Philosophy of Science 23(2), 123–32.
• Bartlett, T. (2014). “Replication Crisis in Psychology Research Turns Ugly and Odd”, The Chronicle of
Higher Education (online) June 23, 2014.
• Bayarri, M., Benjamin, D., Berger, J., Sellke, T. (2016). “Rejection Odds and Rejection Ratios: A
Proposal for Statistical Practice in Testing Hypotheses.” Journal of Mathematical Psychology 72: 90-
103.
• Benjamin, D., Berger, J., Johannesson, M., et al. (2017). “Redefine Statistical Significance”, Nature
Human Behaviour 2, 6–10.
• Berger, J. O. (2006). “The Case for Objective Bayesian Analysis.” Bayesian Analysis 1 (3): 385–402.
• Berger, J. O. and Wolpert, R. (1988). The Likelihood Principle, 2nd
ed. Vol. 6 Lecture Notes-Monograph
Series. Hayward, CA: Institute of Mathematical Statistics.
• Birnbaum, A. (1970). “Statistical Methods in Scientific Inference (letter to the Editor).” Nature 225 (5237)
(March 14): 1033.
• Casella, G. and Berger, R. (1987). “Reconciling Bayesian and Frequentist Evidence in the One-sided
Testing Problem”, Journal of the American Statistical Association 82(397), 106-11.
• Colquhoun, D. (2014). ‘An Investigation of the False Discovery Rate and the Misinterpretation of P-
values’, Royal Society Open Science 1(3), 140216.
• Cox, D. R., and Mayo, D. G. (2010). “Objectivity and Conditionality in Frequentist Inference.” Error and
Inference: Recent Exchanges on Experimental Reasoning, Reliability, and the Objectivity and Rationality
of Science. Mayo and Spanos (eds.), pp. 276–304. CUP.
69

• FDA (U. S. Food and Drug Administration) (2017). “Multiple Endpoints in Clinical Trials: Guidance for
Industry (DRAFT GUIDANCE).” Retrieved from https://www.fda.gov/media/102657/download
• Fisher, R. A. (1947). The Design of Experiments 4th ed., Edinburgh: Oliver and Boyd.
• Gelman, A. and Loken, E. (2014). “The Statistical Crisis in Science”, American Scientist 2, 460-65.
• Gelman, A. and Shalizi, C. (2013). “Philosophy and the Practice of Bayesian Statistics” and “Rejoinder’”
British Journal of Mathematical and Statistical Psychology 66(1): 8–38; 76-80.
• Glymour, C. (2010). "Explanation and Truth". In Mayo, D. and Spanos, A. (eds.) Error and Inference:
Recent Exchanges on Experimental Reasoning, Reliability, and the Objectivity and Rationality of Science,
pp. 331–50. CUP.
• Goodman S. N. (1999). “Toward evidence-based medical statistics. 2: The Bayes factor,” Annals of
Internal Medicine 1999; 130:1005 –1013.
• Hacking, I. (1965). Logic of Statistical Inference. CUP.
• Hacking, I. (1980). “The Theory of Probable Inference: Neyman, Peirce and Braithwaite”, in Mellor, D.
(ed.), Science, Belief and Behavior: Essays in Honour of R. B. Braithwaite, pp. 141–60. CUP.
• Ioannidis, J. (2005). “Why Most Published Research Findings are False”, PLoS Medicine 2(8), 0696–
0701.
• Ioannidis J. (2019). “The Importance of Predefined Rules and Prespecified Statistical Analyses: Do Not
Abandon Significance.” JAMA. 321(21): 2067–2068. doi:10.1001/jama.2019. 4582
• Kafadar, K. (2019). “The Year in Review … And More to Come,” President's Corner, AmStat News, (Issue
510), (Dec. 2019).
• Lakens, D., et al. (2018). “Justify your Alpha”, Nature Human Behavior 2, 168–71.
70

71
• Lindley, D. V. (1971). “The Estimation of Many Parameters.” in Godambe, V. and Sprott, D. (eds.),
Foundations of Statistical Inference 435–455. Toronto: Holt, Rinehart and Winston.
• Mayo, D. (1991). “Novel Evidence and Severe Tests,” Philosophy of Science, 58 (4): 523-552.
Reprinted (1991) in The Philosopher’s Annual XIV: 203-232.
• Mayo, D. G. (1996). Error and the Growth of Experimental Knowledge. Science and Its Conceptual
Foundation. Chicago: University of Chicago Press.
• Mayo, D. G. (2018). Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars,
Cambridge: Cambridge University Press.
• Mayo, D. (2019). ”P-value thresholds: Forfeit at your peril”. European Journal of Clinical Investigation,
49, e13170. https://doi.org/10.1111/eci.13170
• Mayo, D. G. and Cox, D. R. (2006). "Frequentist Statistics as a Theory of Inductive Inference” in Rojo,
J. (ed.) The Second Erich L. Lehmann Symposium: Optimality, 2006, Lecture Notes-Monograph Series,
49, pp. 247-275. Institute of Mathematical Statistics.
• Mayo, D. G., and A. Spanos. (2006). “Severe Testing as a Basic Concept in a Neyman–Pearson
Philosophy of Induction.” British Journal for the Philosophy of Science 57 (2) (June 1): 323–357.
• Meehl, P. (1990). "Why Summaries of Research on Psychological Theories Are Often Uninterpretable",
Psychological Reports 66(1): 195–244.
• Meehl, P. and Waller, N. (2002). "The Path Analysis Controversy: A New Statistical Approach to Strong
Appraisal of Verisimilitude", Psychological Methods 7(3): 283–300.
• Morrison, D. E., and R. E. Henkel, (eds.), (1970). The Significance Test Controversy: A Reader.
Chicago: Aldine De Gruyter.
• Musgrave, A. (1974). “Logical versus Historical Theories of Confirmation”, BJPS 25(1), 1–23.

72
• Neyman, J. and Pearson, E.S. (1933). "On the Problem of the Most Efficient Tests of Statistical
Hypotheses." Philosophical Transactions of the Royal Society of London Series A 231, 289–337.
Reprinted in Joint Statistical Papers, 140–85.
• Neyman, J. and Pearson, E.S. (1967). Joint Statistical Papers of J. Neyman and E. S. Pearson.
Cambridge: Cambridge University Press.
• Open Science Collaboration (2015). “Estimating the Reproducibility of Psychological Science”,
Science 349(6251), 943–51.
• Pearson, E. S. & Neyman, J. (1967). “On the problem of two samples”, Joint Statistical Papers by J.
Neyman & E.S. Pearson, 99-115 (Berkeley: U. of Calif. Press).
• Royall, R. (1997). Statistical Evidence: A Likelihood Paradigm. Boca Raton FL: Chapman and Hall,
CRC press.
• Royall, R. (2004). “The Likelihood Paradigm for Statistical Evidence” and “Rejoinder”. In Taper, M.
and Lele, S. (eds.) The Nature of Scientific Evidence, , pp. 119–138; 145–151. Chicago: University of
Chicago Press.
• Savage, L. J. (1962). The Foundations of Statistical Inference: A Discussion. London: Methuen.
• Selvin, H. (1970). “A critique of tests of significance in survey research”. In The significance test
controversy, edited by D. Morrison and R. Henkel, 94-106. Chicago: Aldine De Gruyter.
• Simmons, J. Nelson, L. and Simonsohn, U. (2012) “A 21 word solution”, Dialogue: 26(2), 4–7.
• Wasserstein, R., Schirm, A. & Lazar, N. (2019). Moving to a World Beyond “p < 0.05” [Editorial]. The
American Statistician, 73(S1), 1–19. https://doi.org/10.1080/00031305.2019.1583913

Philosophy of Science and Philosophy of Statistics

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Philosophy of Science and Philosophy of Statistics

Similar to Philosophy of Science and Philosophy of Statistics (20)

More from jemille6

More from jemille6 (20)

Recently uploaded

Recently uploaded (20)

Philosophy of Science and Philosophy of Statistics