SlideShare a Scribd company logo
1 of 41
Download to read offline
1
Error Control and Severity
Deborah G Mayo
Dept of Philosophy, Virginia Tech
November 13, 2022
Philosophy of Science Association
Multiplicity, Data-Dredging, and Error Control
2
Multiplicity and data-dredging:
Biggest source of
handwringing
.
It is easy to data dredge impressive-looking
effects that are spurious
(minimal) Severity Requirement:
If the test procedure had little or no capability of
finding flaws with C (even if present), then
agreement between data x0 and C provides poor
(or no) evidence for C
(“too cheap to be worth having” Popper 1983)
3
4
Data Dredging (Torturing): Hunting
for Subgroups in RCTs
The case of the Drug CEO:
• No statistically significant benefit on the primary
endpoint (improved lung function)
• Nor on any 10 secondary endpoints
• Ransacks the unblinded data for a subgroup
where those on the drug did better.
• Reports it as a statistically significant result from a
double-blind study
The method has a high probability of reporting drug
benefit (in some subgroup or other), even if none
exists—illicit P-value.
5
But some cases of multiplicity & data
dredging satisfy severity
Searching a full database for a DNA match with a
criminal’s DNA:
• The probability is high of a mismatch with person i, if
i were not the criminal;
• So, the match is good evidence that i is the criminal.
• A non-match virtually excludes the person thereby
strengthening the inference.
It’s the severity or lack of it that distinguishes whether
a data dredged claim is warranted
6
The data dredging at issue involves
some element of double-counting
When data-driven discoveries are tested on new
data, it’s not data dredging
• The FDA gave the drug CEO funds to test his
‘exploratory’ hypothesis
• So it’s not a matter of whether it might be worth
studying
(Highly motivated CEOs can make them appear
so.)
7
• When the new study was stopped for futility
(2009), FDA said: you’re going to jail (for the
misleading press report)!
(reached Supreme Court, 2013, Mayo 2020)
• Even if the follow-up had succeeded, the initial
data poorly tested the dredged claim
8
Ruling out chance vs explaining
a known effect
The dredged hypotheses need not be
prespecified to be kosher
• The same data were used to arrive at and
test the source of a set of blurred 1919
eclipse data (mirror distortion by the sun’s
heat)
Nor is it a problem that the same data are used
to test multiple different claims (Fisher
recommended)
9
My main point
The problem is when the results (or
hypotheses) are related in such a way that the
tester ensures the only ones to emerge or be
reported are in sync with the claim C at issue,
(even if false).
The successes are due to the biasing selection
effects, not C’s truth
Biasing Selection Effects:
When data or hypotheses are selected,
generated or interpreted in such a way as to
fail the severity requirement
(includes inability to assess severity even
approximately)
10
The Severity Requirement with
Data Dredging and Multiplicity
11
• has to be assessed according to the type
of error that C claims is well ruled out by
the data x.
12
In some cases we can compensate
with P-value adjustments
(huge literature)
• Even where such adjustments don’t give a
quantitative severity assessment, they can
reveal reasonably high or terrible severity*
• If you got 10 hits (out of 400 tests) and chance
alone expects 12 say, you haven’t distinguished
your effect from random variation.
• The main thing is: cases with selection effects
shouldn’t be treated the same as if selection
didn’t occur
I prefer to report the nominal P-value, and try to adjust post data.
13
Post-data model selection is a
field of its own
• AI/ML prediction models may compensate for
using the “same” data by cross validation and
data splitting, at least with IID data.
• Not free from reproducibility and replicability
crises
Recap so far
I) Multiplicity and data dredging can alter
capabilities to probe errors (error probabilities)
a. True, but appropriate data-dredging can
satisfy relevant error probabilities
Next part:
b. True, but rivals to error statistics hold
principles of evidence where error
probabilities don’t matter
14
15
Non-error statistical principles
of evidence
All the evidence is via likelihood ratios (LR) of
hypotheses
Pr(x0;H1)/Pr(x0;H0)
The data x0 are fixed, the hypotheses vary
• Error probabilities drop out; the analysis
“conditions” on the observed x
• The drug CEO’s data-dredged claim is
“supported” by x
16
All error probabilities violate the LP
• “Sampling distributions, significance levels,
power, all depend on something more [than the
likelihood function]–something that is irrelevant
in Bayesian inference–namely the sample
space.” (Lindley 1971, 436)
17
Inference by
Bayes Theorem
The
Likelihood
Principle
18
Inference by
Bayesian
Theorem
The
Likelihood
Principle
No error
probabilities
In (2-sided) testing the
mean of a standard normal
distribution
19
Another kind of dredging: Optional Stopping
Stopping rules are irrelevant for Bayesians: Stopping Rule Principle
“This principle is automatically
satisfied by Bayesian analysis, but is
viewed as crazy by many
frequentists. (Bayarri et al. 2004, 77)
20
Not just an example in phil stat, it
leads to quandaries in real trials
“The [regulatory] requirement of type I error control
for Bayesian adaptive designs causes them to lose
many of their philosophical advantages, e.g.,
compliance with the likelihood principle”
(Ryan et al. 2020, radiation oncology)
They admit “the type I error was inflated in the
Bayesian adaptive designs … [but] adjustments to
posterior probabilities, are not required” 21
• Default, non-subjective, O-Bayesians admit to
“technical violations” of the likelihood principle
(the prior can depend on the model)* (Ghosh et
al. 2010)
• We don’t see them embracing error statistical
principles—yet (it would be welcome)
(*That’s aside from violations of the LP in testing
model assumptions.)
22
Bayesians may block (or accept)
claims from multiplicity/dredging
with prior probabilities
(without error probs)
Problems:
• Increases flexibility (selection effects + priors
which can also be data dependent)
• Doesn’t show what’s gone wrong—it’s the
multiplicity
Criticisms of P-hackers lose force
• Statistically significant results that fail to replicate
are often reanalyzed Bayesianly
• Rather than point to (sometimes blatant) biased
selections, Bayesians show with a high prior to H0
the data favor H0 more than H1
• A P-value can be small while Pr(H0|x) non-small or
even large.
24
25
Bayes/Fisher Disagreement:
Spike and Smear
• A point null hypothesis, a lump of prior probability
on H0 or a tiny area around it [Xi ~ N(μ, σ2)]
H0: μ = 0 vs. H1: μ ≠ 0.
• Depending on how you spike and how you smear,
an α significant result can even correspond to
Pr(H0|x) = (1 – α)! (e.g., 0.95)
• But Pr(H0|x) can also agree with the small α
(though they’re measuring different things)
A life raft to data dredgers:
“Bayesians can easily discount my statistically
significant result this way”.
• Even if it’s correct to reject the data dredged claim,
it can be right for the wrong reason:
• Put the blame where it belongs.
26
More stringent P-value thresholds?
• The high prior to H0 leads to the popular
movement to require more stringent P-values
(“redefine P-values” Benjamin et al. 2017).
• Lowering P-value thresholds may compensate for
multiple testing
(e.g., in high energy particle physics (look elsewhere
effect) and for error control in Big Data “diagnostic
screening” (e.g., interesting/not interesting))
27
• But the goal of the redefiners is to get the P-value
more in line with a Bayesian posterior on H0,
assuming H0 is given the high prior.
• Even advocates say it “does not address multiple
hypothesis testing, P-hacking” (Benjamin et al.
2017)
28
We should worry about biases in
favor of “no effect”
• Severe testers compute the probability this
Bayesian analysis leads to erroneously failing
to find various discrepancies (type 2 errors)
• If it’s high for discrepancies of interest, she
denies claims of no evidence against, let alone
evidence for H0
29
30
Last piece: Implications for using
error probabilities inferentially
• The problem of double counting/novel evidence
led me to the severity criterion, and to a distinct
use of error probabilities (probativism)
31
The severe tester reformulates tests
with a discrepancy γ from H0
• Admittedly, small P-values can be misinterpreted
as indicating larger discrepancies than warranted
• We infer discrepancies (population effect sizes)
that are and are not warranted (and how well)
Mayo (1991, 2018); Mayo and Spanos (2006, 2010
Mayo and Cox (2006, 2010); Mayo and Hand
(2022)
32
Avoid misinterpreting a 2SE
significant result
33
Akin to forming the lower confidence
interval (CI) bound
(estimating μ, SE is the standard error, SE = 1)
μ > !
𝑋 - 2SE
The 98% lower confidence interval estimator. 98% of
the time, it would correctly estimate μ (performance)
Once the sample mean is observed,
μ > ̅
𝑥 - 2SE is the CI-lower estimate* (μ > 0)
it cannot be assigned .98 probability.
*fiducial 2% value (Fisher 1936): μ < CI-lower estimator 2% of the
time. These values of μ are not rejectable at the .02 level (Neyman)
34
SEV gives an inferential view of CIs
Severe tester:
μ > CI-lower (e.g., μ > 0) is warranted because with
high probability (.98) we would have observed a
smaller sample mean if μ ≤ CI-lower
We report several confidence levels, a confidence
distribution (Cox 1958), or a severity distribution.
(Thornton will give her take)
35
Fallacies of non-significant results?
Say the cut-off for significance had to exceed 2
• It’s not evidence of no discrepancy from 0, but not
uninformative (even in simple Fisherian tests)
• Can infer (with severity) the absence of
discrepancies that probably would have led to a
larger sample mean (smaller P-value)*
*Less coarse than power analysis
36
The relevant type 2 error is 1 minus, in sync with “attained” power
37
Recap
I. Multiplicity and data dredging can alter
capabilities to probe errors (error probabilities)
a) True, but appropriate data-dredging can satisfy
relevant error probabilities
b) True, but rivals to error statistics hold principles
of evidence where error probabilities don’t
matter.
II. A severity formulation allows error probabilities of
methods to be construed evidentially.
38
39
References
• Bayarri, M. and Berger, J. (2004). ‘The Interplay between Bayesian and Frequentist
Analysis’, Statistical Science 19, 58–80.
• Benjamin, D., Berger, J., Johannesson, M., et al. (2017). ‘Redefine Statistical
Significance’, Nature Human Behaviour 2, 6–10.
• Berger, J. (2006). ‘The Case for Objective Bayesian Analysis’ and ‘Rejoinder’, Bayesian
Analysis 1(3), 385–402; 457–64.
• Cox, D. (1958). ‘Some Problems Connected with Statistical Inference’, Annals of
Mathematical Statistics 29(2), 357–72.
• Casella, G. and Berger, R. (1987a). ‘Reconciling Bayesian and Frequentist Evidence in
the One-sided Testing Problem’, Journal of the American Statistical Association 82
(397), 106–11.
• Fisher, R. A. (1936), ‘Uncertain Inference’, Proceedings of the American Academy of
Arts and Sciences 71, 248–58.
• Ghosh, J., Delampady, M., and Samanta, T. (2010). An Introduction to Bayesian
Analysis: Theory and Methods. New York: Springer.
• Godambe, V. and Sprott, D. (eds.) (1971). Foundations of Statistical Inference. Toronto:
Holt, Rinehart and Winston of Canada.
• Lindley, D. (1971). ‘The Estimation of Many Parameters’, in Godambe, V. and Sprott, D.
(eds.), pp. 435–55.
• Mayo, D. G. (1991). “Novel Evidence and Severe Tests,” Philosophy of Science, 58 (4):
523-552. Reprinted (1991) in The Philosopher’s Annual XIV: 203-232.
40
References (cont.)
• Mayo, D. G. (1996). Error and the Growth of Experimental Knowledge, Chicago:
Chicago University Press. (1998 Lakatos Prize)
• Mayo, D. G. (2018). Statistical Inference as Severe Testing: How to Get Beyond the
Statistics Wars, Cambridge: Cambridge University Press.
• Mayo, D. G. (2020). “P-Values on Trial: Selective Reporting of (Best Practice Guides
Against) Selective Reporting” Harvard Data Science Review 2.1.
• Mayo, D.G., Hand, D. (2022). Statistical significance and its critics: practicing
damaging science, or damaging scientific practice?. Synthese 200,220.
https://doi.org/10.1007/s11229-022-03692-0
• Mayo, D.G. and Cox, D. R. (2006) “Frequentist Statistics as a Theory of Inductive
Inference,” Optimality: The Second Erich L. Lehmann Symposium (ed. J. Rojo),
Lecture Notes-Monograph series, IMS, Vol. 49: 77-97.
• Mayo, D. G. and Spanos, A. (2006). “Severe Testing as a Basic Concept in a Neyman-
Pearson Philosophy of Induction,” British Journal of Philosophy of Science, 57: 323-
357.
• Popper, K. (1983). Realism and the Aim of Science. Totowa, NJ: Rowman and
Littlefield.
41

More Related Content

Similar to Error Control and Severity

Mayo minnesota 28 march 2 (1)
Mayo minnesota 28 march 2 (1)Mayo minnesota 28 march 2 (1)
Mayo minnesota 28 march 2 (1)jemille6
 
“The importance of philosophy of science for statistical science and vice versa”
“The importance of philosophy of science for statistical science and vice versa”“The importance of philosophy of science for statistical science and vice versa”
“The importance of philosophy of science for statistical science and vice versa”jemille6
 
D. G. Mayo Columbia slides for Workshop on Probability &Learning
D. G. Mayo Columbia slides for Workshop on Probability &LearningD. G. Mayo Columbia slides for Workshop on Probability &Learning
D. G. Mayo Columbia slides for Workshop on Probability &Learningjemille6
 
The Statistics Wars and Their Casualties
The Statistics Wars and Their CasualtiesThe Statistics Wars and Their Casualties
The Statistics Wars and Their Casualtiesjemille6
 
The Statistics Wars and Their Causalities (refs)
The Statistics Wars and Their Causalities (refs)The Statistics Wars and Their Causalities (refs)
The Statistics Wars and Their Causalities (refs)jemille6
 
The Statistics Wars and Their Casualties (w/refs)
The Statistics Wars and Their Casualties (w/refs)The Statistics Wars and Their Casualties (w/refs)
The Statistics Wars and Their Casualties (w/refs)jemille6
 
D. G. Mayo: Your data-driven claims must still be probed severely
D. G. Mayo: Your data-driven claims must still be probed severelyD. G. Mayo: Your data-driven claims must still be probed severely
D. G. Mayo: Your data-driven claims must still be probed severelyjemille6
 
D. Mayo: Replication Research Under an Error Statistical Philosophy
D. Mayo: Replication Research Under an Error Statistical Philosophy D. Mayo: Replication Research Under an Error Statistical Philosophy
D. Mayo: Replication Research Under an Error Statistical Philosophy jemille6
 
Statistical skepticism: How to use significance tests effectively
Statistical skepticism: How to use significance tests effectively Statistical skepticism: How to use significance tests effectively
Statistical skepticism: How to use significance tests effectively jemille6
 
Severe Testing: The Key to Error Correction
Severe Testing: The Key to Error CorrectionSevere Testing: The Key to Error Correction
Severe Testing: The Key to Error Correctionjemille6
 
The role of background assumptions in severity appraisal (
The role of background assumptions in severity appraisal (The role of background assumptions in severity appraisal (
The role of background assumptions in severity appraisal (jemille6
 
Controversy Over the Significance Test Controversy
Controversy Over the Significance Test ControversyControversy Over the Significance Test Controversy
Controversy Over the Significance Test Controversyjemille6
 
P-Value "Reforms": Fixing Science or Threat to Replication and Falsification
P-Value "Reforms": Fixing Science or Threat to Replication and FalsificationP-Value "Reforms": Fixing Science or Threat to Replication and Falsification
P-Value "Reforms": Fixing Science or Threat to Replication and Falsificationjemille6
 
Hypothesis testing
Hypothesis testingHypothesis testing
Hypothesis testingpraveen3030
 
Basic of Statistical Inference Part-IV: An Overview of Hypothesis Testing
Basic of Statistical Inference Part-IV: An Overview of Hypothesis TestingBasic of Statistical Inference Part-IV: An Overview of Hypothesis Testing
Basic of Statistical Inference Part-IV: An Overview of Hypothesis TestingDexlab Analytics
 
Statistics for Business Decision-making
Statistics for Business Decision-makingStatistics for Business Decision-making
Statistics for Business Decision-makingJason Martuscello
 
Frequentist Statistics as a Theory of Inductive Inference (2/27/14)
Frequentist Statistics as a Theory of Inductive Inference (2/27/14)Frequentist Statistics as a Theory of Inductive Inference (2/27/14)
Frequentist Statistics as a Theory of Inductive Inference (2/27/14)jemille6
 
Feb21 mayobostonpaper
Feb21 mayobostonpaperFeb21 mayobostonpaper
Feb21 mayobostonpaperjemille6
 
D. Mayo: Philosophical Interventions in the Statistics Wars
D. Mayo: Philosophical Interventions in the Statistics WarsD. Mayo: Philosophical Interventions in the Statistics Wars
D. Mayo: Philosophical Interventions in the Statistics Warsjemille6
 

Similar to Error Control and Severity (20)

Mayo minnesota 28 march 2 (1)
Mayo minnesota 28 march 2 (1)Mayo minnesota 28 march 2 (1)
Mayo minnesota 28 march 2 (1)
 
“The importance of philosophy of science for statistical science and vice versa”
“The importance of philosophy of science for statistical science and vice versa”“The importance of philosophy of science for statistical science and vice versa”
“The importance of philosophy of science for statistical science and vice versa”
 
D. G. Mayo Columbia slides for Workshop on Probability &Learning
D. G. Mayo Columbia slides for Workshop on Probability &LearningD. G. Mayo Columbia slides for Workshop on Probability &Learning
D. G. Mayo Columbia slides for Workshop on Probability &Learning
 
The Statistics Wars and Their Casualties
The Statistics Wars and Their CasualtiesThe Statistics Wars and Their Casualties
The Statistics Wars and Their Casualties
 
The Statistics Wars and Their Causalities (refs)
The Statistics Wars and Their Causalities (refs)The Statistics Wars and Their Causalities (refs)
The Statistics Wars and Their Causalities (refs)
 
The Statistics Wars and Their Casualties (w/refs)
The Statistics Wars and Their Casualties (w/refs)The Statistics Wars and Their Casualties (w/refs)
The Statistics Wars and Their Casualties (w/refs)
 
D. G. Mayo: Your data-driven claims must still be probed severely
D. G. Mayo: Your data-driven claims must still be probed severelyD. G. Mayo: Your data-driven claims must still be probed severely
D. G. Mayo: Your data-driven claims must still be probed severely
 
D. Mayo: Replication Research Under an Error Statistical Philosophy
D. Mayo: Replication Research Under an Error Statistical Philosophy D. Mayo: Replication Research Under an Error Statistical Philosophy
D. Mayo: Replication Research Under an Error Statistical Philosophy
 
Statistical skepticism: How to use significance tests effectively
Statistical skepticism: How to use significance tests effectively Statistical skepticism: How to use significance tests effectively
Statistical skepticism: How to use significance tests effectively
 
Severe Testing: The Key to Error Correction
Severe Testing: The Key to Error CorrectionSevere Testing: The Key to Error Correction
Severe Testing: The Key to Error Correction
 
The role of background assumptions in severity appraisal (
The role of background assumptions in severity appraisal (The role of background assumptions in severity appraisal (
The role of background assumptions in severity appraisal (
 
Controversy Over the Significance Test Controversy
Controversy Over the Significance Test ControversyControversy Over the Significance Test Controversy
Controversy Over the Significance Test Controversy
 
P-Value "Reforms": Fixing Science or Threat to Replication and Falsification
P-Value "Reforms": Fixing Science or Threat to Replication and FalsificationP-Value "Reforms": Fixing Science or Threat to Replication and Falsification
P-Value "Reforms": Fixing Science or Threat to Replication and Falsification
 
Hypothesis testing
Hypothesis testingHypothesis testing
Hypothesis testing
 
Basic of Statistical Inference Part-IV: An Overview of Hypothesis Testing
Basic of Statistical Inference Part-IV: An Overview of Hypothesis TestingBasic of Statistical Inference Part-IV: An Overview of Hypothesis Testing
Basic of Statistical Inference Part-IV: An Overview of Hypothesis Testing
 
educ201.pptx
educ201.pptxeduc201.pptx
educ201.pptx
 
Statistics for Business Decision-making
Statistics for Business Decision-makingStatistics for Business Decision-making
Statistics for Business Decision-making
 
Frequentist Statistics as a Theory of Inductive Inference (2/27/14)
Frequentist Statistics as a Theory of Inductive Inference (2/27/14)Frequentist Statistics as a Theory of Inductive Inference (2/27/14)
Frequentist Statistics as a Theory of Inductive Inference (2/27/14)
 
Feb21 mayobostonpaper
Feb21 mayobostonpaperFeb21 mayobostonpaper
Feb21 mayobostonpaper
 
D. Mayo: Philosophical Interventions in the Statistics Wars
D. Mayo: Philosophical Interventions in the Statistics WarsD. Mayo: Philosophical Interventions in the Statistics Wars
D. Mayo: Philosophical Interventions in the Statistics Wars
 

More from jemille6

D. Mayo JSM slides v2.pdf
D. Mayo JSM slides v2.pdfD. Mayo JSM slides v2.pdf
D. Mayo JSM slides v2.pdfjemille6
 
reid-postJSM-DRC.pdf
reid-postJSM-DRC.pdfreid-postJSM-DRC.pdf
reid-postJSM-DRC.pdfjemille6
 
Errors of the Error Gatekeepers: The case of Statistical Significance 2016-2022
Errors of the Error Gatekeepers: The case of Statistical Significance 2016-2022Errors of the Error Gatekeepers: The case of Statistical Significance 2016-2022
Errors of the Error Gatekeepers: The case of Statistical Significance 2016-2022jemille6
 
Causal inference is not statistical inference
Causal inference is not statistical inferenceCausal inference is not statistical inference
Causal inference is not statistical inferencejemille6
 
What are questionable research practices?
What are questionable research practices?What are questionable research practices?
What are questionable research practices?jemille6
 
What's the question?
What's the question? What's the question?
What's the question? jemille6
 
The neglected importance of complexity in statistics and Metascience
The neglected importance of complexity in statistics and MetascienceThe neglected importance of complexity in statistics and Metascience
The neglected importance of complexity in statistics and Metasciencejemille6
 
Mathematically Elegant Answers to Research Questions No One is Asking (meta-a...
Mathematically Elegant Answers to Research Questions No One is Asking (meta-a...Mathematically Elegant Answers to Research Questions No One is Asking (meta-a...
Mathematically Elegant Answers to Research Questions No One is Asking (meta-a...jemille6
 
On Severity, the Weight of Evidence, and the Relationship Between the Two
On Severity, the Weight of Evidence, and the Relationship Between the TwoOn Severity, the Weight of Evidence, and the Relationship Between the Two
On Severity, the Weight of Evidence, and the Relationship Between the Twojemille6
 
Revisiting the Two Cultures in Statistical Modeling and Inference as they rel...
Revisiting the Two Cultures in Statistical Modeling and Inference as they rel...Revisiting the Two Cultures in Statistical Modeling and Inference as they rel...
Revisiting the Two Cultures in Statistical Modeling and Inference as they rel...jemille6
 
Comparing Frequentists and Bayesian Control of Multiple Testing
Comparing Frequentists and Bayesian Control of Multiple TestingComparing Frequentists and Bayesian Control of Multiple Testing
Comparing Frequentists and Bayesian Control of Multiple Testingjemille6
 
Good Data Dredging
Good Data DredgingGood Data Dredging
Good Data Dredgingjemille6
 
The Duality of Parameters and the Duality of Probability
The Duality of Parameters and the Duality of ProbabilityThe Duality of Parameters and the Duality of Probability
The Duality of Parameters and the Duality of Probabilityjemille6
 
On the interpretation of the mathematical characteristics of statistical test...
On the interpretation of the mathematical characteristics of statistical test...On the interpretation of the mathematical characteristics of statistical test...
On the interpretation of the mathematical characteristics of statistical test...jemille6
 
The two statistical cornerstones of replicability: addressing selective infer...
The two statistical cornerstones of replicability: addressing selective infer...The two statistical cornerstones of replicability: addressing selective infer...
The two statistical cornerstones of replicability: addressing selective infer...jemille6
 
The replication crisis: are P-values the problem and are Bayes factors the so...
The replication crisis: are P-values the problem and are Bayes factors the so...The replication crisis: are P-values the problem and are Bayes factors the so...
The replication crisis: are P-values the problem and are Bayes factors the so...jemille6
 
The ASA president Task Force Statement on Statistical Significance and Replic...
The ASA president Task Force Statement on Statistical Significance and Replic...The ASA president Task Force Statement on Statistical Significance and Replic...
The ASA president Task Force Statement on Statistical Significance and Replic...jemille6
 
D. G. Mayo jan 11 slides
D. G. Mayo jan 11 slides D. G. Mayo jan 11 slides
D. G. Mayo jan 11 slides jemille6
 
T. Pradeu & M. Lemoine: Philosophy in Science: Definition and Boundaries
T. Pradeu & M. Lemoine: Philosophy in Science: Definition and BoundariesT. Pradeu & M. Lemoine: Philosophy in Science: Definition and Boundaries
T. Pradeu & M. Lemoine: Philosophy in Science: Definition and Boundariesjemille6
 
Mayo: Evidence as Passing a Severe Test (How it Gets You Beyond the Statistic...
Mayo: Evidence as Passing a Severe Test (How it Gets You Beyond the Statistic...Mayo: Evidence as Passing a Severe Test (How it Gets You Beyond the Statistic...
Mayo: Evidence as Passing a Severe Test (How it Gets You Beyond the Statistic...jemille6
 

More from jemille6 (20)

D. Mayo JSM slides v2.pdf
D. Mayo JSM slides v2.pdfD. Mayo JSM slides v2.pdf
D. Mayo JSM slides v2.pdf
 
reid-postJSM-DRC.pdf
reid-postJSM-DRC.pdfreid-postJSM-DRC.pdf
reid-postJSM-DRC.pdf
 
Errors of the Error Gatekeepers: The case of Statistical Significance 2016-2022
Errors of the Error Gatekeepers: The case of Statistical Significance 2016-2022Errors of the Error Gatekeepers: The case of Statistical Significance 2016-2022
Errors of the Error Gatekeepers: The case of Statistical Significance 2016-2022
 
Causal inference is not statistical inference
Causal inference is not statistical inferenceCausal inference is not statistical inference
Causal inference is not statistical inference
 
What are questionable research practices?
What are questionable research practices?What are questionable research practices?
What are questionable research practices?
 
What's the question?
What's the question? What's the question?
What's the question?
 
The neglected importance of complexity in statistics and Metascience
The neglected importance of complexity in statistics and MetascienceThe neglected importance of complexity in statistics and Metascience
The neglected importance of complexity in statistics and Metascience
 
Mathematically Elegant Answers to Research Questions No One is Asking (meta-a...
Mathematically Elegant Answers to Research Questions No One is Asking (meta-a...Mathematically Elegant Answers to Research Questions No One is Asking (meta-a...
Mathematically Elegant Answers to Research Questions No One is Asking (meta-a...
 
On Severity, the Weight of Evidence, and the Relationship Between the Two
On Severity, the Weight of Evidence, and the Relationship Between the TwoOn Severity, the Weight of Evidence, and the Relationship Between the Two
On Severity, the Weight of Evidence, and the Relationship Between the Two
 
Revisiting the Two Cultures in Statistical Modeling and Inference as they rel...
Revisiting the Two Cultures in Statistical Modeling and Inference as they rel...Revisiting the Two Cultures in Statistical Modeling and Inference as they rel...
Revisiting the Two Cultures in Statistical Modeling and Inference as they rel...
 
Comparing Frequentists and Bayesian Control of Multiple Testing
Comparing Frequentists and Bayesian Control of Multiple TestingComparing Frequentists and Bayesian Control of Multiple Testing
Comparing Frequentists and Bayesian Control of Multiple Testing
 
Good Data Dredging
Good Data DredgingGood Data Dredging
Good Data Dredging
 
The Duality of Parameters and the Duality of Probability
The Duality of Parameters and the Duality of ProbabilityThe Duality of Parameters and the Duality of Probability
The Duality of Parameters and the Duality of Probability
 
On the interpretation of the mathematical characteristics of statistical test...
On the interpretation of the mathematical characteristics of statistical test...On the interpretation of the mathematical characteristics of statistical test...
On the interpretation of the mathematical characteristics of statistical test...
 
The two statistical cornerstones of replicability: addressing selective infer...
The two statistical cornerstones of replicability: addressing selective infer...The two statistical cornerstones of replicability: addressing selective infer...
The two statistical cornerstones of replicability: addressing selective infer...
 
The replication crisis: are P-values the problem and are Bayes factors the so...
The replication crisis: are P-values the problem and are Bayes factors the so...The replication crisis: are P-values the problem and are Bayes factors the so...
The replication crisis: are P-values the problem and are Bayes factors the so...
 
The ASA president Task Force Statement on Statistical Significance and Replic...
The ASA president Task Force Statement on Statistical Significance and Replic...The ASA president Task Force Statement on Statistical Significance and Replic...
The ASA president Task Force Statement on Statistical Significance and Replic...
 
D. G. Mayo jan 11 slides
D. G. Mayo jan 11 slides D. G. Mayo jan 11 slides
D. G. Mayo jan 11 slides
 
T. Pradeu & M. Lemoine: Philosophy in Science: Definition and Boundaries
T. Pradeu & M. Lemoine: Philosophy in Science: Definition and BoundariesT. Pradeu & M. Lemoine: Philosophy in Science: Definition and Boundaries
T. Pradeu & M. Lemoine: Philosophy in Science: Definition and Boundaries
 
Mayo: Evidence as Passing a Severe Test (How it Gets You Beyond the Statistic...
Mayo: Evidence as Passing a Severe Test (How it Gets You Beyond the Statistic...Mayo: Evidence as Passing a Severe Test (How it Gets You Beyond the Statistic...
Mayo: Evidence as Passing a Severe Test (How it Gets You Beyond the Statistic...
 

Recently uploaded

Presiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsPresiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsanshu789521
 
Earth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatEarth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatYousafMalik24
 
Historical philosophical, theoretical, and legal foundations of special and i...
Historical philosophical, theoretical, and legal foundations of special and i...Historical philosophical, theoretical, and legal foundations of special and i...
Historical philosophical, theoretical, and legal foundations of special and i...jaredbarbolino94
 
Roles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in PharmacovigilanceRoles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in PharmacovigilanceSamikshaHamane
 
Proudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxProudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxthorishapillay1
 
Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Celine George
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxGaneshChakor2
 
Types of Journalistic Writing Grade 8.pptx
Types of Journalistic Writing Grade 8.pptxTypes of Journalistic Writing Grade 8.pptx
Types of Journalistic Writing Grade 8.pptxEyham Joco
 
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...Marc Dusseiller Dusjagr
 
Biting mechanism of poisonous snakes.pdf
Biting mechanism of poisonous snakes.pdfBiting mechanism of poisonous snakes.pdf
Biting mechanism of poisonous snakes.pdfadityarao40181
 
Hierarchy of management that covers different levels of management
Hierarchy of management that covers different levels of managementHierarchy of management that covers different levels of management
Hierarchy of management that covers different levels of managementmkooblal
 
Meghan Sutherland In Media Res Media Component
Meghan Sutherland In Media Res Media ComponentMeghan Sutherland In Media Res Media Component
Meghan Sutherland In Media Res Media ComponentInMediaRes1
 
EPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptxEPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptxRaymartEstabillo3
 
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdf
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdfFraming an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdf
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdfUjwalaBharambe
 
Painted Grey Ware.pptx, PGW Culture of India
Painted Grey Ware.pptx, PGW Culture of IndiaPainted Grey Ware.pptx, PGW Culture of India
Painted Grey Ware.pptx, PGW Culture of IndiaVirag Sontakke
 
Crayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon ACrayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon AUnboundStockton
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)eniolaolutunde
 
भारत-रोम व्यापार.pptx, Indo-Roman Trade,
भारत-रोम व्यापार.pptx, Indo-Roman Trade,भारत-रोम व्यापार.pptx, Indo-Roman Trade,
भारत-रोम व्यापार.pptx, Indo-Roman Trade,Virag Sontakke
 

Recently uploaded (20)

Presiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsPresiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha elections
 
Earth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatEarth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice great
 
Historical philosophical, theoretical, and legal foundations of special and i...
Historical philosophical, theoretical, and legal foundations of special and i...Historical philosophical, theoretical, and legal foundations of special and i...
Historical philosophical, theoretical, and legal foundations of special and i...
 
Roles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in PharmacovigilanceRoles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in Pharmacovigilance
 
Proudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxProudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptx
 
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdfTataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
 
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
 
Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptx
 
Types of Journalistic Writing Grade 8.pptx
Types of Journalistic Writing Grade 8.pptxTypes of Journalistic Writing Grade 8.pptx
Types of Journalistic Writing Grade 8.pptx
 
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
 
Biting mechanism of poisonous snakes.pdf
Biting mechanism of poisonous snakes.pdfBiting mechanism of poisonous snakes.pdf
Biting mechanism of poisonous snakes.pdf
 
Hierarchy of management that covers different levels of management
Hierarchy of management that covers different levels of managementHierarchy of management that covers different levels of management
Hierarchy of management that covers different levels of management
 
Meghan Sutherland In Media Res Media Component
Meghan Sutherland In Media Res Media ComponentMeghan Sutherland In Media Res Media Component
Meghan Sutherland In Media Res Media Component
 
EPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptxEPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptx
 
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdf
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdfFraming an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdf
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdf
 
Painted Grey Ware.pptx, PGW Culture of India
Painted Grey Ware.pptx, PGW Culture of IndiaPainted Grey Ware.pptx, PGW Culture of India
Painted Grey Ware.pptx, PGW Culture of India
 
Crayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon ACrayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon A
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)
 
भारत-रोम व्यापार.pptx, Indo-Roman Trade,
भारत-रोम व्यापार.pptx, Indo-Roman Trade,भारत-रोम व्यापार.pptx, Indo-Roman Trade,
भारत-रोम व्यापार.pptx, Indo-Roman Trade,
 

Error Control and Severity

  • 1. 1 Error Control and Severity Deborah G Mayo Dept of Philosophy, Virginia Tech November 13, 2022 Philosophy of Science Association Multiplicity, Data-Dredging, and Error Control
  • 2. 2 Multiplicity and data-dredging: Biggest source of handwringing . It is easy to data dredge impressive-looking effects that are spurious
  • 3. (minimal) Severity Requirement: If the test procedure had little or no capability of finding flaws with C (even if present), then agreement between data x0 and C provides poor (or no) evidence for C (“too cheap to be worth having” Popper 1983) 3
  • 4. 4 Data Dredging (Torturing): Hunting for Subgroups in RCTs The case of the Drug CEO: • No statistically significant benefit on the primary endpoint (improved lung function) • Nor on any 10 secondary endpoints • Ransacks the unblinded data for a subgroup where those on the drug did better. • Reports it as a statistically significant result from a double-blind study The method has a high probability of reporting drug benefit (in some subgroup or other), even if none exists—illicit P-value.
  • 5. 5 But some cases of multiplicity & data dredging satisfy severity Searching a full database for a DNA match with a criminal’s DNA: • The probability is high of a mismatch with person i, if i were not the criminal; • So, the match is good evidence that i is the criminal. • A non-match virtually excludes the person thereby strengthening the inference. It’s the severity or lack of it that distinguishes whether a data dredged claim is warranted
  • 6. 6 The data dredging at issue involves some element of double-counting When data-driven discoveries are tested on new data, it’s not data dredging • The FDA gave the drug CEO funds to test his ‘exploratory’ hypothesis • So it’s not a matter of whether it might be worth studying (Highly motivated CEOs can make them appear so.)
  • 7. 7 • When the new study was stopped for futility (2009), FDA said: you’re going to jail (for the misleading press report)! (reached Supreme Court, 2013, Mayo 2020) • Even if the follow-up had succeeded, the initial data poorly tested the dredged claim
  • 8. 8 Ruling out chance vs explaining a known effect The dredged hypotheses need not be prespecified to be kosher • The same data were used to arrive at and test the source of a set of blurred 1919 eclipse data (mirror distortion by the sun’s heat) Nor is it a problem that the same data are used to test multiple different claims (Fisher recommended)
  • 9. 9 My main point The problem is when the results (or hypotheses) are related in such a way that the tester ensures the only ones to emerge or be reported are in sync with the claim C at issue, (even if false). The successes are due to the biasing selection effects, not C’s truth
  • 10. Biasing Selection Effects: When data or hypotheses are selected, generated or interpreted in such a way as to fail the severity requirement (includes inability to assess severity even approximately) 10
  • 11. The Severity Requirement with Data Dredging and Multiplicity 11 • has to be assessed according to the type of error that C claims is well ruled out by the data x.
  • 12. 12 In some cases we can compensate with P-value adjustments (huge literature) • Even where such adjustments don’t give a quantitative severity assessment, they can reveal reasonably high or terrible severity* • If you got 10 hits (out of 400 tests) and chance alone expects 12 say, you haven’t distinguished your effect from random variation. • The main thing is: cases with selection effects shouldn’t be treated the same as if selection didn’t occur I prefer to report the nominal P-value, and try to adjust post data.
  • 13. 13 Post-data model selection is a field of its own • AI/ML prediction models may compensate for using the “same” data by cross validation and data splitting, at least with IID data. • Not free from reproducibility and replicability crises
  • 14. Recap so far I) Multiplicity and data dredging can alter capabilities to probe errors (error probabilities) a. True, but appropriate data-dredging can satisfy relevant error probabilities Next part: b. True, but rivals to error statistics hold principles of evidence where error probabilities don’t matter 14
  • 15. 15 Non-error statistical principles of evidence All the evidence is via likelihood ratios (LR) of hypotheses Pr(x0;H1)/Pr(x0;H0) The data x0 are fixed, the hypotheses vary • Error probabilities drop out; the analysis “conditions” on the observed x • The drug CEO’s data-dredged claim is “supported” by x
  • 16. 16 All error probabilities violate the LP • “Sampling distributions, significance levels, power, all depend on something more [than the likelihood function]–something that is irrelevant in Bayesian inference–namely the sample space.” (Lindley 1971, 436)
  • 19. In (2-sided) testing the mean of a standard normal distribution 19 Another kind of dredging: Optional Stopping Stopping rules are irrelevant for Bayesians: Stopping Rule Principle
  • 20. “This principle is automatically satisfied by Bayesian analysis, but is viewed as crazy by many frequentists. (Bayarri et al. 2004, 77) 20
  • 21. Not just an example in phil stat, it leads to quandaries in real trials “The [regulatory] requirement of type I error control for Bayesian adaptive designs causes them to lose many of their philosophical advantages, e.g., compliance with the likelihood principle” (Ryan et al. 2020, radiation oncology) They admit “the type I error was inflated in the Bayesian adaptive designs … [but] adjustments to posterior probabilities, are not required” 21
  • 22. • Default, non-subjective, O-Bayesians admit to “technical violations” of the likelihood principle (the prior can depend on the model)* (Ghosh et al. 2010) • We don’t see them embracing error statistical principles—yet (it would be welcome) (*That’s aside from violations of the LP in testing model assumptions.) 22
  • 23. Bayesians may block (or accept) claims from multiplicity/dredging with prior probabilities (without error probs) Problems: • Increases flexibility (selection effects + priors which can also be data dependent) • Doesn’t show what’s gone wrong—it’s the multiplicity
  • 24. Criticisms of P-hackers lose force • Statistically significant results that fail to replicate are often reanalyzed Bayesianly • Rather than point to (sometimes blatant) biased selections, Bayesians show with a high prior to H0 the data favor H0 more than H1 • A P-value can be small while Pr(H0|x) non-small or even large. 24
  • 25. 25 Bayes/Fisher Disagreement: Spike and Smear • A point null hypothesis, a lump of prior probability on H0 or a tiny area around it [Xi ~ N(μ, σ2)] H0: μ = 0 vs. H1: μ ≠ 0. • Depending on how you spike and how you smear, an α significant result can even correspond to Pr(H0|x) = (1 – α)! (e.g., 0.95) • But Pr(H0|x) can also agree with the small α (though they’re measuring different things)
  • 26. A life raft to data dredgers: “Bayesians can easily discount my statistically significant result this way”. • Even if it’s correct to reject the data dredged claim, it can be right for the wrong reason: • Put the blame where it belongs. 26
  • 27. More stringent P-value thresholds? • The high prior to H0 leads to the popular movement to require more stringent P-values (“redefine P-values” Benjamin et al. 2017). • Lowering P-value thresholds may compensate for multiple testing (e.g., in high energy particle physics (look elsewhere effect) and for error control in Big Data “diagnostic screening” (e.g., interesting/not interesting)) 27
  • 28. • But the goal of the redefiners is to get the P-value more in line with a Bayesian posterior on H0, assuming H0 is given the high prior. • Even advocates say it “does not address multiple hypothesis testing, P-hacking” (Benjamin et al. 2017) 28
  • 29. We should worry about biases in favor of “no effect” • Severe testers compute the probability this Bayesian analysis leads to erroneously failing to find various discrepancies (type 2 errors) • If it’s high for discrepancies of interest, she denies claims of no evidence against, let alone evidence for H0 29
  • 30. 30 Last piece: Implications for using error probabilities inferentially • The problem of double counting/novel evidence led me to the severity criterion, and to a distinct use of error probabilities (probativism)
  • 31. 31 The severe tester reformulates tests with a discrepancy γ from H0 • Admittedly, small P-values can be misinterpreted as indicating larger discrepancies than warranted • We infer discrepancies (population effect sizes) that are and are not warranted (and how well) Mayo (1991, 2018); Mayo and Spanos (2006, 2010 Mayo and Cox (2006, 2010); Mayo and Hand (2022)
  • 32. 32 Avoid misinterpreting a 2SE significant result
  • 33. 33 Akin to forming the lower confidence interval (CI) bound (estimating μ, SE is the standard error, SE = 1) μ > ! 𝑋 - 2SE The 98% lower confidence interval estimator. 98% of the time, it would correctly estimate μ (performance) Once the sample mean is observed, μ > ̅ 𝑥 - 2SE is the CI-lower estimate* (μ > 0) it cannot be assigned .98 probability. *fiducial 2% value (Fisher 1936): μ < CI-lower estimator 2% of the time. These values of μ are not rejectable at the .02 level (Neyman)
  • 34. 34 SEV gives an inferential view of CIs Severe tester: μ > CI-lower (e.g., μ > 0) is warranted because with high probability (.98) we would have observed a smaller sample mean if μ ≤ CI-lower We report several confidence levels, a confidence distribution (Cox 1958), or a severity distribution. (Thornton will give her take)
  • 35. 35 Fallacies of non-significant results? Say the cut-off for significance had to exceed 2 • It’s not evidence of no discrepancy from 0, but not uninformative (even in simple Fisherian tests) • Can infer (with severity) the absence of discrepancies that probably would have led to a larger sample mean (smaller P-value)* *Less coarse than power analysis
  • 36. 36 The relevant type 2 error is 1 minus, in sync with “attained” power
  • 37. 37
  • 38. Recap I. Multiplicity and data dredging can alter capabilities to probe errors (error probabilities) a) True, but appropriate data-dredging can satisfy relevant error probabilities b) True, but rivals to error statistics hold principles of evidence where error probabilities don’t matter. II. A severity formulation allows error probabilities of methods to be construed evidentially. 38
  • 39. 39
  • 40. References • Bayarri, M. and Berger, J. (2004). ‘The Interplay between Bayesian and Frequentist Analysis’, Statistical Science 19, 58–80. • Benjamin, D., Berger, J., Johannesson, M., et al. (2017). ‘Redefine Statistical Significance’, Nature Human Behaviour 2, 6–10. • Berger, J. (2006). ‘The Case for Objective Bayesian Analysis’ and ‘Rejoinder’, Bayesian Analysis 1(3), 385–402; 457–64. • Cox, D. (1958). ‘Some Problems Connected with Statistical Inference’, Annals of Mathematical Statistics 29(2), 357–72. • Casella, G. and Berger, R. (1987a). ‘Reconciling Bayesian and Frequentist Evidence in the One-sided Testing Problem’, Journal of the American Statistical Association 82 (397), 106–11. • Fisher, R. A. (1936), ‘Uncertain Inference’, Proceedings of the American Academy of Arts and Sciences 71, 248–58. • Ghosh, J., Delampady, M., and Samanta, T. (2010). An Introduction to Bayesian Analysis: Theory and Methods. New York: Springer. • Godambe, V. and Sprott, D. (eds.) (1971). Foundations of Statistical Inference. Toronto: Holt, Rinehart and Winston of Canada. • Lindley, D. (1971). ‘The Estimation of Many Parameters’, in Godambe, V. and Sprott, D. (eds.), pp. 435–55. • Mayo, D. G. (1991). “Novel Evidence and Severe Tests,” Philosophy of Science, 58 (4): 523-552. Reprinted (1991) in The Philosopher’s Annual XIV: 203-232. 40
  • 41. References (cont.) • Mayo, D. G. (1996). Error and the Growth of Experimental Knowledge, Chicago: Chicago University Press. (1998 Lakatos Prize) • Mayo, D. G. (2018). Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars, Cambridge: Cambridge University Press. • Mayo, D. G. (2020). “P-Values on Trial: Selective Reporting of (Best Practice Guides Against) Selective Reporting” Harvard Data Science Review 2.1. • Mayo, D.G., Hand, D. (2022). Statistical significance and its critics: practicing damaging science, or damaging scientific practice?. Synthese 200,220. https://doi.org/10.1007/s11229-022-03692-0 • Mayo, D.G. and Cox, D. R. (2006) “Frequentist Statistics as a Theory of Inductive Inference,” Optimality: The Second Erich L. Lehmann Symposium (ed. J. Rojo), Lecture Notes-Monograph series, IMS, Vol. 49: 77-97. • Mayo, D. G. and Spanos, A. (2006). “Severe Testing as a Basic Concept in a Neyman- Pearson Philosophy of Induction,” British Journal of Philosophy of Science, 57: 323- 357. • Popper, K. (1983). Realism and the Aim of Science. Totowa, NJ: Rowman and Littlefield. 41