1) The document discusses philosophical interventions in statistical debates and proposed reforms. It outlines three types of interventions: illuminating debates within statistics, reformulating frequentist tools through a severity perspective, and scrutinizing proposed reforms from the replication crisis.
2) A key idea is that evidence for a claim only comes when it has undergone severe testing, meaning it would probably have failed if it were false. The document argues for keeping the best aspects of Fisherian and Neyman-Pearson testing through a severity interpretation.
3) In scrutinizing reforms, it questions proposals to abandon statistical significance and P-value thresholds, arguing this could exacerbate selective reporting instead of reducing it. The document advocates reformulating tests
D. Mayo: Putting the brakes on the breakthrough: An informal look at the argu...jemille6
“Putting the Brakes on the Breakthrough, or ‘How I used simple logic to uncover a flaw in a controversial 50-year old ‘theorem’ in statistical foundations taken as a‘breakthrough’ in favor of Bayesian vs frequentist error statistics’”
D. Mayo: Philosophical Interventions in the Statistics Warsjemille6
ABSTRACT: While statistics has a long history of passionate philosophical controversy, the last decade especially cries out for philosophical illumination. Misuses of statistics, Big Data dredging, and P-hacking make it easy to find statistically significant, but spurious, effects. This obstructs a test's ability to control the probability of erroneously inferring effects–i.e., to control error probabilities. Disagreements about statistical reforms reflect philosophical disagreements about the nature of statistical inference–including whether error probability control even matters! I describe my interventions in statistics in relation to three events. (1) In 2016 the American Statistical Association (ASA) met to craft principles for avoiding misinterpreting P-values. (2) In 2017, a "megateam" (including philosophers of science) proposed "redefining statistical significance," replacing the common threshold of P ≤ .05 with P ≤ .005. (3) In 2019, an editorial in the main ASA journal called for abandoning all P-value thresholds, and even the words "significant/significance".
A word on each. (1) Invited to be a "philosophical observer" at their meeting, I found the major issues were conceptual. P-values measure how incompatible data are from what is expected under a hypothesis that there is no genuine effect: the smaller the P-value, the more indication of incompatibility. The ASA list of familiar misinterpretations–P-values are not posterior probabilities, statistical significance is not substantive importance, no evidence against a hypothesis need not be evidence for it–I argue, should not be the basis for replacing tests with methods less able to assess and control erroneous interpretations of data. (Mayo 2016, 2019). (2) The "redefine statistical significance" movement appraises P-values from the perspective of a very different quantity: a comparative Bayes Factor. Failing to recognize how contrasting approaches measure different things, disputants often talk past each other (Mayo 2018). (3) To ban P-value thresholds, even to distinguish terrible from warranted evidence, I say, is a mistake (2019). It will not eradicate P-hacking, but it will make it harder to hold P-hackers accountable. A 2020 ASA Task Force on significance testing has just been announced. (I would like to think my blog errorstatistics.com helped.)
To enter the fray between rival statistical approaches, it helps to have a principle applicable to all accounts. There's poor evidence for a claim if little if anything has been done to find it flawed even if it is. This forms a basic requirement for evidence I call the severity requirement. A claim passes with severity only if it is subjected to and passes a test that probably would have found it flawed, if it were. It stems from Popper, though he never adequately cashed it out. A variant is the frequentist principle of evidence developed with Sir David Cox (Mayo and Cox 20
T. Pradeu & M. Lemoine: Philosophy in Science: Definition and Boundariesjemille6
Do philosophers of science frequently contribute to science, and if so how? Bibliometrics helps assess how surprisingly big is the corpus of papers authored or co-authored by philosophers and published in science. Indeed, several hundreds of philosophers have published in scientific journals. It is also possible to assess how influential this work has been in terms of citations, as compared to the average number of citations in the same journals in the same year. Unsurprisingly, many of these papers authored or co-authored by philosophers and published in scientific journals are poorly cited while a handful of them are widely cited. However, the most interesting result is that there is a significant corpus of papers authored by philosophers (both published in science journals and in philosophy journals) and significantly cited in science. It is more difficult, albeit crucial, to identify the most contributive philosophical papers, namely, those which have penetrated science not only through publication or citation in science journals, but also through discussion or endorsement by some scientists.
Based on the identification of this often neglected corpus, which we propose to call "philosophy in science" (PinS), it becomes possible to describe the most central features of this particular way of doing philosophy of science. The first feature is bibliographic: philosophers in science tend to cite little philosophy and a lot of (up-to-date) science. Second, they also address a scientific question rather than a philosophical question. Third, in doing so, they use traditional tools of philosophy of science, typically and mostly, conceptual analysis, explication of implicit claims, examination of the consistency of claims, assessment of the relevance of methods or models. More rarely, but very interestingly, they also make positive and original contributions by bridging domains of science or suggesting hypotheses.
This different context – in particular, the specific requirements for a publication in a peer-reviewed science journal – transforms philosophy of science. Is it still philosophy? What is the difference with approaches such as "philosophy of science in practice", "complementary science", "scientific philosophy", "theory of science", and naturalism? PinS faces a double "impostor syndrome": not entirely philosophical for philosophers, and not entirely scientific for scientists. In conclusion, we will explore how PinS can respond to this double challenge.
Mayo: Evidence as Passing a Severe Test (How it Gets You Beyond the Statistic...jemille6
D. G. Mayo April 28, 2021 presentation to the CUNY Graduate Center Philosophy Colloquium "Evidence as Passing a Severe Test (How it Gets You Beyond the Statistics Wars)"
D. G. Mayo: The Replication Crises and its Constructive Role in the Philosoph...jemille6
Constructive role of replication crises teaches a lot about 1.) Non-fallacious uses of statistical tests, 2.) Rationale for the role of probability in tests, 3.) How to reformulate tests.
Replication Crises and the Statistics Wars: Hidden Controversiesjemille6
D. Mayo presentation at the X-Phil conference on "Reproducibility and Replicabililty in Psychology and Experimental Philosophy", University College London (June 14, 2018)
D. Mayo: Putting the brakes on the breakthrough: An informal look at the argu...jemille6
“Putting the Brakes on the Breakthrough, or ‘How I used simple logic to uncover a flaw in a controversial 50-year old ‘theorem’ in statistical foundations taken as a‘breakthrough’ in favor of Bayesian vs frequentist error statistics’”
D. Mayo: Philosophical Interventions in the Statistics Warsjemille6
ABSTRACT: While statistics has a long history of passionate philosophical controversy, the last decade especially cries out for philosophical illumination. Misuses of statistics, Big Data dredging, and P-hacking make it easy to find statistically significant, but spurious, effects. This obstructs a test's ability to control the probability of erroneously inferring effects–i.e., to control error probabilities. Disagreements about statistical reforms reflect philosophical disagreements about the nature of statistical inference–including whether error probability control even matters! I describe my interventions in statistics in relation to three events. (1) In 2016 the American Statistical Association (ASA) met to craft principles for avoiding misinterpreting P-values. (2) In 2017, a "megateam" (including philosophers of science) proposed "redefining statistical significance," replacing the common threshold of P ≤ .05 with P ≤ .005. (3) In 2019, an editorial in the main ASA journal called for abandoning all P-value thresholds, and even the words "significant/significance".
A word on each. (1) Invited to be a "philosophical observer" at their meeting, I found the major issues were conceptual. P-values measure how incompatible data are from what is expected under a hypothesis that there is no genuine effect: the smaller the P-value, the more indication of incompatibility. The ASA list of familiar misinterpretations–P-values are not posterior probabilities, statistical significance is not substantive importance, no evidence against a hypothesis need not be evidence for it–I argue, should not be the basis for replacing tests with methods less able to assess and control erroneous interpretations of data. (Mayo 2016, 2019). (2) The "redefine statistical significance" movement appraises P-values from the perspective of a very different quantity: a comparative Bayes Factor. Failing to recognize how contrasting approaches measure different things, disputants often talk past each other (Mayo 2018). (3) To ban P-value thresholds, even to distinguish terrible from warranted evidence, I say, is a mistake (2019). It will not eradicate P-hacking, but it will make it harder to hold P-hackers accountable. A 2020 ASA Task Force on significance testing has just been announced. (I would like to think my blog errorstatistics.com helped.)
To enter the fray between rival statistical approaches, it helps to have a principle applicable to all accounts. There's poor evidence for a claim if little if anything has been done to find it flawed even if it is. This forms a basic requirement for evidence I call the severity requirement. A claim passes with severity only if it is subjected to and passes a test that probably would have found it flawed, if it were. It stems from Popper, though he never adequately cashed it out. A variant is the frequentist principle of evidence developed with Sir David Cox (Mayo and Cox 20
T. Pradeu & M. Lemoine: Philosophy in Science: Definition and Boundariesjemille6
Do philosophers of science frequently contribute to science, and if so how? Bibliometrics helps assess how surprisingly big is the corpus of papers authored or co-authored by philosophers and published in science. Indeed, several hundreds of philosophers have published in scientific journals. It is also possible to assess how influential this work has been in terms of citations, as compared to the average number of citations in the same journals in the same year. Unsurprisingly, many of these papers authored or co-authored by philosophers and published in scientific journals are poorly cited while a handful of them are widely cited. However, the most interesting result is that there is a significant corpus of papers authored by philosophers (both published in science journals and in philosophy journals) and significantly cited in science. It is more difficult, albeit crucial, to identify the most contributive philosophical papers, namely, those which have penetrated science not only through publication or citation in science journals, but also through discussion or endorsement by some scientists.
Based on the identification of this often neglected corpus, which we propose to call "philosophy in science" (PinS), it becomes possible to describe the most central features of this particular way of doing philosophy of science. The first feature is bibliographic: philosophers in science tend to cite little philosophy and a lot of (up-to-date) science. Second, they also address a scientific question rather than a philosophical question. Third, in doing so, they use traditional tools of philosophy of science, typically and mostly, conceptual analysis, explication of implicit claims, examination of the consistency of claims, assessment of the relevance of methods or models. More rarely, but very interestingly, they also make positive and original contributions by bridging domains of science or suggesting hypotheses.
This different context – in particular, the specific requirements for a publication in a peer-reviewed science journal – transforms philosophy of science. Is it still philosophy? What is the difference with approaches such as "philosophy of science in practice", "complementary science", "scientific philosophy", "theory of science", and naturalism? PinS faces a double "impostor syndrome": not entirely philosophical for philosophers, and not entirely scientific for scientists. In conclusion, we will explore how PinS can respond to this double challenge.
Mayo: Evidence as Passing a Severe Test (How it Gets You Beyond the Statistic...jemille6
D. G. Mayo April 28, 2021 presentation to the CUNY Graduate Center Philosophy Colloquium "Evidence as Passing a Severe Test (How it Gets You Beyond the Statistics Wars)"
D. G. Mayo: The Replication Crises and its Constructive Role in the Philosoph...jemille6
Constructive role of replication crises teaches a lot about 1.) Non-fallacious uses of statistical tests, 2.) Rationale for the role of probability in tests, 3.) How to reformulate tests.
Replication Crises and the Statistics Wars: Hidden Controversiesjemille6
D. Mayo presentation at the X-Phil conference on "Reproducibility and Replicabililty in Psychology and Experimental Philosophy", University College London (June 14, 2018)
Probing with Severity: Beyond Bayesian Probabilism and Frequentist Performancejemille6
Slides from Rutgers Seminar talk by Deborah G Mayo
December 3, 2014
Rutgers, Department of Statistics and Biostatistics
Abstract: Getting beyond today’s most pressing controversies revolving around statistical methods, I argue, requires scrutinizing their underlying statistical philosophies.Two main philosophies about the roles of probability in statistical inference are probabilism and performance (in the long-run). The first assumes that we need a method of assigning probabilities to hypotheses; the second assumes that the main function of statistical method is to control long-run performance. I offer a third goal: controlling and evaluating the probativeness of methods. An inductive inference, in this conception, takes the form of inferring hypotheses to the extent that they have been well or severely tested. A report of poorly tested claims must also be part of an adequate inference. I develop a statistical philosophy in which error probabilities of methods may be used to evaluate and control the stringency or severity of tests. I then show how the “severe testing” philosophy clarifies and avoids familiar criticisms and abuses of significance tests and cognate methods (e.g., confidence intervals). Severity may be threatened in three main ways: fallacies of statistical tests, unwarranted links between statistical and substantive claims, and violations of model assumptions.
D. G. Mayo: Your data-driven claims must still be probed severelyjemille6
In the session "Philosophy of Science and the New Paradigm of Data-Driven Science at the American Statistical Association Conference on Statistical Learning and Data Science/Nonparametric Statistics
Today we’ll try to cover a number of things:
1. Learning philosophy/philosophy of statistics
2. Situating the broad issues within philosophy of science
3. Little bit of logic
4. Probability and random variables
D. Mayo: The Science Wars and the Statistics Wars: scientism, popular statist...jemille6
I will explore the extent to which concerns about ‘scientism’– an unwarranted obeisance to scientific over other methods of inquiry – are intertwined with issues in the foundations of the statistical data analyses on which (social, behavioral, medical and physical) science increasingly depends. The rise of big data, machine learning, and high-powered computer programs have extended statistical methods and modeling across the landscape of science, law and evidence-based policy, but this has been accompanied by enormous hand wringing as to the reliability, replicability, and valid use of statistics. Legitimate criticisms of scientism often stem from insufficiently self-critical uses of statistical methodology, broadly construed — i.e., from what might be called “statisticism”-- particularly when those methods are applied to matters of controversy.
D. Mayo: Philosophy of Statistics & the Replication Crisis in Sciencejemille6
D. Mayo discusses various disputes-notably the replication crisis in science-in the context of her just released book: Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars.
Severe Testing: The Key to Error Correctionjemille6
D. G. Mayo's slides for her presentation given March 17, 2017 at Boston Colloquium for Philosophy of Science, Alfred I.Taub forum: "Understanding Reproducibility & Error Correction in Science"
Surrogate Science: How Fisher, Neyman-Pearson, and Bayes Were Transformed int...jemille6
Gerd Gigerenzer (Director of Max Planck Institute for Human Development, Berlin, Germany) in the PSA 2016 Symposium:Philosophy of Statistics in the Age of Big Data and Replication Crises
Controversy Over the Significance Test Controversyjemille6
Deborah Mayo (Professor of Philosophy, Virginia Tech, Blacksburg, Virginia) in PSA 2016 Symposium: Philosophy of Statistics in the Age of Big Data and Replication Crises
A. Gelman "50 shades of gray: A research story," presented May 23 at the session on "The Philosophy of Statistics: Bayesianism, Frequentism and the Nature of Inference," 2015 APS Annual Convention in NYC.
D. Mayo: Replication Research Under an Error Statistical Philosophy jemille6
D. Mayo (Virginia Tech) slides from her talk June 3 at the "Preconference Workshop on Replication in the Sciences" at the 2015 Society for Philosophy and Psychology meeting.
Fusion Confusion? Comments on Nancy Reid: "BFF Four-Are we Converging?"jemille6
D. Mayo's comments on Nancy Reid's "BFF Four-Are we Converging?" given May 2, 2017 at The Fourth Bayesian, Fiducial and Frequentists Workshop held at Harvard University.
D. G. Mayo (Virginia Tech) "Error Statistical Control: Forfeit at your Peril" presented May 23 at the session on "The Philosophy of Statistics: Bayesianism, Frequentism and the Nature of Inference," 2015 APS Annual Convention in NYC.
Abstract: Mounting failures of replication in the social and biological sciences give a practical spin to statistical foundations in the form of the question: How can we attain reliability when methods make illicit cherry-picking and significance seeking so easy? Researchers, professional societies, and journals are increasingly getting serious about methodological reforms to restore scientific integrity – some are quite welcome (e.g., pre-registration), while others are quite radical. The American Statistical Association convened members from differing tribes of frequentists, Bayesians, and likelihoodists to codify misuses of P-values. Largely overlooked are the philosophical presuppositions of both criticisms and proposed reforms. Paradoxically, alternative replacement methods may enable rather than reveal illicit inferences due to cherry-picking, multiple testing, and other biasing selection effects. Crowd-sourced reproducibility research in psychology is helping to change the reward structure but has its own shortcomings. Focusing on purely statistical considerations, it tends to overlook problems with artificial experiments. Without a better understanding of the philosophical issues, we can expect the latest reforms to fail.
Statistical skepticism: How to use significance tests effectively jemille6
Prof. D. Mayo, presentation Oct. 12, 2017 at the ASA Symposium on Statistical Inference : “A World Beyond p < .05” in the session: “What are the best uses for P-values?“
Deborah G. Mayo: Is the Philosophy of Probabilism an Obstacle to Statistical Fraud Busting?
Presentation slides for: Revisiting the Foundations of Statistics in the Era of Big Data: Scaling Up to Meet the Challenge[*] at the Boston Colloquium for Philosophy of Science (Feb 21, 2014).
“The importance of philosophy of science for statistical science and vice versa”jemille6
My paper “The importance of philosophy of science for statistical science and vice
versa” presented (zoom) at the conference: IS PHILOSOPHY USEFUL FOR SCIENCE, AND/OR VICE VERSA? January 30 - February 2, 2024 at Chapman University, Schmid College of Science and Technology.
Probing with Severity: Beyond Bayesian Probabilism and Frequentist Performancejemille6
Slides from Rutgers Seminar talk by Deborah G Mayo
December 3, 2014
Rutgers, Department of Statistics and Biostatistics
Abstract: Getting beyond today’s most pressing controversies revolving around statistical methods, I argue, requires scrutinizing their underlying statistical philosophies.Two main philosophies about the roles of probability in statistical inference are probabilism and performance (in the long-run). The first assumes that we need a method of assigning probabilities to hypotheses; the second assumes that the main function of statistical method is to control long-run performance. I offer a third goal: controlling and evaluating the probativeness of methods. An inductive inference, in this conception, takes the form of inferring hypotheses to the extent that they have been well or severely tested. A report of poorly tested claims must also be part of an adequate inference. I develop a statistical philosophy in which error probabilities of methods may be used to evaluate and control the stringency or severity of tests. I then show how the “severe testing” philosophy clarifies and avoids familiar criticisms and abuses of significance tests and cognate methods (e.g., confidence intervals). Severity may be threatened in three main ways: fallacies of statistical tests, unwarranted links between statistical and substantive claims, and violations of model assumptions.
D. G. Mayo: Your data-driven claims must still be probed severelyjemille6
In the session "Philosophy of Science and the New Paradigm of Data-Driven Science at the American Statistical Association Conference on Statistical Learning and Data Science/Nonparametric Statistics
Today we’ll try to cover a number of things:
1. Learning philosophy/philosophy of statistics
2. Situating the broad issues within philosophy of science
3. Little bit of logic
4. Probability and random variables
D. Mayo: The Science Wars and the Statistics Wars: scientism, popular statist...jemille6
I will explore the extent to which concerns about ‘scientism’– an unwarranted obeisance to scientific over other methods of inquiry – are intertwined with issues in the foundations of the statistical data analyses on which (social, behavioral, medical and physical) science increasingly depends. The rise of big data, machine learning, and high-powered computer programs have extended statistical methods and modeling across the landscape of science, law and evidence-based policy, but this has been accompanied by enormous hand wringing as to the reliability, replicability, and valid use of statistics. Legitimate criticisms of scientism often stem from insufficiently self-critical uses of statistical methodology, broadly construed — i.e., from what might be called “statisticism”-- particularly when those methods are applied to matters of controversy.
D. Mayo: Philosophy of Statistics & the Replication Crisis in Sciencejemille6
D. Mayo discusses various disputes-notably the replication crisis in science-in the context of her just released book: Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars.
Severe Testing: The Key to Error Correctionjemille6
D. G. Mayo's slides for her presentation given March 17, 2017 at Boston Colloquium for Philosophy of Science, Alfred I.Taub forum: "Understanding Reproducibility & Error Correction in Science"
Surrogate Science: How Fisher, Neyman-Pearson, and Bayes Were Transformed int...jemille6
Gerd Gigerenzer (Director of Max Planck Institute for Human Development, Berlin, Germany) in the PSA 2016 Symposium:Philosophy of Statistics in the Age of Big Data and Replication Crises
Controversy Over the Significance Test Controversyjemille6
Deborah Mayo (Professor of Philosophy, Virginia Tech, Blacksburg, Virginia) in PSA 2016 Symposium: Philosophy of Statistics in the Age of Big Data and Replication Crises
A. Gelman "50 shades of gray: A research story," presented May 23 at the session on "The Philosophy of Statistics: Bayesianism, Frequentism and the Nature of Inference," 2015 APS Annual Convention in NYC.
D. Mayo: Replication Research Under an Error Statistical Philosophy jemille6
D. Mayo (Virginia Tech) slides from her talk June 3 at the "Preconference Workshop on Replication in the Sciences" at the 2015 Society for Philosophy and Psychology meeting.
Fusion Confusion? Comments on Nancy Reid: "BFF Four-Are we Converging?"jemille6
D. Mayo's comments on Nancy Reid's "BFF Four-Are we Converging?" given May 2, 2017 at The Fourth Bayesian, Fiducial and Frequentists Workshop held at Harvard University.
D. G. Mayo (Virginia Tech) "Error Statistical Control: Forfeit at your Peril" presented May 23 at the session on "The Philosophy of Statistics: Bayesianism, Frequentism and the Nature of Inference," 2015 APS Annual Convention in NYC.
Abstract: Mounting failures of replication in the social and biological sciences give a practical spin to statistical foundations in the form of the question: How can we attain reliability when methods make illicit cherry-picking and significance seeking so easy? Researchers, professional societies, and journals are increasingly getting serious about methodological reforms to restore scientific integrity – some are quite welcome (e.g., pre-registration), while others are quite radical. The American Statistical Association convened members from differing tribes of frequentists, Bayesians, and likelihoodists to codify misuses of P-values. Largely overlooked are the philosophical presuppositions of both criticisms and proposed reforms. Paradoxically, alternative replacement methods may enable rather than reveal illicit inferences due to cherry-picking, multiple testing, and other biasing selection effects. Crowd-sourced reproducibility research in psychology is helping to change the reward structure but has its own shortcomings. Focusing on purely statistical considerations, it tends to overlook problems with artificial experiments. Without a better understanding of the philosophical issues, we can expect the latest reforms to fail.
Statistical skepticism: How to use significance tests effectively jemille6
Prof. D. Mayo, presentation Oct. 12, 2017 at the ASA Symposium on Statistical Inference : “A World Beyond p < .05” in the session: “What are the best uses for P-values?“
Deborah G. Mayo: Is the Philosophy of Probabilism an Obstacle to Statistical Fraud Busting?
Presentation slides for: Revisiting the Foundations of Statistics in the Era of Big Data: Scaling Up to Meet the Challenge[*] at the Boston Colloquium for Philosophy of Science (Feb 21, 2014).
“The importance of philosophy of science for statistical science and vice versa”jemille6
My paper “The importance of philosophy of science for statistical science and vice
versa” presented (zoom) at the conference: IS PHILOSOPHY USEFUL FOR SCIENCE, AND/OR VICE VERSA? January 30 - February 2, 2024 at Chapman University, Schmid College of Science and Technology.
Statistical Inference as Severe Testing: Beyond Performance and Probabilismjemille6
A talk given by Deborah G Mayo
(Dept of Philosophy, Virginia Tech) to the Seminar in Advanced Research Methods at the Dept of Psychology, Princeton University on
November 14, 2023
TITLE: Statistical Inference as Severe Testing: Beyond Probabilism and Performance
ABSTRACT: I develop a statistical philosophy in which error probabilities of methods may be used to evaluate and control the stringency or severity of tests. A claim is severely tested to the extent it has been subjected to and passes a test that probably would have found flaws, were they present. The severe-testing requirement leads to reformulating statistical significance tests to avoid familiar criticisms and abuses. While high-profile failures of replication in the social and biological sciences stem from biasing selection effects—data dredging, multiple testing, optional stopping—some reforms and proposed alternatives to statistical significance tests conflict with the error control that is required to satisfy severity. I discuss recent arguments to redefine, abandon, or replace statistical significance.
The Statistics Wars: Errors and Casualtiesjemille6
ABSTRACT: Mounting failures of replication in the social and biological sciences give a new urgency to critically appraising proposed statistical reforms. While many reforms are welcome (preregistration of experiments, replication, discouraging cookbook uses of statistics), there have been casualties. The philosophical presuppositions behind the meta-research battles remain largely hidden. Too often the statistics wars have become proxy wars between competing tribe leaders, each keen to advance one or another tool or school, rather than build on efforts to do better science. Efforts of replication researchers and open science advocates are diminished when so much attention is centered on repeating hackneyed howlers of statistical significance tests (statistical significance isn’t substantive significance, no evidence against isn’t evidence for), when erroneous understanding of basic statistical terms goes uncorrected, and when bandwagon effects lead to popular reforms that downplay the importance of error probability control. These casualties threaten our ability to hold accountable the “experts,” the agencies, and all the data handlers increasingly exerting power over our lives.
High-profile failures of replication in the social and biological sciences underwrite a
minimal requirement of evidence: If little or nothing has been done to rule out flaws in
inferring a claim, then it has not passed a severe test. A claim is severely tested to the
extent it has been subjected to and passes a test that probably would have found flaws,
were they present. This probability is the severity with which a claim has passed. The
goal of highly well-tested claims differs from that of highly probable ones, explaining
why experts so often disagree about statistical reforms. Even where today’s statistical
test critics see themselves as merely objecting to misuses and misinterpretations, the
reforms they recommend often grow out of presuppositions about the role of probability
in inductive-statistical inference. Paradoxically, I will argue, some of the reforms
intended to replace or improve on statistical significance tests enable rather than reveal
illicit inferences due to cherry-picking, multiple testing, and data-dredging. Some
preclude testing and falsifying claims altogether. These are the “casualties” on which I
will focus. I will consider Fisherian vs Neyman-Pearson tests, Bayes factors, Bayesian
posteriors, likelihoodist assessments, and the “screening model” of tests (a quasiBayesian-frequentist assessment). Whether one accepts this philosophy of evidence, I
argue, that it provides a standpoint for avoiding both the fallacies of statistical testing
and the casualties of today’s statistics wars.
The Statistics Wars and Their Causalities (refs)jemille6
High-profile failures of replication in the social and biological sciences underwrite a
minimal requirement of evidence: If little or nothing has been done to rule out flaws in inferring a claim, then it has not passed a severe test. A claim is severely tested to the extent it has been subjected to and passes a test that probably would have found flaws, were they present. This probability is the severity with which a claim has passed. The goal of highly well-tested claims differs from that of highly probable ones, explaining why experts so often disagree about statistical reforms. Even where today’s statistical test critics see themselves as merely objecting to misuses and misinterpretations, the reforms they recommend often grow out of presuppositions about the role of probability in inductive-statistical inference. Paradoxically, I will argue, some of the reforms intended to replace or improve on statistical significance tests enable rather than reveal illicit inferences due to cherry-picking, multiple testing, and data-dredging. Some preclude testing and falsifying claims altogether. These are the “casualties” on which I will focus. I will consider Fisherian vs Neyman-Pearson tests, Bayes factors, Bayesian posteriors, likelihoodist assessments, and the “screening model” of tests (a quasiBayesian-frequentist assessment). Whether one accepts this philosophy of evidence, I argue, that it provides a standpoint for avoiding both the fallacies of statistical testing and the casualties of today’s statistics wars.
The Statistics Wars and Their Casualties (w/refs)jemille6
High-profile failures of replication in the social and biological sciences underwrite a minimal requirement of evidence: If little or nothing has been done to rule out flaws in inferring a claim, then it has not passed a severe test. A claim is severely tested to the extent it has been subjected to and passes a test that probably would have found flaws, were they present. This probability is the severity with which a claim has passed. The goal of highly well-tested claims differs from that of highly probable ones, explaining why experts so often disagree about statistical reforms. Even where today’s statistical test critics see themselves as merely objecting to misuses and misinterpretations, the reforms they recommend often grow out of presuppositions about the role of probability in inductive-statistical inference. Paradoxically, I will argue, some of the reforms intended to replace or improve on statistical significance tests enable rather than reveal illicit inferences due to cherry-picking, multiple testing, and data-dredging. Some preclude testing and falsifying claims altogether. These are the “casualties” on which I will focus. I will consider Fisherian vs Neyman-Pearson tests, Bayes factors, Bayesian posteriors, likelihoodist assessments, and the “screening model” of tests (a quasi-Bayesian-frequentist assessment). Whether one accepts this philosophy of evidence, I argue, that it provides a standpoint for avoiding both the fallacies of statistical testing and the casualties of today’s statistics wars.
"The Statistical Replication Crisis: Paradoxes and Scapegoats”jemille6
D. G. Mayo LSE Popper talk, May 10, 2016.
Abstract: Mounting failures of replication in the social and biological sciences give a practical spin to statistical foundations in the form of the question: How can we attain reliability when Big Data methods make illicit cherry-picking and significance seeking so easy? Researchers, professional societies, and journals are increasingly getting serious about methodological reforms to restore scientific integrity – some are quite welcome (e.g., preregistration), while others are quite radical. Recently, the American Statistical Association convened members from differing tribes of frequentists, Bayesians, and likelihoodists to codify misuses of P-values. Largely overlooked are the philosophical presuppositions of both criticisms and proposed reforms. Paradoxically, alternative replacement methods may enable rather than reveal illicit inferences due to cherry-picking, multiple testing, and other biasing selection effects. Popular appeals to “diagnostic testing” that aim to improve replication rates may (unintentionally) permit the howlers and cookbook statistics we are at pains to root out. Without a better understanding of the philosophical issues, we can expect the latest reforms to fail.
Slides given for Deborah G. Mayo talk at Minnesota Center for Philosophy of Science at University of Minnesota on the ASA 2016 statement on P-values and Error Statistics
Does preregistration improve the interpretability and credibility of research...Mark Rubin
Rubin, M. (2022, March). Does preregistration improve the interpretability and credibility of research findings? In Research transparency: From preregistration to open access. Erasmus Research Institute of Management Research Transparency Campaign, Erasmus University Rotterdam. [Video recording: https://www.youtube.com/watch?v=xsEoLhQrKNQ&t=1s]
D. Mayo (Dept of Philosophy, VT)
Sir David Cox’s Statistical Philosophy and Its Relevance to Today’s Statistical Controversies
ABSTRACT: This talk will explain Sir David Cox's views of the nature and importance of statistical foundations and their relevance to today's controversies about statistical inference, particularly in using statistical significance testing and confidence intervals. Two key themes of Cox's statistical philosophy are: first, the importance of calibrating methods by considering their behavior in (actual or hypothetical) repeated sampling, and second, ensuring the calibration is relevant to the specific data and inquiry. A question that arises is: How can the frequentist calibration provide a genuinely epistemic assessment of what is learned from data? Building on our jointly written papers, Mayo and Cox (2006) and Cox and Mayo (2010), I will argue that relevant error probabilities may serve to assess how well-corroborated or severely tested statistical claims are.
Paper given at PSA 22 Symposium: Multiplicity, Data-Dredging and Error Control
MAYO ABSTRACT: I put forward a general principle for evidence: an error-prone claim C is warranted to the extent it has been subjected to, and passes, an analysis that very probably would have found evidence of flaws in C just if they are present. This probability is the severity with which C has passed the test. When a test’s error probabilities quantify the capacity of tests to probe errors in C, I argue, they can be used to assess what has been learned from the data about C. A claim can be probable or even known to be true, yet poorly probed by the data and model at hand. The severe testing account leads to a reformulation of statistical significance tests: Moving away from a binary interpretation, we test several discrepancies from any reference hypothesis and report those well or poorly warranted. A probative test will generally involve combining several subsidiary tests, deliberately designed to unearth different flaws. The approach relates to confidence interval estimation, but, like confidence distributions (CD) (Thornton), a series of different confidence levels is considered. A 95% confidence interval method, say using the mean M of a random sample to estimate the population mean μ of a Normal distribution, will cover the true, but unknown, value of μ 95% of the time in a hypothetical series of applications. However, we cannot take .95 as the probability that a particular interval estimate (a ≤ μ ≤ b) is correct—at least not without a prior probability to μ. In the severity interpretation I propose, we can nevertheless give an inferential construal post-data, while still regarding μ as fixed. For example, there is good evidence μ ≥ a (the lower estimation limit) because if μ < a, then with high probability .95 (or .975 if viewed as one-sided) we would have observed a smaller value of M than we did. Likewise for inferring μ ≤ b. To understand a method’s capability to probe flaws in the case at hand, we cannot just consider the observed data, unlike in strict Bayesian accounts. We need to consider what the method would have inferred if other data had been observed. For each point μ’ in the interval, we assess how severely the claim μ > μ’ has been probed. I apply the severity account to the problems discussed by earlier speakers in our session. The problem with multiple testing (and selective reporting) when attempting to distinguish genuine effects from noise, is not merely that it would, if regularly applied, lead to inferences that were often wrong. Rather, it renders the method incapable, or practically so, of probing the relevant mistaken inference in the case at hand. In other cases, by contrast, (e.g., DNA matching) the searching can increase the test’s probative capacity. In this way the severe testing account can explain competing intuitions about multiplicity and data-dredging, while blocking inferences based on problematic data-dredging
Cancer cell metabolism: special Reference to Lactate PathwayAADYARAJPANDEY1
Normal Cell Metabolism:
Cellular respiration describes the series of steps that cells use to break down sugar and other chemicals to get the energy we need to function.
Energy is stored in the bonds of glucose and when glucose is broken down, much of that energy is released.
Cell utilize energy in the form of ATP.
The first step of respiration is called glycolysis. In a series of steps, glycolysis breaks glucose into two smaller molecules - a chemical called pyruvate. A small amount of ATP is formed during this process.
Most healthy cells continue the breakdown in a second process, called the Kreb's cycle. The Kreb's cycle allows cells to “burn” the pyruvates made in glycolysis to get more ATP.
The last step in the breakdown of glucose is called oxidative phosphorylation (Ox-Phos).
It takes place in specialized cell structures called mitochondria. This process produces a large amount of ATP. Importantly, cells need oxygen to complete oxidative phosphorylation.
If a cell completes only glycolysis, only 2 molecules of ATP are made per glucose. However, if the cell completes the entire respiration process (glycolysis - Kreb's - oxidative phosphorylation), about 36 molecules of ATP are created, giving it much more energy to use.
IN CANCER CELL:
Unlike healthy cells that "burn" the entire molecule of sugar to capture a large amount of energy as ATP, cancer cells are wasteful.
Cancer cells only partially break down sugar molecules. They overuse the first step of respiration, glycolysis. They frequently do not complete the second step, oxidative phosphorylation.
This results in only 2 molecules of ATP per each glucose molecule instead of the 36 or so ATPs healthy cells gain. As a result, cancer cells need to use a lot more sugar molecules to get enough energy to survive.
Unlike healthy cells that "burn" the entire molecule of sugar to capture a large amount of energy as ATP, cancer cells are wasteful.
Cancer cells only partially break down sugar molecules. They overuse the first step of respiration, glycolysis. They frequently do not complete the second step, oxidative phosphorylation.
This results in only 2 molecules of ATP per each glucose molecule instead of the 36 or so ATPs healthy cells gain. As a result, cancer cells need to use a lot more sugar molecules to get enough energy to survive.
introduction to WARBERG PHENOMENA:
WARBURG EFFECT Usually, cancer cells are highly glycolytic (glucose addiction) and take up more glucose than do normal cells from outside.
Otto Heinrich Warburg (; 8 October 1883 – 1 August 1970) In 1931 was awarded the Nobel Prize in Physiology for his "discovery of the nature and mode of action of the respiratory enzyme.
WARNBURG EFFECT : cancer cells under aerobic (well-oxygenated) conditions to metabolize glucose to lactate (aerobic glycolysis) is known as the Warburg effect. Warburg made the observation that tumor slices consume glucose and secrete lactate at a higher rate than normal tissues.
This presentation explores a brief idea about the structural and functional attributes of nucleotides, the structure and function of genetic materials along with the impact of UV rays and pH upon them.
What is greenhouse gasses and how many gasses are there to affect the Earth.moosaasad1975
What are greenhouse gasses how they affect the earth and its environment what is the future of the environment and earth how the weather and the climate effects.
(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...Scintica Instrumentation
Intravital microscopy (IVM) is a powerful tool utilized to study cellular behavior over time and space in vivo. Much of our understanding of cell biology has been accomplished using various in vitro and ex vivo methods; however, these studies do not necessarily reflect the natural dynamics of biological processes. Unlike traditional cell culture or fixed tissue imaging, IVM allows for the ultra-fast high-resolution imaging of cellular processes over time and space and were studied in its natural environment. Real-time visualization of biological processes in the context of an intact organism helps maintain physiological relevance and provide insights into the progression of disease, response to treatments or developmental processes.
In this webinar we give an overview of advanced applications of the IVM system in preclinical research. IVIM technology is a provider of all-in-one intravital microscopy systems and solutions optimized for in vivo imaging of live animal models at sub-micron resolution. The system’s unique features and user-friendly software enables researchers to probe fast dynamic biological processes such as immune cell tracking, cell-cell interaction as well as vascularization and tumor metastasis with exceptional detail. This webinar will also give an overview of IVM being utilized in drug development, offering a view into the intricate interaction between drugs/nanoparticles and tissues in vivo and allows for the evaluation of therapeutic intervention in a variety of tissues and organs. This interdisciplinary collaboration continues to drive the advancements of novel therapeutic strategies.
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...Sérgio Sacani
We characterize the earliest galaxy population in the JADES Origins Field (JOF), the deepest
imaging field observed with JWST. We make use of the ancillary Hubble optical images (5 filters
spanning 0.4−0.9µm) and novel JWST images with 14 filters spanning 0.8−5µm, including 7 mediumband filters, and reaching total exposure times of up to 46 hours per filter. We combine all our data
at > 2.3µm to construct an ultradeep image, reaching as deep as ≈ 31.4 AB mag in the stack and
30.3-31.0 AB mag (5σ, r = 0.1” circular aperture) in individual filters. We measure photometric
redshifts and use robust selection criteria to identify a sample of eight galaxy candidates at redshifts
z = 11.5 − 15. These objects show compact half-light radii of R1/2 ∼ 50 − 200pc, stellar masses of
M⋆ ∼ 107−108M⊙, and star-formation rates of SFR ∼ 0.1−1 M⊙ yr−1
. Our search finds no candidates
at 15 < z < 20, placing upper limits at these redshifts. We develop a forward modeling approach to
infer the properties of the evolving luminosity function without binning in redshift or luminosity that
marginalizes over the photometric redshift uncertainty of our candidate galaxies and incorporates the
impact of non-detections. We find a z = 12 luminosity function in good agreement with prior results,
and that the luminosity function normalization and UV luminosity density decline by a factor of ∼ 2.5
from z = 12 to z = 14. We discuss the possible implications of our results in the context of theoretical
models for evolution of the dark matter halo mass function.
Richard's entangled aventures in wonderlandRichard Gill
Since the loophole-free Bell experiments of 2020 and the Nobel prizes in physics of 2022, critics of Bell's work have retreated to the fortress of super-determinism. Now, super-determinism is a derogatory word - it just means "determinism". Palmer, Hance and Hossenfelder argue that quantum mechanics and determinism are not incompatible, using a sophisticated mathematical construction based on a subtle thinning of allowed states and measurements in quantum mechanics, such that what is left appears to make Bell's argument fail, without altering the empirical predictions of quantum mechanics. I think however that it is a smoke screen, and the slogan "lost in math" comes to my mind. I will discuss some other recent disproofs of Bell's theorem using the language of causality based on causal graphs. Causal thinking is also central to law and justice. I will mention surprising connections to my work on serial killer nurse cases, in particular the Dutch case of Lucia de Berk and the current UK case of Lucy Letby.
A brief information about the SCOP protein database used in bioinformatics.
The Structural Classification of Proteins (SCOP) database is a comprehensive and authoritative resource for the structural and evolutionary relationships of proteins. It provides a detailed and curated classification of protein structures, grouping them into families, superfamilies, and folds based on their structural and sequence similarities.
1. 0
Philosophical Interventions in the
Statistics Wars
Deborah G. Mayo
Virginia Tech
Philosophy in Science: Can Philosophers of
Science Contribute to Science?
PSA 2021 November 13, 2-4 pm
2. 1
“A Statistical Scientist Meets a
Philosopher of Science”
Sir David Cox: “Deborah, in some fields
foundations do not seem very important, but we
both think foundations of statistical inference are
important; why do you think that is?”
Mayo: “…in statistics …we invariably cross into
philosophical questions about empirical knowledge
and inductive inference.” (Cox and Mayo 2011)
Some call statistics “applied philosophy of Science”
(Kempthorne 1976)
3. 2
Statistics Philosophy of science
Most of my interactions with statistics have been
drawing out insights from stat:
(1) To solve philosophical problems about
inductive inference, evidence, experiment;
(2) To answer knotty metamethodological
questions: When (if ever) is it legitimate to use
the ‘same’ data to construct and test a
hypothesis?
4. 3
Philosophy of Science Statistics
• In the last decade I’m more likely to be intervening
in stat—in the sense of this session: PinS
• A central job for philosophers of science: minister
to conceptual and logical problems of sciences
• Especially when widely used methods (e.g.,
statistical significance tests) are said to be causing
a crisis (and should be “abandoned” or “retired”)
5. 4
Long-standing philosophical
controversy on probability
Frequentist (error statisticians): to control and
assess the relative frequency of misinterpretations
of data—error probabilities
(e.g., P-values, confidence intervals, randomization,
resampling)
Bayesians (and other probabilists): to assign
comparative degrees of belief or support in claims
(e.g., Bayes factors, Bayesian posterior probabilities)
6. 5
• Wars between frequentists and Bayesians
have been contentious, everyone wants to
believe we are long past them.
• Long standing battles still simmer below the
surface
7. 6
My first type of intervention:
• Illuminate the debates, within and between rival stat
tribes, in relation to today’s problems
• What’s behind the drumbeat that there’s a statistical
crisis in science?
8. • High powered methods enable arriving at well-
fitting models and impressive looking effects even
if they’re not warranted.
• I set sail with a simple tool: if little or nothing has
been done to rule out flaws in inferring a claim, we
do not have evidence for it.
7
9. A claim is warranted to the extent
it passes severely
8
• We have evidence for a claim only to the
extent that it has been subjected to and
passes a test that would probably have found
it flawed or specifiably false, just if it is
• This probability is the stringency or severity
with which it has passed the test
10. Second type of intervention:
Statistical inference as severe testing
• Reformulate frequentist error statistical tools
• Probability arises (in scientific inference) to assess and
control how capable methods are at uncovering and
avoiding erroneous interpretations of data (Probativism)
• Excavation tool: Holds for any kind of inference; you
needn’t accept this philosophy to use it to get beyond
today’s statistical wars and scrutinize reforms
9
11. Third type of intervention: scrutinize
proposed reforms growing out of the
“replication crisis”
• Several proposed reforms are welcome:
preregistration, avoidance of cookbook statistics,
calls for more replication research
• Others are quite radical, and even obstruct practices
known to improve on replication.
10
12. Consider statistical significance tests
(frequentist)
Significance tests (R.A. Fisher) are a small part of an
error statistical methodology
“…to test the conformity of the particular data under
analysis with H0 in some respect….”
…the p-value: the probability of getting an even
larger value of t0bs assuming background variability
or noise (Mayo and Cox 2006, 81)
11
13. Testing reasoning, as I see it
• If even larger differences than t0bs occur fairly frequently
under H0 (i.e., P-value is not small), there’s scarcely
evidence of incompatibility with H0
• Small P-value indicates some underlying discrepancy
from H0 because very probably (1–P) you would
have seen a smaller difference than t0bs were H0 true.
• Even if the small P-value is valid, it isn’t evidence of a
scientific conclusion H*
Stat-Sub fallacy H1 => H*
12
14. Neyman-Pearson (N-P) put
Fisherian tests on firmer footing
(1933):
Introduces alternative hypotheses H0, H1
H0: μ ≤ 0 vs. H1: μ > 0
• Constrains tests by requiring control of both Type I error
(erroneously rejecting) and Type II error (erroneously
failing to reject) H0, and power
(Neyman also developed confidence interval estimation
at the same time)
13
15. N-P tests tools for optimal
performance:
• Their success in optimal control of error
probabilities gives a new paradigm for statistics
• Also encouraged viewing tests as “accept/reject”
rules more apt for industrial quality control, or
high throughput screening, than science
• Fisher, later in life, criticized N-P for turning “his”
tests into acceptance sampling tools —I learned
later it was mostly in-fighting
14
16. • Can we keep the best from Fisherian and N-P tests
without an ‘inconsistent hybrid” (Gigerenzer)?
• This fueled my second intervention (Mayo 1991, 1996)
later developed with econometrician Aris Spanos in
2000 and statistician David Cox in 2003
• “Our goal is to identify a key principle of evidence by
which hypothetical error probabilities may be used for
inductive inference.” (Mayo and Cox 2006)
• Mathematically Fisher and N-P are nearly identical—it is
an interpretation or philosophy that is needed
15
17. Both Fisher & N-P: it’s easy to lie with
biasing selection effects
• Sufficient finagling—cherry-picking, significance
seeking, multiple testing, post-data subgroups, trying
and trying again—may practically guarantee a
preferred claim H gets support, even if it’s unwarranted
by evidence
• Such a test fails a minimal requirement for a stringent
or severe test (P-value is invalidated)
16
18. Key to solving a central problem
• Why is reliable performance relevant for a specific
inference?
• Ask yourself: What bothers you with selective
reporting, cherry picking, stopping when the data
look good, P-hacking.
17
19. 18
• Not a problem about long-run performance—
• It’s that we can’t say the test did its job in the
case at hand: give “a first line of defense against
being fooled by randomness” (Benjamini 2016)
20. Inferential construal of error
probabilities
• Use error probabilities to assess capabilities of tools to
probe various flaws (Probativism)
• They are what Popper call’s “methodological
probabilities”
• “Severe Testing as a Basic Concept in a Neyman-Pearson
Philosophy of Induction” (Mayo and Spanos 2006)
• ”Frequentist theory as an account of inductive inference” (Mayo
and Cox 2006)
19
21. 20
Popper vs logics of induction/
confirmation
Severity was Popper’s term, and the debate between
Popperian falsificationism and inductive logics of
confirmation/ support parallel those in statistics.
Popper: claim C is “corroborated” to the extent C
passes a severe test (one that probably would have
detected C’s falsity, if false).
22. Comparative logic of support
• Ian Hacking (1965) “Law of Likelihood”:
x support hypothesis H0 less well than H1 if,
Pr(x;H0) < Pr(x;H1)
A problem is:
• Any hypothesis that perfectly fits the data is
maximally likely (even if data-dredged)
• “there always is such a rival hypothesis viz., that
things just had to turn out the way they actually
did” (Barnard 1972, 129)
21
23. Error probabilities are
“one level above” a fit measure:
• Pr(H0 is less well supported than H1; H0 ) is high
for some H1 or other
“to fix a limit between ‘small’ and ‘large’ values of
[the likelihood ratio] we must know how often such
values appear when we deal with a true
hypothesis.” (Pearson and Neyman 1967, 106)
22
24. “There is No Such Thing as a Logic
of Statistical Inference”
• Hacking retracts his Law of Likelihood (LL), (1972,
1980)
• And retracts his earlier rejections that Neyman–
Pearson statistics is inferential.
“I now believe that Neyman, Peirce, and
Braithwaite were on the right lines to follow in the
analysis of inductive arguments”
(Hacking 1980, 141)
23
25. Likelihood Principle: what counts
as evidence?
A pervasive view is that all the evidence is
contained in the ratio of likelihoods:
Pr(x;H0) / Pr(x;H1) likelihood principle (LP)
On the LP (followed by strict Bayesians):
“Sampling distributions, significance levels,
power, all depend on something more [than
the likelihood function]–something that is
irrelevant in Bayesian inference–namely the
sample space” (Lindley 1971, 436)
24
26. Bayesians Howson and Urbach
• They say a significance test is precluded from
giving judgments about empirical support
• “[it] depends not only on the outcome that a trial
produced, but also on the outcomes that it could
have produced but did not. …determined by certain
private intentions of the experimenters, embodying
their stopping rule.” (1993 p. 212)
• Whether error probabilities matter turns on your
methodology being able to pick up on them.
25
28. • So the frequentist needs to know the stopping rule
For a (strict) Bayesian:
“It seems very strange that a frequentist could not
analyze a given set of data…if the stopping rule is
not given….Data should be able to speak for itself.”
(Berger and Wolpert, The Likelihood Principle 1988,
78)
27
29. Radiation oncologists look to phil
science: “Why do we disagree about
clinical trials?” (ASTRO 2021)
In a case we considered, Bayesian researchers:
“The [regulatory] requirement of type I error control for
Bayesian adaptive designs causes them to lose many
of their philosophical advantages, such as compliance
with the likelihood principle [which does not require
adjusting]” (Ryan et al. 2020).
They admit “the type I error was inflated in the [trials]
..without adjustments to account for multiplicity”.
• No wonder they disagree, and it turns partly on the
likelihood principle. (LP)
28
30. Bayesians may block implausible
inferences
• With a low prior degree of belief on H (e.g., real
effect), the Bayesian can block inferring H
• Can work in some cases
31. Concerns
• Additional source of flexibility, priors as well as
biasing selection effects
• Doesn’t show what researchers had done wrong—
it’s the multiple testing, data-dredging
• The believability of data-dredged hypotheses is
what makes them so seductive
• Claims can be highly probable (in any sense) while
poorly probed
30
32. Family feuds within the Bayesian
school: default, objective priors:
• Most Bayesian practitioners (last decade) look for
non-subjective prior probabilities
• “Default” priors are supposed to prevent prior beliefs
from influencing the posteriors–data dominant
31
33. How should we interpret them?
“By definition, ‘non-subjective’ prior distributions are
not intended to describe personal beliefs, and in
most cases, they are not even proper probability
distributions .. . . (Bernardo 1997, pp. 159–60)
• No agreement on rival systems for default/non-
subjective priors
(invariance, maximum entropy, maximizing missing
information, matching (Kass and Wasserman 1996))
32
34. There may be ways to combine Bayesian
and error statistical accounts
(Gelman: Falsificationist Bayesian; Shalizi: error statistician)
“[C]rucial parts of Bayesian data analysis, … can be
understood as ‘error probes’ in Mayo’s sense”
“[W]hat we are advocating, then, is what Cox and Hinkley
(1974) call ‘pure significance testing’, in which certain of
the model’s implications are compared directly to the
data.” (Gelman and Shalizi 2013, 10, 20).
• Gelman was at a session on significance testing controversies at the 2016
PSA with Gigerenzer, and Glymour
• Can’t also champion “abandoning statistical significance”
33
35. Now we get to scrutinizing proposed
reforms
34
36. No Threshold view: Don’t say
‘significance’, don’t use P-value
thresholds
• In 2019, executive director of the American Statistical
Association (ASA), Ron Wasserstein, (and 2 co-
authors), announce: "declarations of ‘statistical
significance’ be abandoned"
• “Don’t say “significance”, don’t use P-value thresholds
(e.g., .05, .01, .005)
• John Ioannidis invited me and Andrew Gelman to write
opposing editorials on the “no threshold view”
(European Journal of Clinical Investigation)
Mine was “P-value thresholds: forfeit at your peril”
35
37. • To be fair, many who signed on to the “no threshold view”
think by removing P-value thresholds, researchers lose an
incentive to data dredge and multiple test and otherwise
exploit researcher flexibility
• I argue banning the use of P-value thresholds in
interpreting data does not diminish but rather exacerbates
data-dredging
36
38. • In a world without predesignated thresholds, it would
be hard to hold the data dredgers accountable for
reporting a nominally small P-value through
ransacking, data dredging, trying and trying again.
• What distinguishes genuine P-values from invalid
ones is that they meet a prespecified error probability.
• No thresholds, no tests.
• We agree the actual P-value should be reported (as
all the founders of tests recommended)
37
39. 38
Problems are avoided by reformulating
tests with a discrepancy γ from H0
Instead of a binary cut-off (significant or not) the
particular outcome is used to infer discrepancies that
are or are not warranted
In a nutshell: one tests several discrepancies from a
test hypothesis and infers those well or poorly
warranted
E.g., With non-significant results, we set an upper
bound (e.g., any discrepancy from H0 is less than γ)
40. Final Remarks: to intervene in
statistics battles you need to ask:
• How do they use probability?
(probabilism, performance, probativism (severe testing))
• What’s their notion of evidence?
(error probability principle, likelihood principle)
39
41. Intervening in today’s stat policy
reforms requires chutzpah
• Things have gotten so political, sometimes an
outsider status can help with acrimonious battles
with thought leaders in statistics.
• To give an update: a Task Force of 14 statisticians
was appointed by the ASA President in 2019 “to
address concerns that [the no threshold view] might
be mistakenly interpreted as official ASA policy”
(Benjamini 2021)
40
42. • “the use of P -values and significance testing, properly
applied and interpreted, are important tools that should not
be abandoned” (Benjamini et al. 2021)
• Instead, we need to confront the fact that basic stat
concepts are more confused than ever (in medicine,
economics, law, psychology, climate science, social
science etc.)
• I was glad to see the morning's session* organized by
members of the 2019 Summer Seminar in Phil Stat (Aris
Spanos and I ran)
• I hope more philosophers of science enter the 2-way street
*Current Debates on Statistical Modeling and Inference
41
Phil Sci Stat Sci
45. (FEV) Frequentist Principle of
Evidence: Mayo and Cox (2006)
(SEV): Mayo 1991, 1996, 2018; Mayo
and Spanos (2006)
FEV/SEV Small P-value: indicates discrepancy γ from
H0, only if, there is a high probability the test would
have resulted in a larger P-value were a discrepancy
as large as γ absent.
FEV/SEV Moderate or large P-value: indicates the
absence of a discrepancy γ from H0, only if there is
a high probability the test would have given a
worse fit with H0 (i.e., a smaller P-value) were a
discrepancy γ present.
44
46. References
• Barnard, G. (1972). The logic of statistical inference (Review of “The Logic of Statistical Inference” by Ian
Hacking). British Journal for the Philosophy of Science 23(2), 123–32.
• Benjamini, Y., De Veaux, R., Efron, B., et al. (2021). The ASA President’s task force statement on
statistical significance and replicability. The Annals of Applied Statistics. (Online June 20, 2021.)
• Berger, J. O. and Wolpert, R. (1988). The Likelihood Principle, 2nd ed. Vol. 6 Lecture Notes-Monograph
Series. Hayward, CA: Institute of Mathematical Statistics.
• Cox, D. R., and Mayo, D. G. (2010). “Objectivity and Conditionality in Frequentist Inference.” In Error and
Inference: Recent Exchanges on Experimental Reasoning, Reliability, and the Objectivity and Rationality
of Science, edited by Deborah G. Mayo and Aris Spanos, 276–304. Cambridge: Cambridge University
Press.
• Cox, D. and Mayo, D. (2011). “A Statistical Scientist Meets a Philosopher of Science: A Conversation
between Sir David Cox and Deborah Mayo”, in Rationality, Markets and Morals (RMM) 2, 103–14.
• Fisher, R. A. (1935a). The Design of Experiments. Oxford: Oxford University Press.
• Gelman, A. and Shalizi, C. (2013). Philosophy and the Practice of Bayesian Statistics and Rejoinder,
British Journal of Mathematical and Statistical Psychology 66(1), 8–38; 76–80.
• Giere, R. (1976). Empirical probability, objective statistical methods, and scientific inquiry. In Foundations
of probability theory, statistical inference and statistical theories of science, vol. 2, edited by W. 1. Harper
and C. A. Hooker, 63-101. Dordrecht, The Netherlands: D. Reidel.
• Hacking, I. (1965). Logic of Statistical Inference. Cambridge: Cambridge University Press.
• Hacking (1972). Likelihood. British Journal for the Philosophy of Science 23, l32-3 7.
• Hacking, I. (1980). The theory of probable inference: Neyman, Peirce and Braithwaite. In Mellor, D.
(ed.), Science, Belief and Behavior: Essays in Honour of R. B. Braithwaite, Cambridge: Cambridge
University Press, pp. 141–60. 45
47. • Harper, W. 1., and C. A. Hooker, eds. 1976. Foundations of probability theory, statistical inference and
statistical theories of science. Vol. 2. Dordrecht, The Netherlands: D. Reidel.
• Howson, C. & Urbach, P. (1993). Scientific Reasoning: The Bayesian Approach. LaSalle, IL: Open Court.
• Kass, R. & Wasserman, L. (1996). The Selection of Prior Distributions by Formal Rules. Journal of the
American Statistical Association 91, 1343–70.
• Kempthorne, O. (1976). Statistics and the Philosophers, in Harper, W. and Hooker, C. (eds.), Foundations of
Probability Theory, Statistical Inference and Statistical Theories of Science, Volume II. 273–314. Boston, MA:
D. Reidel.
• Lindley, D. V. (1971). The Estimation of Many Parameters in Godambe, V. and Sprott, D. (eds.), Foundations
of Statistical Inference 435–455. Toronto: Holt, Rinehart and Winston.
• Mayo, D. (1991). Novel Evidence and Severe Tests. Philosophy of Science 58(4), 523–52.
• Mayo, D. (1996). Error and the Growth of Experimental Knowledge. Chicago: University of Chicago Press.
• Mayo, D. (2014). On the Birnbaum Argument for the Strong Likelihood Principle (with discussion), Statistical
Science 29(2), 227–39; 261–6.
• Mayo, D. (2016). Don’t Throw Out the Error Control Baby with the Bad Statistics Bathwater: A Commentary
on Wasserstein, R. L. and Lazar, N. A. 2016, “The ASA’s Statement on p-Values: Context, Process, and
Purpose. The American Statistician 70(2) (supplemental materials).
• Mayo, D. (2018). Statistical inference as severe testing: How to get beyond the statistics wars. Cambridge:
Cambridge University Press.
• Mayo, D. (forthcoming). The Statistics Wars and Intellectual Conflicts of Interest (editorial). Conservation
Biology.
46
48. • Mayo, D. & Cox, D. (2006). Frequentist statistics as a theory of inductive inference. In Rojo, J. (ed.),
Optimality: The Second Erich L. Lehmann Symposium, Lecture Notes-Monograph series, Institute of
Mathematical Statistics (IMS), 49, pp. 77–97. (Reprinted 2010 in Mayo, D. and Spanos, A. (eds.), pp. 247–
75.)
• Mayo, D. & Hand, D. (under review). Statistical Significance Tests: Practicing damaging science, or damaging
scientific practice? In Kao, M., Shech, E., & Mayo, D. Synthese (Special Issue: Recent Issues in Philosophy
of Statistics: Evidence, Testing, and Applications ).
• Mayo, D. & Spanos, A. (2006). Severe testing as a basic concept in a Neyman–Pearson philosophy of
induction. British Journal for the Philosophy of Science 57(2), 323–57.
• Mayo, D. G., and A. Spanos (2011). “Error Statistics.” In Philosophy of Statistics, edited by Prasanta S.
Bandyopadhyay and Malcolm R. Forster, 7:152–198. Handbook of the Philosophy of Science. The
Netherlands: Elsevier.
• Musgrave, A. (1974). ‘Logical versus Historical Theories of Confirmation’, The British Journal for the
Philosophy of Science 25(1), 1–23.
• Neyman J. & Pearson, E. (1967). On the problem of the most efficient tests of statistical hypotheses. In Joint
statistical papers, 140-85 (Berkeley: University of California Press). First published in Philosophical
Transactions of the Royal Society (A)(1933):231, 289-337.
• Popper, K. (1959). The Logic of Scientific Discovery. London, New York: Routledge.
• Simmons, J., Nelson, L., & Simonsohn, U. (2012). A 21 word solution. Dialogue: The Official Newsletter of
the Society for Personality and Social Psychology 26(2), 4–7.
• Wasserstein, R. & Lazar, N. (2016). The ASA’s statement on p-values: Context, process and purpose (and
supplemental materials). The American Statistician, 70(2), 129-133.
• Wasserstein, R., Schirm, A,. & Lazar, N. (2019). Moving to a world beyond “p < 0.05” (Editorial). The
American Statistician 73(S1), 1–19. https://doi.org/10.1080/00031305.2019.1583913
47