D. Mayo (Virginia Tech) slides from her talk June 3 at the "Preconference Workshop on Replication in the Sciences" at the 2015 Society for Philosophy and Psychology meeting.
A. Gelman "50 shades of gray: A research story," presented May 23 at the session on "The Philosophy of Statistics: Bayesianism, Frequentism and the Nature of Inference," 2015 APS Annual Convention in NYC.
D. G. Mayo (Virginia Tech) "Error Statistical Control: Forfeit at your Peril" presented May 23 at the session on "The Philosophy of Statistics: Bayesianism, Frequentism and the Nature of Inference," 2015 APS Annual Convention in NYC.
"The Statistical Replication Crisis: Paradoxes and Scapegoats”jemille6
D. G. Mayo LSE Popper talk, May 10, 2016.
Abstract: Mounting failures of replication in the social and biological sciences give a practical spin to statistical foundations in the form of the question: How can we attain reliability when Big Data methods make illicit cherry-picking and significance seeking so easy? Researchers, professional societies, and journals are increasingly getting serious about methodological reforms to restore scientific integrity – some are quite welcome (e.g., preregistration), while others are quite radical. Recently, the American Statistical Association convened members from differing tribes of frequentists, Bayesians, and likelihoodists to codify misuses of P-values. Largely overlooked are the philosophical presuppositions of both criticisms and proposed reforms. Paradoxically, alternative replacement methods may enable rather than reveal illicit inferences due to cherry-picking, multiple testing, and other biasing selection effects. Popular appeals to “diagnostic testing” that aim to improve replication rates may (unintentionally) permit the howlers and cookbook statistics we are at pains to root out. Without a better understanding of the philosophical issues, we can expect the latest reforms to fail.
Stephen Senn slides:"‘Repligate’: reproducibility in statistical studies. What does it mean and in what sense does it matter?" presented May 23 at the session on "The Philosophy of Statistics: Bayesianism, Frequentism and the Nature of Inference"," at the 2015 APS Annual Convention in NYC
Probing with Severity: Beyond Bayesian Probabilism and Frequentist Performancejemille6
Slides from Rutgers Seminar talk by Deborah G Mayo
December 3, 2014
Rutgers, Department of Statistics and Biostatistics
Abstract: Getting beyond today’s most pressing controversies revolving around statistical methods, I argue, requires scrutinizing their underlying statistical philosophies.Two main philosophies about the roles of probability in statistical inference are probabilism and performance (in the long-run). The first assumes that we need a method of assigning probabilities to hypotheses; the second assumes that the main function of statistical method is to control long-run performance. I offer a third goal: controlling and evaluating the probativeness of methods. An inductive inference, in this conception, takes the form of inferring hypotheses to the extent that they have been well or severely tested. A report of poorly tested claims must also be part of an adequate inference. I develop a statistical philosophy in which error probabilities of methods may be used to evaluate and control the stringency or severity of tests. I then show how the “severe testing” philosophy clarifies and avoids familiar criticisms and abuses of significance tests and cognate methods (e.g., confidence intervals). Severity may be threatened in three main ways: fallacies of statistical tests, unwarranted links between statistical and substantive claims, and violations of model assumptions.
Severe Testing: The Key to Error Correctionjemille6
D. G. Mayo's slides for her presentation given March 17, 2017 at Boston Colloquium for Philosophy of Science, Alfred I.Taub forum: "Understanding Reproducibility & Error Correction in Science"
Deborah G. Mayo: Is the Philosophy of Probabilism an Obstacle to Statistical Fraud Busting?
Presentation slides for: Revisiting the Foundations of Statistics in the Era of Big Data: Scaling Up to Meet the Challenge[*] at the Boston Colloquium for Philosophy of Science (Feb 21, 2014).
Statistical skepticism: How to use significance tests effectively jemille6
Prof. D. Mayo, presentation Oct. 12, 2017 at the ASA Symposium on Statistical Inference : “A World Beyond p < .05” in the session: “What are the best uses for P-values?“
A. Gelman "50 shades of gray: A research story," presented May 23 at the session on "The Philosophy of Statistics: Bayesianism, Frequentism and the Nature of Inference," 2015 APS Annual Convention in NYC.
D. G. Mayo (Virginia Tech) "Error Statistical Control: Forfeit at your Peril" presented May 23 at the session on "The Philosophy of Statistics: Bayesianism, Frequentism and the Nature of Inference," 2015 APS Annual Convention in NYC.
"The Statistical Replication Crisis: Paradoxes and Scapegoats”jemille6
D. G. Mayo LSE Popper talk, May 10, 2016.
Abstract: Mounting failures of replication in the social and biological sciences give a practical spin to statistical foundations in the form of the question: How can we attain reliability when Big Data methods make illicit cherry-picking and significance seeking so easy? Researchers, professional societies, and journals are increasingly getting serious about methodological reforms to restore scientific integrity – some are quite welcome (e.g., preregistration), while others are quite radical. Recently, the American Statistical Association convened members from differing tribes of frequentists, Bayesians, and likelihoodists to codify misuses of P-values. Largely overlooked are the philosophical presuppositions of both criticisms and proposed reforms. Paradoxically, alternative replacement methods may enable rather than reveal illicit inferences due to cherry-picking, multiple testing, and other biasing selection effects. Popular appeals to “diagnostic testing” that aim to improve replication rates may (unintentionally) permit the howlers and cookbook statistics we are at pains to root out. Without a better understanding of the philosophical issues, we can expect the latest reforms to fail.
Stephen Senn slides:"‘Repligate’: reproducibility in statistical studies. What does it mean and in what sense does it matter?" presented May 23 at the session on "The Philosophy of Statistics: Bayesianism, Frequentism and the Nature of Inference"," at the 2015 APS Annual Convention in NYC
Probing with Severity: Beyond Bayesian Probabilism and Frequentist Performancejemille6
Slides from Rutgers Seminar talk by Deborah G Mayo
December 3, 2014
Rutgers, Department of Statistics and Biostatistics
Abstract: Getting beyond today’s most pressing controversies revolving around statistical methods, I argue, requires scrutinizing their underlying statistical philosophies.Two main philosophies about the roles of probability in statistical inference are probabilism and performance (in the long-run). The first assumes that we need a method of assigning probabilities to hypotheses; the second assumes that the main function of statistical method is to control long-run performance. I offer a third goal: controlling and evaluating the probativeness of methods. An inductive inference, in this conception, takes the form of inferring hypotheses to the extent that they have been well or severely tested. A report of poorly tested claims must also be part of an adequate inference. I develop a statistical philosophy in which error probabilities of methods may be used to evaluate and control the stringency or severity of tests. I then show how the “severe testing” philosophy clarifies and avoids familiar criticisms and abuses of significance tests and cognate methods (e.g., confidence intervals). Severity may be threatened in three main ways: fallacies of statistical tests, unwarranted links between statistical and substantive claims, and violations of model assumptions.
Severe Testing: The Key to Error Correctionjemille6
D. G. Mayo's slides for her presentation given March 17, 2017 at Boston Colloquium for Philosophy of Science, Alfred I.Taub forum: "Understanding Reproducibility & Error Correction in Science"
Deborah G. Mayo: Is the Philosophy of Probabilism an Obstacle to Statistical Fraud Busting?
Presentation slides for: Revisiting the Foundations of Statistics in the Era of Big Data: Scaling Up to Meet the Challenge[*] at the Boston Colloquium for Philosophy of Science (Feb 21, 2014).
Statistical skepticism: How to use significance tests effectively jemille6
Prof. D. Mayo, presentation Oct. 12, 2017 at the ASA Symposium on Statistical Inference : “A World Beyond p < .05” in the session: “What are the best uses for P-values?“
Abstract: Mounting failures of replication in the social and biological sciences give a practical spin to statistical foundations in the form of the question: How can we attain reliability when methods make illicit cherry-picking and significance seeking so easy? Researchers, professional societies, and journals are increasingly getting serious about methodological reforms to restore scientific integrity – some are quite welcome (e.g., pre-registration), while others are quite radical. The American Statistical Association convened members from differing tribes of frequentists, Bayesians, and likelihoodists to codify misuses of P-values. Largely overlooked are the philosophical presuppositions of both criticisms and proposed reforms. Paradoxically, alternative replacement methods may enable rather than reveal illicit inferences due to cherry-picking, multiple testing, and other biasing selection effects. Crowd-sourced reproducibility research in psychology is helping to change the reward structure but has its own shortcomings. Focusing on purely statistical considerations, it tends to overlook problems with artificial experiments. Without a better understanding of the philosophical issues, we can expect the latest reforms to fail.
Surrogate Science: How Fisher, Neyman-Pearson, and Bayes Were Transformed int...jemille6
Gerd Gigerenzer (Director of Max Planck Institute for Human Development, Berlin, Germany) in the PSA 2016 Symposium:Philosophy of Statistics in the Age of Big Data and Replication Crises
Replication Crises and the Statistics Wars: Hidden Controversiesjemille6
D. Mayo presentation at the X-Phil conference on "Reproducibility and Replicabililty in Psychology and Experimental Philosophy", University College London (June 14, 2018)
Controversy Over the Significance Test Controversyjemille6
Deborah Mayo (Professor of Philosophy, Virginia Tech, Blacksburg, Virginia) in PSA 2016 Symposium: Philosophy of Statistics in the Age of Big Data and Replication Crises
D. G. Mayo: Your data-driven claims must still be probed severelyjemille6
In the session "Philosophy of Science and the New Paradigm of Data-Driven Science at the American Statistical Association Conference on Statistical Learning and Data Science/Nonparametric Statistics
Exploratory Research is More Reliable Than Confirmatory Researchjemille6
PSA 2016 Symposium:
Philosophy of Statistics in the Age of Big Data and Replication Crises
Presenter: Clark Glymour (Alumni University Professor in Philosophy, Carnegie Mellon University, Pittsburgh, Pennsylvania)
ABSTRACT: Ioannidis (2005) argued that most published research is false, and that “exploratory” research in which many hypotheses are assessed automatically is especially likely to produce false positive relations. Colquhoun (2014) with simulations estimates that 30 to 40% of positive results using the conventional .05 cutoff for rejection of a null hypothesis is false. Their explanation is that true relationships in a domain are rare and the selection of hypotheses to test is roughly independent of their truth, so most relationships tested will in fact be false. Conventional use of hypothesis tests, in other words, suffers from a base rate fallacy. I will show that the reverse is true for modern search methods for causal relations because: a. each hypothesis is tested or assessed multiple times; b. the methods are biased against positive results; c. systems in which true relationships are rare are an advantage for these methods. I will substantiate the claim with both empirical data and with simulations of data from systems with a thousand to a million variables that result in fewer than 5% false positive relationships and in which 90% or more of the true relationships are recovered.
D. Mayo: The Science Wars and the Statistics Wars: scientism, popular statist...jemille6
I will explore the extent to which concerns about ‘scientism’– an unwarranted obeisance to scientific over other methods of inquiry – are intertwined with issues in the foundations of the statistical data analyses on which (social, behavioral, medical and physical) science increasingly depends. The rise of big data, machine learning, and high-powered computer programs have extended statistical methods and modeling across the landscape of science, law and evidence-based policy, but this has been accompanied by enormous hand wringing as to the reliability, replicability, and valid use of statistics. Legitimate criticisms of scientism often stem from insufficiently self-critical uses of statistical methodology, broadly construed — i.e., from what might be called “statisticism”-- particularly when those methods are applied to matters of controversy.
D. G. Mayo: The Replication Crises and its Constructive Role in the Philosoph...jemille6
Constructive role of replication crises teaches a lot about 1.) Non-fallacious uses of statistical tests, 2.) Rationale for the role of probability in tests, 3.) How to reformulate tests.
Mayo: Evidence as Passing a Severe Test (How it Gets You Beyond the Statistic...jemille6
D. G. Mayo April 28, 2021 presentation to the CUNY Graduate Center Philosophy Colloquium "Evidence as Passing a Severe Test (How it Gets You Beyond the Statistics Wars)"
D. Mayo: Philosophical Interventions in the Statistics Warsjemille6
ABSTRACT: While statistics has a long history of passionate philosophical controversy, the last decade especially cries out for philosophical illumination. Misuses of statistics, Big Data dredging, and P-hacking make it easy to find statistically significant, but spurious, effects. This obstructs a test's ability to control the probability of erroneously inferring effects–i.e., to control error probabilities. Disagreements about statistical reforms reflect philosophical disagreements about the nature of statistical inference–including whether error probability control even matters! I describe my interventions in statistics in relation to three events. (1) In 2016 the American Statistical Association (ASA) met to craft principles for avoiding misinterpreting P-values. (2) In 2017, a "megateam" (including philosophers of science) proposed "redefining statistical significance," replacing the common threshold of P ≤ .05 with P ≤ .005. (3) In 2019, an editorial in the main ASA journal called for abandoning all P-value thresholds, and even the words "significant/significance".
A word on each. (1) Invited to be a "philosophical observer" at their meeting, I found the major issues were conceptual. P-values measure how incompatible data are from what is expected under a hypothesis that there is no genuine effect: the smaller the P-value, the more indication of incompatibility. The ASA list of familiar misinterpretations–P-values are not posterior probabilities, statistical significance is not substantive importance, no evidence against a hypothesis need not be evidence for it–I argue, should not be the basis for replacing tests with methods less able to assess and control erroneous interpretations of data. (Mayo 2016, 2019). (2) The "redefine statistical significance" movement appraises P-values from the perspective of a very different quantity: a comparative Bayes Factor. Failing to recognize how contrasting approaches measure different things, disputants often talk past each other (Mayo 2018). (3) To ban P-value thresholds, even to distinguish terrible from warranted evidence, I say, is a mistake (2019). It will not eradicate P-hacking, but it will make it harder to hold P-hackers accountable. A 2020 ASA Task Force on significance testing has just been announced. (I would like to think my blog errorstatistics.com helped.)
To enter the fray between rival statistical approaches, it helps to have a principle applicable to all accounts. There's poor evidence for a claim if little if anything has been done to find it flawed even if it is. This forms a basic requirement for evidence I call the severity requirement. A claim passes with severity only if it is subjected to and passes a test that probably would have found it flawed, if it were. It stems from Popper, though he never adequately cashed it out. A variant is the frequentist principle of evidence developed with Sir David Cox (Mayo and Cox 20
D. Mayo: Philosophy of Statistics & the Replication Crisis in Sciencejemille6
D. Mayo discusses various disputes-notably the replication crisis in science-in the context of her just released book: Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars.
Today we’ll try to cover a number of things:
1. Learning philosophy/philosophy of statistics
2. Situating the broad issues within philosophy of science
3. Little bit of logic
4. Probability and random variables
Byrd statistical considerations of the histomorphometric test protocol (1)jemille6
"Statistical considerations of the histomorphometric test protocol"
John E. Byrd, Ph.D. D-ABFA
Maria-Teresa Tersigni-Tarrant, Ph.D.
Central Identification Laboratory
JPAC
Abstract: Mounting failures of replication in the social and biological sciences give a practical spin to statistical foundations in the form of the question: How can we attain reliability when methods make illicit cherry-picking and significance seeking so easy? Researchers, professional societies, and journals are increasingly getting serious about methodological reforms to restore scientific integrity – some are quite welcome (e.g., pre-registration), while others are quite radical. The American Statistical Association convened members from differing tribes of frequentists, Bayesians, and likelihoodists to codify misuses of P-values. Largely overlooked are the philosophical presuppositions of both criticisms and proposed reforms. Paradoxically, alternative replacement methods may enable rather than reveal illicit inferences due to cherry-picking, multiple testing, and other biasing selection effects. Crowd-sourced reproducibility research in psychology is helping to change the reward structure but has its own shortcomings. Focusing on purely statistical considerations, it tends to overlook problems with artificial experiments. Without a better understanding of the philosophical issues, we can expect the latest reforms to fail.
Surrogate Science: How Fisher, Neyman-Pearson, and Bayes Were Transformed int...jemille6
Gerd Gigerenzer (Director of Max Planck Institute for Human Development, Berlin, Germany) in the PSA 2016 Symposium:Philosophy of Statistics in the Age of Big Data and Replication Crises
Replication Crises and the Statistics Wars: Hidden Controversiesjemille6
D. Mayo presentation at the X-Phil conference on "Reproducibility and Replicabililty in Psychology and Experimental Philosophy", University College London (June 14, 2018)
Controversy Over the Significance Test Controversyjemille6
Deborah Mayo (Professor of Philosophy, Virginia Tech, Blacksburg, Virginia) in PSA 2016 Symposium: Philosophy of Statistics in the Age of Big Data and Replication Crises
D. G. Mayo: Your data-driven claims must still be probed severelyjemille6
In the session "Philosophy of Science and the New Paradigm of Data-Driven Science at the American Statistical Association Conference on Statistical Learning and Data Science/Nonparametric Statistics
Exploratory Research is More Reliable Than Confirmatory Researchjemille6
PSA 2016 Symposium:
Philosophy of Statistics in the Age of Big Data and Replication Crises
Presenter: Clark Glymour (Alumni University Professor in Philosophy, Carnegie Mellon University, Pittsburgh, Pennsylvania)
ABSTRACT: Ioannidis (2005) argued that most published research is false, and that “exploratory” research in which many hypotheses are assessed automatically is especially likely to produce false positive relations. Colquhoun (2014) with simulations estimates that 30 to 40% of positive results using the conventional .05 cutoff for rejection of a null hypothesis is false. Their explanation is that true relationships in a domain are rare and the selection of hypotheses to test is roughly independent of their truth, so most relationships tested will in fact be false. Conventional use of hypothesis tests, in other words, suffers from a base rate fallacy. I will show that the reverse is true for modern search methods for causal relations because: a. each hypothesis is tested or assessed multiple times; b. the methods are biased against positive results; c. systems in which true relationships are rare are an advantage for these methods. I will substantiate the claim with both empirical data and with simulations of data from systems with a thousand to a million variables that result in fewer than 5% false positive relationships and in which 90% or more of the true relationships are recovered.
D. Mayo: The Science Wars and the Statistics Wars: scientism, popular statist...jemille6
I will explore the extent to which concerns about ‘scientism’– an unwarranted obeisance to scientific over other methods of inquiry – are intertwined with issues in the foundations of the statistical data analyses on which (social, behavioral, medical and physical) science increasingly depends. The rise of big data, machine learning, and high-powered computer programs have extended statistical methods and modeling across the landscape of science, law and evidence-based policy, but this has been accompanied by enormous hand wringing as to the reliability, replicability, and valid use of statistics. Legitimate criticisms of scientism often stem from insufficiently self-critical uses of statistical methodology, broadly construed — i.e., from what might be called “statisticism”-- particularly when those methods are applied to matters of controversy.
D. G. Mayo: The Replication Crises and its Constructive Role in the Philosoph...jemille6
Constructive role of replication crises teaches a lot about 1.) Non-fallacious uses of statistical tests, 2.) Rationale for the role of probability in tests, 3.) How to reformulate tests.
Mayo: Evidence as Passing a Severe Test (How it Gets You Beyond the Statistic...jemille6
D. G. Mayo April 28, 2021 presentation to the CUNY Graduate Center Philosophy Colloquium "Evidence as Passing a Severe Test (How it Gets You Beyond the Statistics Wars)"
D. Mayo: Philosophical Interventions in the Statistics Warsjemille6
ABSTRACT: While statistics has a long history of passionate philosophical controversy, the last decade especially cries out for philosophical illumination. Misuses of statistics, Big Data dredging, and P-hacking make it easy to find statistically significant, but spurious, effects. This obstructs a test's ability to control the probability of erroneously inferring effects–i.e., to control error probabilities. Disagreements about statistical reforms reflect philosophical disagreements about the nature of statistical inference–including whether error probability control even matters! I describe my interventions in statistics in relation to three events. (1) In 2016 the American Statistical Association (ASA) met to craft principles for avoiding misinterpreting P-values. (2) In 2017, a "megateam" (including philosophers of science) proposed "redefining statistical significance," replacing the common threshold of P ≤ .05 with P ≤ .005. (3) In 2019, an editorial in the main ASA journal called for abandoning all P-value thresholds, and even the words "significant/significance".
A word on each. (1) Invited to be a "philosophical observer" at their meeting, I found the major issues were conceptual. P-values measure how incompatible data are from what is expected under a hypothesis that there is no genuine effect: the smaller the P-value, the more indication of incompatibility. The ASA list of familiar misinterpretations–P-values are not posterior probabilities, statistical significance is not substantive importance, no evidence against a hypothesis need not be evidence for it–I argue, should not be the basis for replacing tests with methods less able to assess and control erroneous interpretations of data. (Mayo 2016, 2019). (2) The "redefine statistical significance" movement appraises P-values from the perspective of a very different quantity: a comparative Bayes Factor. Failing to recognize how contrasting approaches measure different things, disputants often talk past each other (Mayo 2018). (3) To ban P-value thresholds, even to distinguish terrible from warranted evidence, I say, is a mistake (2019). It will not eradicate P-hacking, but it will make it harder to hold P-hackers accountable. A 2020 ASA Task Force on significance testing has just been announced. (I would like to think my blog errorstatistics.com helped.)
To enter the fray between rival statistical approaches, it helps to have a principle applicable to all accounts. There's poor evidence for a claim if little if anything has been done to find it flawed even if it is. This forms a basic requirement for evidence I call the severity requirement. A claim passes with severity only if it is subjected to and passes a test that probably would have found it flawed, if it were. It stems from Popper, though he never adequately cashed it out. A variant is the frequentist principle of evidence developed with Sir David Cox (Mayo and Cox 20
D. Mayo: Philosophy of Statistics & the Replication Crisis in Sciencejemille6
D. Mayo discusses various disputes-notably the replication crisis in science-in the context of her just released book: Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars.
Today we’ll try to cover a number of things:
1. Learning philosophy/philosophy of statistics
2. Situating the broad issues within philosophy of science
3. Little bit of logic
4. Probability and random variables
Byrd statistical considerations of the histomorphometric test protocol (1)jemille6
"Statistical considerations of the histomorphometric test protocol"
John E. Byrd, Ph.D. D-ABFA
Maria-Teresa Tersigni-Tarrant, Ph.D.
Central Identification Laboratory
JPAC
Statistical Inference as Severe Testing: Beyond Performance and Probabilismjemille6
A talk given by Deborah G Mayo
(Dept of Philosophy, Virginia Tech) to the Seminar in Advanced Research Methods at the Dept of Psychology, Princeton University on
November 14, 2023
TITLE: Statistical Inference as Severe Testing: Beyond Probabilism and Performance
ABSTRACT: I develop a statistical philosophy in which error probabilities of methods may be used to evaluate and control the stringency or severity of tests. A claim is severely tested to the extent it has been subjected to and passes a test that probably would have found flaws, were they present. The severe-testing requirement leads to reformulating statistical significance tests to avoid familiar criticisms and abuses. While high-profile failures of replication in the social and biological sciences stem from biasing selection effects—data dredging, multiple testing, optional stopping—some reforms and proposed alternatives to statistical significance tests conflict with the error control that is required to satisfy severity. I discuss recent arguments to redefine, abandon, or replace statistical significance.
Slides given for Deborah G. Mayo talk at Minnesota Center for Philosophy of Science at University of Minnesota on the ASA 2016 statement on P-values and Error Statistics
“The importance of philosophy of science for statistical science and vice versa”jemille6
My paper “The importance of philosophy of science for statistical science and vice
versa” presented (zoom) at the conference: IS PHILOSOPHY USEFUL FOR SCIENCE, AND/OR VICE VERSA? January 30 - February 2, 2024 at Chapman University, Schmid College of Science and Technology.
What is the reproducibility crisis in science and what can we do about it?Dorothy Bishop
Talk given to the Rhodes Biomedical Association, 4th May 2016.
For references see: http://www.slideshare.net/deevybishop/references-on-reproducibility-crisis-in-science-by-dvm-bishop
D. Mayo (Dept of Philosophy, VT)
Sir David Cox’s Statistical Philosophy and Its Relevance to Today’s Statistical Controversies
ABSTRACT: This talk will explain Sir David Cox's views of the nature and importance of statistical foundations and their relevance to today's controversies about statistical inference, particularly in using statistical significance testing and confidence intervals. Two key themes of Cox's statistical philosophy are: first, the importance of calibrating methods by considering their behavior in (actual or hypothetical) repeated sampling, and second, ensuring the calibration is relevant to the specific data and inquiry. A question that arises is: How can the frequentist calibration provide a genuinely epistemic assessment of what is learned from data? Building on our jointly written papers, Mayo and Cox (2006) and Cox and Mayo (2010), I will argue that relevant error probabilities may serve to assess how well-corroborated or severely tested statistical claims are.
Nancy Reid, Dept. of Statistics, University of Toronto. Inaugural receiptant of the "David R. Cox Foundations of Statistics Award".
Slides from Invited presentation at 2023 JSM: “The Importance of Foundations in Statistical Science“
Ronald Wasserstein, Chair (American Statistical Association)
ABSTRACT: David Cox wrote “A healthy interplay between theory and application is crucial for statistics… This is particularly the case when by theory we mean foundations of statistical analysis, rather than the theoretical analysis of specific statistical methods.” These foundations distinguish statistical science from the many fields of research in which statistical thinking is a key intellectual component. In this talk I will emphasize the ongoing importance and relevance of theoretical advances and theoretical thinking through some illustrative examples.
Errors of the Error Gatekeepers: The case of Statistical Significance 2016-2022jemille6
ABSTRACT: Statistical significance tests serve in gatekeeping against being fooled by randomness, but recent attempts to gatekeep these tools have themselves malfunctioned. Warranted gatekeepers formulate statistical tests so as to avoid fallacies and misuses of P-values. They highlight how multiplicity, optional stopping, and data-dredging can readily invalidate error probabilities. It is unwarranted, however, to argue that statistical significance and P-value thresholds be abandoned because they can be misused. Nor is it warranted to argue for abandoning statistical significance based on presuppositions about evidence and probability that are at odds with those underlying statistical significance tests. When statistical gatekeeping malfunctions, I argue, it undermines a central role to which scientists look to statistics. In order to combat the dangers of unthinking, bandwagon effects, statistical practitioners and consumers need to be in a position to critically evaluate the ramifications of proposed "reforms” (“stat activism”). I analyze what may be learned from three recent episodes of gatekeeping (and meta-gatekeeping) at the American Statistical Association (ASA).
Causal inference is not statistical inferencejemille6
Jon Williamson (University of Kent)
ABSTRACT: Many methods for testing causal claims are couched as statistical methods: e.g.,
randomised controlled trials, various kinds of observational study, meta-analysis, and
model-based approaches such as structural equation modelling and graphical causal
modelling. I argue that this is a mistake: causal inference is not a purely statistical
problem. When we look at causal inference from a general point of view, we see that
methods for causal inference fit into the framework of Evidential Pluralism: causal
inference is properly understood as requiring mechanistic inference in addition to
statistical inference.
Evidential Pluralism also offers a new perspective on the replication crisis. That
observed associations are not replicated by subsequent studies is a part of normal
science. A problem only arises when those associations are taken to establish causal
claims: a science whose established causal claims are constantly overturned is indeed
in crisis. However, if we understand causal inference as involving mechanistic inference
alongside statistical inference, as Evidential Pluralism suggests, we avoid fallacious
inferences from association to causation. Thus, Evidential Pluralism offers the means to
prevent the drama of science from turning into a crisis.
Stephan Guttinger (Lecturer in Philosophy of Data/Data Ethics, University of Exeter, UK)
ABSTRACT: The idea of “questionable research practices” (QRPs) is central to the narrative of a replication crisis in the experimental sciences. According to this narrative the low replicability of scientific findings is not simply due to fraud or incompetence, but in large part to the widespread use of QRPs, such as “p-hacking” or the lack of adequate experimental controls. The claim is that such flawed practices generate flawed output. The reduction – or even elimination – of QRPs is therefore one of the main strategies proposed by policymakers and scientists to tackle the replication crisis.
What counts as a QRP, however, is not clear. As I will discuss in the first part of this paper, there is no consensus on how to define the term, and ascriptions of the qualifier “questionable” often vary across disciplines, time, and even within single laboratories. This lack of clarity matters as it creates the risk of introducing methodological constraints that might create more harm than good. Practices labelled as ‘QRPs’ can be both beneficial and problematic for research practice and targeting them without a sound understanding of their dynamic and context-dependent nature risks creating unnecessary casualties in the fight for a more reliable scientific practice.
To start developing a more situated and dynamic picture of QRPs I will then turn my attention to a specific example of a dynamic QRP in the experimental life sciences, namely, the so-called “Far Western Blot” (FWB). The FWB is an experimental system that can be used to study protein-protein interactions but which for most of its existence has not seen a wide uptake in the community because it was seen as a QRP. This was mainly due to its (alleged) propensity to generate high levels of false positives and negatives. Interestingly, however, it seems that over the last few years the FWB slowly moved into the space of acceptable research practices. Analysing this shift and the reasons underlying it, I will argue a) that suppressing this practice deprived the research community of a powerful experimental tool and b) that the original judgment of the FWB was based on a simplistic and non-empirical assessment of its error-generating potential. Ultimately, it seems like the key QRP at work in the FWB case was the way in which the label “questionable” was assigned in the first place. I will argue that findings from this case can be extended to other QRPs in the experimental life sciences and that they point to a larger issue with how researchers judge the error-potential of new research practices.
David Hand (Professor Emeritus and Senior Research Investigator, Department of Mathematics,
Faculty of Natural Sciences, Imperial College London.)
ABSTRACT: Science progresses through an iterative process of formulating theories and comparing
them with empirical real-world data. Different camps of scientists will favour different
theories, until accumulating evidence renders one or more untenable. Not unnaturally,
people become attached to theories. Perhaps they invented a theory, and kudos arises
from being the originator of a generally accepted theory. A theory might represent a
life's work, so that being found wanting might be interpreted as failure. Perhaps
researchers were trained in a particular school, and acknowledging its shortcomings is
difficult. Because of this, tensions can arise between proponents of different theories.
The discipline of statistics is susceptible to precisely the same tensions. Here, however,
the tensions are not between different theories of "what is", but between different
strategies for shedding light on the real world from limited empirical data. This can be in
the form of how one measures discrepancy between the theory's predictions and
observations. It can be in the form of different ways of looking at empirical results. It can
be, at a higher level, because of differences between what is regarded as important in a
particular context. Or it can be for other reasons.
Perhaps the most familiar example of this tension within statistics is between different
approaches to inference. However, there are many other examples of such tensions.
This paper illustrates with several examples. We argue that the tension generally arises
as a consequence of inadequate care being taken in question formulation. That is,
insufficient thought is given to deciding exactly what one wants to know - to determining
"What is the question?".
The ideas and disagreements are illustrated with several examples.
The neglected importance of complexity in statistics and Metasciencejemille6
Daniele Fanelli
London School of Economics Fellow in Quantitative Methodology, Department of
Methodology, London School of Economics and Political Science.
ABSTRACT: Statistics is at war, and Metascience is ailing. This is partially due, the talk will argue, to
a paradigmatic blind-spot: the assumption that one can draw general conclusions about
empirical findings without considering the role played by context, conditions,
assumptions, and the complexity of methods and theories. Whilst ideally these
particularities should be unimportant in science, in practice they cannot be neglected in
most research fields, let alone in research-on-research.
This neglected importance of complexity is supported by theoretical arguments and
empirical findings (or the lack thereof) in the recent meta-analytical and metascientific
literature. The talk will overview this background and suggest how the complexity of
theories and methodologies may be explicitly factored into particular methodologies of
statistics and Metaresearch. The talk will then give examples of how this approach may
usefully complement existing paradigms, by translating results, methods and theories
into quantities of information that are evaluated using an information-compression logic.
Mathematically Elegant Answers to Research Questions No One is Asking (meta-a...jemille6
Uri Simonsohn (Professor, Department of Operations, Innovation and Data Sciences at Esade)
ABSTRACT: The statistical tools listed in the title share that a mathematically elegant solution has
become the consensus advice of statisticians, methodologists and some
mathematically sophisticated researchers writing tutorials and textbooks, and yet,
they lead research workers to meaningless answers, that are often also statistically
invalid. Part of the problem is that advice givers take the mathematical abstractions
of the tools they advocate for literally, instead of taking the actual behavior of
researchers seriously.
On Severity, the Weight of Evidence, and the Relationship Between the Twojemille6
Margherita Harris
Visiting fellow in the Department of Philosophy, Logic and Scientific Method at the London
School of Economics and Political Science.
ABSTRACT: According to the severe tester, one is justified in declaring to have evidence in support of a
hypothesis just in case the hypothesis in question has passed a severe test, one that it would be very
unlikely to pass so well if the hypothesis were false. Deborah Mayo (2018) calls this the strong
severity principle. The Bayesian, however, can declare to have evidence for a hypothesis despite not
having done anything to test it severely. The core reason for this has to do with the
(infamous) likelihood principle, whose violation is not an option for anyone who subscribes to the
Bayesian paradigm. Although the Bayesian is largely unmoved by the incompatibility between
the strong severity principle and the likelihood principle, I will argue that the Bayesian’s never-ending
quest to account for yet an other notion, one that is often attributed to Keynes (1921) and that is
usually referred to as the weight of evidence, betrays the Bayesian’s confidence in the likelihood
principle after all. Indeed, I will argue that the weight of evidence and severity may be thought of as
two (very different) sides of the same coin: they are two unrelated notions, but what brings them
together is the fact that they both make trouble for the likelihood principle, a principle at the core of
Bayesian inference. I will relate this conclusion to current debates on how to best conceptualise
uncertainty by the IPCC in particular. I will argue that failure to fully grasp the limitations of an
epistemology that envisions the role of probability to be that of quantifying the degree of belief to
assign to a hypothesis given the available evidence can be (and has been) detrimental to an
adequate communication of uncertainty.
Revisiting the Two Cultures in Statistical Modeling and Inference as they rel...jemille6
Aris Spanos (Wilson Schmidt Professor of Economics, Virginia Tech)
ABSTRACT: The discussion places the two cultures, the model-driven statistical modeling and the
algorithm-driven modeling associated with Machine Learning (ML) and Statistical
Learning Theory (SLT) in a broader context of paradigm shifts in 20th-century statistics,
which includes Fisher’s model-based induction of the 1920s and variations/extensions
thereof, the Data Science (ML, STL, etc.) and the Graphical Causal modeling in the
1990s. The primary objective is to compare and contrast the effectiveness of different
approaches to statistics in learning from data about phenomena of interest and relate
that to the current discussions pertaining to the statistics wars and their potential
casualties.
Comparing Frequentists and Bayesian Control of Multiple Testingjemille6
James Berger
ABSTRACT: A problem that is common to many sciences is that of having to deal with a multiplicity of statistical inferences. For instance, in GWAS (Genome Wide Association Studies), an experiment might consider 20 diseases and 100,000 genes, and conduct statistical tests of the 20x100,000=2,000,000 null hypotheses that a specific disease is associated with a specific gene. The issue is that selective reporting of only the ‘highly significant’ results could lead to many claimed disease/gene associations that turn out to be false, simply because of statistical randomness. In 2007, the seriousness of this problem was recognized in GWAS and extremely stringent standards were employed to resolve it. Indeed, it was recommended that tests for association should be conducted at an error probability of 5 x 10—7. Particle physicists similarly learned that a discovery would be reliably replicated only if the p-value of the relevant test was less than 5.7 x 10—7. This was because they had to account for a huge number of multiplicities in their analyses. Other sciences have continuing issues with multiplicity. In the Social Sciences, p-hacking and data dredging are common, which involve multiple analyses of data. Stopping rules in social sciences are often ignored, even though it has been known since 1933 that, if one keeps collecting data and computing the p-value, one is guaranteed to obtain a p-value less than 0.05 (or, indeed, any specified value), even if the null hypothesis is true. In medical studies that occur with strong oversight (e.g., by the FDA), control for multiplicity is mandated. There is also typically a large amount of replication, resulting in meta-analysis. But there are many situations where multiplicity is not handled well, such as subgroup analysis: one first tests for an overall treatment effect in the population; failing to find that, one tests for an effect among men or among women; failing to find that, one tests for an effect among old men or young men, or among old women or young women; …. I will argue that there is a single method that can address any such problems of multiplicity: Bayesian analysis, with the multiplicity being addressed through choice of prior probabilities of hypotheses. ... There are, of course, also frequentist error approaches (such as Bonferroni and FDR) for handling multiplicity of statistical inferences; indeed, these are much more familiar than the Bayesian approach. These are, however, targeted solutions for specific classes of problems and are not easily generalizable to new problems.
Clark Glamour
ABSTRACT: "Data dredging"--searching non experimental data for causal and other relationships and taking that same data to be evidence for those relationships--was historically common in the natural sciences--the works of Kepler, Cannizzaro and Mendeleev are examples. Nowadays, "data dredging"--using data to bring hypotheses into consideration and regarding that same data as evidence bearing on their truth or falsity--is widely denounced by both philosophical and statistical methodologists. Notwithstanding, "data dredging" is routinely practiced in the human sciences using "traditional" methods--various forms of regression for example. The main thesis of my talk is that, in the spirit and letter of Mayo's and Spanos’ notion of severe testing, modern computational algorithms that search data for causal relations severely test their resulting models in the process of "constructing" them. My claim is that in many investigations, principled computerized search is invaluable for reliable, generalizable, informative, scientific inquiry. The possible failures of traditional search methods for causal relations, multiple regression for example, are easily demonstrated by simulation in cases where even the earliest consistent graphical model search algorithms succeed. ... These and other examples raise a number of issues about using multiple hypothesis tests in strategies for severe testing, notably, the interpretation of standard errors and confidence levels as error probabilities when the structures assumed in parameter estimation are uncertain. Commonly used regression methods, I will argue, are bad data dredging methods that do not severely, or appropriately, test their results. I argue that various traditional and proposed methodological norms, including pre-specification of experimental outcomes and error probabilities for regression estimates of causal effects, are unnecessary or illusory in application. Statistics wants a number, or at least an interval, to express a normative virtue, the value of data as evidence for a hypothesis, how well the data pushes us toward the true or away from the false. Good when you can get it, but there are many circumstances where you have evidence but there is no number or interval to express it other than phony numbers with no logical connection with truth guidance. Kepler, Darwin, Cannizarro, Mendeleev had no such numbers, but they severely tested their claims by combining data dredging with severe testing.
The Duality of Parameters and the Duality of Probabilityjemille6
Suzanne Thornton
ABSTRACT: Under any inferential paradigm, statistical inference is connected to the logic of probability. Well-known debates among these various paradigms emerge from conflicting views on the notion of probability. One dominant view understands the logic of probability as a representation of variability (frequentism), and another prominent view understands probability as a measurement of belief (Bayesianism). The first camp generally describes model parameters as fixed values, whereas the second camp views parameters as random. Just as calibration (Reid and Cox 2015, “On Some Principles of Statistical Inference,” International Statistical Review 83(2), 293-308)--the behavior of a procedure under hypothetical repetition--bypasses the need for different versions of probability, I propose that an inferential approach based on confidence distributions (CD), which I will explain, bypasses the analogous conflicting perspectives on parameters. Frequentist inference is connected to the logic of probability through the notion of empirical randomness. Sample estimates are useful only insofar as one has a sense of the extent to which the estimator may vary from one random sample to another. The bounds of a confidence interval are thus particular observations of a random variable, where the randomness is inherited by the random sampling of the data. For example, 95% confidence intervals for parameter θ can be calculated for any random sample from a Normal N(θ, 1) distribution. With repeated sampling, approximately 95% of these intervals are guaranteed to yield an interval covering the fixed value of θ. Bayesian inference produces a probability distribution for the different values of a particular parameter. However, the quality of this distribution is difficult to assess without invoking an appeal to the notion of repeated performance. ... In contrast to a posterior distribution, a CD is not a probabilistic statement about the parameter, rather it is a data-dependent estimate for a fixed parameter for which a particular behavioral property holds. The Normal distribution itself, centered around the observed average of the data (e.g. average recovery times), can be a CD for θ. It can give any level of confidence. Such estimators can be derived through Bayesian or frequentist inductive procedures, and any CD, regardless of how it is obtained, guarantees performance of the estimator under replication for a fixed target, while simultaneously producing a random estimate for the possible values of θ.
Paper given at PSA 22 Symposium: Multiplicity, Data-Dredging and Error Control
MAYO ABSTRACT: I put forward a general principle for evidence: an error-prone claim C is warranted to the extent it has been subjected to, and passes, an analysis that very probably would have found evidence of flaws in C just if they are present. This probability is the severity with which C has passed the test. When a test’s error probabilities quantify the capacity of tests to probe errors in C, I argue, they can be used to assess what has been learned from the data about C. A claim can be probable or even known to be true, yet poorly probed by the data and model at hand. The severe testing account leads to a reformulation of statistical significance tests: Moving away from a binary interpretation, we test several discrepancies from any reference hypothesis and report those well or poorly warranted. A probative test will generally involve combining several subsidiary tests, deliberately designed to unearth different flaws. The approach relates to confidence interval estimation, but, like confidence distributions (CD) (Thornton), a series of different confidence levels is considered. A 95% confidence interval method, say using the mean M of a random sample to estimate the population mean μ of a Normal distribution, will cover the true, but unknown, value of μ 95% of the time in a hypothetical series of applications. However, we cannot take .95 as the probability that a particular interval estimate (a ≤ μ ≤ b) is correct—at least not without a prior probability to μ. In the severity interpretation I propose, we can nevertheless give an inferential construal post-data, while still regarding μ as fixed. For example, there is good evidence μ ≥ a (the lower estimation limit) because if μ < a, then with high probability .95 (or .975 if viewed as one-sided) we would have observed a smaller value of M than we did. Likewise for inferring μ ≤ b. To understand a method’s capability to probe flaws in the case at hand, we cannot just consider the observed data, unlike in strict Bayesian accounts. We need to consider what the method would have inferred if other data had been observed. For each point μ’ in the interval, we assess how severely the claim μ > μ’ has been probed. I apply the severity account to the problems discussed by earlier speakers in our session. The problem with multiple testing (and selective reporting) when attempting to distinguish genuine effects from noise, is not merely that it would, if regularly applied, lead to inferences that were often wrong. Rather, it renders the method incapable, or practically so, of probing the relevant mistaken inference in the case at hand. In other cases, by contrast, (e.g., DNA matching) the searching can increase the test’s probative capacity. In this way the severe testing account can explain competing intuitions about multiplicity and data-dredging, while blocking inferences based on problematic data-dredging
The Statistics Wars and Their Causalities (refs)jemille6
High-profile failures of replication in the social and biological sciences underwrite a
minimal requirement of evidence: If little or nothing has been done to rule out flaws in inferring a claim, then it has not passed a severe test. A claim is severely tested to the extent it has been subjected to and passes a test that probably would have found flaws, were they present. This probability is the severity with which a claim has passed. The goal of highly well-tested claims differs from that of highly probable ones, explaining why experts so often disagree about statistical reforms. Even where today’s statistical test critics see themselves as merely objecting to misuses and misinterpretations, the reforms they recommend often grow out of presuppositions about the role of probability in inductive-statistical inference. Paradoxically, I will argue, some of the reforms intended to replace or improve on statistical significance tests enable rather than reveal illicit inferences due to cherry-picking, multiple testing, and data-dredging. Some preclude testing and falsifying claims altogether. These are the “casualties” on which I will focus. I will consider Fisherian vs Neyman-Pearson tests, Bayes factors, Bayesian posteriors, likelihoodist assessments, and the “screening model” of tests (a quasiBayesian-frequentist assessment). Whether one accepts this philosophy of evidence, I argue, that it provides a standpoint for avoiding both the fallacies of statistical testing and the casualties of today’s statistics wars.
The Statistics Wars and Their Casualties (w/refs)jemille6
High-profile failures of replication in the social and biological sciences underwrite a minimal requirement of evidence: If little or nothing has been done to rule out flaws in inferring a claim, then it has not passed a severe test. A claim is severely tested to the extent it has been subjected to and passes a test that probably would have found flaws, were they present. This probability is the severity with which a claim has passed. The goal of highly well-tested claims differs from that of highly probable ones, explaining why experts so often disagree about statistical reforms. Even where today’s statistical test critics see themselves as merely objecting to misuses and misinterpretations, the reforms they recommend often grow out of presuppositions about the role of probability in inductive-statistical inference. Paradoxically, I will argue, some of the reforms intended to replace or improve on statistical significance tests enable rather than reveal illicit inferences due to cherry-picking, multiple testing, and data-dredging. Some preclude testing and falsifying claims altogether. These are the “casualties” on which I will focus. I will consider Fisherian vs Neyman-Pearson tests, Bayes factors, Bayesian posteriors, likelihoodist assessments, and the “screening model” of tests (a quasi-Bayesian-frequentist assessment). Whether one accepts this philosophy of evidence, I argue, that it provides a standpoint for avoiding both the fallacies of statistical testing and the casualties of today’s statistics wars.
On the interpretation of the mathematical characteristics of statistical test...jemille6
Statistical hypothesis tests are often misused and misinterpreted. Here I focus on one
source of such misinterpretation, namely an inappropriate notion regarding what the
mathematical theory of tests implies, and does not imply, when it comes to the
application of tests in practice. The view taken here is that it is helpful and instructive to be consciously aware of the essential difference between mathematical model and
reality, and to appreciate the mathematical model and its implications as a tool for
thinking rather than something that has a truth value regarding reality. Insights are presented regarding the role of model assumptions, unbiasedness and the alternative hypothesis, Neyman-Pearson optimality, multiple and data dependent testing.
The role of background assumptions in severity appraisal (jemille6
In the past decade discussions around the reproducibility of scientific findings have led to a re-appreciation of the importance of guaranteeing claims are severely tested. The inflation of Type 1 error rates due to flexibility in the data analysis is widely considered
one of the underlying causes of low replicability rates. Solutions, such as study preregistration, are becoming increasingly popular to combat this problem. Preregistration only allows researchers to evaluate the severity of a test, but not all
preregistered studies provide a severe test of a claim. The appraisal of the severity of a
test depends on background information, such as assumptions about the data generating process, and auxiliary hypotheses that influence the final choice for the
design of the test. In this article, I will discuss the difference between subjective and
inter-subjectively testable assumptions underlying scientific claims, and the importance
of separating the two. I will stress the role of justifications in statistical inferences, the
conditional nature of scientific conclusions following these justifications, and highlight
how severe tests could lead to inter-subjective agreement, based on a philosophical approach grounded in methodological falsificationism. Appreciating the role of background assumptions in the appraisal of severity should shed light on current discussions about the role of preregistration, interpreting the results of replication studies, and proposals to reform statistical inferences.
The two statistical cornerstones of replicability: addressing selective infer...jemille6
Tukey’s last published work in 2020 was an obscure entry on multiple comparisons in the
Encyclopedia of Behavioral Sciences, addressing the two topics in the title. Replicability
was not mentioned at all, nor was any other connection made between the two topics. I shall demonstrate how these two topics critically affect replicability using recently completed studies. I shall review how these have been addressed in the past. I shall
review in more detail the available ways to address selective inference. My conclusion is that conducting many small replicability studies without strict standardization is the way to assure replicability of results in science, and we should introduce policies to make this happen.
The replication crisis: are P-values the problem and are Bayes factors the so...jemille6
Today’s posterior is tomorrow’s prior. Dennis Lindley
It has been claimed that science is undergoing a replication crisis and that when looking for culprits, the cult of significance is the chief suspect. It has also been claimed that Bayes factors might provide a solution.
In my opinion, these claims are misleading and part of the problem is our understanding
of the purpose and nature of replication, which has only recently been subject to formal
analysis.
What we are or should be interested in is truth. Replication is a coherence not a correspondence requirement and one that has a strong dependence on the size
of the replication study
.
Consideration of Bayes factors raises a puzzling question. Should the Bayes factor for a replication study be calculated as if it were the initial study? If the answer is yes, the approach is not fully Bayesian and furthermore the Bayes factors will be subject to
exactly the same replication ‘paradox’ as P-values. If the answer is no, then in what
sense can an initially found Bayes factor be replicated and what are the implications for how we should view replication of P-values?
A further issue is that little attention has been paid to false negatives and, by extension
to true negative values. Yet, as is well known from the theory of diagnostic tests, it is
meaningless to consider the performance of a test in terms of false positives alone.
I shall argue that we are in danger of confusing evidence with the conclusions we draw and that any reforms of scientific practice should concentrate on producing evidence
that is reliable as it can be qua evidence. There are many basic scientific practices in
need of reform. Pseudoreplication, for example, and the routine destruction of
information through dichotomisation are far more serious problems than many matters of inferential framing that seem to have excited statisticians.
Synthetic Fiber Construction in lab .pptxPavel ( NSTU)
Synthetic fiber production is a fascinating and complex field that blends chemistry, engineering, and environmental science. By understanding these aspects, students can gain a comprehensive view of synthetic fiber production, its impact on society and the environment, and the potential for future innovations. Synthetic fibers play a crucial role in modern society, impacting various aspects of daily life, industry, and the environment. ynthetic fibers are integral to modern life, offering a range of benefits from cost-effectiveness and versatility to innovative applications and performance characteristics. While they pose environmental challenges, ongoing research and development aim to create more sustainable and eco-friendly alternatives. Understanding the importance of synthetic fibers helps in appreciating their role in the economy, industry, and daily life, while also emphasizing the need for sustainable practices and innovation.
Read| The latest issue of The Challenger is here! We are thrilled to announce that our school paper has qualified for the NATIONAL SCHOOLS PRESS CONFERENCE (NSPC) 2024. Thank you for your unwavering support and trust. Dive into the stories that made us stand out!
Operation “Blue Star” is the only event in the history of Independent India where the state went into war with its own people. Even after about 40 years it is not clear if it was culmination of states anger over people of the region, a political game of power or start of dictatorial chapter in the democratic setup.
The people of Punjab felt alienated from main stream due to denial of their just demands during a long democratic struggle since independence. As it happen all over the word, it led to militant struggle with great loss of lives of military, police and civilian personnel. Killing of Indira Gandhi and massacre of innocent Sikhs in Delhi and other India cities was also associated with this movement.
Biological screening of herbal drugs: Introduction and Need for
Phyto-Pharmacological Screening, New Strategies for evaluating
Natural Products, In vitro evaluation techniques for Antioxidants, Antimicrobial and Anticancer drugs. In vivo evaluation techniques
for Anti-inflammatory, Antiulcer, Anticancer, Wound healing, Antidiabetic, Hepatoprotective, Cardio protective, Diuretics and
Antifertility, Toxicity studies as per OECD guidelines
Normal Labour/ Stages of Labour/ Mechanism of LabourWasim Ak
Normal labor is also termed spontaneous labor, defined as the natural physiological process through which the fetus, placenta, and membranes are expelled from the uterus through the birth canal at term (37 to 42 weeks
Francesca Gottschalk - How can education support child empowerment.pptxEduSkills OECD
Francesca Gottschalk from the OECD’s Centre for Educational Research and Innovation presents at the Ask an Expert Webinar: How can education support child empowerment?
Model Attribute Check Company Auto PropertyCeline George
In Odoo, the multi-company feature allows you to manage multiple companies within a single Odoo database instance. Each company can have its own configurations while still sharing common resources such as products, customers, and suppliers.
Executive Directors Chat Leveraging AI for Diversity, Equity, and InclusionTechSoup
Let’s explore the intersection of technology and equity in the final session of our DEI series. Discover how AI tools, like ChatGPT, can be used to support and enhance your nonprofit's DEI initiatives. Participants will gain insights into practical AI applications and get tips for leveraging technology to advance their DEI goals.
D. Mayo: Replication Research Under an Error Statistical Philosophy
1. SPP
D.
Mayo
1
Replication Research Under an Error Statistical Philosophy
Deborah Mayo
Around a year ago on my blog:
“There are some ironic twists in the way psychology is
dealing with its replication crisis that may well threaten even
the most sincere efforts to put the field on firmer scientific
footing”
Philosopher’s talk: I see a rich source of problems that cry out
for ministrations of philosophers of science and of statistics
2. SPP
D.
Mayo
2
Three main philosophical tasks:
#1 Clarify concepts and presuppositions
#2 Reveal inconsistencies, puzzles, tensions (“ironies”)
#3 Solve problems, improve on methodology
• Philosophers usually stop with the first two, but I think
going on to solve problems is important.
This presentation is ‘programmatic’- what might replication
research under an error statistical philosophy be?
My interest grew thanks to Caitlin Parker whose MA thesis was
on the topic
3. SPP
D.
Mayo
3
Example of a conceptual clarification (#1)
Editors of a journal, Basic and Applied Social Psychology,
announced they are banning statistical hypothesis testing
because it is “invalid”
It’s invalid because it does not supply “the probability of the
null hypothesis, given the finding” (the posterior probability of
H0) (2015 Trafimow and Marks)
• Since the methodology of testing explicitly rejects the mode
of inference they don’t supply, it would be incorrect to claim
the methods were invalid.
• Simple conceptual job that philosophers are good at
4. SPP
D.
Mayo
4
Example of revealing inconsistencies and tensions (#2)
Critic: It’s too easy to satisfy standard significance thresholds
You: Why do replicationists find it so hard to achieve
significance thresholds?
Critic: Obviously the initial studies were guilty of p-hacking,
cherry-picking, significance seeking, QRPs
You: So, the replication researchers want methods that pick up
on and block these biasing selection effects.
Critic: Actually the “reforms” recommend methods where
selection effects and data dredging make no difference
5. SPP
D.
Mayo
5
Whether this can be resolved or not is separate.
• We are constantly hearing of how the “reward structure”
leads to taking advantage of researcher flexibility
• As philosophers, we can at least show how to hold their
feet to the fire, and warn of the perils of accounts that bury
the finagling
The philosopher is the curmudgeon (takes chutzpah!)
I’ll give examples of
#1 clarifying terms
#2 inconsistencies
#3 proposed solutions (though I won’t always number them)
.
6. SPP
D.
Mayo
6
Demarcation: Bad Methodology/Bad Statistics
• A lot of the recent attention grew out of the case of Diederik
Stapel, the social psychologist who fabricated his data.
• Kahneman
in
2012
“I
see
a
train-‐wreck
looming,”
setting
up
a
“daisy
chain”
of
replication.
• The Stapel investigators: 2012 Tilberg Report, “Flawed
Science” do a good job of characterizing pseudoscience.
• Philosophers tend to have cold feet when it comes to saying
anything general about science versus pseudoscience.
7. SPP
D.
Mayo
7
Items in their list of “dirty laundry” include:
“An experiment fails to yield the expected statistically
significant results. The experimenters try and try again
until they find something (multiple testing, multiple
modeling, post-data search of endpoint or subgroups),
and the only experiment subsequently reported is the
one that did yield the expected results.”
… continuing an experiment until it works as desired, or
excluding unwelcome experimental subjects or results,
inevitably tends to confirm the researcher’s research
hypotheses, and essentially render the hypotheses
immune to the facts”. (Report, 48)
--they walked into a “culture of verification bias”
8. SPP
D.
Mayo
8
Bad Statistics
Severity Requirement: If data x0 agree with a hypothesis
H, but the test procedure had little or no capability, i.e., little
or no probability of finding flaws with H (even if H is
incorrect), then x0 provide poor evidence for H.
Such a test we would say fails a minimal requirement for a
stringent or severe test.
• This seems utterly uncontroversial.
9. SPP
D.
Mayo
9
• Methods that scrutinize a test’s capabilities, according to
their severity, I call error statistical.
• Existing error probabilities (confidence levels, significance
levels) may but need not provide severity assessments.
• New name: frequentist, sampling theory, Fisherian,
Neyman-Pearsonian—are too associated with hard line
views and personality conflicts (“It’s the methods, stupid”)
(example of new solutions #3)
10. SPP
D.
Mayo
10
Are philosophies about science relevant?
One of the final recommendations in the Report is this:
In the training program for PhD students, the relevant
basic principles of philosophy of science, methodology,
ethics and statistics that enable the responsible practice
of science must be covered. (p. 57)
11. SPP
D.
Mayo
11
A critic might protest:
“There’s nothing philosophical about my criticism of
significance tests: a small p-value is invariably, and
erroneously, interpreted as giving a small probability to the null
hypothesis that the observed difference is mere chance.”
Really? P-values are not intended to be used this way;
presupposing they should be stems from a conception of the role
of probability in statistical inference—this conception is
philosophical.
(of course criticizing them because they might be misinterpreted
is just silly)
12. SPP
D.
Mayo
12
Two
main
views
of
the
role
of
probability
in
inference
Probabilism.
To
provide
a
post-‐data
assignment
of
degree
of
probability,
confirmation,
support
or
belief
in
a
hypothesis,
absolute
or
comparative,
given
data
x0.
Performance.
To
ensure
long-‐run
reliability
of
methods,
coverage
probabilities,
control
the relative frequency of
erroneous inferences in a long-run series of trials.
What happened to the goal of scrutinizing bad science by the
severity criterion?
13. SPP
D.
Mayo
13
• Neither “probabilism” nor “performance” directly captures
it.
• Good long-run performance is a necessary not a sufficient
condition for avoiding insevere tests.
• The problems with selective reporting, multiple testing,
stopping when the data look good are not problems about
long-runs—
• It’s that we cannot say about the case at hand that it has
done a good job of avoiding the sources of
misinterpretation.
14. SPP
D.
Mayo
14
• Probabilism
says
H
is
not
justified
unless
it’s
true
or
probable
(made
firmer).
• Error
statistics
(probativism)
says
H
is
not
justified
unless
something
(a
good
job)
has
been
done
to
probe
ways
we
can
be
wrong
about
H.
• If
it’s
assumed
probabilism
is
required
for
inference,
error
probabilities
could
be
relevant
only
by
misinterpretation.
False!
• Error
probabilities
have
a
crucial
role
in
appraising
well-‐
testedness
(new
philosophy
for
probability
#3)
• Both
H
and
not-‐H
be
can
be
poorly
tested,
so
a
severe
testing
assessment
violates
probability
15. SPP
D.
Mayo
15
Understanding
the
Replication
Crisis
Requires
Understanding
How
it
Intermingles
with
PhilStat
Controversies
• It’s not that I’m keen to defend many common uses of
significance tests
• It’s just that the criticisms (in psychology and elsewhere)
are based on serious misunderstandings of the nature and
role of these methods; consequently so are many “reforms”
• How can you be clear the reforms are better if you might be
mistaken about existing methods?
16. SPP
D.
Mayo
16
Criticisms
concern
a
kind
of
Fisherian
Significance
Test
(i) Sample
space:
Let
the
sample
be
X
=
(X1,
…,Xn),
be
n
iid
(independent
and
identically
distributed)
outcomes
from
a
Normal
distribution
with
standard
deviation
σ
(ii)
A
null
hypothesis
H0:
µ
=
0
(Δ: µΤ − µC = 0)
(iii)
Test
statistic:
A
function
of
the
sample,
d(X)
reflecting
the
difference
between
the
data
x0
=
(x1,
…,xn),
and
H0:
The
larger
d(x0)
the
further
the
outcome
from
what’s
expected
under
H0,
with
respect
to
the
particular
question.
(iv)
Sampling
distribution
of
test
statistic:
d(X)
17. SPP
D.
Mayo
17
The
p-‐value
is
the
probability
of
a
difference
larger
than
d(x0),
under
the
assumption
that
H0
is
true:
p(x0)=Pr(d(X)
>
d(x0);
H0).
If p(x0)
is
sufficiently
small,
there’s
an
indication
of
discrepancy
from
the
null.
(Even
Fisher
had
implicit
alternatives,
by
the
way)
18. SPP
D.
Mayo
18
P-‐value
reasoning:
from
high
capacity
to
curb
enthusiasm
If
the
hypothesis
H0
is
correct
then,
with
high
probability,
1-‐p,
the
data
would
not
be
statistically
significant
at
level
p.
x0
is
statistically
significant
at
level
p.
____________________________
Thus,
x0
indicates
a
discrepancy
from
H0.
That
merely
indicates
some
discrepancy!
19. SPP
D.
Mayo
19
A genuine experimental effect is needed
“[W]e need, not an isolated record, but a reliable method of
procedure. In relation to the test of significance, we may say
that a phenomenon is experimentally demonstrable when we
know how to conduct an experiment which will rarely fail to
give us a statistically significant result.” (Fisher 1935, 14)
(low P-value ≠> H: statistical effect)
“[A]ccording
to
Fisher,
rejecting
the
null
hypothesis
is
not
equivalent
to
accepting
the
efficacy
of
the
cause
in
question.
The
latter...requires
obtaining
more
significant
results
when
the
experiment,
or
an
improvement
of
it,
is
repeated
at
other
laboratories
or
under
other
conditions.”
(Gigerentzer
1989,
95-‐6)
(H ≠> H*)
20. SPP
D.
Mayo
20
Still,
simple
Fisherian
Tests
have
Important
Uses
• Testing
assumptions
• Fraudbusting
and
forensics:
Finding
Data
too
good
to
be
true
(Simonsohn)
• Finding
if
data
are
consistent
with
a
model
Gelman and Shalizi (meeting of minds between a Bayesian and
an error statistician)
“What we are advocating, then, is what Cox and Hinkley (1974)
call ‘pure significance testing’, in which certain of the model’s
implications are compared directly to the data, rather than
entering into a contest with some alternative model.” (p.20)
21. SPP
D.
Mayo
21
Fallacy
of
Rejection:
H
–
>
H*
:
Erroneously
take
statistical
significance
as
evidence
of
research
hypothesis
H*
The
fallacy
is
explicated
by
severity:
flaws
in
alternative
H*
have
not
been
probed
by
the
test,
the
inference
from
a
statistically
significant
result
to
H*
fails
to
pass
with
severity
Merely refuting the null hypothesis is too weak to
corroborate substantive H*, “we have to have ‘Popperian
risk’, ‘severe test’ [as in Mayo], or what philosopher Wesley
Salmon called ‘a highly improbable coincidence.’” (Meehl
and Waller 2002, 184)
(Meehl
was
wrong
to
blame
Fisher)
22. SPP
D.
Mayo
22
NHST
are
pseudostatistical:
Why
do
psychologists
speak
of
NHSTs
–tests
that
supposedly
allow
moving
from
statistical
to
substantive?
So
defined,
they
exist
only
as
abuses
of
tests:
they
exist as
something you’re never supposed to do
Psychologists
tend
to
ignore
Neyman-‐Pearson
(N-‐P)
tests:
N-‐P
supplemented
Fisher’s
tests
with
explicit
alternatives
23. SPP
D.
Mayo
23
Neyman-‐Pearson
(N-‐P)
Tests:
A
null
and
alternative
hypotheses
H0,
H1
that
exhaust
the
parameter
space
So
the
fallacy
of
rejection
H
–
>
H*
is
impossible
(rejecting
the
null
only
indicates
statistical
alternatives)
Scotches
criticisms
that
P-‐values
are
only
under
the
null
Example:
Test
T+:
sampling
distribution
of
d(x)
under
null
and
alternatives.
H0:
µ
≤
µ0
vs.
H1:
µ
>
µ0
if
d(x0)
>
cα,
"reject"
H0,
if
d(x0)
<
cα,
"do
not
reject”
or
“accept"
H0,
e.g.
cα=1.96
for
α=.025
24. SPP
D.
Mayo
24
The
sampling
distribution
yields
Error
Probabilities
Probability
of
a
Type
I
error
=
P(d(X)
>
cα;
H0)
≤
α.
Probability
of
a
Type
II
error:
=
P(d(X)
<
cα;
H0)
=
ß(µ1),
for
any
µ1
>
µ0.
The
complement
of
the
Type
II
error
probability=
power
against
(µ1)
POW(µ1)=
P(d(X)
>
cα;
µ1)
Even
without
“best”
tests,
there
are
“good”
tests
25. SPP
D.
Mayo
25
N-‐P
test
in
terms
of
the
P-‐value:
reject
H0
iff
P-‐value
<
.025
• Even
N-‐P
report
the
attained
significance
level
or
P-‐value
(Lehmann)
• “reject/do
not
reject”
uninterpreted
parts
of
the
mathematical
apparatus
Reject
could
be:
“Declare
statistically
significant
at
the
p-‐level”
• “The
tests…
must
be
used
with
discretion
and
understanding”
(N-‐P,
1928,
p.
58)
(“it’s
the
methods,
stupid”)
26. SPP
D.
Mayo
26
Why
Inductive
behavior?
N-‐P
justify
tests
(and
confidence
intervals)
by
performance,
control
of
long-‐run
error
coverage
probabilities
They
called
this
inductive
behavior,
why?
• They
were
reaching
conclusions
beyond
the
data
(inductive)
• If
inductive
inference
is
probabilist,
then
they
needed
a
new
term.
In
Popperian
spirit,
they
(mostly
Neyman)
called
it
inductive
behavior-‐-‐
adjust
how
we’d
act
rather
than
beliefs
(I’m
not
knocking
performance,
but
error
probabilities
also
serve
for
particular
inferences—evidential)
27. SPP
D.
Mayo
27
N-‐P
tests
can
still
commit
a
type
of
fallacy
of
rejection:
Infer
a
discrepancy
beyond
what’s
warranted:
––especially
with n sufficiently large:
large
n
problem.
• Severity
tells
us:
an
α-‐significant
difference
is
indicative
of
less
of
a
discrepancy
from
the
null
if
it
results
from
larger
(n1)
rather
than
a
smaller
(n2)
sample
size
(n1
>
n2
)
What’s
more
indicative
of
a
large
effect
(fire),
a
fire
alarm
that
goes
off
with
burnt
toast
or
one
so
insensitive
that
it
doesn’t
go
off
unless
the
house
is
fully
ablaze?
[The
larger
sample
size
is
like
the
one
that
goes
off
with
burnt
toast.)
28. SPP
D.
Mayo
28
Fallacy
of
Non-‐Significant
results:
Insensitive
tests
• Negative
results
may
not
warrant
0
discrepancy
from
the
null,
but
we
can
use
severity
to
rule
out
discrepancies
that,
with
high
probability,
would
have
resulted
in
a
larger
difference
than
observed
Similar
to
Cohen’s
power
analysis
but
sensitive
to
the
outcome—P-‐value
distribution
(#3)
• I
hear
some
replicationists
say
negative
results
are
uninformative:
not
so
(#2
ironies)
No
point
in
running
replication
research
if
your
account
views
negative
results
as
uninformative
29. SPP
D.
Mayo
29
Error
statistics
gives
evidential
interpretation
to
tests
(#3)
Use
results
to
infer
discrepancies
from
a
null
that
are
well
ruled-‐
out,
and
those
which
are
not
I’d
never
just
report
a
P-‐value
Mayo
(1996);
Mayo
and
Cox
(2010):
Frequentist
Principle
of
Evidence:
FEV
Mayo
and
Spanos
(2006):
SEV
30. SPP
D.
Mayo
30
One-‐sided
Test
T+:
H0:
µ
<
µ0
vs.
H1:
µ
>
µ0
d(x)
is
statistically
significant
(set
lower
bounds)
(i)
If
the
test
had
high
capacity
to
warn
us
(by
producing
a
less
significant
result)
if
µ
≤
µ0
+
γ.
then
d(x)
is
a
good
indication
of
µ
>
µ0
+
γ.
(ii)
If
the
test
had
little
(or
even
moderate)
capacity
(e.g.
<
.5)
to
produce
a
less
significant
result
even
if
µ
≤
µ0
+
γ,
then
d(x)
is
a
poor
indication
of
µ
>
µ0
+
γ
(If
an
even
more
impressive
result
is
probable,
due
to
guppies,
it’s
not
a
good
indication
of
a
great
whale)
31. SPP
D.
Mayo
31
d(x)
is
not
statistically
significant
(set
upper
bounds)
(i)If
the
test
had
a
high
probability
of
producing
a
more
statistically
significant
difference
if
µ
>
µ0
+
γ,
then
d(x)
is
a
good
indication
that
µ
≤
µ0
+
γ.
(ii)
If
the
test
had
a
low
probability
of
a
more
statistically
significant
difference
if
µ
>
µ0
+
γ,
then
d(x)
is
poor
indication
that
µ
≤
µ0
+
γ.
(too
insensitive
to
rule
out
discrepancy
γ)
If
you
set
an
overly
stringent
significance
level
in
order
to
block
rejecting
a
null,
we
can
determine
the
discrepancies
you
can’t
detect
(e.g.,
risks
of
concern)
32. SPP
D.
Mayo
32
Confidence
Intervals
also
require
supplementing
Duality
between
tests
and
intervals:
values
within
the
(1
-‐
α)
CI
are
non-‐rejectable
at
the
α
level
• Still
too
dichotomous:
in
/out,
plausible/not
plausible
(Permit
fallacies
of
rejection/non-‐rejection).
• Justified
in
terms
of
long-‐run
coverage
(performance).
• All
members
of
the
CI
treated
on
par.
• Fixed
confidence
level
(SEV
needs
several
benchmarks).
• Estimation
is
important
but
we
need
tests
for
distinguishing
real
and
spurious
effects,
and
checking
assumptions
of
statistical
models.
33. SPP
D.
Mayo
33
The
evidential
interpretation
is
crucial
but
error
probabilities
can
be
violated
by
selection
effects
(also
violated
model
assumptions)
One
function
of
severity
is
to
identify
which
selection
effects
are
problematic
(not
all
are)
(#3).
Biasing
selection
effects:
when
data
or
hypotheses
are
selected
or
generated
(or
a
test
criterion
is
specified),
in
such
a
way
that
the
minimal
severity
requirement
is
violated,
seriously
altered
or
incapable
of
being
assessed.
34. SPP
D.
Mayo
34
Nominal vs actual significance levels
Suppose
that
twenty
sets
of
differences
have
been
examined,
that
one
difference
seems
large
enough
to
test
and
that
this
difference
turns
out
to
be
‘significant
at
the
5
percent
level.’
….The
actual
level
of
significance
is
not
5
percent,
but
64
percent!
(Selvin,
1970,
p.
104)
• They
were
clear
on
the
fallacy:
blurring
the
“computed”
or
“nominal”
significance
level,
and
the
“actual”
level
• There
are
many
more
ways
you
can
be
wrong
with
hunting
(different
sample
space)
35. SPP
D.
Mayo
35
This is a genuine example of an invalid or unsound method
You report: Such
results
would
be
difficult
to
achieve
under
the
assumption
of
H0
When
in
fact
such
results
are
common
under
the
assumption
of
H0
(formally): You say Pr(P-value < Pobs; H0) ~ α (small)
but in fact Pr(P-value < Pobs; H0) = high, if not guaranteed
• Nowadays,
we’re
likely
to
see
the
tests
blamed
for
permitting
such
misuses
(instead
of
the
testers).
• Worse
are
those
accounts
where
the
abuse
vanishes!
36. SPP
D.
Mayo
36
What
defies
scientific
sense?
On
some
views,
biasing
selection
effects
are
irrelevant….
Stephen
Goodman
(epidemiologist):
Two
problems
that
plague
frequentist
inference:
multiple
comparisons
and
multiple
looks,
or,
as
they
are
more
commonly
called,
data
dredging
and
peeking
at
the
data.
The
frequentist
solution
to
both
problems
involves
adjusting
the
P-‐value…But
adjusting
the
measure
of
evidence
because
of
considerations
that
have
nothing
to
do
with
the
data
defies
scientific
sense,
belies
the
claim
of
‘objectivity’
that
is
often
made
for
the
P-‐value.”
(1999,
p.
1010).
37. SPP
D.
Mayo
37
Likelihood
Principle
(LP)
The
vanishing
act
takes
us
to
the
pivot
point
around
which
much
debate
in
philosophy
of
statistics
revolves:
In probabilisms, the import of the data is via the ratios of
likelihoods of hypotheses:
P(x0;H1)/P(x0;H0)
Different
forms:
posterior
probabilities,
Bayes
factor
(inference
is
comparative,
data
favors
this
over
that–is
that
even
inference?)
38. SPP
D.
Mayo
38
All
error
probabilities
violate
the
LP
(even
without
selection
effects):
“Sampling
distributions,
significance
levels,
power,
all
depend
on
something
more
[than
the
likelihood
function]–something
that
is
irrelevant
in
Bayesian
inference–namely
the
sample
space”.
(Lindley
1971,
p.
436)
The
information
is
just
a
matter
of
our
“intentions”
“The
LP
implies…the
irrelevance
of
predesignation,
of
whether
a
hypothesis
was
thought
of
before
hand
or
was
introduced
to
explain
known
effects
(Rosenkrantz,
1977,
122)
39. SPP
D.
Mayo
39
Many current Reforms are Probabilist
Probabilist reforms to replace tests (and CIs) with likelihood
ratios, Bayes factors, HPD intervals, or just lower the P-value
(so that the maximal likely alternative gets .95 posterior)
while ignoring biasing selection effects, will fail.
The same p-hacked hypothesis can occur in Bayes factors;
optional stopping can exclude true nulls from HPD intervals.
With one big difference: Your direct basis for criticism and
possible adjustments has just vanished.
(lots of #2 inconsistencies)
40. SPP
D.
Mayo
40
How
might
probabilists
block
intuitively
unwarranted
inferences?
(Consider
first
subjective)
When we hear there’s statistical evidence of some unbelievable
claim (distinguishing shades of grey and being politically
moderate, ovulation and voting preferences), some probabilists
claim—you see, if our beliefs were mixed into the interpretation
of the evidence, we wouldn’t be fooled
We know these things are unbelievable, a subjective Bayesian
might say
That could work in some cases (though it still wouldn’t show
what researchers had done wrong)—battle of beliefs.
41. SPP
D.
Mayo
41
It wouldn’t help with our most important problem:
• How to distinguish the warrant for a single hypothesis H
with different methods (e.g., one has biasing selection
effects, another, registered results and precautions)?
So now you’ve got two sources of flexibility, priors and biasing
selection effects (which can no longer be criticized).
Besides, researchers really do believe their hypotheses.
42. SPP
D.
Mayo
42
Diederik Stapel says he always read the research literature
extensively to generate his hypotheses.
“So that it was believable and could be argued that this
was the only logical thing you would find.” (E.g., eating
meat causes aggression.)
(In “The Mind of a Con Man,” NY Times, April 26,
2013[4])
43. SPP
D.
Mayo
43
Conventional
Bayesians
The most popular probabilisms these days are “non-subjective”
(reference, default) or conventional designed
to
prevent
prior
beliefs
from
influencing
the
posteriors:
“The
priors
are
not
to
be
considered
expressions
of
uncertainty,
ignorance,
or
degree
of
belief.
Conventional
priors
may
not
even
be
probabilities…
.”
(Cox
and
Mayo
2010,
p.
299)
How
might
they
avoid
too-‐easy
rejections
of
a
null?
44. SPP
D.
Mayo
44
Cult
of
the
Holy
Spike
Give
a
spike
prior
of
.5
to
H0
the
remaining
.5
probability
being
spread
out
over
the
alternative
parameter
space,
Jeffreys.
This
“spiked
concentration
of
belief
in
the
null”
is
at
odds
with
the
prevailing
view
“we
know
all
nulls
are
false”
(#2)
Bottom line: By convenient choices of priors and alternatives
statistically significant differences can be evidence for the null
The
conflict
often
considers
the
two
sided
test
H0:
µ
=
0
versus
H1:
µ
≠
0
45. SPP
D.
Mayo
45
Posterior
Probabilities
in
H0
n
(sample
size)
____________________________
p
z
n=50
n=100
n=1000
.10
1.645
.65
.72
.89
.05
1.960
.52
.60
.82
.01
2.576
.22
.27
.53
.001
3.291
.034
.045
.124
If
n
=
1000,
a
result
statistically
significant
at
the
.05
level
leads
to
a
posterior
to
the
null
of
.82!
From
Berger
and
Sellke
(1987)
based
on
a
Jeffreys
pror
46. SPP
D.
Mayo
46
• With
a
z
=
1.96
difference,
the
95%
CI
(2-‐sided)
or
the
.975
CI
one
sided
excludes
the
null
(0)
from
the
interval
• Severity reasoning: Were H0 true, the probability of getting
d(x) < dobs is high (~.975), so SEV
(µ
>
0) ∼ .975
• But they give P(H0 | z = 1.96 ) = .82
• Error statistical critique: there’s a high probability that they
give posterior probability of .82 to H0:µ = 0 erroneously
• The onus is on probabilists to show a high posterior for H
constitutes having passed a good test.
47. SPP
D.
Mayo
47
Informal
and
Quasi-‐Formal
Severity
:
H
-‐>
H*
• Error
statisticians
avoid
the
fallacy
of
going
directly
from
statistical
to
research
hypothesis
H*
• Can
we
say
nothing
about
this
link?
• I
think
we
can
and
must,
and
informal
severity
assessments
are
relevant
(#3)
I
will
not
discuss
straw
man
studies
(“chump
effects”).
This is believable: Men react more negatively to success of
their partners than to their failures (compared to women)?
Studies have shown:
H: partner’s success lowers self-esteem in men
48. SPP
D.
Mayo
48
Macho
Men
H*: partner’s success lowers self-esteem in men
I
have
no
doubts
that
certain
types
of
men
feel
threatened
by
the
success
of
their
female
partners,
wives
or
girlfriends
I’ve
even
known
a
few.
Can
this
be
studied
in
the
lab?
Ratliff
and
Oishi
(2013)
did:
.
H*:
“men’s
implicit
self-‐esteem
is
lower
when
a
partner
succeeds
than
when
a
partner
fails.”
Not so for women
Their example does a good job, given the standards in place.
49. SPP
D.
Mayo
49
Treatments: Subjects are randomly assigned to five
“treatments”:
think,
write
about
a
time
your
partner
succeeded,
failed,
succeeded
when
you
failed
(partner
beats
me),
failed
when
you
succeeded
(I
beat
partner),
and
a
typical
day
(control).
Effects:
a
measure
of
“self-‐esteem”
Explicit:
“How
do
you
feel
about
yourself?”
Implicit:
a test of word associations with “me” versus “other”.
None showed statistical significance in explicit self-esteem, so
consider just implicit measures
50. SPP
D.
Mayo
50
Some null hypotheses: The average self-esteem score is no
different (these are statistical hypotheses)
a) when partner succeeds (rather than failing)
b) when partner beats (surpasses) me or I beat her
c) control: when she succeeds, fails, or it’s a regular day
There are at least double this, given self-esteem could be
“explicit” or “implicit” (others too, e.g., the area of success)
Only
null
(a)
was
rejected
statistically!
Should
they
have
taken
the
research
hypothesis
as
disconfirmed
by
negative
cases?
Or
as
casting
doubt
on
their
test?
51. SPP
D.
Mayo
51
Or
should
they
just
focus
on
the
null
hypotheses
that
were
rejected,
in
particular
null
(a),
for
implicit
self-‐esteem.
They
opt
for
the
third.
It’s not that they should have regarded their research
hypothesis H* as disconfirmed much less falsified.
This is precisely the nub of the problem! I’m saying the
hypothesis that the study isn’t well-run needs to be considered
• Is the artificial writing assignment sufficiently relevant to
the phenomenon of interest? (look at proxy variables)
• Is the measure of implicit self esteem (word associations) a
valid measure of the effect? (measurements of effects)
52. SPP
D.
Mayo
52
Take,
null
hypothesis
b):
The average self-esteem score is no
different when partner beats (surpasses) me or I beat her
Clearly
they
expected
“she
beat
me
in
X”
to
have
a
greater
negative
impact
on
self-‐esteem
than
“she
succeeded
at
X”.
Still,
they
could
view
it
as
lending
“some
support
to
the
idea
that
men
interpret
‘my
partner
is
successful’
as
‘my
partner
is
more
successful
than
me”
(p.
698),
….as
do
the
authors.
That
is,
any
success
of
hers
is
always
construed
by
Macho
man
as,
she
beat
me.
53. SPP
D.
Mayo
53
Bending
over
Backwards
For
the
stringent
self-‐critic,
this
skirts
too
close
to
viewing
the
data
through
the
theory,
a
kind
of
“self-‐sealing
fallacy”.
I want to be clear that this is not a criticism of them given
existing standards
“I'm talking about a specific, extra type of integrity...bending
over backwards to show how you're maybe wrong, that you
ought to have when acting as a scientist.”
(R. Feynman 1974)
I’m
describing
what’s
needed
to
show
“sincerely
trying
to
find
flaws”
under
the
austere
account
I
recommend
The
most
interesting
information
was
never
reported!
Perhaps
it
was
never
even
looked
at:
what
they
wrote
about.
54. SPP
D.
Mayo
54
Conclusion: Replication Research in Psychology Under an
Error Statistical Philosophy
Replication problems can’t be solved without correctly
understanding their sources
Biggest
sources
of
problems
in
replication
crises
(a) Stat
H
-‐>research
H*
and
(b)
biasing
selection
effects:
Reasons for (a): focus on P-values and Fisherian tests ignoring
N-P tests (and the illicit NHST that goes directly H–> H*)
55. SPP
D.
Mayo
55
Another reason, false dilemma:
probabilism or long-run performance
plus assuming that N-P can only give the latter
I argue for a third use of probability: Rather than report on
believability researchers need to report the properties of the
methods they used:
What was their capacity to have identified, avoided,
admitted bias?
What’s
wanted
is
not
a
high
posterior
probability
in
H
(however
construed)
but
a
high
probability
the
procedure
would
have
unearthed
flaws
in
H
(reinterpretation
of
N-‐P
methods)
56. SPP
D.
Mayo
56
What’s
replicable?
Discrepancies
that
are
severely
warranted
Reasons
for
(b)
[embracing
accounts
that
formally
ignore
selection
effects]:
accepting
probabilisms
that
embrace
the
likelihood
principle
LP
There’s
no
point
in
raising
thresholds
for
significance
if
your
methodology
does
not
pick
up
on
biasing
selection
effects.
57. SPP
D.
Mayo
57
Informal assessments of probativeness are needed to scrutinize
statistical inferences in relation to research hypotheses H –> H*
One
hypothesis
must
always
be:
our
results
point
to
the
inability
of
our
study
to
severely
probe
the
phenomenon
of
interest
(problem
with
proxy
variables,
measurements,
etc.)
The scientific status of an inquiry is questionable if it cannot or
will not distinguish the correctness of inferences from problems
stemming from a poorly run study
If ordinary research reports adopted the Feynman “bending over
backwards” scrutiny, the interpretation of replication efforts
would be more informative (or perhaps not needed)
58. SPP
D.
Mayo
58
REFERENCES
Baggerly,
K.
A.,
Coombes,
K.
R.
&
Neeley,
E.
S.
(2008).
“Run
Batch
Effects
Potentially
Compromise
the
Usefulness
of
Genomic
Signatures
for
Ovarian
Cancer.”
Journal
of
Clinical
Oncology.
26(7):
1186-‐1187.
Bartless,
T.
(2012).
“Daniel
Kahneman
Sees
‘Train-‐Wreck
Looming’
for
Social
Psychology”.
Chronicle
of
Higher
Education
Blog
(Oct.
4,
2012)
article
w/links
to
email
D.
Kahneman
sent
to
several
social
psychologists.
http://chronicle.com/blogs/percolator/daniel-‐kahneman-‐sees-‐train-‐
wreck-‐looming-‐for-‐social-‐psychology/31338.
Berger,
J.
O.
(2006).
“The
Case
for
Objective
Bayesian
Analysis.”
Bayesian
Analysis
1
(3):
385–402.
Berger,
J.
O.
&
Sellke,
T.
(1987).
“Testing
a
Point
Null
Hypothesis:
The
Irreconcilability
of
P
Values
and
Evidence
(with
Discussion).”
Journal
of
the
American
Statistical
Association
82
(397)
(March
1):
112–122.
Bhattacharjee,
Y.
(2013).
“The
Mind
of
a
Con
Man”.
The
New
York
Times
Magazine
(4/28/2013),
p.
44.
Cohen,
J.
1988.
Statistical
Power
Analysis
for
the
Behavioral
Sciences.
2nd
ed.
Hillsdale,
NJ:
Erlbaum.
59. SPP
D.
Mayo
59
Coombes,
K.
R.,
Wang,
J.
&
Baggerly,
K.
A.
(2007).
“Microrrays:
retracing
steps.”
Nature
Medicine.
13(11):1276-‐7.
Cox,
D.
R.
&
D.
V.
Hinkley.
(1974).
Theoretical
Statistics.
London:
Chapman
and
Hall.
Cox,
D.
R.
&
Mayo,
D.
G.
(2010).
“Objectivity
and
Conditionality
in
Frequentist
Inference.”
In
Error
and
Inference:
Recent
Exchanges
on
Experimental
Reasoning,
Reliability,
and
the
Objectivity
and
Rationality
of
Science,
edited
by
Deborah
G.
Mayo
and
Aris
Spanos,
276–304.
Cambridge:
Cambridge
University
Press.
Diaconis,
P.
(1978).
“Statistical
Problems
in
ESP
Research”.
Science
201
(4351):
131-‐136.
(Letters
in
response
can
be
found
in
the
Dec.
15,
1978
issue
pp.
1145-‐6.)
Dienes,
Z.
(2011)
“Bayesian
versus
Orthodox
Statistics:
Which
Side
Are
You
On?”
Perspectives
on
Psychological
Science
6(3):
274-‐290.
Feynman,
R.
(1974).
“Cargo
Cult
Science.”
Caltech
Commencement
Speech.
Fisher,
R.
A.
(1947).
The
Design
of
Experiments,
4th
ed.
Edinburgh:
Oliver
and
Boyd.
60. SPP
D.
Mayo
60
Gelman,
A.
(2011).
“Induction
and
Deduction
in
Bayesian
Data
Analysis.”
Edited
by
Deborah
G.
Mayo,
Aris
Spanos,
and
Kent
W.
Staley.
Rationality,
Markets
and
Morals:
Studies
at
the
Intersection
of
Philosophy
and
Economics
2
(Special
Topic:
Statistical
Science
and
Philosophy
of
Science):
67–78.
Gelman,
A.
&
Shalizi,
C.
(2013).
“Philosophy
and
the
Practice
of
Bayesian
Statistics.”
British
Journal
of
Mathematical
and
Statistical
Psychology
66
(1):
8–38.
Gigerenzer,
G.
(2000).
“The
Superego,
the
Ego,
and
the
Id
in
Statistical
Reasoning.
“
Adaptive
Thinking,
Rationality
in
the
Real
World,
OUP.
Goodman,
S.
N.
(1999).
Toward
evidence-‐based
medical
statistics.
2:
The
Bayes
factor.”
Annals
of
Internal
Medicine,
130:1005
–1013.
Howson,
C.
&
Urbach,
P.
(1993).
Scientific
Reasoning:
The
Bayesian
Approach.
2nd
ed.
La
Salle,
IL:
Open
Court.
Johansson
T.
(2010)
“Hail
the
impossible:
p-‐values,
evidence,
and
likelihood.”
Scandinavian
Journal
of
Psychology
52:113-‐125.
Kruschke,
J.
K.
(2010).
“What
to
believe:
Bayesian
methods
for
data
analysis”.
Trends
in
Cognitive
Science,
14(7):
297-‐300.
Lehmann,
E.
L.
(1993).
“The
Fisher,
Neyman-‐Pearson
Theories
of
Testing
61. SPP
D.
Mayo
61
Hypotheses:
One
Theory
or
Two?”
Journal
of
the
American
Statistical
Association
88
(424):
1242–1249.
Levelt
Committee,
Noort
Committee,
Drenth
Committee.
(2012).
“Flawed
science:
The
fraudulent
research
practices
of
social
psychologist
Diederik
Stapel”.
Stapel
Investigation:
Joint
Tilburg/Groningen/Amsterdam
investigation
of
the
publications
by
Mr.
Stapel.
https://www.commissielevelt.nl/
Lindley,
D.
V.
(1971).
“The
Estimation
of
Many
Parameters.”
In
Foundations
of
Statistical
Inference,
edited
by
V.
P.
Godambe
and
D.
A.
Sprott,
435–455.
Toronto:
Holt,
Rinehart
and
Winston.
Mayo,
D.
G.
(1996).
Error
and
the
Growth
of
Experimental
Knowledge.
Science
and
Its
Conceptual
Foundation.
Chicago:
University
of
Chicago
Press.
Mayo,
D.
G.
&
Cox,
D.
R.
(2010).
"Frequentist
Statistics
as
a
Theory
of
Inductive
Inference"
in
Error
and
Inference:
Recent
Exchanges
on
Experimental
Reasoning,
Reliability
and
the
Objectivity
and
Rationality
of
Science
(D.
Mayo
and
A.
Spanos
eds.),
Cambridge:
Cambridge
University
Press:
1-‐27.
This
paper
appeared
in
The
Second
Erich
L.
Lehmann
Symposium:
Optimality,
2006,
Lecture
Notes-‐Monograph
Series,
Volume
49,
Institute
of
Mathematical
Statistics,
pp.
247-‐275.
62. SPP
D.
Mayo
62
Mayo,
D.
G.,
and
A.
Spanos.
(2006).
“Severe
Testing
as
a
Basic
Concept
in
a
Neyman–Pearson
Philosophy
of
Induction.”
British
Journal
for
the
Philosophy
of
Science
57
(2)
(June
1):
323–357.
Mayo,
D.
G.,
and
A.
Spanos.
(2011).
“Error
Statistics.”
In
Philosophy
of
Statistics,
edited
by
Prasanta
S.
Bandyopadhyay
and
Malcom
R.
Forster,
7:152–198.
Handbook
of
the
Philosophy
of
Science.
The
Netherlands:
Elsevier.
Meehl,
P.
E.
&
Waller,
N.
G.
(2002).
“The
Path
Analysis
Controversy:
A
New
Statistical
Approach
to
Strong
Appraisal
of
Verisimilitude.”
Psychological
Methods
7(3):
283–300.
Morrison,
D.
E.
&
Henkel,
R.
E.
(eEds).
(1970).
The
Significance
Test
Controversy:
A
Reader.
Chicago:
Aldine
De
Gruyter.
Micheel,
C.
M.,
Nass,
S.
J.
&
Omenn
G.
S.
(Eds)
Committee
on
the
Review
of
Omics-‐
Based
Tests
for
Predicting
Patient
Outcomes
in
Clinical
Trials;
Board
on
Health
Care
Services;
Board
on
Health
Sciences
Policy;
Institute
of
Medicine
(2012).
Evolution
of
Translational
Omics:
Lessons
Learned
and
the
Path
Forward.
Nat.
Acad.
Press.
Neyman,
J.
(1957).
“‘Inductive
Behavior’”
as
a
Basic
Concept
of
Science.”
Revue
de
l'Institut
International
de
Statistique/Review
of
the
International
Statistical
Institute,
25
(1/3):
7-‐22.
63. SPP
D.
Mayo
63
Neyman,
J.
&
Pearson,
E.
S.
(1928).
“On
the
Use
and
Interpretation
of
Certain
Test
Criteria
for
Purposes
of
Statistical
Inference.
Part
I,”
Biometrica
20A:
175-‐240
(reprinted
in
Joint
Statistical
Papers,
University
of
California
Press,
Berkeley,
1967,
pp.
1-‐66.)
Popper,
K.
(1962).
Conjectures
and
Refutations:
The
Growth
of
Scientific
Knowledge.
New
York:
Basic
Books.
Potti,
A.,
Dressman
H.
K.,
Bild,
A.,
Riedel,
R.
F.,
Chan,
G.,
Sayer,
R.,
Cragun,
J.,
Cottrill,
H.,
Kelley,
M.
J.,
Petersen,
R.,
Harpole,
D.,
Marks,
J.,
Berchuck,
A.,
Ginsburg,
G.
S.,
Febbo,
P.,
Lancaster,
J.
&
Nevins,
J.
R.
(2006).
“Genomic
signatures
to
guide
the
use
of
chemotherapeutics.”
Nature
Medicine.
Nov
12(11):1294-‐300.
Epub
2006
Oct
22.
Potti,
A.
&
Nevins,
J.
R.
(2007)
“Reply
to
Coombes,
Wang
&
Baggerly.”
Nature
Medicine
Nov
13(11):1277-‐8.
Ratliff,
K.
A.
&
Oishi,
S.
(2013).
“Gender
Differences
in
Implicit
Self-‐Esteem
Following
a
Romantic
Partner’s
Success
or
Failure”.
Journal
of
Personality
and
Social
Psychology
105(4):
688–702.
Rosenkrantz,
R.
(1977).
Inference,
Method
and
Decision:
Towards
a
Bayesian
Philosophy
of
Science.
Dordrecht,
The
Netherlands:
D.
Reidel.
64. SPP
D.
Mayo
64
Savage,
L.
J.
(1962).
The
Foundations
of
Statistical
Inference:
A
Discussion.
London:
Methuen.
Savage,
L.
J.
(1964).
“The
Foundations
of
Statistics
Reconsidered.”
In
Studies
in
Subjective
Probability,
H.
Kyburg
&
H.
Smokler
(eds.),
173-‐188.
New
York:
John
Wiley
&
Sons.
Selvin,
H.
(1970).
“A
Critique
of
Tests
of
Significance
in
Survey
Research.”
In
The
Significance
Test
Controversy,
edited
by
D.
Morrison
and
R.
Henkel,
94-‐106.
Chicago:
Aldine
De
Gruyter.
Trafimow,
D.
&
Marks
M.
(2015).
“Editorial”.
Basic
and
Applied
Social
Psychology,
37(1),
pp.
1-‐2.
Wagenmakers,
E.-‐J.
(2007).
“A
Practical
Solution
to
the
Pervasive
Problems
of
P
Values”.
Psychonomic
Bulletin
&
Review
14
(5),
779-‐804.