Slides given for Deborah G. Mayo talk at Minnesota Center for Philosophy of Science at University of Minnesota on the ASA 2016 statement on P-values and Error Statistics
Severe Testing: The Key to Error Correctionjemille6
D. G. Mayo's slides for her presentation given March 17, 2017 at Boston Colloquium for Philosophy of Science, Alfred I.Taub forum: "Understanding Reproducibility & Error Correction in Science"
Deborah G. Mayo: Is the Philosophy of Probabilism an Obstacle to Statistical Fraud Busting?
Presentation slides for: Revisiting the Foundations of Statistics in the Era of Big Data: Scaling Up to Meet the Challenge[*] at the Boston Colloquium for Philosophy of Science (Feb 21, 2014).
"The Statistical Replication Crisis: Paradoxes and Scapegoats”jemille6
D. G. Mayo LSE Popper talk, May 10, 2016.
Abstract: Mounting failures of replication in the social and biological sciences give a practical spin to statistical foundations in the form of the question: How can we attain reliability when Big Data methods make illicit cherry-picking and significance seeking so easy? Researchers, professional societies, and journals are increasingly getting serious about methodological reforms to restore scientific integrity – some are quite welcome (e.g., preregistration), while others are quite radical. Recently, the American Statistical Association convened members from differing tribes of frequentists, Bayesians, and likelihoodists to codify misuses of P-values. Largely overlooked are the philosophical presuppositions of both criticisms and proposed reforms. Paradoxically, alternative replacement methods may enable rather than reveal illicit inferences due to cherry-picking, multiple testing, and other biasing selection effects. Popular appeals to “diagnostic testing” that aim to improve replication rates may (unintentionally) permit the howlers and cookbook statistics we are at pains to root out. Without a better understanding of the philosophical issues, we can expect the latest reforms to fail.
Statistical skepticism: How to use significance tests effectively jemille6
Prof. D. Mayo, presentation Oct. 12, 2017 at the ASA Symposium on Statistical Inference : “A World Beyond p < .05” in the session: “What are the best uses for P-values?“
D. Mayo: The Science Wars and the Statistics Wars: scientism, popular statist...jemille6
I will explore the extent to which concerns about ‘scientism’– an unwarranted obeisance to scientific over other methods of inquiry – are intertwined with issues in the foundations of the statistical data analyses on which (social, behavioral, medical and physical) science increasingly depends. The rise of big data, machine learning, and high-powered computer programs have extended statistical methods and modeling across the landscape of science, law and evidence-based policy, but this has been accompanied by enormous hand wringing as to the reliability, replicability, and valid use of statistics. Legitimate criticisms of scientism often stem from insufficiently self-critical uses of statistical methodology, broadly construed — i.e., from what might be called “statisticism”-- particularly when those methods are applied to matters of controversy.
D. Mayo: Replication Research Under an Error Statistical Philosophy jemille6
D. Mayo (Virginia Tech) slides from her talk June 3 at the "Preconference Workshop on Replication in the Sciences" at the 2015 Society for Philosophy and Psychology meeting.
Severe Testing: The Key to Error Correctionjemille6
D. G. Mayo's slides for her presentation given March 17, 2017 at Boston Colloquium for Philosophy of Science, Alfred I.Taub forum: "Understanding Reproducibility & Error Correction in Science"
Deborah G. Mayo: Is the Philosophy of Probabilism an Obstacle to Statistical Fraud Busting?
Presentation slides for: Revisiting the Foundations of Statistics in the Era of Big Data: Scaling Up to Meet the Challenge[*] at the Boston Colloquium for Philosophy of Science (Feb 21, 2014).
"The Statistical Replication Crisis: Paradoxes and Scapegoats”jemille6
D. G. Mayo LSE Popper talk, May 10, 2016.
Abstract: Mounting failures of replication in the social and biological sciences give a practical spin to statistical foundations in the form of the question: How can we attain reliability when Big Data methods make illicit cherry-picking and significance seeking so easy? Researchers, professional societies, and journals are increasingly getting serious about methodological reforms to restore scientific integrity – some are quite welcome (e.g., preregistration), while others are quite radical. Recently, the American Statistical Association convened members from differing tribes of frequentists, Bayesians, and likelihoodists to codify misuses of P-values. Largely overlooked are the philosophical presuppositions of both criticisms and proposed reforms. Paradoxically, alternative replacement methods may enable rather than reveal illicit inferences due to cherry-picking, multiple testing, and other biasing selection effects. Popular appeals to “diagnostic testing” that aim to improve replication rates may (unintentionally) permit the howlers and cookbook statistics we are at pains to root out. Without a better understanding of the philosophical issues, we can expect the latest reforms to fail.
Statistical skepticism: How to use significance tests effectively jemille6
Prof. D. Mayo, presentation Oct. 12, 2017 at the ASA Symposium on Statistical Inference : “A World Beyond p < .05” in the session: “What are the best uses for P-values?“
D. Mayo: The Science Wars and the Statistics Wars: scientism, popular statist...jemille6
I will explore the extent to which concerns about ‘scientism’– an unwarranted obeisance to scientific over other methods of inquiry – are intertwined with issues in the foundations of the statistical data analyses on which (social, behavioral, medical and physical) science increasingly depends. The rise of big data, machine learning, and high-powered computer programs have extended statistical methods and modeling across the landscape of science, law and evidence-based policy, but this has been accompanied by enormous hand wringing as to the reliability, replicability, and valid use of statistics. Legitimate criticisms of scientism often stem from insufficiently self-critical uses of statistical methodology, broadly construed — i.e., from what might be called “statisticism”-- particularly when those methods are applied to matters of controversy.
D. Mayo: Replication Research Under an Error Statistical Philosophy jemille6
D. Mayo (Virginia Tech) slides from her talk June 3 at the "Preconference Workshop on Replication in the Sciences" at the 2015 Society for Philosophy and Psychology meeting.
Surrogate Science: How Fisher, Neyman-Pearson, and Bayes Were Transformed int...jemille6
Gerd Gigerenzer (Director of Max Planck Institute for Human Development, Berlin, Germany) in the PSA 2016 Symposium:Philosophy of Statistics in the Age of Big Data and Replication Crises
Stephen Senn slides:"‘Repligate’: reproducibility in statistical studies. What does it mean and in what sense does it matter?" presented May 23 at the session on "The Philosophy of Statistics: Bayesianism, Frequentism and the Nature of Inference"," at the 2015 APS Annual Convention in NYC
Replication Crises and the Statistics Wars: Hidden Controversiesjemille6
D. Mayo presentation at the X-Phil conference on "Reproducibility and Replicabililty in Psychology and Experimental Philosophy", University College London (June 14, 2018)
D. G. Mayo (Virginia Tech) "Error Statistical Control: Forfeit at your Peril" presented May 23 at the session on "The Philosophy of Statistics: Bayesianism, Frequentism and the Nature of Inference," 2015 APS Annual Convention in NYC.
D. G. Mayo: Your data-driven claims must still be probed severelyjemille6
In the session "Philosophy of Science and the New Paradigm of Data-Driven Science at the American Statistical Association Conference on Statistical Learning and Data Science/Nonparametric Statistics
Abstract: Mounting failures of replication in the social and biological sciences give a practical spin to statistical foundations in the form of the question: How can we attain reliability when methods make illicit cherry-picking and significance seeking so easy? Researchers, professional societies, and journals are increasingly getting serious about methodological reforms to restore scientific integrity – some are quite welcome (e.g., pre-registration), while others are quite radical. The American Statistical Association convened members from differing tribes of frequentists, Bayesians, and likelihoodists to codify misuses of P-values. Largely overlooked are the philosophical presuppositions of both criticisms and proposed reforms. Paradoxically, alternative replacement methods may enable rather than reveal illicit inferences due to cherry-picking, multiple testing, and other biasing selection effects. Crowd-sourced reproducibility research in psychology is helping to change the reward structure but has its own shortcomings. Focusing on purely statistical considerations, it tends to overlook problems with artificial experiments. Without a better understanding of the philosophical issues, we can expect the latest reforms to fail.
A. Gelman "50 shades of gray: A research story," presented May 23 at the session on "The Philosophy of Statistics: Bayesianism, Frequentism and the Nature of Inference," 2015 APS Annual Convention in NYC.
Byrd statistical considerations of the histomorphometric test protocol (1)jemille6
"Statistical considerations of the histomorphometric test protocol"
John E. Byrd, Ph.D. D-ABFA
Maria-Teresa Tersigni-Tarrant, Ph.D.
Central Identification Laboratory
JPAC
Probing with Severity: Beyond Bayesian Probabilism and Frequentist Performancejemille6
Slides from Rutgers Seminar talk by Deborah G Mayo
December 3, 2014
Rutgers, Department of Statistics and Biostatistics
Abstract: Getting beyond today’s most pressing controversies revolving around statistical methods, I argue, requires scrutinizing their underlying statistical philosophies.Two main philosophies about the roles of probability in statistical inference are probabilism and performance (in the long-run). The first assumes that we need a method of assigning probabilities to hypotheses; the second assumes that the main function of statistical method is to control long-run performance. I offer a third goal: controlling and evaluating the probativeness of methods. An inductive inference, in this conception, takes the form of inferring hypotheses to the extent that they have been well or severely tested. A report of poorly tested claims must also be part of an adequate inference. I develop a statistical philosophy in which error probabilities of methods may be used to evaluate and control the stringency or severity of tests. I then show how the “severe testing” philosophy clarifies and avoids familiar criticisms and abuses of significance tests and cognate methods (e.g., confidence intervals). Severity may be threatened in three main ways: fallacies of statistical tests, unwarranted links between statistical and substantive claims, and violations of model assumptions.
Mayo: Evidence as Passing a Severe Test (How it Gets You Beyond the Statistic...jemille6
D. G. Mayo April 28, 2021 presentation to the CUNY Graduate Center Philosophy Colloquium "Evidence as Passing a Severe Test (How it Gets You Beyond the Statistics Wars)"
Controversy Over the Significance Test Controversyjemille6
Deborah Mayo (Professor of Philosophy, Virginia Tech, Blacksburg, Virginia) in PSA 2016 Symposium: Philosophy of Statistics in the Age of Big Data and Replication Crises
D. G. Mayo: The Replication Crises and its Constructive Role in the Philosoph...jemille6
Constructive role of replication crises teaches a lot about 1.) Non-fallacious uses of statistical tests, 2.) Rationale for the role of probability in tests, 3.) How to reformulate tests.
Today we’ll try to cover a number of things:
1. Learning philosophy/philosophy of statistics
2. Situating the broad issues within philosophy of science
3. Little bit of logic
4. Probability and random variables
D. Mayo: Philosophy of Statistics & the Replication Crisis in Sciencejemille6
D. Mayo discusses various disputes-notably the replication crisis in science-in the context of her just released book: Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars.
These slides were presented on November 22 2016 during the Annual Julius Symposium, organised by the Julius Center for Health Sciences and Primary Care, University Medical Hospital Utrecht.
Only a few months ago, the American Statistical Association authoritatively issued an official statement on significance and p-values (American Statistician, 2016, 70:2, 129-133), claiming that the p-value is: “commonly misused and misinterpreted.”
In this presentation I focus on the principles of the ASA statement.
Sustainable Business: Realising the True Business Case.Mike Townsend
I would like to take you on a journey of discovery at various levels of the business case; the risks and opportunities for sustainable business. We'll explore the early stages, and the significant eco-efficiency benefits we can deliver, but then the going gets tougher, forcing us to look deeper into the changes we might need to make...
Surrogate Science: How Fisher, Neyman-Pearson, and Bayes Were Transformed int...jemille6
Gerd Gigerenzer (Director of Max Planck Institute for Human Development, Berlin, Germany) in the PSA 2016 Symposium:Philosophy of Statistics in the Age of Big Data and Replication Crises
Stephen Senn slides:"‘Repligate’: reproducibility in statistical studies. What does it mean and in what sense does it matter?" presented May 23 at the session on "The Philosophy of Statistics: Bayesianism, Frequentism and the Nature of Inference"," at the 2015 APS Annual Convention in NYC
Replication Crises and the Statistics Wars: Hidden Controversiesjemille6
D. Mayo presentation at the X-Phil conference on "Reproducibility and Replicabililty in Psychology and Experimental Philosophy", University College London (June 14, 2018)
D. G. Mayo (Virginia Tech) "Error Statistical Control: Forfeit at your Peril" presented May 23 at the session on "The Philosophy of Statistics: Bayesianism, Frequentism and the Nature of Inference," 2015 APS Annual Convention in NYC.
D. G. Mayo: Your data-driven claims must still be probed severelyjemille6
In the session "Philosophy of Science and the New Paradigm of Data-Driven Science at the American Statistical Association Conference on Statistical Learning and Data Science/Nonparametric Statistics
Abstract: Mounting failures of replication in the social and biological sciences give a practical spin to statistical foundations in the form of the question: How can we attain reliability when methods make illicit cherry-picking and significance seeking so easy? Researchers, professional societies, and journals are increasingly getting serious about methodological reforms to restore scientific integrity – some are quite welcome (e.g., pre-registration), while others are quite radical. The American Statistical Association convened members from differing tribes of frequentists, Bayesians, and likelihoodists to codify misuses of P-values. Largely overlooked are the philosophical presuppositions of both criticisms and proposed reforms. Paradoxically, alternative replacement methods may enable rather than reveal illicit inferences due to cherry-picking, multiple testing, and other biasing selection effects. Crowd-sourced reproducibility research in psychology is helping to change the reward structure but has its own shortcomings. Focusing on purely statistical considerations, it tends to overlook problems with artificial experiments. Without a better understanding of the philosophical issues, we can expect the latest reforms to fail.
A. Gelman "50 shades of gray: A research story," presented May 23 at the session on "The Philosophy of Statistics: Bayesianism, Frequentism and the Nature of Inference," 2015 APS Annual Convention in NYC.
Byrd statistical considerations of the histomorphometric test protocol (1)jemille6
"Statistical considerations of the histomorphometric test protocol"
John E. Byrd, Ph.D. D-ABFA
Maria-Teresa Tersigni-Tarrant, Ph.D.
Central Identification Laboratory
JPAC
Probing with Severity: Beyond Bayesian Probabilism and Frequentist Performancejemille6
Slides from Rutgers Seminar talk by Deborah G Mayo
December 3, 2014
Rutgers, Department of Statistics and Biostatistics
Abstract: Getting beyond today’s most pressing controversies revolving around statistical methods, I argue, requires scrutinizing their underlying statistical philosophies.Two main philosophies about the roles of probability in statistical inference are probabilism and performance (in the long-run). The first assumes that we need a method of assigning probabilities to hypotheses; the second assumes that the main function of statistical method is to control long-run performance. I offer a third goal: controlling and evaluating the probativeness of methods. An inductive inference, in this conception, takes the form of inferring hypotheses to the extent that they have been well or severely tested. A report of poorly tested claims must also be part of an adequate inference. I develop a statistical philosophy in which error probabilities of methods may be used to evaluate and control the stringency or severity of tests. I then show how the “severe testing” philosophy clarifies and avoids familiar criticisms and abuses of significance tests and cognate methods (e.g., confidence intervals). Severity may be threatened in three main ways: fallacies of statistical tests, unwarranted links between statistical and substantive claims, and violations of model assumptions.
Mayo: Evidence as Passing a Severe Test (How it Gets You Beyond the Statistic...jemille6
D. G. Mayo April 28, 2021 presentation to the CUNY Graduate Center Philosophy Colloquium "Evidence as Passing a Severe Test (How it Gets You Beyond the Statistics Wars)"
Controversy Over the Significance Test Controversyjemille6
Deborah Mayo (Professor of Philosophy, Virginia Tech, Blacksburg, Virginia) in PSA 2016 Symposium: Philosophy of Statistics in the Age of Big Data and Replication Crises
D. G. Mayo: The Replication Crises and its Constructive Role in the Philosoph...jemille6
Constructive role of replication crises teaches a lot about 1.) Non-fallacious uses of statistical tests, 2.) Rationale for the role of probability in tests, 3.) How to reformulate tests.
Today we’ll try to cover a number of things:
1. Learning philosophy/philosophy of statistics
2. Situating the broad issues within philosophy of science
3. Little bit of logic
4. Probability and random variables
D. Mayo: Philosophy of Statistics & the Replication Crisis in Sciencejemille6
D. Mayo discusses various disputes-notably the replication crisis in science-in the context of her just released book: Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars.
These slides were presented on November 22 2016 during the Annual Julius Symposium, organised by the Julius Center for Health Sciences and Primary Care, University Medical Hospital Utrecht.
Only a few months ago, the American Statistical Association authoritatively issued an official statement on significance and p-values (American Statistician, 2016, 70:2, 129-133), claiming that the p-value is: “commonly misused and misinterpreted.”
In this presentation I focus on the principles of the ASA statement.
Sustainable Business: Realising the True Business Case.Mike Townsend
I would like to take you on a journey of discovery at various levels of the business case; the risks and opportunities for sustainable business. We'll explore the early stages, and the significant eco-efficiency benefits we can deliver, but then the going gets tougher, forcing us to look deeper into the changes we might need to make...
Toothache can ruin your whole day. While you should always call a dentist and ask for appointment if your tooth hurts there are some effective home remedies, which may help you alleviate the pain before you get your dental appointment. This infographic summarizes 5 best home remedies for a toothache.
Presented at the Pulses for Sustainable Agriculture and Human Health” on 31 May-1 June 2016 at NASC, New Delhi, India. The conference was jointly organised by the International Food Policy Research Institute (IFPRI), National Academy of Agricultural Sciences (NAAS), TCi of Cornell University (TCi-CU) and Agriculture Today.
Public Device & Biopharma Ophthalmology Company Showcase - QLTHealthegy
Public Device & Biopharma Ophthalmology Company Showcase - QLT at OIS@AAO 2016.
Presenter:
David Saperstein, MD, Chief Medical Advisor
Powered by:
Healthegy
For more ophthalmology innovation
Visit us at www.ois.net
Posterior Segment Company Showcase - AerpioHealthegy
Posterior Segment Company Showcase - Aerpio at OIS@AAO 2016.
Presenter:
Joseph Gardner, President & CEO
Powered by:
Healthegy
For more ophthalmology innovation
Visit us at www.ois.net
Paper given at PSA 22 Symposium: Multiplicity, Data-Dredging and Error Control
MAYO ABSTRACT: I put forward a general principle for evidence: an error-prone claim C is warranted to the extent it has been subjected to, and passes, an analysis that very probably would have found evidence of flaws in C just if they are present. This probability is the severity with which C has passed the test. When a test’s error probabilities quantify the capacity of tests to probe errors in C, I argue, they can be used to assess what has been learned from the data about C. A claim can be probable or even known to be true, yet poorly probed by the data and model at hand. The severe testing account leads to a reformulation of statistical significance tests: Moving away from a binary interpretation, we test several discrepancies from any reference hypothesis and report those well or poorly warranted. A probative test will generally involve combining several subsidiary tests, deliberately designed to unearth different flaws. The approach relates to confidence interval estimation, but, like confidence distributions (CD) (Thornton), a series of different confidence levels is considered. A 95% confidence interval method, say using the mean M of a random sample to estimate the population mean μ of a Normal distribution, will cover the true, but unknown, value of μ 95% of the time in a hypothetical series of applications. However, we cannot take .95 as the probability that a particular interval estimate (a ≤ μ ≤ b) is correct—at least not without a prior probability to μ. In the severity interpretation I propose, we can nevertheless give an inferential construal post-data, while still regarding μ as fixed. For example, there is good evidence μ ≥ a (the lower estimation limit) because if μ < a, then with high probability .95 (or .975 if viewed as one-sided) we would have observed a smaller value of M than we did. Likewise for inferring μ ≤ b. To understand a method’s capability to probe flaws in the case at hand, we cannot just consider the observed data, unlike in strict Bayesian accounts. We need to consider what the method would have inferred if other data had been observed. For each point μ’ in the interval, we assess how severely the claim μ > μ’ has been probed. I apply the severity account to the problems discussed by earlier speakers in our session. The problem with multiple testing (and selective reporting) when attempting to distinguish genuine effects from noise, is not merely that it would, if regularly applied, lead to inferences that were often wrong. Rather, it renders the method incapable, or practically so, of probing the relevant mistaken inference in the case at hand. In other cases, by contrast, (e.g., DNA matching) the searching can increase the test’s probative capacity. In this way the severe testing account can explain competing intuitions about multiplicity and data-dredging, while blocking inferences based on problematic data-dredging
“The importance of philosophy of science for statistical science and vice versa”jemille6
My paper “The importance of philosophy of science for statistical science and vice
versa” presented (zoom) at the conference: IS PHILOSOPHY USEFUL FOR SCIENCE, AND/OR VICE VERSA? January 30 - February 2, 2024 at Chapman University, Schmid College of Science and Technology.
D. Mayo: Philosophical Interventions in the Statistics Warsjemille6
ABSTRACT: While statistics has a long history of passionate philosophical controversy, the last decade especially cries out for philosophical illumination. Misuses of statistics, Big Data dredging, and P-hacking make it easy to find statistically significant, but spurious, effects. This obstructs a test's ability to control the probability of erroneously inferring effects–i.e., to control error probabilities. Disagreements about statistical reforms reflect philosophical disagreements about the nature of statistical inference–including whether error probability control even matters! I describe my interventions in statistics in relation to three events. (1) In 2016 the American Statistical Association (ASA) met to craft principles for avoiding misinterpreting P-values. (2) In 2017, a "megateam" (including philosophers of science) proposed "redefining statistical significance," replacing the common threshold of P ≤ .05 with P ≤ .005. (3) In 2019, an editorial in the main ASA journal called for abandoning all P-value thresholds, and even the words "significant/significance".
A word on each. (1) Invited to be a "philosophical observer" at their meeting, I found the major issues were conceptual. P-values measure how incompatible data are from what is expected under a hypothesis that there is no genuine effect: the smaller the P-value, the more indication of incompatibility. The ASA list of familiar misinterpretations–P-values are not posterior probabilities, statistical significance is not substantive importance, no evidence against a hypothesis need not be evidence for it–I argue, should not be the basis for replacing tests with methods less able to assess and control erroneous interpretations of data. (Mayo 2016, 2019). (2) The "redefine statistical significance" movement appraises P-values from the perspective of a very different quantity: a comparative Bayes Factor. Failing to recognize how contrasting approaches measure different things, disputants often talk past each other (Mayo 2018). (3) To ban P-value thresholds, even to distinguish terrible from warranted evidence, I say, is a mistake (2019). It will not eradicate P-hacking, but it will make it harder to hold P-hackers accountable. A 2020 ASA Task Force on significance testing has just been announced. (I would like to think my blog errorstatistics.com helped.)
To enter the fray between rival statistical approaches, it helps to have a principle applicable to all accounts. There's poor evidence for a claim if little if anything has been done to find it flawed even if it is. This forms a basic requirement for evidence I call the severity requirement. A claim passes with severity only if it is subjected to and passes a test that probably would have found it flawed, if it were. It stems from Popper, though he never adequately cashed it out. A variant is the frequentist principle of evidence developed with Sir David Cox (Mayo and Cox 20
Statistical Inference as Severe Testing: Beyond Performance and Probabilismjemille6
A talk given by Deborah G Mayo
(Dept of Philosophy, Virginia Tech) to the Seminar in Advanced Research Methods at the Dept of Psychology, Princeton University on
November 14, 2023
TITLE: Statistical Inference as Severe Testing: Beyond Probabilism and Performance
ABSTRACT: I develop a statistical philosophy in which error probabilities of methods may be used to evaluate and control the stringency or severity of tests. A claim is severely tested to the extent it has been subjected to and passes a test that probably would have found flaws, were they present. The severe-testing requirement leads to reformulating statistical significance tests to avoid familiar criticisms and abuses. While high-profile failures of replication in the social and biological sciences stem from biasing selection effects—data dredging, multiple testing, optional stopping—some reforms and proposed alternatives to statistical significance tests conflict with the error control that is required to satisfy severity. I discuss recent arguments to redefine, abandon, or replace statistical significance.
High-profile failures of replication in the social and biological sciences underwrite a
minimal requirement of evidence: If little or nothing has been done to rule out flaws in
inferring a claim, then it has not passed a severe test. A claim is severely tested to the
extent it has been subjected to and passes a test that probably would have found flaws,
were they present. This probability is the severity with which a claim has passed. The
goal of highly well-tested claims differs from that of highly probable ones, explaining
why experts so often disagree about statistical reforms. Even where today’s statistical
test critics see themselves as merely objecting to misuses and misinterpretations, the
reforms they recommend often grow out of presuppositions about the role of probability
in inductive-statistical inference. Paradoxically, I will argue, some of the reforms
intended to replace or improve on statistical significance tests enable rather than reveal
illicit inferences due to cherry-picking, multiple testing, and data-dredging. Some
preclude testing and falsifying claims altogether. These are the “casualties” on which I
will focus. I will consider Fisherian vs Neyman-Pearson tests, Bayes factors, Bayesian
posteriors, likelihoodist assessments, and the “screening model” of tests (a quasiBayesian-frequentist assessment). Whether one accepts this philosophy of evidence, I
argue, that it provides a standpoint for avoiding both the fallacies of statistical testing
and the casualties of today’s statistics wars.
The Statistics Wars and Their Causalities (refs)jemille6
High-profile failures of replication in the social and biological sciences underwrite a
minimal requirement of evidence: If little or nothing has been done to rule out flaws in inferring a claim, then it has not passed a severe test. A claim is severely tested to the extent it has been subjected to and passes a test that probably would have found flaws, were they present. This probability is the severity with which a claim has passed. The goal of highly well-tested claims differs from that of highly probable ones, explaining why experts so often disagree about statistical reforms. Even where today’s statistical test critics see themselves as merely objecting to misuses and misinterpretations, the reforms they recommend often grow out of presuppositions about the role of probability in inductive-statistical inference. Paradoxically, I will argue, some of the reforms intended to replace or improve on statistical significance tests enable rather than reveal illicit inferences due to cherry-picking, multiple testing, and data-dredging. Some preclude testing and falsifying claims altogether. These are the “casualties” on which I will focus. I will consider Fisherian vs Neyman-Pearson tests, Bayes factors, Bayesian posteriors, likelihoodist assessments, and the “screening model” of tests (a quasiBayesian-frequentist assessment). Whether one accepts this philosophy of evidence, I argue, that it provides a standpoint for avoiding both the fallacies of statistical testing and the casualties of today’s statistics wars.
The Statistics Wars and Their Casualties (w/refs)jemille6
High-profile failures of replication in the social and biological sciences underwrite a minimal requirement of evidence: If little or nothing has been done to rule out flaws in inferring a claim, then it has not passed a severe test. A claim is severely tested to the extent it has been subjected to and passes a test that probably would have found flaws, were they present. This probability is the severity with which a claim has passed. The goal of highly well-tested claims differs from that of highly probable ones, explaining why experts so often disagree about statistical reforms. Even where today’s statistical test critics see themselves as merely objecting to misuses and misinterpretations, the reforms they recommend often grow out of presuppositions about the role of probability in inductive-statistical inference. Paradoxically, I will argue, some of the reforms intended to replace or improve on statistical significance tests enable rather than reveal illicit inferences due to cherry-picking, multiple testing, and data-dredging. Some preclude testing and falsifying claims altogether. These are the “casualties” on which I will focus. I will consider Fisherian vs Neyman-Pearson tests, Bayes factors, Bayesian posteriors, likelihoodist assessments, and the “screening model” of tests (a quasi-Bayesian-frequentist assessment). Whether one accepts this philosophy of evidence, I argue, that it provides a standpoint for avoiding both the fallacies of statistical testing and the casualties of today’s statistics wars.
D. Mayo (Dept of Philosophy, VT)
Sir David Cox’s Statistical Philosophy and Its Relevance to Today’s Statistical Controversies
ABSTRACT: This talk will explain Sir David Cox's views of the nature and importance of statistical foundations and their relevance to today's controversies about statistical inference, particularly in using statistical significance testing and confidence intervals. Two key themes of Cox's statistical philosophy are: first, the importance of calibrating methods by considering their behavior in (actual or hypothetical) repeated sampling, and second, ensuring the calibration is relevant to the specific data and inquiry. A question that arises is: How can the frequentist calibration provide a genuinely epistemic assessment of what is learned from data? Building on our jointly written papers, Mayo and Cox (2006) and Cox and Mayo (2010), I will argue that relevant error probabilities may serve to assess how well-corroborated or severely tested statistical claims are.
Nancy Reid, Dept. of Statistics, University of Toronto. Inaugural receiptant of the "David R. Cox Foundations of Statistics Award".
Slides from Invited presentation at 2023 JSM: “The Importance of Foundations in Statistical Science“
Ronald Wasserstein, Chair (American Statistical Association)
ABSTRACT: David Cox wrote “A healthy interplay between theory and application is crucial for statistics… This is particularly the case when by theory we mean foundations of statistical analysis, rather than the theoretical analysis of specific statistical methods.” These foundations distinguish statistical science from the many fields of research in which statistical thinking is a key intellectual component. In this talk I will emphasize the ongoing importance and relevance of theoretical advances and theoretical thinking through some illustrative examples.
Errors of the Error Gatekeepers: The case of Statistical Significance 2016-2022jemille6
ABSTRACT: Statistical significance tests serve in gatekeeping against being fooled by randomness, but recent attempts to gatekeep these tools have themselves malfunctioned. Warranted gatekeepers formulate statistical tests so as to avoid fallacies and misuses of P-values. They highlight how multiplicity, optional stopping, and data-dredging can readily invalidate error probabilities. It is unwarranted, however, to argue that statistical significance and P-value thresholds be abandoned because they can be misused. Nor is it warranted to argue for abandoning statistical significance based on presuppositions about evidence and probability that are at odds with those underlying statistical significance tests. When statistical gatekeeping malfunctions, I argue, it undermines a central role to which scientists look to statistics. In order to combat the dangers of unthinking, bandwagon effects, statistical practitioners and consumers need to be in a position to critically evaluate the ramifications of proposed "reforms” (“stat activism”). I analyze what may be learned from three recent episodes of gatekeeping (and meta-gatekeeping) at the American Statistical Association (ASA).
Causal inference is not statistical inferencejemille6
Jon Williamson (University of Kent)
ABSTRACT: Many methods for testing causal claims are couched as statistical methods: e.g.,
randomised controlled trials, various kinds of observational study, meta-analysis, and
model-based approaches such as structural equation modelling and graphical causal
modelling. I argue that this is a mistake: causal inference is not a purely statistical
problem. When we look at causal inference from a general point of view, we see that
methods for causal inference fit into the framework of Evidential Pluralism: causal
inference is properly understood as requiring mechanistic inference in addition to
statistical inference.
Evidential Pluralism also offers a new perspective on the replication crisis. That
observed associations are not replicated by subsequent studies is a part of normal
science. A problem only arises when those associations are taken to establish causal
claims: a science whose established causal claims are constantly overturned is indeed
in crisis. However, if we understand causal inference as involving mechanistic inference
alongside statistical inference, as Evidential Pluralism suggests, we avoid fallacious
inferences from association to causation. Thus, Evidential Pluralism offers the means to
prevent the drama of science from turning into a crisis.
Stephan Guttinger (Lecturer in Philosophy of Data/Data Ethics, University of Exeter, UK)
ABSTRACT: The idea of “questionable research practices” (QRPs) is central to the narrative of a replication crisis in the experimental sciences. According to this narrative the low replicability of scientific findings is not simply due to fraud or incompetence, but in large part to the widespread use of QRPs, such as “p-hacking” or the lack of adequate experimental controls. The claim is that such flawed practices generate flawed output. The reduction – or even elimination – of QRPs is therefore one of the main strategies proposed by policymakers and scientists to tackle the replication crisis.
What counts as a QRP, however, is not clear. As I will discuss in the first part of this paper, there is no consensus on how to define the term, and ascriptions of the qualifier “questionable” often vary across disciplines, time, and even within single laboratories. This lack of clarity matters as it creates the risk of introducing methodological constraints that might create more harm than good. Practices labelled as ‘QRPs’ can be both beneficial and problematic for research practice and targeting them without a sound understanding of their dynamic and context-dependent nature risks creating unnecessary casualties in the fight for a more reliable scientific practice.
To start developing a more situated and dynamic picture of QRPs I will then turn my attention to a specific example of a dynamic QRP in the experimental life sciences, namely, the so-called “Far Western Blot” (FWB). The FWB is an experimental system that can be used to study protein-protein interactions but which for most of its existence has not seen a wide uptake in the community because it was seen as a QRP. This was mainly due to its (alleged) propensity to generate high levels of false positives and negatives. Interestingly, however, it seems that over the last few years the FWB slowly moved into the space of acceptable research practices. Analysing this shift and the reasons underlying it, I will argue a) that suppressing this practice deprived the research community of a powerful experimental tool and b) that the original judgment of the FWB was based on a simplistic and non-empirical assessment of its error-generating potential. Ultimately, it seems like the key QRP at work in the FWB case was the way in which the label “questionable” was assigned in the first place. I will argue that findings from this case can be extended to other QRPs in the experimental life sciences and that they point to a larger issue with how researchers judge the error-potential of new research practices.
David Hand (Professor Emeritus and Senior Research Investigator, Department of Mathematics,
Faculty of Natural Sciences, Imperial College London.)
ABSTRACT: Science progresses through an iterative process of formulating theories and comparing
them with empirical real-world data. Different camps of scientists will favour different
theories, until accumulating evidence renders one or more untenable. Not unnaturally,
people become attached to theories. Perhaps they invented a theory, and kudos arises
from being the originator of a generally accepted theory. A theory might represent a
life's work, so that being found wanting might be interpreted as failure. Perhaps
researchers were trained in a particular school, and acknowledging its shortcomings is
difficult. Because of this, tensions can arise between proponents of different theories.
The discipline of statistics is susceptible to precisely the same tensions. Here, however,
the tensions are not between different theories of "what is", but between different
strategies for shedding light on the real world from limited empirical data. This can be in
the form of how one measures discrepancy between the theory's predictions and
observations. It can be in the form of different ways of looking at empirical results. It can
be, at a higher level, because of differences between what is regarded as important in a
particular context. Or it can be for other reasons.
Perhaps the most familiar example of this tension within statistics is between different
approaches to inference. However, there are many other examples of such tensions.
This paper illustrates with several examples. We argue that the tension generally arises
as a consequence of inadequate care being taken in question formulation. That is,
insufficient thought is given to deciding exactly what one wants to know - to determining
"What is the question?".
The ideas and disagreements are illustrated with several examples.
The neglected importance of complexity in statistics and Metasciencejemille6
Daniele Fanelli
London School of Economics Fellow in Quantitative Methodology, Department of
Methodology, London School of Economics and Political Science.
ABSTRACT: Statistics is at war, and Metascience is ailing. This is partially due, the talk will argue, to
a paradigmatic blind-spot: the assumption that one can draw general conclusions about
empirical findings without considering the role played by context, conditions,
assumptions, and the complexity of methods and theories. Whilst ideally these
particularities should be unimportant in science, in practice they cannot be neglected in
most research fields, let alone in research-on-research.
This neglected importance of complexity is supported by theoretical arguments and
empirical findings (or the lack thereof) in the recent meta-analytical and metascientific
literature. The talk will overview this background and suggest how the complexity of
theories and methodologies may be explicitly factored into particular methodologies of
statistics and Metaresearch. The talk will then give examples of how this approach may
usefully complement existing paradigms, by translating results, methods and theories
into quantities of information that are evaluated using an information-compression logic.
Mathematically Elegant Answers to Research Questions No One is Asking (meta-a...jemille6
Uri Simonsohn (Professor, Department of Operations, Innovation and Data Sciences at Esade)
ABSTRACT: The statistical tools listed in the title share that a mathematically elegant solution has
become the consensus advice of statisticians, methodologists and some
mathematically sophisticated researchers writing tutorials and textbooks, and yet,
they lead research workers to meaningless answers, that are often also statistically
invalid. Part of the problem is that advice givers take the mathematical abstractions
of the tools they advocate for literally, instead of taking the actual behavior of
researchers seriously.
On Severity, the Weight of Evidence, and the Relationship Between the Twojemille6
Margherita Harris
Visiting fellow in the Department of Philosophy, Logic and Scientific Method at the London
School of Economics and Political Science.
ABSTRACT: According to the severe tester, one is justified in declaring to have evidence in support of a
hypothesis just in case the hypothesis in question has passed a severe test, one that it would be very
unlikely to pass so well if the hypothesis were false. Deborah Mayo (2018) calls this the strong
severity principle. The Bayesian, however, can declare to have evidence for a hypothesis despite not
having done anything to test it severely. The core reason for this has to do with the
(infamous) likelihood principle, whose violation is not an option for anyone who subscribes to the
Bayesian paradigm. Although the Bayesian is largely unmoved by the incompatibility between
the strong severity principle and the likelihood principle, I will argue that the Bayesian’s never-ending
quest to account for yet an other notion, one that is often attributed to Keynes (1921) and that is
usually referred to as the weight of evidence, betrays the Bayesian’s confidence in the likelihood
principle after all. Indeed, I will argue that the weight of evidence and severity may be thought of as
two (very different) sides of the same coin: they are two unrelated notions, but what brings them
together is the fact that they both make trouble for the likelihood principle, a principle at the core of
Bayesian inference. I will relate this conclusion to current debates on how to best conceptualise
uncertainty by the IPCC in particular. I will argue that failure to fully grasp the limitations of an
epistemology that envisions the role of probability to be that of quantifying the degree of belief to
assign to a hypothesis given the available evidence can be (and has been) detrimental to an
adequate communication of uncertainty.
Revisiting the Two Cultures in Statistical Modeling and Inference as they rel...jemille6
Aris Spanos (Wilson Schmidt Professor of Economics, Virginia Tech)
ABSTRACT: The discussion places the two cultures, the model-driven statistical modeling and the
algorithm-driven modeling associated with Machine Learning (ML) and Statistical
Learning Theory (SLT) in a broader context of paradigm shifts in 20th-century statistics,
which includes Fisher’s model-based induction of the 1920s and variations/extensions
thereof, the Data Science (ML, STL, etc.) and the Graphical Causal modeling in the
1990s. The primary objective is to compare and contrast the effectiveness of different
approaches to statistics in learning from data about phenomena of interest and relate
that to the current discussions pertaining to the statistics wars and their potential
casualties.
Comparing Frequentists and Bayesian Control of Multiple Testingjemille6
James Berger
ABSTRACT: A problem that is common to many sciences is that of having to deal with a multiplicity of statistical inferences. For instance, in GWAS (Genome Wide Association Studies), an experiment might consider 20 diseases and 100,000 genes, and conduct statistical tests of the 20x100,000=2,000,000 null hypotheses that a specific disease is associated with a specific gene. The issue is that selective reporting of only the ‘highly significant’ results could lead to many claimed disease/gene associations that turn out to be false, simply because of statistical randomness. In 2007, the seriousness of this problem was recognized in GWAS and extremely stringent standards were employed to resolve it. Indeed, it was recommended that tests for association should be conducted at an error probability of 5 x 10—7. Particle physicists similarly learned that a discovery would be reliably replicated only if the p-value of the relevant test was less than 5.7 x 10—7. This was because they had to account for a huge number of multiplicities in their analyses. Other sciences have continuing issues with multiplicity. In the Social Sciences, p-hacking and data dredging are common, which involve multiple analyses of data. Stopping rules in social sciences are often ignored, even though it has been known since 1933 that, if one keeps collecting data and computing the p-value, one is guaranteed to obtain a p-value less than 0.05 (or, indeed, any specified value), even if the null hypothesis is true. In medical studies that occur with strong oversight (e.g., by the FDA), control for multiplicity is mandated. There is also typically a large amount of replication, resulting in meta-analysis. But there are many situations where multiplicity is not handled well, such as subgroup analysis: one first tests for an overall treatment effect in the population; failing to find that, one tests for an effect among men or among women; failing to find that, one tests for an effect among old men or young men, or among old women or young women; …. I will argue that there is a single method that can address any such problems of multiplicity: Bayesian analysis, with the multiplicity being addressed through choice of prior probabilities of hypotheses. ... There are, of course, also frequentist error approaches (such as Bonferroni and FDR) for handling multiplicity of statistical inferences; indeed, these are much more familiar than the Bayesian approach. These are, however, targeted solutions for specific classes of problems and are not easily generalizable to new problems.
Clark Glamour
ABSTRACT: "Data dredging"--searching non experimental data for causal and other relationships and taking that same data to be evidence for those relationships--was historically common in the natural sciences--the works of Kepler, Cannizzaro and Mendeleev are examples. Nowadays, "data dredging"--using data to bring hypotheses into consideration and regarding that same data as evidence bearing on their truth or falsity--is widely denounced by both philosophical and statistical methodologists. Notwithstanding, "data dredging" is routinely practiced in the human sciences using "traditional" methods--various forms of regression for example. The main thesis of my talk is that, in the spirit and letter of Mayo's and Spanos’ notion of severe testing, modern computational algorithms that search data for causal relations severely test their resulting models in the process of "constructing" them. My claim is that in many investigations, principled computerized search is invaluable for reliable, generalizable, informative, scientific inquiry. The possible failures of traditional search methods for causal relations, multiple regression for example, are easily demonstrated by simulation in cases where even the earliest consistent graphical model search algorithms succeed. ... These and other examples raise a number of issues about using multiple hypothesis tests in strategies for severe testing, notably, the interpretation of standard errors and confidence levels as error probabilities when the structures assumed in parameter estimation are uncertain. Commonly used regression methods, I will argue, are bad data dredging methods that do not severely, or appropriately, test their results. I argue that various traditional and proposed methodological norms, including pre-specification of experimental outcomes and error probabilities for regression estimates of causal effects, are unnecessary or illusory in application. Statistics wants a number, or at least an interval, to express a normative virtue, the value of data as evidence for a hypothesis, how well the data pushes us toward the true or away from the false. Good when you can get it, but there are many circumstances where you have evidence but there is no number or interval to express it other than phony numbers with no logical connection with truth guidance. Kepler, Darwin, Cannizarro, Mendeleev had no such numbers, but they severely tested their claims by combining data dredging with severe testing.
The Duality of Parameters and the Duality of Probabilityjemille6
Suzanne Thornton
ABSTRACT: Under any inferential paradigm, statistical inference is connected to the logic of probability. Well-known debates among these various paradigms emerge from conflicting views on the notion of probability. One dominant view understands the logic of probability as a representation of variability (frequentism), and another prominent view understands probability as a measurement of belief (Bayesianism). The first camp generally describes model parameters as fixed values, whereas the second camp views parameters as random. Just as calibration (Reid and Cox 2015, “On Some Principles of Statistical Inference,” International Statistical Review 83(2), 293-308)--the behavior of a procedure under hypothetical repetition--bypasses the need for different versions of probability, I propose that an inferential approach based on confidence distributions (CD), which I will explain, bypasses the analogous conflicting perspectives on parameters. Frequentist inference is connected to the logic of probability through the notion of empirical randomness. Sample estimates are useful only insofar as one has a sense of the extent to which the estimator may vary from one random sample to another. The bounds of a confidence interval are thus particular observations of a random variable, where the randomness is inherited by the random sampling of the data. For example, 95% confidence intervals for parameter θ can be calculated for any random sample from a Normal N(θ, 1) distribution. With repeated sampling, approximately 95% of these intervals are guaranteed to yield an interval covering the fixed value of θ. Bayesian inference produces a probability distribution for the different values of a particular parameter. However, the quality of this distribution is difficult to assess without invoking an appeal to the notion of repeated performance. ... In contrast to a posterior distribution, a CD is not a probabilistic statement about the parameter, rather it is a data-dependent estimate for a fixed parameter for which a particular behavioral property holds. The Normal distribution itself, centered around the observed average of the data (e.g. average recovery times), can be a CD for θ. It can give any level of confidence. Such estimators can be derived through Bayesian or frequentist inductive procedures, and any CD, regardless of how it is obtained, guarantees performance of the estimator under replication for a fixed target, while simultaneously producing a random estimate for the possible values of θ.
On the interpretation of the mathematical characteristics of statistical test...jemille6
Statistical hypothesis tests are often misused and misinterpreted. Here I focus on one
source of such misinterpretation, namely an inappropriate notion regarding what the
mathematical theory of tests implies, and does not imply, when it comes to the
application of tests in practice. The view taken here is that it is helpful and instructive to be consciously aware of the essential difference between mathematical model and
reality, and to appreciate the mathematical model and its implications as a tool for
thinking rather than something that has a truth value regarding reality. Insights are presented regarding the role of model assumptions, unbiasedness and the alternative hypothesis, Neyman-Pearson optimality, multiple and data dependent testing.
The role of background assumptions in severity appraisal (jemille6
In the past decade discussions around the reproducibility of scientific findings have led to a re-appreciation of the importance of guaranteeing claims are severely tested. The inflation of Type 1 error rates due to flexibility in the data analysis is widely considered
one of the underlying causes of low replicability rates. Solutions, such as study preregistration, are becoming increasingly popular to combat this problem. Preregistration only allows researchers to evaluate the severity of a test, but not all
preregistered studies provide a severe test of a claim. The appraisal of the severity of a
test depends on background information, such as assumptions about the data generating process, and auxiliary hypotheses that influence the final choice for the
design of the test. In this article, I will discuss the difference between subjective and
inter-subjectively testable assumptions underlying scientific claims, and the importance
of separating the two. I will stress the role of justifications in statistical inferences, the
conditional nature of scientific conclusions following these justifications, and highlight
how severe tests could lead to inter-subjective agreement, based on a philosophical approach grounded in methodological falsificationism. Appreciating the role of background assumptions in the appraisal of severity should shed light on current discussions about the role of preregistration, interpreting the results of replication studies, and proposals to reform statistical inferences.
The two statistical cornerstones of replicability: addressing selective infer...jemille6
Tukey’s last published work in 2020 was an obscure entry on multiple comparisons in the
Encyclopedia of Behavioral Sciences, addressing the two topics in the title. Replicability
was not mentioned at all, nor was any other connection made between the two topics. I shall demonstrate how these two topics critically affect replicability using recently completed studies. I shall review how these have been addressed in the past. I shall
review in more detail the available ways to address selective inference. My conclusion is that conducting many small replicability studies without strict standardization is the way to assure replicability of results in science, and we should introduce policies to make this happen.
The replication crisis: are P-values the problem and are Bayes factors the so...jemille6
Today’s posterior is tomorrow’s prior. Dennis Lindley
It has been claimed that science is undergoing a replication crisis and that when looking for culprits, the cult of significance is the chief suspect. It has also been claimed that Bayes factors might provide a solution.
In my opinion, these claims are misleading and part of the problem is our understanding
of the purpose and nature of replication, which has only recently been subject to formal
analysis.
What we are or should be interested in is truth. Replication is a coherence not a correspondence requirement and one that has a strong dependence on the size
of the replication study
.
Consideration of Bayes factors raises a puzzling question. Should the Bayes factor for a replication study be calculated as if it were the initial study? If the answer is yes, the approach is not fully Bayesian and furthermore the Bayes factors will be subject to
exactly the same replication ‘paradox’ as P-values. If the answer is no, then in what
sense can an initially found Bayes factor be replicated and what are the implications for how we should view replication of P-values?
A further issue is that little attention has been paid to false negatives and, by extension
to true negative values. Yet, as is well known from the theory of diagnostic tests, it is
meaningless to consider the performance of a test in terms of false positives alone.
I shall argue that we are in danger of confusing evidence with the conclusions we draw and that any reforms of scientific practice should concentrate on producing evidence
that is reliable as it can be qua evidence. There are many basic scientific practices in
need of reform. Pseudoreplication, for example, and the routine destruction of
information through dichotomisation are far more serious problems than many matters of inferential framing that seem to have excited statisticians.
The ASA president Task Force Statement on Statistical Significance and Replic...jemille6
Yoav Benjamini's slides "The ASA president Task Force Statement on Statistical Significance and Replicability” for Special Session of the (remote) Phil Stat Forum: “Statistical Significance Test Anxiety” on 11 January 2022
D. Mayo's slides "“The Statistics Wars and Intellectual Conflicts of Interest” for Special Session of the (remote) Phil Stat Forum: “Statistical Significance Test Anxiety” on 11 January 2022
T. Pradeu & M. Lemoine: Philosophy in Science: Definition and Boundariesjemille6
Do philosophers of science frequently contribute to science, and if so how? Bibliometrics helps assess how surprisingly big is the corpus of papers authored or co-authored by philosophers and published in science. Indeed, several hundreds of philosophers have published in scientific journals. It is also possible to assess how influential this work has been in terms of citations, as compared to the average number of citations in the same journals in the same year. Unsurprisingly, many of these papers authored or co-authored by philosophers and published in scientific journals are poorly cited while a handful of them are widely cited. However, the most interesting result is that there is a significant corpus of papers authored by philosophers (both published in science journals and in philosophy journals) and significantly cited in science. It is more difficult, albeit crucial, to identify the most contributive philosophical papers, namely, those which have penetrated science not only through publication or citation in science journals, but also through discussion or endorsement by some scientists.
Based on the identification of this often neglected corpus, which we propose to call "philosophy in science" (PinS), it becomes possible to describe the most central features of this particular way of doing philosophy of science. The first feature is bibliographic: philosophers in science tend to cite little philosophy and a lot of (up-to-date) science. Second, they also address a scientific question rather than a philosophical question. Third, in doing so, they use traditional tools of philosophy of science, typically and mostly, conceptual analysis, explication of implicit claims, examination of the consistency of claims, assessment of the relevance of methods or models. More rarely, but very interestingly, they also make positive and original contributions by bridging domains of science or suggesting hypotheses.
This different context – in particular, the specific requirements for a publication in a peer-reviewed science journal – transforms philosophy of science. Is it still philosophy? What is the difference with approaches such as "philosophy of science in practice", "complementary science", "scientific philosophy", "theory of science", and naturalism? PinS faces a double "impostor syndrome": not entirely philosophical for philosophers, and not entirely scientific for scientists. In conclusion, we will explore how PinS can respond to this double challenge.
This slide is special for master students (MIBS & MIFB) in UUM. Also useful for readers who are interested in the topic of contemporary Islamic banking.
Unit 8 - Information and Communication Technology (Paper I).pdfThiyagu K
This slides describes the basic concepts of ICT, basics of Email, Emerging Technology and Digital Initiatives in Education. This presentations aligns with the UGC Paper I syllabus.
Model Attribute Check Company Auto PropertyCeline George
In Odoo, the multi-company feature allows you to manage multiple companies within a single Odoo database instance. Each company can have its own configurations while still sharing common resources such as products, customers, and suppliers.
Operation “Blue Star” is the only event in the history of Independent India where the state went into war with its own people. Even after about 40 years it is not clear if it was culmination of states anger over people of the region, a political game of power or start of dictatorial chapter in the democratic setup.
The people of Punjab felt alienated from main stream due to denial of their just demands during a long democratic struggle since independence. As it happen all over the word, it led to militant struggle with great loss of lives of military, police and civilian personnel. Killing of Indira Gandhi and massacre of innocent Sikhs in Delhi and other India cities was also associated with this movement.
Normal Labour/ Stages of Labour/ Mechanism of LabourWasim Ak
Normal labor is also termed spontaneous labor, defined as the natural physiological process through which the fetus, placenta, and membranes are expelled from the uterus through the birth canal at term (37 to 42 weeks
Safalta Digital marketing institute in Noida, provide complete applications that encompass a huge range of virtual advertising and marketing additives, which includes search engine optimization, virtual communication advertising, pay-per-click on marketing, content material advertising, internet analytics, and greater. These university courses are designed for students who possess a comprehensive understanding of virtual marketing strategies and attributes.Safalta Digital Marketing Institute in Noida is a first choice for young individuals or students who are looking to start their careers in the field of digital advertising. The institute gives specialized courses designed and certification.
for beginners, providing thorough training in areas such as SEO, digital communication marketing, and PPC training in Noida. After finishing the program, students receive the certifications recognised by top different universitie, setting a strong foundation for a successful career in digital marketing.
1. The ASA (2016) Statement on P-values:
How to Stop Refighting
the Statistics Wars
The CLA Quantitative Methods
Collaboration Committee &
Minnesota Center for Philosophy of Science
April 8, 2016
Deborah G Mayo
2. Brad Efron
“By and large, Statistics is a prosperous and
happy country, but it is not a completely
peaceful one. Two contending philosophical
parties, the Bayesians and the frequentists,
have been vying for supremacy over the past
two-and-a-half centuries. …Unlike most
philosophical arguments, this one has
important practical consequences. The two
philosophies represent competing visions of
how science progresses….” (2013, p. 130)
2
3. Today’s Practice: Eclectic
O Use of eclectic tools, little handwringing
of foundations
O Bayes-frequentist unifications
O Scratch a bit below the surface
foundational problems emerge….
O Not just 2: family feuds within (Fisherian,
Neyman-Pearson; tribes of Bayesians.
likelihoodists)
3
4. Why are the statistics wars more
serious today?
O Replication crises led to programs to
restore credibility: fraud busting,
reproducibility studies
O Taskforces, journalistic reforms, and
debunking treatises
O Proposed methodological reforms––many
welcome (preregistration)–some quite
radical
4
5. I was a philosophical observer at the
ASA P-value “pow wow”
5
6. “Don’t throw out the error control baby
with the bad statistics bathwater”
The American Statistician
6
7. O “Statistical significance tests are a small
part of a rich set of:
“techniques for systematically appraising
and bounding the probabilities … of
seriously misleading interpretations of
data” (Birnbaum 1970, p. 1033)
O These I call error statistical methods (or
sampling theory)”.
7
8. One Rock in a Shifting Scene
O “Birnbaum calls it the ‘one rock in a
shifting scene’ in statistical practice
O “Misinterpretations and abuses of tests,
warned against by the very founders of
the tools, shouldn’t be the basis for
supplanting them with methods unable or
less able to assess, control, and alert us
to erroneous interpretations of data”
8
9. Error Statistics
O Statistics: Collection, modeling, drawing
inferences from data to claims about
aspects of processes
O The inference may be in error
O It’s qualified by a claim about the
method’s capabilities to control and alert
us to erroneous interpretations (error
probabilities)
9
10. “p-value. …to test the conformity of the
particular data under analysis with H0 in
some respect:
…we find a function t = t(y) of the data,
to be called the test statistic, such that
• the larger the value of t the more
inconsistent are the data with H0;
• The random variable T = t(Y) has a
(numerically) known probability
distribution when H0 is true.
…the p-value corresponding to any t as
p = p(t) = P(T ≥ t; H0)”
(Mayo and Cox 2006, p. 81) 10
11. O “Clearly, if even larger differences
than t occur fairly frequently under H0
(p-value is not small), there’s scarcely
evidence of incompatibility
O But a small p-value doesn’t warrant
inferring a genuine effect H, let alone a
scientific conclusion H*–as the ASA
document correctly warns (Principle 3)”
11
12. A Paradox for Significance Test Critics
Critic: It’s much too easy to get a small P-
value
You: Why do they find it so difficult to
replicate the small P-values others found?
Is it easy or is it hard?
12
13. O R.A. Fisher: it’s easy to lie with statistics
by selective reporting (he called it the
“political principle”)
O Sufficient finagling—cherry-picking, P-
hacking, significance seeking—may
practically guarantee a researcher’s
preferred claim C gets support, even if it’s
unwarranted by evidence
O Note: Rejecting a null taken as support for
some non-null claim C
13
14. Severity Requirement:
O If data x0 agree with a claim C, but the
test procedure had little or no capability
of finding flaws with C (even if the claim
is incorrect), then x0 provide poor
evidence for C
O Such a test fails a minimal requirement
for a stringent or severe test
O My account: severe testing based on
error statistics
14
15. Two main views of the role of
probability in inference (not in ASA doc)
O Probabilism. To assign a degree of
probability, confirmation, support or belief
in a hypothesis, given data x0. (e.g.,
Bayesian, likelihoodist)—with regard for
inner coherency
O Performance. Ensure long-run reliability
of methods, coverage probabilities
(frequentist, behavioristic Neyman-
Pearson)
15
16. What happened to using probability to assess
the error probing capacity by the severity
criterion?
O Neither “probabilism” nor “performance”
directly captures it
O Good long-run performance is a
necessary, not a sufficient, condition for
severity
O That’s why frequentist methods can be
shown to have howlers
16
17. O Problems with selective reporting, cherry
picking, stopping when the data look
good, P-hacking, are not problems about
long-runs—
O It’s that we cannot say the case at hand
has done a good job of avoiding the
sources of misinterpreting data
17
18. A claim C is not warranted _______
O Probabilism: unless C is true or probable
(gets a probability boost, is made
comparatively firmer)
O Performance: unless it stems from a
method with low long-run error
O Probativism (severe testing) something
(a fair amount) has been done to probe
ways we can be wrong about C
18
19. O If you assume probabilism is required for
inference, error probabilities are relevant
for inference only by misinterpretation
False!
O I claim, error probabilities play a crucial
role in appraising well-testedness
O It’s crucial to be able to say, C is highly
believable or plausible but poorly tested
O Probabilists can allow for the distinct task
of severe testing (you may not have to
entirely take sides in the stat wars)
19
20. The ASA doc gives no sense of
different tools for different jobs
O “To use an eclectic toolbox in statistics,
it’s important not to expect an agreement
on numbers form methods evaluating
different things
O A p-value isn’t ‘invalid’ because it does
not supply “the probability of the null
hypothesis, given the finding” (the
posterior probability of H0) (Trafimow and
Marks*, 2015)
*Editors of a journal, Basic and Applied Social
Psychology
20
21. O ASA Principle 2 says a p-value ≠
posterior but one doesn’t get the sense of
its role in error probability control
O It’s not that I’m keen to defend many
common uses of significance tests
O The criticisms are often based on
misunderstandings; consequently so are
many “reforms”
21
22. Biasing selection effects:
O One function of severity is to identify
problematic selection effects (not all are)
O Biasing selection effects: when data or
hypotheses are selected or generated (or
a test criterion is specified), in such a way
that the minimal severity requirement is
violated, seriously altered or incapable of
being assessed
O Picking up on these alterations is
precisely what enables error statistics to
be self-correcting—
22
23. Nominal vs actual significance levels
The ASA correctly warns that “[c]onducting
multiple analyses of the data and reporting
only those with certain p-values” leads to
spurious p-values (Principle 4)
Suppose that twenty sets of differences
have been examined, that one difference
seems large enough to test and that this
difference turns out to be ‘significant at
the 5 percent level.’ ….The actual level of
significance is not 5 percent, but 64
percent! (Selvin, 1970, p. 104)
From (Morrison & Henkel’s Significance Test
controversy 1970!)
23
24. O They were clear on the fallacy: blurring
the “computed” or “nominal” significance
level, and the “actual” level
O There are many more ways you can be
wrong with hunting (different sample
space)
O Here’s a case where a p-value report is
invalid
24
25. You report: Such results would be difficult
to achieve under the assumption of H0
When in fact such results are common
under the assumption of H0
(Formally):
O You say Pr(P-value < Pobs; H0) ~ Pobs
small
O But in fact Pr(P-value < Pobs; H0) = high
25
26. O Nowadays, we’re likely to see the tests
blamed
O My view: Tests don’t kill inference, people
do
O Even worse are those statistical accounts
where the abuse vanishes!
26
27. On some views, taking account of biasing
selection effects “defies scientific sense”
Two problems that plague frequentist inference:
multiple comparisons and multiple looks, or, as
they are more commonly called, data dredging
and peeking at the data. The frequentist solution
to both problems involves adjusting the P-
value…But adjusting the measure of evidence
because of considerations that have nothing
to do with the data defies scientific sense,
belies the claim of ‘objectivity’ that is often made
for the P-value” (Goodman 1999, p. 1010)
(To his credit, he’s open about this; heads the
Meta-Research Innovation Center at Stanford)
27
28. Technical activism isn’t free of philosophy
Ben Goldacre (of Bad Science) in a 2016
Nature article, is puzzled that bad statistical
practices continue even in the face of the
new "technical activism”:
The editors at Annals of Internal
Medicine,… repeatedly (but confusedly)
argue that it is acceptable to identify
“prespecified outcomes” [from results]
produced after a trial began; ….they say
that their expertise allows them to
permit — and even solicit —
undeclared outcome-switching
28
29. His paper: “Make journals report clinical
trials properly”
O He shouldn’t close his eyes to the
possibility that some of the pushback he’s
seeing has a basis in statistical
philosophy!
29
30. Likelihood Principle (LP)
The vanishing act links to a pivotal
disagreement in the philosophy of statistics
battles
In probabilisms, the import of the data is via
the ratios of likelihoods of hypotheses
P(x0;H1)/P(x0;H0)
The data x0 are fixed, while the hypotheses
vary
30
31. Jimmy Savage on the LP:
O “According to Bayes' theorem,…. if y is
the datum of some other experiment, and
if it happens that P(x|µ) and P(y|µ) are
proportional functions of µ (that is,
constant multiples of each other), then
each of the two data x and y have
exactly the same thing to say about the
values of µ…” (Savage 1962, p. 17)
31
32. All error probabilities violate the LP
(even without selection effects):
Sampling distributions, significance levels,
power, all depend on something more [than
the likelihood function]–something that is
irrelevant in Bayesian inference–namely the
sample space (Lindley 1971, p. 436)
The LP implies…the irrelevance of
predesignation, of whether a hypothesis
was thought of before hand or was
introduced to explain known effects
(Rosenkrantz, 1977, p. 122)
32
33. Paradox of Optional Stopping:
Error probing capacities are altered not just
by cherry picking and data dredging, but
also via data dependent stopping rules:
Xi ~ N(μ, σ2), 2-sided H0: μ = 0 vs. H1: μ ≠ 0.
Instead of fixing the sample size n in
advance, in some tests, n is determined by
a stopping rule:
33
34. “Trying and trying again”
O Keep sampling until H0 is rejected at
0.05 level
i.e., keep sampling until M 1.96 s/√n
O Trying and trying again: Having failed
to rack up a 1.96 s difference after 10
trials, go to 20, 30 and so on until
obtaining a 1.96 s difference
34
35. Nominal vs. Actual
significance levels again:
O With n fixed the Type 1 error probability is
0.05
O With this stopping rule the actual
significance level differs from, and will be
greater than 0.05
O Violates Cox and Hinkley’s (1974) “weak
repeated sampling principle”
35
36. 1959 Savage Forum
Jimmy Savage audaciously declared:
“optional stopping is no sin”
so the problem must be with significance
levels
Peter Armitage:
“thou shalt be misled”
if thou dost not know the person tried and
tried again (p. 72)
36
37. O “The ASA correctly warns that
“[c]onducting multiple analyses of the
data and reporting only those with
certain p-values” leads to spurious p-
values (Principle 4)
O However, the same p-hacked hypothesis
can occur in Bayes factors; optional
stopping can be guaranteed to exclude
true nulls from HPD intervals
37
38. With One Big Difference:
O “The direct grounds to criticize inferences
as flouting error statistical control is lost
O They condition on the actual data,
O error probabilities take into account other
outcomes that could have occurred but
did not (sampling distribution)”
38
39. Tension: Does Principle 4 Hold for
Other Approaches?
O “In view of the prevalent misuses of and
misconceptions concerning p-values, some
statisticians prefer to supplement or even
replace p-values with other approaches”
(They include Bayes factors, likelihood ratios,
as “alternative measures of evidence”)
O They appear to extend “full reporting and
transparency” (principle 4) to all methods.
O Some controversy: should it apply only to “p-
values and related statistics”
39
40. How might probabilists block intuitively
unwarranted inferences (without error
probabilities)?
A subjective Bayesian might say:
If our beliefs were mixed into the
interpretation of the evidence, we wouldn’t
declare there’s statistical evidence of some
unbelievable claim (distinguishing shades of
grey and being politically moderate,
ovulation and voting preferences)
40
41. Rescued by beliefs
O That could work in some cases (it still
wouldn’t show what researchers had done
wrong)—battle of beliefs
O Besides, researchers sincerely believe
their hypotheses
O So now you’ve got two sources of
flexibility, priors and biasing selection
effects
41
42. No help with our most important
problem
O How to distinguish the warrant for a
single hypothesis H with different methods
(e.g., one has biasing selection effects,
another, pre-registered results and
precautions)?
42
43. Most Bayesians are “conventional”
O Eliciting subjective priors too difficult,
scientists reluctant to allow subjective
beliefs to overshadow data
O Default, or reference priors are supposed
to prevent prior beliefs from influencing
the posteriors (O-Bayesians, 2006)
43
44. O A classic conundrum: no general non-
informative prior exists, so most are
conventional
O “The priors are not to be considered
expressions of uncertainty, ignorance, or
degree of belief. Conventional priors may
not even be probabilities…” (Cox and
Mayo 2010, p. 299)
O Prior probability: An undefined
mathematical construct for obtaining
posteriors (giving highest weight to data,
or satisfying invariance, or matching or….)
44
45. Conventional Bayesian Reforms are
touted as free of selection effects
O Jim Berger gives us “conditional error
probabilities” CEPs
O “[I]t is considerably less important to disabuse
students of the notion that a frequentist error
probability is the probability that the hypothesis
is true, given the data”, since under his new
definition “a CEP actually has that
interpretation”
O “CEPs do not depend on the stopping rule”
(“Could Fisher, Jeffreys and Neyman Have Agreed on Testing?”
2003)
45
46. By and large the ASA doc highlights
classic foibles
“In relation to the test of significance, we
may say that a phenomenon is
experimentally demonstrable when we
know how to conduct an experiment
which will rarely fail to give us a
statistically significant result”
(Fisher 1935, p. 14)
(“isolated” low P-value ≠> H: statistical
effect)
46
47. Statistical ≠> substantive (H ≠> H*)
“[A]ccording to Fisher, rejecting the null
hypothesis is not equivalent to accepting
the efficacy of the cause in question. The
latter...requires obtaining more significant
results when the experiment, or an
improvement of it, is repeated at other
laboratories or under other conditions”
(Gigerentzer 1989, pp. 95-6)
47
48. O Flaws in alternative H* have not been probed
by the test,
O The inference from a statistically significant
result to H* fails to pass with severity
O “Merely refuting the null hypothesis is too
weak to corroborate” substantive H*, “we
have to have ‘Popperian risk’, ‘severe test’
[as in Mayo], or what philosopher Wesley
Salmon called ‘a highly improbable
coincidence’” (Meehl and Waller 2002,
p.184)
48
49. O Encouraged by something called NHSTs
–that supposedly allow moving from
statistical to substantive
O If defined that way, they exist only as
abuses of tests
O ASA doc ignores Neyman-Pearson (N-P)
tests
49
50. Neyman-Pearson (N-P) Tests:
A null and alternative hypotheses
H0, H1 that exhaust the parameter
space
O So the fallacy of rejection H – > H* is
impossible
O Rejecting the null only indicates statistical
alternatives
50
51. P-values Don’t Report Effect Sizes
Principle 5
Who ever said to just report a P-value?
O “Tests should be accompanied by
interpretive tools that avoid the fallacies of
rejection and non-rejection. These
correctives can be articulated in either
Fisherian or Neyman-Pearson terms”
(Mayo and Cox 2006, Mayo and Spanos
2006)
51
52. To Avoid Inferring a Discrepancy
Beyond What’s Warranted:
large n problem.
O Severity tells us: an α-significant
difference is indicative of less of a
discrepancy from the null if it results from
larger (n1) rather than a smaller (n2)
sample size (n1 > n2 )
52
53. O What’s more indicative of a large effect
(fire), a fire alarm that goes off with burnt
toast or one so insensitive that it doesn’t
go off unless the house is fully ablaze?
[The larger sample size is like the one that
goes off with burnt toast]
53
54. What About the Fallacy of
Non-Significant Results?
O Non-Replication occurs with non-
significant results, but there’s much
confusion in interpreting them
O No point in running replication research if
you view negative results as uninformative
54
55. O They don’t warrant 0 discrepancy
O Use the same severity reasoning to rule
out discrepancies that very probably
would have resulted in a larger difference
than observed- set upper bounds
O If you very probably would have observed
a more impressive (smaller) p-value than
you did, if μ > μ1 (where μ1 = μ0 + γ), then
the data are good evidence that μ< μ1
O Akin to power analysis (Cohen, Neyman)
but sensitive to x0
55
56. Improves on Confidence Intervals
“This is akin to confidence intervals (which
are dual to tests) but we get around their
shortcomings:
O We do not fix a single confidence level,
O The evidential warrant for different points
in any interval are distinguished”
O Go beyond “performance goal” to give
inferential construal
56
57. Simple Fisherian Tests Have Important
Uses
O Model validation:
George Box calls for ecumenism because
“diagnostic checks and tests of fit” he
argues “require frequentist theory
significance tests for their formal
justification” (Box 1983, p. 57)
57
58. “What we are advocating, ..is what Cox
and Hinkley (1974) call ‘pure significance
testing’, in which certain of the model’s
implications are compared directly to the
data, rather than entering into a contest
with some alternative model” (Gelman &
Shalizi p. 20)
O Fraudbusting and forensics: Finding
Data too good to be true (Simonsohn)
58
59. Concluding remarks: Reforms without
Philosophy of Statistics are Blind
O I end my commentary: “Failing to understand the
correct (if limited) role of simple significance tests
threatens to throw the error control baby out with the
bad statistics bathwater
O Avoid refighting the same wars, or banning methods
based on cookbook methods long lampooned
O Makes no sense to banish tools for testing
assumptions the other methods require and cannot
perform
59
60. O Don’t expect an agreement on numbers
form methods evaluating difference things
O Recognize different roles of probability:
probabilism, long run performance ,
probativism (severe testing)
O Probabilisms may enable rather than
block illicit inferences due to biasing
selection effects
O Main paradox of the “replication crisis”
60
61. Paradox of Replication
O Critic: It’s too easy to satisfy standard significance
thresholds
O You: Why do replicationists find it so hard to achieve
them with preregistered trials?
O Critic: Most likely the initial studies were guilty of p-
hacking, cherry-picking, significance seeking, QRPs
O You: So, replication researchers want methods that
pick up on and block these biasing selection effects
O Critic: Actually the “reforms” recommend methods
where selection effects make no difference
61
62. Either you care about error probabilities
or not
O If not, experimental design principles (e.g.,
RCTs) may well go by the board
O Not enough to have a principle: we must be
transparent about data-dependent selections
O Your statistical account needs a way to make
use of the information
O “Technical activists” are not free of conflicts of
interest and of philosophy
62
63. Granted, error statistical improvements
are needed
O An inferential construal of error probabilities
wasn’t clearly given (Birnbaum)-my goal
O It’s not long-run error control (performance),
but severely probing flaws today
O I also grant an error statistical account need
to say more about how they’ll use
background information
63
64. Future ASA project
O Look at the “other approaches” (Bayes
factors, LRs, Bayesian updating)
O What is it for a replication to succeed or
fail on those approaches?
(can’t be just a matter of prior beliefs in
the hypotheses)
64
65. Finally, it should be recognized that
often better statistics cannot help
O Rather than search for more “idols”, do
better science, get better experiments and
theories
O One hypothesis must always be: our
results point to the inability of our study to
severely probe the phenomenon of
interest
65
66. Be ready to admit questionable science
O The scientific status of an inquiry is
questionable if unable to distinguish
poorly run study and poor hypothesis
O Continually violate minimal requirements
for severe testing
66
67. Non-replications often construed as
simply weaker effects
2 that didn’t replicate in psychology:
O Belief in free will and cheating
O Physical distance (of points plotted) and
emotional closeness
67
69. The ASA’s Six Principles
O (1) P-values can indicate how incompatible the data are
with a specified statistical model
O (2) P-values do not measure the probability that the
studied hypothesis is true, or the probability that the data
were produced by random chance alone
O (3) Scientific conclusions and business or policy
decisions should not be based only on whether a p-value
passes a specific threshold
O (4) Proper inference requires full reporting and
transparency
O (5) A p-value, or statistical significance, does not
measure the size of an effect or the importance of a result
O (6) By itself, a p-value does not provide a good measure
of evidence regarding a model or hypothesis
69
70. Mayo and Cox (2010): Frequentist Principle of
Evidence (FEV); SEV: Mayo and Spanos (2006)
FEV/SEV: insignificant result: A moderate P-value
is evidence of the absence of a discrepancy δ
from H0, only if there is a high probability the
test would have given a worse fit with H0 (i.e.,
d(X) > d(x0) ) were a discrepancy δ to exist
FEV/SEV significant result d(X) > d(x0) is
evidence of discrepancy δ from H0, if and only if,
there is a high probability the test would have d(X)
< d(x0) were a discrepancy as large as δ absent
70
71. Test T+: Normal testing: H0: μ < μ0 vs. H1: μ > μ0
σ known
(FEV/SEV): If d(x) is not statistically significant,
then
μ < M0 + kεσ/√n passes the test T+ with
severity (1 – ε)
(FEV/SEV): If d(x) is statistically significant, then
μ > M0 + kεσ/√n passes the test T+ with
severity (1 – ε)
where P(d(X) > kε) = ε 71
72. References
O Armitage, P. 1962. “Contribution to Discussion.” In The Foundations of Statistical
Inference: A Discussion, edited by L. J. Savage. London: Methuen.
O Berger, J. O. 2003 'Could Fisher, Jeffreys and Neyman Have Agreed on Testing?'
and 'Rejoinder,', Statistical Science 18(1): 1-12; 28-32.
O Berger, J. O. 2006. “The Case for Objective Bayesian Analysis.” Bayesian Analysis
1 (3): 385–402.
O Birnbaum, A. 1970. “Statistical Methods in Scientific Inference (letter to the
Editor).” Nature 225 (5237) (March 14): 1033.
O Efron, B. 2013. 'A 250-Year Argument: Belief, Behavior, and the Bootstrap', Bulletin
of the American Mathematical Society 50(1): 126-46.
O Box, G. 1983. “An Apology for Ecumenism in Statistics,” in Box, G.E.P., Leonard, T.
and Wu, D. F. J. (eds.), pp. 51-84, Scientific Inference, Data Analysis, and
Robustness. New York: Academic Press.
O Cox, D. R. and Hinkley, D. 1974. Theoretical Statistics. London: Chapman and
Hall.
O Cox, D. R., and Deborah G. Mayo. 2010. “Objectivity and Conditionality in
Frequentist Inference.” In Error and Inference: Recent Exchanges on Experimental
Reasoning, Reliability, and the Objectivity and Rationality of Science, edited by
Deborah G. Mayo and Aris Spanos, 276–304. Cambridge: Cambridge University
Press.
O Fisher, R. A. 1935. The Design of Experiments. Edinburgh: Oliver and Boyd.
O Fisher, R. A. 1955. “Statistical Methods and Scientific Induction.” Journal of the
Royal Statistical Society, Series B (Methodological) 17 (1) (January 1): 69–78.
72
73. O Gelman, A. and Shalizi, C. 2013. 'Philosophy and the Practice of Bayesian
Statistics' and 'Rejoinder', British Journal of Mathematical and Statistical
Psychology 66(1): 8–38; 76-80.
O Gigerenzer, G., Swijtink, Porter, T. Daston, L. Beatty, J, and Kruger, L. 1989. The
Empire of Chance. Cambridge: Cambridge University Press.
O Goldacre, B. 2008. Bad Science. HarperCollins Publishers.
O Goldacre, B. 2016. “Make journals report clinical trials properly”, Nature
530(7588);online 02Feb2016.
O Goodman SN. 1999. “Toward evidence-based medical statistics. 2: The Bayes
factor,” Annals of Internal Medicine 1999; 130:1005 –1013.
O Lindley, D. V. 1971. “The Estimation of Many Parameters.” In Foundations of
Statistical Inference, edited by V. P. Godambe and D. A. Sprott, 435–455. Toronto:
Holt, Rinehart and Winston.
O Mayo, D. G. 1996. Error and the Growth of Experimental Knowledge. Science and
Its Conceptual Foundation. Chicago: University of Chicago Press.
O Mayo, D. G. and Cox, D. R. (2010). "Frequentist Statistics as a Theory of Inductive
Inference" in Error and Inference: Recent Exchanges on Experimental Reasoning,
Reliability and the Objectivity and Rationality of Science (D Mayo and A. Spanos
eds.), Cambridge: Cambridge University Press: 1-27. This paper appeared in The
Second Erich L. Lehmann Symposium: Optimality, 2006, Lecture Notes-
Monograph Series, Volume 49, Institute of Mathematical Statistics, pp. 247-275.
O Mayo, D. G., and A. Spanos. 2006. “Severe Testing as a Basic Concept in a
Neyman–Pearson Philosophy of Induction.” British Journal for the Philosophy of
Science 57 (2) (June 1): 323–357.
73
74. O Mayo, D. G., and A. Spanos. 2011. “Error Statistics.” In Philosophy of Statistics,
edited by Prasanta S. Bandyopadhyay and Malcolm R. Forster, 7:152–198. Handbook
of the Philosophy of Science. The Netherlands: Elsevier.
O Meehl, P. E., and N. G. Waller. 2002. “The Path Analysis Controversy: A New
Statistical Approach to Strong Appraisal of Verisimilitude.” Psychological Methods 7
(3): 283–300.
O Morrison, D. E., and R. E. Henkel, ed. 1970. The Significance Test Controversy: A
Reader. Chicago: Aldine De Gruyter.
O Pearson, E. S. & Neyman, J. (1930). On the problem of two samples. Joint Statistical
Papers by J. Neyman & E.S. Pearson, 99-115 (Berkeley: U. of Calif. Press). First
published in Bul. Acad. Pol.Sci. 73-96.
O Rosenkrantz, R. 1977. Inference, Method and Decision: Towards a Bayesian
Philosophy of Science. Dordrecht, The Netherlands: D. Reidel.
O Savage, L. J. 1962. The Foundations of Statistical Inference: A Discussion. London:
Methuen.
O Selvin, H. 1970. “A critique of tests of significance in survey research. In The
significance test controversy, edited by D. Morrison and R. Henkel, 94-106. Chicago:
Aldine De Gruyter.
O Simonsohn, U. 2013, "Just Post It: The Lesson From Two Cases of Fabricated Data
Detected by Statistics Alone", Psychological Science, vol. 24, no. 10, pp. 1875-1888.
O Trafimow D. and Marks, M. 2015. “Editorial”, Basic and Applied Social Psychology
37(1): pp. 1-2.
O Wasserstein, R. and Lazar, N. 2016. “The ASA’s statement on p-values: context,
process, and purpose”, The American Statistician 74
75. Abstract
If a statistical methodology is to be adequate, it needs
to register how “questionable research practices”
(QRPs) alter a method’s error probing capacities. If
little has been done to rule out flaws in taking data as
evidence for a claim, then that claim has not passed a
stringent or severe test. The goal of severe testing is
the linchpin for (re)interpreting frequentist methods so
as to avoid long-standing fallacies at the heart of
today’s statistics wars. A contrasting philosophy views
statistical inference in terms of posterior probabilities
in hypotheses: probabilism. Presupposing probabilism,
critics mistakenly argue that significance and
confidence levels are misinterpreted, exaggerate
evidence, or are irrelevant for inference.
Recommended replacements—Bayesian updating,
Bayes factors, likelihood ratios—fail to control severity.
75