I will explore the extent to which concerns about ‘scientism’– an unwarranted obeisance to scientific over other methods of inquiry – are intertwined with issues in the foundations of the statistical data analyses on which (social, behavioral, medical and physical) science increasingly depends. The rise of big data, machine learning, and high-powered computer programs have extended statistical methods and modeling across the landscape of science, law and evidence-based policy, but this has been accompanied by enormous hand wringing as to the reliability, replicability, and valid use of statistics. Legitimate criticisms of scientism often stem from insufficiently self-critical uses of statistical methodology, broadly construed — i.e., from what might be called “statisticism”-- particularly when those methods are applied to matters of controversy.
Severe Testing: The Key to Error Correctionjemille6
D. G. Mayo's slides for her presentation given March 17, 2017 at Boston Colloquium for Philosophy of Science, Alfred I.Taub forum: "Understanding Reproducibility & Error Correction in Science"
Controversy Over the Significance Test Controversyjemille6
Deborah Mayo (Professor of Philosophy, Virginia Tech, Blacksburg, Virginia) in PSA 2016 Symposium: Philosophy of Statistics in the Age of Big Data and Replication Crises
Abstract: Mounting failures of replication in the social and biological sciences give a practical spin to statistical foundations in the form of the question: How can we attain reliability when methods make illicit cherry-picking and significance seeking so easy? Researchers, professional societies, and journals are increasingly getting serious about methodological reforms to restore scientific integrity – some are quite welcome (e.g., pre-registration), while others are quite radical. The American Statistical Association convened members from differing tribes of frequentists, Bayesians, and likelihoodists to codify misuses of P-values. Largely overlooked are the philosophical presuppositions of both criticisms and proposed reforms. Paradoxically, alternative replacement methods may enable rather than reveal illicit inferences due to cherry-picking, multiple testing, and other biasing selection effects. Crowd-sourced reproducibility research in psychology is helping to change the reward structure but has its own shortcomings. Focusing on purely statistical considerations, it tends to overlook problems with artificial experiments. Without a better understanding of the philosophical issues, we can expect the latest reforms to fail.
Surrogate Science: How Fisher, Neyman-Pearson, and Bayes Were Transformed int...jemille6
Gerd Gigerenzer (Director of Max Planck Institute for Human Development, Berlin, Germany) in the PSA 2016 Symposium:Philosophy of Statistics in the Age of Big Data and Replication Crises
Replication Crises and the Statistics Wars: Hidden Controversiesjemille6
D. Mayo presentation at the X-Phil conference on "Reproducibility and Replicabililty in Psychology and Experimental Philosophy", University College London (June 14, 2018)
Mayo: Evidence as Passing a Severe Test (How it Gets You Beyond the Statistic...jemille6
D. G. Mayo April 28, 2021 presentation to the CUNY Graduate Center Philosophy Colloquium "Evidence as Passing a Severe Test (How it Gets You Beyond the Statistics Wars)"
Severe Testing: The Key to Error Correctionjemille6
D. G. Mayo's slides for her presentation given March 17, 2017 at Boston Colloquium for Philosophy of Science, Alfred I.Taub forum: "Understanding Reproducibility & Error Correction in Science"
Controversy Over the Significance Test Controversyjemille6
Deborah Mayo (Professor of Philosophy, Virginia Tech, Blacksburg, Virginia) in PSA 2016 Symposium: Philosophy of Statistics in the Age of Big Data and Replication Crises
Abstract: Mounting failures of replication in the social and biological sciences give a practical spin to statistical foundations in the form of the question: How can we attain reliability when methods make illicit cherry-picking and significance seeking so easy? Researchers, professional societies, and journals are increasingly getting serious about methodological reforms to restore scientific integrity – some are quite welcome (e.g., pre-registration), while others are quite radical. The American Statistical Association convened members from differing tribes of frequentists, Bayesians, and likelihoodists to codify misuses of P-values. Largely overlooked are the philosophical presuppositions of both criticisms and proposed reforms. Paradoxically, alternative replacement methods may enable rather than reveal illicit inferences due to cherry-picking, multiple testing, and other biasing selection effects. Crowd-sourced reproducibility research in psychology is helping to change the reward structure but has its own shortcomings. Focusing on purely statistical considerations, it tends to overlook problems with artificial experiments. Without a better understanding of the philosophical issues, we can expect the latest reforms to fail.
Surrogate Science: How Fisher, Neyman-Pearson, and Bayes Were Transformed int...jemille6
Gerd Gigerenzer (Director of Max Planck Institute for Human Development, Berlin, Germany) in the PSA 2016 Symposium:Philosophy of Statistics in the Age of Big Data and Replication Crises
Replication Crises and the Statistics Wars: Hidden Controversiesjemille6
D. Mayo presentation at the X-Phil conference on "Reproducibility and Replicabililty in Psychology and Experimental Philosophy", University College London (June 14, 2018)
Mayo: Evidence as Passing a Severe Test (How it Gets You Beyond the Statistic...jemille6
D. G. Mayo April 28, 2021 presentation to the CUNY Graduate Center Philosophy Colloquium "Evidence as Passing a Severe Test (How it Gets You Beyond the Statistics Wars)"
D. G. Mayo: The Replication Crises and its Constructive Role in the Philosoph...jemille6
Constructive role of replication crises teaches a lot about 1.) Non-fallacious uses of statistical tests, 2.) Rationale for the role of probability in tests, 3.) How to reformulate tests.
D. G. Mayo (Virginia Tech) "Error Statistical Control: Forfeit at your Peril" presented May 23 at the session on "The Philosophy of Statistics: Bayesianism, Frequentism and the Nature of Inference," 2015 APS Annual Convention in NYC.
Deborah G. Mayo: Is the Philosophy of Probabilism an Obstacle to Statistical Fraud Busting?
Presentation slides for: Revisiting the Foundations of Statistics in the Era of Big Data: Scaling Up to Meet the Challenge[*] at the Boston Colloquium for Philosophy of Science (Feb 21, 2014).
D. G. Mayo: Your data-driven claims must still be probed severelyjemille6
In the session "Philosophy of Science and the New Paradigm of Data-Driven Science at the American Statistical Association Conference on Statistical Learning and Data Science/Nonparametric Statistics
Today we’ll try to cover a number of things:
1. Learning philosophy/philosophy of statistics
2. Situating the broad issues within philosophy of science
3. Little bit of logic
4. Probability and random variables
Probing with Severity: Beyond Bayesian Probabilism and Frequentist Performancejemille6
Slides from Rutgers Seminar talk by Deborah G Mayo
December 3, 2014
Rutgers, Department of Statistics and Biostatistics
Abstract: Getting beyond today’s most pressing controversies revolving around statistical methods, I argue, requires scrutinizing their underlying statistical philosophies.Two main philosophies about the roles of probability in statistical inference are probabilism and performance (in the long-run). The first assumes that we need a method of assigning probabilities to hypotheses; the second assumes that the main function of statistical method is to control long-run performance. I offer a third goal: controlling and evaluating the probativeness of methods. An inductive inference, in this conception, takes the form of inferring hypotheses to the extent that they have been well or severely tested. A report of poorly tested claims must also be part of an adequate inference. I develop a statistical philosophy in which error probabilities of methods may be used to evaluate and control the stringency or severity of tests. I then show how the “severe testing” philosophy clarifies and avoids familiar criticisms and abuses of significance tests and cognate methods (e.g., confidence intervals). Severity may be threatened in three main ways: fallacies of statistical tests, unwarranted links between statistical and substantive claims, and violations of model assumptions.
D. Mayo: Philosophy of Statistics & the Replication Crisis in Sciencejemille6
D. Mayo discusses various disputes-notably the replication crisis in science-in the context of her just released book: Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars.
D. Mayo: Replication Research Under an Error Statistical Philosophy jemille6
D. Mayo (Virginia Tech) slides from her talk June 3 at the "Preconference Workshop on Replication in the Sciences" at the 2015 Society for Philosophy and Psychology meeting.
D. Mayo: Philosophical Interventions in the Statistics Warsjemille6
ABSTRACT: While statistics has a long history of passionate philosophical controversy, the last decade especially cries out for philosophical illumination. Misuses of statistics, Big Data dredging, and P-hacking make it easy to find statistically significant, but spurious, effects. This obstructs a test's ability to control the probability of erroneously inferring effects–i.e., to control error probabilities. Disagreements about statistical reforms reflect philosophical disagreements about the nature of statistical inference–including whether error probability control even matters! I describe my interventions in statistics in relation to three events. (1) In 2016 the American Statistical Association (ASA) met to craft principles for avoiding misinterpreting P-values. (2) In 2017, a "megateam" (including philosophers of science) proposed "redefining statistical significance," replacing the common threshold of P ≤ .05 with P ≤ .005. (3) In 2019, an editorial in the main ASA journal called for abandoning all P-value thresholds, and even the words "significant/significance".
A word on each. (1) Invited to be a "philosophical observer" at their meeting, I found the major issues were conceptual. P-values measure how incompatible data are from what is expected under a hypothesis that there is no genuine effect: the smaller the P-value, the more indication of incompatibility. The ASA list of familiar misinterpretations–P-values are not posterior probabilities, statistical significance is not substantive importance, no evidence against a hypothesis need not be evidence for it–I argue, should not be the basis for replacing tests with methods less able to assess and control erroneous interpretations of data. (Mayo 2016, 2019). (2) The "redefine statistical significance" movement appraises P-values from the perspective of a very different quantity: a comparative Bayes Factor. Failing to recognize how contrasting approaches measure different things, disputants often talk past each other (Mayo 2018). (3) To ban P-value thresholds, even to distinguish terrible from warranted evidence, I say, is a mistake (2019). It will not eradicate P-hacking, but it will make it harder to hold P-hackers accountable. A 2020 ASA Task Force on significance testing has just been announced. (I would like to think my blog errorstatistics.com helped.)
To enter the fray between rival statistical approaches, it helps to have a principle applicable to all accounts. There's poor evidence for a claim if little if anything has been done to find it flawed even if it is. This forms a basic requirement for evidence I call the severity requirement. A claim passes with severity only if it is subjected to and passes a test that probably would have found it flawed, if it were. It stems from Popper, though he never adequately cashed it out. A variant is the frequentist principle of evidence developed with Sir David Cox (Mayo and Cox 20
"The Statistical Replication Crisis: Paradoxes and Scapegoats”jemille6
D. G. Mayo LSE Popper talk, May 10, 2016.
Abstract: Mounting failures of replication in the social and biological sciences give a practical spin to statistical foundations in the form of the question: How can we attain reliability when Big Data methods make illicit cherry-picking and significance seeking so easy? Researchers, professional societies, and journals are increasingly getting serious about methodological reforms to restore scientific integrity – some are quite welcome (e.g., preregistration), while others are quite radical. Recently, the American Statistical Association convened members from differing tribes of frequentists, Bayesians, and likelihoodists to codify misuses of P-values. Largely overlooked are the philosophical presuppositions of both criticisms and proposed reforms. Paradoxically, alternative replacement methods may enable rather than reveal illicit inferences due to cherry-picking, multiple testing, and other biasing selection effects. Popular appeals to “diagnostic testing” that aim to improve replication rates may (unintentionally) permit the howlers and cookbook statistics we are at pains to root out. Without a better understanding of the philosophical issues, we can expect the latest reforms to fail.
Exploratory Research is More Reliable Than Confirmatory Researchjemille6
PSA 2016 Symposium:
Philosophy of Statistics in the Age of Big Data and Replication Crises
Presenter: Clark Glymour (Alumni University Professor in Philosophy, Carnegie Mellon University, Pittsburgh, Pennsylvania)
ABSTRACT: Ioannidis (2005) argued that most published research is false, and that “exploratory” research in which many hypotheses are assessed automatically is especially likely to produce false positive relations. Colquhoun (2014) with simulations estimates that 30 to 40% of positive results using the conventional .05 cutoff for rejection of a null hypothesis is false. Their explanation is that true relationships in a domain are rare and the selection of hypotheses to test is roughly independent of their truth, so most relationships tested will in fact be false. Conventional use of hypothesis tests, in other words, suffers from a base rate fallacy. I will show that the reverse is true for modern search methods for causal relations because: a. each hypothesis is tested or assessed multiple times; b. the methods are biased against positive results; c. systems in which true relationships are rare are an advantage for these methods. I will substantiate the claim with both empirical data and with simulations of data from systems with a thousand to a million variables that result in fewer than 5% false positive relationships and in which 90% or more of the true relationships are recovered.
Stephen Senn slides:"‘Repligate’: reproducibility in statistical studies. What does it mean and in what sense does it matter?" presented May 23 at the session on "The Philosophy of Statistics: Bayesianism, Frequentism and the Nature of Inference"," at the 2015 APS Annual Convention in NYC
A. Gelman "50 shades of gray: A research story," presented May 23 at the session on "The Philosophy of Statistics: Bayesianism, Frequentism and the Nature of Inference," 2015 APS Annual Convention in NYC.
Statistical skepticism: How to use significance tests effectively jemille6
Prof. D. Mayo, presentation Oct. 12, 2017 at the ASA Symposium on Statistical Inference : “A World Beyond p < .05” in the session: “What are the best uses for P-values?“
D. Mayo: Putting the brakes on the breakthrough: An informal look at the argu...jemille6
“Putting the Brakes on the Breakthrough, or ‘How I used simple logic to uncover a flaw in a controversial 50-year old ‘theorem’ in statistical foundations taken as a‘breakthrough’ in favor of Bayesian vs frequentist error statistics’”
Fusion Confusion? Comments on Nancy Reid: "BFF Four-Are we Converging?"jemille6
D. Mayo's comments on Nancy Reid's "BFF Four-Are we Converging?" given May 2, 2017 at The Fourth Bayesian, Fiducial and Frequentists Workshop held at Harvard University.
D. G. Mayo: The Replication Crises and its Constructive Role in the Philosoph...jemille6
Constructive role of replication crises teaches a lot about 1.) Non-fallacious uses of statistical tests, 2.) Rationale for the role of probability in tests, 3.) How to reformulate tests.
D. G. Mayo (Virginia Tech) "Error Statistical Control: Forfeit at your Peril" presented May 23 at the session on "The Philosophy of Statistics: Bayesianism, Frequentism and the Nature of Inference," 2015 APS Annual Convention in NYC.
Deborah G. Mayo: Is the Philosophy of Probabilism an Obstacle to Statistical Fraud Busting?
Presentation slides for: Revisiting the Foundations of Statistics in the Era of Big Data: Scaling Up to Meet the Challenge[*] at the Boston Colloquium for Philosophy of Science (Feb 21, 2014).
D. G. Mayo: Your data-driven claims must still be probed severelyjemille6
In the session "Philosophy of Science and the New Paradigm of Data-Driven Science at the American Statistical Association Conference on Statistical Learning and Data Science/Nonparametric Statistics
Today we’ll try to cover a number of things:
1. Learning philosophy/philosophy of statistics
2. Situating the broad issues within philosophy of science
3. Little bit of logic
4. Probability and random variables
Probing with Severity: Beyond Bayesian Probabilism and Frequentist Performancejemille6
Slides from Rutgers Seminar talk by Deborah G Mayo
December 3, 2014
Rutgers, Department of Statistics and Biostatistics
Abstract: Getting beyond today’s most pressing controversies revolving around statistical methods, I argue, requires scrutinizing their underlying statistical philosophies.Two main philosophies about the roles of probability in statistical inference are probabilism and performance (in the long-run). The first assumes that we need a method of assigning probabilities to hypotheses; the second assumes that the main function of statistical method is to control long-run performance. I offer a third goal: controlling and evaluating the probativeness of methods. An inductive inference, in this conception, takes the form of inferring hypotheses to the extent that they have been well or severely tested. A report of poorly tested claims must also be part of an adequate inference. I develop a statistical philosophy in which error probabilities of methods may be used to evaluate and control the stringency or severity of tests. I then show how the “severe testing” philosophy clarifies and avoids familiar criticisms and abuses of significance tests and cognate methods (e.g., confidence intervals). Severity may be threatened in three main ways: fallacies of statistical tests, unwarranted links between statistical and substantive claims, and violations of model assumptions.
D. Mayo: Philosophy of Statistics & the Replication Crisis in Sciencejemille6
D. Mayo discusses various disputes-notably the replication crisis in science-in the context of her just released book: Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars.
D. Mayo: Replication Research Under an Error Statistical Philosophy jemille6
D. Mayo (Virginia Tech) slides from her talk June 3 at the "Preconference Workshop on Replication in the Sciences" at the 2015 Society for Philosophy and Psychology meeting.
D. Mayo: Philosophical Interventions in the Statistics Warsjemille6
ABSTRACT: While statistics has a long history of passionate philosophical controversy, the last decade especially cries out for philosophical illumination. Misuses of statistics, Big Data dredging, and P-hacking make it easy to find statistically significant, but spurious, effects. This obstructs a test's ability to control the probability of erroneously inferring effects–i.e., to control error probabilities. Disagreements about statistical reforms reflect philosophical disagreements about the nature of statistical inference–including whether error probability control even matters! I describe my interventions in statistics in relation to three events. (1) In 2016 the American Statistical Association (ASA) met to craft principles for avoiding misinterpreting P-values. (2) In 2017, a "megateam" (including philosophers of science) proposed "redefining statistical significance," replacing the common threshold of P ≤ .05 with P ≤ .005. (3) In 2019, an editorial in the main ASA journal called for abandoning all P-value thresholds, and even the words "significant/significance".
A word on each. (1) Invited to be a "philosophical observer" at their meeting, I found the major issues were conceptual. P-values measure how incompatible data are from what is expected under a hypothesis that there is no genuine effect: the smaller the P-value, the more indication of incompatibility. The ASA list of familiar misinterpretations–P-values are not posterior probabilities, statistical significance is not substantive importance, no evidence against a hypothesis need not be evidence for it–I argue, should not be the basis for replacing tests with methods less able to assess and control erroneous interpretations of data. (Mayo 2016, 2019). (2) The "redefine statistical significance" movement appraises P-values from the perspective of a very different quantity: a comparative Bayes Factor. Failing to recognize how contrasting approaches measure different things, disputants often talk past each other (Mayo 2018). (3) To ban P-value thresholds, even to distinguish terrible from warranted evidence, I say, is a mistake (2019). It will not eradicate P-hacking, but it will make it harder to hold P-hackers accountable. A 2020 ASA Task Force on significance testing has just been announced. (I would like to think my blog errorstatistics.com helped.)
To enter the fray between rival statistical approaches, it helps to have a principle applicable to all accounts. There's poor evidence for a claim if little if anything has been done to find it flawed even if it is. This forms a basic requirement for evidence I call the severity requirement. A claim passes with severity only if it is subjected to and passes a test that probably would have found it flawed, if it were. It stems from Popper, though he never adequately cashed it out. A variant is the frequentist principle of evidence developed with Sir David Cox (Mayo and Cox 20
"The Statistical Replication Crisis: Paradoxes and Scapegoats”jemille6
D. G. Mayo LSE Popper talk, May 10, 2016.
Abstract: Mounting failures of replication in the social and biological sciences give a practical spin to statistical foundations in the form of the question: How can we attain reliability when Big Data methods make illicit cherry-picking and significance seeking so easy? Researchers, professional societies, and journals are increasingly getting serious about methodological reforms to restore scientific integrity – some are quite welcome (e.g., preregistration), while others are quite radical. Recently, the American Statistical Association convened members from differing tribes of frequentists, Bayesians, and likelihoodists to codify misuses of P-values. Largely overlooked are the philosophical presuppositions of both criticisms and proposed reforms. Paradoxically, alternative replacement methods may enable rather than reveal illicit inferences due to cherry-picking, multiple testing, and other biasing selection effects. Popular appeals to “diagnostic testing” that aim to improve replication rates may (unintentionally) permit the howlers and cookbook statistics we are at pains to root out. Without a better understanding of the philosophical issues, we can expect the latest reforms to fail.
Exploratory Research is More Reliable Than Confirmatory Researchjemille6
PSA 2016 Symposium:
Philosophy of Statistics in the Age of Big Data and Replication Crises
Presenter: Clark Glymour (Alumni University Professor in Philosophy, Carnegie Mellon University, Pittsburgh, Pennsylvania)
ABSTRACT: Ioannidis (2005) argued that most published research is false, and that “exploratory” research in which many hypotheses are assessed automatically is especially likely to produce false positive relations. Colquhoun (2014) with simulations estimates that 30 to 40% of positive results using the conventional .05 cutoff for rejection of a null hypothesis is false. Their explanation is that true relationships in a domain are rare and the selection of hypotheses to test is roughly independent of their truth, so most relationships tested will in fact be false. Conventional use of hypothesis tests, in other words, suffers from a base rate fallacy. I will show that the reverse is true for modern search methods for causal relations because: a. each hypothesis is tested or assessed multiple times; b. the methods are biased against positive results; c. systems in which true relationships are rare are an advantage for these methods. I will substantiate the claim with both empirical data and with simulations of data from systems with a thousand to a million variables that result in fewer than 5% false positive relationships and in which 90% or more of the true relationships are recovered.
Stephen Senn slides:"‘Repligate’: reproducibility in statistical studies. What does it mean and in what sense does it matter?" presented May 23 at the session on "The Philosophy of Statistics: Bayesianism, Frequentism and the Nature of Inference"," at the 2015 APS Annual Convention in NYC
A. Gelman "50 shades of gray: A research story," presented May 23 at the session on "The Philosophy of Statistics: Bayesianism, Frequentism and the Nature of Inference," 2015 APS Annual Convention in NYC.
Statistical skepticism: How to use significance tests effectively jemille6
Prof. D. Mayo, presentation Oct. 12, 2017 at the ASA Symposium on Statistical Inference : “A World Beyond p < .05” in the session: “What are the best uses for P-values?“
D. Mayo: Putting the brakes on the breakthrough: An informal look at the argu...jemille6
“Putting the Brakes on the Breakthrough, or ‘How I used simple logic to uncover a flaw in a controversial 50-year old ‘theorem’ in statistical foundations taken as a‘breakthrough’ in favor of Bayesian vs frequentist error statistics’”
Fusion Confusion? Comments on Nancy Reid: "BFF Four-Are we Converging?"jemille6
D. Mayo's comments on Nancy Reid's "BFF Four-Are we Converging?" given May 2, 2017 at The Fourth Bayesian, Fiducial and Frequentists Workshop held at Harvard University.
Ethically Litigating Forensic Science Cases: Daubert, Dna and BeyondAdam Tebrugge
What are the shared responsibilities of the analyst, prosecutor ,defense attorney and judge when dealing with forensic science cases? This lecture also covers DNA evidence and focuses on discovery and litigation issues.
Science v Pseudoscience: What’s the Difference? - Kevin KorbAdam Ford
Science has a certain common core, especially a reliance on empirical methods of assessing hypotheses. Pseudosciences have little in common but their negation: they are not science.
They reject meaningful empirical assessment in some way or another. Popper proposed a clear demarcation criterion for Science v Rubbish: Falsifiability. However, his criterion has not stood the test of time. There are no definitive arguments against any pseudoscience, any more than against extreme skepticism in general, but there are clear indicators of phoniness.
Post: http://www.scifuture.org/science-vs-pseudoscience
“The importance of philosophy of science for statistical science and vice versa”jemille6
My paper “The importance of philosophy of science for statistical science and vice
versa” presented (zoom) at the conference: IS PHILOSOPHY USEFUL FOR SCIENCE, AND/OR VICE VERSA? January 30 - February 2, 2024 at Chapman University, Schmid College of Science and Technology.
Presentation to CRC Mental Health Early Career Researcher Workshop, Melbourne 29.11.17 for @andsdata.
Workshop title: A by-product of scientific training: We're all a little bit biased.
1. TEN MYTHS OF SCIENCE REEXAMINING WHAT WE THINK WE KNOW...W. .docxambersalomon88660
1. TEN MYTHS OF SCIENCE: REEXAMINING WHAT WE THINK WE KNOW...
W. McComas 1996
This article addresses and attempts to refute several of the most widespread and enduring misconceptions held by students regarding the enterprise of science. The ten myths discussed include the common notions that theories become laws, that hypotheses are best characterized as educated guesses, and that there is a commonly-applied scientific method. In addition, the article includes discussion of other incorrect ideas such as the view that evidence leads to sure knowledge, that science and its methods provide absolute proof, and that science is not a creative endeavor. Finally, the myths that scientists are objective, that experiments are the sole route to scientific knowledge and that scientific conclusions are continually reviewed conclude this presentation. The paper ends with a plea that instruction in and opportunities to experience the nature of science are vital in preservice and inservice teacher education programs to help unseat the myths of science.
Myths are typically defined as traditional views, fables, legends or stories. As such, myths can be entertaining and even educational since they help people make sense of the world. In fact, the explanatory role of myths most likely accounts for their development, spread and persistence. However, when fact and fiction blur, myths lose their entertainment value and serve only to block full understanding. Such is the case with the myths of science.
Scholar Joseph Campbell (1968) has proposed that the similarity among many folk myths worldwide is due to a subconscious link between all peoples, but no such link can explain the myths of science. Misconceptions about science are most likely due to the lack of philosophy of science content in teacher education programs, the failure of such programs to provide and require authentic science experiences for preservice teachers and the generally shallow treatment of the nature of science in the precollege textbooks to which teachers might turn for guidance.
As Steven Jay Gould points out in The Case of the Creeping Fox Terrier Clone (1988), science textbook writers are among the most egregious purveyors of myth and inaccuracy. The fox terrier mentioned in the title refers to the classic comparison used to express the size of the dawn horse, the tiny precursor to the modem horse. This comparison is unfortunate for two reasons. Not only was this horse ancestor much bigger than a fox terrier, but the fox terrier breed of dog is virtually unknown to American students. The major criticism leveled by Gould is that once this comparison took hold, no one bothered to check its validity or utility. Through time, one author after another simply repeated the inept comparison and continued a tradition that has made many science texts virtual clones of each other on this and countless other points.
In an attempt to provide a more realistic view of science and point out issues o.
Statistical Inference as Severe Testing: Beyond Performance and Probabilismjemille6
A talk given by Deborah G Mayo
(Dept of Philosophy, Virginia Tech) to the Seminar in Advanced Research Methods at the Dept of Psychology, Princeton University on
November 14, 2023
TITLE: Statistical Inference as Severe Testing: Beyond Probabilism and Performance
ABSTRACT: I develop a statistical philosophy in which error probabilities of methods may be used to evaluate and control the stringency or severity of tests. A claim is severely tested to the extent it has been subjected to and passes a test that probably would have found flaws, were they present. The severe-testing requirement leads to reformulating statistical significance tests to avoid familiar criticisms and abuses. While high-profile failures of replication in the social and biological sciences stem from biasing selection effects—data dredging, multiple testing, optional stopping—some reforms and proposed alternatives to statistical significance tests conflict with the error control that is required to satisfy severity. I discuss recent arguments to redefine, abandon, or replace statistical significance.
D. Mayo (Dept of Philosophy, VT)
Sir David Cox’s Statistical Philosophy and Its Relevance to Today’s Statistical Controversies
ABSTRACT: This talk will explain Sir David Cox's views of the nature and importance of statistical foundations and their relevance to today's controversies about statistical inference, particularly in using statistical significance testing and confidence intervals. Two key themes of Cox's statistical philosophy are: first, the importance of calibrating methods by considering their behavior in (actual or hypothetical) repeated sampling, and second, ensuring the calibration is relevant to the specific data and inquiry. A question that arises is: How can the frequentist calibration provide a genuinely epistemic assessment of what is learned from data? Building on our jointly written papers, Mayo and Cox (2006) and Cox and Mayo (2010), I will argue that relevant error probabilities may serve to assess how well-corroborated or severely tested statistical claims are.
Nancy Reid, Dept. of Statistics, University of Toronto. Inaugural receiptant of the "David R. Cox Foundations of Statistics Award".
Slides from Invited presentation at 2023 JSM: “The Importance of Foundations in Statistical Science“
Ronald Wasserstein, Chair (American Statistical Association)
ABSTRACT: David Cox wrote “A healthy interplay between theory and application is crucial for statistics… This is particularly the case when by theory we mean foundations of statistical analysis, rather than the theoretical analysis of specific statistical methods.” These foundations distinguish statistical science from the many fields of research in which statistical thinking is a key intellectual component. In this talk I will emphasize the ongoing importance and relevance of theoretical advances and theoretical thinking through some illustrative examples.
Errors of the Error Gatekeepers: The case of Statistical Significance 2016-2022jemille6
ABSTRACT: Statistical significance tests serve in gatekeeping against being fooled by randomness, but recent attempts to gatekeep these tools have themselves malfunctioned. Warranted gatekeepers formulate statistical tests so as to avoid fallacies and misuses of P-values. They highlight how multiplicity, optional stopping, and data-dredging can readily invalidate error probabilities. It is unwarranted, however, to argue that statistical significance and P-value thresholds be abandoned because they can be misused. Nor is it warranted to argue for abandoning statistical significance based on presuppositions about evidence and probability that are at odds with those underlying statistical significance tests. When statistical gatekeeping malfunctions, I argue, it undermines a central role to which scientists look to statistics. In order to combat the dangers of unthinking, bandwagon effects, statistical practitioners and consumers need to be in a position to critically evaluate the ramifications of proposed "reforms” (“stat activism”). I analyze what may be learned from three recent episodes of gatekeeping (and meta-gatekeeping) at the American Statistical Association (ASA).
Causal inference is not statistical inferencejemille6
Jon Williamson (University of Kent)
ABSTRACT: Many methods for testing causal claims are couched as statistical methods: e.g.,
randomised controlled trials, various kinds of observational study, meta-analysis, and
model-based approaches such as structural equation modelling and graphical causal
modelling. I argue that this is a mistake: causal inference is not a purely statistical
problem. When we look at causal inference from a general point of view, we see that
methods for causal inference fit into the framework of Evidential Pluralism: causal
inference is properly understood as requiring mechanistic inference in addition to
statistical inference.
Evidential Pluralism also offers a new perspective on the replication crisis. That
observed associations are not replicated by subsequent studies is a part of normal
science. A problem only arises when those associations are taken to establish causal
claims: a science whose established causal claims are constantly overturned is indeed
in crisis. However, if we understand causal inference as involving mechanistic inference
alongside statistical inference, as Evidential Pluralism suggests, we avoid fallacious
inferences from association to causation. Thus, Evidential Pluralism offers the means to
prevent the drama of science from turning into a crisis.
Stephan Guttinger (Lecturer in Philosophy of Data/Data Ethics, University of Exeter, UK)
ABSTRACT: The idea of “questionable research practices” (QRPs) is central to the narrative of a replication crisis in the experimental sciences. According to this narrative the low replicability of scientific findings is not simply due to fraud or incompetence, but in large part to the widespread use of QRPs, such as “p-hacking” or the lack of adequate experimental controls. The claim is that such flawed practices generate flawed output. The reduction – or even elimination – of QRPs is therefore one of the main strategies proposed by policymakers and scientists to tackle the replication crisis.
What counts as a QRP, however, is not clear. As I will discuss in the first part of this paper, there is no consensus on how to define the term, and ascriptions of the qualifier “questionable” often vary across disciplines, time, and even within single laboratories. This lack of clarity matters as it creates the risk of introducing methodological constraints that might create more harm than good. Practices labelled as ‘QRPs’ can be both beneficial and problematic for research practice and targeting them without a sound understanding of their dynamic and context-dependent nature risks creating unnecessary casualties in the fight for a more reliable scientific practice.
To start developing a more situated and dynamic picture of QRPs I will then turn my attention to a specific example of a dynamic QRP in the experimental life sciences, namely, the so-called “Far Western Blot” (FWB). The FWB is an experimental system that can be used to study protein-protein interactions but which for most of its existence has not seen a wide uptake in the community because it was seen as a QRP. This was mainly due to its (alleged) propensity to generate high levels of false positives and negatives. Interestingly, however, it seems that over the last few years the FWB slowly moved into the space of acceptable research practices. Analysing this shift and the reasons underlying it, I will argue a) that suppressing this practice deprived the research community of a powerful experimental tool and b) that the original judgment of the FWB was based on a simplistic and non-empirical assessment of its error-generating potential. Ultimately, it seems like the key QRP at work in the FWB case was the way in which the label “questionable” was assigned in the first place. I will argue that findings from this case can be extended to other QRPs in the experimental life sciences and that they point to a larger issue with how researchers judge the error-potential of new research practices.
David Hand (Professor Emeritus and Senior Research Investigator, Department of Mathematics,
Faculty of Natural Sciences, Imperial College London.)
ABSTRACT: Science progresses through an iterative process of formulating theories and comparing
them with empirical real-world data. Different camps of scientists will favour different
theories, until accumulating evidence renders one or more untenable. Not unnaturally,
people become attached to theories. Perhaps they invented a theory, and kudos arises
from being the originator of a generally accepted theory. A theory might represent a
life's work, so that being found wanting might be interpreted as failure. Perhaps
researchers were trained in a particular school, and acknowledging its shortcomings is
difficult. Because of this, tensions can arise between proponents of different theories.
The discipline of statistics is susceptible to precisely the same tensions. Here, however,
the tensions are not between different theories of "what is", but between different
strategies for shedding light on the real world from limited empirical data. This can be in
the form of how one measures discrepancy between the theory's predictions and
observations. It can be in the form of different ways of looking at empirical results. It can
be, at a higher level, because of differences between what is regarded as important in a
particular context. Or it can be for other reasons.
Perhaps the most familiar example of this tension within statistics is between different
approaches to inference. However, there are many other examples of such tensions.
This paper illustrates with several examples. We argue that the tension generally arises
as a consequence of inadequate care being taken in question formulation. That is,
insufficient thought is given to deciding exactly what one wants to know - to determining
"What is the question?".
The ideas and disagreements are illustrated with several examples.
The neglected importance of complexity in statistics and Metasciencejemille6
Daniele Fanelli
London School of Economics Fellow in Quantitative Methodology, Department of
Methodology, London School of Economics and Political Science.
ABSTRACT: Statistics is at war, and Metascience is ailing. This is partially due, the talk will argue, to
a paradigmatic blind-spot: the assumption that one can draw general conclusions about
empirical findings without considering the role played by context, conditions,
assumptions, and the complexity of methods and theories. Whilst ideally these
particularities should be unimportant in science, in practice they cannot be neglected in
most research fields, let alone in research-on-research.
This neglected importance of complexity is supported by theoretical arguments and
empirical findings (or the lack thereof) in the recent meta-analytical and metascientific
literature. The talk will overview this background and suggest how the complexity of
theories and methodologies may be explicitly factored into particular methodologies of
statistics and Metaresearch. The talk will then give examples of how this approach may
usefully complement existing paradigms, by translating results, methods and theories
into quantities of information that are evaluated using an information-compression logic.
Mathematically Elegant Answers to Research Questions No One is Asking (meta-a...jemille6
Uri Simonsohn (Professor, Department of Operations, Innovation and Data Sciences at Esade)
ABSTRACT: The statistical tools listed in the title share that a mathematically elegant solution has
become the consensus advice of statisticians, methodologists and some
mathematically sophisticated researchers writing tutorials and textbooks, and yet,
they lead research workers to meaningless answers, that are often also statistically
invalid. Part of the problem is that advice givers take the mathematical abstractions
of the tools they advocate for literally, instead of taking the actual behavior of
researchers seriously.
On Severity, the Weight of Evidence, and the Relationship Between the Twojemille6
Margherita Harris
Visiting fellow in the Department of Philosophy, Logic and Scientific Method at the London
School of Economics and Political Science.
ABSTRACT: According to the severe tester, one is justified in declaring to have evidence in support of a
hypothesis just in case the hypothesis in question has passed a severe test, one that it would be very
unlikely to pass so well if the hypothesis were false. Deborah Mayo (2018) calls this the strong
severity principle. The Bayesian, however, can declare to have evidence for a hypothesis despite not
having done anything to test it severely. The core reason for this has to do with the
(infamous) likelihood principle, whose violation is not an option for anyone who subscribes to the
Bayesian paradigm. Although the Bayesian is largely unmoved by the incompatibility between
the strong severity principle and the likelihood principle, I will argue that the Bayesian’s never-ending
quest to account for yet an other notion, one that is often attributed to Keynes (1921) and that is
usually referred to as the weight of evidence, betrays the Bayesian’s confidence in the likelihood
principle after all. Indeed, I will argue that the weight of evidence and severity may be thought of as
two (very different) sides of the same coin: they are two unrelated notions, but what brings them
together is the fact that they both make trouble for the likelihood principle, a principle at the core of
Bayesian inference. I will relate this conclusion to current debates on how to best conceptualise
uncertainty by the IPCC in particular. I will argue that failure to fully grasp the limitations of an
epistemology that envisions the role of probability to be that of quantifying the degree of belief to
assign to a hypothesis given the available evidence can be (and has been) detrimental to an
adequate communication of uncertainty.
Revisiting the Two Cultures in Statistical Modeling and Inference as they rel...jemille6
Aris Spanos (Wilson Schmidt Professor of Economics, Virginia Tech)
ABSTRACT: The discussion places the two cultures, the model-driven statistical modeling and the
algorithm-driven modeling associated with Machine Learning (ML) and Statistical
Learning Theory (SLT) in a broader context of paradigm shifts in 20th-century statistics,
which includes Fisher’s model-based induction of the 1920s and variations/extensions
thereof, the Data Science (ML, STL, etc.) and the Graphical Causal modeling in the
1990s. The primary objective is to compare and contrast the effectiveness of different
approaches to statistics in learning from data about phenomena of interest and relate
that to the current discussions pertaining to the statistics wars and their potential
casualties.
Comparing Frequentists and Bayesian Control of Multiple Testingjemille6
James Berger
ABSTRACT: A problem that is common to many sciences is that of having to deal with a multiplicity of statistical inferences. For instance, in GWAS (Genome Wide Association Studies), an experiment might consider 20 diseases and 100,000 genes, and conduct statistical tests of the 20x100,000=2,000,000 null hypotheses that a specific disease is associated with a specific gene. The issue is that selective reporting of only the ‘highly significant’ results could lead to many claimed disease/gene associations that turn out to be false, simply because of statistical randomness. In 2007, the seriousness of this problem was recognized in GWAS and extremely stringent standards were employed to resolve it. Indeed, it was recommended that tests for association should be conducted at an error probability of 5 x 10—7. Particle physicists similarly learned that a discovery would be reliably replicated only if the p-value of the relevant test was less than 5.7 x 10—7. This was because they had to account for a huge number of multiplicities in their analyses. Other sciences have continuing issues with multiplicity. In the Social Sciences, p-hacking and data dredging are common, which involve multiple analyses of data. Stopping rules in social sciences are often ignored, even though it has been known since 1933 that, if one keeps collecting data and computing the p-value, one is guaranteed to obtain a p-value less than 0.05 (or, indeed, any specified value), even if the null hypothesis is true. In medical studies that occur with strong oversight (e.g., by the FDA), control for multiplicity is mandated. There is also typically a large amount of replication, resulting in meta-analysis. But there are many situations where multiplicity is not handled well, such as subgroup analysis: one first tests for an overall treatment effect in the population; failing to find that, one tests for an effect among men or among women; failing to find that, one tests for an effect among old men or young men, or among old women or young women; …. I will argue that there is a single method that can address any such problems of multiplicity: Bayesian analysis, with the multiplicity being addressed through choice of prior probabilities of hypotheses. ... There are, of course, also frequentist error approaches (such as Bonferroni and FDR) for handling multiplicity of statistical inferences; indeed, these are much more familiar than the Bayesian approach. These are, however, targeted solutions for specific classes of problems and are not easily generalizable to new problems.
Clark Glamour
ABSTRACT: "Data dredging"--searching non experimental data for causal and other relationships and taking that same data to be evidence for those relationships--was historically common in the natural sciences--the works of Kepler, Cannizzaro and Mendeleev are examples. Nowadays, "data dredging"--using data to bring hypotheses into consideration and regarding that same data as evidence bearing on their truth or falsity--is widely denounced by both philosophical and statistical methodologists. Notwithstanding, "data dredging" is routinely practiced in the human sciences using "traditional" methods--various forms of regression for example. The main thesis of my talk is that, in the spirit and letter of Mayo's and Spanos’ notion of severe testing, modern computational algorithms that search data for causal relations severely test their resulting models in the process of "constructing" them. My claim is that in many investigations, principled computerized search is invaluable for reliable, generalizable, informative, scientific inquiry. The possible failures of traditional search methods for causal relations, multiple regression for example, are easily demonstrated by simulation in cases where even the earliest consistent graphical model search algorithms succeed. ... These and other examples raise a number of issues about using multiple hypothesis tests in strategies for severe testing, notably, the interpretation of standard errors and confidence levels as error probabilities when the structures assumed in parameter estimation are uncertain. Commonly used regression methods, I will argue, are bad data dredging methods that do not severely, or appropriately, test their results. I argue that various traditional and proposed methodological norms, including pre-specification of experimental outcomes and error probabilities for regression estimates of causal effects, are unnecessary or illusory in application. Statistics wants a number, or at least an interval, to express a normative virtue, the value of data as evidence for a hypothesis, how well the data pushes us toward the true or away from the false. Good when you can get it, but there are many circumstances where you have evidence but there is no number or interval to express it other than phony numbers with no logical connection with truth guidance. Kepler, Darwin, Cannizarro, Mendeleev had no such numbers, but they severely tested their claims by combining data dredging with severe testing.
The Duality of Parameters and the Duality of Probabilityjemille6
Suzanne Thornton
ABSTRACT: Under any inferential paradigm, statistical inference is connected to the logic of probability. Well-known debates among these various paradigms emerge from conflicting views on the notion of probability. One dominant view understands the logic of probability as a representation of variability (frequentism), and another prominent view understands probability as a measurement of belief (Bayesianism). The first camp generally describes model parameters as fixed values, whereas the second camp views parameters as random. Just as calibration (Reid and Cox 2015, “On Some Principles of Statistical Inference,” International Statistical Review 83(2), 293-308)--the behavior of a procedure under hypothetical repetition--bypasses the need for different versions of probability, I propose that an inferential approach based on confidence distributions (CD), which I will explain, bypasses the analogous conflicting perspectives on parameters. Frequentist inference is connected to the logic of probability through the notion of empirical randomness. Sample estimates are useful only insofar as one has a sense of the extent to which the estimator may vary from one random sample to another. The bounds of a confidence interval are thus particular observations of a random variable, where the randomness is inherited by the random sampling of the data. For example, 95% confidence intervals for parameter θ can be calculated for any random sample from a Normal N(θ, 1) distribution. With repeated sampling, approximately 95% of these intervals are guaranteed to yield an interval covering the fixed value of θ. Bayesian inference produces a probability distribution for the different values of a particular parameter. However, the quality of this distribution is difficult to assess without invoking an appeal to the notion of repeated performance. ... In contrast to a posterior distribution, a CD is not a probabilistic statement about the parameter, rather it is a data-dependent estimate for a fixed parameter for which a particular behavioral property holds. The Normal distribution itself, centered around the observed average of the data (e.g. average recovery times), can be a CD for θ. It can give any level of confidence. Such estimators can be derived through Bayesian or frequentist inductive procedures, and any CD, regardless of how it is obtained, guarantees performance of the estimator under replication for a fixed target, while simultaneously producing a random estimate for the possible values of θ.
Paper given at PSA 22 Symposium: Multiplicity, Data-Dredging and Error Control
MAYO ABSTRACT: I put forward a general principle for evidence: an error-prone claim C is warranted to the extent it has been subjected to, and passes, an analysis that very probably would have found evidence of flaws in C just if they are present. This probability is the severity with which C has passed the test. When a test’s error probabilities quantify the capacity of tests to probe errors in C, I argue, they can be used to assess what has been learned from the data about C. A claim can be probable or even known to be true, yet poorly probed by the data and model at hand. The severe testing account leads to a reformulation of statistical significance tests: Moving away from a binary interpretation, we test several discrepancies from any reference hypothesis and report those well or poorly warranted. A probative test will generally involve combining several subsidiary tests, deliberately designed to unearth different flaws. The approach relates to confidence interval estimation, but, like confidence distributions (CD) (Thornton), a series of different confidence levels is considered. A 95% confidence interval method, say using the mean M of a random sample to estimate the population mean μ of a Normal distribution, will cover the true, but unknown, value of μ 95% of the time in a hypothetical series of applications. However, we cannot take .95 as the probability that a particular interval estimate (a ≤ μ ≤ b) is correct—at least not without a prior probability to μ. In the severity interpretation I propose, we can nevertheless give an inferential construal post-data, while still regarding μ as fixed. For example, there is good evidence μ ≥ a (the lower estimation limit) because if μ < a, then with high probability .95 (or .975 if viewed as one-sided) we would have observed a smaller value of M than we did. Likewise for inferring μ ≤ b. To understand a method’s capability to probe flaws in the case at hand, we cannot just consider the observed data, unlike in strict Bayesian accounts. We need to consider what the method would have inferred if other data had been observed. For each point μ’ in the interval, we assess how severely the claim μ > μ’ has been probed. I apply the severity account to the problems discussed by earlier speakers in our session. The problem with multiple testing (and selective reporting) when attempting to distinguish genuine effects from noise, is not merely that it would, if regularly applied, lead to inferences that were often wrong. Rather, it renders the method incapable, or practically so, of probing the relevant mistaken inference in the case at hand. In other cases, by contrast, (e.g., DNA matching) the searching can increase the test’s probative capacity. In this way the severe testing account can explain competing intuitions about multiplicity and data-dredging, while blocking inferences based on problematic data-dredging
The Statistics Wars and Their Causalities (refs)jemille6
High-profile failures of replication in the social and biological sciences underwrite a
minimal requirement of evidence: If little or nothing has been done to rule out flaws in inferring a claim, then it has not passed a severe test. A claim is severely tested to the extent it has been subjected to and passes a test that probably would have found flaws, were they present. This probability is the severity with which a claim has passed. The goal of highly well-tested claims differs from that of highly probable ones, explaining why experts so often disagree about statistical reforms. Even where today’s statistical test critics see themselves as merely objecting to misuses and misinterpretations, the reforms they recommend often grow out of presuppositions about the role of probability in inductive-statistical inference. Paradoxically, I will argue, some of the reforms intended to replace or improve on statistical significance tests enable rather than reveal illicit inferences due to cherry-picking, multiple testing, and data-dredging. Some preclude testing and falsifying claims altogether. These are the “casualties” on which I will focus. I will consider Fisherian vs Neyman-Pearson tests, Bayes factors, Bayesian posteriors, likelihoodist assessments, and the “screening model” of tests (a quasiBayesian-frequentist assessment). Whether one accepts this philosophy of evidence, I argue, that it provides a standpoint for avoiding both the fallacies of statistical testing and the casualties of today’s statistics wars.
The Statistics Wars and Their Casualties (w/refs)jemille6
High-profile failures of replication in the social and biological sciences underwrite a minimal requirement of evidence: If little or nothing has been done to rule out flaws in inferring a claim, then it has not passed a severe test. A claim is severely tested to the extent it has been subjected to and passes a test that probably would have found flaws, were they present. This probability is the severity with which a claim has passed. The goal of highly well-tested claims differs from that of highly probable ones, explaining why experts so often disagree about statistical reforms. Even where today’s statistical test critics see themselves as merely objecting to misuses and misinterpretations, the reforms they recommend often grow out of presuppositions about the role of probability in inductive-statistical inference. Paradoxically, I will argue, some of the reforms intended to replace or improve on statistical significance tests enable rather than reveal illicit inferences due to cherry-picking, multiple testing, and data-dredging. Some preclude testing and falsifying claims altogether. These are the “casualties” on which I will focus. I will consider Fisherian vs Neyman-Pearson tests, Bayes factors, Bayesian posteriors, likelihoodist assessments, and the “screening model” of tests (a quasi-Bayesian-frequentist assessment). Whether one accepts this philosophy of evidence, I argue, that it provides a standpoint for avoiding both the fallacies of statistical testing and the casualties of today’s statistics wars.
On the interpretation of the mathematical characteristics of statistical test...jemille6
Statistical hypothesis tests are often misused and misinterpreted. Here I focus on one
source of such misinterpretation, namely an inappropriate notion regarding what the
mathematical theory of tests implies, and does not imply, when it comes to the
application of tests in practice. The view taken here is that it is helpful and instructive to be consciously aware of the essential difference between mathematical model and
reality, and to appreciate the mathematical model and its implications as a tool for
thinking rather than something that has a truth value regarding reality. Insights are presented regarding the role of model assumptions, unbiasedness and the alternative hypothesis, Neyman-Pearson optimality, multiple and data dependent testing.
The role of background assumptions in severity appraisal (jemille6
In the past decade discussions around the reproducibility of scientific findings have led to a re-appreciation of the importance of guaranteeing claims are severely tested. The inflation of Type 1 error rates due to flexibility in the data analysis is widely considered
one of the underlying causes of low replicability rates. Solutions, such as study preregistration, are becoming increasingly popular to combat this problem. Preregistration only allows researchers to evaluate the severity of a test, but not all
preregistered studies provide a severe test of a claim. The appraisal of the severity of a
test depends on background information, such as assumptions about the data generating process, and auxiliary hypotheses that influence the final choice for the
design of the test. In this article, I will discuss the difference between subjective and
inter-subjectively testable assumptions underlying scientific claims, and the importance
of separating the two. I will stress the role of justifications in statistical inferences, the
conditional nature of scientific conclusions following these justifications, and highlight
how severe tests could lead to inter-subjective agreement, based on a philosophical approach grounded in methodological falsificationism. Appreciating the role of background assumptions in the appraisal of severity should shed light on current discussions about the role of preregistration, interpreting the results of replication studies, and proposals to reform statistical inferences.
The two statistical cornerstones of replicability: addressing selective infer...jemille6
Tukey’s last published work in 2020 was an obscure entry on multiple comparisons in the
Encyclopedia of Behavioral Sciences, addressing the two topics in the title. Replicability
was not mentioned at all, nor was any other connection made between the two topics. I shall demonstrate how these two topics critically affect replicability using recently completed studies. I shall review how these have been addressed in the past. I shall
review in more detail the available ways to address selective inference. My conclusion is that conducting many small replicability studies without strict standardization is the way to assure replicability of results in science, and we should introduce policies to make this happen.
Synthetic Fiber Construction in lab .pptxPavel ( NSTU)
Synthetic fiber production is a fascinating and complex field that blends chemistry, engineering, and environmental science. By understanding these aspects, students can gain a comprehensive view of synthetic fiber production, its impact on society and the environment, and the potential for future innovations. Synthetic fibers play a crucial role in modern society, impacting various aspects of daily life, industry, and the environment. ynthetic fibers are integral to modern life, offering a range of benefits from cost-effectiveness and versatility to innovative applications and performance characteristics. While they pose environmental challenges, ongoing research and development aim to create more sustainable and eco-friendly alternatives. Understanding the importance of synthetic fibers helps in appreciating their role in the economy, industry, and daily life, while also emphasizing the need for sustainable practices and innovation.
A workshop hosted by the South African Journal of Science aimed at postgraduate students and early career researchers with little or no experience in writing and publishing journal articles.
Acetabularia Information For Class 9 .docxvaibhavrinwa19
Acetabularia acetabulum is a single-celled green alga that in its vegetative state is morphologically differentiated into a basal rhizoid and an axially elongated stalk, which bears whorls of branching hairs. The single diploid nucleus resides in the rhizoid.
How to Make a Field invisible in Odoo 17Celine George
It is possible to hide or invisible some fields in odoo. Commonly using “invisible” attribute in the field definition to invisible the fields. This slide will show how to make a field invisible in odoo 17.
A review of the growth of the Israel Genealogy Research Association Database Collection for the last 12 months. Our collection is now passed the 3 million mark and still growing. See which archives have contributed the most. See the different types of records we have, and which years have had records added. You can also see what we have for the future.
2024.06.01 Introducing a competency framework for languag learning materials ...Sandy Millin
http://sandymillin.wordpress.com/iateflwebinar2024
Published classroom materials form the basis of syllabuses, drive teacher professional development, and have a potentially huge influence on learners, teachers and education systems. All teachers also create their own materials, whether a few sentences on a blackboard, a highly-structured fully-realised online course, or anything in between. Despite this, the knowledge and skills needed to create effective language learning materials are rarely part of teacher training, and are mostly learnt by trial and error.
Knowledge and skills frameworks, generally called competency frameworks, for ELT teachers, trainers and managers have existed for a few years now. However, until I created one for my MA dissertation, there wasn’t one drawing together what we need to know and do to be able to effectively produce language learning materials.
This webinar will introduce you to my framework, highlighting the key competencies I identified from my research. It will also show how anybody involved in language teaching (any language, not just English!), teacher training, managing schools or developing language learning materials can benefit from using the framework.
A Strategic Approach: GenAI in EducationPeter Windle
Artificial Intelligence (AI) technologies such as Generative AI, Image Generators and Large Language Models have had a dramatic impact on teaching, learning and assessment over the past 18 months. The most immediate threat AI posed was to Academic Integrity with Higher Education Institutes (HEIs) focusing their efforts on combating the use of GenAI in assessment. Guidelines were developed for staff and students, policies put in place too. Innovative educators have forged paths in the use of Generative AI for teaching, learning and assessments leading to pockets of transformation springing up across HEIs, often with little or no top-down guidance, support or direction.
This Gasta posits a strategic approach to integrating AI into HEIs to prepare staff, students and the curriculum for an evolving world and workplace. We will highlight the advantages of working with these technologies beyond the realm of teaching, learning and assessment by considering prompt engineering skills, industry impact, curriculum changes, and the need for staff upskilling. In contrast, not engaging strategically with Generative AI poses risks, including falling behind peers, missed opportunities and failing to ensure our graduates remain employable. The rapid evolution of AI technologies necessitates a proactive and strategic approach if we are to remain relevant.
Executive Directors Chat Leveraging AI for Diversity, Equity, and InclusionTechSoup
Let’s explore the intersection of technology and equity in the final session of our DEI series. Discover how AI tools, like ChatGPT, can be used to support and enhance your nonprofit's DEI initiatives. Participants will gain insights into practical AI applications and get tips for leveraging technology to advance their DEI goals.
D. Mayo: The Science Wars and the Statistics Wars: scientism, popular statistics, and the philosophers
1. Mayo
5/15
1
The Science Wars and the Statistics Wars: scientism,
popular statistics, and the philosophers
Deborah Mayo
• In thinking about scientism for this conference—a topic on
which I’ve never written—a puzzle arises: How can we
worry about science being held in too high a regard when
we are daily confronted with articles shouting that “most
scientific findings are false?”
• Too deferential to scientific methodology? In the fields I’m
most closely involved, scarcely a day goes by where we’re
not reading articles on “bad science”, “trouble in the lab”,
and “science fails to self-correct.”
2. Mayo
5/15
2
• Not puzzling: I suggest that legitimate criticisms of
scientism often stem from methodological abuses of
statistical methodology—i.e., from what might be called
“statisticism”—“lies, damned lies, and statistics.”
• The rise of big data and high-powered computer programs
extend statistical methods across the sciences, law and
evidence-based policy,—and beyond (culturomics,
philosophometrics)—but often with methodological-
philosophical loopholes.
• It’s the false veneer of science, statistics as window
dressing, that bothers us.
3. Mayo
5/15
3
Are philosophies about science relevant here?
• I say yes: “Getting philosophical” here would be to provide
tools to avoid obfuscating philosophically tinged notions
about inference, testing, while offering a critical
illumination of flaws and foibles linking technical statistical
concepts to substantive claims.
That is the goal of the different examples I will consider.
• Provocative articles give useful exposés of classic fallacies:
op-values are not posterior probabilities,
ostatistical significance is not substantive significance,
oassociation is not causation.
They often lack a depth of understanding of underlying
philosophical, statistical, and historical issues.
4. Mayo
5/15
4
Demarcation: Bad Methodology/Bad Statistics
• Investigators of Diederik Stapel, the social psychologist
who fabricated his data, walked into a culture of
“verification bias” (2012 Tilberg Report, “Flawed
Science”).
• They were shocked when people they interviewed
“defended the serious and less serious violations of proper
scientific method saying: that is what I have learned in
practice; everyone in my research environment does the
same, and so does everyone we talk to…” (48).
5. Mayo
5/15
5
• Philosophers tend to have cold feet when it comes to saying
anything general about science versus pseudoscience.
• Debunkers need to have a position on bad, very bad, not so
bad methodology.
• The Tilberg Report does a pretty good job:
“One of the most fundamental rules of scientific research is
that an investigation must be designed in such a way that
facts that might refute the research hypotheses are given at
least an equal chance of emerging as do facts that confirm
the research hypotheses. Violations of this rule, continuing
an experiment until it works as desired, or excluding
unwelcome experimental subjects or results, inevitably
tends to confirm the researcher’s research hypotheses, and
essentially render the hypotheses immune to the facts”.
6. Mayo
5/15
6
Items in their list of “dirty laundry” include:
“An experiment fails to yield the expected statistically
significant results. The experimenters try and try again
until they find something (multiple testing, multiple
modeling, post-data search of endpoint or subgroups,
and the only experiment subsequently reported is the
one that did yield the expected results.” (Report, 48)
In fields like medicine, these gambits are deemed bad statistics
if not criminal behavior.
(A recent case went all the way to the Supreme Court, Scott
Harkonen case: post data searching for statistically significant
endpoints does not qualify as free speech.)
7. Mayo
5/15
7
Popper had the right idea:
“Observations or experiments can be accepted as
supporting a theory (or a hypothesis, or a scientific
assertion) only if these observations or experiments are
severe tests of the theory” (Popper 1994, p. 89).
Unfortunately Popper never arrived at an adequate notion of a
severe test.
(In a letter, Popper said he regretted not having sufficiently
learned statistics.)
8. Mayo
5/15
8
Philosophers have their own “statisticisms”—logicism,
mathematicism: search for logics
of
evidential-‐relationship
Assumes:
For
any
data
x,
hypothesis
H,
there
is
an
(context
free)
evidential
relationship.
(x
assumed
given)
Hacking
(1965):
the
“Law
of
Likelihood”:
x
support
hypotheses
H1
more
than
H2
if
P(x;H1)
>
P(x;H2).
Such
a
maximally
likelihood
alternative
H2
can
always
be
constructed:
H1
may
always
be
found
less
well
supported,
even
if
H1
is
true—no
error
control.
Hacking
rejected
the
likelihood
approach
(1977)
on
such
grounds
9. Mayo
5/15
9
Lakatos was correct that there’s a tension between logics of
evidence and the intuition against ad hoc hypotheses; he
described it as an appeal to history, to how the hypothesis was
formulated, selected for testing, modified, etc.
Now we’d call them “selection effects” and “cherry picking”.
The problems with selective reporting, stopping when the data
look good are not problems about long-runs….
It’s that we cannot say about the case at hand that it has done a
good job of avoiding the sources of misinterpretation.
That makes it questionable inference
10. Mayo
5/15
10
Role for philosophers? One of the final recommendations in
the Report is this:
In the training program for PhD students, the relevant
basic principles of philosophy of science, methodology,
ethics and statistics that enable the responsible practice
of science must be covered.
A philosophy department could well create an entire core
specialization that revolved around these themes.
11. Mayo
5/15
11
Statistics Wars: Was the Discovery of the Higgs Particle
“Bad Science”?
One of the biggest science events of 2012-13 was undoubtedly
the announcement on July 4, 2012 of evidence for the discovery
of a Higgs-like particle based on a “5 sigma observed effect”.
Because the 5 sigma report refers to frequentist statistical tests,
the discovery is imbued with some controversial themes from
philosophy of statistics
12. Mayo
5/15
12
Subjective Bayesian Dennis Lindley (of the Jeffreys-Lindley
paradox) sent around a letter to the ISBA (through O’Hagan):
1. Why such an extreme evidence requirement? We
know from a Bayesian perspective that this only makes
sense if (a) the existence of the Higgs boson has
extremely small prior probability and/or (b) the
consequences of erroneously announcing its discovery
are dire in the extreme. …
2. Are the particle physics community completely
wedded to frequentist analysis? If so, has anyone tried
to explain what bad science that is?
13. Mayo
5/15
13
Not bad science at all.
Practitioners of HEP are very sophisticated with their
statistical methodology and modeling: they’d seen too many
bumps disappear.
They want to ensure that before announcing the hypothesis
H*: “a SM Higgs boson has been discovered” that
H* has been given a severe run for its money.
14. Mayo
5/15
14
Within
a
general
model
for
the
detector,
H0:
μ
=
0—background
only
hypothesis,
μ
is
the
“global signal strength” parameter,
μ = 1—measures the SM Higgs boson signal in addition to
the background (SM: Standard Model).
They
want
to
ensure
that
with
extremely
high
probability,
H0
would
have
survived
a
cluster
of
tests,
fortified
with
much
cross-‐checking
T,
were
μ
=
0.
15. Mayo
5/15
15
Note what’s being given a high probability:
Pr(test
T
would
produce
less
than
5
sigma;
H0)
>
.9999997.
With
probability
.9999997,
the
bumps
would
disappear
(in
either
ATLAS
or
CMS)
under
the
assumption
data
are
due
to
background
H0:
this
is
an
error
probability.
16. Mayo
5/15
16
P-‐value
police
Science
writers
rushed
in
to
examine
if
the
.99999
was
fallaciously
being
assigned
to
H*
itself—a
posterior
probability
in
H*.
P-‐value
police
graded
sentences
from
each
news
article.
Physicists
did
not
assign
a
high
probability
to
H*: A
Standard
Model
(SM)
Higgs
exists
(…whatever
it
might
mean).
Most
believed
a
Higgs
particle
before
the
collider,
but
most
also
believe
in
beyond
the
standard
model
physics
(BSM).
Once H* passes with severity, they quantify various properties
of the particle discovered (inferring ranges of magnitudes).
17. Mayo
5/15
17
Statistics Wars: Bayesian vs Frequentist
The traditional frequentist-Bayesian wars are still alive.
In an oversimple nutshell:
• A Bayesian account uses probability for updating beliefs in
claims using Bayes’ theorem.
• Frequentist accounts use probability to control long-run error
rates of procedures (e.g., 95% coverage probability)
Note: anyone who uses conditional probability employs Bayes’
theorem, be it Bayes’ nets or ordinary probability—doesn’t
make it Bayesian)
Probabilism vs Performance
I advocate a third “p”: probativeness
18. Mayo
5/15
18
Current state of play? (save for discussion)
• Bayesian methods useful but the traditional subjective
Bayesian philosophy (largely) rejected.
• Since the 1990s: “Insisting we should be doing a subjective
analysis falls on deaf ears; they come to statistics to avoid
subjectivity.” (Berger); elicitation given up on.
• Reconciliations and unifications: non-subjective (default or
conventional) Bayesianism: the prior is automatically chosen
so as to maximize the contribution of the data (rather than the
prior). Many different rival systems.
• Priors aren’t considered a degree of belief, not even
probabilities (improper).
19. Mayo
5/15
19
• Reject Dutch Book, Likelihood Principle; rarely is the final
form a posterior probability, or even a Bayes ratio.
• Gelman and Shalizi (2013)–a Bayesian at Columbia and a
CMU error statistician): “There have been technical
advances, now we need an advance in philosophy…”
“Implicit in the best Bayesian practice is a stance
that has much in common with [my] error-statistical
approach…Indeed crucial parts of Bayesian data
analysis, such as model checking, can be understood
as ‘error probes’ in Mayo’s sense” (p. 10).
20. Mayo
5/15
20
Big Data: Statistics vs. Data Science (Informatics, Machine
learning, data analytics, CS): “data revolution”
2013 was the “International Year of Celebrating Statistics.”
The label was to help prevent Statistical Science being eclipsed
by the fashionable “Big Data” crowd.
Larry Wasserman: Talk of “Data Science” and “Big Data” fills
me with:
Optimism––it means statistics is finally a sexy field.
Dread––statistics is being left on the sidelines.
21. Mayo
5/15
21
Data Science: The End of Statistics?
Vapnik, of the Vapnik/Chervonenkis (VC) theory, is known for
his seminal work in machine learning.
They distinguish classical and modern work in philosophy as
well as statistics.
In philosophy:
The classical conception is objective, rational, a naïve realism.
The modern “data driven” empirical view, illustrated by
machine learning, is enlightened.
22. Mayo
5/15
22
In statistics:
Classical view seeks statistical regularities modeled with
parametric distributions, seeks to estimate and test parameters in
a model intended to describe a real data generating process.
Modern “data driven” view: aims for good predictions with
wholly uninterpretable “black boxes”; views models as mental
constructs and exhorts scientists to restrict themselves to
problems deemed “well posed” by machine-learning criteria.
23. Mayo
5/15
23
Black Box science
How would the Higgs Boson fit? (It wouldn’t.)
“So the Instrumentalist view follows directly from a sound
scientific theory, and not from the philosophical argument.
So realism is not possible, and instrumentalism is an
appropriate (technically sound) philosophical position”.
24. Mayo
5/15
24
Down with models: They claim to avoid assumptions about
parametric distributions—but iid is a big assumption.
“Machine-learning inductions, based on training samples
work only so long as stationarity is sufficient to ensure that
the new data are adequately similar to the training data” .
You don’t have to be a naïve realist to think that science is more
than the binary classification problem,
(predicting if you will buy X’s book, or teaching a machine to
disambiguate a handwritten 5 from an 8 in postal addresses),
improve Google searches,….)
All very impressive, limited to that realm.
25. Mayo
5/15
25
The success of other outgrowths “culturomics” is unclear
(statistics on frequency of word use).
If making something more scientific means treating it as data
mining “associations”, then it may be less scientific (a less good
methodology for given aims).
Not everyone who works in these areas agrees with this
philosophy, but these are founders.
26. Mayo
5/15
26
Broadly analogous moves occur in philosophy: all science and
inquiry should be restricted to problems deemed “well posed”
by their favorite science,
(neuroscience, physics, evolutionary psychology….)
• The
problem,
of
course,
is
that
they
are
question
begging.
• Uncritical
about
the
methodological
rigor
underlying
research
purporting
to
show
it’s
a
good
way
to
solve
problems
outside
their
particular
subset
of
inquiry.
27. Mayo
5/15
27
“Aren’t We Data Science?” Marie Davidian, president of the
ASA, asks.
She argues that data scientists have “little appreciation for the
power of design of experiments”.
Reports
are
now
trickling
in
about
the
consequences
of
ignoring
principles
of
DOE
28. Mayo
5/15
28
Microarray
Big
Data
Analytics:
Screening
for
genetic
associations
Stanley
Young
(Nat.
Inst.
Of
Stat):
There
is
a
relatively
unknown
problem
with
microarray
experiments,
in
addition
to
the
multiple
testing
problems.
Until
relatively
recently,
the
microarray
samples
were
not
sent
through
assay
equipment
in
random
order.
Essentially
all
the
microarray
data
pre-‐2010
is
unreliable.
29. Mayo
5/15
29
“Stop
Ignoring
Experimental
Design
(or
my
head
will
explode)”
(Lambert,
of
a
bioinformatics
software
Co.)
Statisticians
“tell
me
how
they
are
never
asked
to
help
with
design
before
the
experiment
begins,
only
asked
to
clean
up
the
mess
after
millions
have
been
spent.”
•Fisher:
“To consult the statistician after an experiment is
finished is often merely to ask him to conduct a post mortem
examination…[to] say what the experiment died of.”
30. Mayo
5/15
30
• Different research programs now appeal to gene and other
theories to get more reliable results than black box
bioinformatics.
• Maybe black boxes aren’t enough after all….
• Let’s go back to the International Year of Celebrating
Statistics
31. Mayo
5/15
31
The Analytics Rock Star: Nate Silver
The Presidential Address at the ASA (usually by a famous
statistician) was given by pollster Nate Silver.
He’s not in statistics, but he did combine numerous polling
results to predict the Obama win in 2012.
Nate
Silver
“hit
a
home
run
with
the
crowd
in
his
reply
to
the
question
“What
do
you
think
of
data
science
vs.
statistics?”
(Questions
were
twittered.)
Nate’s
reply:
“data
scientist”
was
just
a
“sexed
up”
term
for
statistician.
Audience members cried out with joy.
32. Mayo
5/15
32
In the talk itself, Silver listed his advice to data journalists:
The reason he favors the Bayesian philosophy is that people
should be explicit about disclosing their biases and
preconceptions.
• If people are so inclined to see the world through their
tunnel vision, why suppose they are able/willing to be
explicit about their biases?
• If priors are to represent biases, shouldn’t they be kept
separate from the data rather than combined with them?
At odds with the idea of data driven journalism.
33. Mayo
5/15
33
Data-driven journalism
Silver’s
538
blog
is
one
of
the
new
attempts
at
“Big
Data”
journalism:
“to
use
statistical
analysis
—
hard
numbers
—
to
tell
compelling
stories.”
• They
don’t
announce
priors
(so
far
as
I
can
tell).
• My antennae go up for other reasons: reports on observable
statistical associations, running this or that regression may
allow shaky claims under the guise of hard-nosed, “just the
facts” journalism.
(One of the biggest sources of “sciency” approaches.)
• Maybe announcing the biases would be better.
• I’d want an entirely distinct account of warranted inference
from data.
34. Mayo
5/15
34
Plausibility differs from Well-Testedness
When we hear there’s statistical evidence of some unbelievable
claim (distinguishing shades of grey and being politically
moderate, ovulation and voting preferences), some claim—you
see, if our beliefs were mixed into the interpretation of the
evidence, we wouldn’t be fooled.
We know these things are unbelievable.
That could work in some cases (though it still wouldn’t show
what they’d done wrong).
35. Mayo
5/15
35
It wouldn’t help with our most important problem:
How to distinguish tests of one and the same hypothesis
with different methods used (e.g., one with searching, post
data subgroups, etc., another without)?
Moreover, committees investigating questionable research
practices (QRPs) find:
“People are not deliberately cheating: they honestly
believe in their theories and believe the data is
supporting them and are just doing the best to make this
as clear as possible to everyone”. Richard Gill (forensic
statistician).
36. Mayo
5/15
36
We are back to the Tilberg report (and now Jens Forster).
Diederik Stapel says he always read the research literature
extensively to generate his hypotheses.
“So that it was believable and could be argued that this
was the only logical thing you would find.” (E.g., eating
meat causes aggression.)
(In “The Mind of a Con Man,” NY Times, April 26,
2013[4])
(He really doesn’t think he did anything that bad.)
37. Mayo
5/15
37
Demarcating Methodologies for Finding Things Out
§ Rather than report on believability, researchers need to
report the properties of the methods they used: What was
their capacity to have identified, avoided, admitted bias?
Probability enters to quantify well-testedness, and
discrepancies well or poorly detected
§ A methodology (for finding things out) is questionable if it
cannot or will not distinguish the correctness or plausibility
of inferences from problems stemming from a poorly run
study.
38. Mayo
5/15
38
An
inference
to
H*
is
questionable
if
it
stems
from
a
method
with
little
ability
to
have
found
flaws
if
they
existed.
Area
of
pseudoinquiry:
A
research
area
that
regularly
fails
to
be
able
to
vouchsafe
the
capability
of
discerning/reporting
mistakes
at
the
levels
of
data,
statistical
model,
substantive
inference
Need
to
be
able
to
say:
H
is
plausible,
but
this
is
a
bad
test
39. Mayo
5/15
39
Here’s a believable hypothesis: Men react more negatively to
the success of their partners than to their failures?
Studies have shown:
H: partner’s success lowers self-esteem in men
It’s believable, but the statistical experiments are a sham:
[Subjects are randomly assigned to either think about a time
their partner succeeded, or a time they failed. They purport to
find a statistically significant difference in self-esteem is
measured on an Official Psychological Self-Esteem measure
(based on positive word associations with “me” versus “other”)]
Randomly assigning “treatments” does not protect against data-
mining, flexibilities in interpreting results (problems with the
statistics, the self-esteem measure).
40. Mayo
5/15
40
The New Science of Replication:
• They do not question the methodology of the original study.
• It’s another statistical analysis to mimic everything and see
if it is found in an appropriately powered test.
The problem with failing to replicate one of these social
scientific studies is we cannot say we’ve refuted the original
study because there is too much latitude for finding and not
finding the effect (aside from the formal capacities).
(I’m on one such committee; they need more philosophers of
methodology.)
Distinguish from fraud busting: Statistical fraud busting is
essential (a few days ago Jens Forster, using R.A. Fisher’s “too
good to be true” F-test).
41. Mayo
5/15
41
Need a “philosophical-methodological” assessment
(I’m calling it this because, philosophers do not always question
the methodology; e.g.,“experimental philosophers” use results
from this type of study for informing philosophical questions.)
42. Mayo
5/15
42
I began with a puzzle: How can we worry about science being
held in too high a regard when we are daily confronted with
articles shouting that “most scientific findings are false?”
“there is a crisis of replication”?
There is a connection: methodological and philosophical
problems with the use and interpretation of statistical method
Statistics as holy water, hide selection effects, misinterpret
methods (based on assumed philosophies of statistics) ignore
DOEs (we have so much data we don’t need them), ….
One more (underlying the): “Most scientific findings are false”
Based on using measures of exploratory screening to assess
“science-wise error rates.” (I’ll save for discussion.)
43. Mayo
5/15
43
“Science-wise error rates” (FDRs):
A: finding a statistically significant result at the .05 level
If we:
• imagine two point hypotheses H0
and H1
–
H1
identified with
some “meaningful” effect, H1,
all else ignored,
• assume P(H1)
is very small (.1),
• permit a dichotomous “thumbs up-down” pronouncement,
from a single (just) .05 significant result (ignoring
magnitudes),
44. Mayo
5/15
44
• allow the ratio of type 1 error probability to the power
against H1 to supply a “likelihood ratio”.
The unsurprising result is that most “positive results” are false.
Not based on data, but an analytic exercise (Ioannides 2005):
Their computations might at best hold for crude screening
exercises (e.g., for associations between genes and disease).
It risks entrenching just about every fallacy in the books.
45. Mayo
5/15
45
Conclusion
• Legitimate criticisms of scientism often stem from
insufficiently self-critical methodology, often statistical i.e.,
from what might be called “statisticism.”
• Understanding and resolving these issues calls for
philosophical scrutiny of the methodological sort (jointly
with statistical practitioners, and science journalists).
• Not only would this help to make progress in the debates—
the science wars and the statistics wars—it would promote
philosophies of science genuinely relevant for practice.