A brief introduction do the Philosophy of Science for information scientists and technologists. This is also Chapter 1 of my course on Qualitative Research.
A brief introduction do the Philosophy of Science for information scientists and technologists. This is also Chapter 1 of my course on Qualitative Research.
An overview of History and Philosophy of Science, dissecting terms such as History, Philosophy and its focal point science, correlating history of science and philosophy of science, tackeling about other essential information such as scientific method, paradigms and the role of History and Philosophy of Science in Science classroom. This is such a great help to inspire teachers and soon to be on how they can integrate their learning's in this subject to further enhance more science teaching.
The Statistics Wars: Errors and Casualtiesjemille6
ABSTRACT: Mounting failures of replication in the social and biological sciences give a new urgency to critically appraising proposed statistical reforms. While many reforms are welcome (preregistration of experiments, replication, discouraging cookbook uses of statistics), there have been casualties. The philosophical presuppositions behind the meta-research battles remain largely hidden. Too often the statistics wars have become proxy wars between competing tribe leaders, each keen to advance one or another tool or school, rather than build on efforts to do better science. Efforts of replication researchers and open science advocates are diminished when so much attention is centered on repeating hackneyed howlers of statistical significance tests (statistical significance isn’t substantive significance, no evidence against isn’t evidence for), when erroneous understanding of basic statistical terms goes uncorrected, and when bandwagon effects lead to popular reforms that downplay the importance of error probability control. These casualties threaten our ability to hold accountable the “experts,” the agencies, and all the data handlers increasingly exerting power over our lives.
An overview of History and Philosophy of Science, dissecting terms such as History, Philosophy and its focal point science, correlating history of science and philosophy of science, tackeling about other essential information such as scientific method, paradigms and the role of History and Philosophy of Science in Science classroom. This is such a great help to inspire teachers and soon to be on how they can integrate their learning's in this subject to further enhance more science teaching.
The Statistics Wars: Errors and Casualtiesjemille6
ABSTRACT: Mounting failures of replication in the social and biological sciences give a new urgency to critically appraising proposed statistical reforms. While many reforms are welcome (preregistration of experiments, replication, discouraging cookbook uses of statistics), there have been casualties. The philosophical presuppositions behind the meta-research battles remain largely hidden. Too often the statistics wars have become proxy wars between competing tribe leaders, each keen to advance one or another tool or school, rather than build on efforts to do better science. Efforts of replication researchers and open science advocates are diminished when so much attention is centered on repeating hackneyed howlers of statistical significance tests (statistical significance isn’t substantive significance, no evidence against isn’t evidence for), when erroneous understanding of basic statistical terms goes uncorrected, and when bandwagon effects lead to popular reforms that downplay the importance of error probability control. These casualties threaten our ability to hold accountable the “experts,” the agencies, and all the data handlers increasingly exerting power over our lives.
D. Mayo: Philosophical Interventions in the Statistics Warsjemille6
ABSTRACT: While statistics has a long history of passionate philosophical controversy, the last decade especially cries out for philosophical illumination. Misuses of statistics, Big Data dredging, and P-hacking make it easy to find statistically significant, but spurious, effects. This obstructs a test's ability to control the probability of erroneously inferring effects–i.e., to control error probabilities. Disagreements about statistical reforms reflect philosophical disagreements about the nature of statistical inference–including whether error probability control even matters! I describe my interventions in statistics in relation to three events. (1) In 2016 the American Statistical Association (ASA) met to craft principles for avoiding misinterpreting P-values. (2) In 2017, a "megateam" (including philosophers of science) proposed "redefining statistical significance," replacing the common threshold of P ≤ .05 with P ≤ .005. (3) In 2019, an editorial in the main ASA journal called for abandoning all P-value thresholds, and even the words "significant/significance".
A word on each. (1) Invited to be a "philosophical observer" at their meeting, I found the major issues were conceptual. P-values measure how incompatible data are from what is expected under a hypothesis that there is no genuine effect: the smaller the P-value, the more indication of incompatibility. The ASA list of familiar misinterpretations–P-values are not posterior probabilities, statistical significance is not substantive importance, no evidence against a hypothesis need not be evidence for it–I argue, should not be the basis for replacing tests with methods less able to assess and control erroneous interpretations of data. (Mayo 2016, 2019). (2) The "redefine statistical significance" movement appraises P-values from the perspective of a very different quantity: a comparative Bayes Factor. Failing to recognize how contrasting approaches measure different things, disputants often talk past each other (Mayo 2018). (3) To ban P-value thresholds, even to distinguish terrible from warranted evidence, I say, is a mistake (2019). It will not eradicate P-hacking, but it will make it harder to hold P-hackers accountable. A 2020 ASA Task Force on significance testing has just been announced. (I would like to think my blog errorstatistics.com helped.)
To enter the fray between rival statistical approaches, it helps to have a principle applicable to all accounts. There's poor evidence for a claim if little if anything has been done to find it flawed even if it is. This forms a basic requirement for evidence I call the severity requirement. A claim passes with severity only if it is subjected to and passes a test that probably would have found it flawed, if it were. It stems from Popper, though he never adequately cashed it out. A variant is the frequentist principle of evidence developed with Sir David Cox (Mayo and Cox 20
D. Mayo: Philosophy of Statistics & the Replication Crisis in Sciencejemille6
D. Mayo discusses various disputes-notably the replication crisis in science-in the context of her just released book: Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars.
“The importance of philosophy of science for statistical science and vice versa”jemille6
My paper “The importance of philosophy of science for statistical science and vice
versa” presented (zoom) at the conference: IS PHILOSOPHY USEFUL FOR SCIENCE, AND/OR VICE VERSA? January 30 - February 2, 2024 at Chapman University, Schmid College of Science and Technology.
D. G. Mayo: The Replication Crises and its Constructive Role in the Philosoph...jemille6
Constructive role of replication crises teaches a lot about 1.) Non-fallacious uses of statistical tests, 2.) Rationale for the role of probability in tests, 3.) How to reformulate tests.
Abstract: Mounting failures of replication in the social and biological sciences give a practical spin to statistical foundations in the form of the question: How can we attain reliability when methods make illicit cherry-picking and significance seeking so easy? Researchers, professional societies, and journals are increasingly getting serious about methodological reforms to restore scientific integrity – some are quite welcome (e.g., pre-registration), while others are quite radical. The American Statistical Association convened members from differing tribes of frequentists, Bayesians, and likelihoodists to codify misuses of P-values. Largely overlooked are the philosophical presuppositions of both criticisms and proposed reforms. Paradoxically, alternative replacement methods may enable rather than reveal illicit inferences due to cherry-picking, multiple testing, and other biasing selection effects. Crowd-sourced reproducibility research in psychology is helping to change the reward structure but has its own shortcomings. Focusing on purely statistical considerations, it tends to overlook problems with artificial experiments. Without a better understanding of the philosophical issues, we can expect the latest reforms to fail.
Replication Crises and the Statistics Wars: Hidden Controversiesjemille6
D. Mayo presentation at the X-Phil conference on "Reproducibility and Replicabililty in Psychology and Experimental Philosophy", University College London (June 14, 2018)
Severe Testing: The Key to Error Correctionjemille6
D. G. Mayo's slides for her presentation given March 17, 2017 at Boston Colloquium for Philosophy of Science, Alfred I.Taub forum: "Understanding Reproducibility & Error Correction in Science"
Statistical Inference as Severe Testing: Beyond Performance and Probabilismjemille6
A talk given by Deborah G Mayo
(Dept of Philosophy, Virginia Tech) to the Seminar in Advanced Research Methods at the Dept of Psychology, Princeton University on
November 14, 2023
TITLE: Statistical Inference as Severe Testing: Beyond Probabilism and Performance
ABSTRACT: I develop a statistical philosophy in which error probabilities of methods may be used to evaluate and control the stringency or severity of tests. A claim is severely tested to the extent it has been subjected to and passes a test that probably would have found flaws, were they present. The severe-testing requirement leads to reformulating statistical significance tests to avoid familiar criticisms and abuses. While high-profile failures of replication in the social and biological sciences stem from biasing selection effects—data dredging, multiple testing, optional stopping—some reforms and proposed alternatives to statistical significance tests conflict with the error control that is required to satisfy severity. I discuss recent arguments to redefine, abandon, or replace statistical significance.
Controversy Over the Significance Test Controversyjemille6
Deborah Mayo (Professor of Philosophy, Virginia Tech, Blacksburg, Virginia) in PSA 2016 Symposium: Philosophy of Statistics in the Age of Big Data and Replication Crises
D. G. Mayo: Your data-driven claims must still be probed severelyjemille6
In the session "Philosophy of Science and the New Paradigm of Data-Driven Science at the American Statistical Association Conference on Statistical Learning and Data Science/Nonparametric Statistics
D. Mayo: Replication Research Under an Error Statistical Philosophy jemille6
D. Mayo (Virginia Tech) slides from her talk June 3 at the "Preconference Workshop on Replication in the Sciences" at the 2015 Society for Philosophy and Psychology meeting.
Similar to Philosophy of Science and Philosophy of Statistics (20)
D. Mayo (Dept of Philosophy, VT)
Sir David Cox’s Statistical Philosophy and Its Relevance to Today’s Statistical Controversies
ABSTRACT: This talk will explain Sir David Cox's views of the nature and importance of statistical foundations and their relevance to today's controversies about statistical inference, particularly in using statistical significance testing and confidence intervals. Two key themes of Cox's statistical philosophy are: first, the importance of calibrating methods by considering their behavior in (actual or hypothetical) repeated sampling, and second, ensuring the calibration is relevant to the specific data and inquiry. A question that arises is: How can the frequentist calibration provide a genuinely epistemic assessment of what is learned from data? Building on our jointly written papers, Mayo and Cox (2006) and Cox and Mayo (2010), I will argue that relevant error probabilities may serve to assess how well-corroborated or severely tested statistical claims are.
Nancy Reid, Dept. of Statistics, University of Toronto. Inaugural receiptant of the "David R. Cox Foundations of Statistics Award".
Slides from Invited presentation at 2023 JSM: “The Importance of Foundations in Statistical Science“
Ronald Wasserstein, Chair (American Statistical Association)
ABSTRACT: David Cox wrote “A healthy interplay between theory and application is crucial for statistics… This is particularly the case when by theory we mean foundations of statistical analysis, rather than the theoretical analysis of specific statistical methods.” These foundations distinguish statistical science from the many fields of research in which statistical thinking is a key intellectual component. In this talk I will emphasize the ongoing importance and relevance of theoretical advances and theoretical thinking through some illustrative examples.
Errors of the Error Gatekeepers: The case of Statistical Significance 2016-2022jemille6
ABSTRACT: Statistical significance tests serve in gatekeeping against being fooled by randomness, but recent attempts to gatekeep these tools have themselves malfunctioned. Warranted gatekeepers formulate statistical tests so as to avoid fallacies and misuses of P-values. They highlight how multiplicity, optional stopping, and data-dredging can readily invalidate error probabilities. It is unwarranted, however, to argue that statistical significance and P-value thresholds be abandoned because they can be misused. Nor is it warranted to argue for abandoning statistical significance based on presuppositions about evidence and probability that are at odds with those underlying statistical significance tests. When statistical gatekeeping malfunctions, I argue, it undermines a central role to which scientists look to statistics. In order to combat the dangers of unthinking, bandwagon effects, statistical practitioners and consumers need to be in a position to critically evaluate the ramifications of proposed "reforms” (“stat activism”). I analyze what may be learned from three recent episodes of gatekeeping (and meta-gatekeeping) at the American Statistical Association (ASA).
Causal inference is not statistical inferencejemille6
Jon Williamson (University of Kent)
ABSTRACT: Many methods for testing causal claims are couched as statistical methods: e.g.,
randomised controlled trials, various kinds of observational study, meta-analysis, and
model-based approaches such as structural equation modelling and graphical causal
modelling. I argue that this is a mistake: causal inference is not a purely statistical
problem. When we look at causal inference from a general point of view, we see that
methods for causal inference fit into the framework of Evidential Pluralism: causal
inference is properly understood as requiring mechanistic inference in addition to
statistical inference.
Evidential Pluralism also offers a new perspective on the replication crisis. That
observed associations are not replicated by subsequent studies is a part of normal
science. A problem only arises when those associations are taken to establish causal
claims: a science whose established causal claims are constantly overturned is indeed
in crisis. However, if we understand causal inference as involving mechanistic inference
alongside statistical inference, as Evidential Pluralism suggests, we avoid fallacious
inferences from association to causation. Thus, Evidential Pluralism offers the means to
prevent the drama of science from turning into a crisis.
Stephan Guttinger (Lecturer in Philosophy of Data/Data Ethics, University of Exeter, UK)
ABSTRACT: The idea of “questionable research practices” (QRPs) is central to the narrative of a replication crisis in the experimental sciences. According to this narrative the low replicability of scientific findings is not simply due to fraud or incompetence, but in large part to the widespread use of QRPs, such as “p-hacking” or the lack of adequate experimental controls. The claim is that such flawed practices generate flawed output. The reduction – or even elimination – of QRPs is therefore one of the main strategies proposed by policymakers and scientists to tackle the replication crisis.
What counts as a QRP, however, is not clear. As I will discuss in the first part of this paper, there is no consensus on how to define the term, and ascriptions of the qualifier “questionable” often vary across disciplines, time, and even within single laboratories. This lack of clarity matters as it creates the risk of introducing methodological constraints that might create more harm than good. Practices labelled as ‘QRPs’ can be both beneficial and problematic for research practice and targeting them without a sound understanding of their dynamic and context-dependent nature risks creating unnecessary casualties in the fight for a more reliable scientific practice.
To start developing a more situated and dynamic picture of QRPs I will then turn my attention to a specific example of a dynamic QRP in the experimental life sciences, namely, the so-called “Far Western Blot” (FWB). The FWB is an experimental system that can be used to study protein-protein interactions but which for most of its existence has not seen a wide uptake in the community because it was seen as a QRP. This was mainly due to its (alleged) propensity to generate high levels of false positives and negatives. Interestingly, however, it seems that over the last few years the FWB slowly moved into the space of acceptable research practices. Analysing this shift and the reasons underlying it, I will argue a) that suppressing this practice deprived the research community of a powerful experimental tool and b) that the original judgment of the FWB was based on a simplistic and non-empirical assessment of its error-generating potential. Ultimately, it seems like the key QRP at work in the FWB case was the way in which the label “questionable” was assigned in the first place. I will argue that findings from this case can be extended to other QRPs in the experimental life sciences and that they point to a larger issue with how researchers judge the error-potential of new research practices.
David Hand (Professor Emeritus and Senior Research Investigator, Department of Mathematics,
Faculty of Natural Sciences, Imperial College London.)
ABSTRACT: Science progresses through an iterative process of formulating theories and comparing
them with empirical real-world data. Different camps of scientists will favour different
theories, until accumulating evidence renders one or more untenable. Not unnaturally,
people become attached to theories. Perhaps they invented a theory, and kudos arises
from being the originator of a generally accepted theory. A theory might represent a
life's work, so that being found wanting might be interpreted as failure. Perhaps
researchers were trained in a particular school, and acknowledging its shortcomings is
difficult. Because of this, tensions can arise between proponents of different theories.
The discipline of statistics is susceptible to precisely the same tensions. Here, however,
the tensions are not between different theories of "what is", but between different
strategies for shedding light on the real world from limited empirical data. This can be in
the form of how one measures discrepancy between the theory's predictions and
observations. It can be in the form of different ways of looking at empirical results. It can
be, at a higher level, because of differences between what is regarded as important in a
particular context. Or it can be for other reasons.
Perhaps the most familiar example of this tension within statistics is between different
approaches to inference. However, there are many other examples of such tensions.
This paper illustrates with several examples. We argue that the tension generally arises
as a consequence of inadequate care being taken in question formulation. That is,
insufficient thought is given to deciding exactly what one wants to know - to determining
"What is the question?".
The ideas and disagreements are illustrated with several examples.
The neglected importance of complexity in statistics and Metasciencejemille6
Daniele Fanelli
London School of Economics Fellow in Quantitative Methodology, Department of
Methodology, London School of Economics and Political Science.
ABSTRACT: Statistics is at war, and Metascience is ailing. This is partially due, the talk will argue, to
a paradigmatic blind-spot: the assumption that one can draw general conclusions about
empirical findings without considering the role played by context, conditions,
assumptions, and the complexity of methods and theories. Whilst ideally these
particularities should be unimportant in science, in practice they cannot be neglected in
most research fields, let alone in research-on-research.
This neglected importance of complexity is supported by theoretical arguments and
empirical findings (or the lack thereof) in the recent meta-analytical and metascientific
literature. The talk will overview this background and suggest how the complexity of
theories and methodologies may be explicitly factored into particular methodologies of
statistics and Metaresearch. The talk will then give examples of how this approach may
usefully complement existing paradigms, by translating results, methods and theories
into quantities of information that are evaluated using an information-compression logic.
Mathematically Elegant Answers to Research Questions No One is Asking (meta-a...jemille6
Uri Simonsohn (Professor, Department of Operations, Innovation and Data Sciences at Esade)
ABSTRACT: The statistical tools listed in the title share that a mathematically elegant solution has
become the consensus advice of statisticians, methodologists and some
mathematically sophisticated researchers writing tutorials and textbooks, and yet,
they lead research workers to meaningless answers, that are often also statistically
invalid. Part of the problem is that advice givers take the mathematical abstractions
of the tools they advocate for literally, instead of taking the actual behavior of
researchers seriously.
On Severity, the Weight of Evidence, and the Relationship Between the Twojemille6
Margherita Harris
Visiting fellow in the Department of Philosophy, Logic and Scientific Method at the London
School of Economics and Political Science.
ABSTRACT: According to the severe tester, one is justified in declaring to have evidence in support of a
hypothesis just in case the hypothesis in question has passed a severe test, one that it would be very
unlikely to pass so well if the hypothesis were false. Deborah Mayo (2018) calls this the strong
severity principle. The Bayesian, however, can declare to have evidence for a hypothesis despite not
having done anything to test it severely. The core reason for this has to do with the
(infamous) likelihood principle, whose violation is not an option for anyone who subscribes to the
Bayesian paradigm. Although the Bayesian is largely unmoved by the incompatibility between
the strong severity principle and the likelihood principle, I will argue that the Bayesian’s never-ending
quest to account for yet an other notion, one that is often attributed to Keynes (1921) and that is
usually referred to as the weight of evidence, betrays the Bayesian’s confidence in the likelihood
principle after all. Indeed, I will argue that the weight of evidence and severity may be thought of as
two (very different) sides of the same coin: they are two unrelated notions, but what brings them
together is the fact that they both make trouble for the likelihood principle, a principle at the core of
Bayesian inference. I will relate this conclusion to current debates on how to best conceptualise
uncertainty by the IPCC in particular. I will argue that failure to fully grasp the limitations of an
epistemology that envisions the role of probability to be that of quantifying the degree of belief to
assign to a hypothesis given the available evidence can be (and has been) detrimental to an
adequate communication of uncertainty.
Revisiting the Two Cultures in Statistical Modeling and Inference as they rel...jemille6
Aris Spanos (Wilson Schmidt Professor of Economics, Virginia Tech)
ABSTRACT: The discussion places the two cultures, the model-driven statistical modeling and the
algorithm-driven modeling associated with Machine Learning (ML) and Statistical
Learning Theory (SLT) in a broader context of paradigm shifts in 20th-century statistics,
which includes Fisher’s model-based induction of the 1920s and variations/extensions
thereof, the Data Science (ML, STL, etc.) and the Graphical Causal modeling in the
1990s. The primary objective is to compare and contrast the effectiveness of different
approaches to statistics in learning from data about phenomena of interest and relate
that to the current discussions pertaining to the statistics wars and their potential
casualties.
Comparing Frequentists and Bayesian Control of Multiple Testingjemille6
James Berger
ABSTRACT: A problem that is common to many sciences is that of having to deal with a multiplicity of statistical inferences. For instance, in GWAS (Genome Wide Association Studies), an experiment might consider 20 diseases and 100,000 genes, and conduct statistical tests of the 20x100,000=2,000,000 null hypotheses that a specific disease is associated with a specific gene. The issue is that selective reporting of only the ‘highly significant’ results could lead to many claimed disease/gene associations that turn out to be false, simply because of statistical randomness. In 2007, the seriousness of this problem was recognized in GWAS and extremely stringent standards were employed to resolve it. Indeed, it was recommended that tests for association should be conducted at an error probability of 5 x 10—7. Particle physicists similarly learned that a discovery would be reliably replicated only if the p-value of the relevant test was less than 5.7 x 10—7. This was because they had to account for a huge number of multiplicities in their analyses. Other sciences have continuing issues with multiplicity. In the Social Sciences, p-hacking and data dredging are common, which involve multiple analyses of data. Stopping rules in social sciences are often ignored, even though it has been known since 1933 that, if one keeps collecting data and computing the p-value, one is guaranteed to obtain a p-value less than 0.05 (or, indeed, any specified value), even if the null hypothesis is true. In medical studies that occur with strong oversight (e.g., by the FDA), control for multiplicity is mandated. There is also typically a large amount of replication, resulting in meta-analysis. But there are many situations where multiplicity is not handled well, such as subgroup analysis: one first tests for an overall treatment effect in the population; failing to find that, one tests for an effect among men or among women; failing to find that, one tests for an effect among old men or young men, or among old women or young women; …. I will argue that there is a single method that can address any such problems of multiplicity: Bayesian analysis, with the multiplicity being addressed through choice of prior probabilities of hypotheses. ... There are, of course, also frequentist error approaches (such as Bonferroni and FDR) for handling multiplicity of statistical inferences; indeed, these are much more familiar than the Bayesian approach. These are, however, targeted solutions for specific classes of problems and are not easily generalizable to new problems.
Clark Glamour
ABSTRACT: "Data dredging"--searching non experimental data for causal and other relationships and taking that same data to be evidence for those relationships--was historically common in the natural sciences--the works of Kepler, Cannizzaro and Mendeleev are examples. Nowadays, "data dredging"--using data to bring hypotheses into consideration and regarding that same data as evidence bearing on their truth or falsity--is widely denounced by both philosophical and statistical methodologists. Notwithstanding, "data dredging" is routinely practiced in the human sciences using "traditional" methods--various forms of regression for example. The main thesis of my talk is that, in the spirit and letter of Mayo's and Spanos’ notion of severe testing, modern computational algorithms that search data for causal relations severely test their resulting models in the process of "constructing" them. My claim is that in many investigations, principled computerized search is invaluable for reliable, generalizable, informative, scientific inquiry. The possible failures of traditional search methods for causal relations, multiple regression for example, are easily demonstrated by simulation in cases where even the earliest consistent graphical model search algorithms succeed. ... These and other examples raise a number of issues about using multiple hypothesis tests in strategies for severe testing, notably, the interpretation of standard errors and confidence levels as error probabilities when the structures assumed in parameter estimation are uncertain. Commonly used regression methods, I will argue, are bad data dredging methods that do not severely, or appropriately, test their results. I argue that various traditional and proposed methodological norms, including pre-specification of experimental outcomes and error probabilities for regression estimates of causal effects, are unnecessary or illusory in application. Statistics wants a number, or at least an interval, to express a normative virtue, the value of data as evidence for a hypothesis, how well the data pushes us toward the true or away from the false. Good when you can get it, but there are many circumstances where you have evidence but there is no number or interval to express it other than phony numbers with no logical connection with truth guidance. Kepler, Darwin, Cannizarro, Mendeleev had no such numbers, but they severely tested their claims by combining data dredging with severe testing.
The Duality of Parameters and the Duality of Probabilityjemille6
Suzanne Thornton
ABSTRACT: Under any inferential paradigm, statistical inference is connected to the logic of probability. Well-known debates among these various paradigms emerge from conflicting views on the notion of probability. One dominant view understands the logic of probability as a representation of variability (frequentism), and another prominent view understands probability as a measurement of belief (Bayesianism). The first camp generally describes model parameters as fixed values, whereas the second camp views parameters as random. Just as calibration (Reid and Cox 2015, “On Some Principles of Statistical Inference,” International Statistical Review 83(2), 293-308)--the behavior of a procedure under hypothetical repetition--bypasses the need for different versions of probability, I propose that an inferential approach based on confidence distributions (CD), which I will explain, bypasses the analogous conflicting perspectives on parameters. Frequentist inference is connected to the logic of probability through the notion of empirical randomness. Sample estimates are useful only insofar as one has a sense of the extent to which the estimator may vary from one random sample to another. The bounds of a confidence interval are thus particular observations of a random variable, where the randomness is inherited by the random sampling of the data. For example, 95% confidence intervals for parameter θ can be calculated for any random sample from a Normal N(θ, 1) distribution. With repeated sampling, approximately 95% of these intervals are guaranteed to yield an interval covering the fixed value of θ. Bayesian inference produces a probability distribution for the different values of a particular parameter. However, the quality of this distribution is difficult to assess without invoking an appeal to the notion of repeated performance. ... In contrast to a posterior distribution, a CD is not a probabilistic statement about the parameter, rather it is a data-dependent estimate for a fixed parameter for which a particular behavioral property holds. The Normal distribution itself, centered around the observed average of the data (e.g. average recovery times), can be a CD for θ. It can give any level of confidence. Such estimators can be derived through Bayesian or frequentist inductive procedures, and any CD, regardless of how it is obtained, guarantees performance of the estimator under replication for a fixed target, while simultaneously producing a random estimate for the possible values of θ.
Paper given at PSA 22 Symposium: Multiplicity, Data-Dredging and Error Control
MAYO ABSTRACT: I put forward a general principle for evidence: an error-prone claim C is warranted to the extent it has been subjected to, and passes, an analysis that very probably would have found evidence of flaws in C just if they are present. This probability is the severity with which C has passed the test. When a test’s error probabilities quantify the capacity of tests to probe errors in C, I argue, they can be used to assess what has been learned from the data about C. A claim can be probable or even known to be true, yet poorly probed by the data and model at hand. The severe testing account leads to a reformulation of statistical significance tests: Moving away from a binary interpretation, we test several discrepancies from any reference hypothesis and report those well or poorly warranted. A probative test will generally involve combining several subsidiary tests, deliberately designed to unearth different flaws. The approach relates to confidence interval estimation, but, like confidence distributions (CD) (Thornton), a series of different confidence levels is considered. A 95% confidence interval method, say using the mean M of a random sample to estimate the population mean μ of a Normal distribution, will cover the true, but unknown, value of μ 95% of the time in a hypothetical series of applications. However, we cannot take .95 as the probability that a particular interval estimate (a ≤ μ ≤ b) is correct—at least not without a prior probability to μ. In the severity interpretation I propose, we can nevertheless give an inferential construal post-data, while still regarding μ as fixed. For example, there is good evidence μ ≥ a (the lower estimation limit) because if μ < a, then with high probability .95 (or .975 if viewed as one-sided) we would have observed a smaller value of M than we did. Likewise for inferring μ ≤ b. To understand a method’s capability to probe flaws in the case at hand, we cannot just consider the observed data, unlike in strict Bayesian accounts. We need to consider what the method would have inferred if other data had been observed. For each point μ’ in the interval, we assess how severely the claim μ > μ’ has been probed. I apply the severity account to the problems discussed by earlier speakers in our session. The problem with multiple testing (and selective reporting) when attempting to distinguish genuine effects from noise, is not merely that it would, if regularly applied, lead to inferences that were often wrong. Rather, it renders the method incapable, or practically so, of probing the relevant mistaken inference in the case at hand. In other cases, by contrast, (e.g., DNA matching) the searching can increase the test’s probative capacity. In this way the severe testing account can explain competing intuitions about multiplicity and data-dredging, while blocking inferences based on problematic data-dredging
The Statistics Wars and Their Causalities (refs)jemille6
High-profile failures of replication in the social and biological sciences underwrite a
minimal requirement of evidence: If little or nothing has been done to rule out flaws in inferring a claim, then it has not passed a severe test. A claim is severely tested to the extent it has been subjected to and passes a test that probably would have found flaws, were they present. This probability is the severity with which a claim has passed. The goal of highly well-tested claims differs from that of highly probable ones, explaining why experts so often disagree about statistical reforms. Even where today’s statistical test critics see themselves as merely objecting to misuses and misinterpretations, the reforms they recommend often grow out of presuppositions about the role of probability in inductive-statistical inference. Paradoxically, I will argue, some of the reforms intended to replace or improve on statistical significance tests enable rather than reveal illicit inferences due to cherry-picking, multiple testing, and data-dredging. Some preclude testing and falsifying claims altogether. These are the “casualties” on which I will focus. I will consider Fisherian vs Neyman-Pearson tests, Bayes factors, Bayesian posteriors, likelihoodist assessments, and the “screening model” of tests (a quasiBayesian-frequentist assessment). Whether one accepts this philosophy of evidence, I argue, that it provides a standpoint for avoiding both the fallacies of statistical testing and the casualties of today’s statistics wars.
The Statistics Wars and Their Casualties (w/refs)jemille6
High-profile failures of replication in the social and biological sciences underwrite a minimal requirement of evidence: If little or nothing has been done to rule out flaws in inferring a claim, then it has not passed a severe test. A claim is severely tested to the extent it has been subjected to and passes a test that probably would have found flaws, were they present. This probability is the severity with which a claim has passed. The goal of highly well-tested claims differs from that of highly probable ones, explaining why experts so often disagree about statistical reforms. Even where today’s statistical test critics see themselves as merely objecting to misuses and misinterpretations, the reforms they recommend often grow out of presuppositions about the role of probability in inductive-statistical inference. Paradoxically, I will argue, some of the reforms intended to replace or improve on statistical significance tests enable rather than reveal illicit inferences due to cherry-picking, multiple testing, and data-dredging. Some preclude testing and falsifying claims altogether. These are the “casualties” on which I will focus. I will consider Fisherian vs Neyman-Pearson tests, Bayes factors, Bayesian posteriors, likelihoodist assessments, and the “screening model” of tests (a quasi-Bayesian-frequentist assessment). Whether one accepts this philosophy of evidence, I argue, that it provides a standpoint for avoiding both the fallacies of statistical testing and the casualties of today’s statistics wars.
On the interpretation of the mathematical characteristics of statistical test...jemille6
Statistical hypothesis tests are often misused and misinterpreted. Here I focus on one
source of such misinterpretation, namely an inappropriate notion regarding what the
mathematical theory of tests implies, and does not imply, when it comes to the
application of tests in practice. The view taken here is that it is helpful and instructive to be consciously aware of the essential difference between mathematical model and
reality, and to appreciate the mathematical model and its implications as a tool for
thinking rather than something that has a truth value regarding reality. Insights are presented regarding the role of model assumptions, unbiasedness and the alternative hypothesis, Neyman-Pearson optimality, multiple and data dependent testing.
The role of background assumptions in severity appraisal (jemille6
In the past decade discussions around the reproducibility of scientific findings have led to a re-appreciation of the importance of guaranteeing claims are severely tested. The inflation of Type 1 error rates due to flexibility in the data analysis is widely considered
one of the underlying causes of low replicability rates. Solutions, such as study preregistration, are becoming increasingly popular to combat this problem. Preregistration only allows researchers to evaluate the severity of a test, but not all
preregistered studies provide a severe test of a claim. The appraisal of the severity of a
test depends on background information, such as assumptions about the data generating process, and auxiliary hypotheses that influence the final choice for the
design of the test. In this article, I will discuss the difference between subjective and
inter-subjectively testable assumptions underlying scientific claims, and the importance
of separating the two. I will stress the role of justifications in statistical inferences, the
conditional nature of scientific conclusions following these justifications, and highlight
how severe tests could lead to inter-subjective agreement, based on a philosophical approach grounded in methodological falsificationism. Appreciating the role of background assumptions in the appraisal of severity should shed light on current discussions about the role of preregistration, interpreting the results of replication studies, and proposals to reform statistical inferences.
The two statistical cornerstones of replicability: addressing selective infer...jemille6
Tukey’s last published work in 2020 was an obscure entry on multiple comparisons in the
Encyclopedia of Behavioral Sciences, addressing the two topics in the title. Replicability
was not mentioned at all, nor was any other connection made between the two topics. I shall demonstrate how these two topics critically affect replicability using recently completed studies. I shall review how these have been addressed in the past. I shall
review in more detail the available ways to address selective inference. My conclusion is that conducting many small replicability studies without strict standardization is the way to assure replicability of results in science, and we should introduce policies to make this happen.
The replication crisis: are P-values the problem and are Bayes factors the so...jemille6
Today’s posterior is tomorrow’s prior. Dennis Lindley
It has been claimed that science is undergoing a replication crisis and that when looking for culprits, the cult of significance is the chief suspect. It has also been claimed that Bayes factors might provide a solution.
In my opinion, these claims are misleading and part of the problem is our understanding
of the purpose and nature of replication, which has only recently been subject to formal
analysis.
What we are or should be interested in is truth. Replication is a coherence not a correspondence requirement and one that has a strong dependence on the size
of the replication study
.
Consideration of Bayes factors raises a puzzling question. Should the Bayes factor for a replication study be calculated as if it were the initial study? If the answer is yes, the approach is not fully Bayesian and furthermore the Bayes factors will be subject to
exactly the same replication ‘paradox’ as P-values. If the answer is no, then in what
sense can an initially found Bayes factor be replicated and what are the implications for how we should view replication of P-values?
A further issue is that little attention has been paid to false negatives and, by extension
to true negative values. Yet, as is well known from the theory of diagnostic tests, it is
meaningless to consider the performance of a test in terms of false positives alone.
I shall argue that we are in danger of confusing evidence with the conclusions we draw and that any reforms of scientific practice should concentrate on producing evidence
that is reliable as it can be qua evidence. There are many basic scientific practices in
need of reform. Pseudoreplication, for example, and the routine destruction of
information through dichotomisation are far more serious problems than many matters of inferential framing that seem to have excited statisticians.
Executive Directors Chat Leveraging AI for Diversity, Equity, and InclusionTechSoup
Let’s explore the intersection of technology and equity in the final session of our DEI series. Discover how AI tools, like ChatGPT, can be used to support and enhance your nonprofit's DEI initiatives. Participants will gain insights into practical AI applications and get tips for leveraging technology to advance their DEI goals.
This slide is special for master students (MIBS & MIFB) in UUM. Also useful for readers who are interested in the topic of contemporary Islamic banking.
Biological screening of herbal drugs: Introduction and Need for
Phyto-Pharmacological Screening, New Strategies for evaluating
Natural Products, In vitro evaluation techniques for Antioxidants, Antimicrobial and Anticancer drugs. In vivo evaluation techniques
for Anti-inflammatory, Antiulcer, Anticancer, Wound healing, Antidiabetic, Hepatoprotective, Cardio protective, Diuretics and
Antifertility, Toxicity studies as per OECD guidelines
Francesca Gottschalk - How can education support child empowerment.pptxEduSkills OECD
Francesca Gottschalk from the OECD’s Centre for Educational Research and Innovation presents at the Ask an Expert Webinar: How can education support child empowerment?
Unit 8 - Information and Communication Technology (Paper I).pdfThiyagu K
This slides describes the basic concepts of ICT, basics of Email, Emerging Technology and Digital Initiatives in Education. This presentations aligns with the UGC Paper I syllabus.
A review of the growth of the Israel Genealogy Research Association Database Collection for the last 12 months. Our collection is now passed the 3 million mark and still growing. See which archives have contributed the most. See the different types of records we have, and which years have had records added. You can also see what we have for the future.
Overview on Edible Vaccine: Pros & Cons with Mechanism
Philosophy of Science and Philosophy of Statistics
1. Philosophy of Science and
Philosophy of Statistics
Deborah G Mayo
Dept of Philosophy, Virginia Tech
APA Central: Epistemology Meets Philosophy of
Statistics
February 27, 2020
1
2. 2
What is the Philosophy of
Statistics (PhilStat)?
At one level, statisticians and philosophers of
science ask many of the same questions:
What should be observed and what may justifiably
be inferred from the resulting data?
What is a good test?
How can spurious relationships be distinguished
from genuine regularities? from causal regularities?
3. • These very general questions are entwined with
long standing debates in philosophy of science
• No wonder the field of statistics tends to cross
over so often into philosophical territory.
3
4. 4
Statistics à Philosophy
(1) Model Scientific Inference—capture the actual or
rational ways to arrive at evidence and inference
(2) Solve Philosophical Problems about scientific
inference, observation, experiment;
(3) Metamethodology: Analyze intuitive rules
(e.g., novelty, simplicity)
5. 5
Formal Epistemology?
Could be
• Phil Stat
• Analytic epistemology with probabilities
“Bayesian statistics is one thing.. Bayesian
epistemology is something else. The idea of putting
probabilities over hypotheses delivered to
philosophy a godsend, an entire package of
superficiality.” (Glymour 2010, 334)
6. 6
• His worry: starting with an intuitive principle,
epistemologists reconstruct it with a probabilistic
confirmation logic.
• You haven’t shown, for example, beliefs ought to
go up with varied evidence, you represent it
probabilistically
• I don't knock rational reconstruction using
probability, and analogs of some puzzles arise in
statistics (tacking paradox, old evidence)
7. 7
Philosophy à Statistics
• Central job: minister to scientists’ conceptual,
logical and methodological discomforts
• Despite technical sophistication, basic concepts of
statistical inference and modeling are more
unsettled than ever.
9. Statistical Crisis in Science
• in many fields, latitude in collecting and
interpreting data makes it too easy to dredge up
impressive looking findings even when spurious.
• We set sail with a simple tool: a minimal
requirement for evidence
• Sufficiently general to apply to any methods now in
use
9
10. Statistical reforms
• Several are welcome: preregistration, avoidance
of cookbook statistics, calls for more replication
research
• Others are quite radical, and even violate our
minimal principle of evidence
• To combat paradoxical, self-defeating “reforms,”
requires a mix of statistics, philosophy, history
10
11. Most often used tools are most
criticized
“Several methodologists have pointed out that the
high rate of nonreplication of research discoveries is
a consequence of the convenient, yet ill-founded
strategy of claiming conclusive research findings
solely on the basis of a single study assessed by
formal statistical significance, typically for a p-value
less than 0.05. …” (Ioannidis 2005, 696)
Do researchers do that?
11
12. R.A. Fisher
“[W]e need, not an isolated record, but a reliable
method of procedure. In relation to the test of
significance, we may say that a phenomenon is
experimentally demonstrable when we know how
to conduct an experiment which will rarely fail to
give us a statistically significant result.”
(Fisher 1947, 14)
12
13. Simple significance tests (Fisher)
to test the conformity of the particular data under
analysis with H0 in some respect:
…we find a function d(X) of the data, the test statistic,
such that
• the larger the value of d(X) the more inconsistent
are the data with H0;
• d(X) has a known probability distribution
when H0 is true.
…the p-value corresponding to any d(x) (or d0bs)
p = p(t) = Pr(d(X) ≥ d(x); H0)
(Mayo and Cox 2006, 81; d for t, x for y) 13
14. Testing reasoning
• If even larger differences than d0bs occur fairly
frequently under H0 (i.e., P-value not small),
there’s no evidence of incompatibility with H0
• Small P-value indicates some underlying
discrepancy from H0 because very probably
you would have seen a less impressive
difference than d0bs were H0 true.
• This still isn’t evidence of a genuine statistical
effect H1, let alone a scientific conclusion H*
Stat-Sub fallacy H => H*
14
15. Neyman-Pearson (N-P) tests (1933):
A test (null) and alternative hypotheses
H0, H1 that are exhaustive
H0: μ ≤ 0 vs. H1: μ > 0
Philosophers should adopt the
language of statistics, e.g., Xi ~ N(μ, σ2)
15
16. Neyman-Pearson (N-P) tests (1933):
• This fallacy of rejection H1è H* is impossible
• Rejecting H0 only indicates statistical alternatives
H1 (how discrepant from null)
• We get the type II error, and power
16
17. Error Statistics
• Fisher and N-P both fall under tools for “appraising
and bounding the probabilities of seriously
misleading interpretations of data” (Birnbaum 1970,
1033)–error probabilities
• I place all under the rubric of error statistics
• Confidence intervals, N-P and Fisherian tests,
resampling, randomization
17
18. Both Fisher & N-P: it’s easy to lie
with biasing selection effects
• Sufficient finagling—cherry-picking, significance
seeking, multiple testing, post-data subgroups,
trying and trying again—may practically
guarantee a preferred claim H gets support,
even if it’s unwarranted by evidence
• Violates minimal requirement for evidence.
18
19. Severity Requirement:
• If the test had little or no capability of finding flaws
with H (even if H is incorrect), then agreement
between data x0 and H provides poor (or no)
evidence for H
(Popper: “too cheap to be worth having)
• A claim passes with severity only to the extent that
it is subjected to, and passes, a test that it
probably would have failed, if false.
• This probability is how severely it has passed
(degree of “corroboration”)
19
20. Requires a third role for probability
Probabilism. To assign a degree of probability,
confirmation, support or belief in a hypothesis,
given data x0 (absolute or comparative)
(e.g., Bayesian, likelihoodist, Fisher (at times))
Performance. Ensure long-run reliability of
methods, coverage probabilities (frequentist,
behavioristic Neyman-Pearson, Fisher (at times))
Only probabilism is thought to be inferential or
evidential 20
21. What happened to using probability
to assess error-probing capacity?
• Neither “probabilism” nor “performance” directly
captures assessing error probing capacity
• Good long-run performance is a necessary, not a
sufficient, condition for severity
21
22. A claim C is not warranted _______
• Probabilism: unless C is true or probable (gets
a probability boost, made comparatively firmer)
• Performance: unless it stems from a method
with low long-run error
• Probativism (severe testing): unless
something (a fair amount) has been done to
probe ways we can be wrong about C
22
23. A severe test: My weight
Informal example: To test if I’ve gained weight
between the time I left for England and my return,
I use a series of well-calibrated and stable scales,
both before leaving and upon my return.
All show an over 4 lb gain, none shows a
difference in weighing EGEK, I infer:
H: I’ve gained at least 4 pounds
23
24. 24
• Properties of the scales are akin to the
properties of statistical tests (performance).
• No one claims the justification is merely long
run, and can say nothing about my weight.
• We infer something about the source of the
readings from the high capability to reveal if
any scales were wrong
25. 25
The severe tester assumed to be in a
context of wanting to find things out
• I could insist all the scales are wrong—they work
fine with weighing known objects—but this would
prevent correctly finding out about weight…..
• What sort of extraordinary circumstance could
cause them all to go astray just when we don’t
know the weight of the test object?
• Argument from coincidence-goes beyond being
highly improbable
26. 26
Popper : Carnap as Frequentist : Bayes
“According to modern logical empiricist orthodoxy,
in deciding whether hypothesis h is confirmed by
evidence e, . . . we must consider only the
statements h and e, and the logical relations
[C(h,e)] between them.
It is quite irrelevant whether e was known first and
h proposed to explain it, or whether e resulted from
testing predictions drawn from h”. (Alan Musgrave
1974, 2)
Battles about roles of probability
trace to philosophies of inference
27. Likelihood Principle (LP)
In logics of induction, like probabilist accounts (as
I’m using the term) the import of the data is via the
ratios of likelihoods of hypotheses
Pr(x0;H0)/Pr(x0;H1)
The data x0 are fixed, while the hypotheses vary
27
28. Comparative logic of support
• Ian Hacking (1965) “Law of Likelihood”:
x support hypothesis H0 less well than H1 if,
Pr(x;H0) < Pr(x;H1)
(rejects in 1980)
• Any hypothesis that perfectly fits the data is
maximally likely (even if data-dredged)
• “there always is such a rival hypothesis viz., that
things just had to turn out the way they actually
did” (Barnard 1972, 129). 28
30. Hunting for significance
(nominal vs. actual)
Suppose that twenty sets of differences have
been examined, that one difference seems large
enough to test and that this difference turns out
to be ‘significant at the 5 percent level.’ ….The
actual level of significance is not 5 percent,
but 64 percent! (Selvin 1970, 104)
(Morrison & Henkel’s Significance Test Controversy
1970!)
30
31. Some accounts of evidence object:
“Two problems that plague frequentist inference:
multiple comparisons and multiple looks, or…data
dredging and peeking at the data. The frequentist
solution to both problems involves adjusting the P-
value…
But adjusting the measure of evidence because
of considerations that have nothing to do with
the data defies scientific sense” (Goodman
1999, 1010)
(Co-director, with Ioannidis, the Meta-Research Innovation
Center at Stanford)
31
32. On the LP, error probabilities
appeal to something irrelevant
“Sampling distributions, significance levels,
power, all depend on something more [than
the likelihood function]–something that is
irrelevant in Bayesian inference–namely the
sample space” (Lindley 1971, 436)
Probabilisms, condition on the actual data
32
33. At odds with key way to advance
replication: 21 Word Solution
“We report how we determined our sample size,
and data exclusions (if any), all manipulations, and
all measures in the study” (Simmons, Nelson, and
Simonsohn 2012, 4).
• Replication researchers find flexibility with data-
dredging and stopping rules major source of
failed-replication (the “forking paths”, Gelman and
Loken 2014)
33
34. Many “reforms” offered as alternative
to significance tests follow the LP
• “Bayes factors [likelihood ratios] can be used in the
complete absence of a sampling plan…” (Bayarri,
Benjamin, Berger, Sellke 2016, 100)
• It seems very strange that a frequentist could not
analyze a given set of data…if the stopping rule is
not given….Data should be able to speak for itself.
(Berger and Wolpert 1988, 78; authors of the
Likelihood Principle)
• No wonder reformers talk past each other
34
35. Replication Paradox
• Test Critic: It’s too easy to satisfy standard
significance thresholds
• You: Why do replicationists find it so hard to achieve
significance thresholds (with preregistration)?
• Test Critic: Obviously the initial studies were guilty
of P-hacking, cherry-picking, data-dredging (QRPs)
• You: So, the replication researchers want methods
that pick up on, adjust, and block these biasing
selection effects.
• Test Critic: Actually “reforms” recommend methods
where the need to alter P-values due to data
dredging vanishes 35
36. Probabilists can still block intuitively
unwarranted inferences
• Supplement with subjective beliefs: What do I
believe? As opposed to What is the evidence?
(Royall 1997; 2004)
• Likelihoods + prior probabilities
36
37. Problems
• Additional source of flexibility, priors as well as
biasing selection effects
• Doesn’t show what researchers had done
wrong—battle of beliefs
• The believability of data-dredged hypotheses is
what makes them so seductive
37
38. Contrast with philosophy: Bayesian
statisticians use “default” priors
“[V]irtually never would different experts give prior
distributions that even overlapped” (J. Berger. 2006)
• Default priors are to be data dominant in some sense
• “The priors are not to be considered expressions of
uncertainty, ignorance, or degree of belief. [they] may
not even be probabilities…” (Cox and Mayo 2010,
299)
• No agreement on rival systems for default/non-
subjective priors (no “uninformative” priors) 38
39. 39
Many of today’s statistics wars:
P-values vs posteriors
• The posterior probability Pr(H0|x) can be large
while the P-value is small
• To a Bayesian this shows P-values exaggerate
evidence against
• Significance testers object to highly significant
results being interpreted as no evidence against
the null– or even evidence for it! High Type 2
error
40. Bayes (Jeffreys)/Fisher disagreement
(“spike and smear”)
• The “P-values exaggerate” charges refer to
testing a point null hypothesis, a lump of prior
probability given to H0 (or a tiny region around 0).
Xi ~ N(μ, σ2)
H0: μ = 0 vs. H1: μ ≠ 0.
• The rest appropriately spread over the alternative,
an α significant result can correspond to
Pr(H0|x) = (1 – α)! (e.g., 0.95)
40
41. “Concentrating mass on the point null hypothesis
is biasing the prior in favor of H0 as much as
possible” (Casella and R. Berger 1987, 111)
whether in 1 or 2-sided tests
Yet ‘spike and smear” is the basis for: “Redefine
Statistical Significance” (Benjamin et al., 2017)
41
42. Opposing megateam: Lakens et al. (2018)
• Whether tests should use a lower Type 1 error
probability is separate; the problem is
supposing there should be agreement
between quantities measuring different things
42
43. Recent Example of a battle based on
P-values disagree with posteriors
• If we imagine randomly selecting a hypothesis
from an urn of nulls 90% of which are true
• Consider just 2 possibilities: H0: no effect
H1: meaningful effect, all else ignored,
• Take the prevalence of 90% as
Pr(H0) = 0.9, Pr(H1)= 0.1
• Reject H0 with a single (just) 0.05 significant result,
with cherry-picking, selection effects
Then it can be shown most “findings” are false 43
44. 44
Diagnostic Screening (DS) Model
• Pr(H0|Test T rejects H0 ) > 0.5
really: prevalence of true nulls among those
rejected at the 0.05 level > 0.5.
Call this: False Finding rate FFR
• Pr(Test T rejects H0 | H0 ) = 0.05
Criticism: N-P Type I error probability ≠ FFR
(Ioannidis 2005, Colquhoun 2014)
45. 45
DS testers see this as a major
criticism of tests
• But there are major confusions
• Pr(H0|Test T rejects H0 ) is not a Type I error
probability.
• Transposes conditional
• Combines crude performance with a probabilist
assignment (true to neither Bayesians nor error
statisticians)
• OK in certain screening contexts (genomics)
47. PPV
• Complement of FFR: the positive predictive value
PPV
Pr(H1|Test T rejects H0)
47
48. 48
What’s Pr(H1) (i.e., Prev(H1))?
“Proportion of experiments we do over a lifetime in
which there is a real effect” (Colquhoun 2014, 9)
Proportion of true relationships among those
tested in a field. Ioannidis (2005, 0696)
Hypotheses can be individuated in many ways
49. Probabilistic Instantiation Fallacy
• Pr(the randomly selected null hypothesis is true) = .9
• The randomly selected null hypothesis is H51
• Pr(H51 is true) = .9
Each His is either is true or not!
(It could have a genuine frequentist prior but it
wouldn’t equal .9)
49
50. 50
Is the PPV (complement of the FFR)
relevant to what’s wanted?
Crud Factor. In many fields of social science it’s
thought nearly everything is related to everything:
“all nulls false”.
It also promotes the “stay safe” idea.
51. Some Bayesians reject probabilism
(Gelman: Falsificationist Bayesian;
Shalizi: error statistician)
“[C]rucial parts of Bayesian data analysis, such as
model checking, can be understood as ‘error probes’
in Mayo’s sense”
“[W]hat we are advocating, then, is what Cox and
Hinkley (1974) call ‘pure significance testing’, in
which certain of the model’s implications are
compared directly to the data.” (Gelman and Shalizi
2013, 10, 20). 51
52. • Can’t also jump on the “abandon significance/
don’t use P-value thresholds” bandwagon
• If there’s no threshold, there’s no falsification,
and no tests
• Granted P-values don’t give effect sizes
52
53. 53
The severe tester reformulates
tests with a discrepancy γ from H0
• Severity function: SEV(Test T, data x, claim C)
• Instead of a binary cut-off (significant or not)
the particular outcome is used to infer
discrepancies that are or are not warranted
54. 54
To avoid Fallacies of Rejection
(e.g., magnitude error)
Testing the mean of a Normal distribution:
H0: μ ≤ 0 vs. H1: μ > 0
• If you very probably would have observed a more
impressive (smaller) P-value if μ = μ1 (μ1 = μ0 + γ);
the data are poor evidence that
μ > μ1.
55. 55
Relation to a Test’s Power:
Let M be the sample mean (a random variable) and
it’s value M0
• Say M just reaches statistical significance at level
P, say 0.025; and compute power in relation to
this cut-off
• If the power against μ1 is high then the data are
poor evidence that
μ > μ1.
57. Similarly, severity tells us:
• an α-significant difference indicates less of a
discrepancy from the null if it results from larger (n1)
rather than a smaller (n2) sample size (n1 > n2 )
• What’s more indicative of a large effect (fire), a fire
alarm that goes off with burnt toast or one that
doesn’t go off unless the house is fully ablaze?
• [The larger sample size is like the one that goes off
with burnt toast] 57
58. • Anyone who equates severity and power has it
backwards
• Only one time they could be equal is if M just
misses statistical significance and want to
assess claims of form
μ < μ1, μ < μ2, μ < μ3,… μ < μk,…..
• Then SEV(μ < μk) = POW(μk)
58
59. We avoid fallacies of
non-significant results?
• They don’t warrant 0 discrepancy
• Not uninformative; can find upper bounds μ1
SEV(μ < μ1) is high
• It’s with negative results (P-values not small)
that severity goes in the same direction as
power
–provided model assumptions hold
59
60. FEV: Frequentist Principle of Evidence; Mayo
and Cox (2006); SEV: Mayo 1991, Mayo and
Spanos (2006)
FEV/SEV A small P-value indicates discrepancy γ
from H0, åonly if, there is a high probability the test
would have resulted in a larger P-value were a
discrepancy as large as γ absent.
FEV/SEV A moderate P-value indicates the absence
of a discrepancy γ from H0, only if there is a high
probability the test would have given a worse fit with
H0 (i.e., a smaller P-value) were a discrepancy γ
present.
60
62. The 2019: Don’t say ‘significance’,
don’t use P-value thresholds
• Editors of the March 2019 issue TAS "A World
Beyond p < 0.05"—Wasserstein, Schirm, Lazar—
say "declarations of ‘statistical significance’ be
abandoned" (p. 2).
• On their view: Prespecified P-value thresholds
should never be used in interpreting results.
• it is not just a word ban but a gatekeeper ban
62
63. “Retiring statistical significance
would give bias a free pass".
John Ioannidis (2019)
"...potential for falsification is a prerequisite for
science. Fields that obstinately resist refutation
can hide behind the abolition of statistical
significance but risk becoming self-ostracized
from the remit of science”.
I agree, and in Mayo (2019) I show why.
63
64. • Complying with the “no threshold” view precludes
the FDA's long-established drug review
procedures, as Wasserstein et al. (2019) recognize
• They think by removing P-value thresholds,
researchers lose an incentive to data dredge, and
otherwise exploit researcher flexibility
• Even if true, it's a bad argument.
(Decriminalizing robbery results in less robbery arrests)
• But it's not true.
64
65. • Even without the word significance, eager
researchers still can’t take the large (non-
significant) P-value to indicate a genuine effect
• It would be to say: Even though larger differences
would frequently occur by chance variability alone,
my data provide evidence they are not due to
chance variability
• In short, he would still need to report a reasonably
small P-value
• The eager investigator will need to "spin" his
results, ransack, data dredge
65
66. • In a world without predesignated thresholds, it
would be hard to hold him accountable for
reporting a nominally small P-value:
• “whether a p-value passes any arbitrary threshold
should not be considered at all" in interpreting
data (Wasserstein et al. 2019, 2)
66
67. No tests, no falsification
• The “no thresholds” view also blocks common uses
of confidence intervals and Bayes factor standards
• If you cannot say about any results, ahead of time,
they will not be allowed to count in favor of a claim,
then you do not have a test of it
• Don’t confuse having a threshold for a terrible test
with using a fixed P-value across all studies in an
unthinking manner
• We should reject the latter
67
68. New ASA Task Force on
Significance Tests and Replication
• to “prepare a …piece reflecting “good statistical
practice,” without leaving the impression that p-
values and hypothesis tests…have no
role.” (Karen Kafadar 2019, 4)
• I hope that philosophers (of science and of
knowledge) get involved!
68
69. References
• Barnard, G. (1972). “The Logic of Statistical Inference (Review of ‘The Logic of Statistical Inference’ by
Ian Hacking)”, British Journal for the Philosophy of Science 23(2), 123–32.
• Bartlett, T. (2014). “Replication Crisis in Psychology Research Turns Ugly and Odd”, The Chronicle of
Higher Education (online) June 23, 2014.
• Bayarri, M., Benjamin, D., Berger, J., Sellke, T. (2016). “Rejection Odds and Rejection Ratios: A
Proposal for Statistical Practice in Testing Hypotheses.” Journal of Mathematical Psychology 72: 90-
103.
• Benjamin, D., Berger, J., Johannesson, M., et al. (2017). “Redefine Statistical Significance”, Nature
Human Behaviour 2, 6–10.
• Berger, J. O. (2006). “The Case for Objective Bayesian Analysis.” Bayesian Analysis 1 (3): 385–402.
• Berger, J. O. and Wolpert, R. (1988). The Likelihood Principle, 2nd
ed. Vol. 6 Lecture Notes-Monograph
Series. Hayward, CA: Institute of Mathematical Statistics.
• Birnbaum, A. (1970). “Statistical Methods in Scientific Inference (letter to the Editor).” Nature 225 (5237)
(March 14): 1033.
• Casella, G. and Berger, R. (1987). “Reconciling Bayesian and Frequentist Evidence in the One-sided
Testing Problem”, Journal of the American Statistical Association 82(397), 106-11.
• Colquhoun, D. (2014). ‘An Investigation of the False Discovery Rate and the Misinterpretation of P-
values’, Royal Society Open Science 1(3), 140216.
• Cox, D. R., and Mayo, D. G. (2010). “Objectivity and Conditionality in Frequentist Inference.” Error and
Inference: Recent Exchanges on Experimental Reasoning, Reliability, and the Objectivity and Rationality
of Science. Mayo and Spanos (eds.), pp. 276–304. CUP.
69
70. • FDA (U. S. Food and Drug Administration) (2017). “Multiple Endpoints in Clinical Trials: Guidance for
Industry (DRAFT GUIDANCE).” Retrieved from https://www.fda.gov/media/102657/download
• Fisher, R. A. (1947). The Design of Experiments 4th ed., Edinburgh: Oliver and Boyd.
• Gelman, A. and Loken, E. (2014). “The Statistical Crisis in Science”, American Scientist 2, 460-65.
• Gelman, A. and Shalizi, C. (2013). “Philosophy and the Practice of Bayesian Statistics” and “Rejoinder’”
British Journal of Mathematical and Statistical Psychology 66(1): 8–38; 76-80.
• Glymour, C. (2010). "Explanation and Truth". In Mayo, D. and Spanos, A. (eds.) Error and Inference:
Recent Exchanges on Experimental Reasoning, Reliability, and the Objectivity and Rationality of Science,
pp. 331–50. CUP.
• Goodman S. N. (1999). “Toward evidence-based medical statistics. 2: The Bayes factor,” Annals of
Internal Medicine 1999; 130:1005 –1013.
• Hacking, I. (1965). Logic of Statistical Inference. CUP.
• Hacking, I. (1980). “The Theory of Probable Inference: Neyman, Peirce and Braithwaite”, in Mellor, D.
(ed.), Science, Belief and Behavior: Essays in Honour of R. B. Braithwaite, pp. 141–60. CUP.
• Ioannidis, J. (2005). “Why Most Published Research Findings are False”, PLoS Medicine 2(8), 0696–
0701.
• Ioannidis J. (2019). “The Importance of Predefined Rules and Prespecified Statistical Analyses: Do Not
Abandon Significance.” JAMA. 321(21): 2067–2068. doi:10.1001/jama.2019. 4582
• Kafadar, K. (2019). “The Year in Review … And More to Come,” President's Corner, AmStat News, (Issue
510), (Dec. 2019).
• Lakens, D., et al. (2018). “Justify your Alpha”, Nature Human Behavior 2, 168–71.
70
71. 71
• Lindley, D. V. (1971). “The Estimation of Many Parameters.” in Godambe, V. and Sprott, D. (eds.),
Foundations of Statistical Inference 435–455. Toronto: Holt, Rinehart and Winston.
• Mayo, D. (1991). “Novel Evidence and Severe Tests,” Philosophy of Science, 58 (4): 523-552.
Reprinted (1991) in The Philosopher’s Annual XIV: 203-232.
• Mayo, D. G. (1996). Error and the Growth of Experimental Knowledge. Science and Its Conceptual
Foundation. Chicago: University of Chicago Press.
• Mayo, D. G. (2018). Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars,
Cambridge: Cambridge University Press.
• Mayo, D. (2019). ”P-value thresholds: Forfeit at your peril”. European Journal of Clinical Investigation,
49, e13170. https://doi.org/10.1111/eci.13170
• Mayo, D. G. and Cox, D. R. (2006). "Frequentist Statistics as a Theory of Inductive Inference” in Rojo,
J. (ed.) The Second Erich L. Lehmann Symposium: Optimality, 2006, Lecture Notes-Monograph Series,
49, pp. 247-275. Institute of Mathematical Statistics.
• Mayo, D. G., and A. Spanos. (2006). “Severe Testing as a Basic Concept in a Neyman–Pearson
Philosophy of Induction.” British Journal for the Philosophy of Science 57 (2) (June 1): 323–357.
• Meehl, P. (1990). "Why Summaries of Research on Psychological Theories Are Often Uninterpretable",
Psychological Reports 66(1): 195–244.
• Meehl, P. and Waller, N. (2002). "The Path Analysis Controversy: A New Statistical Approach to Strong
Appraisal of Verisimilitude", Psychological Methods 7(3): 283–300.
• Morrison, D. E., and R. E. Henkel, (eds.), (1970). The Significance Test Controversy: A Reader.
Chicago: Aldine De Gruyter.
• Musgrave, A. (1974). “Logical versus Historical Theories of Confirmation”, BJPS 25(1), 1–23.
72. 72
• Neyman, J. and Pearson, E.S. (1933). "On the Problem of the Most Efficient Tests of Statistical
Hypotheses." Philosophical Transactions of the Royal Society of London Series A 231, 289–337.
Reprinted in Joint Statistical Papers, 140–85.
• Neyman, J. and Pearson, E.S. (1967). Joint Statistical Papers of J. Neyman and E. S. Pearson.
Cambridge: Cambridge University Press.
• Open Science Collaboration (2015). “Estimating the Reproducibility of Psychological Science”,
Science 349(6251), 943–51.
• Pearson, E. S. & Neyman, J. (1967). “On the problem of two samples”, Joint Statistical Papers by J.
Neyman & E.S. Pearson, 99-115 (Berkeley: U. of Calif. Press).
• Royall, R. (1997). Statistical Evidence: A Likelihood Paradigm. Boca Raton FL: Chapman and Hall,
CRC press.
• Royall, R. (2004). “The Likelihood Paradigm for Statistical Evidence” and “Rejoinder”. In Taper, M.
and Lele, S. (eds.) The Nature of Scientific Evidence, , pp. 119–138; 145–151. Chicago: University of
Chicago Press.
• Savage, L. J. (1962). The Foundations of Statistical Inference: A Discussion. London: Methuen.
• Selvin, H. (1970). “A critique of tests of significance in survey research”. In The significance test
controversy, edited by D. Morrison and R. Henkel, 94-106. Chicago: Aldine De Gruyter.
• Simmons, J. Nelson, L. and Simonsohn, U. (2012) “A 21 word solution”, Dialogue: 26(2), 4–7.
• Wasserstein, R., Schirm, A. & Lazar, N. (2019). Moving to a World Beyond “p < 0.05” [Editorial]. The
American Statistician, 73(S1), 1–19. https://doi.org/10.1080/00031305.2019.1583913