This document discusses criticisms of p-values and proposes reforms based on Bayesian statistics. It summarizes debates between Fisher and Bayesians regarding p-values exaggerating evidence against the null hypothesis when using certain priors. When a lump prior of 0.5 is given to the null and the remaining 0.5 spread over the alternative, as the sample size increases, a statistically significant result can correspond to a posterior probability for the null that exceeds the prior of 0.5. Reforms are proposed based on likelihood ratios and Bayes factors to define statistical significance in a way more consistent with Bayesian evidence standards.
Mayo Slides: Part I Meeting #2 (Phil 6334/Econ 6614)jemille6
Slides Meeting #2 (Phil 6334/Econ 6614: Current Debates on Statistical Inference and Modeling (D. Mayo and A. Spanos)
Part I: Bernoulli trials: Plane Jane Version
The following presentation is an introduction to the Algebraic Methods – part one for level 4 Mathematics. This resources is a part of the 2009/2010 Engineering (foundation degree, BEng and HN) courses from University of Wales Newport (course codes H101, H691, H620, HH37 and 001H). This resource is a part of the core modules for the full time 1st year undergraduate programme.
The BEng & Foundation Degrees and HNC/D in Engineering are designed to meet the needs of employers by placing the emphasis on the theoretical, practical and vocational aspects of engineering within the workplace and beyond. Engineering is becoming more high profile, and therefore more in demand as a skill set, in today’s high-tech world. This course has been designed to provide you with knowledge, skills and practical experience encountered in everyday engineering environments.
Please Subscribe to this Channel for more solutions and lectures
http://www.youtube.com/onlineteaching
Chapter 8: Hypothesis Testing
8.3: Testing a Claim About a Mean
Statistical inference: Probability and DistributionEugene Yan Ziyou
This deck was used in the IDA facilitation of the John Hopkins' Data Science Specialization course for Statistical Inference. It covers the topics in week 1 (probability) and week 2 (distribution).
Please Subscribe to this Channel for more solutions and lectures
http://www.youtube.com/onlineteaching
Chapter 8: Hypothesis Testing
8.2: Testing a Claim About a Proportion
Mayo Slides: Part I Meeting #2 (Phil 6334/Econ 6614)jemille6
Slides Meeting #2 (Phil 6334/Econ 6614: Current Debates on Statistical Inference and Modeling (D. Mayo and A. Spanos)
Part I: Bernoulli trials: Plane Jane Version
The following presentation is an introduction to the Algebraic Methods – part one for level 4 Mathematics. This resources is a part of the 2009/2010 Engineering (foundation degree, BEng and HN) courses from University of Wales Newport (course codes H101, H691, H620, HH37 and 001H). This resource is a part of the core modules for the full time 1st year undergraduate programme.
The BEng & Foundation Degrees and HNC/D in Engineering are designed to meet the needs of employers by placing the emphasis on the theoretical, practical and vocational aspects of engineering within the workplace and beyond. Engineering is becoming more high profile, and therefore more in demand as a skill set, in today’s high-tech world. This course has been designed to provide you with knowledge, skills and practical experience encountered in everyday engineering environments.
Please Subscribe to this Channel for more solutions and lectures
http://www.youtube.com/onlineteaching
Chapter 8: Hypothesis Testing
8.3: Testing a Claim About a Mean
Statistical inference: Probability and DistributionEugene Yan Ziyou
This deck was used in the IDA facilitation of the John Hopkins' Data Science Specialization course for Statistical Inference. It covers the topics in week 1 (probability) and week 2 (distribution).
Please Subscribe to this Channel for more solutions and lectures
http://www.youtube.com/onlineteaching
Chapter 8: Hypothesis Testing
8.2: Testing a Claim About a Proportion
“The importance of philosophy of science for statistical science and vice versa”jemille6
My paper “The importance of philosophy of science for statistical science and vice
versa” presented (zoom) at the conference: IS PHILOSOPHY USEFUL FOR SCIENCE, AND/OR VICE VERSA? January 30 - February 2, 2024 at Chapman University, Schmid College of Science and Technology.
Statistical Inference as Severe Testing: Beyond Performance and Probabilismjemille6
A talk given by Deborah G Mayo
(Dept of Philosophy, Virginia Tech) to the Seminar in Advanced Research Methods at the Dept of Psychology, Princeton University on
November 14, 2023
TITLE: Statistical Inference as Severe Testing: Beyond Probabilism and Performance
ABSTRACT: I develop a statistical philosophy in which error probabilities of methods may be used to evaluate and control the stringency or severity of tests. A claim is severely tested to the extent it has been subjected to and passes a test that probably would have found flaws, were they present. The severe-testing requirement leads to reformulating statistical significance tests to avoid familiar criticisms and abuses. While high-profile failures of replication in the social and biological sciences stem from biasing selection effects—data dredging, multiple testing, optional stopping—some reforms and proposed alternatives to statistical significance tests conflict with the error control that is required to satisfy severity. I discuss recent arguments to redefine, abandon, or replace statistical significance.
D. Mayo (Dept of Philosophy, VT)
Sir David Cox’s Statistical Philosophy and Its Relevance to Today’s Statistical Controversies
ABSTRACT: This talk will explain Sir David Cox's views of the nature and importance of statistical foundations and their relevance to today's controversies about statistical inference, particularly in using statistical significance testing and confidence intervals. Two key themes of Cox's statistical philosophy are: first, the importance of calibrating methods by considering their behavior in (actual or hypothetical) repeated sampling, and second, ensuring the calibration is relevant to the specific data and inquiry. A question that arises is: How can the frequentist calibration provide a genuinely epistemic assessment of what is learned from data? Building on our jointly written papers, Mayo and Cox (2006) and Cox and Mayo (2010), I will argue that relevant error probabilities may serve to assess how well-corroborated or severely tested statistical claims are.
Nancy Reid, Dept. of Statistics, University of Toronto. Inaugural receiptant of the "David R. Cox Foundations of Statistics Award".
Slides from Invited presentation at 2023 JSM: “The Importance of Foundations in Statistical Science“
Ronald Wasserstein, Chair (American Statistical Association)
ABSTRACT: David Cox wrote “A healthy interplay between theory and application is crucial for statistics… This is particularly the case when by theory we mean foundations of statistical analysis, rather than the theoretical analysis of specific statistical methods.” These foundations distinguish statistical science from the many fields of research in which statistical thinking is a key intellectual component. In this talk I will emphasize the ongoing importance and relevance of theoretical advances and theoretical thinking through some illustrative examples.
Errors of the Error Gatekeepers: The case of Statistical Significance 2016-2022jemille6
ABSTRACT: Statistical significance tests serve in gatekeeping against being fooled by randomness, but recent attempts to gatekeep these tools have themselves malfunctioned. Warranted gatekeepers formulate statistical tests so as to avoid fallacies and misuses of P-values. They highlight how multiplicity, optional stopping, and data-dredging can readily invalidate error probabilities. It is unwarranted, however, to argue that statistical significance and P-value thresholds be abandoned because they can be misused. Nor is it warranted to argue for abandoning statistical significance based on presuppositions about evidence and probability that are at odds with those underlying statistical significance tests. When statistical gatekeeping malfunctions, I argue, it undermines a central role to which scientists look to statistics. In order to combat the dangers of unthinking, bandwagon effects, statistical practitioners and consumers need to be in a position to critically evaluate the ramifications of proposed "reforms” (“stat activism”). I analyze what may be learned from three recent episodes of gatekeeping (and meta-gatekeeping) at the American Statistical Association (ASA).
Causal inference is not statistical inferencejemille6
Jon Williamson (University of Kent)
ABSTRACT: Many methods for testing causal claims are couched as statistical methods: e.g.,
randomised controlled trials, various kinds of observational study, meta-analysis, and
model-based approaches such as structural equation modelling and graphical causal
modelling. I argue that this is a mistake: causal inference is not a purely statistical
problem. When we look at causal inference from a general point of view, we see that
methods for causal inference fit into the framework of Evidential Pluralism: causal
inference is properly understood as requiring mechanistic inference in addition to
statistical inference.
Evidential Pluralism also offers a new perspective on the replication crisis. That
observed associations are not replicated by subsequent studies is a part of normal
science. A problem only arises when those associations are taken to establish causal
claims: a science whose established causal claims are constantly overturned is indeed
in crisis. However, if we understand causal inference as involving mechanistic inference
alongside statistical inference, as Evidential Pluralism suggests, we avoid fallacious
inferences from association to causation. Thus, Evidential Pluralism offers the means to
prevent the drama of science from turning into a crisis.
Stephan Guttinger (Lecturer in Philosophy of Data/Data Ethics, University of Exeter, UK)
ABSTRACT: The idea of “questionable research practices” (QRPs) is central to the narrative of a replication crisis in the experimental sciences. According to this narrative the low replicability of scientific findings is not simply due to fraud or incompetence, but in large part to the widespread use of QRPs, such as “p-hacking” or the lack of adequate experimental controls. The claim is that such flawed practices generate flawed output. The reduction – or even elimination – of QRPs is therefore one of the main strategies proposed by policymakers and scientists to tackle the replication crisis.
What counts as a QRP, however, is not clear. As I will discuss in the first part of this paper, there is no consensus on how to define the term, and ascriptions of the qualifier “questionable” often vary across disciplines, time, and even within single laboratories. This lack of clarity matters as it creates the risk of introducing methodological constraints that might create more harm than good. Practices labelled as ‘QRPs’ can be both beneficial and problematic for research practice and targeting them without a sound understanding of their dynamic and context-dependent nature risks creating unnecessary casualties in the fight for a more reliable scientific practice.
To start developing a more situated and dynamic picture of QRPs I will then turn my attention to a specific example of a dynamic QRP in the experimental life sciences, namely, the so-called “Far Western Blot” (FWB). The FWB is an experimental system that can be used to study protein-protein interactions but which for most of its existence has not seen a wide uptake in the community because it was seen as a QRP. This was mainly due to its (alleged) propensity to generate high levels of false positives and negatives. Interestingly, however, it seems that over the last few years the FWB slowly moved into the space of acceptable research practices. Analysing this shift and the reasons underlying it, I will argue a) that suppressing this practice deprived the research community of a powerful experimental tool and b) that the original judgment of the FWB was based on a simplistic and non-empirical assessment of its error-generating potential. Ultimately, it seems like the key QRP at work in the FWB case was the way in which the label “questionable” was assigned in the first place. I will argue that findings from this case can be extended to other QRPs in the experimental life sciences and that they point to a larger issue with how researchers judge the error-potential of new research practices.
David Hand (Professor Emeritus and Senior Research Investigator, Department of Mathematics,
Faculty of Natural Sciences, Imperial College London.)
ABSTRACT: Science progresses through an iterative process of formulating theories and comparing
them with empirical real-world data. Different camps of scientists will favour different
theories, until accumulating evidence renders one or more untenable. Not unnaturally,
people become attached to theories. Perhaps they invented a theory, and kudos arises
from being the originator of a generally accepted theory. A theory might represent a
life's work, so that being found wanting might be interpreted as failure. Perhaps
researchers were trained in a particular school, and acknowledging its shortcomings is
difficult. Because of this, tensions can arise between proponents of different theories.
The discipline of statistics is susceptible to precisely the same tensions. Here, however,
the tensions are not between different theories of "what is", but between different
strategies for shedding light on the real world from limited empirical data. This can be in
the form of how one measures discrepancy between the theory's predictions and
observations. It can be in the form of different ways of looking at empirical results. It can
be, at a higher level, because of differences between what is regarded as important in a
particular context. Or it can be for other reasons.
Perhaps the most familiar example of this tension within statistics is between different
approaches to inference. However, there are many other examples of such tensions.
This paper illustrates with several examples. We argue that the tension generally arises
as a consequence of inadequate care being taken in question formulation. That is,
insufficient thought is given to deciding exactly what one wants to know - to determining
"What is the question?".
The ideas and disagreements are illustrated with several examples.
The neglected importance of complexity in statistics and Metasciencejemille6
Daniele Fanelli
London School of Economics Fellow in Quantitative Methodology, Department of
Methodology, London School of Economics and Political Science.
ABSTRACT: Statistics is at war, and Metascience is ailing. This is partially due, the talk will argue, to
a paradigmatic blind-spot: the assumption that one can draw general conclusions about
empirical findings without considering the role played by context, conditions,
assumptions, and the complexity of methods and theories. Whilst ideally these
particularities should be unimportant in science, in practice they cannot be neglected in
most research fields, let alone in research-on-research.
This neglected importance of complexity is supported by theoretical arguments and
empirical findings (or the lack thereof) in the recent meta-analytical and metascientific
literature. The talk will overview this background and suggest how the complexity of
theories and methodologies may be explicitly factored into particular methodologies of
statistics and Metaresearch. The talk will then give examples of how this approach may
usefully complement existing paradigms, by translating results, methods and theories
into quantities of information that are evaluated using an information-compression logic.
Mathematically Elegant Answers to Research Questions No One is Asking (meta-a...jemille6
Uri Simonsohn (Professor, Department of Operations, Innovation and Data Sciences at Esade)
ABSTRACT: The statistical tools listed in the title share that a mathematically elegant solution has
become the consensus advice of statisticians, methodologists and some
mathematically sophisticated researchers writing tutorials and textbooks, and yet,
they lead research workers to meaningless answers, that are often also statistically
invalid. Part of the problem is that advice givers take the mathematical abstractions
of the tools they advocate for literally, instead of taking the actual behavior of
researchers seriously.
On Severity, the Weight of Evidence, and the Relationship Between the Twojemille6
Margherita Harris
Visiting fellow in the Department of Philosophy, Logic and Scientific Method at the London
School of Economics and Political Science.
ABSTRACT: According to the severe tester, one is justified in declaring to have evidence in support of a
hypothesis just in case the hypothesis in question has passed a severe test, one that it would be very
unlikely to pass so well if the hypothesis were false. Deborah Mayo (2018) calls this the strong
severity principle. The Bayesian, however, can declare to have evidence for a hypothesis despite not
having done anything to test it severely. The core reason for this has to do with the
(infamous) likelihood principle, whose violation is not an option for anyone who subscribes to the
Bayesian paradigm. Although the Bayesian is largely unmoved by the incompatibility between
the strong severity principle and the likelihood principle, I will argue that the Bayesian’s never-ending
quest to account for yet an other notion, one that is often attributed to Keynes (1921) and that is
usually referred to as the weight of evidence, betrays the Bayesian’s confidence in the likelihood
principle after all. Indeed, I will argue that the weight of evidence and severity may be thought of as
two (very different) sides of the same coin: they are two unrelated notions, but what brings them
together is the fact that they both make trouble for the likelihood principle, a principle at the core of
Bayesian inference. I will relate this conclusion to current debates on how to best conceptualise
uncertainty by the IPCC in particular. I will argue that failure to fully grasp the limitations of an
epistemology that envisions the role of probability to be that of quantifying the degree of belief to
assign to a hypothesis given the available evidence can be (and has been) detrimental to an
adequate communication of uncertainty.
Revisiting the Two Cultures in Statistical Modeling and Inference as they rel...jemille6
Aris Spanos (Wilson Schmidt Professor of Economics, Virginia Tech)
ABSTRACT: The discussion places the two cultures, the model-driven statistical modeling and the
algorithm-driven modeling associated with Machine Learning (ML) and Statistical
Learning Theory (SLT) in a broader context of paradigm shifts in 20th-century statistics,
which includes Fisher’s model-based induction of the 1920s and variations/extensions
thereof, the Data Science (ML, STL, etc.) and the Graphical Causal modeling in the
1990s. The primary objective is to compare and contrast the effectiveness of different
approaches to statistics in learning from data about phenomena of interest and relate
that to the current discussions pertaining to the statistics wars and their potential
casualties.
Comparing Frequentists and Bayesian Control of Multiple Testingjemille6
James Berger
ABSTRACT: A problem that is common to many sciences is that of having to deal with a multiplicity of statistical inferences. For instance, in GWAS (Genome Wide Association Studies), an experiment might consider 20 diseases and 100,000 genes, and conduct statistical tests of the 20x100,000=2,000,000 null hypotheses that a specific disease is associated with a specific gene. The issue is that selective reporting of only the ‘highly significant’ results could lead to many claimed disease/gene associations that turn out to be false, simply because of statistical randomness. In 2007, the seriousness of this problem was recognized in GWAS and extremely stringent standards were employed to resolve it. Indeed, it was recommended that tests for association should be conducted at an error probability of 5 x 10—7. Particle physicists similarly learned that a discovery would be reliably replicated only if the p-value of the relevant test was less than 5.7 x 10—7. This was because they had to account for a huge number of multiplicities in their analyses. Other sciences have continuing issues with multiplicity. In the Social Sciences, p-hacking and data dredging are common, which involve multiple analyses of data. Stopping rules in social sciences are often ignored, even though it has been known since 1933 that, if one keeps collecting data and computing the p-value, one is guaranteed to obtain a p-value less than 0.05 (or, indeed, any specified value), even if the null hypothesis is true. In medical studies that occur with strong oversight (e.g., by the FDA), control for multiplicity is mandated. There is also typically a large amount of replication, resulting in meta-analysis. But there are many situations where multiplicity is not handled well, such as subgroup analysis: one first tests for an overall treatment effect in the population; failing to find that, one tests for an effect among men or among women; failing to find that, one tests for an effect among old men or young men, or among old women or young women; …. I will argue that there is a single method that can address any such problems of multiplicity: Bayesian analysis, with the multiplicity being addressed through choice of prior probabilities of hypotheses. ... There are, of course, also frequentist error approaches (such as Bonferroni and FDR) for handling multiplicity of statistical inferences; indeed, these are much more familiar than the Bayesian approach. These are, however, targeted solutions for specific classes of problems and are not easily generalizable to new problems.
Clark Glamour
ABSTRACT: "Data dredging"--searching non experimental data for causal and other relationships and taking that same data to be evidence for those relationships--was historically common in the natural sciences--the works of Kepler, Cannizzaro and Mendeleev are examples. Nowadays, "data dredging"--using data to bring hypotheses into consideration and regarding that same data as evidence bearing on their truth or falsity--is widely denounced by both philosophical and statistical methodologists. Notwithstanding, "data dredging" is routinely practiced in the human sciences using "traditional" methods--various forms of regression for example. The main thesis of my talk is that, in the spirit and letter of Mayo's and Spanos’ notion of severe testing, modern computational algorithms that search data for causal relations severely test their resulting models in the process of "constructing" them. My claim is that in many investigations, principled computerized search is invaluable for reliable, generalizable, informative, scientific inquiry. The possible failures of traditional search methods for causal relations, multiple regression for example, are easily demonstrated by simulation in cases where even the earliest consistent graphical model search algorithms succeed. ... These and other examples raise a number of issues about using multiple hypothesis tests in strategies for severe testing, notably, the interpretation of standard errors and confidence levels as error probabilities when the structures assumed in parameter estimation are uncertain. Commonly used regression methods, I will argue, are bad data dredging methods that do not severely, or appropriately, test their results. I argue that various traditional and proposed methodological norms, including pre-specification of experimental outcomes and error probabilities for regression estimates of causal effects, are unnecessary or illusory in application. Statistics wants a number, or at least an interval, to express a normative virtue, the value of data as evidence for a hypothesis, how well the data pushes us toward the true or away from the false. Good when you can get it, but there are many circumstances where you have evidence but there is no number or interval to express it other than phony numbers with no logical connection with truth guidance. Kepler, Darwin, Cannizarro, Mendeleev had no such numbers, but they severely tested their claims by combining data dredging with severe testing.
The Duality of Parameters and the Duality of Probabilityjemille6
Suzanne Thornton
ABSTRACT: Under any inferential paradigm, statistical inference is connected to the logic of probability. Well-known debates among these various paradigms emerge from conflicting views on the notion of probability. One dominant view understands the logic of probability as a representation of variability (frequentism), and another prominent view understands probability as a measurement of belief (Bayesianism). The first camp generally describes model parameters as fixed values, whereas the second camp views parameters as random. Just as calibration (Reid and Cox 2015, “On Some Principles of Statistical Inference,” International Statistical Review 83(2), 293-308)--the behavior of a procedure under hypothetical repetition--bypasses the need for different versions of probability, I propose that an inferential approach based on confidence distributions (CD), which I will explain, bypasses the analogous conflicting perspectives on parameters. Frequentist inference is connected to the logic of probability through the notion of empirical randomness. Sample estimates are useful only insofar as one has a sense of the extent to which the estimator may vary from one random sample to another. The bounds of a confidence interval are thus particular observations of a random variable, where the randomness is inherited by the random sampling of the data. For example, 95% confidence intervals for parameter θ can be calculated for any random sample from a Normal N(θ, 1) distribution. With repeated sampling, approximately 95% of these intervals are guaranteed to yield an interval covering the fixed value of θ. Bayesian inference produces a probability distribution for the different values of a particular parameter. However, the quality of this distribution is difficult to assess without invoking an appeal to the notion of repeated performance. ... In contrast to a posterior distribution, a CD is not a probabilistic statement about the parameter, rather it is a data-dependent estimate for a fixed parameter for which a particular behavioral property holds. The Normal distribution itself, centered around the observed average of the data (e.g. average recovery times), can be a CD for θ. It can give any level of confidence. Such estimators can be derived through Bayesian or frequentist inductive procedures, and any CD, regardless of how it is obtained, guarantees performance of the estimator under replication for a fixed target, while simultaneously producing a random estimate for the possible values of θ.
Paper given at PSA 22 Symposium: Multiplicity, Data-Dredging and Error Control
MAYO ABSTRACT: I put forward a general principle for evidence: an error-prone claim C is warranted to the extent it has been subjected to, and passes, an analysis that very probably would have found evidence of flaws in C just if they are present. This probability is the severity with which C has passed the test. When a test’s error probabilities quantify the capacity of tests to probe errors in C, I argue, they can be used to assess what has been learned from the data about C. A claim can be probable or even known to be true, yet poorly probed by the data and model at hand. The severe testing account leads to a reformulation of statistical significance tests: Moving away from a binary interpretation, we test several discrepancies from any reference hypothesis and report those well or poorly warranted. A probative test will generally involve combining several subsidiary tests, deliberately designed to unearth different flaws. The approach relates to confidence interval estimation, but, like confidence distributions (CD) (Thornton), a series of different confidence levels is considered. A 95% confidence interval method, say using the mean M of a random sample to estimate the population mean μ of a Normal distribution, will cover the true, but unknown, value of μ 95% of the time in a hypothetical series of applications. However, we cannot take .95 as the probability that a particular interval estimate (a ≤ μ ≤ b) is correct—at least not without a prior probability to μ. In the severity interpretation I propose, we can nevertheless give an inferential construal post-data, while still regarding μ as fixed. For example, there is good evidence μ ≥ a (the lower estimation limit) because if μ < a, then with high probability .95 (or .975 if viewed as one-sided) we would have observed a smaller value of M than we did. Likewise for inferring μ ≤ b. To understand a method’s capability to probe flaws in the case at hand, we cannot just consider the observed data, unlike in strict Bayesian accounts. We need to consider what the method would have inferred if other data had been observed. For each point μ’ in the interval, we assess how severely the claim μ > μ’ has been probed. I apply the severity account to the problems discussed by earlier speakers in our session. The problem with multiple testing (and selective reporting) when attempting to distinguish genuine effects from noise, is not merely that it would, if regularly applied, lead to inferences that were often wrong. Rather, it renders the method incapable, or practically so, of probing the relevant mistaken inference in the case at hand. In other cases, by contrast, (e.g., DNA matching) the searching can increase the test’s probative capacity. In this way the severe testing account can explain competing intuitions about multiplicity and data-dredging, while blocking inferences based on problematic data-dredging
The Statistics Wars and Their Causalities (refs)jemille6
High-profile failures of replication in the social and biological sciences underwrite a
minimal requirement of evidence: If little or nothing has been done to rule out flaws in inferring a claim, then it has not passed a severe test. A claim is severely tested to the extent it has been subjected to and passes a test that probably would have found flaws, were they present. This probability is the severity with which a claim has passed. The goal of highly well-tested claims differs from that of highly probable ones, explaining why experts so often disagree about statistical reforms. Even where today’s statistical test critics see themselves as merely objecting to misuses and misinterpretations, the reforms they recommend often grow out of presuppositions about the role of probability in inductive-statistical inference. Paradoxically, I will argue, some of the reforms intended to replace or improve on statistical significance tests enable rather than reveal illicit inferences due to cherry-picking, multiple testing, and data-dredging. Some preclude testing and falsifying claims altogether. These are the “casualties” on which I will focus. I will consider Fisherian vs Neyman-Pearson tests, Bayes factors, Bayesian posteriors, likelihoodist assessments, and the “screening model” of tests (a quasiBayesian-frequentist assessment). Whether one accepts this philosophy of evidence, I argue, that it provides a standpoint for avoiding both the fallacies of statistical testing and the casualties of today’s statistics wars.
The Statistics Wars and Their Casualties (w/refs)jemille6
High-profile failures of replication in the social and biological sciences underwrite a minimal requirement of evidence: If little or nothing has been done to rule out flaws in inferring a claim, then it has not passed a severe test. A claim is severely tested to the extent it has been subjected to and passes a test that probably would have found flaws, were they present. This probability is the severity with which a claim has passed. The goal of highly well-tested claims differs from that of highly probable ones, explaining why experts so often disagree about statistical reforms. Even where today’s statistical test critics see themselves as merely objecting to misuses and misinterpretations, the reforms they recommend often grow out of presuppositions about the role of probability in inductive-statistical inference. Paradoxically, I will argue, some of the reforms intended to replace or improve on statistical significance tests enable rather than reveal illicit inferences due to cherry-picking, multiple testing, and data-dredging. Some preclude testing and falsifying claims altogether. These are the “casualties” on which I will focus. I will consider Fisherian vs Neyman-Pearson tests, Bayes factors, Bayesian posteriors, likelihoodist assessments, and the “screening model” of tests (a quasi-Bayesian-frequentist assessment). Whether one accepts this philosophy of evidence, I argue, that it provides a standpoint for avoiding both the fallacies of statistical testing and the casualties of today’s statistics wars.
On the interpretation of the mathematical characteristics of statistical test...jemille6
Statistical hypothesis tests are often misused and misinterpreted. Here I focus on one
source of such misinterpretation, namely an inappropriate notion regarding what the
mathematical theory of tests implies, and does not imply, when it comes to the
application of tests in practice. The view taken here is that it is helpful and instructive to be consciously aware of the essential difference between mathematical model and
reality, and to appreciate the mathematical model and its implications as a tool for
thinking rather than something that has a truth value regarding reality. Insights are presented regarding the role of model assumptions, unbiasedness and the alternative hypothesis, Neyman-Pearson optimality, multiple and data dependent testing.
The role of background assumptions in severity appraisal (jemille6
In the past decade discussions around the reproducibility of scientific findings have led to a re-appreciation of the importance of guaranteeing claims are severely tested. The inflation of Type 1 error rates due to flexibility in the data analysis is widely considered
one of the underlying causes of low replicability rates. Solutions, such as study preregistration, are becoming increasingly popular to combat this problem. Preregistration only allows researchers to evaluate the severity of a test, but not all
preregistered studies provide a severe test of a claim. The appraisal of the severity of a
test depends on background information, such as assumptions about the data generating process, and auxiliary hypotheses that influence the final choice for the
design of the test. In this article, I will discuss the difference between subjective and
inter-subjectively testable assumptions underlying scientific claims, and the importance
of separating the two. I will stress the role of justifications in statistical inferences, the
conditional nature of scientific conclusions following these justifications, and highlight
how severe tests could lead to inter-subjective agreement, based on a philosophical approach grounded in methodological falsificationism. Appreciating the role of background assumptions in the appraisal of severity should shed light on current discussions about the role of preregistration, interpreting the results of replication studies, and proposals to reform statistical inferences.
This slide is special for master students (MIBS & MIFB) in UUM. Also useful for readers who are interested in the topic of contemporary Islamic banking.
Francesca Gottschalk - How can education support child empowerment.pptxEduSkills OECD
Francesca Gottschalk from the OECD’s Centre for Educational Research and Innovation presents at the Ask an Expert Webinar: How can education support child empowerment?
2024.06.01 Introducing a competency framework for languag learning materials ...Sandy Millin
http://sandymillin.wordpress.com/iateflwebinar2024
Published classroom materials form the basis of syllabuses, drive teacher professional development, and have a potentially huge influence on learners, teachers and education systems. All teachers also create their own materials, whether a few sentences on a blackboard, a highly-structured fully-realised online course, or anything in between. Despite this, the knowledge and skills needed to create effective language learning materials are rarely part of teacher training, and are mostly learnt by trial and error.
Knowledge and skills frameworks, generally called competency frameworks, for ELT teachers, trainers and managers have existed for a few years now. However, until I created one for my MA dissertation, there wasn’t one drawing together what we need to know and do to be able to effectively produce language learning materials.
This webinar will introduce you to my framework, highlighting the key competencies I identified from my research. It will also show how anybody involved in language teaching (any language, not just English!), teacher training, managing schools or developing language learning materials can benefit from using the framework.
Introduction to AI for Nonprofits with Tapp NetworkTechSoup
Dive into the world of AI! Experts Jon Hill and Tareq Monaur will guide you through AI's role in enhancing nonprofit websites and basic marketing strategies, making it easy to understand and apply.
Normal Labour/ Stages of Labour/ Mechanism of LabourWasim Ak
Normal labor is also termed spontaneous labor, defined as the natural physiological process through which the fetus, placenta, and membranes are expelled from the uterus through the birth canal at term (37 to 42 weeks
A review of the growth of the Israel Genealogy Research Association Database Collection for the last 12 months. Our collection is now passed the 3 million mark and still growing. See which archives have contributed the most. See the different types of records we have, and which years have had records added. You can also see what we have for the future.
Unit 8 - Information and Communication Technology (Paper I).pdfThiyagu K
This slides describes the basic concepts of ICT, basics of Email, Emerging Technology and Digital Initiatives in Education. This presentations aligns with the UGC Paper I syllabus.
6. I. J. Berger and Sellke and Casella and R.
Berger
II. Jeffreys-Lindley Paradox; Bayes/Fisher
Disagreement
III. Redefine Statistical Significance
6
7. Common criticism: “Significance levels (or P-values)
exaggerate the evidence against the null hypothesis”
What do you mean by exaggerating the evidence against H0 ?
Answer: The P-value is too small, for ex.:
What I mean is that when I put a lump of prior weight π0 of
1/2 on a point null H0 (or a very small interval around
it), the P-value is smaller than my Bayesian posterior
probability on H0.
(p.246)
7
8. “P-values exaggerate”: if inference is appraised via
one of the probabilisms–Bayesian posteriors, Bayes
factors, or likelihood ratios–the evidence against the
null isn’t as big as 1 – P.
• On the other hand, the probability H0 would have
survived is 1 – P
• Difference in the role for probabilities
8
9. You might react by observing that:
• P-values are not intended as posteriors
in H0 (or Bayes factors, likelihood ratios)
• Why suppose a P-value should match numbers
computed in very different accounts.
9
10. 10
When the criticism is in the form of a posterior:
…[S]ome Bayesians in criticizing P-values seem to
think that it is appropriate to use a threshold for
significance of 0.95 of the probability of the alternative
hypothesis being true. This makes no more sense
than, in moving from a minimum height standard (say)
for recruiting police officers to a minimum weight
standard, declaring that since it was previously 6 foot it
must now be 6 stone (Senn 2001, p. 202).
11. Getting Beyond “I’m Rubber
and You’re Glue”. P. 247
The danger in critiquing statistical method X from
the standpoint of a distinct school Y, is that of falling
into begging the question.
Whatever you say about me bounces off and sticks
to you. This is a genuine worry, but it’s not fatal.
11
12. 12
The minimal theses about “bad evidence no test
(BENT)” enables scrutiny of any statistical
inference account–at least on the meta-level.
Why assume all schools of statistical inference
embrace the minimum severity principle?
I don’t, and they don’t.
But by identifying when methods violate severity,
we can pull back the veil on at least one source of
disagreement behind the battles.
13. This is a “how to” book
• We do not depict critics as committing a gross
blunder (confusing a P-value with a posterior
probability in a null).
• Nor just deny we care about their measure of
support: I say we should look at exactly what the
critics are on about.
13
14. Bayes Factor (bold part)
• Likelihood ratio but not limited to point
hypothesis
• The parameter is viewed as a random
variable with a distribution
14
15. J. Berger and Sellke,
and Casella and R. Berger.
Berger and Sellke (1987) make out the conflict between P-
values and Bayesian posteriors using the two-sided test of
the Normal mean, H0: μ = 0 versus H1: μ ≠ 0.
“Suppose that X = (X1,…,Xn), where the Xi are IID N(μ, 2),
2 known” (p. 112).
Then the test statistic d(X) = √n | 𝑋 – µ0|/, and the P-value
will be twice the P-value of the corresponding one-sided
test.
15
16. 16
By titling their paper: “The irreconcilability of P-values and
evidence,” Berger and Sellke imply that if P-values disagree
with posterior assessments, they can’t be measures of
evidence at all.
Casella and R. Berger (1987) retort that “reconciling” is at
hand, if you move away from the lump prior.
17. First, Casella and Berger:
Spike and Smear
Starting with a lump of prior, 0.5, on H0, they find the
posterior probability in H0 is larger than the P-value for a
variety of different priors assigned to the alternative.
The result depends entirely on how the remaining .5 is
smeared over the alternative
17
18. 18
Using a Jeffreys-type prior, the .5 is spread out over the
alternative parameter values as if the parameter is itself
distributed N(µ0,).
Actually Jeffreys recommends the lump prior only when a
special value of a parameter is deemed plausible*
The rationale is to enable it to receive a reasonable
posterior probability, and avoid a 0 prior to H0
“P-values are reasonable measures of evidence of evidence
when there is no a priori concentration of belief about H0
(Berger and Delampady)
19. Table 4.1 SIST p. 249
(From J. Berger and T. Sellke (1987))
19
20. • With n = 50, “one can classically ‘reject H0 at
significance level p = .05,’ although Pr(H0|x) = .52
(which would actually indicate that the evidence favors
H0)” (Berger and Sellke, p. 113).
If n = 1000, a result statistically significant at the .05 level
has the posterior probability to μ = 0 go up from .5 (the
lump prior) to .82!
20
21. From their Bayesian perspective, this seems to show
P-values are exaggerating evidence against H0.
From an error statistical perspective, this allows
statistically significant results to be interpreted as no
evidence against H0–or even evidence for it!
(posterior H0. is higher than the prior-B-boost)
• After all, 0 is excluded from the 2-sided confidence
interval at level .95.
• The probability of declaring evidence for the null
even if false is high.
21
22. • Why assign the lump of ½ as prior to the point null?
“The choice of π0 = 1/2 has obvious intuitive appeal in
scientific investigations as being ‘objective’” Berger
and Sellke (1987, p. 115).
• But is it?
• One starts by making H0 and H1 equally probable,
then the .5 accorded to H1 is spread out over all the
values in H1:
22
24. But it’s scarcely an obvious justification for a
lump of prior on the null H0 that it ensures, if they
do reject H0, there will be a meaningful drop in
its probability.
24
25. Casella and R. Berger (1987) charge that
“concentrating mass on the point null hypothesis is
biasing the prior in favor of H0 as much as possible” (p.
111) whether in 1 or 2-sided tests.
According to them,
The testing of a point null hypothesis is one of the
most misused statistical procedures. In particular, in
the location parameter problem, the point null
hypothesis is more the mathematical convenience
than the statistical method of choice (ibid. p. 106).
Most of the time “there is a direction of interest in many
experiments, and saddling an experimenter with a two-
sided test would not be appropriate”(ibid.).
25
26. Jeffreys-Lindley “Paradox” or
Bayes/Fisher Disagreement (p. 250)
The disagreement (between the P-value and the
posterior can be dramatic
With a lump given to the point null, and the rest
appropriately spread over the alternative, an n can
be found such an α significant result corresponds to
Pr(H0|x) = (1 – α)!
26
27. Contrasting Bayes Factors p. 254
They arise in prominent criticisms and/or reforms of
significance tests.
1. Jeffrey-type prior with the “spike and slab” in a two sided
test. Here, with large enough n, a statistically significant
result becomes evidence for the null; the posterior to H0
exceeds the lump prior.
2. Likelihood ratio most generous to the alternative.
Second, there’s a spike to a point null, to be compared to
the point alternative that’s maximally likely θmax.
3. Matching. Instead of a spike prior on the null, it uses a
smooth diffuse prior. Here, the P-value “is an
approximation to the posterior probability that θ < 0”
(Pratt 1965, p. 182). 27
28. Why Blame Us Because You Can’t Agree
on Your Posterior?
Stephen Senn argues, “…the reason that Bayesians can
regard P-values as overstating the evidence against the null
is simply a reflection of the fact that Bayesians can disagree
sharply with each other“ (Senn 2002, p. 2442).
28
Senn riffs on the well-known joke of Jeffreys that we
heard in 3.4 (1961, p. 385):
It would require that a procedure is dismissed [by
significance testers] because, when combined with
information which it doesn’t require and which may
not exist, it disagrees with a [Bayesian] procedure
that disagrees with itself. Senn (ibid. p. 195)
29. Exhibit (vii). Jeffrey-Lindley ‘paradox’
A large number (n =527,135) of independent collisions
either of type A or type B will test if the proportion of type
A collisions is exactly .2, as opposed to any other value.
n Bernoulli trials, testing H0: θ = .2 vs. H1: θ ≠ .2.
The observed proportion of type A collisions is scarcely
greater than the point null of .2:
𝑥 = k/n = .20165233 where n=527,135; k = 106,298.
Example from Aris Spanos (2013) (from Stone 1997.)
29
30. The significance level against H0 is small
• the result 𝑥 is highly significant, even though it’s
scarcely different from the point null.
The Bayes Factor in favor of H0 is high
• H0 is given the spiked prior of .5, and the remaining .5
is spread equally among the values in H1.
The Bayes factor B01 = 𝑃𝑟 𝑘 𝐻0 / 𝑃𝑟 𝑘 𝐻1 =
.000015394/.000001897 = 8.115
While the likelihood of H0 in the numerator is tiny, the
likelihood of H1 is even tinier.
30
31. Clearly, θ = .2 is more likely, and we have an example of
the Jeffreys-Fisher disagreement.
SIST p. 255
31
32. Compare it with the second kind of prior:
Here Bayes factor B01 = 0.01; B10 = Lik(θmax)/Lik(.2) = 89
Why should a result 89 times more likely under
alternative θmax than under θ = .2 be taken as strong
evidence for θ = .2?
.
32
33. Contrasting Bayes Factors p. 254
1. Jeffrey-type prior with the “spike and slab” in a two sided
test. Here, with large enough n, a statistically significant
result becomes evidence for the null; the posterior to H0
exceeds the lump prior.
2. Likelihood ratio most generous to the alternative.
Second, there’s a spike to a point null, to be compared to
the point alternative that’s maximally likely θmax.
3. Matching. Instead of a spike prior on the null, it uses a
smooth diffuse prior. Here, the P-value “is an
approximation to the posterior probability that θ < 0”
(Pratt 1965, p. 182).
33
34. Bayesian Family feuds
It shouldn’t, according to some, including Lindley’s
own student, default Bayesian José Bernardo
(2010). (SIST p. 256, Note 7)
Yet it’s at the heart of recommended reforms
First, look at p. 256 on matching priors
34
36. 4.5 Reforms (Redefine Significance) Based
on Bayes Factor Standards
“Redefine Significance” is recent, but, like other reforms,
is based on old results:
Imagine all the density under the alternative
hypothesis concentrated at x, the place most favored
by the data. …Even the utmost generosity to the
alternative hypothesis cannot make the evidence in
favor of it as strong as classical significance levels
might suggest (Edwards, Lindman, and Savage
1963, p. 228).
Normal testing case of Berger and Sellke, but as a one-
tailed test of H0: μ = 0 vs. H1: μ = μ1 = θmax.
We abbreviate H1 by Hmax. 36
37. 37
Here the likelihood ratio Lik(θmax)/Lik(θ0) = exp [z2/2]);
the inverse is Lik(θ0)/Lik(θmax), is exp [-z2/2].
What is θmax?
It’s the observed mean 𝑥 (whatever it is), and we’re to
consider 𝑥 = the result that is just statistically
significant at the indicated P-value.
SIST p. 260 (see note #9)
39. 39
To ensure Hmax: μ = μmax is 28 times as likely as H0: θ =
θ0, you’d need to use a P-value ~.005, z value of 2.58.
40. 40
Valen Johnson (2013a,b): a way to bring the
likelihood ratio more into line with what counts as
strong evidence, according to a Bayes factor.
“The posterior odds between two hypotheses H1
and H0 can be expressed as”
“In a Bayesian test, the null hypothesis is rejected if
the posterior probability of H1 exceeds a certain
threshold. ….”(Johnson 2013b, p. 1721)
41. • and “the alternative hypothesis is accepted if
BF10 > k”
• Johnson views his method as showing how to
specify an alternative hypothesis–he calls it the
“implicit alternative”
• It will be Hmax
• Unlike N-P, the test does not exhaust the parameter
space, it’s just two points.
41
42. 42
Johnson offers an illuminating way to relate Bayes
factors and standard cut-offs for rejection in UMP tests
(SIST p. 262) Setting k as the Bayes factor you want,
you get the corresponding cut-off for rejection by
computing √(2log k): this matches the zα
corresponding to a N-P, UMP one-sided test.
The UMP test (with μ > μ0) is of the form:
Reject H0 iff 𝑋 > 𝑥 𝛼where 𝑥 𝛼 = μ0 + zα /√n, which is
zα /√n for the case μ0 = 0.
Table 4.3 (SIST p. 262), computations note #10 p. 264
45. 45
His approach is intended to “provide a new form of default,
non subjective Bayesian tests” (2013b, p. 1719)
It has the same rejection region as a UMP error
statistical test, but to bring them into line with the BF you
need a smaller α level.
Johnson recommends levels more like .01 or .005.
True, if you reach a smaller significance level, say .01
rather than .025, you may infer a larger discrepancy.
But more will fail to make it over the hurdle: the Type II
error probability increases.
46. 46
So, you get a Bayes Factor and a default posterior
probability. What’s not to like?
We perform our two-part criticism, based on the minimal
severity requirement. SIST p. 263
(S-1) holds*, but (S-2) fails; the SEV is .5.
*Hmax : μ = 𝑥 𝛼 accords with 𝑥 𝛼 --they’re equal
Next slide: SIST p. 263
50. 50
Souvenir (R)
In Tour II you have visited the tribes who lament that P-
values are sensitive to sample size (4.3), and they
exaggerate the evidence against a null hypothesis (4.4,
4.5).
Stephen Senn says “reformers” should stop deforming P-
values to turn them into second class Bayesian posterior
probabilities (Senn 2015a). I agree.
51. 51
There is an urgency here. Not only do the replacements run
afoul of the minimal severity requirement, to suppose all is
fixed by lowering P-values ignores the biasing selection
effects at the bottom of nonreplicability.
[I]t is important to note that this high rate of
nonreproducibility is not the result of scientific misconduct,
publication bias, file drawer biases, or flawed statistical
designs; it is simply the consequence of using evidence
thresholds that do not represent sufficiently strong evidence
in favor of hypothesized effects.” (Johnson 2013a, p.
19316).