This document discusses four waves in the philosophy of statistics from the 1930s to the present. The first wave centered around debates between Fisher, Neyman, Pearson, and others regarding hypothesis testing and the interpretation of p-values. The second wave from the 1960s-1980s involved criticisms of Neyman-Pearson methods regarding their applicability before and after obtaining data. The third wave from 1980-2005 explored likelihoodism and debates over the likelihood principle. The fourth and ongoing wave since 2005 continues philosophical debates in statistics.
Byrd statistical considerations of the histomorphometric test protocol (1)jemille6
"Statistical considerations of the histomorphometric test protocol"
John E. Byrd, Ph.D. D-ABFA
Maria-Teresa Tersigni-Tarrant, Ph.D.
Central Identification Laboratory
JPAC
Fusion Confusion? Comments on Nancy Reid: "BFF Four-Are we Converging?"jemille6
D. Mayo's comments on Nancy Reid's "BFF Four-Are we Converging?" given May 2, 2017 at The Fourth Bayesian, Fiducial and Frequentists Workshop held at Harvard University.
Today we’ll try to cover a number of things:
1. Learning philosophy/philosophy of statistics
2. Situating the broad issues within philosophy of science
3. Little bit of logic
4. Probability and random variables
Replication Crises and the Statistics Wars: Hidden Controversiesjemille6
D. Mayo presentation at the X-Phil conference on "Reproducibility and Replicabililty in Psychology and Experimental Philosophy", University College London (June 14, 2018)
Byrd statistical considerations of the histomorphometric test protocol (1)jemille6
"Statistical considerations of the histomorphometric test protocol"
John E. Byrd, Ph.D. D-ABFA
Maria-Teresa Tersigni-Tarrant, Ph.D.
Central Identification Laboratory
JPAC
Fusion Confusion? Comments on Nancy Reid: "BFF Four-Are we Converging?"jemille6
D. Mayo's comments on Nancy Reid's "BFF Four-Are we Converging?" given May 2, 2017 at The Fourth Bayesian, Fiducial and Frequentists Workshop held at Harvard University.
Today we’ll try to cover a number of things:
1. Learning philosophy/philosophy of statistics
2. Situating the broad issues within philosophy of science
3. Little bit of logic
4. Probability and random variables
Replication Crises and the Statistics Wars: Hidden Controversiesjemille6
D. Mayo presentation at the X-Phil conference on "Reproducibility and Replicabililty in Psychology and Experimental Philosophy", University College London (June 14, 2018)
D. Mayo: Philosophy of Statistics & the Replication Crisis in Sciencejemille6
D. Mayo discusses various disputes-notably the replication crisis in science-in the context of her just released book: Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars.
D. G. Mayo: The Replication Crises and its Constructive Role in the Philosoph...jemille6
Constructive role of replication crises teaches a lot about 1.) Non-fallacious uses of statistical tests, 2.) Rationale for the role of probability in tests, 3.) How to reformulate tests.
D. G. Mayo: Your data-driven claims must still be probed severelyjemille6
In the session "Philosophy of Science and the New Paradigm of Data-Driven Science at the American Statistical Association Conference on Statistical Learning and Data Science/Nonparametric Statistics
Deborah G. Mayo: Is the Philosophy of Probabilism an Obstacle to Statistical Fraud Busting?
Presentation slides for: Revisiting the Foundations of Statistics in the Era of Big Data: Scaling Up to Meet the Challenge[*] at the Boston Colloquium for Philosophy of Science (Feb 21, 2014).
"The Statistical Replication Crisis: Paradoxes and Scapegoats”jemille6
D. G. Mayo LSE Popper talk, May 10, 2016.
Abstract: Mounting failures of replication in the social and biological sciences give a practical spin to statistical foundations in the form of the question: How can we attain reliability when Big Data methods make illicit cherry-picking and significance seeking so easy? Researchers, professional societies, and journals are increasingly getting serious about methodological reforms to restore scientific integrity – some are quite welcome (e.g., preregistration), while others are quite radical. Recently, the American Statistical Association convened members from differing tribes of frequentists, Bayesians, and likelihoodists to codify misuses of P-values. Largely overlooked are the philosophical presuppositions of both criticisms and proposed reforms. Paradoxically, alternative replacement methods may enable rather than reveal illicit inferences due to cherry-picking, multiple testing, and other biasing selection effects. Popular appeals to “diagnostic testing” that aim to improve replication rates may (unintentionally) permit the howlers and cookbook statistics we are at pains to root out. Without a better understanding of the philosophical issues, we can expect the latest reforms to fail.
Stephen Senn slides:"‘Repligate’: reproducibility in statistical studies. What does it mean and in what sense does it matter?" presented May 23 at the session on "The Philosophy of Statistics: Bayesianism, Frequentism and the Nature of Inference"," at the 2015 APS Annual Convention in NYC
Hypothesis testing and estimation are used to reach conclusions about a population by examining a sample of that population.
Hypothesis testing is widely used in medicine, dentistry, health care, biology and other fields as a means to draw conclusions about the nature of populations
The Statistics Wars: Errors and Casualtiesjemille6
ABSTRACT: Mounting failures of replication in the social and biological sciences give a new urgency to critically appraising proposed statistical reforms. While many reforms are welcome (preregistration of experiments, replication, discouraging cookbook uses of statistics), there have been casualties. The philosophical presuppositions behind the meta-research battles remain largely hidden. Too often the statistics wars have become proxy wars between competing tribe leaders, each keen to advance one or another tool or school, rather than build on efforts to do better science. Efforts of replication researchers and open science advocates are diminished when so much attention is centered on repeating hackneyed howlers of statistical significance tests (statistical significance isn’t substantive significance, no evidence against isn’t evidence for), when erroneous understanding of basic statistical terms goes uncorrected, and when bandwagon effects lead to popular reforms that downplay the importance of error probability control. These casualties threaten our ability to hold accountable the “experts,” the agencies, and all the data handlers increasingly exerting power over our lives.
D. Mayo: Philosophy of Statistics & the Replication Crisis in Sciencejemille6
D. Mayo discusses various disputes-notably the replication crisis in science-in the context of her just released book: Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars.
D. G. Mayo: The Replication Crises and its Constructive Role in the Philosoph...jemille6
Constructive role of replication crises teaches a lot about 1.) Non-fallacious uses of statistical tests, 2.) Rationale for the role of probability in tests, 3.) How to reformulate tests.
D. G. Mayo: Your data-driven claims must still be probed severelyjemille6
In the session "Philosophy of Science and the New Paradigm of Data-Driven Science at the American Statistical Association Conference on Statistical Learning and Data Science/Nonparametric Statistics
Deborah G. Mayo: Is the Philosophy of Probabilism an Obstacle to Statistical Fraud Busting?
Presentation slides for: Revisiting the Foundations of Statistics in the Era of Big Data: Scaling Up to Meet the Challenge[*] at the Boston Colloquium for Philosophy of Science (Feb 21, 2014).
"The Statistical Replication Crisis: Paradoxes and Scapegoats”jemille6
D. G. Mayo LSE Popper talk, May 10, 2016.
Abstract: Mounting failures of replication in the social and biological sciences give a practical spin to statistical foundations in the form of the question: How can we attain reliability when Big Data methods make illicit cherry-picking and significance seeking so easy? Researchers, professional societies, and journals are increasingly getting serious about methodological reforms to restore scientific integrity – some are quite welcome (e.g., preregistration), while others are quite radical. Recently, the American Statistical Association convened members from differing tribes of frequentists, Bayesians, and likelihoodists to codify misuses of P-values. Largely overlooked are the philosophical presuppositions of both criticisms and proposed reforms. Paradoxically, alternative replacement methods may enable rather than reveal illicit inferences due to cherry-picking, multiple testing, and other biasing selection effects. Popular appeals to “diagnostic testing” that aim to improve replication rates may (unintentionally) permit the howlers and cookbook statistics we are at pains to root out. Without a better understanding of the philosophical issues, we can expect the latest reforms to fail.
Stephen Senn slides:"‘Repligate’: reproducibility in statistical studies. What does it mean and in what sense does it matter?" presented May 23 at the session on "The Philosophy of Statistics: Bayesianism, Frequentism and the Nature of Inference"," at the 2015 APS Annual Convention in NYC
Hypothesis testing and estimation are used to reach conclusions about a population by examining a sample of that population.
Hypothesis testing is widely used in medicine, dentistry, health care, biology and other fields as a means to draw conclusions about the nature of populations
The Statistics Wars: Errors and Casualtiesjemille6
ABSTRACT: Mounting failures of replication in the social and biological sciences give a new urgency to critically appraising proposed statistical reforms. While many reforms are welcome (preregistration of experiments, replication, discouraging cookbook uses of statistics), there have been casualties. The philosophical presuppositions behind the meta-research battles remain largely hidden. Too often the statistics wars have become proxy wars between competing tribe leaders, each keen to advance one or another tool or school, rather than build on efforts to do better science. Efforts of replication researchers and open science advocates are diminished when so much attention is centered on repeating hackneyed howlers of statistical significance tests (statistical significance isn’t substantive significance, no evidence against isn’t evidence for), when erroneous understanding of basic statistical terms goes uncorrected, and when bandwagon effects lead to popular reforms that downplay the importance of error probability control. These casualties threaten our ability to hold accountable the “experts,” the agencies, and all the data handlers increasingly exerting power over our lives.
Severe Testing: The Key to Error Correctionjemille6
D. G. Mayo's slides for her presentation given March 17, 2017 at Boston Colloquium for Philosophy of Science, Alfred I.Taub forum: "Understanding Reproducibility & Error Correction in Science"
Controversy Over the Significance Test Controversyjemille6
Deborah Mayo (Professor of Philosophy, Virginia Tech, Blacksburg, Virginia) in PSA 2016 Symposium: Philosophy of Statistics in the Age of Big Data and Replication Crises
D. Mayo: Philosophical Interventions in the Statistics Warsjemille6
ABSTRACT: While statistics has a long history of passionate philosophical controversy, the last decade especially cries out for philosophical illumination. Misuses of statistics, Big Data dredging, and P-hacking make it easy to find statistically significant, but spurious, effects. This obstructs a test's ability to control the probability of erroneously inferring effects–i.e., to control error probabilities. Disagreements about statistical reforms reflect philosophical disagreements about the nature of statistical inference–including whether error probability control even matters! I describe my interventions in statistics in relation to three events. (1) In 2016 the American Statistical Association (ASA) met to craft principles for avoiding misinterpreting P-values. (2) In 2017, a "megateam" (including philosophers of science) proposed "redefining statistical significance," replacing the common threshold of P ≤ .05 with P ≤ .005. (3) In 2019, an editorial in the main ASA journal called for abandoning all P-value thresholds, and even the words "significant/significance".
A word on each. (1) Invited to be a "philosophical observer" at their meeting, I found the major issues were conceptual. P-values measure how incompatible data are from what is expected under a hypothesis that there is no genuine effect: the smaller the P-value, the more indication of incompatibility. The ASA list of familiar misinterpretations–P-values are not posterior probabilities, statistical significance is not substantive importance, no evidence against a hypothesis need not be evidence for it–I argue, should not be the basis for replacing tests with methods less able to assess and control erroneous interpretations of data. (Mayo 2016, 2019). (2) The "redefine statistical significance" movement appraises P-values from the perspective of a very different quantity: a comparative Bayes Factor. Failing to recognize how contrasting approaches measure different things, disputants often talk past each other (Mayo 2018). (3) To ban P-value thresholds, even to distinguish terrible from warranted evidence, I say, is a mistake (2019). It will not eradicate P-hacking, but it will make it harder to hold P-hackers accountable. A 2020 ASA Task Force on significance testing has just been announced. (I would like to think my blog errorstatistics.com helped.)
To enter the fray between rival statistical approaches, it helps to have a principle applicable to all accounts. There's poor evidence for a claim if little if anything has been done to find it flawed even if it is. This forms a basic requirement for evidence I call the severity requirement. A claim passes with severity only if it is subjected to and passes a test that probably would have found it flawed, if it were. It stems from Popper, though he never adequately cashed it out. A variant is the frequentist principle of evidence developed with Sir David Cox (Mayo and Cox 20
Abstract: Mounting failures of replication in the social and biological sciences give a practical spin to statistical foundations in the form of the question: How can we attain reliability when methods make illicit cherry-picking and significance seeking so easy? Researchers, professional societies, and journals are increasingly getting serious about methodological reforms to restore scientific integrity – some are quite welcome (e.g., pre-registration), while others are quite radical. The American Statistical Association convened members from differing tribes of frequentists, Bayesians, and likelihoodists to codify misuses of P-values. Largely overlooked are the philosophical presuppositions of both criticisms and proposed reforms. Paradoxically, alternative replacement methods may enable rather than reveal illicit inferences due to cherry-picking, multiple testing, and other biasing selection effects. Crowd-sourced reproducibility research in psychology is helping to change the reward structure but has its own shortcomings. Focusing on purely statistical considerations, it tends to overlook problems with artificial experiments. Without a better understanding of the philosophical issues, we can expect the latest reforms to fail.
D. Mayo: Replication Research Under an Error Statistical Philosophy jemille6
D. Mayo (Virginia Tech) slides from her talk June 3 at the "Preconference Workshop on Replication in the Sciences" at the 2015 Society for Philosophy and Psychology meeting.
“The importance of philosophy of science for statistical science and vice versa”jemille6
My paper “The importance of philosophy of science for statistical science and vice
versa” presented (zoom) at the conference: IS PHILOSOPHY USEFUL FOR SCIENCE, AND/OR VICE VERSA? January 30 - February 2, 2024 at Chapman University, Schmid College of Science and Technology.
The Indian Dental Academy is the Leader in continuing dental education , training dentists in all aspects of dentistry and
offering a wide range of dental certified courses in different formats.for more details please visit
www.indiandentalacademy.com
Statistical Inference as Severe Testing: Beyond Performance and Probabilismjemille6
A talk given by Deborah G Mayo
(Dept of Philosophy, Virginia Tech) to the Seminar in Advanced Research Methods at the Dept of Psychology, Princeton University on
November 14, 2023
TITLE: Statistical Inference as Severe Testing: Beyond Probabilism and Performance
ABSTRACT: I develop a statistical philosophy in which error probabilities of methods may be used to evaluate and control the stringency or severity of tests. A claim is severely tested to the extent it has been subjected to and passes a test that probably would have found flaws, were they present. The severe-testing requirement leads to reformulating statistical significance tests to avoid familiar criticisms and abuses. While high-profile failures of replication in the social and biological sciences stem from biasing selection effects—data dredging, multiple testing, optional stopping—some reforms and proposed alternatives to statistical significance tests conflict with the error control that is required to satisfy severity. I discuss recent arguments to redefine, abandon, or replace statistical significance.
D. Mayo (Dept of Philosophy, VT)
Sir David Cox’s Statistical Philosophy and Its Relevance to Today’s Statistical Controversies
ABSTRACT: This talk will explain Sir David Cox's views of the nature and importance of statistical foundations and their relevance to today's controversies about statistical inference, particularly in using statistical significance testing and confidence intervals. Two key themes of Cox's statistical philosophy are: first, the importance of calibrating methods by considering their behavior in (actual or hypothetical) repeated sampling, and second, ensuring the calibration is relevant to the specific data and inquiry. A question that arises is: How can the frequentist calibration provide a genuinely epistemic assessment of what is learned from data? Building on our jointly written papers, Mayo and Cox (2006) and Cox and Mayo (2010), I will argue that relevant error probabilities may serve to assess how well-corroborated or severely tested statistical claims are.
Nancy Reid, Dept. of Statistics, University of Toronto. Inaugural receiptant of the "David R. Cox Foundations of Statistics Award".
Slides from Invited presentation at 2023 JSM: “The Importance of Foundations in Statistical Science“
Ronald Wasserstein, Chair (American Statistical Association)
ABSTRACT: David Cox wrote “A healthy interplay between theory and application is crucial for statistics… This is particularly the case when by theory we mean foundations of statistical analysis, rather than the theoretical analysis of specific statistical methods.” These foundations distinguish statistical science from the many fields of research in which statistical thinking is a key intellectual component. In this talk I will emphasize the ongoing importance and relevance of theoretical advances and theoretical thinking through some illustrative examples.
Errors of the Error Gatekeepers: The case of Statistical Significance 2016-2022jemille6
ABSTRACT: Statistical significance tests serve in gatekeeping against being fooled by randomness, but recent attempts to gatekeep these tools have themselves malfunctioned. Warranted gatekeepers formulate statistical tests so as to avoid fallacies and misuses of P-values. They highlight how multiplicity, optional stopping, and data-dredging can readily invalidate error probabilities. It is unwarranted, however, to argue that statistical significance and P-value thresholds be abandoned because they can be misused. Nor is it warranted to argue for abandoning statistical significance based on presuppositions about evidence and probability that are at odds with those underlying statistical significance tests. When statistical gatekeeping malfunctions, I argue, it undermines a central role to which scientists look to statistics. In order to combat the dangers of unthinking, bandwagon effects, statistical practitioners and consumers need to be in a position to critically evaluate the ramifications of proposed "reforms” (“stat activism”). I analyze what may be learned from three recent episodes of gatekeeping (and meta-gatekeeping) at the American Statistical Association (ASA).
Causal inference is not statistical inferencejemille6
Jon Williamson (University of Kent)
ABSTRACT: Many methods for testing causal claims are couched as statistical methods: e.g.,
randomised controlled trials, various kinds of observational study, meta-analysis, and
model-based approaches such as structural equation modelling and graphical causal
modelling. I argue that this is a mistake: causal inference is not a purely statistical
problem. When we look at causal inference from a general point of view, we see that
methods for causal inference fit into the framework of Evidential Pluralism: causal
inference is properly understood as requiring mechanistic inference in addition to
statistical inference.
Evidential Pluralism also offers a new perspective on the replication crisis. That
observed associations are not replicated by subsequent studies is a part of normal
science. A problem only arises when those associations are taken to establish causal
claims: a science whose established causal claims are constantly overturned is indeed
in crisis. However, if we understand causal inference as involving mechanistic inference
alongside statistical inference, as Evidential Pluralism suggests, we avoid fallacious
inferences from association to causation. Thus, Evidential Pluralism offers the means to
prevent the drama of science from turning into a crisis.
Stephan Guttinger (Lecturer in Philosophy of Data/Data Ethics, University of Exeter, UK)
ABSTRACT: The idea of “questionable research practices” (QRPs) is central to the narrative of a replication crisis in the experimental sciences. According to this narrative the low replicability of scientific findings is not simply due to fraud or incompetence, but in large part to the widespread use of QRPs, such as “p-hacking” or the lack of adequate experimental controls. The claim is that such flawed practices generate flawed output. The reduction – or even elimination – of QRPs is therefore one of the main strategies proposed by policymakers and scientists to tackle the replication crisis.
What counts as a QRP, however, is not clear. As I will discuss in the first part of this paper, there is no consensus on how to define the term, and ascriptions of the qualifier “questionable” often vary across disciplines, time, and even within single laboratories. This lack of clarity matters as it creates the risk of introducing methodological constraints that might create more harm than good. Practices labelled as ‘QRPs’ can be both beneficial and problematic for research practice and targeting them without a sound understanding of their dynamic and context-dependent nature risks creating unnecessary casualties in the fight for a more reliable scientific practice.
To start developing a more situated and dynamic picture of QRPs I will then turn my attention to a specific example of a dynamic QRP in the experimental life sciences, namely, the so-called “Far Western Blot” (FWB). The FWB is an experimental system that can be used to study protein-protein interactions but which for most of its existence has not seen a wide uptake in the community because it was seen as a QRP. This was mainly due to its (alleged) propensity to generate high levels of false positives and negatives. Interestingly, however, it seems that over the last few years the FWB slowly moved into the space of acceptable research practices. Analysing this shift and the reasons underlying it, I will argue a) that suppressing this practice deprived the research community of a powerful experimental tool and b) that the original judgment of the FWB was based on a simplistic and non-empirical assessment of its error-generating potential. Ultimately, it seems like the key QRP at work in the FWB case was the way in which the label “questionable” was assigned in the first place. I will argue that findings from this case can be extended to other QRPs in the experimental life sciences and that they point to a larger issue with how researchers judge the error-potential of new research practices.
David Hand (Professor Emeritus and Senior Research Investigator, Department of Mathematics,
Faculty of Natural Sciences, Imperial College London.)
ABSTRACT: Science progresses through an iterative process of formulating theories and comparing
them with empirical real-world data. Different camps of scientists will favour different
theories, until accumulating evidence renders one or more untenable. Not unnaturally,
people become attached to theories. Perhaps they invented a theory, and kudos arises
from being the originator of a generally accepted theory. A theory might represent a
life's work, so that being found wanting might be interpreted as failure. Perhaps
researchers were trained in a particular school, and acknowledging its shortcomings is
difficult. Because of this, tensions can arise between proponents of different theories.
The discipline of statistics is susceptible to precisely the same tensions. Here, however,
the tensions are not between different theories of "what is", but between different
strategies for shedding light on the real world from limited empirical data. This can be in
the form of how one measures discrepancy between the theory's predictions and
observations. It can be in the form of different ways of looking at empirical results. It can
be, at a higher level, because of differences between what is regarded as important in a
particular context. Or it can be for other reasons.
Perhaps the most familiar example of this tension within statistics is between different
approaches to inference. However, there are many other examples of such tensions.
This paper illustrates with several examples. We argue that the tension generally arises
as a consequence of inadequate care being taken in question formulation. That is,
insufficient thought is given to deciding exactly what one wants to know - to determining
"What is the question?".
The ideas and disagreements are illustrated with several examples.
The neglected importance of complexity in statistics and Metasciencejemille6
Daniele Fanelli
London School of Economics Fellow in Quantitative Methodology, Department of
Methodology, London School of Economics and Political Science.
ABSTRACT: Statistics is at war, and Metascience is ailing. This is partially due, the talk will argue, to
a paradigmatic blind-spot: the assumption that one can draw general conclusions about
empirical findings without considering the role played by context, conditions,
assumptions, and the complexity of methods and theories. Whilst ideally these
particularities should be unimportant in science, in practice they cannot be neglected in
most research fields, let alone in research-on-research.
This neglected importance of complexity is supported by theoretical arguments and
empirical findings (or the lack thereof) in the recent meta-analytical and metascientific
literature. The talk will overview this background and suggest how the complexity of
theories and methodologies may be explicitly factored into particular methodologies of
statistics and Metaresearch. The talk will then give examples of how this approach may
usefully complement existing paradigms, by translating results, methods and theories
into quantities of information that are evaluated using an information-compression logic.
Mathematically Elegant Answers to Research Questions No One is Asking (meta-a...jemille6
Uri Simonsohn (Professor, Department of Operations, Innovation and Data Sciences at Esade)
ABSTRACT: The statistical tools listed in the title share that a mathematically elegant solution has
become the consensus advice of statisticians, methodologists and some
mathematically sophisticated researchers writing tutorials and textbooks, and yet,
they lead research workers to meaningless answers, that are often also statistically
invalid. Part of the problem is that advice givers take the mathematical abstractions
of the tools they advocate for literally, instead of taking the actual behavior of
researchers seriously.
On Severity, the Weight of Evidence, and the Relationship Between the Twojemille6
Margherita Harris
Visiting fellow in the Department of Philosophy, Logic and Scientific Method at the London
School of Economics and Political Science.
ABSTRACT: According to the severe tester, one is justified in declaring to have evidence in support of a
hypothesis just in case the hypothesis in question has passed a severe test, one that it would be very
unlikely to pass so well if the hypothesis were false. Deborah Mayo (2018) calls this the strong
severity principle. The Bayesian, however, can declare to have evidence for a hypothesis despite not
having done anything to test it severely. The core reason for this has to do with the
(infamous) likelihood principle, whose violation is not an option for anyone who subscribes to the
Bayesian paradigm. Although the Bayesian is largely unmoved by the incompatibility between
the strong severity principle and the likelihood principle, I will argue that the Bayesian’s never-ending
quest to account for yet an other notion, one that is often attributed to Keynes (1921) and that is
usually referred to as the weight of evidence, betrays the Bayesian’s confidence in the likelihood
principle after all. Indeed, I will argue that the weight of evidence and severity may be thought of as
two (very different) sides of the same coin: they are two unrelated notions, but what brings them
together is the fact that they both make trouble for the likelihood principle, a principle at the core of
Bayesian inference. I will relate this conclusion to current debates on how to best conceptualise
uncertainty by the IPCC in particular. I will argue that failure to fully grasp the limitations of an
epistemology that envisions the role of probability to be that of quantifying the degree of belief to
assign to a hypothesis given the available evidence can be (and has been) detrimental to an
adequate communication of uncertainty.
Revisiting the Two Cultures in Statistical Modeling and Inference as they rel...jemille6
Aris Spanos (Wilson Schmidt Professor of Economics, Virginia Tech)
ABSTRACT: The discussion places the two cultures, the model-driven statistical modeling and the
algorithm-driven modeling associated with Machine Learning (ML) and Statistical
Learning Theory (SLT) in a broader context of paradigm shifts in 20th-century statistics,
which includes Fisher’s model-based induction of the 1920s and variations/extensions
thereof, the Data Science (ML, STL, etc.) and the Graphical Causal modeling in the
1990s. The primary objective is to compare and contrast the effectiveness of different
approaches to statistics in learning from data about phenomena of interest and relate
that to the current discussions pertaining to the statistics wars and their potential
casualties.
Comparing Frequentists and Bayesian Control of Multiple Testingjemille6
James Berger
ABSTRACT: A problem that is common to many sciences is that of having to deal with a multiplicity of statistical inferences. For instance, in GWAS (Genome Wide Association Studies), an experiment might consider 20 diseases and 100,000 genes, and conduct statistical tests of the 20x100,000=2,000,000 null hypotheses that a specific disease is associated with a specific gene. The issue is that selective reporting of only the ‘highly significant’ results could lead to many claimed disease/gene associations that turn out to be false, simply because of statistical randomness. In 2007, the seriousness of this problem was recognized in GWAS and extremely stringent standards were employed to resolve it. Indeed, it was recommended that tests for association should be conducted at an error probability of 5 x 10—7. Particle physicists similarly learned that a discovery would be reliably replicated only if the p-value of the relevant test was less than 5.7 x 10—7. This was because they had to account for a huge number of multiplicities in their analyses. Other sciences have continuing issues with multiplicity. In the Social Sciences, p-hacking and data dredging are common, which involve multiple analyses of data. Stopping rules in social sciences are often ignored, even though it has been known since 1933 that, if one keeps collecting data and computing the p-value, one is guaranteed to obtain a p-value less than 0.05 (or, indeed, any specified value), even if the null hypothesis is true. In medical studies that occur with strong oversight (e.g., by the FDA), control for multiplicity is mandated. There is also typically a large amount of replication, resulting in meta-analysis. But there are many situations where multiplicity is not handled well, such as subgroup analysis: one first tests for an overall treatment effect in the population; failing to find that, one tests for an effect among men or among women; failing to find that, one tests for an effect among old men or young men, or among old women or young women; …. I will argue that there is a single method that can address any such problems of multiplicity: Bayesian analysis, with the multiplicity being addressed through choice of prior probabilities of hypotheses. ... There are, of course, also frequentist error approaches (such as Bonferroni and FDR) for handling multiplicity of statistical inferences; indeed, these are much more familiar than the Bayesian approach. These are, however, targeted solutions for specific classes of problems and are not easily generalizable to new problems.
Clark Glamour
ABSTRACT: "Data dredging"--searching non experimental data for causal and other relationships and taking that same data to be evidence for those relationships--was historically common in the natural sciences--the works of Kepler, Cannizzaro and Mendeleev are examples. Nowadays, "data dredging"--using data to bring hypotheses into consideration and regarding that same data as evidence bearing on their truth or falsity--is widely denounced by both philosophical and statistical methodologists. Notwithstanding, "data dredging" is routinely practiced in the human sciences using "traditional" methods--various forms of regression for example. The main thesis of my talk is that, in the spirit and letter of Mayo's and Spanos’ notion of severe testing, modern computational algorithms that search data for causal relations severely test their resulting models in the process of "constructing" them. My claim is that in many investigations, principled computerized search is invaluable for reliable, generalizable, informative, scientific inquiry. The possible failures of traditional search methods for causal relations, multiple regression for example, are easily demonstrated by simulation in cases where even the earliest consistent graphical model search algorithms succeed. ... These and other examples raise a number of issues about using multiple hypothesis tests in strategies for severe testing, notably, the interpretation of standard errors and confidence levels as error probabilities when the structures assumed in parameter estimation are uncertain. Commonly used regression methods, I will argue, are bad data dredging methods that do not severely, or appropriately, test their results. I argue that various traditional and proposed methodological norms, including pre-specification of experimental outcomes and error probabilities for regression estimates of causal effects, are unnecessary or illusory in application. Statistics wants a number, or at least an interval, to express a normative virtue, the value of data as evidence for a hypothesis, how well the data pushes us toward the true or away from the false. Good when you can get it, but there are many circumstances where you have evidence but there is no number or interval to express it other than phony numbers with no logical connection with truth guidance. Kepler, Darwin, Cannizarro, Mendeleev had no such numbers, but they severely tested their claims by combining data dredging with severe testing.
The Duality of Parameters and the Duality of Probabilityjemille6
Suzanne Thornton
ABSTRACT: Under any inferential paradigm, statistical inference is connected to the logic of probability. Well-known debates among these various paradigms emerge from conflicting views on the notion of probability. One dominant view understands the logic of probability as a representation of variability (frequentism), and another prominent view understands probability as a measurement of belief (Bayesianism). The first camp generally describes model parameters as fixed values, whereas the second camp views parameters as random. Just as calibration (Reid and Cox 2015, “On Some Principles of Statistical Inference,” International Statistical Review 83(2), 293-308)--the behavior of a procedure under hypothetical repetition--bypasses the need for different versions of probability, I propose that an inferential approach based on confidence distributions (CD), which I will explain, bypasses the analogous conflicting perspectives on parameters. Frequentist inference is connected to the logic of probability through the notion of empirical randomness. Sample estimates are useful only insofar as one has a sense of the extent to which the estimator may vary from one random sample to another. The bounds of a confidence interval are thus particular observations of a random variable, where the randomness is inherited by the random sampling of the data. For example, 95% confidence intervals for parameter θ can be calculated for any random sample from a Normal N(θ, 1) distribution. With repeated sampling, approximately 95% of these intervals are guaranteed to yield an interval covering the fixed value of θ. Bayesian inference produces a probability distribution for the different values of a particular parameter. However, the quality of this distribution is difficult to assess without invoking an appeal to the notion of repeated performance. ... In contrast to a posterior distribution, a CD is not a probabilistic statement about the parameter, rather it is a data-dependent estimate for a fixed parameter for which a particular behavioral property holds. The Normal distribution itself, centered around the observed average of the data (e.g. average recovery times), can be a CD for θ. It can give any level of confidence. Such estimators can be derived through Bayesian or frequentist inductive procedures, and any CD, regardless of how it is obtained, guarantees performance of the estimator under replication for a fixed target, while simultaneously producing a random estimate for the possible values of θ.
Paper given at PSA 22 Symposium: Multiplicity, Data-Dredging and Error Control
MAYO ABSTRACT: I put forward a general principle for evidence: an error-prone claim C is warranted to the extent it has been subjected to, and passes, an analysis that very probably would have found evidence of flaws in C just if they are present. This probability is the severity with which C has passed the test. When a test’s error probabilities quantify the capacity of tests to probe errors in C, I argue, they can be used to assess what has been learned from the data about C. A claim can be probable or even known to be true, yet poorly probed by the data and model at hand. The severe testing account leads to a reformulation of statistical significance tests: Moving away from a binary interpretation, we test several discrepancies from any reference hypothesis and report those well or poorly warranted. A probative test will generally involve combining several subsidiary tests, deliberately designed to unearth different flaws. The approach relates to confidence interval estimation, but, like confidence distributions (CD) (Thornton), a series of different confidence levels is considered. A 95% confidence interval method, say using the mean M of a random sample to estimate the population mean μ of a Normal distribution, will cover the true, but unknown, value of μ 95% of the time in a hypothetical series of applications. However, we cannot take .95 as the probability that a particular interval estimate (a ≤ μ ≤ b) is correct—at least not without a prior probability to μ. In the severity interpretation I propose, we can nevertheless give an inferential construal post-data, while still regarding μ as fixed. For example, there is good evidence μ ≥ a (the lower estimation limit) because if μ < a, then with high probability .95 (or .975 if viewed as one-sided) we would have observed a smaller value of M than we did. Likewise for inferring μ ≤ b. To understand a method’s capability to probe flaws in the case at hand, we cannot just consider the observed data, unlike in strict Bayesian accounts. We need to consider what the method would have inferred if other data had been observed. For each point μ’ in the interval, we assess how severely the claim μ > μ’ has been probed. I apply the severity account to the problems discussed by earlier speakers in our session. The problem with multiple testing (and selective reporting) when attempting to distinguish genuine effects from noise, is not merely that it would, if regularly applied, lead to inferences that were often wrong. Rather, it renders the method incapable, or practically so, of probing the relevant mistaken inference in the case at hand. In other cases, by contrast, (e.g., DNA matching) the searching can increase the test’s probative capacity. In this way the severe testing account can explain competing intuitions about multiplicity and data-dredging, while blocking inferences based on problematic data-dredging
The Statistics Wars and Their Causalities (refs)jemille6
High-profile failures of replication in the social and biological sciences underwrite a
minimal requirement of evidence: If little or nothing has been done to rule out flaws in inferring a claim, then it has not passed a severe test. A claim is severely tested to the extent it has been subjected to and passes a test that probably would have found flaws, were they present. This probability is the severity with which a claim has passed. The goal of highly well-tested claims differs from that of highly probable ones, explaining why experts so often disagree about statistical reforms. Even where today’s statistical test critics see themselves as merely objecting to misuses and misinterpretations, the reforms they recommend often grow out of presuppositions about the role of probability in inductive-statistical inference. Paradoxically, I will argue, some of the reforms intended to replace or improve on statistical significance tests enable rather than reveal illicit inferences due to cherry-picking, multiple testing, and data-dredging. Some preclude testing and falsifying claims altogether. These are the “casualties” on which I will focus. I will consider Fisherian vs Neyman-Pearson tests, Bayes factors, Bayesian posteriors, likelihoodist assessments, and the “screening model” of tests (a quasiBayesian-frequentist assessment). Whether one accepts this philosophy of evidence, I argue, that it provides a standpoint for avoiding both the fallacies of statistical testing and the casualties of today’s statistics wars.
The Statistics Wars and Their Casualties (w/refs)jemille6
High-profile failures of replication in the social and biological sciences underwrite a minimal requirement of evidence: If little or nothing has been done to rule out flaws in inferring a claim, then it has not passed a severe test. A claim is severely tested to the extent it has been subjected to and passes a test that probably would have found flaws, were they present. This probability is the severity with which a claim has passed. The goal of highly well-tested claims differs from that of highly probable ones, explaining why experts so often disagree about statistical reforms. Even where today’s statistical test critics see themselves as merely objecting to misuses and misinterpretations, the reforms they recommend often grow out of presuppositions about the role of probability in inductive-statistical inference. Paradoxically, I will argue, some of the reforms intended to replace or improve on statistical significance tests enable rather than reveal illicit inferences due to cherry-picking, multiple testing, and data-dredging. Some preclude testing and falsifying claims altogether. These are the “casualties” on which I will focus. I will consider Fisherian vs Neyman-Pearson tests, Bayes factors, Bayesian posteriors, likelihoodist assessments, and the “screening model” of tests (a quasi-Bayesian-frequentist assessment). Whether one accepts this philosophy of evidence, I argue, that it provides a standpoint for avoiding both the fallacies of statistical testing and the casualties of today’s statistics wars.
On the interpretation of the mathematical characteristics of statistical test...jemille6
Statistical hypothesis tests are often misused and misinterpreted. Here I focus on one
source of such misinterpretation, namely an inappropriate notion regarding what the
mathematical theory of tests implies, and does not imply, when it comes to the
application of tests in practice. The view taken here is that it is helpful and instructive to be consciously aware of the essential difference between mathematical model and
reality, and to appreciate the mathematical model and its implications as a tool for
thinking rather than something that has a truth value regarding reality. Insights are presented regarding the role of model assumptions, unbiasedness and the alternative hypothesis, Neyman-Pearson optimality, multiple and data dependent testing.
The role of background assumptions in severity appraisal (jemille6
In the past decade discussions around the reproducibility of scientific findings have led to a re-appreciation of the importance of guaranteeing claims are severely tested. The inflation of Type 1 error rates due to flexibility in the data analysis is widely considered
one of the underlying causes of low replicability rates. Solutions, such as study preregistration, are becoming increasingly popular to combat this problem. Preregistration only allows researchers to evaluate the severity of a test, but not all
preregistered studies provide a severe test of a claim. The appraisal of the severity of a
test depends on background information, such as assumptions about the data generating process, and auxiliary hypotheses that influence the final choice for the
design of the test. In this article, I will discuss the difference between subjective and
inter-subjectively testable assumptions underlying scientific claims, and the importance
of separating the two. I will stress the role of justifications in statistical inferences, the
conditional nature of scientific conclusions following these justifications, and highlight
how severe tests could lead to inter-subjective agreement, based on a philosophical approach grounded in methodological falsificationism. Appreciating the role of background assumptions in the appraisal of severity should shed light on current discussions about the role of preregistration, interpreting the results of replication studies, and proposals to reform statistical inferences.
The two statistical cornerstones of replicability: addressing selective infer...jemille6
Tukey’s last published work in 2020 was an obscure entry on multiple comparisons in the
Encyclopedia of Behavioral Sciences, addressing the two topics in the title. Replicability
was not mentioned at all, nor was any other connection made between the two topics. I shall demonstrate how these two topics critically affect replicability using recently completed studies. I shall review how these have been addressed in the past. I shall
review in more detail the available ways to address selective inference. My conclusion is that conducting many small replicability studies without strict standardization is the way to assure replicability of results in science, and we should introduce policies to make this happen.
Unit 8 - Information and Communication Technology (Paper I).pdfThiyagu K
This slides describes the basic concepts of ICT, basics of Email, Emerging Technology and Digital Initiatives in Education. This presentations aligns with the UGC Paper I syllabus.
Executive Directors Chat Leveraging AI for Diversity, Equity, and InclusionTechSoup
Let’s explore the intersection of technology and equity in the final session of our DEI series. Discover how AI tools, like ChatGPT, can be used to support and enhance your nonprofit's DEI initiatives. Participants will gain insights into practical AI applications and get tips for leveraging technology to advance their DEI goals.
Read| The latest issue of The Challenger is here! We are thrilled to announce that our school paper has qualified for the NATIONAL SCHOOLS PRESS CONFERENCE (NSPC) 2024. Thank you for your unwavering support and trust. Dive into the stories that made us stand out!
Acetabularia Information For Class 9 .docxvaibhavrinwa19
Acetabularia acetabulum is a single-celled green alga that in its vegetative state is morphologically differentiated into a basal rhizoid and an axially elongated stalk, which bears whorls of branching hairs. The single diploid nucleus resides in the rhizoid.
A workshop hosted by the South African Journal of Science aimed at postgraduate students and early career researchers with little or no experience in writing and publishing journal articles.
Introduction to AI for Nonprofits with Tapp NetworkTechSoup
Dive into the world of AI! Experts Jon Hill and Tareq Monaur will guide you through AI's role in enhancing nonprofit websites and basic marketing strategies, making it easy to understand and apply.
How to Make a Field invisible in Odoo 17Celine George
It is possible to hide or invisible some fields in odoo. Commonly using “invisible” attribute in the field definition to invisible the fields. This slide will show how to make a field invisible in odoo 17.
Macroeconomics- Movie Location
This will be used as part of your Personal Professional Portfolio once graded.
Objective:
Prepare a presentation or a paper using research, basic comparative analysis, data organization and application of economic information. You will make an informed assessment of an economic climate outside of the United States to accomplish an entertainment industry objective.
Operation “Blue Star” is the only event in the history of Independent India where the state went into war with its own people. Even after about 40 years it is not clear if it was culmination of states anger over people of the region, a political game of power or start of dictatorial chapter in the democratic setup.
The people of Punjab felt alienated from main stream due to denial of their just demands during a long democratic struggle since independence. As it happen all over the word, it led to militant struggle with great loss of lives of military, police and civilian personnel. Killing of Indira Gandhi and massacre of innocent Sikhs in Delhi and other India cities was also associated with this movement.
Advantages and Disadvantages of CMS from an SEO Perspective
Phil 6334 Mayo slides Day 1
1. "4 Waves in Philosophy of Statistics"
What is the Philosophy of Statistics?
At one level of analysis at least, statisticians and philosophers of science ask many of the same
questions:
What should be observed and what may justifiably be inferred from the resulting data?
How well do data confirm or fit a model?
What is a good test?
Must predictions be “novel” in some sense? (selection effects, double counting, data mining)
How can spurious relationships be distinguished from genuine regularities? from causal
regularities?
How can we infer more accurate and reliable observations from less accurate ones?
When does a fitted model account for regularities in the data?
1
2. That these very general questions are entwined with long standing debates in philosophy of
science helps to explain why the field of statistics tends to cross over so often into
philosophical territory.
That statistics is a kind of “applied philosophy of science” is not too far off the mark
(Kempthorne, 1976).
2
3. Statistics philosophy3 ways statistical accounts are used in philosophy of science
(1) Model Scientific Inference—to capture either the actual or rational ways to arrive at
evidence and inference
(2) Resolve Philosophical Problems about scientific inference, observation, experiment;
(problem of induction, objectivity of observation, reliable evidence, Duhem's problem,
underdetermination).
(3) Perform a Metamethodological Critique—scrutinize methodological rules, e.g., accord
special weight to "novel" facts, avoid ad hoc hypotheses, avoid "data mining", require
randomization.
philosophy statistics
central job to help resolve the conceptual, logical, and methodological discomforts of
scientists as to: how to make reliable inferences despite uncertainties and errors?
In tackling the problems around which the statistics wars have been fought, I claim, one also
arrives at a general account of inductive inference that solves or makes progress on:
the philosopher's problems of induction, objective evidence, underdetermination.
3
4. History and philosophy of statistics is a huge territory marked by 70 years of debates widely
known for reaching unusual heights both of passion and of technical complexity.
To get a handle on the movements and cycles without too much distortion, I propose to
identify four main “battle waves”—
Wave I ~ 1930 –1955/60
Wave II~ 1955/60-1980
Wave III~1980-2005
Wave IV~2005- (ongoing)
4
5. Confirmation Theory: The Search for Measures of
Degree of Evidential-Relationship (E-R)
Philosophy of science: e.g., 1960s and 70s (in the early to mid 20th century), saw a resurgence of
interest in solving the traditional Humean problem of induction.
Conceding that all attempts to solve the problem of induction, fail, philosophers of induction
turned to constructing logics of induction or confirmation theories (e.g., Carnap 1962).
The thinking was/is:
Deductive logic: rules to compute whether a conclusion is true, given the truth of a set of premises
(True)
Inductive logic or confirmation theory: would provide rules to compute the probability of a
conclusion, given the truth of certain evidence statements (?)
Having conceded loss in the battle for justifying induction, philosophers appeal to logic to capture
scientific method
5
6. Inductive Logics
Logic of falsification
“Confirmation Theory”
Methodological falsification
Rules to assign degrees of
Rules to decide when to
probability or confirmation to “prefer” or accept hypotheses
hypotheses given evidence e
Carnap C(H,e)
Popper
Inductive Logicians
we can build and try to justify
“inductive logics”
straight rule: Assign degrees of
confirmation/credibility
Deductive Testers
we can reject induction and
uphold the “rationality” of
preferring or accepting
H if it is “well tested”
Statistical affinity
Statistical affinity
Bayesian (and likelihoodist)
accounts
Fisherian, Neyman-Pearson
methods: probability enters to
ensure reliability and severity of
tests with these tests.
6
7. The goal of an inductive logic: supply means to compute the degree of evidential relationship
between given evidence statements, e, and a hypothesis H, e.g., look to conditional probability or
Bayes’s Theorem:
P(H|e) = P(e|H)P(H)/P(e)
where P(e) = P(e|H)P(H) + P(e|not-H) P(not-H).
Computing P(H|e), the posterior probability, requires a probability assignment to all of the
members of “not-H”.
Major source of difficulty: how to obtain and interpret these prior probabilities.
a. If analytic and a priori, relevance for predicting and learning about empirical
phenomena is problematic
b. If they measure subjective degrees of belief, their relevance for giving objective
guarantees of reliable inference is unclear.
In statistics, a, is analogous to “objective” Bayesianism (e.g., Jeffreys); b, to subjective
Bayesianism
The Bayesian-Frequentist controversies is one of the big topics we’ll explore in this
course
7
8. A core question: What is the nature and role of probabilistic concepts, methods, and
models in making inferences in the face of limited data, uncertainty and error?
1.Three Roles For Probability:
Degrees of Confirmation, Degrees of long-run error rates, Degrees of Well-Testedness
a. To provide a post-data assignment of degree of probability, confirmation, support or
belief in a hypothesis (probabilism);
b. To ensure long-run reliability of methods (performance)
c. To determine the warrant of hypotheses by assessing how stringently or severely probed
they are (probativeness)
These three contrasting philosophies of the role of probability in statistical inference are very
much at the heart of the central points of controversy in the “four waves” of philosophy of
statistics…
8
9. I.
Philosophy of Statistics: “The First Battle Wave”
WAVE I: circa 1930-1955/60:
Fisher, Neyman, Pearson, Savage, and Jeffreys.
Statistical inference tools use data x to probe aspects of the data generating source:
In statistical testing, these aspects are in terms of statistical hypotheses about parameters
governing a statistical distribution
H tells us the “probability of x under H”, written P(x;H)
(probabilistic assignments under a model)
P(H,H,T,H,T,T,T,H,H,T; fair coin) = (.5)10
We will explain how this differs from conditional probabilities in Bayes’s rule or theorem, P(x|H).
9
10. Modern Statistics Begins with Fisher:
“Simple” Significance Tests
Fisher strongly objected to Bayesian inference, in particular to the use of prior distributions (relevant for psychology not
science).
Looks to develop ways to express the uncertainty of inferences without deviating from frequentist probabilities.
Example. Let the sample be X = (X1, …,Xn), be n iid (independent and identically distributed) outcomes from a Normal
distribution with standard deviation =1
1. A null hypothesis H0 : H0: = 0
e.g., 0 mean concentration of lead, no difference in mean survival in a given group, in mean risk, mean deflection of
light.
2. A function of the sample, d(X), the test statistic: which reflects the difference between the data x0 = (x1, …,xn), and H0;
The larger d(x0) the further the outcome is from what is expected under H0, with respect to the particular question being asked.
3. the p-value is the probability of a difference larger than d(x0), under the assumption that H0 is true:
p(x0)=P(d(X) > d(x0); H0)
10
11. Mini-recipe for p-value calculation:
The observed significance level (p-value) with observed
p(x0)=P(d(X) > d(x0); H0).
X
= .1
The relevant test statistic d(X) is:
d(X) = ( X -0x,
where X is the sample mean with standard deviation x = (√n).
d (X) =
Observed - Expected (under H0 )
Let n = 25.
Since s x =
sx
s
n
= 1/5 = .2, d(X) = .1 – 0 in units of x yields
d(x0)=.1/.2 = .5
Under the null, d(X) is distributed as standard Normal, denoted by d(X) ~ N(0,1).
(area to the right of .5) ~.3, i.e. not very significant.
11
12. Logic of Simple Significance Tests: Statistical Modus Tollens
“Every experiment may be said to exist only in order to give the facts a chance of
disproving the null hypothesis” (Fisher, 1956, p.160).
Statistical analogy to the deductively valid pattern modus tollens:
If the hypothesis H0 is correct then, with high probability, 1− p, the data would not
be statistically significant at level p.
x0 is statistically significant at level p.
____________________________________________
Thus, x0 is evidence against H0, or x0 indicates the falsity of H0.
12
13. The Alternative or “Non-Null” Hypothesis
Evidence against H0 seems to indicate evidence for some alternative.
Fisherian significance tests strictly consider only the H0
Neyman and Pearson (N-P) tests introduce an alternative H1 (even if only to serve as a
direction of departure)
Example. X = (X1, …,Xn), iid Normal with =1,
H0: = 0 vs. H1: > 0
Despite the bitter disputes with Fisher that were to erupt soon after ~1935, Neyman and
Pearson, at first, saw their work as merely placing Fisherian tests on firmer logical footing.
Much of Fisher’s hostility toward N-P methods reflects professional and personality conflicts
more than philosophical differences.
13
14. Neyman-Pearson (N-P) Tests
N-P hypothesis test: maps each outcome x = (x1, …,xn) into either the null hypothesis H0, or
an alternative hypothesis H1 (where the two exhaust the parameter space) to ensure the
probabilities of erroneous rejections (type I errors) and erroneous acceptances (type II errors)
are controlled at prespecified values, e.g., 0.05 or 0.01, the significance level of the test. It
also requires a sensible distance measure d(x0).
Test T+: X = (X1, …,Xn), iid Normal with =1
H0: = vs. H1: >
if d(x0) > c, "reject" H0, (or declare the result statistically significant at the level)
if d(x0) < c, "do not reject” or “accept" H0,
e.g. c=1.96 for =.025
“Accept/reject” uninterpreted parts of the mathematical apparatus.
14
15. Testing Errors and Error Probabilities
Type I error: Reject H0 even though H0 is true.
Type II error: Fail to reject H0 even though H0 is false.
Probability of a Type I error = P(d(x0) > c; H0) ≤
Probability of a Type II error:
P(Test T+ does not reject H0 ; =1) =
= P(d(X) < c; H0) = ß(1), for any 1 > 0.
The "best" test at level at the same time minimizes the value of ß for all 1 > 0, or
equivalently, maximizes the power:
POW(1)= P(d(X) > c; 1)
T+ is a Uniformly Most Powerful (UMP) level test
15
16. Inductive Behavior Philosophy
Philosophical issues and debates arise once one begins to consider the interpretations of the formal
apparatus
‘Accept/Reject’ are identified with deciding to take specific actions, e.g., publishing a result,
announcing a new effect.
The justification for optimal tests is that:
“it may often be proved that if we behave according to such a rule ... we shall reject H when
it is true not more, say, than once in a hundred times, and in addition we may have evidence
that we shall reject H sufficiently often when it is false.”
Neyman: Tests are not rules of inductive inference but rules of behavior:
The goal is not to adjust our beliefs but rather to “adjust our behavior” to limited amounts of data
Is he just drawing a stark contrast between N-P tests and Fisherian as well as Bayesian methods?
Or is the behavioral interpretation essential to the tests?
16
17. “Inductive behavior” vs. “Inductive inference” battle
Commingles philosophical, statistical and personality clashes.
Fisher (1955) denounced the way that Neyman and Pearson transformed ‘his’ significance
tests into ‘acceptance procedures’
They’ve turned my tests into mechanical rules or ‘recipes’ for ‘deciding’ to accept or
reject statistical hypothesis H0,
The concern has more to do with speeding up production or making money than in
learning about phenomena
N-P followers are like:
“Russians (who) are made familiar with the ideal that research in pure science can and
should be geared to technological performance, in the comprehensive organized effort of a
five-year plan for the nation.” (1955, 70)
17
18. Pearson distanced himself from Neyman’s “inductive behavior” jargon, calling it “Professor
Neyman’s field rather than mine.”
But the most impressive mathematical results were in the decision-theoretic framework of
Neyman-Pearson-Wald.
Many of the qualifications by Neyman and Pearson in the first wave are overlooked in the
philosophy of statistics literature.
Admittedly, these “evidential” practices were not made explicit *. (Had they been, the
subsequent waves of philosophy of statistics might have looked very different).
*Mayo’s goal as a graduate student.
18
19. The Second Wave: ~1955/60 -1980
“Post-data criticisms of N-P methods”:
Ian Hacking (1965), framed the main lines of criticism by philosophers “Neyman-Pearson tests as
suitable for before-trial betting, but not for after-trial evaluation.” (p. 99):
Battles: “initial precision’ vs. “final precision”,
“before-data vs. after data”
After the data, he claimed, the relevant measure of support is the (relative) likelihood
Two data sets x and y may afford the same "support" to H, yet warrant different
inferences [on significance test reasoning] because x and y arose from tests with
different error probabilities.
o This is just what error statisticians want!
o But (at least early on) Hacking (1965) held to the
“Law of Likelihood”: x support hypotheses H1 more than H2 if,
P(x;H1) > P(x;H2).
19
20. Yet, as Barnard notes, “there always is such a rival hypothesis: That things just had to turn
out the way they actually did”.
(H,H,T,H) is made most probable by the hypothesis that makes P(H) = 1 on trials 1, 2, and 4
(0 on trial 3).
“Best explanation”? Since such a maximally likelihood alternative H2 can always be
constructed, H1 may always be found less well supported, even if H1 is true—no error
control.
Hacking soon rejected the likelihood approach on such grounds, but likelihoodist accounts
are advocated by others—most especially philosophers (e.g., formal epistemologists).
So we will want to consider some of the problems that beset such accounts (in philosophy
and in statistics).
To begin with we’ll need to be clear on what a likelihood function is.
20
21. Perhaps THE key issue of controversy in the philosophy of statistics battles
The (strong) likelihood principle, likelihoods suffice to convey “all that the data have to say:”
According to Bayes’s theorem, P(x|µ) ... constitutes the entire evidence of the experiment,
that is, it tells all that the experiment has to tell. More fully and more precisely, if y is the
datum of some other experiment, and if it happens that P(x|µ) and P(y|µ) are proportional
functions of µ (that is, constant multiples of each other), then each of the two data x and y
have exactly the same thing to say about the values of µ… (Savage 1962, p. 17.)
—the error probability statistician needs to consider, in addition, the sampling distribution of
the likelihoods.
—significance levels and other error probabilities all violate the likelihood principle (Savage
1962).
Breakthrough update: A long-held “proof” of the likelihood principle by Allan Birnbaum is
the subject of some recent work of mine: I will give a colloquia talk on this in the Philosophy
Department, May 2.
21
22. Paradox of Optional Stopping
Instead of fixing the sample size n in advance, in some tests, n is determined by a stopping rule:
In Normal testing, 2-sided H0: = 0 vs. H1: ≠ 0
Keep sampling until H0 is rejected at the .05 level
(i.e., keep sampling until | X | 1.96 /
n ).
Nominal vs. Actual significance levels:
With n fixed the type 1 error probability is .05,
With this stopping rule the actual significance level differs from, and will be greater than .05.
By contrast, since likelihoods are unaffected by the stopping rule, the LP follower denies there
really is an evidential difference between the two cases (i.e., n fixed and n determined by the
stopping rule).
22
23. Intuitively: Should it matter if I decided to toss the coin 100 times and happened to get 60% heads,
or if I decided to keep tossing until I could reject at the .05 level (2-sided) and this happened to
occur on trial 100?
Should it matter if I kept going until I found statistical significance?
Error statistical principles: Yes!—penalty for perseverance!
The LP says NO!
Savage Forum 1959: Savage audaciously declares that the lesson to draw from the optional
stopping effect is that “optional stopping is no sin” so the problem must lie with the use of
significance levels. But why accept the likelihood principle (LP)? (simplicity and freedom?)
The likelihood principle emphasized in Bayesian statistics implies, … that the rules governing
when data collection stops are irrelevant to data interpretation. It is entirely appropriate to
collect data until a point has been proved or disproved (p. 193)…This irrelevance of stopping
rules to statistical inference restores a simplicity and freedom to experimental design that had
been lost by classical emphasis on significance levels (in the sense of Neyman and Pearson)
(Edwards, Lindman, Savage 1963, p. 239).
23
24. For frequentists this only underscores the point raised years before by Pearson and Neyman:
A likelihood ratio (LR) may be a criterion of relative fit but it “is still necessary to determine its
sampling distribution in order to control the error involved in rejecting a true hypothesis,
because a knowledge of [LR] alone is not adequate to insure control of this error (Pearson and
Neyman, 1930, p. 106).
The key difference: likelihood fixes the actual outcome, i.e., just d(x), while error statistics
considers outcomes other than the one observed in order to assess the error properties
LP irrelevance of, and no control over, error probabilities.
("why you cannot be just a little bit Bayesian" EGEK 1996)
EGEK: Error and the Growth of Experimental Knowledge (Mayo 1996)
24
25. The Statistical Significance Test Controversy
(Morrison and Henkel, 1970) – contributors chastise social scientists for slavish use of
significance tests
o Focus on simple Fisherian significance tests
o Philosophers direct criticisms mostly to N-P tests.
Fallacies of Rejection: Statistical vs. Substantive Significance
(i) take statistical significance as evidence of substantive theory that explains the effect
(ii) Infer a discrepancy from the null beyond what the test
warrants
(i) Paul Meehl: It is fallacious to go from a statistically significant result, e.g., at the .001
level, to infer that “one’s substantive theory T, which entails the [statistical] alternative H1, has
received .. quantitative support of magnitude around .999”
A statistically significant difference (e.g., in child rearing) is not automatically evidence for a
Freudian theory.
Merely refuting the null hypothesis is too weak to corroborate substantive theories, “we have
to have ‘Popperian risk’, ‘severe test’ [as in Mayo], or what philosopher Wesley Salmon
called a highly improbable coincidence” (Meehl and Waller 2002, 184) (“damn coincidence”)
25
26. Fallacies of rejection:
(i) Take statistical significance as evidence of substantive theory that explains the effect
(ii) Infer a discrepancy from the null beyond what the test warrants
Finding a statistically significant effect, d(x0)> c (cut-off for rejection) need not be
indicative of large or meaningful effect sizes — test too sensitive
Large n Problem: an significant rejection of H0 can be very probable, even with a
substantively trivial discrepancy from H0.
This is often taken as a criticism because it is assumed that statistical significance at a given
level is more evidence against the null the larger the sample size (n)—fallacy!
"The thesis implicit in the [NP] approach [is] that a hypothesis may be rejected with increasing
confidence or reasonableness as the power of the test increases” (Howson and Urbach 1989 and
later editions)
In fact, it is indicative of less of a discrepancy from the null than if it resulted from a smaller
sample size.
Comes also in the form of the “Jeffrey-Good-Lindley” paradox
Even a highly statistically significant result can, with n sufficiently large, correspond to a high
posterior probability to a null hypothesis.
26
27. Fallacy of Non-Statistically Significant Results
Test T fails to reject the null, when the test statistic fails to reach the cut-off point for
rejection, i.e., d(x0) ≤ c.
A classic fallacy is to construe such a “negative” result as evidence for the correctness of the null
hypothesis (common in risk assessment contexts).
“No evidence against” is not “evidence for”
Merely surviving the statistical test is too easy, occurs too frequently, even when the null is false.
—results from tests lacking sufficient sensitivity or power.
The Power Analytic Movement of the 60’s in psychology
Jacob Cohen: By considering ahead of time the Power of the test, select a test capable of detecting
discrepancies of interest.
(Power is a feature of N-P tests, but apparently the prevalence of Fisherian tests in the social
sciences, coupled, perhaps, with the difficulty in calculating power, resulted in ignoring power)
A multitude of tables were supplied (Cohen, 1988), but until his death he bemoaned their all-torare use.
27
28. Post-data use of power to avoid fallacies of insensitive tests
If there's a low probability of a statistically significant result, even if a non-trivial discrepancy
non-trivial is present (low power against non-trivial) then a non-significant difference is not good
evidence that a non-trivial discrepancy is absent.
This still retains an unacceptable coarseness: power is always calculated relative to the cut-off
point c for rejecting H0.
We will introduce you to a way of retaining the main logic but in a data-dependent use of power.
Rather than calculating
(1) P(d(X) > c; =.2)
Power
one should calculate
(2) P(d(X) > d(x0); =.2).
observed power (severity)
Even if (1) is low, (2) may be high. We return to this in the developments of Wave III.
28
29. III. The Third Wave: Relativism, Reformulations, Reconciliations ~1980-2005+
Rational Reconstruction and Relativism in Philosophy of Science
Fighting Kuhnian battles to the very idea of a unified method of scientific inference, statistical inference less
prominent in philosophy
— largely used in rational reconstructions of scientific episodes,
— in appraising methodological rules,
— in classic philosophical problems e.g., Duhem’s problem—reconstruct a given assignment of blame so as to
be “warranted” by Bayesian probability assignments.
problem with reconstructions: normative force.
The recognition that science involves subjective judgments and values, reconstructions often appeal to a
subjective Bayesian account (Salmon’s “Tom Kuhn Meets Tom Bayes”).
(Kuhn thought this was confused: no reason to suppose an algorithm remains through theory change)
Naturalisms, HPS —immersed in biology, psychology, etc., philosophers of science recoil from
unified inferential accounts.
Achinstein (2001): “scientists do not and should not take such philosophical accounts of evidence
seriously” (p. 9).
They are a priori while they should be empirical; but being empirical is not enough ….
29
30. Wave III in Scientific Practice: Still operative
—
Statisticians turn to eclecticism.
—
Non-statistician practitioners (e.g., in psychology, ecology, medicine), bemoan “unholy
hybrids” (the New Hybridists)
A mixture of ideas from N-P methods, Fisherian tests, and Bayesian accounts that is
“inconsistent from both perspectives and burdened with conceptual confusion”. (Gigerenzer, 1993,
p. 323).
Faced with foundational questions, non statistician practitioners raise anew the questions
from the first and second waves.
Finding the automaticity and fallacies still rampant, many call on an outright “ban” on
significance tests in research, or at least insist on reforms and reformulations of statistical
tests.
Task Force to consider Test Ban in Psychology: 1990s
(They didn’t ban them, but it’s continued to give them fodder for reforms: e.g., confidence
interval estimation. Fine, but they commit the same fallacies, often, using confidence intervals.)
30
31. Reforms and Reinterpretations Within Error Probability Statistics
Any adequate reformulation must:
(i)
Show how to avoid classic fallacies (of acceptance and of rejection) —on principled
grounds,
(ii)
Show that it provides an account of inductive inference
31
32. Avoiding Fallacies
We will discuss attempts to avoid fallacies of acceptance and rejection (e.g., using confidence
interval estimates).
Move away from coarse accept/reject rule; use specific result (significant or insignificant) to infer
those discrepancies from the null that are well ruled-out, and those which are not.
e.g., Interpretation of Non-Significant results
If d(x) is not statistically significant, and the test had a very high probability of a
more statistically significant difference if µ > µ0 + , then d(x) is good grounds for
inferring µ ≤ µ0 + .
Use specific outcome to infer an upper bound
µ ≤ µ* (values beyond are ruled out by given severity.)
32
33. Takes us back to the post-data version of power:
Rather than construe “a miss as good as a mile”, parity of logic suggests that the post-data
power assessment should replace the usual calculation of power against :
POW() = P(d(X) > c; =),
with what might be called the power actually attained or, to have a distinct term, the severity
(SEV):
SEV(< ) = P(d(X) > d(x0); =),
where d(x0) is the observed (non-statistically significant) result.
33
34. Fallacies of Rejection: The Large n-Problem
While with a nonsignificant result, the concern is erroneously inferring that a discrepancy from µ0
is absent;
With a significant result x0, the concern is erroneously inferring that it is present.
Utilizing the severity assessment an -significant difference with n1 passes µ > µ1 less
severely than with n2 where n1 > n2.
(What’s more indicative of a large effect (fire), a fire alarm that goes off with burnt toast or
one so insensitive that it doesn’t go off unless the house is fully ablaze. The larger sample size is
like the one that goes off with burnt toast.)
In this way we solve the problems of tests too sensitive or not sensitive enough, but there’s one
more thing ... showing how it supplies an account of inductive inference.
Many argue in Wave III that error statistical methods cannot supply an account of inductive
inference because error probabilities conflict with posterior probabilities.
34
35. P-values vs Bayesian Posteriors
A statistically significant difference from H0 can correspond to large posteriors in H0
From the Bayesian perspective, it follows that p-values come up short as a measure of
inductive evidence,
the significance testers balk at the fact that the recommended priors result in highly
significant results being construed as no evidence against the null — or even evidence for
it!
The conflict often considers the two sided T(2 test
H0: = versus H1: ≠ .
(The difference between p-values and posteriors are far less marked with one-sided tests).
“Assuming a prior of .5 to H0, with n = 50 one can classically ‘reject H0 at significance level p =
.05,’ although P(H0|x) = .52 (which would actually indicate that the evidence favors H0).”
This is taken as a criticism of p-values, only because, it is assumed the .51 posterior is the
appropriate measure of the belief-worthiness.
As the sample size increases, the conflict becomes more noteworthy.
35
36. If n = 1000, a result statistically significant at the .05 level leads to a posterior to the null of .82!
SEV (H1) = .95 while the corresponding posterior has gone
from .5 to .82. What warrants such a prior?
n (sample size)
______________________________________________________
p
t
n=10
n=20
n=50 n=100 n=1000
.10
.05
.01
.001
1.645
1.960
2.576
3.291
.47
.37
.14
.024
.56
.42
.16
.026
.65
.52
.22
.034
.72
.60
.27
.045
.89
.82
.53
.124
(1) Some claim the prior of .5 is a warranted frequentist assignment:
H0 was randomly selected from an urn in which 50% are true
(*) Therefore P(H0) = p
H0 may be 0 change in extinction rates, 0 lead concentration, etc.
What should go in the urn of hypotheses?
For the frequentist: either H0 is true or false, the probability in (*) is fallacious and results from an
unsound instantiation.
We are very interested in how false it might be, which is what we can do by means of a severity
assessment of.
36
37. (2) Subjective degree of belief assignments will not ensure the error probability, and thus the
severity, assessments we need.
(3) Some suggest an “impartial” or “uninformative” Bayesian prior gives .5 to H0, the remaining .5
probability being spread out over the alternative parameter space, e.g., Jeffreys.
This “spiked concentration of belief in the null” is at odds with the prevailing view “we know all
nulls are false”.
37
38. Wave IV 2005: Recapitulation of previous waves + new challenges to reliability of science
A. Contemporary “Impersonal” Bayesianism: In the Bayes vs frequentist wars, the impersonal
Bayesian tries to have frequentist guarantees
Because of the difficulty of eliciting subjective priors, and because of the reluctance among
scientists to allow subjective beliefs to be conflated with the information provided by data,
much current Bayesian work in practice favors conventional “default”, “uninformative,” or
“reference” priors.
We may call them “conventional” Bayesians
The conventional Bayesians abandon coherence, the LP, and strive to match frequentist error
probabilities!
38
39. Some questions for “reference” Bayesians
1. What do reference posteriors measure?
A classic conundrum: there is no unique “noninformative” prior. (Supposing there is
one leads to inconsistencies in calculating posterior marginal probabilities).
Any representation of ignorance or lack of information that succeeds for one
parameterization will, under a different parameterization, entail having knowledge.
The conventional prior is said to be simply something that allows computing the
posterior (undefined), they are weights of some sort.
Not to be considered expressions of uncertainty, ignorance, or degree of belief.
May not even be probabilities; flat priors may not sum to one (improper prior). If priors
are not probabilities, what then is the interpretation of a posterior?
39
40. 2. Priors for the same hypothesis changes according to what experiment is to be done!
Bayesian incoherent
If the prior is to represent information, why should it be influenced by the sample space of a
contemplated experiment?
Violates the likelihood principle — the cornerstone of Bayesian coherency
Conventional Bayesians: it is “the price” of objectivity.
Seems to wreck havoc with basic Bayesian foundations, but without the payoff of an
objective, interpretable output—even subjective Bayesians object
3. Reference posteriors with good frequentist properties
Reference priors are touted as having some good frequentist properties, at least in onedimensional problems.
They are deliberately designed to match frequentist error probabilities.
If you want error probabilities, why not use techniques that provide them directly?
By the way, using conditional probability—which is part and parcel of probability theory, (as in
“Bayes nets”, etc.), in no way makes one a Bayesian—no priors to hypotheses….
40
41. B. Brand new sets of crises: (so new we have barely started writing on it):
Research implicating statistical methods with pseudoscience, fraud, unreplicable results
Origins?
Controversies about models in economics, climate change, medicine?
Economic downturn (open source journal demand sexy results?); big data
makes it easy to “cherry pick” and data mine to get ad hoc models and unreliable
results?
Use of Bayesian statistics?
Use of frequentist statistics?
Computerized data analysis)
Regardless of the source, it’s resulted in one of the hottest topics in science (that
philosophers should be involved in).
41
42. Different forms:
(i) Science-wise false discovery rates.
Given type 1 and type 2 error probabilities, and an assumption of the proportion
of false hypotheses studied, it is argued that most statistically significant
“discoveries” are false—
Stems from large-scale screening in bioinformatics
Not based on real data, but conjecture and simulation.
(ii) Journal practices: attention-getting articles with eye-catching, but inadequately
scrutinized, conjectures. (Stapel in social psychology, who collected no data.)
(iii) Unthinking uses of statistics (previous waves I-III)
42
43. We can’t possibly cover the tremendous number of important issues, let alone
readings; and we were sorely tempted to include so many “greats” that we’ve had to
omit to avoid overwhelming you. But you will, by the end of the course, have a
basic methodological framework within which current methodological problems
may be understood and addressed.
43