Byrd statistical considerations of the histomorphometric test protocol (1)jemille6
"Statistical considerations of the histomorphometric test protocol"
John E. Byrd, Ph.D. D-ABFA
Maria-Teresa Tersigni-Tarrant, Ph.D.
Central Identification Laboratory
JPAC
Fusion Confusion? Comments on Nancy Reid: "BFF Four-Are we Converging?"jemille6
D. Mayo's comments on Nancy Reid's "BFF Four-Are we Converging?" given May 2, 2017 at The Fourth Bayesian, Fiducial and Frequentists Workshop held at Harvard University.
Today we’ll try to cover a number of things:
1. Learning philosophy/philosophy of statistics
2. Situating the broad issues within philosophy of science
3. Little bit of logic
4. Probability and random variables
Byrd statistical considerations of the histomorphometric test protocol (1)jemille6
"Statistical considerations of the histomorphometric test protocol"
John E. Byrd, Ph.D. D-ABFA
Maria-Teresa Tersigni-Tarrant, Ph.D.
Central Identification Laboratory
JPAC
Fusion Confusion? Comments on Nancy Reid: "BFF Four-Are we Converging?"jemille6
D. Mayo's comments on Nancy Reid's "BFF Four-Are we Converging?" given May 2, 2017 at The Fourth Bayesian, Fiducial and Frequentists Workshop held at Harvard University.
Today we’ll try to cover a number of things:
1. Learning philosophy/philosophy of statistics
2. Situating the broad issues within philosophy of science
3. Little bit of logic
4. Probability and random variables
D. Mayo: Philosophy of Statistics & the Replication Crisis in Sciencejemille6
D. Mayo discusses various disputes-notably the replication crisis in science-in the context of her just released book: Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars.
Replication Crises and the Statistics Wars: Hidden Controversiesjemille6
D. Mayo presentation at the X-Phil conference on "Reproducibility and Replicabililty in Psychology and Experimental Philosophy", University College London (June 14, 2018)
D. G. Mayo: The Replication Crises and its Constructive Role in the Philosoph...jemille6
Constructive role of replication crises teaches a lot about 1.) Non-fallacious uses of statistical tests, 2.) Rationale for the role of probability in tests, 3.) How to reformulate tests.
D. G. Mayo: Your data-driven claims must still be probed severelyjemille6
In the session "Philosophy of Science and the New Paradigm of Data-Driven Science at the American Statistical Association Conference on Statistical Learning and Data Science/Nonparametric Statistics
Deborah G. Mayo: Is the Philosophy of Probabilism an Obstacle to Statistical Fraud Busting?
Presentation slides for: Revisiting the Foundations of Statistics in the Era of Big Data: Scaling Up to Meet the Challenge[*] at the Boston Colloquium for Philosophy of Science (Feb 21, 2014).
Severe Testing: The Key to Error Correctionjemille6
D. G. Mayo's slides for her presentation given March 17, 2017 at Boston Colloquium for Philosophy of Science, Alfred I.Taub forum: "Understanding Reproducibility & Error Correction in Science"
D. Mayo: Philosophy of Statistics & the Replication Crisis in Sciencejemille6
D. Mayo discusses various disputes-notably the replication crisis in science-in the context of her just released book: Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars.
Replication Crises and the Statistics Wars: Hidden Controversiesjemille6
D. Mayo presentation at the X-Phil conference on "Reproducibility and Replicabililty in Psychology and Experimental Philosophy", University College London (June 14, 2018)
D. G. Mayo: The Replication Crises and its Constructive Role in the Philosoph...jemille6
Constructive role of replication crises teaches a lot about 1.) Non-fallacious uses of statistical tests, 2.) Rationale for the role of probability in tests, 3.) How to reformulate tests.
D. G. Mayo: Your data-driven claims must still be probed severelyjemille6
In the session "Philosophy of Science and the New Paradigm of Data-Driven Science at the American Statistical Association Conference on Statistical Learning and Data Science/Nonparametric Statistics
Deborah G. Mayo: Is the Philosophy of Probabilism an Obstacle to Statistical Fraud Busting?
Presentation slides for: Revisiting the Foundations of Statistics in the Era of Big Data: Scaling Up to Meet the Challenge[*] at the Boston Colloquium for Philosophy of Science (Feb 21, 2014).
Severe Testing: The Key to Error Correctionjemille6
D. G. Mayo's slides for her presentation given March 17, 2017 at Boston Colloquium for Philosophy of Science, Alfred I.Taub forum: "Understanding Reproducibility & Error Correction in Science"
D. Mayo: Philosophical Interventions in the Statistics Warsjemille6
ABSTRACT: While statistics has a long history of passionate philosophical controversy, the last decade especially cries out for philosophical illumination. Misuses of statistics, Big Data dredging, and P-hacking make it easy to find statistically significant, but spurious, effects. This obstructs a test's ability to control the probability of erroneously inferring effects–i.e., to control error probabilities. Disagreements about statistical reforms reflect philosophical disagreements about the nature of statistical inference–including whether error probability control even matters! I describe my interventions in statistics in relation to three events. (1) In 2016 the American Statistical Association (ASA) met to craft principles for avoiding misinterpreting P-values. (2) In 2017, a "megateam" (including philosophers of science) proposed "redefining statistical significance," replacing the common threshold of P ≤ .05 with P ≤ .005. (3) In 2019, an editorial in the main ASA journal called for abandoning all P-value thresholds, and even the words "significant/significance".
A word on each. (1) Invited to be a "philosophical observer" at their meeting, I found the major issues were conceptual. P-values measure how incompatible data are from what is expected under a hypothesis that there is no genuine effect: the smaller the P-value, the more indication of incompatibility. The ASA list of familiar misinterpretations–P-values are not posterior probabilities, statistical significance is not substantive importance, no evidence against a hypothesis need not be evidence for it–I argue, should not be the basis for replacing tests with methods less able to assess and control erroneous interpretations of data. (Mayo 2016, 2019). (2) The "redefine statistical significance" movement appraises P-values from the perspective of a very different quantity: a comparative Bayes Factor. Failing to recognize how contrasting approaches measure different things, disputants often talk past each other (Mayo 2018). (3) To ban P-value thresholds, even to distinguish terrible from warranted evidence, I say, is a mistake (2019). It will not eradicate P-hacking, but it will make it harder to hold P-hackers accountable. A 2020 ASA Task Force on significance testing has just been announced. (I would like to think my blog errorstatistics.com helped.)
To enter the fray between rival statistical approaches, it helps to have a principle applicable to all accounts. There's poor evidence for a claim if little if anything has been done to find it flawed even if it is. This forms a basic requirement for evidence I call the severity requirement. A claim passes with severity only if it is subjected to and passes a test that probably would have found it flawed, if it were. It stems from Popper, though he never adequately cashed it out. A variant is the frequentist principle of evidence developed with Sir David Cox (Mayo and Cox 20
Controversy Over the Significance Test Controversyjemille6
Deborah Mayo (Professor of Philosophy, Virginia Tech, Blacksburg, Virginia) in PSA 2016 Symposium: Philosophy of Statistics in the Age of Big Data and Replication Crises
Slides given for Deborah G. Mayo talk at Minnesota Center for Philosophy of Science at University of Minnesota on the ASA 2016 statement on P-values and Error Statistics
Abstract: Mounting failures of replication in the social and biological sciences give a practical spin to statistical foundations in the form of the question: How can we attain reliability when methods make illicit cherry-picking and significance seeking so easy? Researchers, professional societies, and journals are increasingly getting serious about methodological reforms to restore scientific integrity – some are quite welcome (e.g., pre-registration), while others are quite radical. The American Statistical Association convened members from differing tribes of frequentists, Bayesians, and likelihoodists to codify misuses of P-values. Largely overlooked are the philosophical presuppositions of both criticisms and proposed reforms. Paradoxically, alternative replacement methods may enable rather than reveal illicit inferences due to cherry-picking, multiple testing, and other biasing selection effects. Crowd-sourced reproducibility research in psychology is helping to change the reward structure but has its own shortcomings. Focusing on purely statistical considerations, it tends to overlook problems with artificial experiments. Without a better understanding of the philosophical issues, we can expect the latest reforms to fail.
“The importance of philosophy of science for statistical science and vice versa”jemille6
My paper “The importance of philosophy of science for statistical science and vice
versa” presented (zoom) at the conference: IS PHILOSOPHY USEFUL FOR SCIENCE, AND/OR VICE VERSA? January 30 - February 2, 2024 at Chapman University, Schmid College of Science and Technology.
D. Mayo: Replication Research Under an Error Statistical Philosophy jemille6
D. Mayo (Virginia Tech) slides from her talk June 3 at the "Preconference Workshop on Replication in the Sciences" at the 2015 Society for Philosophy and Psychology meeting.
The Statistics Wars: Errors and Casualtiesjemille6
ABSTRACT: Mounting failures of replication in the social and biological sciences give a new urgency to critically appraising proposed statistical reforms. While many reforms are welcome (preregistration of experiments, replication, discouraging cookbook uses of statistics), there have been casualties. The philosophical presuppositions behind the meta-research battles remain largely hidden. Too often the statistics wars have become proxy wars between competing tribe leaders, each keen to advance one or another tool or school, rather than build on efforts to do better science. Efforts of replication researchers and open science advocates are diminished when so much attention is centered on repeating hackneyed howlers of statistical significance tests (statistical significance isn’t substantive significance, no evidence against isn’t evidence for), when erroneous understanding of basic statistical terms goes uncorrected, and when bandwagon effects lead to popular reforms that downplay the importance of error probability control. These casualties threaten our ability to hold accountable the “experts,” the agencies, and all the data handlers increasingly exerting power over our lives.
"The Statistical Replication Crisis: Paradoxes and Scapegoats”jemille6
D. G. Mayo LSE Popper talk, May 10, 2016.
Abstract: Mounting failures of replication in the social and biological sciences give a practical spin to statistical foundations in the form of the question: How can we attain reliability when Big Data methods make illicit cherry-picking and significance seeking so easy? Researchers, professional societies, and journals are increasingly getting serious about methodological reforms to restore scientific integrity – some are quite welcome (e.g., preregistration), while others are quite radical. Recently, the American Statistical Association convened members from differing tribes of frequentists, Bayesians, and likelihoodists to codify misuses of P-values. Largely overlooked are the philosophical presuppositions of both criticisms and proposed reforms. Paradoxically, alternative replacement methods may enable rather than reveal illicit inferences due to cherry-picking, multiple testing, and other biasing selection effects. Popular appeals to “diagnostic testing” that aim to improve replication rates may (unintentionally) permit the howlers and cookbook statistics we are at pains to root out. Without a better understanding of the philosophical issues, we can expect the latest reforms to fail.
Mayo: Evidence as Passing a Severe Test (How it Gets You Beyond the Statistic...jemille6
D. G. Mayo April 28, 2021 presentation to the CUNY Graduate Center Philosophy Colloquium "Evidence as Passing a Severe Test (How it Gets You Beyond the Statistics Wars)"
Statistical Inference as Severe Testing: Beyond Performance and Probabilismjemille6
A talk given by Deborah G Mayo
(Dept of Philosophy, Virginia Tech) to the Seminar in Advanced Research Methods at the Dept of Psychology, Princeton University on
November 14, 2023
TITLE: Statistical Inference as Severe Testing: Beyond Probabilism and Performance
ABSTRACT: I develop a statistical philosophy in which error probabilities of methods may be used to evaluate and control the stringency or severity of tests. A claim is severely tested to the extent it has been subjected to and passes a test that probably would have found flaws, were they present. The severe-testing requirement leads to reformulating statistical significance tests to avoid familiar criticisms and abuses. While high-profile failures of replication in the social and biological sciences stem from biasing selection effects—data dredging, multiple testing, optional stopping—some reforms and proposed alternatives to statistical significance tests conflict with the error control that is required to satisfy severity. I discuss recent arguments to redefine, abandon, or replace statistical significance.
D. Mayo (Dept of Philosophy, VT)
Sir David Cox’s Statistical Philosophy and Its Relevance to Today’s Statistical Controversies
ABSTRACT: This talk will explain Sir David Cox's views of the nature and importance of statistical foundations and their relevance to today's controversies about statistical inference, particularly in using statistical significance testing and confidence intervals. Two key themes of Cox's statistical philosophy are: first, the importance of calibrating methods by considering their behavior in (actual or hypothetical) repeated sampling, and second, ensuring the calibration is relevant to the specific data and inquiry. A question that arises is: How can the frequentist calibration provide a genuinely epistemic assessment of what is learned from data? Building on our jointly written papers, Mayo and Cox (2006) and Cox and Mayo (2010), I will argue that relevant error probabilities may serve to assess how well-corroborated or severely tested statistical claims are.
Nancy Reid, Dept. of Statistics, University of Toronto. Inaugural receiptant of the "David R. Cox Foundations of Statistics Award".
Slides from Invited presentation at 2023 JSM: “The Importance of Foundations in Statistical Science“
Ronald Wasserstein, Chair (American Statistical Association)
ABSTRACT: David Cox wrote “A healthy interplay between theory and application is crucial for statistics… This is particularly the case when by theory we mean foundations of statistical analysis, rather than the theoretical analysis of specific statistical methods.” These foundations distinguish statistical science from the many fields of research in which statistical thinking is a key intellectual component. In this talk I will emphasize the ongoing importance and relevance of theoretical advances and theoretical thinking through some illustrative examples.
Errors of the Error Gatekeepers: The case of Statistical Significance 2016-2022jemille6
ABSTRACT: Statistical significance tests serve in gatekeeping against being fooled by randomness, but recent attempts to gatekeep these tools have themselves malfunctioned. Warranted gatekeepers formulate statistical tests so as to avoid fallacies and misuses of P-values. They highlight how multiplicity, optional stopping, and data-dredging can readily invalidate error probabilities. It is unwarranted, however, to argue that statistical significance and P-value thresholds be abandoned because they can be misused. Nor is it warranted to argue for abandoning statistical significance based on presuppositions about evidence and probability that are at odds with those underlying statistical significance tests. When statistical gatekeeping malfunctions, I argue, it undermines a central role to which scientists look to statistics. In order to combat the dangers of unthinking, bandwagon effects, statistical practitioners and consumers need to be in a position to critically evaluate the ramifications of proposed "reforms” (“stat activism”). I analyze what may be learned from three recent episodes of gatekeeping (and meta-gatekeeping) at the American Statistical Association (ASA).
Causal inference is not statistical inferencejemille6
Jon Williamson (University of Kent)
ABSTRACT: Many methods for testing causal claims are couched as statistical methods: e.g.,
randomised controlled trials, various kinds of observational study, meta-analysis, and
model-based approaches such as structural equation modelling and graphical causal
modelling. I argue that this is a mistake: causal inference is not a purely statistical
problem. When we look at causal inference from a general point of view, we see that
methods for causal inference fit into the framework of Evidential Pluralism: causal
inference is properly understood as requiring mechanistic inference in addition to
statistical inference.
Evidential Pluralism also offers a new perspective on the replication crisis. That
observed associations are not replicated by subsequent studies is a part of normal
science. A problem only arises when those associations are taken to establish causal
claims: a science whose established causal claims are constantly overturned is indeed
in crisis. However, if we understand causal inference as involving mechanistic inference
alongside statistical inference, as Evidential Pluralism suggests, we avoid fallacious
inferences from association to causation. Thus, Evidential Pluralism offers the means to
prevent the drama of science from turning into a crisis.
Stephan Guttinger (Lecturer in Philosophy of Data/Data Ethics, University of Exeter, UK)
ABSTRACT: The idea of “questionable research practices” (QRPs) is central to the narrative of a replication crisis in the experimental sciences. According to this narrative the low replicability of scientific findings is not simply due to fraud or incompetence, but in large part to the widespread use of QRPs, such as “p-hacking” or the lack of adequate experimental controls. The claim is that such flawed practices generate flawed output. The reduction – or even elimination – of QRPs is therefore one of the main strategies proposed by policymakers and scientists to tackle the replication crisis.
What counts as a QRP, however, is not clear. As I will discuss in the first part of this paper, there is no consensus on how to define the term, and ascriptions of the qualifier “questionable” often vary across disciplines, time, and even within single laboratories. This lack of clarity matters as it creates the risk of introducing methodological constraints that might create more harm than good. Practices labelled as ‘QRPs’ can be both beneficial and problematic for research practice and targeting them without a sound understanding of their dynamic and context-dependent nature risks creating unnecessary casualties in the fight for a more reliable scientific practice.
To start developing a more situated and dynamic picture of QRPs I will then turn my attention to a specific example of a dynamic QRP in the experimental life sciences, namely, the so-called “Far Western Blot” (FWB). The FWB is an experimental system that can be used to study protein-protein interactions but which for most of its existence has not seen a wide uptake in the community because it was seen as a QRP. This was mainly due to its (alleged) propensity to generate high levels of false positives and negatives. Interestingly, however, it seems that over the last few years the FWB slowly moved into the space of acceptable research practices. Analysing this shift and the reasons underlying it, I will argue a) that suppressing this practice deprived the research community of a powerful experimental tool and b) that the original judgment of the FWB was based on a simplistic and non-empirical assessment of its error-generating potential. Ultimately, it seems like the key QRP at work in the FWB case was the way in which the label “questionable” was assigned in the first place. I will argue that findings from this case can be extended to other QRPs in the experimental life sciences and that they point to a larger issue with how researchers judge the error-potential of new research practices.
David Hand (Professor Emeritus and Senior Research Investigator, Department of Mathematics,
Faculty of Natural Sciences, Imperial College London.)
ABSTRACT: Science progresses through an iterative process of formulating theories and comparing
them with empirical real-world data. Different camps of scientists will favour different
theories, until accumulating evidence renders one or more untenable. Not unnaturally,
people become attached to theories. Perhaps they invented a theory, and kudos arises
from being the originator of a generally accepted theory. A theory might represent a
life's work, so that being found wanting might be interpreted as failure. Perhaps
researchers were trained in a particular school, and acknowledging its shortcomings is
difficult. Because of this, tensions can arise between proponents of different theories.
The discipline of statistics is susceptible to precisely the same tensions. Here, however,
the tensions are not between different theories of "what is", but between different
strategies for shedding light on the real world from limited empirical data. This can be in
the form of how one measures discrepancy between the theory's predictions and
observations. It can be in the form of different ways of looking at empirical results. It can
be, at a higher level, because of differences between what is regarded as important in a
particular context. Or it can be for other reasons.
Perhaps the most familiar example of this tension within statistics is between different
approaches to inference. However, there are many other examples of such tensions.
This paper illustrates with several examples. We argue that the tension generally arises
as a consequence of inadequate care being taken in question formulation. That is,
insufficient thought is given to deciding exactly what one wants to know - to determining
"What is the question?".
The ideas and disagreements are illustrated with several examples.
The neglected importance of complexity in statistics and Metasciencejemille6
Daniele Fanelli
London School of Economics Fellow in Quantitative Methodology, Department of
Methodology, London School of Economics and Political Science.
ABSTRACT: Statistics is at war, and Metascience is ailing. This is partially due, the talk will argue, to
a paradigmatic blind-spot: the assumption that one can draw general conclusions about
empirical findings without considering the role played by context, conditions,
assumptions, and the complexity of methods and theories. Whilst ideally these
particularities should be unimportant in science, in practice they cannot be neglected in
most research fields, let alone in research-on-research.
This neglected importance of complexity is supported by theoretical arguments and
empirical findings (or the lack thereof) in the recent meta-analytical and metascientific
literature. The talk will overview this background and suggest how the complexity of
theories and methodologies may be explicitly factored into particular methodologies of
statistics and Metaresearch. The talk will then give examples of how this approach may
usefully complement existing paradigms, by translating results, methods and theories
into quantities of information that are evaluated using an information-compression logic.
Mathematically Elegant Answers to Research Questions No One is Asking (meta-a...jemille6
Uri Simonsohn (Professor, Department of Operations, Innovation and Data Sciences at Esade)
ABSTRACT: The statistical tools listed in the title share that a mathematically elegant solution has
become the consensus advice of statisticians, methodologists and some
mathematically sophisticated researchers writing tutorials and textbooks, and yet,
they lead research workers to meaningless answers, that are often also statistically
invalid. Part of the problem is that advice givers take the mathematical abstractions
of the tools they advocate for literally, instead of taking the actual behavior of
researchers seriously.
On Severity, the Weight of Evidence, and the Relationship Between the Twojemille6
Margherita Harris
Visiting fellow in the Department of Philosophy, Logic and Scientific Method at the London
School of Economics and Political Science.
ABSTRACT: According to the severe tester, one is justified in declaring to have evidence in support of a
hypothesis just in case the hypothesis in question has passed a severe test, one that it would be very
unlikely to pass so well if the hypothesis were false. Deborah Mayo (2018) calls this the strong
severity principle. The Bayesian, however, can declare to have evidence for a hypothesis despite not
having done anything to test it severely. The core reason for this has to do with the
(infamous) likelihood principle, whose violation is not an option for anyone who subscribes to the
Bayesian paradigm. Although the Bayesian is largely unmoved by the incompatibility between
the strong severity principle and the likelihood principle, I will argue that the Bayesian’s never-ending
quest to account for yet an other notion, one that is often attributed to Keynes (1921) and that is
usually referred to as the weight of evidence, betrays the Bayesian’s confidence in the likelihood
principle after all. Indeed, I will argue that the weight of evidence and severity may be thought of as
two (very different) sides of the same coin: they are two unrelated notions, but what brings them
together is the fact that they both make trouble for the likelihood principle, a principle at the core of
Bayesian inference. I will relate this conclusion to current debates on how to best conceptualise
uncertainty by the IPCC in particular. I will argue that failure to fully grasp the limitations of an
epistemology that envisions the role of probability to be that of quantifying the degree of belief to
assign to a hypothesis given the available evidence can be (and has been) detrimental to an
adequate communication of uncertainty.
Revisiting the Two Cultures in Statistical Modeling and Inference as they rel...jemille6
Aris Spanos (Wilson Schmidt Professor of Economics, Virginia Tech)
ABSTRACT: The discussion places the two cultures, the model-driven statistical modeling and the
algorithm-driven modeling associated with Machine Learning (ML) and Statistical
Learning Theory (SLT) in a broader context of paradigm shifts in 20th-century statistics,
which includes Fisher’s model-based induction of the 1920s and variations/extensions
thereof, the Data Science (ML, STL, etc.) and the Graphical Causal modeling in the
1990s. The primary objective is to compare and contrast the effectiveness of different
approaches to statistics in learning from data about phenomena of interest and relate
that to the current discussions pertaining to the statistics wars and their potential
casualties.
Comparing Frequentists and Bayesian Control of Multiple Testingjemille6
James Berger
ABSTRACT: A problem that is common to many sciences is that of having to deal with a multiplicity of statistical inferences. For instance, in GWAS (Genome Wide Association Studies), an experiment might consider 20 diseases and 100,000 genes, and conduct statistical tests of the 20x100,000=2,000,000 null hypotheses that a specific disease is associated with a specific gene. The issue is that selective reporting of only the ‘highly significant’ results could lead to many claimed disease/gene associations that turn out to be false, simply because of statistical randomness. In 2007, the seriousness of this problem was recognized in GWAS and extremely stringent standards were employed to resolve it. Indeed, it was recommended that tests for association should be conducted at an error probability of 5 x 10—7. Particle physicists similarly learned that a discovery would be reliably replicated only if the p-value of the relevant test was less than 5.7 x 10—7. This was because they had to account for a huge number of multiplicities in their analyses. Other sciences have continuing issues with multiplicity. In the Social Sciences, p-hacking and data dredging are common, which involve multiple analyses of data. Stopping rules in social sciences are often ignored, even though it has been known since 1933 that, if one keeps collecting data and computing the p-value, one is guaranteed to obtain a p-value less than 0.05 (or, indeed, any specified value), even if the null hypothesis is true. In medical studies that occur with strong oversight (e.g., by the FDA), control for multiplicity is mandated. There is also typically a large amount of replication, resulting in meta-analysis. But there are many situations where multiplicity is not handled well, such as subgroup analysis: one first tests for an overall treatment effect in the population; failing to find that, one tests for an effect among men or among women; failing to find that, one tests for an effect among old men or young men, or among old women or young women; …. I will argue that there is a single method that can address any such problems of multiplicity: Bayesian analysis, with the multiplicity being addressed through choice of prior probabilities of hypotheses. ... There are, of course, also frequentist error approaches (such as Bonferroni and FDR) for handling multiplicity of statistical inferences; indeed, these are much more familiar than the Bayesian approach. These are, however, targeted solutions for specific classes of problems and are not easily generalizable to new problems.
Clark Glamour
ABSTRACT: "Data dredging"--searching non experimental data for causal and other relationships and taking that same data to be evidence for those relationships--was historically common in the natural sciences--the works of Kepler, Cannizzaro and Mendeleev are examples. Nowadays, "data dredging"--using data to bring hypotheses into consideration and regarding that same data as evidence bearing on their truth or falsity--is widely denounced by both philosophical and statistical methodologists. Notwithstanding, "data dredging" is routinely practiced in the human sciences using "traditional" methods--various forms of regression for example. The main thesis of my talk is that, in the spirit and letter of Mayo's and Spanos’ notion of severe testing, modern computational algorithms that search data for causal relations severely test their resulting models in the process of "constructing" them. My claim is that in many investigations, principled computerized search is invaluable for reliable, generalizable, informative, scientific inquiry. The possible failures of traditional search methods for causal relations, multiple regression for example, are easily demonstrated by simulation in cases where even the earliest consistent graphical model search algorithms succeed. ... These and other examples raise a number of issues about using multiple hypothesis tests in strategies for severe testing, notably, the interpretation of standard errors and confidence levels as error probabilities when the structures assumed in parameter estimation are uncertain. Commonly used regression methods, I will argue, are bad data dredging methods that do not severely, or appropriately, test their results. I argue that various traditional and proposed methodological norms, including pre-specification of experimental outcomes and error probabilities for regression estimates of causal effects, are unnecessary or illusory in application. Statistics wants a number, or at least an interval, to express a normative virtue, the value of data as evidence for a hypothesis, how well the data pushes us toward the true or away from the false. Good when you can get it, but there are many circumstances where you have evidence but there is no number or interval to express it other than phony numbers with no logical connection with truth guidance. Kepler, Darwin, Cannizarro, Mendeleev had no such numbers, but they severely tested their claims by combining data dredging with severe testing.
The Duality of Parameters and the Duality of Probabilityjemille6
Suzanne Thornton
ABSTRACT: Under any inferential paradigm, statistical inference is connected to the logic of probability. Well-known debates among these various paradigms emerge from conflicting views on the notion of probability. One dominant view understands the logic of probability as a representation of variability (frequentism), and another prominent view understands probability as a measurement of belief (Bayesianism). The first camp generally describes model parameters as fixed values, whereas the second camp views parameters as random. Just as calibration (Reid and Cox 2015, “On Some Principles of Statistical Inference,” International Statistical Review 83(2), 293-308)--the behavior of a procedure under hypothetical repetition--bypasses the need for different versions of probability, I propose that an inferential approach based on confidence distributions (CD), which I will explain, bypasses the analogous conflicting perspectives on parameters. Frequentist inference is connected to the logic of probability through the notion of empirical randomness. Sample estimates are useful only insofar as one has a sense of the extent to which the estimator may vary from one random sample to another. The bounds of a confidence interval are thus particular observations of a random variable, where the randomness is inherited by the random sampling of the data. For example, 95% confidence intervals for parameter θ can be calculated for any random sample from a Normal N(θ, 1) distribution. With repeated sampling, approximately 95% of these intervals are guaranteed to yield an interval covering the fixed value of θ. Bayesian inference produces a probability distribution for the different values of a particular parameter. However, the quality of this distribution is difficult to assess without invoking an appeal to the notion of repeated performance. ... In contrast to a posterior distribution, a CD is not a probabilistic statement about the parameter, rather it is a data-dependent estimate for a fixed parameter for which a particular behavioral property holds. The Normal distribution itself, centered around the observed average of the data (e.g. average recovery times), can be a CD for θ. It can give any level of confidence. Such estimators can be derived through Bayesian or frequentist inductive procedures, and any CD, regardless of how it is obtained, guarantees performance of the estimator under replication for a fixed target, while simultaneously producing a random estimate for the possible values of θ.
Paper given at PSA 22 Symposium: Multiplicity, Data-Dredging and Error Control
MAYO ABSTRACT: I put forward a general principle for evidence: an error-prone claim C is warranted to the extent it has been subjected to, and passes, an analysis that very probably would have found evidence of flaws in C just if they are present. This probability is the severity with which C has passed the test. When a test’s error probabilities quantify the capacity of tests to probe errors in C, I argue, they can be used to assess what has been learned from the data about C. A claim can be probable or even known to be true, yet poorly probed by the data and model at hand. The severe testing account leads to a reformulation of statistical significance tests: Moving away from a binary interpretation, we test several discrepancies from any reference hypothesis and report those well or poorly warranted. A probative test will generally involve combining several subsidiary tests, deliberately designed to unearth different flaws. The approach relates to confidence interval estimation, but, like confidence distributions (CD) (Thornton), a series of different confidence levels is considered. A 95% confidence interval method, say using the mean M of a random sample to estimate the population mean μ of a Normal distribution, will cover the true, but unknown, value of μ 95% of the time in a hypothetical series of applications. However, we cannot take .95 as the probability that a particular interval estimate (a ≤ μ ≤ b) is correct—at least not without a prior probability to μ. In the severity interpretation I propose, we can nevertheless give an inferential construal post-data, while still regarding μ as fixed. For example, there is good evidence μ ≥ a (the lower estimation limit) because if μ < a, then with high probability .95 (or .975 if viewed as one-sided) we would have observed a smaller value of M than we did. Likewise for inferring μ ≤ b. To understand a method’s capability to probe flaws in the case at hand, we cannot just consider the observed data, unlike in strict Bayesian accounts. We need to consider what the method would have inferred if other data had been observed. For each point μ’ in the interval, we assess how severely the claim μ > μ’ has been probed. I apply the severity account to the problems discussed by earlier speakers in our session. The problem with multiple testing (and selective reporting) when attempting to distinguish genuine effects from noise, is not merely that it would, if regularly applied, lead to inferences that were often wrong. Rather, it renders the method incapable, or practically so, of probing the relevant mistaken inference in the case at hand. In other cases, by contrast, (e.g., DNA matching) the searching can increase the test’s probative capacity. In this way the severe testing account can explain competing intuitions about multiplicity and data-dredging, while blocking inferences based on problematic data-dredging
The Statistics Wars and Their Causalities (refs)jemille6
High-profile failures of replication in the social and biological sciences underwrite a
minimal requirement of evidence: If little or nothing has been done to rule out flaws in inferring a claim, then it has not passed a severe test. A claim is severely tested to the extent it has been subjected to and passes a test that probably would have found flaws, were they present. This probability is the severity with which a claim has passed. The goal of highly well-tested claims differs from that of highly probable ones, explaining why experts so often disagree about statistical reforms. Even where today’s statistical test critics see themselves as merely objecting to misuses and misinterpretations, the reforms they recommend often grow out of presuppositions about the role of probability in inductive-statistical inference. Paradoxically, I will argue, some of the reforms intended to replace or improve on statistical significance tests enable rather than reveal illicit inferences due to cherry-picking, multiple testing, and data-dredging. Some preclude testing and falsifying claims altogether. These are the “casualties” on which I will focus. I will consider Fisherian vs Neyman-Pearson tests, Bayes factors, Bayesian posteriors, likelihoodist assessments, and the “screening model” of tests (a quasiBayesian-frequentist assessment). Whether one accepts this philosophy of evidence, I argue, that it provides a standpoint for avoiding both the fallacies of statistical testing and the casualties of today’s statistics wars.
The Statistics Wars and Their Casualties (w/refs)jemille6
High-profile failures of replication in the social and biological sciences underwrite a minimal requirement of evidence: If little or nothing has been done to rule out flaws in inferring a claim, then it has not passed a severe test. A claim is severely tested to the extent it has been subjected to and passes a test that probably would have found flaws, were they present. This probability is the severity with which a claim has passed. The goal of highly well-tested claims differs from that of highly probable ones, explaining why experts so often disagree about statistical reforms. Even where today’s statistical test critics see themselves as merely objecting to misuses and misinterpretations, the reforms they recommend often grow out of presuppositions about the role of probability in inductive-statistical inference. Paradoxically, I will argue, some of the reforms intended to replace or improve on statistical significance tests enable rather than reveal illicit inferences due to cherry-picking, multiple testing, and data-dredging. Some preclude testing and falsifying claims altogether. These are the “casualties” on which I will focus. I will consider Fisherian vs Neyman-Pearson tests, Bayes factors, Bayesian posteriors, likelihoodist assessments, and the “screening model” of tests (a quasi-Bayesian-frequentist assessment). Whether one accepts this philosophy of evidence, I argue, that it provides a standpoint for avoiding both the fallacies of statistical testing and the casualties of today’s statistics wars.
On the interpretation of the mathematical characteristics of statistical test...jemille6
Statistical hypothesis tests are often misused and misinterpreted. Here I focus on one
source of such misinterpretation, namely an inappropriate notion regarding what the
mathematical theory of tests implies, and does not imply, when it comes to the
application of tests in practice. The view taken here is that it is helpful and instructive to be consciously aware of the essential difference between mathematical model and
reality, and to appreciate the mathematical model and its implications as a tool for
thinking rather than something that has a truth value regarding reality. Insights are presented regarding the role of model assumptions, unbiasedness and the alternative hypothesis, Neyman-Pearson optimality, multiple and data dependent testing.
The role of background assumptions in severity appraisal (jemille6
In the past decade discussions around the reproducibility of scientific findings have led to a re-appreciation of the importance of guaranteeing claims are severely tested. The inflation of Type 1 error rates due to flexibility in the data analysis is widely considered
one of the underlying causes of low replicability rates. Solutions, such as study preregistration, are becoming increasingly popular to combat this problem. Preregistration only allows researchers to evaluate the severity of a test, but not all
preregistered studies provide a severe test of a claim. The appraisal of the severity of a
test depends on background information, such as assumptions about the data generating process, and auxiliary hypotheses that influence the final choice for the
design of the test. In this article, I will discuss the difference between subjective and
inter-subjectively testable assumptions underlying scientific claims, and the importance
of separating the two. I will stress the role of justifications in statistical inferences, the
conditional nature of scientific conclusions following these justifications, and highlight
how severe tests could lead to inter-subjective agreement, based on a philosophical approach grounded in methodological falsificationism. Appreciating the role of background assumptions in the appraisal of severity should shed light on current discussions about the role of preregistration, interpreting the results of replication studies, and proposals to reform statistical inferences.
The two statistical cornerstones of replicability: addressing selective infer...jemille6
Tukey’s last published work in 2020 was an obscure entry on multiple comparisons in the
Encyclopedia of Behavioral Sciences, addressing the two topics in the title. Replicability
was not mentioned at all, nor was any other connection made between the two topics. I shall demonstrate how these two topics critically affect replicability using recently completed studies. I shall review how these have been addressed in the past. I shall
review in more detail the available ways to address selective inference. My conclusion is that conducting many small replicability studies without strict standardization is the way to assure replicability of results in science, and we should introduce policies to make this happen.
Synthetic Fiber Construction in lab .pptxPavel ( NSTU)
Synthetic fiber production is a fascinating and complex field that blends chemistry, engineering, and environmental science. By understanding these aspects, students can gain a comprehensive view of synthetic fiber production, its impact on society and the environment, and the potential for future innovations. Synthetic fibers play a crucial role in modern society, impacting various aspects of daily life, industry, and the environment. ynthetic fibers are integral to modern life, offering a range of benefits from cost-effectiveness and versatility to innovative applications and performance characteristics. While they pose environmental challenges, ongoing research and development aim to create more sustainable and eco-friendly alternatives. Understanding the importance of synthetic fibers helps in appreciating their role in the economy, industry, and daily life, while also emphasizing the need for sustainable practices and innovation.
Unit 8 - Information and Communication Technology (Paper I).pdfThiyagu K
This slides describes the basic concepts of ICT, basics of Email, Emerging Technology and Digital Initiatives in Education. This presentations aligns with the UGC Paper I syllabus.
Macroeconomics- Movie Location
This will be used as part of your Personal Professional Portfolio once graded.
Objective:
Prepare a presentation or a paper using research, basic comparative analysis, data organization and application of economic information. You will make an informed assessment of an economic climate outside of the United States to accomplish an entertainment industry objective.
A review of the growth of the Israel Genealogy Research Association Database Collection for the last 12 months. Our collection is now passed the 3 million mark and still growing. See which archives have contributed the most. See the different types of records we have, and which years have had records added. You can also see what we have for the future.
June 3, 2024 Anti-Semitism Letter Sent to MIT President Kornbluth and MIT Cor...Levi Shapiro
Letter from the Congress of the United States regarding Anti-Semitism sent June 3rd to MIT President Sally Kornbluth, MIT Corp Chair, Mark Gorenberg
Dear Dr. Kornbluth and Mr. Gorenberg,
The US House of Representatives is deeply concerned by ongoing and pervasive acts of antisemitic
harassment and intimidation at the Massachusetts Institute of Technology (MIT). Failing to act decisively to ensure a safe learning environment for all students would be a grave dereliction of your responsibilities as President of MIT and Chair of the MIT Corporation.
This Congress will not stand idly by and allow an environment hostile to Jewish students to persist. The House believes that your institution is in violation of Title VI of the Civil Rights Act, and the inability or
unwillingness to rectify this violation through action requires accountability.
Postsecondary education is a unique opportunity for students to learn and have their ideas and beliefs challenged. However, universities receiving hundreds of millions of federal funds annually have denied
students that opportunity and have been hijacked to become venues for the promotion of terrorism, antisemitic harassment and intimidation, unlawful encampments, and in some cases, assaults and riots.
The House of Representatives will not countenance the use of federal funds to indoctrinate students into hateful, antisemitic, anti-American supporters of terrorism. Investigations into campus antisemitism by the Committee on Education and the Workforce and the Committee on Ways and Means have been expanded into a Congress-wide probe across all relevant jurisdictions to address this national crisis. The undersigned Committees will conduct oversight into the use of federal funds at MIT and its learning environment under authorities granted to each Committee.
• The Committee on Education and the Workforce has been investigating your institution since December 7, 2023. The Committee has broad jurisdiction over postsecondary education, including its compliance with Title VI of the Civil Rights Act, campus safety concerns over disruptions to the learning environment, and the awarding of federal student aid under the Higher Education Act.
• The Committee on Oversight and Accountability is investigating the sources of funding and other support flowing to groups espousing pro-Hamas propaganda and engaged in antisemitic harassment and intimidation of students. The Committee on Oversight and Accountability is the principal oversight committee of the US House of Representatives and has broad authority to investigate “any matter” at “any time” under House Rule X.
• The Committee on Ways and Means has been investigating several universities since November 15, 2023, when the Committee held a hearing entitled From Ivory Towers to Dark Corners: Investigating the Nexus Between Antisemitism, Tax-Exempt Universities, and Terror Financing. The Committee followed the hearing with letters to those institutions on January 10, 202
Executive Directors Chat Leveraging AI for Diversity, Equity, and InclusionTechSoup
Let’s explore the intersection of technology and equity in the final session of our DEI series. Discover how AI tools, like ChatGPT, can be used to support and enhance your nonprofit's DEI initiatives. Participants will gain insights into practical AI applications and get tips for leveraging technology to advance their DEI goals.
2024.06.01 Introducing a competency framework for languag learning materials ...Sandy Millin
http://sandymillin.wordpress.com/iateflwebinar2024
Published classroom materials form the basis of syllabuses, drive teacher professional development, and have a potentially huge influence on learners, teachers and education systems. All teachers also create their own materials, whether a few sentences on a blackboard, a highly-structured fully-realised online course, or anything in between. Despite this, the knowledge and skills needed to create effective language learning materials are rarely part of teacher training, and are mostly learnt by trial and error.
Knowledge and skills frameworks, generally called competency frameworks, for ELT teachers, trainers and managers have existed for a few years now. However, until I created one for my MA dissertation, there wasn’t one drawing together what we need to know and do to be able to effectively produce language learning materials.
This webinar will introduce you to my framework, highlighting the key competencies I identified from my research. It will also show how anybody involved in language teaching (any language, not just English!), teacher training, managing schools or developing language learning materials can benefit from using the framework.
Model Attribute Check Company Auto PropertyCeline George
In Odoo, the multi-company feature allows you to manage multiple companies within a single Odoo database instance. Each company can have its own configurations while still sharing common resources such as products, customers, and suppliers.
MASS MEDIA STUDIES-835-CLASS XI Resource Material.pdf
Frequentist Statistics as a Theory of Inductive Inference (2/27/14)
1. Frequentist Statistics as a Theory of Inductive Inference (2/27/14)
A conversation I had with Sir David Cox, (“Statistical Science and Philosophy of
Science: Where Do/Should They Meet (in 2011 and beyond)?” began with his
asking:
Cox: Deborah, in some fields foundations do not seem very important, but we both
think that foundations of statistical inference are important; why do you think that is?
Mayo: I think because they ask about fundamental questions about evidence,
inference, and probability. …In statistics we’re so intimately connected to the
scientific interest in learning about the world, we invariably cross into philosophical
questions about empirical knowledge and inductive inference.
When Sir David Cox was unable to make it to give my seminar at the LSE last
December, I presented my construal of his philosophy of “frequentist” statistical
methods (based on scads of papers over 60 years, 2 recent books of his on
“Principles”, our 2 joint papers)
1
2. Moving away from caricatures of the standard methods (e.g., significance tests,
confidence intervals) on which critics often focus, the tools emerge as satisfying
piecemeal learning goals within a multiplicity of methods and interconnected
checks, and reports, of error.
We deny the assumption that the overall unified principle for these methods
must be merely controlling the low-long run error probabilities of methods
(behavioristic rationale),
We advance an epistemological principle that renders hypothetical error
probabilities relevant for learning about particular phenomena.
We contrast it with current Bayesian attempts at unifications of frequentist and
Bayesian methods.
2
3. Decision and Inference: Cox also distinguishes uses of statistical methods for
inference and for decision--- the former is open ended (inferences are not fixed in
advance), without explicit quantification of costs
Background Knowledge: In the statistical analysis of scientific and technological
data, there is virtually always external information that should enter in reaching
conclusions about what the data indicate with respect to the primary question of
interest.
Typically, these background considerations enter not by a probability assignment (as
in the Bayesian paradigm), but by identifying the question to be asked, designing the
study, specifying an appropriate statistical model, interpreting the statistical results,
and relating those inferences to primary scientific ones, and using them to extend and
support underlying theory.
--a very large issue in current foundations
3
4. Roles of Probability in Induction
Inductive inference: the premises (background and evidence statements) can be true
while the conclusion inferred may be false without a logical contradiction:
the conclusion is "evidence transcending".
probability naturally arises in capturing induction, but there is more than one way this
can occur: To measure:
1. how much belief or support there is for a hypothesis
2. how reliable or how well tested a hypothesis is
too often presupposed it can only be the former—probabilism leading to wellknown difficulties with using frequentist ideas (universes are not as plenty as
blackberries—Peirce)
our interest is in how frequentist probability may be used for the latter (for
measuring corroboration, trustworthiness, severity of tests)…
4
5. For instance, significance testing uses probability to characterize the proportion of
cases in which a null hypothesis H0 would be erroneously rejected in a hypothetical
long-run of sampling, an error probability.
We may even refer to frequentist statistical inference as error probability statistics, or
just error statistics.
Cox’s analogy with calibrating instruments:
"Arguments involving probability only via its (hypothetical) long-run frequency
interpretation are called frequentist. That is, we define procedures for assessing
evidence that are calibrated by how they would perform were they used repeatedly.
In that sense they do not differ from other measuring instrument." (Cox 2006, p. 8)
As with the use of measuring instruments, we employ the performance features to
make inferences about aspects of the particular thing that is measured, (aspects that
the measuring tool is appropriately capable of revealing).
5
6. It’s not hard to see how we do so, if we remember the central problem of inductive
inference
The argument from:
H entails data y, (H “fits” y)
y is observed,
therefore H is correct,
is, of course, deductively invalid.
A central problem is to be able nevertheless to warrant inferring H, or regarding y as
good evidence for H.
Although y accords with H, even in deterministic cases, many rival hypotheses (some
would say infinitely many) would also fit or predict y, and would pass as well as H.
6
7. In order for a test to be probative, one wants the prediction from H to be something
that would be very difficult to achieve, and not easily accounted for, were H false and
rivals to H correct (whatever they may be)
i.e., the test should have high probability of falsifying H, if H is false….
7
8. Statistical Significance Tests
While taking elements from Fisherian and Neyman-Pearson approaches, our
conception differs from both
1. We have empirical data y treated as observed values of a random variable (vector)
Y.
y is of interest to provide information about the probability distribution of Y as
defined by the relevant statistical model—often an idealized representation of
the underlying data generating process.
2. We have a hypothesis framed in terms of parameters of the distribution, the
hypothesis under test or the null hypothesis H0.
An elementary example:
Data Y consists of n Independent and Identically Distributed (IID) components
(r.v’s), Normally distributed with unknown mean and possibly unknown standard
deviation :
Yi ~ NIID(,), i =1,2,…,n.
8
9. A simple hypothesis is obtained if the value of is known and H0 asserts that
= 0, a given constant.
We find a function T = t(Y) of the sample, the test statistic, such that:
(i) The larger the value of t the more inconsistent or discordant the data are with
H0 in the respect tested
(ii) The statistic T = t(Y) has a (numerically) known probability distribution when
H0 is true.
The p-value (observed statistical significance level) corresponding to t(y) is:
P(t(Y) ≥ t(y); H0 ):=P(T ≥ t; H0 )
regarded as a measure of accordance with H0 in the respect tested.
So we have: (1) Data (corresponding sample space) , (2) hypotheses, (3) test statistic,
(4) p-value (or other error probability) as a measure of discordance or accordance
with H0 in the respect tested.
9
10. Low p-values, if properly computed, count as evidence against H0 (evidence of a
specified discrepancy or inconsistency with H0).
10
11. Our treatment contrasts with what is often taken as the strict Neyman-Pearson (N-P)
formulation:
Rejects recipe-like view of tests, e.g., set a preassigned threshold value and
"reject" H0
if and only if p ≤ .
However, such “behavioristic” construals have roles in this conception: While a
relatively mechanical use of p-values is widely lampooned, there are contexts where
it serves as a screening device, decreasing the rates of publishing misleading results
(e.g., microarrays)
Virtues: impartiality, relative independence from manipulation, gives protection of
known and desirable long-run properties.
But the main interest and novelty here is developing an inferential or evidential
rationale.
P(t(Y) ≥ t(y); H0 ):=P(T ≥ t; H0 )
11
12. For example: low p-values, if properly computed, count as evidence against H0
(evidence of a specified discrepancy or inconsistency with H0).
Why?
12
13. It’s true that such a rule provides low error rates (i.e., erroneous rejections) in the
long run when H0 is true, a behavioristic argument:
Suppose that we were to treat the data as just decisive evidence against H0, then, in
hypothetical repetitions, H0 would be rejected in a long-run proportion p of the
cases in which it is actually true.
(so a p-value is an error probability unlike what many allege)
But what matters for us, is that such a rule also provides a way to determine whether
a specific data set provides evidence of inconsistency with or discrepancy from H0.
…along the lines of the probative demand for tests…
13
14. FEV: Frequentist Inductive Inference Principle
The reasoning is based on the following frequentist inference principle (with respect
to H0):
(FEV) y fails to count as evidence against H0,
if, such discordant results are fairly frequent even if H0 is correct. Actually FEV(i)
(Mayo and Cox 2006/2010, 254)1
y counts as evidence of a discrepancy only if (and to the extent that) a less discordant
result would probably have occurred, if H0 correctly described the distribution
generating y.
So if the p-value is not small, it is fairly easy (frequent) to generate such discordant
results even if the null is true, so this is not good evidence of a discrepancy from the
null…
Where “such discordant results” are those as or even further from H0 than the
outcome observed.
14
15. Weight Gain Example.
To distinguish between this "evidential" use of significance test reasoning, and the
familiar appeal to “low long-run erroneous behavior” (N-P), consider a very informal
example:
Suppose that weight gain is measured by a multiplicity of well-calibrated and stable
methods, and the results show negligible change over a test period (e.g., before and
after England).
This is grounds for inferring that my weight gain is negligible within limits set by the
sensitivity of the scales.
Why?
While it is true that by following such a procedure in the long run one would rarely
report weight gains erroneously, that is not the rationale for the particular inference.
15
16. The justification is rather that the error probabilistic properties of the weighing
procedure reflect what is actually the case in the specific instance.
(It informs about the specific cause of the lack of a recorded increase in weight).
Low long run error, while necessary, is not sufficient for warranted evidence in a
particular case.
General Severity Account of Evidence for statistical and non-statistical cases:
FEV falls under a more general account of evidence or inductive inference, extending
beyond statistical hypotheses:
Data y provide evidence for a claim H, just to the extent that H passes a severe test
with y
16
17. The intuition behind requiring severity is that:
Data y0 in test T provide good evidence for inferring H (just) to the extent that H
passes severely with y0, i.e., to the extent that H would (very probably) not have
survived the test so well were H false.
This may be a quantitative or a qualitative assessment.
(A core question that permeates "inductive philosophy", both in statistics and
philosophy is:
What is the nature and role of probabilistic concepts, methods, and models in making
inferences in the face of limited data, uncertainty and error?
17
18. Mayo and Cox (2010) identify a key principle —
Frequentist Evidential Principle (FEV):
(FEV) — by which hypothetical error probabilities may be used for inductive
inference from specific data and consider how FEV may direct and justify:
(a) Different uses and interpretations of statistical significance levels in testing a
variety of different types of null hypotheses,
and
(b) when and why "selection effects" need to be taken account of in data dependent
statistical testing.
18
19. The Role of Probability in Frequentist Induction
The defining feature of an inductive inference is that the premises (background
and evidence statements) can be true while the conclusion inferred may be false
without a logical contradiction:
the conclusion is "evidence transcending".
Probability naturally arises but there is more than one way this can occur.
Two distinct philosophical traditions for using probability in inference are summed
up by Pearson (1950, p. 228) (probabilism and performance):
“For one school, the degree of confidence in a proposition, a quantity varying with
the nature and extent of the evidence, provides the basic notion to which the
numerical scale should be adjusted.
The other school notes the relevance in ordinary life of a knowledge of the relative
frequency of occurrence of a particular class of events in a series of repetitions, and
suggests that "it is through its link with relative frequency that probability has the
most direct meaning for the human mind"
19
20. Frequentist induction, employs probability in the second manner:
For instance, significance testing appeals to probability to characterize the proportion
of cases in which a null hypothesis H0 would be rejected in a hypothetical long-run of
repeated sampling, an error probability.
We may even refer to frequentist statistical inference as error probability statistics, or
just error statistics.
Some, following Neyman, describe the role of probability in frequentist error
statistics as “inductive behavior”:
Here the inductive reasoner "decides to infer" the conclusion, and probability
quantifies the associated risk of error.
The idea that one role of probability arises in science to characterize the "riskiness"
or probativeness or severity of the tests to which hypotheses are put is reminiscent of
the philosophy of Karl Popper.
20
21. Popper and Neyman:
Popper and Neyman have a broadly analogous approach based on the idea that
we can speak of a hypothesis having been well-tested in some sense, quite
different from its being accorded a degree of probability, belief or confirmation;
Both also broadly shared the view that in order for data to "confirm" or
"corroborate" a hypothesis H, that hypothesis would have to have been subjected
to a test with high probability or power to have rejected it if false.
But the similarities between Neyman and Popper do not go as deeply as one might
suppose.
Despite the close connection of the ideas, there appears to be no reference to
Popper in the writings of Neyman (Lehmann, 1995, p. 3) and the references by
Popper to Neyman are scarce (e.g.,LSD).
Moreover, because Popper denied that any inductive claims were justifiable, his
philosophy forced him to deny that even the method he espoused (conjecture
and refutations) was reliable (not that it’s known to be unreliable)
21
22. Although H might be true, Popper made it clear that he regarded corroboration
at most as a report of the past performance of H: it warranted no claims about
its reliability in future applications.
22
23. Popper and Fisher:
Popper’s focus on falsification by Popper as the goal of tests, and indeed, science, is
clearly strongly redolent of Fisher.
While, again, evidence of direct influence is virtually absent, Popper would have
concurred with Fisher (1935a, p. 16) that ‘every experiment may be said to exist only
in order to give the facts the chance of disproving the null hypothesis’.
However, Popper's position denies ever having grounds for inference about
reliability, or for inferring reproducible deviations.
By contrast, a central feature of frequentist statistics is to actually assess and
control the probability that a test would have rejected a hypothesis, if false.
The advantage in the modern statistical framework is that the probabilities arise from
defining a probability (statistical) model to represent the phenomenon of interest.
Neyman describes frequentist statistics as modelling the phenomenon of the stability
of relative frequencies of results of repeated "trials", granting there are other
23
24. possibilities concerned with modelling psychological phenomena connected with
intensities of belief, or with readiness to bet specified sums, etc.
Frequentist statistical methods can (and often do) supply tools for inductive
inference by providing methods for evaluating the severity or probativeness of tests
—although they don’t directly do so, the severity rationale gives guidance in using
them in this way.
24
25. A hypothesis H passes a severe test T with data y0 if,
(S-1) y0 agrees with H, and
(S-2) with very high probability, test T would have produced a result that
accords less well with H than y0 does, if H were false.
In statistical contexts, the test statistic t(Y) must define an appropriate measure of
accordance or distance (as required by severity condition (S-1).
Note: when t(Y) is statistically significantly different from H0, it is agreeing with notH0
‘The severity with which inference H passes test T with outcome y’ may be
abbreviated by:
SEV(Test T, outcome y, claim H)
25
26. Just as one makes inferences about changes in body mass based on performance
characteristics of various scales, error probabilities of tests indicate the capacity
of the particular test to have revealed inconsistencies in the respects probed, and
this in turn allows relating p-values to statements about the underlying process.
Peirce: shall we assume instead the scales read my mind and mislead me just when I
don't know the weight, but doo fine with items of known weight, e.g., 5 lb potatoes.
—That itself a highly unreliable way to proceed in reasoning from instruments…
Note: Clue to showing counterinduction is unreliable.
26
27. A Multiplicity of Types of Null Hypotheses (Cox’s taxonomy)
Considerable criticisms and abuses of significance tests would have been avoided had
Cox’s taxonomy in (1958) been an integral part of significance testing.
Types of Null Hypothesis (multiple uses in piecemeal steps involved in
linking data and models to learn about modeled phenomena)
1. Embedded null hypotheses
2. Dividing null hypotheses:
3. Null hypotheses of absence of structure
4. Null hypotheses of model adequacy
5. Substantively-based null hypotheses.
Evaluating such local hypotheses for these piecemeal learning goals demands its
own criteria
27
28. 1. Embedded null hypotheses
Consider a parametric family of distributions f(y; indexed by unknown (vector of)
parameters =
where denotes unknown nuisance parameter(s).
The null and alternative hypotheses of interest are:
H0: vs. H1:
Central tasks, often ignored, take central stage in Cox’s discussion and indeed are an
important part of the justification of this approach (though must be skimpy here):
(a) —need to choose appropriate test statistic T(Y)
(b) —need to be able to compute the probability distribution of T under the null (and
perhaps alternative) hypotheses (it should not depend on nuisance parameters)
28
29. (c) —need to collect data y to so that they satisfy adequately the assumptions of the
relevant probability model.
If we succeed with all that….
p-value not small—evidence the data are consistent with the null…
But no evidence against is not automatically evidence for the null…
but we can infer the absence of a discrepancy from (or concordance with) H0
To infer the absence of a discrepancy from H0 as large as examine the probability
We have a result and it differs from H0 by some observed amount d.
= Prob[a result that differs as much as or more from H0 (than d); evaluated under
the assumption that ],
= P(t(Y) > t(y) ;
If is near 1, then, following FEV, the data are good evidence that
Terms: discrepancy refers to parameters difference, discord, fit refers to observed
differences.
29
30. Interpreting
may be regarded as stringency or severity with which the test has probed the
discrepancy ;
that is, has passed a severe test:
(other terms: precision, sensitivity)
Avoids unwarranted interpretations of “consistency with H0” with insensitive
tests (“fallacies of acceptance”).
A fallacy of acceptance takes an insignificant difference as evidence that the null
is true; but tests have limited discernment power.
More relevant to specific data than is a test’s power, which is calculated relative
to a predesignated critical value c beyond which the test "rejects" the null.
Power at
() = P(t(Y) > c ;
In contrast:
= P(t(Y) > t(y) ;
30
31. An important passage in Cox 2008, p. 25)
“In the Neyman-Pearson theory of tests, sensitivity of a test is assessed by the notion
of power….. In the approach adopted here the assessment is via the distribution of
the random variable P, again considered for various alternatives”
P here is the significance probability regarded as a random variable—which it is…
Cox makes an equivalent move referring to confidence intervals—I'll come back to…
small p-values (evidence of some discrepancy in direction of alternative).
Critics correctly note that the p-value alone doesn’t tell you the size of the
discrepancy indicated “effect size”), but it, together with sample size, can be used to
determine this:
If a test is so sensitive that so small a p-value is probable, even when ,
then a small p-value is not evidence of a discrepancy from H0 in excess of .
So by this reasoning we also avoid “fallacies of rejection”
31
32. 5. Substantively-based null hypotheses.
A theory T predicts that H0 is at least a very close approximation to true situation
(perhaps T has already passed several theoretical and empirical tests)
Rival theory T* predicts a specified discrepancy from H0 and the test is designed to
discriminate between T and T* in a thus far untested domain.
Focus on interpreting: p-value not small.
H0 may be formed deliberately to let T “stick its neck out”
Discrepancies from T in the direction of T* are given a very good chance to be
detected, so if no significant departure is found, this constitutes evidence for T in the
respect tested
Famous “null results” take this form (e.g., set the GTR predicted values at 0)
rivals to the GTR predicted a breakdown of the Weak Equivalence Principle (WEP)
for massive self-gravitating bodies, the earth-moon system: this effect, the Nordvedt
effect would be 0 for GTR (identified with the null hypothesis) and non-0 for rivals.
32
33. Measurements of the round trip travel times between the earth and moon (between
1969 and 1975) set upper bounds to the possible violation of the WEP, and because
the tests were sufficiently sensitive, these measurements provided good evidence that
the Nordvedt effect is absent, i.e., evidence for the null hypothesis
Such a negative result does not provide evidence for all of GTR (in all its areas of
prediction), but it does provide evidence for its correctness with respect to this effect.
Some argue I should allow inferring all of GTR (Chalmers, Laudan, Musgrave) —but
I think it is this piecemeal way that inquiries enable progress (realizing that NOT all
of GTR has been warranted provoked developing rivals…)
Unlike the large-scale theory testers, we are not after an account of “theory choice”
but rather learning about aspects of theoretical phenomena by local probes…
Statistical reasoning is at the heart of inquiries which are broken down into questions
about parameters in statistical distributions intermediate between the full theory and
the actual data.
My conception is to view them as probing for piecemeal errors—and note, an error
can be any mistaken claim or flawed understanding—both empirical and theoretical!
33
34. A fundamental tenet of the conception of inductive learning most at home with
the frequentist philosophy is that inductive inference requires building up
incisive arguments and inferences by putting together several different piecemeal results.
Although the complexity of the story makes it more difficult to set out neatly, as,
for example, if a single algorithm is the whole of inductive inference, the payoff
is an account that approaches the kind of full-bodied arguments that scientists
build up in order to obtain reliable knowledge.
The goal is not representing beliefs or opinions (as in personalist Bayesian
accounts) but avoiding being misled by beliefs and opinions
34
35. Series of Confidence Intervals (goal: to distinguish inferential and
behaviorist justifications while avoiding an infamous fallacy of R.A. Fisher)
Consider our Normal mean “embedded” example, rather than running a
significance test we may form the
1 − α upper confidence bound, CIU(Y; α)
for estimating mean µ, let be known, y= n-.5.
The upper 1 - α limit is Y + k(α)y
k(α) the upper α-value of the standard Normal distribution. Pick a small α, say .025
The upper .975 limit is Y + 1.96y
Consider the inference:
Infer (or regard the data as evidence for)
µ y + 1.96y
CIU(Y; .025)
35
36. One rationale is that it instantiates an inference rule that yields true claims with high
probability (.975) because
P(µ Y + 1.96y) .975
Whatever the true value of µ
36
37. The procedure has high long-run “coverage probabilities.”
They "rub-off" on the particular case.
Instead we might view µ y + 1.96syas an inference from a type of reductio ad
absurdum argument:
suppose in fact that this inference is false and the true mean is µ*, where µ*
y + 1.96y.
Then it is very probable that we would have observed a larger sample mean:
P( Y y ; µ*) .975.
Therefore, one can reason, y is inconsistent at level .975 (or .025) with having been
generated from a population with µ in excess of the upper limit.
This reasoning is captured in FEV
-------------------------------------------------
37
38. Aside: This statistic is directly related to a test of µ µ0 against µ µ0.
In particular, Y is statistically significantly smaller than values of µ in excess of
CIU(Y; α) at level α.
38
39. Fisher's fiducial error
This may make look like a random variable----but it is not; these claims do not
hold once a specific y is plugged in for Y
P ( < Y 0x) = .5
P ( < Y .5x) = .7
P ( < Y 1x) = .84
P ( < Y 1.5x) = .93
P ( < Y 1.96 x) = .975
“Our (Cox and Hinkley) attitude toward the statement [ < Y 1.96 x] might then
be taken to be the same as that to other uncertain statements for which the probability
is [.975] of their being true, and hence…is virtually indistinguishable from a
probability statement about []. However, this attitude is in general incorrect, in our
view because the confidence statement can be known to be true or untrue…… The
system of confidence limits simply summarizes what the data tell us about , given
the model….It is wrong to combine confidence limit statements about different
39
40. parameters as though the parameters were random variables.” (Cox and Hinkley
1974, p. 227)
But what exactly is it telling us about
40
41.
P ( < Y 0x) = .5
P ( < Y .5x) = .7
P ( < Y 1x) = .84
P ( < Y 1.5x) = .93
P ( < Y 1.96 x) = .975
Answer (Mayo): it tells you which discrepancies are and are not indicated (at various
degrees of severity)
If we replace P with SEV (the degree of severity or even “corroboration”) the claims
are true even after substituting the observed sample mean
(the severity with which the inference has passed the test with the data)
41
42. SEV( < Y 0 x) to abbreviate:
The severity with which a test T with an observed result Y 0 passes:
( < Y 0 x).
SEV ( < Y 0 0x) = .5
SEV ( < Y 0 .5x) = .7
SEV ( < Y 0 1x) = .84
SEV ( < Y 0 1.5x) = .93
SEV ( < Y 0 1.96 x) = .975
But aren’t I just using this as another way to say how probable each claim is?
No. This would lead to inconsistencies (if we mean mathematical probability), but
the main thing is, or so I argue, probability gives the wrong logic for “how welltested” (or “corroborated”) a claim is
42
43. What Would a Logic of Severe Testing be?
If H has passed a severe test T with x, there is no problem in saying x warrants H, or,
if one likes, x warrants believing in H….
(H could be the confidence interval estimate)
If SEV(H ) is high, its denial is low, i.e., SEV(~H ) is low
But it does not follow a severity assessment should obey the probability calculus, or
be a posterior probability….
For example, if SEV(H ) is low, SEV(~H ) could be high or low
……in conflict with a probability assignment
(there may be a confusion of ordinary language use of “probability”)
43
44. Other points lead me to deny probability yields a logic we want for well-testedness2
(e.g., problem of irrelevant conjunctions, tacking paradox for Bayesians and
hypothetico-deductivists)
This is different from what I call the
Contrast again with the rubbing off construal: The procedure is rarely wrong,
therefore, the probability it is wrong in this case is low.
The long-run reliability of the rule is a necessary but not a sufficient condition to
infer H (with severity)
The reasoning instead is counterfactual:
H:
< Y 0 1.96x
(i.e., < CIu )
passes severely because were this inference false, and the true mean > CIu
then, very probably, we would have observed a larger sample mean:
44
45. A very well known criticism (many Bayesians say it is a reductio) of frequentist
confidence intervals stems from assuming that confidence levels must give degrees
of belief, critics mount
e.g., a 95% confidence interval might be known to be correct, (given some additional
constraints about the parameter)--trivial intervals
In our construal, this is just saying no parameter values are ruled out with severity…..
“Viewed as a single statement [the trivial interval] is trivially true, but, on the other hand,
viewed as a statement that all parameter values are consistent with the data at a particular
level is a strong statement about the limitations of the data.” (Cox and Hinkley1974, p.
226)3
45
46. Mayo and Spanos: Severity Interpretations for T+ of “accept” and “reject”
(SEV) for non-statistically-significant results
For “Accept” H0 (a statistically insignificant difference from 0)
≤ Y 0 + k( √n) passes with severity (1 - ,
where P(d(X) > k) = (for any 0 < < 1).
(SEV) for statistically-significant results
For Reject H0 (a statistically significant difference from 0)
> Y 0 k( √n) passes with severity (1 -
46
47. FEV(ii): A moderate p-value is evidence of the absence of a discrepancy from H0
only if there is a high probability the test would have given a worse fit with H0 (i.e.,
smaller p-value) were a discrepancy to exist
1
47