• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
p-values: A significant problem in science? - John Carlin
 

p-values: A significant problem in science? - John Carlin

on

  • 393 views

P-values and Null Hypothesis Significance Testing abound in journal articles and other literature on bioinformatics and quantitative bioscience. Yet there has been confusion and controversy about what ...

P-values and Null Hypothesis Significance Testing abound in journal articles and other literature on bioinformatics and quantitative bioscience. Yet there has been confusion and controversy about what these approaches do and don’t mean—see e.g., Scientific method: Statistical errors (Nature 506, 150–152, 13 February 2014 doi:10.1038/506150a). Bioinformatics FOAM 2014 is fortunate to have two eminent statisticians—Professor John Carlin and Professor Gordon Smyth—to lead us in exploring the subtleties and science of statistical significance so that we can all become wiser to the value of p-values.

Statistics

Views

Total Views
393
Views on SlideShare
393
Embed Views
0

Actions

Likes
1
Downloads
33
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    p-values: A significant problem in science? - John Carlin p-values: A significant problem in science? - John Carlin Presentation Transcript

    • John Carlin Murdoch Children’s Research Institute & University of Melbourne Bioinformatics FOAM, 28-Mar-14 P-values: A Significant Problem in Science?
    • 2
    • Nature is suddenly concerned about statistics? But wait, that sounds familiar… 3
    • Outline • The reproducibility crisis (not just p-values) • The p-value as the currency of research (‘findings’) • Tutorial time: what is a p-value anyway? • A brief history of a significant problem • Ways forward? 4
    • The reproducibility crisis 5
    • The reproducibility crisis Basic concern: many scientific claims cannot be replicated Dominant themes: • Pressures to publish, pressures to be first/ original/ novel – replication studies have less appeal and harder to publish • Significance tests & p-values widely misunderstood and mis-used (our main topic) – “research teams… fall prey to an honest confusion between the sweet signal of a genuine discovery and a freak of the statistical noise” (Economist, 19/10/13) • Peer review process imperfect (at best) • Well documented examples from laboratory science (Begley – Amgen) & psychology 6
    • Beauty, sex, and power Gelman (2007), critique of Kanazawa (J.Theor.Biol. 2005-07) • “Beautiful parents have more daughters” • “Violent men have more sons” (& more: “Ten politically incorrect truths about human nature” Psychology Today, 2007) • Almost certainly false “findings” since any reasonable consideration of other studies of similar questions makes even moderate effects highly implausible • Studies had no power for likely effects, so findings are almost certainly false positives 7
    • The p-value as research currency: an everyday example (current issue of Nature) 8
    • The p-value as research currency: an everyday example (current issue of Nature) First paragraph of results: Transcriptional profiling has demonstrated significant changes in the expression of neuronal genes in the prefrontal cortex of ageing humans9, 10. Analysis of this data set using the Ingenuity Systems IPA platform indicates that the transcription factor most strongly predicted to be activated in the ageing brain is REST (P = 9 × 10−10). Moreover, the 21-base-pair canonical RE1 recognition motif for REST is highly enriched in the age-downregulated gene set (P = 3 × 10−7) (Fig. 1a). 9
    • T Lu et al. Nature 000, 1-7 (2014) doi:10.1038/nature13163 Induction of REST in the ageing human prefrontal cortex. […] For c and e, values are expressed as fold change relative to the young adult group, and represent the mean ± s.e.m. *P < 0.05, **P < 0.01, ***P < 0.001 by Student’s unpaired t-test…
    • P-value as currency • If you find a significant p-value, you are more likely to get your research published – And in your report you are allowed to say that you “found” X (e.g. “found that Y was higher with drug A than drug B”), implying that this is a factual claim • If you find non-significant p-values, it is harder to get published… • Why wouldn’t you try to find significant p-values (if you believe you are “on to something”)? 11
    • P-hacking • Also known as data dredging, fishing, etc. • N.B. p-hacking may be done unconsciously! 12
    • The problem of false positives • The emphasis on “findings” (i.e. rejection of null hypotheses) leads to the plausible claim that a majority of published findings are false (Ioannides, 2005) • E.g. can calculate frequency of “accepting” & “rejecting” true and false null hypotheses, if: – 90% of hypotheses tested are actually true nulls – Significance level = 0.05 – Power = 50% • Then more than half the “significant” findings are false positives … 13
    • Sterne & Davey Smith, BMJ 2001. Frequency of “accepting” & “rejecting” true and false null hypotheses 14 Result of study Null hypothesis true (association doesn’t exist) Null hypothesis false (association does exist) Total “Non-signif” 855 50 905 “Significant” 45 50 95 Total 900 100 1000
    • It can’t happen to me, I understand my P-value Let’s see, take an example from a published article (“randomly selected”): “Occupational Exposure to Extremely Low Frequency Magnetic Fields and Mortality from Cardiovascular Disease” Håkansson et al, American Journal of Epidemiology (15 Sept 2003) “The authors found a low-level increase in AMI [acute myocardial infarction] risk in the highest exposure group (relative risk = 1.3, 95% confidence interval: 0.9, 1.9) and suggestions of an exposure-response relation (p = 0.02).” Quote from the Abstract: 15
    • The results quote a p-value in support of a claim that there may be an exposure-response relationship • This is actually quite well presented (no mention of significance or implication that they have a “true finding”) Question Which of the following is a valid interpretation of the P value? 16
    • • The probability that the exposure-response relationship is due to chance alone is 0.02. • The probability that the null hypothesis (i.e. there is no exposure-response relationship) is false is 0.02, i.e. 2%. • If we did a similar study again, the probability that we would obtain a similar or greater level of association than found in these data, if the null hypothesis (of no exposure-response relationship) is true, is 0.02. • There is a very low probability (i.e. around 2%) that these results can be explained by chance if there is truly no association. • The probability that the investigators make a Type I error if they conclude that the association is real is 2%. • It doesn’t really matter because it’s just a scientific convention that if P < 0.05, then the association is significant. 17
    • • The probability that the exposure-response relationship is due to chance alone is 0.02. • The probability that the null hypothesis (i.e. there is no exposure-response relationship) is false is 0.02, i.e. 2%. • If we did a similar study again, the probability that we would obtain a similar or greater level of association than found in these data, if the null hypothesis (of no exposure-response relationship) is true, is 0.02. • There is a very low probability (i.e. around 2%) that these results can be explained by chance if there is truly no association. • The probability that the investigators make a Type I error if they conclude that the association is real is 2%. • It doesn’t really matter because it’s just a scientific convention that if P < 0.05, then the association is significant.  ?   ?? 
    • Fisher’s interpretation (1920s) P = Prob (we would obtain a more extreme result than the actual one if the null hypothesis is true) 19 • An index of “surprise”: if P is small, EITHER something surprising has occurred (under the null hypothesis), • OR the null hypothesis is false, i.e. we should adopt some other theory about the truth.
    • Neyman & Pearson’s version (1930s) “…no test based upon a theory of probability can by itself provide any valuable evidence of the truth or falsehood of a hypothesis. But we may look at the purpose of tests from another viewpoint. Without hoping to know whether each separate hypothesis is true or false, we may search for rules to govern our behaviour with regard to them, in following which we ensure that, in the long run of experience, we shall not often be wrong.” 20
    • The historical legacy… • A mess: widespread misunderstanding and confusion! • Adoption of “P < 0.05” as a mantra with the word “significant” attached – Applying Neyman-Pearson thinking (“accept/ reject” null hypothesis & claim positive/negative finding) in a context that requires inductive inference (conditional on the data) – Leads to over-interpretation in both directions (“significant” and “non-significant”) 21
    • Genetics: a good-news story? • Early days of genomics (post HGP, early 2000s?): “Candidate genes” widely sought & “found” to be associated with disease risk • Many (most?) such findings failed to replicate • Genome-wide approach – Unstructured searching for associations – Multiple comparisons on massive scale – Many true null hypotheses – Recognition of need to drastically control “over- calling” (P < 10-7, not P < 0.05) – P-values for ranking, not declaring “findings” – All apparent associations followed up for replication, pathways etc. 22
    • Ways forward • Avoid use of “statistically significant” – Immediately shifts emphasis away from artificial dichotomisation (whether at 0.05 or anywhere else) • Change style of presentation, away from “findings” to incremental evidence – Think about directions and magnitudes rather than rejecting or accepting • Embrace more Bayesian inference • Full disclosure of all data and all data manipulations • Support and perform pre-registered replication? 23