A Lenda do Valor P


Published on

Published in: Education
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

A Lenda do Valor P

  1. 1. ECONOMICS, EDUCATION, AND HEALTH SYSTEMS RESEARCHSECTION EDITORRONALD D. MILLEREDITORIALThe Legend of the P ValueZeev N. Kain, MD, MBACenter for the Advancement of Perioperative Health and Department of Anesthesiology & Pediatrics & Child Psychiatry,Yale University School of Medicine, New Haven, ConnecticutA lthough there is a growing body of literature related to this complex problem. Please note that a criticizing the use of mere statistical significance detailed discussion of the underlying statistics in- as a measure of clinical impact, much of this volved in this topic is beyond the scope of thisliterature remains out of the purview of the discipline editorial.of anesthesiology. Currently, the magical boundary of When examining the report of a clinical trial inves-P 0.05 is a major factor in determining whether a tigating a new treatment, clinicians should be inter-manuscript will be accepted for publication or a re- ested in answering the following three basic questions:search grant will be funded. Similarly, the Federal 1. Could the findings of the clinical trial be solely aDrug Administration does not currently consider the result of a chance occurrence? (i.e., statisticalmagnitude of an advantage that a new drug shows significance)over placebo. As long as the difference is statistically 2. How large is the difference between the primarysignificant, a drug can be advertised in the United end-points of the study groups? (i.e., impact ofStates as “effective” whether clinical trials proved it to treatment, effect size)be 10% or 200% more effective than placebo. We sub- 3. Is the difference of primary end-points betweenmit that if a treatment is to be useful to our patients, it groups meaningful to a patient? (i.e., clinicalis not enough for treatment effects to be statistically significance)significant; they also need to be large enough to beclinically meaningful. It was Sir Ronald A. Fisher, an extraordinarily in- Unfortunately, physicians often misinterpret statis- fluential British statistician, who first suggested thetically significant results as showing clinical signifi- use of a boundary to accept or reject a null hypothesis,cance as well. One should realize, however, that with and he arbitrarily set this boundary at P 0.05; wherea large sample it is quite possible to have a statistically “P” stands for probability related to chance (1,2). Thatsignificant result between groups despite a minimal is, the level of statistical significance as defined byimpact of treatment (i.e., small effect size). Also, study Fisher in 1925 and as used today refers to the proba-outcomes with lower P values are typically misinter- bility that the difference between two groups wouldpreted by physicians as having stronger effects than have occurred solely by chance (i.e., probability of 5those with higher P values. That is, most clinicians in 100 is reported as P 0.05). Fisher’s emphasis onagree that a result with a P 0.002 has a much greater significance testing and the arbitrary boundary of Ptreatment effect than a result of P 0.045. Although 0.05 has been widely criticized over the past 80 yr.this is true if the sample size is the same in both This criticism was based on the rationale that focusingstudies, it is not true if the sample size is larger in the on the P value does not take into account the size andstudy with the smaller P value. This is of particular clinical significance of the observed effect. That is, aconcern when one realizes that most pharmaceutically small effect in a study with large sample size has thefunded studies have very large sample sizes and effect same P value as a large effect in a study with a smallsizes are typically not reported in these types of stud- sample size. Also, P value is commonly misinter-ies. In the following editorial I highlight some of issues preted when there are multiple comparisons, in which case a traditional level of statistical significance of P Supported, in part, by National Institutes of Health grants 0.05 is no longer valid. Fisher himself indicated someNICHD, R01HD37007– 02. 25 yr after his initial publication that “If P is between Accepted for publication June 16, 2005. 0.1 and 0.9 there is certainly no reason to suspect the Address correspondence and reprint requests to Zeev N. Kain,MD MBA, Department of Anesthesiology, Yale University School of hypothesis tested. If it is below 0.02 it is stronglyMed, 333 Cedar Street, New Haven, CT 06510. Address e-mail to indicated that the hypothesis fails to account for thezeev.kain@yale.edu. whole of the facts. We shall not often be astray if weDOI: 10.1213/01.ANE.0000181331.59738.66 draw a conventional line at 0.05. . .” (3). Indeed, this ©2005 by the International Anesthesia Research Society1454 Anesth Analg 2005;101:1454–6 0003-2999/05
  2. 2. ANESTH ANALG EDITORIAL 14552005;101:1454 –6issue has been addressed in multiple recent review Clinicians should be cautioned to not interpret mag-articles and editorials in the general medical and psy- nitude of change (effect size) as an indication of clin-chological literature (4 – 8). ical significance. The clinical significance of a treat- In an attempt to address some of the limitations of ment should be based on external standards providedthe P value, the use of the confidence intervals (CI) has by patients and clinicians. That is, a small effect sizebeen advocated by some clinicians (9). One should may still be clinically significant and, likewise, a largerealize, however, that these two definitions of statisti- effect size may not be clinically significant, dependingcal significance are essentially reciprocal (10). That is, on what is being studied. Indeed, there is a growinggetting a P 0.05 is the same as having a 95% CI that recognition that traditional methods used, such asdoes not overlap zero. CIs can also, however, be used statistical significance tests and effect sizes, should beto estimate the size of difference between groups in supplemented with methods for determining clini-addition to merely indicating the existence or absence cally significant changes. Although there is little con-of statistical significance (11). This later approach, sensus about the criteria for these efficacy standards,however, is not widely used in the medical and psy- the most prominent definitions of clinically significantchological literature, and today CIs are mostly used as change include: 1) treated patients make a statisticallysurrogates for the hypothesis test rather than consid- reliable improvement in the change scores; 2) treatedering the full range of likely effect size. patients are empirically indistinguishable from a nor- The group of statistics called “effect sizes” designate mal population after treatment, or 3) changes of atindices that measure the magnitude of difference be- least one sd. The most frequently used method fortween groups, controlling for variation within the evaluating the reliability of change scores is thegroups; effect sizes can be thought of as a standard- Jacobson-Truax method in combination with clinicalized difference. In other words, although a P value cutoff points (15). Using this method, change is con-denotes whether the difference between two groups in sidered reliable, or unlikely to be the product of meas-a particular study is likely to occur solely by chance, urement error, if the reliable change index (RCI) isthe effect size quantifies the amount of difference be- more than 1.96. That is, when the individual has atween the two groups. Quantification of effect size change score more than 1.96, one can reasonably as-does not rely on sample size but instead relies on the sume that the individual has improved.strength of the intervention. There are a number of Unfortunately, most of the methods above are dif-different types of effect sizes and a description of these ficult to adopt in the perioperative arena, as compar-various types and formulae is beyond the scope of this ison with a normal population is not an option in mosteditorial. We refer the interested reader to review trials, and the RCI, which controls for statistical issuesarticles that describe the various types of effect sizes involving the assessment tool, is a somewhat compli-and their calculation methodology (12,13). Effect sizes cated and controversial technique. Thus, clinical sig-of the d type are the most commonly used in the nificance in the perioperative arena may be best as-medical literature, as they are primarily used to com- sessed by posing a particular question such as “is apare two treatment groups. D type effect size is de- change of 8.5% reduction in intraoperative bleed clin-fined as the magnitude of difference between two ically significant?” or “how many sd does this changemeans, divided by the sd [(Mean of control group represent?” Obviously, both of these questions have aMean of treatment group)/sd of the control group]. subjective component in them and although it is tra-Thus, the d effect size is dependent on variation ditionally agreed that at least a 1-sd change is gener-within the control group and the differences between ally needed for clinical significance, this boundary hasthe control and intervention groups. Values of the d no scientific underpinning. The validity of a clinicaltype effect sizes range from to , where zero cutoff for these last two methods can be improved bydenotes no effect and values less than or more than establishing external validity (e.g., patient perspec-zero are treated as absolute values when interpreting tive) for the decision. For example, Flor et al. (16) havemagnitude. Conventionally, d type effect sizes that are conducted a large meta-analysis that was aimed atnear 0.20 are interpreted as small, effect sizes near 0.50 evaluating the effectiveness of multidisciplinary reha-are considered “medium,” and effect sizes in the range bilitation for chronic pain. The investigators foundof 0.80 are considered “large” (14). However, interpre- that pain among the patients who received the inter-tation of the magnitude of an effect size depends on vention was indeed reduced by 25%. This reductionthe type of data gathered and the discipline involved. was certainly statistically significant and had an effectEffect sizes of another type—the risk potency type— size of 0.7. Colvin et al. (17), however, reported earlierinclude likelihood ratios such as odds ratio, risk ratio, that patients would consider only a 50% improvementrisk difference, and relative risk reduction. Clinicians in their pain levels as a treatment “success.” Thus, inare probably more familiar with these less abstract this example, a reduction of 25% in pain scores may bestatistics and it may be helpful to realize that likeli- statistically, but not clinically, significant. Clearly thishood statistics are a type of effect size. is a developing area that warrants further discussion.
  3. 3. 1456 EDITORIAL ANESTH ANALG 2005;101:1454 –6 In conclusion, we suggest that reporting of periop- 7. Greenstein G. Clinical versus statistical significance as they relate to the efficacy of periodontal therapy. J Am Dent Assocerative medical research should continue beyond re- 2003;134:1168 –70.porting results consisting primarily of descriptive and 8. Sterne JAC, Smith GD, Cox DR. Sifting the evidence: what’sstatistically significant or nonsignificant findings. The wrong with significance tests? Another comment on the role ofinterpretation of findings should occur in the context statistical methods. BMJ 2001;322:226 –31. 9. Simon R. Confidence intervals for reporting results of clinicalof the magnitude of change that occurred and the trials. Ann Intern Med 1986;105:429 –35.clinical significance of the findings. 10. Feinstein AR. P-values and confidence intervals: two sides of the same unsatisfactory coin. J Clin Epidemiol 1998;51:355– 60. 11. Gardner MG, Altman DG. Confidence intervals rather than P values: estimation rather than hypothesis testing. BMJ 1986;292:References 746 –50.1. Fisher RA. Statistical methods for research workers, 1st ed. 12. Kirk R. Practical significance: A concept whose time has come. Edinburgh: Oliver and Boyd, 1925. Reprinted by Oxford Uni- Educ Psychol Meas 1996;56:746 –59. versity Press. 13. Snyder P, Lawson S. Evaluating results using corrected and2. Fisher RA. Design of experiments. 1st ed. Edinburgh: Oliver and uncorrected effect size estimates. J Exper Educ 1993;61:334 –349. Boyd, 1935. Reprinted by Oxford University Press. 14. Cohen J. Statistical power analysis for the behavioral sciences,3. Fisher RA. Statistical methods for research workers. London: 2nd ed. Mahwah, New Jersey: Lawrence Erlbaum, 1988. Oliver and Boyd, 1950:80. 15. Jacobson NS, Truax P. Clinical significance: A statistical ap-4. Borenstein M. Hypothesis testing and effect size estimation in proach to defining meaningful change in psychotherapy re- clinical trials. Ann Allergy Asthma Immunol 1997;78:5–11. search. J Consult Clinic Psych 1991;59:12–9.5. Matthey S. P 0.05: but is it clinically significant? Practical 16. Flor H, Fydrich T, Turk DC. Efficacy of multidisciplinary pain examples for clinicians. Behav Change 1998;15:140 – 6. treatment centers: a meta-analytic review. Clin J Pain 1992;49:6. Cummings P, Rivara FP. Reporting statistical information in 221–30. medical journal articles. Arch Pediatr Adolesc Med 2003;157: 17. Colvin DF, Bettinger R, Knapp R, et al. Characteristics of pa- 321– 4. tients with chronic pain. South Med J 1980;73:1020 –3.