ASU Health Medical Odds Ratio for Lorcaserin Producing Questions.docx

ASU Health & Medical Odds Ratio for Lorcaserin Producing Questions
Assignment Week 2, Risk Benefit Age, gender and weight matched patients were treated
with loracserin, a new drug for weight loss, or placebo, in conjunction with a diet and
exercise program. The tables below summarize results from different trials. An newer drug,
semaglutide, was also studied. Table 1 shows weight loss results for patients that completed
the trial Placebo Lorcaserin Weight loss> 10% body weight 243 748 Total participants
completing trial 5083 5135 Placebo Semaglutide Weight loss> 10% body weight 12 68 Total
participants completing trial 655 1306 Table 2 shows some adverse events. Placebo
Lorcaserin Placebo Semaglutide Headache Nausea Suicidal Ideation Total Participants 15 37
81 198 19 35 114 544 11 21 83 124 5992 5995 655 1306 Table 3 shows outcomes for
cardiovascular events for lorcarserin and semaglutide, another new weight loss drug.
Placebo Lorcaserin Cardiovascular event Total 369 364 6000 6000 Placebo Semaglutide
Cardiovascular event Total 70 107 655 1306 Analyze the benefits and risks of lorcaserin
and semaglutide compared to placebo. In particular, answer the following questions: 1.
What is the odds ratio for lorcaserin producing greater than 10% weight loss? 2. What is the
odds ratio for semaglutide producing greater than 10% weight loss? 3. What is the Relative
risk for each drug for: Lorcaserin Semaglutide a. Headache RR= b. Nausea RR= c. Suicidal
ideation RR= 4. What is the relative risk or relative risk reduction for major cardiovascular
events for each drug? Lorcaserin Semaglutide RRR= 5. Do you think the benefits of
lorcaserin outweigh the risks given that the odds ratio for myocardial infarction is 1.44 for a
patient with a BMI over 30, which all of these patients had at the beginning of the study? 6.
What about the patients treated with semaglutide, who also had a BMI over 30 at the
beginning of the study? 7. Which drug would you choose, or do you think neither has a good
enough risk benefit ratio to be used? CMAJ 2005: Tips for Learners of Evidence-Based
Medicine: A 5-Part Series Barratt A, WYer PC, Hatala R, McGinn T, Dans AL, Keitz S, Moyer V,
Guyatt G. Tips for learners of evidence-based medicine: 1. relative risk reduction, absolute
risk reduction and number needed to treat. Can Med Assoc J 2004; 171:353–358. Montori
VM, Kleinbart J, Newman TB, Keitz S, Wyer PC, Moyer V, Guyatt G. Tips for learners of
evidence-based medicine: 2. measures of precision (confidence intervals). Can Med Assoc J
2004; 171:611–615. McGinn T, Wyer PC, Newman TB, Keitz S, Leipzig R, Guyatt G. Tips for
learners of evidence-based medicine: 3. measures of observer variability (kappa statistic).
Can Med Assoc J 2004; 171:1369–1373. Hatala R, Keitz S, Wyer P, Guyatt G. Tips for learners
of evidence-based medicine: 4. assessing heterogeneity of primary studies in systematic
reviews and whether to combine their results. Can Med Assoc J 2005;172:661–665. Montori

VM, Wyer P, Newman TB, Keitz S, Guyatt G. Tips for learners of evidencebased medicine: 5.
the effect of spectrum of disease on the performance of diagnostic tests. Can med Assoc J
2005;172:385–390. Review Synthèse Tips for learners of evidence-based medicine: 1.
Relative risk reduction, absolute risk reduction and number needed to treat Alexandra
Barratt, Peter C. Wyer, Rose Hatala, Thomas McGinn, Antonio L. Dans, Sheri Keitz, Virginia
Moyer, Gordon Guyatt, for the Evidence-Based Medicine Teaching Tips Working Group ß
See related article page 347 P hysicians, patients and policy-makers are influenced not only
by the results of studies but also by how authors present the results.1–4 Depending on
which measures of effect authors choose, the impact of an intervention may appear very
large or quite small, even though the underlying data are the same. In this article we present
3 measures of effect — relative risk reduction, absolute risk reduction and number needed
to treat — in a fashion designed to help clinicians understand and use them. We have
organized the article as a series of “tips” or exercises. This means that you, the reader, will
have to do some work in the course of reading this article (we are assuming that most
readers are practitioners, as opposed to researchers and educators). The tips in this article
are adapted from approaches developed by educators with experience in teaching
evidencebased medicine skills to clinicians.5,6 A related article, intended for people who
teach these concepts to clinicians, is available online at
www.cmaj.ca/cgi/content/full/171/4/353/DC1. Clinician learners’ objectives
DOI:10.1503/cmaj.1021197 Understanding risk and risk reduction • Learn how to
determine control and treatment event rates in published studies. • Learn how to determine
relative and absolute risk reductions from published studies. • Understand how relative and
absolute risk reductions usually apply to different populations. Balancing benefits and
adverse effects in individual patients • Learn how to use a known relative risk reduction to
estimate the risk of an event for a patient undergoing treatment, given an estimate of that
patient’s risk of the event without treatment. • Learn how to use absolute risk reductions to
assess whether the benefits of therapy outweigh its harms. Calculating and using number
needed to treat • Develop an understanding of the concept of number needed to treat (NNT)
and how it is calculated. • Learn how to interpret the NNT and develop an understanding of
how the “threshold NNT” varies depending on the patient’s values and preferences, the
severity of possible outcomes and the adverse effects (harms) of therapy. Tip 1:
Understanding risk and risk reduction You can calculate relative and absolute risk
reductions using simple mathematical formulas (see Appendix 1). However, you might find
it easier to understand the concepts through visual presentation. Fig. 1A presents data from
a hypothetical trial of a new drug for acute myocardial infarction, showing the 30-day
mortality rate in a group of patients at high risk for the adverse event (e.g., elderly patients
with congestive heart failure and anterior wall infarction). On the basis of information in
Fig. 1A, how would you describe the Teachers of evidence-based medicine: See the “Tips for
teachers” version of this article online at www.cmaj.ca/cgi/content/full/171/4/353/DC1. It
contains the exercises found in this article in fill-in-theblank format, commentaries from the
authors on the challenges they encounter when teaching these concepts to clinician learners
and links to useful online resources. CMAJ • AUG. 17, 2004; 171 (4) © 2004 Canadian
Medical Association or its licensors 353 Barratt et al effect of the new drug? (Hint: Consider

the event rates in not most cases7,8), the absolute gains, represented by abpeople not
taking the new drug and those who are taking it.) solute risk reductions, are not. In sum, the
absolute risk reWe can describe the difference in mortality (event) duction becomes smaller
when event rates are low, whereas rates in both relative and abthe relative risk reduction,
or solute terms. In this case, “efficacy” of the treatment, ofthese high-risk patients had a ten
remains constant. Risk and risk reduction: definitions relative risk reduction of 25% These
phenomena may be and an absolute risk reduction factors in the design of drug Event rate:
the number of people experiencing an of 10%. trials. For example, a drug event as a
proportion of the number of people in the population Now, let’s consider Fig. 1B, may be
tested in severely afwhich shows the results of a fected people in whom the Relative risk
reduction: the difference in event second hypothetical trial of the absolute risk reduction is
likerates between 2 groups, expressed as a proportion of the event rate in the untreated
group; usually same new drug, but in a patient ly to be impressive, but is 7,8 constant across
populations with different risks population with a lower risk for subsequently marketed for
the outcome (e.g., younger pause by less severely affected Absolute risk reduction: the
arithmetic difference tients with uncomplicated infepatients, in whom the absobetween 2
event rates; varies with the underlying risk of an event in the individual patient rior wall
myocardial infarclute risk reduction will be tion). Looking at Fig. 1B, how substantially less.
The absolute risk reduction becomes smaller would you describe the effect when event
rates are low, whereas the of the new drug? The bottom line relative risk reduction, or
“efficacy” of the The relative risk reduction treatment, often remains constant with the new
drug remains at Relative risk reduction is 25%, but the event rate is lowoften more
impressive than er in both groups, and hence absolute risk reduction. Furthe absolute risk
reduction is only 2.5%. thermore, the lower the event rate in the control group, Although
the relative risk reduction might be similar the larger the difference between relative risk
reduction across different risk groups (a safe assumption in many if and absolute risk
reduction. Risk for outcome of interest, % A 40 Risk for outcome of interest, % Absolute risk
reduction (also called the risk difference) is the simple difference in the event rates (40% –
30% = 10%). 30 Relative risk reduction is the difference between the event rates in relative
terms. Here, the event rate in the treatment group is 25% less than the event rate in the
control group (i.e., the 10% absolute difference expressed as a proportion of the control
rate is 10/40 or 25% less). 20 10 0 B Among high-risk patients in trial 1, the event rate in
the control group (placebo) is 40 per 100 patients, and the event rate in the treatment
group is 30 per 100 patients. Trial 1: highrisk patients Placebo Treatment 40 Among low-
risk patients in trial 2, the event rate in the control group (placebo) is only 10%. If the
treatment is just as effective in these low-risk patients, what event rate can we expect in the
treatment group? 30 20 The event rate in the treated group would be 25% less than in the
control group or 7.5%. Therefore, the absolute risk reduction for the low-risk patients
(second pair of columns) is only 2.5%, even though the relative risk reduction is the same as
for the high-risk patients (first pair of columns). 10 0 Trial 1: highrisk patients Trial 2:
lowrisk patients Fig. 1: Results of hypothetical placebo-controlled trials of a new drug for
acute myocardial infarction. The bars represent the 30day mortality rate in different groups
of patients with acute myocardial infarction and heart failure. A: Trial involving patients at

high risk for the adverse outcome. B: Trials involving a group of patients at high risk for the
adverse outcome and another group of patients at low risk for the adverse outcome. 354
JAMC • 17 AOÛT 2004; 171 (4) Tips for learners of evidence-based medicine Tip 2:
Balancing benefits and adverse effects in individual patients In prescribing medications or
other treatments, physicians consider both the potential benefits and the potential harms.
We have just demonstrated that the benefits of treatment (presented as absolute risk
reductions) will generally be greater in patients at higher risk of adverse outcomes than in
patients at lower risk of adverse outcomes. You must now incorporate the possibility of
harm into your decision-making. First, you need to quantify the potential benefits. Assume
you are managing 2 patients for high blood pressure and are considering the use of a new
antihypertensive drug, drug X, for which the relative risk reduction for stroke over 3 years
is 33%, according to published randomized controlled trials. Pat is a 69-year-old woman
whose blood pressure during a routine examination is 170/100 mm Hg; her blood pressure
remains unchanged when you see her again 3 weeks later. She is otherwise well and has no
history of cardiovascular or cerebrovascular disease. You assess her risk of stroke at about
1% (or 1 per 100) per year.9 Dorothy is also 69 years of age, and her blood pressure is the
same as Pat’s, 170/100 mm Hg; however, because she had a stroke recently, you assess her
risk of subsequent stroke as higher than Pat’s, perhaps 10% per year.10 One way of
determining the potential benefit of a new treatment is to complete a benefit table such as
Table 1A. To do this, insert your estimated 3-year event rates for Pat and Dorothy, and then
apply the relative risk reduction (33%) expected if they take drug X. It is clear from Table
1A that the absolute risk reduction for patients at higher risk (such as Dorothy) is much
greater than for those at lower risk (such as Pat). Now, you need to factor the potential
harms (adverse effects associated with using the drug) into the clinical decision. In the
clinical trials of drug X, the risk of severe gastric bleeding increased 3-fold over 3 years in
patients who received the drug (relative risk of 3). A population-based study has reported
the risk of severe gastric bleeding for women in your patients’ age group at about 0.1% per
year (regardless of their risk of stroke). These data can now be added to the table to allow a
more balanced assessment of the benefits and harms that could arise from treatment (Table
1B). Considering the results of this process, would you give drug X to Pat, to Dorothy or to
both? In making your decisions, remember that there is not necessarily one “right answer”
here. Your analysis might go something Pat will experience a small benefit (absolute risk
reduction over 3 years of about 1%), but this will be considerably offset by the increased
risk of gastric bleeding (absolute risk increase over 3 years of 0.6%). The potential benefit
for Dorothy (absolute risk reduction over 3 years of about 10%) is much greater than the
increased risk of harm (absolute risk increase over 3 years of 0.6%). Therefore, the benefit
of treatment is likely to be greater for Dorothy (who is at higher risk of stroke) than for Pat
(who is at lower risk). Assessment of the balance between benefits and harms depends on
the value that patients place on reducing their risk of stoke in relation to the increased risk
of gastric bleeding. Many patients might be much more concerned about the former than
the latter. Table 1A: Benefit table* 3-yr event rate for stroke, % Patient group At lower risk
(e.g., Pat) At higher risk (e.g., Dorothy) No treatment With treatment (drug X) Absolute risk
reduction, % (no treatment – treatment) 3 30 2 20 1 10 *Based on data from a randomized

controlled trial of drug X, which reported a 33% relative risk reduction for the outcome
(stroke) over 3 years. Table 1B: Benefit and harm table 3-yr event rate for stroke, % Patient
group At lower risk (e.g., Pat) At higher risk (e.g., Dorothy) No treatment 3-yr event rate for
severe gastric bleeding, % With treatment Absolute risk reduction (drug X) (no treatment –
treatment) No treatment With treatment (drug X) Absolute risk increase (treatment – no
treatment) 3 2 1 0.3 0.9 0.6 30 20 10 0.3 0.9 0.6 *Based on data from randomized controlled
trials of drug X reporting a 33% relative risk reduction for the outcome (stroke) over 3
years and a 3-fold increase for the adverse effect (severe gastric bleeding) over the same
period. CMAJ • AUG. 17, 2004; 171 (4) 355 Barratt et al Number needed to treat: definitions
Number needed to treat: the number of patients who would have to receive the treatment
for 1 of them to benefit; calculated as 100 divided by the absolute risk reduction expressed
as a percentage (or 1 divided by the absolute risk reduction expressed as a proportion; see
Appendix 1) Number needed to harm: the number of patients who would have to receive
the treatment for 1 of them to experience an adverse effect; calculated as 100 divided by the
absolute risk increase expressed as a percentage (or 1 divided by the absolute risk increase
expressed as a proportion) The bottom line When available, trial data regarding relative
risk reductions (or increases), combined with estimates of baseline (untreated) risk in
individual patients, provide the basis for clinicians to balance the benefits and harms of
therapy for their patients. Tip 3: Calculating and using number needed to treat Some
physicians use another measure of risk and benefit, the number needed to treat (NNT), in
considering the consequences of treating or not treating. The NNT is the number of patients
to whom a clinician would need to administer a particular treatment to prevent 1 patient
from having an adverse outcome over a predefined period of time. (It also reflects the
likelihood that a particular patient to whom treatment is administered will benefit from it.)
If, for example, the NNT for a treatment is 10, the practitioner would have to give the
treatment to 10 patients to prevent 1 patient from having the adverse outcome over the
defined period, and each patient who received the treatment would have a 1 in 10 chance of
being a beneficiary. If the absolute risk reduction is large, you need to treat only a small
number of patients to observe a benefit in at least some of them. Conversely, if the absolute
risk reduction is small, you must treat many people to observe a benefit in just a few. An
analogous calculation to the one used to determine the NNT can be used to determine the
number of patients who would have to be treated for 1 patient to experience an adverse
event. This is the number needed to harm (NNH), which is the inverse of the absolute risk
increase. How comfortable are you with estimating the NNT for a given treatment? For
example, consider the following questions: How many 60-year-old patients with
hypertension would you have to treat with diuretics for a period of 5 years to prevent 1
death? How many people with myocardial infarction would you have to treat with
βblockers for 2 years to prevent 1 death? How many people with acute myocardial
infarction would you have to treat with streptokinase to prevent 1 person from dying in the
next 5 weeks? Compare your answers with estimates derived from published studies (Table
2). How accurate were your estimates? Are you surprised by the size of the NNT values?
Physicians often experience problems in this type of exercise, usually because they are
unfamiliar with the calculation of NNT. Here is one way to think about it. If a disease has a

mortality rate of 100% without treatment and therapy reduces that mortality rate to 50%,
how many people would you need to treat to prevent 1 death? From the numbers given, you
can probably figure out that treating 100 patients with the otherwise fatal disease results in
50 survivors. This is equivalent to 1 out of every 2 treated. Since all were destined to die, the
NNT to prevent 1 death is 2. The formula reflected in this calculation is as follows: the NNT
to prevent 1 adverse outcome equals the inverse of the absolute risk reduction. Table 3
illustrates this concept further. Note that, if the absolute risk reduction is presented as a
percentage, the NNT is Table 2: Benefit table for patients with cardiovascular problems
Event rate, % Clinical question Control group Treatment group ARR, % NNT What is the
reduction in risk of stroke within 5 years among 60-year-old patients with hypertension
who are treated with diuretics?11 2.9 1.9 1.00 100 What is the reduction in risk of death
within 2 years after MI among 60-year-old patients treated with β-blockers?12 9.8 7.3 2.50
40 What is the reduction in risk of death within 5 weeks after acute MI among 60-year-old
patients treated with streptokinase?13 12.0 9.2 2.80 36 Note: MI = myocardial infarction,
ARR = absolute risk reduction, NNT = number needed to treat. 356 JAMC • 17 AOÛT 2004;
171 (4) Tips for learners of evidence-based medicine Table 3: Calculation of NNT from
absolute risk reduction* Form of absolute risk reduction Calculation of NNT Example
Percentage (e.g., 2.8%) Proportion (e.g., 0.028) 100/ARR 1/ARR 100/2.8 = 36 1/0.028 = 36
*Using absolute risk reduction in last row of Table 2.13 100/absolute risk reduction; if the
absolute risk reduction is expressed as a proportion, the NNT is 1/absolute risk reduction.
Both methods give the same answer, so use whichever you find easier. It can be challenging
for clinicians to estimate the baseline risks for specific populations. For example, some
physicians may have little idea of the risk of stroke over 5 years among patients with
hypertension. Physicians may also overestimate the effect of treatment, which leads them to
ascribe larger absolute risk reductions and smaller NNT values than are actually the case.14
Now that you know how to determine the NNT from the absolute risk reduction, you must
also consider whether the NNT is reasonable. In other words, what is the maximum NNT
that you and your patients will accept as justifying the benefits and harms of therapy? This
is referred to as the threshold NNT.15 If the calculated NNT is above the threshold, the
benefits are not large enough (or the risk of harm is too great) to warrant initiating the
therapy. Determinants of the threshold NNT include the patient’s own values and
preferences, the severity of the outcome that would be prevented, and the costs and side
effects of the intervention. Thus, the threshold NNT will almost certainly be different for
different patients, and there is no simple answer to the question of when an NNT is
sufficiently low to justify initiating treatment. The bottom line NNT is a concise, clinically
useful presentation of the effect of an intervention. You can easily calculate it from the
absolute risk reduction (just remember to check whether the absolute risk reduction is
presented as a percentage or a proportion and use a numerator of 100 or 1 accordingly). Be
careful not to overestimate the effect of treatments (i.e., use a value of absolute risk
reduction that is too high) and thus underestimate the NNT. Conclusions Clinicians seeking
to apply clinical evidence to the care of individual patients need to understand and be able
to calculate relative risk reduction, absolute risk reduction and NNT from data presented in
clinical trials and systematic reviews. We have described and defined these concepts and

presented tabular tools and equations to help clinicians overcome common pitfalls in
acquiring these skills. This article has been peer reviewed. From the School of Public Health,
University of Sydney, Sydney, Australia (Barratt); the Columbia University College of
Physicians and Surgeons, New York, NY (Wyer); the Department of Medicine, University of
British Columbia, Vancouver, BC (Hatala); Mount Sinai Medical Center, New York, NY
(McGinn); the Department of Internal Medicine, University of the Philippines College of
Medicine, Manila, The Philippines (Dans); Durham Veterans Affairs Medical Center and
Duke University Medical Center, Durham, NC (Keitz); the Department of Pediatrics,
University of Texas, Houston, Tex. (Moyer); and the Departments of Medicine and of Clinical
Epidemiology and Biostatistics, McMaster University, Hamilton, Ont. (Guyatt) Competing
interests: None declared. Contributors: Alexandra Barratt contributed tip 2, drafted the
manuscript, coordinated input from coauthors and reviewers and from field-testing and
revised all drafts. Peter Wyer edited drafts and provided guidance in developing the final
format. Rose Hatala contributed tip 1, coordinated the internal review process and provided
comments throughout development of the manuscript. Thomas McGinn contributed tip 3
and provided comments throughout development of the manuscript. Antonio Dans
reviewed all drafts and provided comments throughout development of the manuscript.
Sheri Keitz conducted field-testing of the tips and contributed material from the field-
testing to the manuscript. Virginia Moyer reviewed and contributed to the final version of
the manuscript. Gordon Guyatt helped to write the manuscript (as an editor and coauthor).
References 1. Malenka DJ, Baron JA, Johansen S, Wahrenberger JW, Ross JM. The framing
effect of relative and absolute risk. J Gen Intern Med 1993;8:543-8. 2. Forrow L, Taylor WC,
Arnold RM. Absolutely relative: How research results are summarized can affect treatment
decisions. Am J Med 1992;92:121-4. 3. Naylor CD, Chen E, Strauss B. Measured enthusiasm:
Does the method of reporting trial results alter perceptions of therapeutic effectiveness?
Ann Intern Med 1992;117:916-21. 4. Fahey T, Griffiths S, Peters TJ. Evidence based
purchasing: understanding results of clinical trials and systematic reviews. BMJ
1995;311:1056-60. 5. Jaeschke R, Guyatt G, Barratt A, Walter S, Cook D, McAlister F, et al.
Measures of association. In: Guyatt G, Rennie D, editors. The users’ guides to the medical
literature: a manual of evidence-based clinical practice. Chicago: AMA Publications; 2002. p.
351-68. 6. Wyer PC, Keitz S, Hatala R, Hayward R, Barratt A, Montori V, et al. Tips for
learning and teaching evidence-based medicine: introduction to the series. CMAJ
2004;171(4):347-8. 7. Schmid CH, Lau J, McIntosh MW, Cappelleri JC. An empirical study of
the effect of the control rate as a predictor of treatment efficacy in meta-analysis of clinical
trials. Stat Med 1998;17:1923-42. 8. Furukawa TA, Guyatt GH, Griffith LE. Can we
individualise the number needed to treat? An empirical study of summary effect measures
in metaanalyses. Int J Epidemiol 2002;31:72-6. 9. SHEP Cooperative Research Group.
Prevention of stroke by anti-hypertensive drug treatment in older persons with isolated
systolic hypertension. Final results of the Systolic Hypertension in the Elderly Program
(SHEP). JAMA 1991;265:3255-64. 10. SALT Collaborative Group. Swedish Aspirin Low-dose
Trial (SALT) of 75mg aspirin as secondary prophylaxis after cerebrovascular events. Lancet
1991;338:1345-9. 11. Psaty BM, Smith NL, Siscovick DS, Koepsell TD, Weiss NS, Heckbert
SR. Health outcomes associated with antihypertensive therapies used as first-line agents. A

systematic review and meta-analysis. JAMA 1997;277: 739-45. 12. β-Blocker Health Attack
Trial Research Group. A randomized trial of propranolol in patients with acute myocardial
infarction. I. Mortality results. JAMA 1982;247:1707-14. 13. ISIS-2 Collaborative Group.
Randomised trial of intravenous streptokinase, oral aspirin, both or neither among 17 187
cases of suspected acute myocardial infarction: ISIS-2. Lancet 1988;2:349-60. 14. Chatellier
G, Zapletal E, Lemaitre D, Menard J, Degoulet P. The number needed to treat: a clinically
useful nomogram in its proper context. BMJ 1996; 312:426-9. 15. Sinclair JC, Cook RJ, Guyatt
GH, Pauker SG, Cook DJ. When should an effective treatment be used? Derivation of the
threshold number needed to treat and the minimum event rate for treatment. J Clin
Epidemiol 2001;54:253-62. Correspondence to: Dr. Peter C. Wyer, 446 Pelhamdale Ave.,
Pelham NY 10803, USA; fax 212 305-6792; pwyer@worldnet .att.net CMAJ • AUG. 17, 2004;
171 (4) 357 Barratt et al Members of the Evidence-Based Medicine Teaching Tips Working
Group: Peter C. Wyer (project director), Columbia University College of Physicians and
Surgeons, New York, NY; Deborah Cook, Gordon Guyatt (general editor), Ted Haines, Roman
Jaeschke, McMaster University, Hamilton, Ont.; Rose Hatala (internal review coordinator),
Department of Medicine, University of British Columbia, Vancouver, BC; Robert Hayward
(editor, online version), Bruce Fisher, University of Alberta, Edmonton, Alta.; Sheri Keitz
(field-test coordinator), Durham Veterans Affairs Medical Center and Duke University,
Durham, NC; Alexandra Barratt, University of Sydney, Sydney, Australia; Pamela Charney,
Albert Einstein College of Medicine, Bronx, NY; Antonio L. Dans, University of the
Philippines College of Medicine, Manila, The Philippines; Barnet Eskin, Morristown
Memorial Hospital, Morristown, NJ; Jennifer Kleinbart, Emory University, Atlanta, Ga.; Hui
Lee, formerly Group Health Centre, Sault Ste. Marie, Ont. (deceased); Rosanne Leipzig,
Thomas McGinn, Mount Sinai Medical Center, New York, NY; Victor M. Montori, Department
of Medicine, Mayo Clinic College of Medicine, Rochester, Minn.; Virginia Moyer, University of
Texas, Houston, Tex.; Thomas B. Newman, University of California, San Francisco, Calif.; Jim
Nishikawa, University of Ottawa, Ottawa, Ont.; W. Scott Richardson, Wright State University,
Dayton, Ohio; Mark C. Wilson, University of Iowa, Iowa City, Iowa Appendix 1: Formulas for
commonly used measures of therapeutic effect Measure of effect Formula Relative risk
(Event rate in intervention group) ÷ (event rate in control group) Relative risk reduction 1 –
relative risk or (Absolute risk reduction) ÷ (event rate in control group) Absolute risk
reduction (Event rate in intervention group) – (event rate in control group) Number needed
to treat 1 ÷ (absolute risk reduction) Fred Sebastian Please, reader, can you spare some
time? Our annual CMAJ readership survey begins September 20. By telling us a little about
who you are and what you think of CMAJ, you’ll help us pave our way to an even better
journal. For 2 weeks, we’ll be asking you to take the survey route on one of your visits to the
journal online. We hope you’ll go along with the detour and help us stay on track. Chers
lecteurs et lectrices, pourriez-vous nous accorder un moment? Le sondage annuel auprès
des lecteurs du JAMC débute le 20 septembre. En nous parlant un peu de vous et de ce que
vous pensez du JAMC, vous nous aiderez à améliorer encore le journal. Pendant deux
semaines, lorsque vous rendrez visite au journal électronique, nous vous demanderons de
passer une fois par la page du sondage. Nous espérons que vous accepterez de faire ce
détour qui contribuera à nous garder sur la bonne voie. 358 JAMC • 17 AOÛT 2004; 171 (4)

Review Synthèse Tips for learners of evidence-based medicine: 2. Measures of precision
(confidence intervals) Victor M. Montori, Jennifer Kleinbart, Thomas B. Newman, Sheri
Keitz, Peter C. Wyer, Virginia Moyer, Gordon Guyatt, for the Evidence-Based Medicine
Teaching Tips Working Group DOI:10.1503/cmaj.1031667 I n the first article in this series,1
we presented an approach to understanding how to estimate a treatment’s effectiveness
that covered relative risk reduction, absolute risk reduction and number needed to treat.
But how precise are these estimates of treatment effect? In reading the results of clinical
trials, clinicians often come across 2 related but different statistical measures of an
estimate’s precision: p values and confidence intervals. The p value describes how often
apparent differences in treatment effect that are as large as or larger than those observed in
a particular trial will occur in a long run of identical trials if in fact no true effect exists. If the
observed differences are sufficiently unlikely to occur by chance alone, investigators reject
the hypothesis that there is no effect. For example, consider a randomized trial comparing
diuretics with placebo that finds a 25% relative risk reduction for stroke with a p value of
0.04. This p value means that, if diuretics were in fact no different in effectiveness than
placebo, we would expect, by the play of chance alone, to observe a reduction — or increase
— in relative risk of 25% or more in 4 out of 100 identical trials. Although they are useful
for investigators planning how large a study needs to be to demonstrate a particular
magnitude of effect, p values fail to provide clinicians and patients with the information
they most need, i.e., the range of values within which the true effect is likely to reside.
However, confidence intervals provide exactly that information in a form that pertains
directly to the process of deciding whether to administer a therapy to patients. If the range
of possible true effects encompassed by the confidence interval is overly wide, the clinician
may choose to administer the therapy only selectively or not at all. Confidence intervals are
therefore the topic of this article. For a nontechnical explanation of p values and their
limitations, we refer interested readers to the Users’ Guides to the Medical Literature.2 As
with the first article in this series,1 we present the information as a series of “tips” or
exercises. This means that you, the reader, will have to do some work in the course of
reading the article. The tips we present here have been adapted from approaches developed
by educators experienced in teaching evidence-based medicine skills to clinicians.2-4 A
related article, intended for people who teach these concepts to clinicians, is available
online at www. cmaj.ca/cgi/content/full/171/6/611/DC1. Clinician learners’ objectives
Making confidence intervals intuitive • Understand the dynamic relation between
confidence intervals and sample size. Interpreting confidence intervals • Understand how
the confidence intervals around estimates of treatment effect can affect therapeutic
decisions. Estimating confidence intervals for extreme proportions • Learn a shortcut for
estimating the upper limit of the 95% confidence intervals for proportions with very small
numerators and for proportions with numerators very close to the corresponding
denominators. Tip 1: Making confidence intervals intuitive Imagine a hypothetical series of
5 trials (of equal duration but different sample sizes) in which investigators have
experimented with treatments for patients who have a particular condition (elevated low-
density lipoprotein cholesterol) to determine whether a drug (a novel cholesterollowering
agent) would work better than a placebo to prevent strokes (Table 1A). The smallest trial

enrolled only Teachers of evidence-based medicine: See the “Tips for teachers” version of
this article online at www.cmaj.ca/cgi/content/full/171/6/611/DC1. It contains the
exercises found in this article in fill-in-theblank format, commentaries from the authors on
the challenges they encounter when teaching these concepts to clinician learners and links
to useful online resources. CMAJ • SEPT. 14, 2004; 171 (6) © 2004 Canadian Medical
Association or its licensors 611 Montori et al 8 patients, and the largest enrolled 2000
patients, and half of the patients in each trial underwent the experimental treatment. Now
imagine that all of the trials showed a relative risk reduction for the treatment group of
50% (meaning that patients in the drug treatment group were only half as likely as those in
the placebo group to have a stroke). In each individual trial, how confident can we be that
the true value of the relative risk reduction is important for patients (i.e., “patient-
important”)?5 If you were to look at the studies individually, which ones would lead you to
recommend the treatment unequivocally to your patients? Most clinicians might intuitively
guess that we could be more confident in the results of the larger trials. Why is this? In the
absence of bias or systematic error, the results of a trial can be interpreted as an estimate of
the true magnitude of effect that would occur if all possible eligible patients had been
included. When only a few of these patients are included, the play of chance alone may lead
to a result that is quite different from the true value. Confidence intervals are a numeric
measure of the range within which such variation is likely to occur. The 95% confidence
intervals that we often see in biomedical publications represent the range within which we
are likely to find the underlying true treatment effect. To gain a better appreciation of
confidence intervals, go back to Table 1A (don’t look yet at Table 1B!) and take a guess at
what you think the confidence intervals might be for the 5 trials presented. In a moment
you’ll see how your Table 1A: Relative risk and relative risk reduction observed in 5
successively larger hypothetical trials Control event rate Treatment event rate Relative risk,
% Relative risk reduction, %* 2/4 10/20 20/40 50/100 500/1000 1/4 5/20 10/40 25/100
250/1000 50 50 50 50 50 50 50 50 50 50 *Calculated as the absolute difference between
the control and treatment event rates (expressed as a fraction or a percentage), divided by
the control event rate. In the first row in this table, relative risk reduction = (2/4 –1/4) ÷
2/4 = 1/2 or 50%. If the control event rate were 3/4 and the treatment event rate 1/4, the
relative risk reduction would be (3/4 – 1/4) ÷ 3/4 = 2/3. Using percentages for the same
example, if the control event rate were 75% and the treatment event rate were 25%, the
relative risk reduction would be (75% – 25%) ÷ 75% = 67%. estimates compare to 95%
confidence intervals calculated using a formula, but for now, try figuring out intervals that
you intuitively feel to be appropriate. Now, consider the first trial, in which 2 out of 4
patients who receive the control intervention and 1 out of 4 patients who receive the
experimental treatment suffer a stroke. The risk in the treatment group is half that in the
control group, which gives us a relative risk of 50% and a relative risk reduction of 50%
(see Table 1A).1,6 Given the substantial relative risk reduction, would you be ready to
recommend this treatment to a patient? Before you answer this question, consider whether
it is plausible, with so few patients in the study, that the investigators might just have gotten
lucky and the true treatment effect is really a 50% increase in relative risk. In other words,
is it plausible that the true event rate in the group that received treatment was 3 out of 4

instead of 1 out of 4? If you accept that this large, harmful effect might represent the
underlying truth, would you also accept that a relative risk reduction of 90%, i.e., a very
large benefit of treatment, is consistent with the experimental data in these few patients? To
the extent that these suggestions are plausible, we can intuitively create a range of plausible
truth of “-50% to 90%” surrounding the relative risk reduction of 50% that was actually
observed. Now, do this for each of the other 4 trials. In the trial with 20 patients in each
group, 10 of those in the control group suffered a stroke, as did 5 of those in the treatment
group. Both the relative risk and the relative risk reduction are again 50%. Do you still
consider it plausible that the true event rate in the treatment group is 15 out of 20 rather
than 5 out of 20 (the same proportions as we considered in the smaller trial)? If not, what
about 12 out of 20? The latter would represent a 20% increase in risk over the control rate
(12/20 v. 10/20). A true relative risk reduction of 90% may still be plausible, given the
observed results and the numbers of patients involved. In short, given this larger number of
patients and the lower chance of a “bad sample,” the “range of plausible truth” around the
observed relative risk reduction of 50% might be narrower, perhaps from a relative risk
increase of 20% (represented as –20%) to a relative risk reduction of 90%. You can develop
similar intuitively derived confidence intervals for the larger trials. We’ve done this in Table
1B, which also shows the 95% confidence intervals that we cal- Table 1B: Confidence
intervals (CIs) around the relative risk reduction in 5 successively larger hypothetical trials
CI around relative risk reduction, % Control event rate Treatment event rate Relative risk,
% Relative risk reduction, % Intuitive CI* Calculated 95% CI*† 2/4 10/20 20/40 50/100
500/1000 1/4 5/20 10/40 25/100 250/1000 50 50 50 50 50 50 50 50 50 50 –50 to 90 –20
to 90 0 to 90 20 to 80 40 to 60 –174 to 92 –14 to 79.5 9.5 to 73.4 26.8 to 66.4 43.5 to 55.9
*Negative values represent an increase in risk relative to control. See text for further
explanation. †Calculated by statistical software. 612 JAMC • 14 SEPT. 2004; 171 (6) Tips for
EBM learners: confidence intervals culated using a statistical program called StatsDirect
(available commercially through www.statsdirect.com). You can see that in some instances
we intuitively overestimated or underestimated the intervals relative to those we derived
using the statistical formulas. The bottom line Confidence intervals inform clinicians about
the range within which the true treatment effect might plausibly lie, given the trial data.
Greater precision (narrower confidence intervals) results from larger sample sizes and
consequent larger number of events. Statisticians (and statistical software) can calculate
95% confidence intervals around any estimate of treatment effect. would you recommend
this treatment to your patients if the point estimate represented the truth? What if the
upper boundary of the confidence interval represented the truth? Or the lower boundary?
For all 3 of these questions, the answer is yes, provided that 1% is in fact the smallest
patient-important difference. Thus, the trial is definitive and allows a strong inference about
the treatment decision. In the case of trial 2 (see Fig. 1B), would your patients choose to
undergo the treatment if either the point estimate or the upper boundary of the confidence
interval represented the true effect? What about the lower boundary? The answer
regarding the lower boundary is no, because the effect is less than the smallest difference
that patients would consider large enough for them to undergo the treatment. Al- Tip 2:
Interpreting confidence intervals You should now have an understanding of the relation

between the width of the confidence interval around a measure of outcome in a clinical trial
and the number of participants and events in that study. You are ready to consider whether
a study is sufficiently large, and the resulting confidence intervals sufficiently narrow, to
reach a definitive conclusion about recommending the therapy, after taking into account
your patient’s values, preferences and circumstances. The concept of a minimally important
treatment effect proves useful in considering the issue of when a study is large enough and
has therefore generated confidence intervals that are narrow enough to recommend for or
against the therapy. This concept requires the clinician to think about the smallest amount
of benefit that would justify therapy. Consider a set of hypothetical trials. Fig. 1A displays
the results of trial 1. The uppermost point of the bell curve is the observed treatment effect
(the point estimate), and the tails of the bell curve represent the boundaries of the 95%
confidence interval. For the medical condition being investigated, assume that a 1%
absolute risk reduction is the smallest benefit that patients would consider to outweigh the
downsides of therapy. Given the information in Fig. 1A, Treatment helps Treatment harms
Trial 1 A -5 -3 -1 0 1 3 5 Trial 1 B Trial 2 -5 -3 -1 0 1 3 5 -1 0 1 3 5 Trial 3 C Trial 4 -5 -3 %
Absolute risk reduction Fig. 1: Results of 4 hypothetical trials. For the medical condition
under investigation, an absolute risk reduction of 1% (double vertical rule) is the smallest
benefit that patients would consider important enough to warrant undergoing treatment. In
each case, the uppermost point of the bell curve is the observed treatment effect (the point
estimate), and the tails of the bell curve represent the boundaries of the 95% confidence
interval. See text for further explanation. CMAJ • SEPT. 14, 2004; 171 (6) 613 Montori et al
though trial 2 shows a “positive” result (i.e., the confidence interval does not encompass
zero), the sample size was inadequate and the result remains compatible with risk
reductions below the minimal patient-important difference. When a study result is positive,
you can determine whether the sample size was adequate by checking the lower boundary
of the confidence interval, the smallest plausible treatment effect compatible with the
results. If this value is greater than the smallest difference your patients would consider
important, the sample size is adequate and the trial result definitive. However, if the lower
boundary falls below the smallest patient-important difference, leaving patients uncertain
as to whether taking the treatment is in their best interest, the trial is not definitive. The
sample size is inadequate, and further trials are required. What happens when the
confidence interval for the effect of a therapy includes zero (where zero means “no effect”
and hence a negative result)? For studies with negative results — those that do not exclude
a true treatment effect of zero — you must focus on the other end of the confidence interval,
that representing the largest plausible treatment effect consistent with the trial data. You
must consider whether the upper boundary of the confidence interval falls below the
smallest difference that patients might consider important. If so, the sample size is
adequate, and the trial is definitively negative (see trial 3 in Fig. 1C). Conversely, if the
upper boundary exceeds the smallest patient-important difference, then the trial is not
definitively negative, and more trials with larger sample sizes are needed (see trial 4 in Fig.
1C). The bottom line To determine whether a trial with a positive result is sufficiently large,
clinicians should focus on the lower boundary of the confidence interval and determine if it
is greater than the smallest treatment benefit that patients would consider important

enough to warrant taking the treatment. For studies with a negative result, clinicians should
examine the upper boundary of the confidence interval to determine if this value is lower
than the smallest treatment benefit that patients would consider important enough to
warrant taking the treatment. In either case, if the confidence interval overlaps the smallest
treatment benefit that is important to patients, then the study is not definitive and a larger
study is needed. Table 2: The 3/n rule to estimate the upper limit of the 95% confidence
interval (CI) for proportions with 0 in the numerator n 20 100 300 1000 614 Observed
proportion 3/n Upper limit of 95% CI 0/20 0/100 0/300 0/1000 3/20 3/100 3/300
3/1000 0.15 or 15% 0.03 or 3% 0.01 or 1% 0.003 or 0.3% JAMC • 14 SEPT. 2004; 171 (6)
Tip 3: Estimating confidence intervals for extreme proportions When reviewing journal
articles, readers often encounter proportions with small numerators or with numerators
very close in size to the denominators. Both situations raise the same issue. For example, an
article might assert that a treatment is safe because no serious complications occurred in
the 20 patients who received it; another might claim near-perfect sensitivity for a test that
correctly identified 29 out of 30 cases of a disease. However, in many cases such articles do
not present confidence intervals for these proportions. The first step of this tip is to learn
the “rule of 3” for zero numerators,7 and the next step is to learn an extension (which might
be called the “rule of 5, 7, 9 and 10”) for numerators of 1, 2, 3 and 4.8 Consider the following
example. Twenty people undergo surgery, and none suffer serious complications. Does this
result allow us to be confident that the true complication rate is very low, say less than 5%
(1 out of 20)? What about 10% (2 out of 20)? You will probably appreciate that if the true
complication rate were 5% (1 in 20), it wouldn’t be that unusual to observe no
complications in a sample of 20, but for increasingly higher true rates, the chances of
observing no complications in a sample of 20 gets increasingly smaller. What we are after is
the upper limit of a 95% confidence interval for the proportion 0/20. The following is a
simple rule for calculating this upper limit: if an event occurs 0 times in n subjects, the
upper boundary of the 95% confidence interval for the event rate is about 3/n (Table 2).
You can use the same formula when the observed proportion is 100%, by translating 100%
into its complement. For example, imagine that the authors of a study on a diagnostic test
report 100% sensitivity when the test is performed for 20 patients who have the disease.
That means that the test identified all 20 with the disease as positive and identified none as
falsely negative. You would like to know how low the sensitivity of the test could be, given
that it was 100% for a sample of 20 patients. Using the 3/n rule Table 3: Method for
obtaining an approximation of the upper limit of the 95% CI* Observed numerator 0 1 2 3 4
Numerator for calculating approximate upper limit of 95% CI 3 5 7 9 10 *For any observed
numerator listed in the left hand column, divide the corresponding numerator in the right
hand column by the number of study subjects to get the approximate upper limit of the 95%
CI. For example, if the sample size is 15 and the observed numerator is 3, the upper limit of
the 95% confidence interval is approximately 9 ÷ 15 = 0.6 or 60%. Tips for EBM learners:
confidence intervals for the proportion of false negatives (0 out of 20), we find that the
proportion of false negatives could be as high as 15% (3 out of 20). Subtract this result from
100% to obtain the lower limit of the 95% confidence interval for the sensitivity (in this
example, 85%). What if the numerator is not zero but is still very small? There is a shortcut

rule for small numerators other than zero (i.e., 1, 2, 3 or 4) (Table 3). For example, out of 20
people receiving surgery imagine that 1 person suffers a serious complication, yielding an
observed proportion of 1/20 or 5%. Using the corresponding value from Table 3 (i.e., 5) and
the sample size, we find that the upper limit of the 95% confidence interval will be about
5/20 or 25%. If 2 of the 20 (10%) had suffered complications, the upper limit would be
about 7/20, or 35%. References The bottom line 7. 1. 2. 3. 4. 5. 6. 8. Although statisticians
(and statistical software) can calculate 95% confidence intervals, clinicians can readily
estimate the upper boundary of confidence intervals for proportions with very small
numerators. These estimates highlight the greater precision attained with larger sample
sizes and help to calibrate intuitively derived confidence intervals. Conclusions Clinicians
need to understand and interpret confidence intervals to properly use research results in
making decisions. They can use thresholds, based on differences that patients are likely to
consider important, to interpret confidence intervals and to judge whether the results are
definitive or whether a larger study (with more patients and events) is necessary. For
proportions with extremely small numerators, a simple rule is available for estimating the
upper limit of the confidence interval. This article has been peer reviewed. From the
Department of Medicine, Mayo Clinic College of Medicine, Rochester, Minn. (Montori); the
Hospital Medicine Unit, Division of General Medicine, Emory University, Atlanta, Ga.
(Kleinbart); the Departments of Epidemiology and Biostatistics and of Pediatrics, University
of California, San Francisco, San Francisco, Calif. (Newman); Durham Veterans Affairs
Medical Center and Duke University Medical Center, Durham, NC (Keitz); the Columbia
University College of Physicians and Surgeons, New York, NY (Wyer); the Department of
Pediatrics, University of Texas, Houston, Tex. (Moyer); and the Departments of Medicine
and of Clinical Epidemiology and Biostatistics, McMaster University, Hamilton, Ont. (Guyatt)
Competing interests: None declared. Contributors: Victor Montori, as principal author,
decided on the structure and flow of the article, and oversaw and contributed to the writing
of the manuscript. Jennifer Kleinbart reviewed the manuscript at all phases of development
and contributed to the writing of tip 1. Thomas Newman developed the original idea for tip
3 and reviewed the manuscript at all phases of development. Sheri Keitz used all of the tips
as part of a live teaching exercise and submitted comments, suggestions and the possible
variations that are described in the article. Peter Wyer reviewed and revised the final draft
of the manuscript to achieve uniform adherence with format specifications. Virginia Moyer
reviewed and revised the final draft of the manuscript to improve clarity and style. Gordon
Guyatt developed the original ideas for tips 1 and 2, reviewed the manuscript at all phases
of development, contributed to the writing as coauthor, and reviewed and revised the final
draft of the manuscript to achieve accuracy and consistency of content as general editor.
Barratt A, Wyer PC, Hatala R, McGinn T, Dans AL, Keitz S, et al. Tips for learners of evidence-
based medicine: 1. Relative risk reduction, absolute risk reduction and number needed to
treat. CMAJ 2004;171(4):353-8. Guyatt G, Jaeschke R, Cook D, Walter S. Therapy and
understanding the results: hypothesis testing. In: Guyatt G, Rennie D, editors. Users’ guides
to the medical literature: a manual of evidence-based clinical practice. Chicago: AMA Press;
2002. p. 329-38. Guyatt G, Walter S, Cook D, Jaeschke R. Therapy and understanding the
results: confidence intervals. In: Guyatt G, Rennie D, editors. Users’ guides to the medical

literature: a manual of evidence-based clinical practice. Chicago: AMA Press; 2002. p. 339-
49. Wyer PC, Keitz S, Hatala R, Hayward R, Barratt A, Montori V, et al. Tips for learning and
teaching evidence-based medicine: introduction to the series [editorial]. CMAJ
2004;171(4):347-8. Guyatt G, Montori V, Devereaux PJ, Schunemann H, Bhandari M.
Patients at the center: in our practice, and in our use of language. ACP J Club 2004;140:A11-
2. Jaeschke R, Guyatt G, Barratt A, Walter S, Cook D, McAlister F, et al. Measures of
association. In: Guyatt G, Rennie D, editors. Users’ guides to the medical literature: a manual
of evidence-based clinical practice. Chicago: AMA Press; 2002. p. 351-68. Hanley J, Lippman-
Hand A. If nothing goes wrong, is everything all right? Interpreting zero numerators. JAMA
1983;249:1743-5. Newman TB. If almost nothing goes wrong, is almost everything all right?
[letter]. JAMA 1995;274:1013. Correspondence to: Dr. Peter C. Wyer, 446 Pelhamdale Ave.,
Pelham NY 10803, USA; fax 212 305-6792; pwyer@worldnet .att.net Members of the
Evidence-Based Medicine Teaching Tips Working Group: Peter C. Wyer (project director),
College of Physicians and Surgeons, Columbia University, New York, NY; Deborah Cook,
Gordon Guyatt (general editor), Ted Haines, Roman Jaeschke, McMaster University,
Hamilton, Ont.; Rose Hatala (internal review coordinator), University of British Columbia,
Vancouver, BC; Robert Hayward (editor, online version), Bruce Fisher, University of Alberta,
Edmonton, Alta.; Sheri Keitz (field test coordinator), Durham Veterans Affairs Medical
Center and Duke University Medical Center, Durham, NC; Alexandra Barratt, University of
Sydney, Sydney, Australia; Pamela Charney, Albert Einstein College of Medicine, Bronx, NY;
Antonio L. Dans, University of the Philippines College of Medicine, Manila, The Philippines;
Barnet Eskin, Morristown Memorial Hospital, Morristown, NJ; Jennifer Kleinbart, Emory
University School of Medicine, Atlanta, Ga.; Hui Lee, formerly Group Health Centre, Sault Ste.
Marie, Ont. (deceased); Rosanne Leipzig, Thomas McGinn, Mount Sinai Medical Center, New
York, NY; Victor M. Montori, Mayo Clinic College of Medicine, Rochester, Minn.; Virginia
Moyer, University of Texas, Houston, Tex.; Thomas B. Newman, University of California, San
Francisco, San Francisco, Calif.; Jim Nishikawa, University of Ottawa, Ottawa, Ont.;
Kameshwar Prasad, Arabian Gulf University, Manama, Bahrain; W. Scott Richardson, Wright
State University, Dayton, Ohio; Mark C. Wilson, University of Iowa, Iowa City, Iowa Articles
to date in this series Barratt A, Wyer PC, Hatala R, McGinn T, Dans AL, Keitz S, et al. Tips for
learners of evidence-based medicine: 1. Relative risk reduction, absolute risk reduction and
number needed to treat. CMAJ 2004;171(4):353-8. CMAJ • SEPT. 14, 2004; 171 (6) 615
Correspondance ical journals [editorial]. CMAJ 1984;130:1412. 11. Bero LA, Galbraith A,
Rennie D. The publication of sponsored symposiums in medical journals. N Engl J Med
1992;327:1135-40. Competing interests: None declared. DOI:10.1503/cmaj.1041329
thetical trial 2 in Fig. 1B should have been centred at 5% absolute risk reduction, as
described in the text; instead, the figure showed trial 2 as being centred at about 6.5%
absolute risk reduction. The corrected figure is presented here. Reference 1. Montori VM,
Kleinbart J, Newman TB, Keitz S, Wyer PC, Moyer V, et al. Tips for learners of evidence-
based medicine: 2. Measures of precision (confidence intervals). CMAJ 2004;171(6): 611-5.
DOI:10.1503/cmaj.1041761 Online access to a for-profit CMAJ W ayne Kondro, quoting
CMA Secretary-General Bill Tholl, reports that “Physicians will continue to receive their free
subscription to CMAJ as a benefit of association membership ‘for the foreseeable future’”

after CMA Publications is sold to CMA Holdings in January 2004.1 That’s all to the good —
but what then of CMAJ’s worldwide readers? Will access to CMAJ remain free for all online
users, despite the shift to for-profit status? I found it strange that this issue was not
addressed in Kondro’s news article. Treatment helps Treatment harms Trial 1 A -5 -3 -1 0 1
3 5 Trial 1 B Trial 2 Adam L. Scheffler Independent researcher Chicago, Ill. -5 Reference 1. -3
-1 0 1 3 5 -1 0 1 3 5 Kondro W. CMAJ enters for-profit market. CMAJ 2004;171(11):1334.
DOI:10.1503/cmaj.1041759 Trial 3 C [Editor’s note] C MAJ’s editors have addressed the
topic of open access in this issue’s Editorial (see page 149). DOI:10.1503/cmaj.1041760
Trial 4 -5 -3 % Absolute risk reduction Correction I n part 2 of the series “Tips for learners
of evidence-based medicine”1 the information in Fig. 1 did not fully correspond with the
information provided in the text. Specifically, the data for hypo- 162 Fig. 1: Results of 4
hypothetical trials. For the medical condition under investigation, an absolute risk
reduction of 1% (double vertical rule) is the smallest benefit that patients would consider
important enough to warrant undergoing treatment. In each case, the uppermost point of
the bell curve is the observed treatment effect (the point estimate), and the tails of the bell
curve represent the boundaries of the 95% confidence interval. See the text1 for further
explanation. JAMC • 18 JANV. 2005; 172 (2) Review Synthèse Tips for learners of evidence-
based medicine: 3. Measures of observer variability (kappa statistic) Thomas McGinn, Peter
C. Wyer, Thomas B. Newman, Sheri Keitz, Rosanne Leipzig, Gordon Guyatt, for the Evidence-
Based Medicine Teaching Tips Working Group I DOI:10.1503/cmaj.1031981 magine that
you’re a busy family physician and that you’ve found a rare free moment to scan the recent
literature. Reviewing your preferred digest of abstracts, you notice a study comparing
emergency physicians’ interpretation of chest radiographs with radiologists’
interpretations.1 The article catches your eye because you have frequently found that your
own reading of a radiograph differs from both the official radiologist reading and an
unofficial reading by a different radiologist, and you’ve wondered about the extent of this
disagreement and its implications. Looking at the abstract, you find that the authors have
reported the extent of agreement using the κ statistic. You recall that κ stands for “kappa”
and that you have encountered this measure of agreement before, but your grasp of its
meaning remains tentative. You therefore choose to take a quick glance at the authors’
conclusions as reported in the abstract and to defer downloading and reviewing the full text
of the article. Practitioners, such as the family physician just described, may benefit from
understanding measures of observer variability. For many studies in the medical literature,
clinician readers will be interested in the extent of agreement among multiple observers.
For example, do the investigators in a clinical study agree on the presence or absence of
physical, radiographic or laboratory findings? Do investigators involved in a systematic
overview agree on the validity of an article, or on whether the article should be included in
the analysis? In perusing these types of studies, where investigators are interested in
quantifying agreement, clinicians will often come across the kappa statistic. In this article
we present tips aimed at helping clinical learners to use the concepts of kappa when
applying diagnostic tests in practice. The tips presented here have been adapted from
approaches developed by educators experienced in teaching evidence-based medicine skills
to clinicians.2 A related article, intended for people who teach these concepts to clinicians,

is available online at www. cmaj.ca/cgi/content/full/171/11/1369/DC1. Clinician learners’
objectives Defining the importance of kappa • Understand the difference between
measuring agreement and measuring agreement beyond chance. • Understand the
implications of different values of kappa. Calculating kappa • Understand the basics of how
the kappa score is calculated. • Understand the importance of “chance agreement” in
estimating kappa. Calculating chance agreement • Understand how to calculate the kappa
score given different distributions of positive and negative results. • Understand that the
more extreme the distributions of positive and negative results, the greater the agreement
that will occur by chance alone. • Understand how to calculate chance agreement,
agreement beyond chance and kappa for any set of assessments by 2 observers. Tip 1:
Defining the importance of kappa A common stumbling block for clinicians is the basic
concept of agreement beyond chance and, in turn, the importance of correcting for chance
agreement. People making a decision on the basis of presence or absence of an element of
the physical examination, such as Murphy’s sign, will sometimes agree simply by chance.
The kappa statistic corrects for this chance agreement and tells us how much of the possible
agreement over and above chance the reviewers have achieved. A simple example should
help to clarify the importance of correcting for chance agreement. Two radiologists
independently read the same 100 mammograms. Reader 1 is having a bad day and reads all
the films as negative without looking at them in great detail. Reader 2 reads the Teachers of
evidence-based medicine: See the “Tips for teachers” version of this article online at
www.cmaj.ca/cgi/content/full/171/11/1369/DC1. It contains the exercises found in this
article in fill-in-theblank format, commentaries from the authors on the challenges they
encounter when teaching these concepts to clinician learners and links to useful online
resources. CMAJ • NOV. 23, 2004; 171 (11) © 2004 Canadian Medical Association or its
licensors 1369 McGinn et al films more carefully and identifies 4 of the 100 mammograms
as positive (suspicious for malignancy). How would you characterize the level of agreement
between these 2 radiologists? The percent agreement between them is 96%, even though
one of the readers has, on cursory review, decided to call all of the results negative. Hence,
measuring the simple percent agreement overestimates the degree of clinically important
agreement in a fashion that is misleading. The role of kappa is to indicate how much the 2
observers agree beyond the level of agreement that could be expected by chance. Table 1
presents a rating system that is commonly used as a guideline for evaluating kappa scores.
Purely to illustrate the range of kappa scores that readers can expect to encounter, Table 2
gives some examples of commonly reported assessments and the kappa scores that resulted
when investigators studied their reproducibility. The bottom line If clinicians neglect the
possibility of chance agreement, they will come to misleading conclusions about the
reproducibility of clinical tests. The kappa statistic allows us to measure agreement above
and beyond that expected by chance alone. Examples of kappa scores for frequently ordered
tests sometimes show surprisingly poor levels of agreement beyond chance. Table 1:
Qualitative classification of kappa values as degree of 3 agreement beyond chance Kappa
value Degree of agreement beyond chance 0 0–0.2 0.2–0.4 0.4–0.6 0.6–0.8 0.8–1.0 None
Slight Fair Moderate Substantial Almost perfect Kappa value Interpretation of T wave
changes on an exercise stress test4 Presence of jugular venous distension5 Detection of

alcohol dependence using CAGE questionnaire6 Presence of goitre7 Bone marrow
interpretation by hematologist8 Straight leg raising test9 Diagnosis of pulmonary embolus
by helical CT10 Diagnosis of lower extremity arterial disease by arteriography11 1370
What is the maximum potential for agreement between 2 observers doing a clinical
assessment, such as presence or absence of Murphy’s sign in patients with abdominal pain?
In Fig. 1, the upper horizontal bar represents 100% agreement between 2 observers. For
the hypothetical situation represented in the figure, the estimated chance agreement
between the 2 observers is 50%. This would occur if, for example, each of the 2 observers
randomly called half of the assessments positive. Given this information, what is the
possible agreement beyond chance? The vertical line in Fig. 1 intersects the horizontal bars
at the 50% point that we identified as the expected agreement by chance. All agreement to
the right of this line corresponds to agreement beyond chance. Hence the maximum
agreement beyond chance is 50% (100% – 50%). The other number you need to calculate
the kappa score is the degree of agreement beyond chance. The observed agreement, as
shown by the lower horizontal bar in Fig. 1, is 75%, so the degree of agreement beyond
chance is 25% (75% – 50%). Kappa is calculated as the observed agreement beyond chance
(25%) divided by the maximum agreement beyond chance (50%); here, kappa is 0.50.
Agreement expected by chance Table 2: Representative kappa values for common tests and
clinical assessments Assessment Tip 2: Calculating kappa 0.25 0.56 0.75 0.82–0.95 0.84 0.82
0.82 0.39–0.64 JAMC • 23 NOV. 2004; 171 (11) 50% Observed agreement: Observed
agreement above chance: Possible agreement above chance 75% 25% kappa = 25/50 = 0. 5
(moderate agreement) Fig. 1: Two observers independently assess the presence or absence
of a finding or outcome. Each observer determines that the finding is present in exactly 50%
of the subjects. Their assessments agree in 75% of the cases. The yellow horizontal bar
represents potential agreement (100%), and the turquoise bar represents actual
agreement. The portion of each coloured bar that lies to the left of the dotted vertical line
represents the agreement expected by chance (50%). The observed agreement above
chance is half of the possible agreement above chance. The ratio of these 2 numbers is the
kappa score. Tips for EBM learners: kappa statistic The bottom line Kappa allows us to
measure agreement above and beyond that expected by chance alone. We calculate kappa
by estimating the chance agreement and then comparing the observed agreement beyond
chance with the maximum possible agreement beyond chance. Tip 3: Calculating chance
agreement A conceptual understanding of kappa may still leave the actual calculations a
mystery. The following example is intended for those who desire a more complete
understanding of the kappa statistic. Let us assume that 2 hopeless clinicians are assessing
the presence of Murphy’s sign in a group of patients. They have no idea what they are doing,
and their evaluations are no better than blind guesses. Let us say they are each guessing the
presence and absence of Murphy’s sign in a 50:50 ratio: half the time they guess that
Murphy’s sign is present, and the other half that it is absent. If you were completing a 2 × 2
table, with these 2 clinicians evaluating the same 100 patients, how would the cells, on
average, get filled in? Fig. 2 represents the completed 2 × 2 table. Guessing at random, the 2
hopeless clinicians have agreed on the assessments of 50% of the patients. How did we
arrive at the numbers shown in the table? According to the laws of chance, each clinician

guesses that half of the 50 patients assessed as positive by the other clinician (i.e., 25
patients) have Murphy’s sign. How would this exercise work if the same 2 hopeless
clinicians were to randomly guess that 60% of the patients had a positive result for
Murphy’s sign? Fig. 3 provides the answer in this situation. The clinicians would agree for
52 of the 100 patients (or 52% of the time) and would disagree for 48 of the patients. In a
similar way, using 2 × 2 tables for higher and higher positive proportions (i.e., how often
Clinician 1 Clinician 2 Sign present Sign absent Total the observer makes the diagnosis), you
can figure out how often the observers will, on average, agree by chance alone (as
delineated in Table 3). At this point, we have demonstrated 2 things. First, even if the
reviewers have no idea what they are doing, there will be substantial agreement by chance
alone. Second, the magnitude of the agreement by chance increases as the proportion of
positive (or negative) assessments increases. But how can we calculate kappa when the
clinicians whose assessments are being compared are no longer “hopeless,” in other words,
when their assessments reflect a level of expertise that one might actually encounter in
practice? It’s not very hard. Let’s take a simple example, returning to the premise that each
of the 2 clinicians assesses Murphy’s sign as being present in 50% of the patients. Here, we
assume that the 2 clinicians now have some knowledge of Murphy’s sign and their
assessments are no longer random. Each decides that 50% of the patients have Murphy’s
sign and 50% do not, but they still don’t agree on every patient. Rather, for 40 patients they
agree that Murphy’s sign is present, and for 40 patients they agree that Murphy’s sign is
absent. Thus, they agree on the diagnosis for 80% of the patients, and they disagree for 20%
of the patients (see Fig. 4A). How do we calculate the kappa score in this situation? Recall
that if each clinician found that 50% of the patients had Murphy’s sign but their decision
about the presence of the sign in each patient was random, the clinicians would be in
agreement 50% of the time, each cell of the 2 × 2 table would have 25 patients (as shown in
Fig. 2), chance agreeClinician 1 Clinician 2 Sign present Sign absent Total Sign present Sign
absent Total 25 25 50 25 25 50 50 50 Fig. 2: Agreement table for 2 hopeless clinicians who
randomly guess whether Murphy’s sign is present or absent in 100 patients with abdominal
pain. Each clinician determines that half of the patients have a positive result. The numbers
in each box reflect the number of patients in each agreement category. Sign present Sign
absent Total 36 24 60 24 16 40 60 40 Fig. 3: As in Fig. 2, the 2 clinicians again guess at
random whether Murphy’s sign is present or absent. However, each clinician now guesses
that the sign is present in 60 of the 100 patients. Under these circumstances, of the 60
patients for whom clinician 1 guesses that the sign is present, clinician 2 guesses that it is
present in 60%; 60% of 60 is 36 patients. Of the 60 patients for whom clinician 1 guesses
that the sign is present, clinician 2 guesses that it is absent in 40%; 40% of 60 is 24 patients.
Of the 40 patients for whom clinician 1 guesses that the sign is absent, clinician 2 guesses
that it is present in 60%; 60% of 40 is 24 patients. Of the 40 patients for whom clinician 1
guesses that the sign is absent, clinician 2 guesses that it is absent in 40%; 40% of 40 is 16
patients. CMAJ • NOV. 23, 2004; 171 (11) 1371 McGinn et al ment would be 50%, and
maximum agreement beyond chance would also be 50%. The no-longer-hopeless clinicians’
agreement on 80% of the patients is therefore 30% above chance. Kappa is a comparison of
the observed agreement above chance with the maximum agreement above chance:

30%/50% = 60% of the possible agreement above chance, which gives these clinicians a
kappa of 0.6, as shown in Fig. 4B. Table 3: Chance agreement when 2 observers randomly
assign positive and negative results, for successively higher rates of a positive call
Proportion positive (%) 50 52 58 68 82 A Clinician 2 Sign present Sign absent 40 10 10 40
Chance agreement is not always 50%; rather, it varies from one clinical situation to another.
When the prevalence of a disease or outcome is low, 2 observers will guess that most
patients are normal and the symptom of the disease is absent. This situation will lead to a
high percentage of agreement simply by chance. When the prevalence is high, there will also
be high apparent agreement, with most patients judged to exhibit the symptom. Kappa
measures the agreement after correcting for this variable degree of chance agreement.
Conclusions B Clinician 2 Clinician 1 Sign present Sign absent Sign present Sign absent 40
(25) 10 (25) 10 (25) 40 (25) Total 50 50 Total 50 50 κ = (observed agreement – agreement
expected by chance) ÷ (100 – agreement expected by chance) = (80% – 50%) ÷ (100% –
50%) = 30% ÷ 50% = 0.6 Fig. 4: Two clinicians who have been trained to assess Murphy’s
sign in patients with abdominal pain do an actual assessment on 100 patients. A: A 2 × 2
table reflecting actual agreement between the 2 clinicians. B: A 2 × 2 table illustrating the
correct approach to determining the kappa score. The numbers in parentheses correspond
to the results that would be expected were each clinician randomly guessing that half of the
patients had a positive result (as in Fig. 2). 1372 Another way of expressing this formula:
(Observed agreement beyond chance) ÷ (maximum possible agreement beyond chance)
The bottom line Clinician 1 Sign present Sign absent (Observed agreement – agreement
expected by chance) ÷ (100% – agreement expected by chance) Hence, to calculate kappa
when only 2 alternatives are possible (e.g., presence or absence of a finding), you need just
2 numbers: the percentage of patients that the 2 assessors agreed on and the expected
agreement by chance. Both can be determined by constructing a 2 × 2 table exactly as
illustrated above. Agreement by chance (%) 50 60 70 80 90 Formula for calculating kappa
JAMC • 23 NOV. 2004; 171 (11) Armed with this understanding of kappa as a measure of
agreement between different observers, you are able to return to the study of agreement in
chest radiography interpretations between emergency physicians and radiologists1 in a
more informed fashion. You learn from the abstract that the kappa score for overall
agreement between the 2 classes of practitioners was 0.40, with a 95% confidence interval
ranging from 0.35 to 0.46. This means that the agreement between emergency physicians
and radiologists represented 40% of the potentially achievable agreement beyond chance.
You understand that this kappa score would be conventionally considered to represent fair
to moderate agreement but is inferior to many of the kappa values listed in Table 2. You are
now much more confident about going to the full text of the article to review the methods
and assess the clinical applicability of the results to your own patients. The ability to
understand measures of variability in data presented in clinical trials and systematic
reviews is an important skill for clinicians. We have presented a series of tips developed and
used by experienced teachers of evidence-based medicine for the purpose of facilitating
such understanding. Tips for EBM learners: kappa statistic This article has been peer
reviewed. From the Department of Medicine, Division of General Internal Medicine
(McGinn), and the Department of Geriatrics (Leipzig), Mount Sinai Medical Center, New

York, NY; the Columbia University College of Physicians and Surgeons, New York, NY
(Wyer); the Departments of Epidemiology and Biostatistics and of Pediatrics, University of
California, San Francisco, San Francisco, Calif. (Newman); Durham Veterans Affairs Medical
Center and Duke University Medical Center, Durham, NC (Keitz); and the Departments of
Medicine and of Clinical Epidemiology and Biostatistics, McMaster University, Hamilton,
Ont. (Guyatt) Competing interests: None declared. Contributors: Thomas McGinn developed
the original idea for tips 1 and 2 and, as principal author, oversaw and contributed to the
writing of the manuscript. Thomas Newman and Roseanne Leipzig reviewed the manuscript
at all phases of development and contributed to the writing as coauthors. Sheri Keitz used
all of the tips as part of a live teaching exercise and submitted comments, suggestions and
the possible variations that are described in the article. Peter Wyer reviewed and revised
the final draft of the manuscript to achieve uniform adherence with format specifications.
Gordon Guyatt developed the original idea for tip 3, reviewed the manuscript at all phases
of development, contributed to the writing as a coauthor, and, as general editor, reviewed
and revised the final draft of the manuscript to achieve accuracy and consistency of content.
References 1. Gatt ME, Spectre G, Paltiel O, Hiller N, Stalnikowicz R. Chest radiographs in the
emergency department: Is the radiologist really necessary? Postgrad Med J 2003;79:214-7.
2. Wyer PC, Keitz S, Hatala R, Hayward R, Barratt A, Montori V, et al. Tips for learning and
teaching evidence-based medicine: introduction to the series [editorial]. CMAJ
2004;171(4):347-8. 3. Maclure M, Willett WC. Misinterpretation and misuse of the kappa
statistic. Am J Epidemiol 1987;126:161-9. 4. Blackburn H. The exercise electrocardiogram:
differences in interpretation. Report of a technical group on exercise electrocardiography.
Am J Cardiol 1968;21:871-80. 5. Cook DJ. Clinical assessment of central venous pressure in
the critically ill. Am J Med Sci 1990;299:175-8. 6. Aertgeerts B, Buntinx F, Fevery J, Ansoms
S. Is there a difference between CAGE interviews and written CAGE questionnaires? Alcohol
Clin Exp Res 2000;24:733-6. 7. Kilpatrick R, Milne JS, Rushbrooke M, Wilson ESB. A survey
of thyroid enlargement in two general practices in Great Britain. BMJ 1963;1:29-34. 8.
Guyatt GH, Patterson C, Ali M, Singer J, Levine M, Turpie I, et al. Diagnosis of iron-deficiency
anemia in the elderly. Am J Med 1990;88:205-9. 9. McCombe PF, Fairbank JC, Cockersole BC,
Pynsent PB. 1989 Volvo Award in clinical sciences. Reproducibility of physical signs in low-
back pain. Spine 1989;14:908-18. 10. Perrier A, Howarth N, Didier D, Loubeyre P, Unger PF,
de Moerloose P, et al. Performance of helical computed tomography in unselected
outpatients with suspected pulmonary embolism. Ann Intern Med 2001;135:88-97. 11.
Koelemay MJ, Legemate DA, Reekers JA, Koedam NA, Balm R, Jacobs MJ. Interobserver
variation in interpretation of arteriography and management of severe lower leg arterial
disease. Eur J Vasc Endovasc Surg 2001;21:417-22. Correspondence to: Dr. Peter C. Wyer,
446 Pelhamdale Ave., Pelham NY 10803, USA; fax 914 738-9368; pwyer@att.net Members
of the Evidence-Based Medicine Teaching Tips Working Group: Peter C. Wyer (project
director), College of Physicians and Surgeons, Columbia University, New York, NY; Deborah
Cook, Gordon Guyatt (general editor), Ted Haines, Roman Jaeschke, McMaster University,
Hamilton, Ont.; Rose Hatala (internal review coordinator), University of British Columbia,
Vancouver, BC; Robert Hayward (editor, online version), Bruce Fisher, University of Alberta,
Edmonton, Alta.; Sheri Keitz (field test coordinator), Durham Veterans Affairs Medical

Center and Duke University Medical Center, Durham, NC; Alexandra Barratt, University of
Sydney, Sydney, Australia; Pamela Charney, Albert Einstein College of Medicine, Bronx, NY;
Antonio L. Dans, University of the Philippines College of Medicine, Manila, The Philippines;
Barnet Eskin, Morristown Memorial Hospital, Morristown, NJ; Jennifer Kleinbart, Emory
University School of Medicine, Atlanta, Ga.; Hui Lee, formerly Group Health Centre, Sault Ste.
Marie, Ont. (deceased); Rosanne Leipzig, Thomas McGinn, Mount Sinai Medical Center, New
York, NY; Victor M. Montori, Mayo Clinic College of Medicine, Rochester, Minn.; Virginia
Moyer, University of Texas, Houston, Tex.; Thomas B. Newman, University of California, San
Francisco, San Francisco, Calif.; Jim Nishikawa, University of Ottawa, Ottawa, Ont.;
Kameshwar Prasad, Arabian Gulf University, Manama, Bahrain; W. Scott Richardson, Wright
State University, Dayton, Ohio; Mark C. Wilson, University of Iowa, Iowa City, Iowa Articles
to date in this series Barratt A, Wyer PC, Hatala R, McGinn T, Dans AL, Keitz S, et al. Tips for
learners of evidence-based medicine: 1. Relative risk reduction, absolute risk reduction and
number needed to treat. CMAJ 2004;171(4):353-8. Montori VM, Kleinbart J, Newman TB,
Keitz S, Wyer PC, Moyer V, et al. Tips for learners of evidence-based medicine: 2. Measures
of precision (confidence intervals). CMAJ 2004;171(6):611-5. CMAJ • NOV. 23, 2004; 171
(11) 1373 Review Synthèse Tips for learners of evidence-based medicine: 4. Assessing
heterogeneity of primary studies in systematic reviews and whether to combine their
results Rose Hatala, Sheri Keitz, Peter Wyer, Gordon Guyatt, for the Evidence-Based
Medicine Teaching Tips Working Group DOI:10.1503/cmaj.1031920 C linicians wishing to
quickly answer a clinical question may seek a systematic review, rather than searching for
primary articles. Such a review is also called a meta-analysis when the investigators have
used statistical techniques to combine results across studies. Databases useful for this
purpose include the Cochrane Library (www. thecochranelibrary.com) and the ACP Journal
Club (www. acpjc.org; use the search term “review”), both of which are available through
personal or institutional subscription. Clinicians can use systematic reviews to guide clinical
practice if they are able to understand and interpret the results. Systematic reviews differ
from traditional reviews in that they are usually confined to a single focused question,
which serves as the basis for systematic searching, selection and critical evaluation of the
relevant research.1 Authors of systematic reviews use explicit methods to minimize bias
and consider using statistical techniques to combine the results of individual studies. When
appropriate, such pooling allows a more precise estimate of the magnitude of benefit or
harm of a therapy. It may also increase the applicability of the result to a broader range of
patient populations. Clinicians encountering a meta-analysis frequently find the pooling
process mysterious. Specifically, they wonder how authors decide whether the ranges of
patients, interventions and outcomes are too broad to sensibly pool the results of the
primary studies. In this article we present an approach to evaluating potentially important
differences in the results of individual studies being considered for a meta-analysis. These
differences are frequently referred to as heterogeneity.1 Our discussion focuses on the
qualitative, rather than the statistical, assessment of heterogeneity (see Box 1). Two
concepts are commonly implied in the assessment of heterogeneity. The first is an
assessment for heterogeneity within 4 key elements of the design of the original studies: the
patients, interventions, outcomes and methods. This assessment bears on the question of

whether pooling the results is at all sensible. The second concept relates to assessing
heterogeneity among the results of the original studies. Even if the study designs are
similar, the researchers must decide whether it is useful to combine the primary studies’
results. Our discussion assumes a basic familiarity with how investigators present the
magnitude2,3 and precision4 of treatment effects in individual randomized trials. The tips
in this article are adapted from approaches developed by educators with experience in
teaching evidencebased medicine skills to clinicians.1,5,6 A related article, intended for
people who teach these concepts to clinicians, is available online at
www.cmaj.ca/cgi/content/full/172/5/ 661/DC1. Clinician learners’ objectives Qualitative
assessment of the design of primary studies • Understand the concepts of heterogeneity of
study design among the individual studies included in a systematic review. Qualitative
assessment of the results of primary studies • Understand how to qualitatively determine
the appropriateness of pooling estimates of effect from the individual studies by assessing
(1) the degree of overlap of the confidence intervals around these point estimates of effect
and (2) the disparity between the point estimates themselves. • Understand how to
estimate the “true” value of the estimate of effect from a graphic display of the results of
individual studies. Teachers of evidence-based medicine: See the “Tips for teachers” version
of this article online at www.cmaj.ca/cgi/content/full/172/5/661/DC1. It contains the
exercises found in this article in fill-in-theblank format, commentaries from the authors on
the challenges they encounter when teaching these concepts to clinician learners and links
to useful online resources. CMAJ • MAR. 1, 2005; 172 (5) © 2005 CMA Media Inc. or its
licensors 661 Hatala et al Box 1: Statistical assessments of heterogeneity Meta-analysts
typically use 2 statistical approaches to evaluate the extent of variability in results between
studies: Cochran’s Q test and the I 2 statistic. Cochran’s Q test • Cochran’s Q test is the
traditional test for heterogeneity. It begins with the null hypothesis that all of the apparent
variability is due to chance. That is, the true underlying magnitude of effect (whether
measured with a relative risk, an odds ratio or a risk difference) is the same across studies. •
The test then generates a probability, based on a χ2 distribution, that differences in results
between studies as extreme as or more extreme than those observed could occur simply by
chance. • If the p value is low (say, less than 0.1) investigators should look hard for possible
explanations of variability in results between studies (including differences in patients,
interventions, measurement of outcomes and study design). • As the p value gets very low
(less than 0.01) we may be increasingly uncomfortable about using single best estimates of
treatment effects. • The traditional test for heterogeneity is limited, in that it may be
underpowered (when studies have included few patients it may be difficult to reject the null
hypothesis even if it is false) or overpowered (when sample sizes are very large, small and
unimportant differences in magnitude of effect may nevertheless generate low p values). I 2
statistic • The I 2 statistic, the second approach to measuring heterogeneity, attempts to
deal with potential underpowering or overpowering. I 2 provides an estimate of the
percentage of variability in results across studies that is likely due to true differences in
treatment effect, as opposed to chance. • When I 2 is 0%, chance provides a satisfactory
explanation for the variability we have observed, and we are more likely to be comfortable
with a single pooled estimate of treatment effect. • As I 2 increases, we get increasingly

uncomfortable with a single pooled estimate, and the need to look for explanations of
variability other than chance becomes more compelling. • For example, one rule of thumb
characterizes I 2 of less than 0.25 as low heterogeneity, 0.25 to 0.5 as moderate
heterogeneity and over 0.5 as high heterogeneity. Tip 1: Qualitative assessment of the
design of primary studies Consider the following 3 hypothetical systematic reviews. For
which of these systematic reviews does it make sense to combine the primary studies? • A
systematic review of all therapies for all types of cancer, intended to generate a single
estimate of the impact of these therapies on mortality. • A systematic review that examines
the effect of different antibiotics, such as tetracyclines, penicillins and chloramphenicol, on
improvement in peak expiratory flow rates and days of illness in patients with acute
exacerbation of obstructive lung disease, including chronic bronchitis and emphysema.7 • A
systematic review of the effectiveness of tissue plasminogen activator (tPA) compared with
no treatment or placebo in reducing mortality among patients with acute myocardial
infarction.8 Most clinicians would instinctively reject the first of these proposed reviews as
overly broad but would be comfortable with the idea of combining the results of trials
relevant to the third question. What about the second review? What aspects of the primary
studies must be similar to justify combining their results in this systematic review? Table 1
lists features that would be relevant to the question considered in the second review and
categorizes them according to the 4 key elements of study design: the patients,
interventions, outcomes and methods of the primary studies. Combining results is
appropriate when the biology is such that across the range of patients, interventions,
outcomes and study methods, one can anticipate more or less the same magnitude of
treatment effect. In other words, the judgement as to whether the primary studies are
similar enough to be combined in a systematic review is based on whether the underlying
pathophysiology would predict a similar treatment effect across the range of patients,
interventions, outcomes and study methods of the primary studies. If you think back to the
first systematic review — all therapies for all cancers — you probably recognize that there
is significant variability in the Table 1: Relevant features of study design to be considered
when deciding whether to pool studies in a systematic review (for a review examining the
effect of antibiotics in patients with obstructive lung disease) Patients Patient age Patient
sex Type of lung disease (e.g., emphysema, chronic bronchitis) 662 Interventions Outcomes
Study methods Same antibiotic in all studies Same class of antibiotic in all studies
Comparison of antibiotic with placebo Comparison of one antibiotic with another Death
Peak expiratory flow Forced expiratory volume in the first second All randomized trials
Only blinded randomized trials Cohort studies JAMC • 1er MARS 2005; 172 (5) Tips for EBM
learners: heterogeneity pathophysiology of different cancers (“patients” in Table 1) and in
the mechanisms of action of different cancer therapies (“interventions” in Table 1). If you
were inclined to reject pooling the results of the studies to be considered in the second
systematic review, you might have reasoned that we would expect substantially different
effects with different antibiotics, different infecting agents or different underlying lung
pathology. If you were inclined to accept pooling of results in this review, you might argue
that the antibiotics used in the different studies are all effective against the most common
organisms underlying pulmonary exacerbations. You might also assert that the biology of an

acute exacerbation of an obstructive lung disease (e.g., inflammation) is similar, despite
variability in the underlying pathology. In other words, we would expect more or less the
same effect across agents and across patients. Finally, you probably accepted the validity of
pooling results for the third systematic review — tPA for myocardial infarction — because
you consider that the mechanism of myocardial infarction is relatively constant across a
broad range of patients. left of the “no difference” line indicate that the treatment is
superior to the control, whereas those to the right of the line indicate that the control is
superior to the treatment. For each of the 4 studies represented in the figures, the dot
represents the point estimate of the treatment effect (the value observed in the study), and
the horizontal line represents the confidence interval around that observed effect. For
which systematic review does it make sense to combine results? Decide on the answer to
this question before you read on. You have probably concluded that pooling is appropriate
A The bottom line • Similarity in the aspects of primary study design outlined in Table 1
(patients, interventions, outcomes, study methods) guides the decision as to whether it
makes sense to combine the results of primary studies in a systematic review. • The range
of characteristics of the primary studies across which it is sensible to combine results is a
matter of judgment based on the researcher’s understanding of the underlying biology of
the disease. Favours new treatment No difference Favours control Favours new treatment
No difference Favours control B Tip 2: Qualitative assessment of the results of primary
studies You should now understand that combining the results of different studies is
sensible only when we expect more or less the same magnitude of treatment effects across
the range of patients, interventions and outcomes that the investigators have included in
their systematic review. However, even when we are confident of the similarity in design
among the individual studies, we may still wonder whether the results of the studies should
be pooled. The following graphic demonstration shows how to qualitatively assess the
results of the primary studies to decide if meta-analysis (i.e., statistical pooling) is
appropriate. You can find discussions of quantitative, or statistical, approaches to the
assessment of heterogeneity elsewhere (see Box 1 or Higgins and associates9). Consider the
results of the studies in 2 hypothetical systematic reviews (Fig. 1A and Fig. 1B). The central
vertical line, labelled “no difference,” represents a treatment effect of 0. This would be
equivalent to a risk ratio or relative risk of 1 or an absolute or relative risk reduction of 0.2
Values to the Fig. 1: Results of the studies in 2 hypothetical systematic reviews. The central
vertical line represents a treatment effect of 0. Values to the left of this line indicate that the
treatment is superior to the control, whereas those to the right of the line indicate that the
control is superior to the treatment. For each of the 4 studies in each figure, the dot
represents the point estimate of the treatment effect (the value observed in the study), and
the horizontal line represents the confidence interval around that observed effect. CMAJ •
MAR. 1, 2005; 172 (5) 663 Hatala et al for the studies represented in Fig. 1B but not for
those represented in Fig. 1A. Can you explain why? Is it because the point estimates for the
studies in Fig. 1A lie on opposite sides No difference Favours new treatment Favours
control Fig. 2: Point estimates and confidence intervals for 4 studies. Two of the point
estimates favour the new treatment, and the other 2 point estimates favour the control.
Investigators doing a systematic review with these 4 studies would be satisfied that it is

appropriate to pool the results. Pooled estimate of underlying effect Favours new treatment
No difference Favours control Fig. 3: Results of the hypothetical systematic review
presented in Fig. 1B. The pooled estimate at the bottom of the chart (large diamond)
provides the best guess as to the underlying treatment effect. It is centred on the midpoint
of the area of overlap of the confidence intervals around the estimates of the individual
trials. 664 JAMC • 1er MARS 2005; 172 (5) of the “no difference” line, whereas those for the
studies in Fig. 1B lie on the same side of the “no difference” line? Before you answer this
question, consider the studies represented in Fig. 2. Here, the point estimates of 2 studies
are on the “favours new treatment” side of the “no difference” line, and the point estimates
of 2 other studies are on the “favours control” side. However, all 4 point estimates are very
close to the “no difference” line, and, in this case, investigators doing a systematic review
will be satisfied that it is appropriate to pool the results. Therefore, it is not the position of
the point estimates relative to the “no difference” line that determines the appropriateness
of pooling. There are 2 criteria for not combining the results of studies in a meta-analysis:
highly disparate point estimates and confidence intervals with little overlap, both of which
are exemplified by Fig. 1A. When pooling is appropriate on the basis of these criteria, where
is the best estimate of the underlying magnitude of effect likely to be? Look again at Fig. 1B
and make a guess. Now look at Fig. 3. The pooled estimate at the bottom of Fig. 3 is centred
on the midpoint of the area of overlap of the confidence intervals around the estimates of
the individual trials. It provides our best guess as to the underlying treatment effect. Of
course, we cannot actually know the “truth” and must be content with potentially
misleading estimates. The intent of a meta-analysis is to include enough studies to narrow
the confidence interval around the resulting pooled estimate sufficiently to provide
estimates of benefit for our patients in which we can be confident. Thus, our best estimate
of the truth will lie in the area of overlap among the confidence intervals around the point
estimates of treatment effect presented in the primary studies. What is the clinician to do
when presented with results such as those in Fig. 1A? If the investigators have done a good
job of planning and executing the meta-analysis, they will provide some assistance.6 Before
examining the study results in detail, they will have generated a priori hypotheses to
explain the heterogeneity in magnitude of effect across studies that they are liable to
encounter. These hypotheses will include differences in patients (effects may be larger in
sicker patients), in interventions (larger doses may result in larger effects), in outcomes
(longer follow-up may diminish the magnitude of effect) and in study design
(methodologically weaker studies may generate larger effects). The investigators will then
have examined the extent to which these hypotheses can explain the differences in
magnitude of effect across studies. These subgroup analyses may be misleading, but if they
meet 7 criteria suggested elsewhere10 (see Box 2), they may provide credible and satisfying
explanations for the variability in results. The bottom line • Readers can decide for
themselves whether there is clinically important heterogeneity among the results of
primary studies through a qualitative assessment of the graphic results. This assessment is
based on the amount Tips for EBM learners: heterogeneity Box 2: Questions to ask when
evaluating a subgroup 10 analysis in a …

ASU Health Medical Odds Ratio for Lorcaserin Producing Questions.docx

Recommended

Recommended

More Related Content

Similar to ASU Health Medical Odds Ratio for Lorcaserin Producing Questions.docx

Similar to ASU Health Medical Odds Ratio for Lorcaserin Producing Questions.docx (18)

More from write12

More from write12 (20)

Recently uploaded

Recently uploaded (20)

ASU Health Medical Odds Ratio for Lorcaserin Producing Questions.docx