Grading Strength of Evidence
Upcoming SlideShare
Loading in...5
×
 

Grading Strength of Evidence

on

  • 6,608 views

 

Statistics

Views

Total Views
6,608
Views on SlideShare
6,608
Embed Views
0

Actions

Likes
0
Downloads
39
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • Grading Strength of Evidence
  • Systematic Review Process Overview This module focuses on a “late step” in the development of comparative effectiveness reviews (CERs) and systematic reviews more generally: developing and documenting “grades” for the strength of a body of evidence. This module primarily addresses comparative effectiveness reviews of interventions and not medical tests. Earlier steps leading to this step in the CER process include developing stating key questions, analytic frameworks, identifying critical outcomes and comparisons for which strength of evidence should be graded, and rating the internal validity (quality, risk of bias) of individual studies. Not covered herein are tasks such as assessing the applicability of bodies of evidence (i.e., formally evaluating the generalizability or external validity of either individual studies or bodies of evidence). Issues concerning meta-analysis are also covered in earlier modules.
  • Learning Objectives
  • Grading Strength of Evidence The approach used by the Evidence-based Practice Centers (EPCs) to grade strength of evidence is based in large measure on the approach used by the Grading of Recommendations Assessment, Development, and Evaluation (GRADE) Working Group. EPC and GRADE experts explore ways to harmonize the two methods and to offer reviewers and decisionmakers a coordinated model for grading strength of evidence.
  • Why Grade Strength of Evidence? The goal of our evidence-grading system is to facilitate the use of the reports by decisionmakers who may have differing perspectives. Among organizations that make practice guidelines or coverage decisions and among experts who develop systematic reviews, assessment of the strength of a body of evidence is widely accepted. A growing number of organizations adopt systematic approaches to making judgments about the strength of evidence. A wide variety of grading systems is available for this purpose, and different organizations may weigh features, or domains, of a body of evidence differently. Consequently, discrepant, contradictory, or variable ratings may arise, and results may not be of practical help to some organizations. We note the important distinction between strength-of-evidence systems and evidence hierarchies. Evidence hierarchies traditionally focus only on study design, with systematic reviews of randomized controlled trials (RCTs) and individual RCTs at the highest levels. By contrast, strength-of-evidence systems incorporate not only study design but also many other facets of the evidence, including study conduct, presence or absence of bias, quantity of evidence, directness (or indirectness) of evidence, consistency of evidence, and precision of estimates. By including these additional components in our approach, we have attempted to give decisionmakers a more comprehensive evaluation of the evidence. The aims of this work are to ensure appropriate methodological consistency in how different Evidence-based Practice Centers (EPCs) grade the strength of evidence and to facilitate interpretations of those grades by users and their application of them in guideline development or other decisionmaking tasks. Attaining these goals rests in part on consistency and predictability in the domains that EPC reviewers use in this effort. Although no one system for reporting results and grading the related strength of evidence is likely to suit all users, documentation and consistent reporting of the most important summary information about a body of literature will make reviews more useful to a broader range of potential audiences. One part of the American Recovery and Reinvestment Act of 2009, sometimes referred to as the “stimulus bill,” provided AHRQ with $300 million to support various types of research and other activities. A significant portion of this is being used to produce comparative effectiveness reviews and other systematic reviews (as well as methods research that may cover issues raised for grading strength of evidence).
  • Three Steps to Grading Strength of Evidence
  • Four Required Domains: Risk of Bias The risk of bias for an evidence base is derived from assessing the risk of bias (quality, internal validity) for individual studies. Risk of bias incorporates both study design and study conduct. Typically, determining risk of bias entails giving each included article a rating of “good,” “fair,” or “poor.” In some cases, bodies of evidence may include only articles rated good or fair and not deal further with articles rated poor. The grading step requires reviewers to assess the aggregate quality of studies within each major study design and to integrate those assessments into an overall risk-of-bias score. Risk-of-bias scores are denoted as high, medium, or low. High risk of bias lowers the strength-of-evidence grade; low risk of bias raises the strength-of-evidence grade. If studies included in a systematic review differ substantially in risk of bias, reviewers at the Evidence-based Practice Centers (EPCs) may give greater weight or emphasis to the studies with a lower risk of bias. In formal meta-analyses, EPC reviewers may choose to evaluate the influence of studies with differing risk of bias to aid in their assessment of the overall strength of evidence.
  • Four Required Domains: Consistency Consistency refers to the degree of similarity in the effect sizes of different studies within an evidence base. If effect sizes indicate the same direction of effect, and if the range of effect sizes is narrow, an evidence base can be judged to be consistent. This assessment enhances the overall strength-of-evidence grade. Nonoverlapping confidence intervals, significant unexplained clinical or statistical heterogeneity, or similar problems may reflect inconsistency. The presence of inconsistency is the chief concern for grading strength of evidence in this domain, and it would lead reviewers at the Evidence-based Practice Centers to reduce the overall strength-of-evidence grade. Some bodies of evidence may show statistical heterogeneity in effect sizes but consistency in the direction of effect. Even if EPC reviewers cannot explain the heterogeneity satisfactorily, they can still judge the evidence base to be consistent with respect to the direction of effect. With substantial unexplained heterogeneity, however, EPC reviewers should be appropriately cautious about estimating treatment effects. EPC reviewers should designate an evidence base as being inconsistent when different studies show statistically significant effect sizes in opposite directions. In the absence of statistical testing or measurement of heterogeneity, EPC reviewers can assess consistency on the basis of similarity of populations, interventions, and outcome measures.
  • Four Required Domains: Consistency Scores Evaluation of consistency ideally requires an evidence base with independent replication of findings; therefore, reviewers at the Evidence-based Practice Centers (EPCs) cannot evaluate consistency properly in an evidence base with a single study. Even if the study is a large multicenter trial (i.e., a mega-trial), findings from different centers within such a study are rarely reported separately. If the results are reported separately for each center, EPC reviewers may be able to evaluate consistency within the overall trial, but this is not truly independent replication. Any flaw (reported or unreported) in the trial design or conduct will likely be replicated at every center. Even pairs of mega-trials that address the same clinical question (i.e., the same patient intervention-outcome combinations) may report discrepant results, and the methodology of mega-trials has been questioned further. Thus, EPC reviewers cannot be certain that a single trial, no matter how large or well designed, presents the definitive picture of any particular clinical benefit or harm for a given treatment. Accordingly, with respect to consistency, we recommend that EPC reviewers judge single-study evidence bases to be "consistency unknown (single study)," which would generally decrease the strength-of-evidence grade. If meta-analysis is appropriate, EPC reviewers can evaluate consistency by using statistical tests and measures of heterogeneity, such as Cochran’s Q test or I 2 statistics, as noted in first module on quantitative synthesis. References: Owens DK, Lohr KN, Atkins D, et al. Grading the strength of a body of evidence when comparing medical interventions. In: Methods Guide for Comparative Effectiveness Reviews. Rockville, MD: Agency for Healthcare Research and Quality, Posted August 2009. Available at: http://effectivehealthcare.ahrq.gov/ehc/products/60/318/2009_0805_grading.pdf. Owens DK, Lohr KN, Atkins D, et al. Grading the strength of a body of evidence when comparing medical interventions —Agency for Healthcare Research and Quality and the Effective Health Care Program. J Clin Epidemiol 2010;63:513-523. Quantitative Synthesis. In: Agency for Healthcare Research and Quality. Methods Guide for Comparative Effectiveness Reviews, Version 1.0 [Draft posted Oct. 2007] Rockville, MD. Available at: http://www.effectivehealthcare.ahrq.gov/repFiles/2007_10DraftMethodsGuide.pdf).
  • Four Required Domains: Directness Directness concerns whether the evidence being assessed reflects a single, direct link between the interventions of interest and the ultimate health outcome under consideration (whether a benefit or harm). If direct evidence linking an intervention to the most ultimate outcomes is lacking, then two or more bodies of evidence are needed to link the intervention to health outcomes. When several bodies of evidence are involved, the ultimate decision about using an intervention may depend on the strength of evidence for every link in the causal chain. Some links in the causal chain will be more important than others. Thus, the final assessment of directness requires reviewers at the Evidence-based Practice Centers (EPCs) to consider the strength of evidence for each link, as well as the importance of each link in the chain. Of particular salience is the extent to which evidence pertains to intermediate or surrogate outcomes rather than to ultimate patient-centered outcomes, such as mortality, morbidity, and quality of life. More direct links enhance strength-of-evidence assessments (and vice versa). As illustrated in the chapter “Grading the Strength of a Body of Evidence When Comparing Medical Interventions,” in the Methods Guide for Comparative Effectiveness Reviews (an update of chapter 11 in version 1.0), use of an analytic framework is an important heuristic for determining how to evaluate evidence in a causal chain (e.g., in an overarching link or only in subsidiary linkages). References: Owens DK, Lohr KN, Atkins D, et al. Grading the strength of a body of evidence when comparing medical interventions. In: Methods Guide for Comparative Effectiveness Reviews. Rockville, MD: Agency for Healthcare Research and Quality, Posted August 2009. Available at: http://effectivehealthcare.ahrq.gov/ehc/products/60/318/2009_0805_grading.pdf. Owens DK, Lohr KN, Atkins D, et al. Grading the strength of a body of evidence when comparing medical interventions —Agency for Healthcare Research and Quality and the Effective Health Care Program. J Clin Epidemiol 2010;63:513-523.
  • Four Required Domains: Aspects of Indirectness For comparative effectiveness reviews in particular, directness also applies to comparisons of interventions. For example, if there are three alternative interventions — A, B, and C — then it is desirable to have evidence that compares them directly: A versus B, A versus C, and B versus C. In many circumstances, such head-to-head evidence is not available. Under these circumstances, reviewers must look to indirect evidence, such as evidence for A versus B and B versus C, but not A versus C. Grades for such indirect evidence will not be as strong as those obtained from truly direct evidence. A single body of evidence is preferable to two bodies of evidence, particularly if the strengths of evidence for those two bodies of evidence differ in material ways. Assessing directness clarifies the degree to which evidence between the intervention and the ultimate health outcome does or does not meet the ideal set of studies addressing the overarching question.
  • Related Issue of Applicability
  • Four Required Domains: Directness Scores “Direct” means that an overarching link can be made, in the relevant body of evidence on the topic at hand, between use of the intervention (e.g., technology, health care service) and important health or ultimate outcomes (e.g., death, disease/morbidity, patient-reported quality of life). Reviewers do not need to make assumptions about the link between use of the intervention and proxy, surrogate, or intermediate measures (e.g., laboratory tests, clinical observations) and the ultimate health outcomes of interest to patients or their families. “Indirect” refers to a situation in which more than one body of evidence is needed to move from use of the health care service to ultimate outcomes. Information about one link (e.g., between intermediate and final/ultimate outcomes, or between use of the service and the intermediate or proxy measures such as laboratory test results) may be supported by citations and not necessarily reviewed formally in the comparative effectiveness review (CER) or systematic review. For example, depending upon the question and literature reviewed, CERs and systematic reviews can have a mix, but strength of evidence is graded for important outcomes (or comparisons) independently.
  • Four Required Domains: Precision Precision is the degree of certainty surrounding an estimate of effect with respect to a specific outcome. A precise estimate should enable decisionmakers to draw conclusions about whether one treatment is, clinically speaking, inferior, superior, or equivalent (neither inferior nor superior) to another. This domain is a complicated concept, because judgments about precision may depend on or be contingent upon the importance of the outcome being measured, other clinically important outcomes, the context of decisionmaking, whether the central issue is harms or benefits, and the relative effect or size of those harms or benefits. Reviewers at the Evidence-based Practice Centers (EPCs) should assess the boundaries of the pooled confidence interval for that effect estimate in relation to a threshold that would allow users of comparative effectiveness reviews to make judgments about the treatments being compared. Relevant thresholds for precision include the boundary of statistical significance — that is, whether the estimate of an effect reaches accepted levels for statistical significance. Substantial variability does not necessarily render an estimate imprecise. A truly imprecise estimate is one with a confidence interval so wide that it does not rule out the superiority or inferiority of either treatment being compared — that is, an estimate whose confidence interval includes two incompatible possibilities (e.g., one treatment is clinically significantly better than the other, and the difference is in the opposite direction). In this case, no conclusion can be reached about the relative effectiveness of the two treatments.
  • Four Required Domains: Precision Scores This domain should be rated as “precise” or “imprecise” for each important outcome and comparison. If a meta-analysis was performed, this will be the confidence interval around the summary effect size. The Methods Guide for Comparative Effectiveness Reviews makes no recommendations about how to choose “important” outcomes or comparisons. This must be left to the teams producing the systematic review or comparative effectiveness review.
  • Additional Domains This second set of four domains supplements the four required domains and consists of secondary constructs that reviewers at the Evidence-based Practice Centers should use and report if they are relevant to a particular comparative effectiveness review or systematic review. These domains are dose-response association, the existence of confounders that would diminish an observed effect, strength of association (i.e., magnitude of effect), and publication bias. These domains are found in other schemes for grading strength-of-evidence, including GRADE. Generally, expect that three of these domains — dose-response association, plausible confounders, and strength of association — will be relevant more often to evidence from observational studies (of all types) than to evidence from randomized controlled trials.
  • Additional Domains: Dose-Response Association This association, either across or within studies, refers to a pattern of a larger effect with greater exposure (dose, duration, adherence). This additional domain should be rated if studies in the evidence base have noted levels of exposure. If dose-response association is present, the reviewer may upgrade the strength of evidence.
  • Additional Domains: Dose-Response Scores This slide describes the three possible scores for dose response. If a dose response association is present, the EPC reviewer may increase the strength of evidence.
  • Additional Domains: Plausible Confounding
  • Additional Domains: Plausible Confounding Scores This additional domain should be considered if plausible confounding exists that would decrease the observed effect. Use one of two scores: Present: confounding factors that would decrease the observed effect may be present. Absent: confounding factors that would decrease the observed effect are not likely to be present.
  • Additional Domains: Strength of Association If strength of association is strong, that may increase the strength of evidence.
  • Additional Domains: Strength of Association Scores This additional domain should be considered when the effect size is particularly large and then given one of two scores: either a strong association or a weak association, perhaps explained by confounding factors.
  • Additional Domains: Publication Bias Reviewers should invoke publication bias concerns when they have reason to believe that relevant empirical findings have not been published or are not otherwise available. Three situations are particularly relevant: 1) when negative, no-difference, or other studies with results that are substantially different from published studies are unavailable; 2) when the results of completed studies (e.g., those noted in ClinicalTrials.gov as having ended 3 or more years in the past) have clearly not been published (save, perhaps, in abstract form); and 3) when trial protocols specify certain secondary end points for which results have not been reported (even if other results have been published). Reviewers at the Evidence-based Practice Centers should consider and report on publication bias insofar as it appears to influence scores for either required (e.g., consistency or precision) or other domains.
  • Additional Domains: Publication Bias Scores According to the Methods Guide for Comparative Effectiveness Reviews , publication bias need not be formally scored. However, it can influence ratings of consistency, precision, magnitude of effect, and, to a lesser degree, risk of bias and directness. If reviewers at the Evidence-based Practice Centers identify unpublished trials, and if the results differ from those of published studies, they can take these factors into account in their rating for consistency and in calculating a summary confidence interval for an effect. We encourage authors to comment on publication bias when circumstances suggest that relevant empirical findings, particularly negative or no-difference findings, have not been published or are not otherwise available.
  • Procedures for Assessing Domains Reviewers need not score strength of evidence for every comparison or outcome; rather, they should focus on major outcomes and comparisons. It is up to reviewers at the Evidence-based Practice Centers to identify what are considered to be “major” outcomes or comparisons; this may be done in conjunction with advice from a Technical Expert Panel. The point is that reviewers should grade each required domain for each of these major outcomes or comparisons.
  • Strength of Evidence Grades (I) The overall grade for strength of evidence reflects a global assessment that takes the required domains directly into account and, as needed, incorporates judgments about the additional domains. The aims are to provide “actionable” information for a variety of different users, readers, and stakeholders, and be transparent in how the strength-of-evidence grades are reached.
  • Strength of Evidence Grades (II) As mentioned, for each comparison of interest, reviewers in the Evidence-based Practice Centers (EPCs) should rate strength of evidence for each major benefit (e.g., positive effects on health outcomes such as physical function or quality of life, or effects on laboratory measures or other surrogate variables) and each major harm (ranging from rare, serious, or life-threatening adverse events to common but bothersome effects). For both benefits and harms, EPC reviewers should focus on the outcomes most relevant to patients, clinicians, and policymakers. Systematic reviews and comparative effectiveness reviews can be broad in scope, encompassing multiple patient populations, interventions, and outcomes. EPC reviewers are not expected to grade every possible comparison for every outcome. Rather, reviewers should set clear priorities, assigning grades to those combinations (patients-interventions-outcomes) that are likely to be of greatest interest to users of the report. EPC reviewers should also state clearly which interventions, outcomes, and comparators they included for each strength-of-evidence grade. For example, an evidence grade might apply to a link in an analytic framework, or it might apply to a specific intervention for a specific set of outcomes in a particular population. EPC reviewers should also make clear which of the comparators or interventions is favored for each strength-of-evidence grade.
  • Strength of Evidence Grades and Definitions Reviewers in the Evidence-based Practice Centers (EPCs) should use four levels for grading the strength of a body of evidence. Each grade level has two components: The principal definition , which is the part highlighted in italics type, concerns the level of confidence the reviewers place in the estimate of effect for the benefit or harm (i.e., their judgment that the evidence reflects the true effect). The subsidiary definition involves a subjective assessment of the likelihood that future research may affect the level of confidence in the estimate or actually change that estimate. Grades are denoted as high, moderate, low, and insufficient. They are not designated by Roman numerals or other symbols. The definitions used by EPC reviewers are different from those used by the GRADE Working Group (as of 2008); the definitions used by EPC reviewers focus on the confidence decisionmakers have about the evidence and only secondarily invoke the GRADE decisions, which involve only possible (future) research and the unknowable results of such research, should it be conducted.
  • Strength of Evidence Grades: Additional Points (I) Assigning a grade of high, moderate, or low implies that an evidence base is available from which to estimate an effect. Reviewers in the Evidence-based Practice Centers (EPCs) understand that even when evidence is low, consumers, clinicians, and policymakers may have to make decisions. The designations of high, moderate, and low should convey how secure reviewers feel about decisions based on evidence of differing grades. EPC reviewers should apply discrete grades and avoid designations such as “low to moderate” strength of evidence.
  • Strength of Evidence Grades: Additional Points (II) In some cases, reviewers cannot draw conclusions for a particular outcome, specific comparison, or other question of interest. In these situations, reviewers in the Evidence-based Practice Centers (EPCs) should assign a grade of insufficient. Such situations arise in two main ways. First, evidence for an outcome receives a grade of insufficient when no evidence is available from the included studies. This circumstance includes the absence of any relevant studies whatsoever. In comparative effectiveness reviews, for example, certain drug comparisons may never have been studied (or published) in head-to-head trials, and placebo-controlled trials of the multiple drugs of interest may not provide adequate indirect evidence for any comparisons. Second, a grade of insufficient is also appropriate when evidence on the outcome is too weak, sparse, or inconsistent to permit any conclusion to be drawn. This situation can reflect several complicated conditions, such as unacceptably high risk of bias or a major inconsistency that cannot be explained (e.g., two studies with the same risk of bias that found opposite results, with no clear explanation for the discrepancy). Imprecise data may also lead to a grade of insufficient, specifically when the confidence interval is so wide that it includes two incompatible conclusions: that one treatment is clinically significantly better than the other and that it is worse. Indirect data based on only one study or comparison could also receive a grade of insufficient. If a single quantitative estimate is desired, the strength of evidence may be insufficient if an effect size cannot be calculated from reported information or if heterogeneity cannot be explained. This same evidence base may still be sufficient to permit a conclusion about the general direction of the effect, but EPC reviewers need to take care not to conflate “low” with “insufficient” strength of evidence.
  • Scoring and Reporting: General Guidance To assign an overall grade to the strength of a body of evidence, reviewers in the Evidence-based Practice Centers (EPCs) must decide how to incorporate multiple domains into an overall assessment. Apart from deciding how best to proceed, reviewers need to understand that this is a complex step requiring some training and practice, because no one way may be right and because consistency and transparency in making what can be subjective decisions are important. Three options are available to EPC reviewers, one of which is the system promulgated by the GRADE Working Group. Because this system differs in some ways from the guidance developed for EPC reviewers, the AHRQ EPC program will be collaborating with the GRADE Working Group to address ongoing challenges in assessing the strength of evidence. In some systems, such as that of the GRADE Working Group, the overall grade for strength of evidence (which they call quality of evidence) is calculated from the ratings for each domain by using a method that provides guidance on how to upgrade or downgrade the rating of the evidence. Such a system has the advantage of transparency, because it clearly delineates a direct path from the evidence to its grade. With respect to non-GRADE approaches, although a system that uses such a method may offer advantages in terms of transparency, as yet no empirical evidence supports the superiority of a particular point system when compared with a more qualitative approach. Furthermore, some evidence suggests no difference in accuracy between quantitative and qualitative systems. More methods-oriented research is needed, for instance, in comparing a point-system approach with other grading systems. Thus, EPC reviewers may use different approaches to incorporate multiple domains into an overall strength-of-evidence grade. EPC reviewers should explain the rationale for their approach to strength-of-evidence rating and note which domains were important in upgrading or downgrading the strength of evidence. The GRADE Working Group uses an algorithm to help reviewers clarify how they consider domains to produce the grade. EPC reviewers may use the GRADE system or their own weighting system, or they may elect to use a qualitative approach, as long as the rationale for strength-of-evidence ratings is clear. Reference: Atkins D, Best D, Briss PA, et al, for the GRADE Working Group. Grading quality of evidence and strength of recommendations. BMJ . 2004;328:1490. http://www.ncbi.nlm.nih.gov/pubmed/15205295 Owens DK, Lohr KN, Atkins D, et al. Grading the strength of a body of evidence when comparing medical interventions. In: Methods Guide for Comparative Effectiveness Reviews. Rockville, MD: Agency for Healthcare Research and Quality, Posted August 2009. Available at: http://effectivehealthcare.ahrq.gov/ehc/products/60/318/2009_0805_grading.pdf. Owens DK, Lohr KN, Atkins D, et al. Grading the strength of a body of evidence when comparing medical interventions — Agency for Healthcare Research and Quality and the Effective Health Care Program. J Clin Epidemiol 2010;63:513-523.
  • Guiding Principles: Risk of Bias The risk of bias based on the design and conduct of the available studies is an essential component to rating the overall body of evidence. In considering the risk-of-bias domain, reviewers in the Evidence-based Practice Centers should consider which study design is most appropriate to reduce bias for each key question about outcomes and comparisons.
  • Guiding Principles: Risk of Bias Example This is an example using pharmaceuticals. For many of the traditional therapeutic interventions, evidence that is based on well-conducted randomized controlled trials (RCTs) will have less risk of bias than evidence based on observational studies. For these outcomes, if RCT data are available, reviewers in the Evidence-based Practice Centers may choose to start with a rating of low for the risk-of-bias domain and change the assessment of this domain if the RCTs have important flaws. For these traditional therapeutic intervention questions, observational data would generally start with a high risk of bias but may be altered, depending on the conduct of the study. However, the overall strength of evidence must incorporate assessments of other domains in addition to risk of bias.
  • Further Guidance: Principles for Scoring Reviewers in the Evidence-based Practice Centers (EPCs) should take specific steps to ensure reliability and transparency within their own work (both in individual reviews and across them) when incorporating domains into an overall grade. The first step is for EPC reviewers to be explicit about whether the evidence grade will be determined by a point system for combining ratings of the domains or by a qualitative consideration of the domains. They should carefully document the procedures used to grade strength of evidence and provide enough detail within the report to ensure that users can grasp the methods that were employed. Furthermore, EPC reviewers should keep records of their procedures and results for each review so that they may contribute to the overall EPC expertise and to the science of grading evidence. The second step is to identify the domains that are most important for the targeted body of evidence and decide how to weight the domains when assigning the evidence grade. For the sake of consistency across reviews, the domains should be defined by using the terminology presented herein. In the absence of evidence to support specific systems for weighting the domains, both qualitative and quantitative approaches are acceptable. EPC reviewers may also choose to follow GRADE guidance for downgrading and upgrading evidence based on assessments of each domain. In general, the highest priority should be given to the risk-of-bias domain, because it has been well established that evidence is strongest when the study design and conduct have the lowest risk of bias. The third step is to develop an explicit procedure for ensuring a high degree of interrater reliability for rating individual domains. As mentioned earlier, this step assumes that at least two reviewers with appropriate clinical and methodological expertise will rate each domain. In addition, EPC reviewers should assess the resulting interrater reliability for each domain. Although EPC reviewers generally will not include the details of the reliability assessment in their comparative effectiveness reviews, they should keep records of this information. By documenting this information, EPC reviewers will be able to increase knowledge about the reliability of the grading system. The fourth step is to use the ratings of the domains to assign an overall strength-of-evidence grade according to the decisions made in the first three steps. If this action involves a qualitative approach, with subjective weighting of the domains, EPC reviewers should consider using at least two reviewers and assessing the interrater reliability of this step. This will not be necessary if the approach involves a formulaic calculation or algorithm based on the ratings of the domains. However, the scoring system or algorithm should be specified in sufficient detail to permit readers to replicate it if desired.
  • Further Guidance: Principles for Reporting (I) Reviewers should present information about all comparisons of interest for the outcomes that are most important to patients and other decisionmakers. Thus, strength of evidence should relate to those important outcomes. Complete and perfect information is rarely available. For some treatments, data may be lacking about one or more of the outcomes. In other cases, the available evidence is imprecise, is not applicable to some populations of interest, or comes from studies that have important flaws. For these reasons, reviewers in the Evidence-based Practice Centers should also present information that will help decisionmakers judge the risk of bias in the estimates of effect, assess the applicability of the evidence to populations of interest, and take imprecision and other factors into account. In arriving at an overall strength-of-evidence grade, the crucial requirement is transparency. The EPC method implies that EPC reviewers can, if they choose, make a global assessment of the overall quality of evidence rather than explicitly usescores for each domain and then combine them. Nevertheless, EPC reviewers are encouraged to make judgments for individual domains as a first step and to be especially sensitive to the effects of any “borderline” scores for those domains and their impact on the overall score. Being explicit and transparent about what criteria are used to raise or lower grades is the essential element in this step.
  • Further Guidance: Principles for Reporting (II) References: Owens DK, Lohr KN, Atkins D, et al. Grading the strength of a body of evidence when comparing medical interventions. In: Methods Guide for Comparative Effectiveness Reviews. Rockville, MD: Agency for Healthcare Research and Quality, Posted August 2009. Available at: http://effectivehealthcare. ahrq.gov/ ehc/products/60/318/2009_0805_grading.pdf. Owens DK, Lohr KN, Atkins D, et al. Grading the strength of a body of evidence when comparing medical interventions —Agency for Healthcare Research and Quality and the Effective Health Care Program. J Clin Epidemiol 2010;63:531-523.
  • Grading Strength of Evidence: Presentation of Results — Moderate and High Grades This slide uses hypothetical data to illustrate one approach to providing actionable information to decisionmakers that reflects strength of evidence. It shows how reviewers might present information pertinent to magnitude of effect and strength of evidence from different types of studies. Typically, the information in these tables would be combined into a single table; they are split here just for ease of reading. For each outcome of interest, the tables give the numbers of studies and subjects (left column), the grades for the four required domains (middle columns), and estimates of the magnitude of effect and strength of evidence (right column), which in this example is characterized as the “absolute risk difference per 100 patients.” The row identifying the outcome as a whole (e.g., severe diarrhea) also gives the overall rating (e.g., moderate SOE [strength of evidence]). This slide depicts the magnitude of effect and strength of evidence grade for two outcomes: severe diarrhea (moderate SOE) and improved quality of life (high SOE). It shows that 4 fair-quality randomized controlled trials (RCTs), with a total of 256 subjects, reported severe diarrhea, which was lower by 4 patients per 100 treated (i.e., 4%), a difference that was not statistically significant (95% confidence interval [CI], −8 to +1). The data were consistent and direct but imprecise. For the same comparison, 14 fair-quality retrospective cohort studies, with a total of 28,400 subjects, produced an absolute risk difference of −5, which was statistically significant (95% CI of −8 to −2). These estimates were consistent, direct, and precise. Note that when different bodies of evidence are considered (e.g., trials and also observational studies), the data are provided for different types of study designs; in this example, RCTs are separate from cohort studies. Note that although the tables on these slides illustrate how reviewers in the Evidence-based Practice Centers might organize information about the strength of evidence and magnitude of effect in ways useful to decisionmakers, this is not a “complete” account of this aspect of any review. First, the tables do not convey any information about the applicability of the evidence. As emphasized earlier, applicability is dealt with and presented through other means (in text or in a table). Second, a narrative summary of the results is also essential for interpreting the results of a literature synthesis.
  • Grading Strength of Evidence: Presentation of Results — Insufficient and Low This table uses hypothetical data to illustrate actionable information to decisionmakers that reflects strength of evidence. In this case, the evidence pertains to mortality and myocardial infection and is either insufficient to grade strength of evidence (SOE) or has low SOE. The table gives the numbers of studies and subjects (left column), the grades for the four required domains (middle columns), and the estimates of the magnitude of effect and strength of evidence (right column). In the row identifying the outcome as a whole (e.g., mortality), the table also gives the overall rating (e.g., insufficient SOE). As illustrated on the previous slide, when different bodies of evidence are considered (e.g., trials and observational studies), the data are provided separately for different types of study designs. For example, for mortality, randomized controlled trials are separate from retrospective cohort studies. This table shows that one fair-quality randomized controlled trial (RCT) with 80 subjects reported mortality, which was lower by 1 patient per 100 treated (i.e., 1%), a difference that was not statistically significant (95% confidence interval [CI], − 4 to +3). Consistency was unknown (a single study), and the data were direct but imprecise. For the same comparison, 14 retrospective cohort studies, with 384 subjects, had a wide range of effect sizes (range − 7 to +5%). Although the linkage was direct, the studies were both inconsistent and imprecise. Had these estimates been precise and consistent (i.e., a narrower CI for the RCT, consistent cohort studies to allow a summary effect size), one might have been able to reach a conclusion. However, the evidence is insufficient to allow a conclusion for mortality.
  • Comparison With the GRADE System Our recommendations address specific circumstances of the Evidence-based Practice Center (EPC) program that differ from the circumstances of some groups that use GRADE. The EPC program produces systematic reviews, but it is not directly involved in the development of recommendations or guidelines. Rather, EPC reports are used by a spectrum of government agencies, professional societies, and other stakeholders. Our approach for grading strength of evidence and discussing applicability of the evidence is meant to facilitate the use of EPC reports by this broad group of users. EPC reviewers may believe that for certain outcomes, such as harms, observational studies have less risk of bias when compared with randomized controlled trials (RCTs) or that the available RCTs have a substantial risk of bias. In such instances, EPC reviewers may either raise the initial strength-of-evidence rating based on observational studies to “moderate” or lower the initial rating based on RCTs to “moderate” or “low.” The EPC approach emphasizes that applicability should be assessed separately from strength of evidence. GRADE also addresses applicability, which is incorporated within the general concept of directness. The rationale for the EPC approach is that many stakeholders use EPC reviews to develop guidelines or make clinical or health policy decisions, and they may have different views on how much the evidence applies to populations of interest to them. Future EPC reports will have a discussion of and information about applicability, and the intention is for the various users and audiences to read this section of the report and make their own judgments. References: Atkins D, Best D, Briss PA, et al, for the GRADE Working Group. Grading quality of evidence and strength of recommendations. BMJ . 2004;328:1490. http://www.ncbi.nlm.nih.gov/pubmed/152052 Owens DK, Lohr KN, Atkins D, et al. Grading the strength of a body of evidence when comparing medical interventions. In: Methods Guide for Comparative Effectiveness Reviews. Rockville, MD: Agency for Healthcare Research and Quality, Posted August 2009. Available at: http://effectivehealthcare.ahrq.gov/ehc/products/60/318/2009_0805_grading.pdf. Owens DK, Lohr KN, Atkins D, et al. Grading the strength of a body of evidence when comparing medical interventions — Agency for Healthcare Research and Quality and the Effective Health Care Program. J Clin Epidemiol 2010;63:513-523.
  • Summary: Grading Strength of Evidence The grading system used by reviewers in the Evidence-based Practice Centers is conceptually similar to the GRADE (Grading of Recommendations Assessment, Development and Evaluation) system of evidence rating. It requires assessment of four domains: risk of bias, consistency, directness, and precision. Additional domains to be used when appropriate are dose-response association, presence of confounders that would diminish an observed effect, strength of association, and publication bias. Strength of evidence receives a single grade: high, moderate, low, or insufficient. EPC reviewers should grade strength of evidence separately for each major outcome and, for comparative effectiveness reviews, all major comparisons.
  • References
  • Author

Grading Strength of Evidence Grading Strength of Evidence Presentation Transcript

  • Grading Strength of Evidence Prepared for: The Agency for Healthcare Research and Quality (AHRQ) Training Modules for Systematic Reviews Methods Guide www.ahrq.gov
  • Systematic Review Process Overview
    • To define what “grading strength of evidence (SOE)” is
    • To describe why grading SOE is important
    • To distinguish between grading SOE and rating the quality of individual articles
    • To list primary and additional domains for grading SOE
    • To describe options for scoring SOE domains
    • To describe how to score and present SOE grades
    Learning Objectives
    • Is distinct from rating the quality of individual studies
    • Is generally used only to assess:
      • Major outcomes (benefits and harms)
      • Major comparisons, when relevant
    Grading Strength of Evidence
    • To facilitate use of systematic reviews by diverse decisionmakers and stakeholders
    • To give decisionmakers:
      • A comprehensive evaluation of the evidence
      • A sense of how much confidence they can place in the evidence
    • To foster transparency and documentation
    Why Grade Strength of Evidence?
    • Scoring four required domains
      • Risk of bias
      • Consistency
      • Directness
      • Precision
    • Considering, and possibly scoring, four additional domains
      • Dose-response association
      • Plausible confounders
      • Strength of association
      • Publication bias
    • Combining scores from required domains into a single strength-of-evidence score, taking scores on additional domains into account as needed
    Three Steps to Grading Strength of Evidence
    • Concerns both study design and study conduct for individual studies, rated by usual methods
    • Assesses the aggregate quality of studies within each major study design and integrates those assessments into an overall risk-of-bias score
    • Risk-of-bias scores:
      • High — lowers strength-of-evidence grade
      • Medium
      • Low — raises strength-of-evidence grade
    Four Required Domains: Risk of Bias
    • Defined as the degree of similarity in the effect sizes of different studies within an evidence base
      • Consistent evidence bases:
        • Have the same direction of effect (same side of “no effect”)
        • Have a narrow range of effect sizes
      • Inconsistent evidence bases:
        • Have nonoverlapping confidence intervals
        • Have significant unexplained clinical or statistical heterogeneity
    Four Required Domains: Consistency
    • Only three possible scores for consistency:
      • Consistent (i.e., no inconsistency)
      • Inconsistent
      • Unknown or not applicable (single study cannot be assessed)
    • Meta-analysis:
      • Use appropriate tests, such as Cochran’s Q test or I2 statistics
    Four Required Domains: Consistency Scores
    • Defined as whether the evidence being assessed:
      • Reflects a single, direct link between the interventions of interest and the ultimate health outcome under consideration
      • Relies on multiple links in a causal chain
    • If multiple links are involved, strength of evidence can be only as strong as the weakest link
    • Using analytic frameworks* is important
    Four Required Domains: Directness *See the “Analytic Frameworks” module
    • Intermediate or surrogate outcomes instead of health or patient-centered outcomes
      • Example: laboratory test results or radiographic findings versus patient-reported functional outcomes or death
    • Indirect comparisons rather than direct, head-to-head comparisons
      • Direct (e.g., A vs. B, A vs. C, and B vs. C):
        • Head-to-head studies in the evidence base
        • Generally assumes use of health outcomes, not surrogate/proxy outcomes
        • Better strength of evidence
      • Indirect (e.g., A vs. B, B vs. C, but not A vs. C):
        • No head-to-head studies that cover all interventions or outcomes of interest
        • Problematic situation for all types of comparisons
        • Strength-of-evidence grades not as strong as with direct evidence
    Four Required Domains: Aspects of Indirectness
    • Applicability is evaluated separately from directness for the Evidence-based Practice Center (EPC) program.
      • For decisionmakers, the applicability of evidence depends on the different interests of diverse groups.
      • A PICOS framework (patient populations, interventions, comparators, outcomes, and settings) is used for applicability assessment in the EPC program.
      • Although the EPC program separates applicability from strength-of-evidence grading, other systems that work with one decisionmaker may incorporate applicability issues into their evaluations of directness.
    Related Issue of Applicability* *See the “Assessing Applicability” module
    • Only two possible scores for directness:
      • Direct:
        • Evidence is based on a single link between the intervention and health outcomes
      • Indirect:
        • Evidence relies on:
          • Surrogate/proxy outcomes
          • More than one body of evidence
          • Both situations
    Four Required Domains: Directness Scores
    • Defined as the degree of certainty for estimate of effect with respect to a specific outcome
    • Is a complicated concept that:
      • Asks the question:
        • What can decisionmakers conclude about whether one treatment is, clinically speaking, inferior, superior, or equivalent (neither inferior nor superior) to another?
      • Includes considerations of:
        • Statistical significance for effect estimates
        • Confidence intervals for those effect estimates
    Four Required Domains: Precision
    • Are rated separately for each important outcome or comparison, including for any summary estimate of effect size
    • Only two scores are possible
      • Precise: estimate allows a clinically useful conclusion
      • Imprecise: confidence interval is so wide it could include clinically distinct (even conflicting) conclusions
    Four Required Domains: Precision Scores
    • Four “discretionary” domains:
      • Dose-response association
      • Plausible confounders
      • Strength of association
      • Publication bias
    • Use when they are:
      • Applicable
      • Helpful in reaching conclusions about overall grades for strength of evidence
    Additional Domains
    • Pattern of a larger effect with greater exposure (dose, duration, adherence) either across or within studies
    • Rate if studies give levels of exposure
    Additional Domains: Dose-Response Association
    • Three scores are possible for dose-response:
      • Present: dose-response pattern observed
        • In such a case, Evidence-based Practice Center reviewers may want to upgrade the level of evidence.
      • Not present: no dose-response pattern observed (dose-response relationship not present)
      • Not applicable or not tested
    Additional Domains: Dose-Response Scores
    • In an observational study, sometimes plausible confounding factors work in the direction opposite that of the observed effect.
      • Had such “effect-weakening” confounders not been present, the observed effect would have been even larger than the one observed.
      • In such a case, Evidence-based Practice Center reviewers may want to upgrade the level of evidence.
    • Consider whether or not plausible confounding exists that would decrease the observed effect.
    Additional Domains: Plausible Confounding
    • Two scores are possible for plausible confounding:
      • Present: confounding factors that would decrease the observed effect may be present
      • Absent: confounding factors that would decrease the observed effect are not likely to be present
    Additional Domains: Plausible Confounding Scores
    • Magnitude of effect:
      • Defined as the likelihood that the observed effect is large enough that it cannot have occurred solely as a result of bias from potential confounding factors
      • Consider when effect size is particularly large
    Additional Domains: Strength of Association
    • Two scores are possible for strength of association:
      • Strong: large effect size that is unlikely to have occurred in the absence of a true effect of the intervention
        • In such a case, Evidence-based Practice Center reviewers may want to upgrade the level of evidence.
      • Weak: small enough effect size that it could have occurred solely as a result of bias from confounding factors
    Additional Domains: Strength of Association Scores
    • Studies may have been published selectively.
      • Example: only a small proportion of relevant trials or other studies has been published.
    • Estimated effects of an intervention that are based on published studies do not reflect true effect.
    • Publication bias may undermine the overall robustness of a body of evidence.
    Additional Domains: Publication Bias
    • Publication bias scores:
      • Need not be formally computed but can influence ratings of required domains
      • Should take these possible publication bias factors into account:
        • Rating for consistency
        • Calculating a summary confidence interval for an effect
    • Add comments on publication bias when circumstances suggest that relevant empirical findings, particularly negative or no-difference findings, have not been published or are not otherwise available.
    Additional Domains: Publication Bias Scores
    • Use two or more reviewers with the appropriate clinical and methodological expertise.
    • Assess separately:
      • Each required domain (or each optional domain, as relevant)
      • Each major outcome, including benefits and harms
    • Resolve differences by consensus or mediation by an additional expert; consensus scores should appear in tables.
    • Record and maintain records of each reviewer's individual judgments about domains as background documentation.
    Procedures for Assessing Domains
    • Reflect a global assessment that:
      • Takes the required domains directly into account
      • Incorporates judgments about the additional domains as needed
    • Aim to:
      • Provide “actionable” information for a variety of different users, readers, and stakeholders
      • Be transparent in how the strength-of-evidence grades are reached
    Strength of Evidence Grades (I)
    • For each comparison of interest, rate the strength of evidence for:
      • Each major benefit (e.g., positive effects on health outcomes such as physical function or quality of life, or effects on laboratory measures or other surrogate variables)
      • Each major harm (ranging from rare, serious, or life-threatening adverse events to common but bothersome effects)
    • For both benefits and harms:
      • Focus on the outcomes most relevant to patients, clinicians, and policymakers
    Strength of Evidence Grades (II)
    • High : High confidence that the evidence reflects the true effect . Further research is very unlikely to change our confidence in the estimate of effect.
    • Moderate : Moderate confidence that the evidence reflects the true effect . Further research may change our confidence in the estimate of effect and may change the estimate.
    • Low : Low confidence that the evidence reflects the true effect . Further research is likely to change the confidence in the estimate of effect and is likely to change the estimate.
    • Insufficient : Evidence either is unavailable or does not permit a conclusion .
    Strength of Evidence Grades and Definitions
    • Using the high, moderate, or low strength-of-evidence grade:
      • Implies that a body of evidence actually exists
      • Is intended to convey how confident reviewers are about decisions that may be made based on evidence graded one way or another
      • Requires the use of only one designation, not a range (e.g., not “low to moderate”)
    Strength of Evidence Grades: Additional Points (I)
    • The insufficient strength-of-evidence grade:
      • Is applied when:
        • Reviewers cannot draw conclusions about an outcome, comparison, or other question
      • Is appropriate when:
        • No evidence is available at all
        • Evidence is too insubstantial to permit conclusions to be drawn (e.g., opposing results from studies with a similar risk of bias; wide and overlapping confidence intervals)
    Strength of Evidence Grades: Additional Points (II)
    • Use different approaches to incorporate multiple domains into an overall strength-of-evidence grade
      • GRADE algorithm
      • Weighting system of the Evidence-based Practice Center
      • Some qualitative approach
    • Use (at least) two reviewers
    • Assess resulting interrater reliability for each domain score, and keep records
    Scoring and Reporting: General Guidance
    • Risk of bias (given design and conduct of available studies) is the essential component in determining the strength-of-evidence grade.
      • First, consider which study design is most appropriate to reduce bias for each question.
      • Next, consider the risk of bias from available studies.
    Guiding Principles: Risk of Bias
    • Drug comparisons in randomized controlled trials (RCTs), with either placebo or an active comparator as an appropriate design:
      • Evidence from well-conducted RCTs will have less risk of bias than evidence based on observational studies.
      • For RCTs, reviewers can start with a rating of low for risk of bias and change the assessment if the RCTs have important flaws.
      • For observational data, reviewers can start with a rating of high for risk of bias and change the assessment, depending upon how well studies were conducted.
    Guiding Principles: Risk of Bias Example
    • Be explicit about how the evidence grade will be determined.
      • A point system for combining ratings of the domains
      • A qualitative consideration of the domains
    • Carefully document procedures.
    • Keep records of procedures and results for each review so that they may contribute to the overall expertise of the Evidence-based Practice Center and the science of grading evidence.
    Further Guidance: Principles for Scoring
    • Explain the rationale for the approach used and identify which domains were important in upgrading or downgrading the strength of evidence.
    • Explain judgments about the degree to which any additional domains altered the overall strength-of-evidence grade.
    • Provide enough detail within the report to ensure that users can grasp the methods.
    Further Guidance: Principles for Reporting (I)
    • Use the terms high, moderate, low, or insufficient.
    • Do not use Roman numerals or other symbols.
    • Use or adapt the illustrative tabular approach to reporting (see the publications listed below for examples).
      • Owens DK, Lohr KN, Atkins D, et al. Grading the strength of a body of evidence when comparing medical interventions. In: Methods Guide for Comparative Effectiveness Reviews. Rockville, MD: Agency for Healthcare Research and Quality, Posted August 2009. Available at: http://effectivehealthcare. ahrq.gov/ ehc/products/60/318/2009_0805_grading.pdf.
      • Owens DK, Lohr KN, Atkins D, et al. Grading the strength of a body of evidence when comparing medical interventions —Agency for Healthcare Research and Quality and the Effective Health Care Program. J Clin Epidemiol 2010;63:531-523.
    Further Guidance: Principles for Reporting (II)
  • Grading Strength of Evidence: Presentation of Results — Moderate and High Grades CI = confidence interval; RCT = randomized controlled trial Number of Studies (Subjects) Domains Pertaining to Strength of Evidence Magnitude of Effect and Strength of Evidence (SOE) Risk of Bias; Design/Quality Consistency Directness Precision Absolute Risk Difference per 100 Patients Severe Diarrhea Moderate SOE 4 (256) RCT/Fair Consistent Direct Imprecise  4 (95% CI – 8 to +1) 14 (28,400) Cohort/Fair Consistent Direct Precise  5 (95% CI  8 to  2) Improved Quality of Life High SOE 6 (265) RCTs/Good Consistent Direct Precise  5 (95% CI  1 to  7)
  • Grading Strength of Evidence: Presentation of Results — Insufficient and Low CI = confidence interval; RCT = randomized controlled trial Number of Studies (Subjects) Domains Pertaining to Strength of Evidence Magnitude of Effect and Strength of Evidence (SOE) Risk of Bias; Design/Quality Consistency Directness Precision Absolute Risk Difference per 100 Patients Mortality Insufficient SOE 1 (80) RCT/Fair Unknown Direct Imprecise  1 (95% CI  4 to +3) 14 (384) Retrospective cohort/Fair Inconsistent Direct Imprecise  7 to +5 (range) Myocardial Infarction Low SOE 7 (625) Retrospective cohort/Low Consistent Direct Imprecise  3 (95% CI  5 to  1)
    • The grading system used by the Evidence-based Practice Centers (EPCs) is similar to the GRADE system.
    • The EPC grading system reflects the needs of AHRQ stakeholders for reviews on a wide variety of topics and not for recommendations or guidelines.
    • The main differences between the two grading systems:
      • The definitions of domains differ slightly; in the EPC system “directness” excludes “applicability,” which is handled separately.
      • In the EPC system, observational studies are considered to have less risk of bias for outcomes such as harms, which can raise the initial grade to “moderate.”
      • The definition of overall grade differs; the EPC system emphasizes confidence in estimate, whereas the GRADE system emphasizes effect of future research.
      • The EPC system permits three different ways to reach an overall strength-of -evidence grade; the GRADE formula has one.
    Comparison With the GRADE System
    • Is a critical last step in analysis and presentation
    • Is done after the quality of articles is rated by at least two independent reviewers
    • Helps users of systematic reviews understand the body of evidence and how much confidence they can have in making decisions based on that evidence
    • Uses scores on four primary (mandatory) domains and four additional (discretionary) domains
    • Focuses on major outcomes and comparisons
    • Is denoted in terms of high, moderate, or low strength or insufficient evidence
    • Presents strength-of-evidence grades in tabular form
    Summary: Grading Strength of Evidence
    • Atkins D, Best D, Briss PA, et al, for the GRADE Working Group. Grading quality of evidence and strength of recommendations. BMJ. 2004;328:1490.
    • Owens DK, Lohr KN, Atkins D, et al. Grading the strength of a body of evidence when comparing medical interventions. In: Agency for Healthcare Research and Quality. Methods Guide for Comparative Effectiveness Reviews [posted July 2009]. Rockville, MD. Available at: http://effectivehealthcare. ahrq.gov/healthInfo.cfm?infotype=rr&ProcessID=60.
    • Owens DK, Lohr KN, Atkins D, et al. Grading the strength of a body of evidence when comparing medical interventions —Agency for Healthcare Research and Quality and the Effective Health Care Program. J Clin Epidemiol 2010;63:513-523.
    References
    • This presentation was prepared by Kathleen N. Lohr, Ph.D., a Distinguished Fellow at RTI International.
    • This module is based on an update of chapter 11 in version 1.0 of the Methods Guide for Comparative Effectiveness Reviews (updated chapter available at: http://effectivehealthcare.ahrq.gov/ehc/products/60/318/2009_0805_ grading.pdf ) .
    Author