Upcoming SlideShare
×

# Statistical Methods

8,568 views
8,339 views

Published on

Statistical Methods, Likert-type items, Reliability vs. validity, Modeling vs. description,Consider theory, Raw Agreement Indices, Nonparametric tests, Factor analysis and SEM, Measurement Model, Odds Ratio and Yule's Q, Agreement on Interval-Level Ratings, Distribution of Ratings, Using the Results.

3 Likes
Statistics
Notes
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

Views
Total views
8,568
On SlideShare
0
From Embeds
0
Number of Embeds
6
Actions
Shares
0
272
0
Likes
3
Embeds 0
No embeds

No notes for slide

### Statistical Methods

1. 1. Statistical Methods for Rater Agreement Statistical Methods for Rater Agreement Recep ÖZCAN recepozcan06@gmail.com http://recepozcan06.blogcu.com/ 2009 1
2. 2. Statistical Methods for Rater Agreement INDEX Page 1. Statistical Methods for Rater Agreement…………………………………………… 5 1.0 Basic Considerations……………………………………………...…...…………… 5 1.1 Know the goals……………………………………………………………..………. 5 1.2 Consider theory………………………………………………………..…………… 5 1.3 Reliability vs. validity…………………………………………………..………….. 6 1.4 Modeling vs. description…………………………………………………………… 6 1.5 Components of disagreement……………………………………….……………… 7 1.6 Keep it simple………………………………………………………………………. 7 1.6.1 An example……………………………………………………...……………… 7 1.7 Recommended Methods………………………………………….………………… 8 1.7.1 Dichotomous data………………………………………………………………. 8 1.7.2 Ordered-category data…………………………………………….……………. 9 1.7.3 Nominal data…………………………………………………………………… 9 1.7.4 Likert-type items………………………………………………….……………. 10 2. Raw Agreement Indices……………………………………………………...………. 12 2.0 Introduction ………………………………………………………………...……… 12 2.1 Two Raters, Dichotomous Ratings ………………………………………………… 12 2.2 Proportion of overall agreement ………………………………………...…………. 12 2.3 Positive agreement and negative agreement ……………………………..………… 13 2.4 Significance, standard errors, interval estimation …………………………………. 13 2.4.1 Proportion of overall agreement ………………………………….……………. 13 2.4.2 Positive agreement and negative agreement ………………………..………….. 14 2.5 Two Raters, Polytomous Ratings …………………………………….……………. 15 2.6 Overall Agreement ……………………………………………...…….…………… 16 2.7 Specific agreement …………………………………………………..…………….. 17 2.8 Generalized Case ………………………………………………………………….. 17 2.9 Specific agreement ……………………………………………………...…………. 17 2.10 Overall agreement …………………………………………………..……………. 18 2.11 Standard errors, interval estimation, significance …………………..……………. 19 3. Intraclass Correlation and Related Method……………………………...………… 21 3.0 Introduction …………………………………………………………..……………. 21 3.1 Different Types of ICC……………………………………………….……………. 23 2
3. 3. Statistical Methods for Rater Agreement 3.2 Pros and Cons……………………………………………………………...……….. 23 3.2.1 Pros………………………………………………………………...…………… 23 3.2.2 Cons…………………………………………………………………………….. 24 3.3 The Comparability Issue……………………………………….…………………… 25 4. Kappa Coefficients ………………………………………………………..…………. 27 4.0 Summary …………………………………………………………………………… 27 5. Tests of Marginal Homogeneity ……………………………………………………. 28 5.0 Introduction …………………………………………………………….………….. 28 5.1 Graphical and descriptive methods………………………………………………… 29 5.2 Nonparametric tests ……………………………………………………..…………. 30 5.3 Bootstrapping………………………………………………………………………. 30 5.4 Loglinear, association and quasi-symmetry modeling ………………….…………. 31 5.5 Latent trait and related models ………………………………………..…………… 32 6. The Tetrachoric and Polychoric Correlation Coefficients………….…………….. 34 6.0 Introduction……………………………………………………………...…………. 34 6.0.1 Summary ……………………………………………………………….……… 34 6.1 Pros and Cons: Tetrachoric and Polychoric Correlation Coefficients……………… 34 6.1.1 Pros……………………………………………………………………..………. 34 6.1.2 Cons…………………………………………………………………….………. 35 6.2 Intuitive Explanation ………………………………………………………………. 35 7. Detailed Description …………………………………………………………………. 38 7.0 Introduction ………………………………………………………………...……… 38 7.1 Measurement Model ……………………………………………………….………. 38 7.2 Using the Polychoric Correlation to Measure Agreement …………………..…….. 40 7.3 Extensions and Generalizations ………………………………………….………… 42 7.3.1 Examples ……………………………………………………….……………… 42 7.4 Factor analysis and SEM……………………………………………...……………. 45 7.4.1 Programs for tetrachoric correlation……………………………………………. 45 7.4.2 Programs for polychoric and tetrachoric correlation………………...…………. 46 7.4.3 Generalized latent correlation…………………………………………...……… 47 8. Latent Trait Models for Rater Agreement………………………………………….. 49 8.0 Introduction ………………………………………………………………...……… 49 8.1 Measurement Model ……………………………………………………..………… 49 8.2 Evaluating the Assumptions ……………………………………………..………… 50 3
4. 4. Statistical Methods for Rater Agreement 8.3 What the Model Provides ………………………………………………..………… 50 9. Odds Ratio and Yule's Q…………………………………………………….………. 52 9.0 Introduction …………………………………………………………….………….. 52 9.1 Intuitive explanation…………………………………………………….………….. 52 9.2 Yule's Q………………………………………………………………..…………… 53 9.3 Log-odds ratio……………………………………………………………………… 53 9.4 Pros and Cons: the Odds Ratio…………………………………………..…………. 54 9.4.1 Pros………………………………………………………………….………….. 54 9.4.2 Cons………………………………………………………………….…………. 54 9.5 Extensions and alternatives …………………………………………….………….. 55 9.5.1 Extensions ……………………………………………………………………… 55 9.5.2 Alternatives…………………………………………………………..…………. 55 10. Agreement on Interval-Level Ratings …………………………………..………… 57 10.0 Introduction ………………………………………………………….…………… 57 10.1 General Issues …………………………………………………………….………. 58 10.2 Rater Association ………………………………………………………...……….. 58 10.3 Rater Bias …………………………………………………………………..…….. 59 10.4 Rating Distribution ……………………………………………………………….. 59 10.5 Rater vs. Rater or Rater vs. Group ……………………………………………….. 59 10.6 Measuring Rater Agreement …………………………………………..………….. 60 10.7 Measuring Rater Association ………………………………………..…………… 60 10.8 Measuring Rater Bias …………………………………………………………….. 62 10.9 Rater Distribution Differences …………………………………...………………. 62 10.10 Using the Results ………………………………………………..………………. 63 10.11 The Delphi Method ………………………………………………...……………. 63 10.12 Rater Bias ………………………………………………………….……………. 63 10.13 Rater Association ……………………………………………………..…………. 63 10.14 Distribution of Ratings ………………………………………………..………… 64 10.15 Discussion of Ambiguous Cases ……………………………………...………… 64 4
5. 5. Statistical Methods for Rater Agreement 1. Statistical Methods for Rater Agreement 1.0 Basic Considerations In many fields it is common to study agreement among ratings of multiple judges, experts, diagnostic tests, etc. We are concerned here with categorical ratings: dichotomous (Yes/No, Present/Absent, etc.), ordered categorical (Low, Medium, High, etc.), and nominal (Schizophrenic, Bi-Polar, Major Depression, etc.) ratings. Likert-type ratings--intermediate between ordered-categorical and interval-level ratings, are also considered. There is little consensus about what statistical methods are best to analyze rater agreement (we will use the generic words quot;ratersquot; and quot;ratingsquot; here to include observers, judges, diagnostic tests, etc. and their ratings/results.) To the non-statistician, the number of alternatives and lack of consistency in the literature is no doubt cause for concern. This site aims to reduce confusion and help researchers select appropriate methods for their applications. Despite the many apparent options for analyzing agreement data, the basic issues are very simple. Usually there are one or two methods best for a particular application. But it is necessary to clearly identify the purpose of analysis and the substantive questions to be answered. 1.1 Know the goals The most common mistake made when analyzing agreement data is not having a explicit goal. It is not enough for the goal to be quot;measuring agreementquot; or quot;finding out if raters agree.quot; There is presumably some reason why one wants to measure agreement. Which statistical method is best depends on this reason. For example, rating agreement studies are often used to evaluate a new rating system or instrument. If such a study is being conducted during the development phase of the instrument, one may wish to analyze the data using methods that identify how the instrument could be changed to improve agreement. However if an instrument is already in a final format, the same methods might not be helpful. Very often agreement studies are an indirect attempt to validate a new rating system or instrument. That is, lacking a definitive criterion variable or quot;gold standard,quot; the accuracy of a scale or instrument is assessed by comparing its results when used by different raters. Here one may wish to use methods that address the issue of real concern--how well do ratings reflect the true trait one wants to measure? In other situations one may be considering combining the ratings of two or more raters to obtain evaluations of suitable accuracy. If so, again, specific methods suitable for this purpose should be used. 1.2 Consider theory A second common problem in analyzing agreement is the failure to think about the data from the standpoint of theory. Nearly all statistical methods for analyzing agreement make 5
6. 6. Statistical Methods for Rater Agreement assumptions. If one has not thought about the data from a theoretical point of view it will be hard to select an appropriate method. The theoretical questions one asks do not need to be complicated. Even simple questions, like quot;is the trait being measured really discrete, like presence/absence of a pathogen, or is the trait really continuous and being divided into discrete levels (e.g., quot;low,quot; quot;medium, quot;highquot;) for convenience? If the latter, is it reasonable to assume that the trait is normally distributed? Or is some other distribution plausible? Sometimes one will not know the answers to these questions. That is fine, too, because there are methods suitable for that case also. The main point is to be inclined to think about data in this way, and to be attuned to the issue of matching method and data on this basis. These two issues--knowing ones goals and considering theory, are the main keys to successful analysis of agreement data. Following are some other, more specific issues that pertain to the selection of methods appropriate to a given study. 1.3 Reliability vs. validity One can broadly distinguish two reasons for studying rating agreement. Sometimes the goal is estimate the validity (accuracy) of ratings in the absence of a quot;gold standard.quot; This is a reasonable use of agreement data: if two ratings disagree, then at least one of them must be incorrect. Proper analysis of agreement data therefore permits certain inferences about how likely a given rating is to be correct. Other times one merely wants to know the consistency of ratings made by different raters. In some cases, the issue of accuracy may even have no meaning--for example ratings may concern opinions, attitudes, or values. 1.4 Modeling vs. description One should also distinguish between modeling vs. describing agreement. Ultimately, there are only a few simple ways to describe the amount of agreement: for example, the proportion of times two ratings of the same case agree, the proportion of times raters agree on specific categories, the proportions of times different raters use the various rating levels, etc. The quantification of agreement in any other way inevitably involves a model about how ratings are made and why raters agree or disagree. This model is either explicit, as with latent structure models, or implicit, as with the kappa coefficient. With this in mind, two basic principles are evident: It is better to have a model that is explicitly understood than one which is only implicit • and potentially not understood. The model should be testable. • Methods vary with respect to how well they meet the these two criteria. 1.5 Components of disagreement 6
7. 7. Statistical Methods for Rater Agreement Consider that disagreement has different components. With ordered-category (including dichotomous) ratings, one can distinguish between two different sources of disagreement. Raters may differ: (a) in the definition of the trait itself; or (b) in their definitions of specific rating levels or categories. A trait definition can be thought of as a weighted composite of several variables. Different raters may define or understand the trait as different weighted combinations. For example, to one rater Intelligence may mean 50% verbal skill and 50% mathematical skill; to another it may mean 33% verbal skill, 33% mathematical skill, and 33% motor skill. Thus their essential definitions of what the trait means differ. Similarity in raters' trait definitions can be assessed with various estimates of the correlation of their ratings, or analogous measures of association. Category definitions, on the other hand, differ because raters divide the trait into different intervals. For example, by quot;low skillquot; one rater may mean subjects from the 1st to the 20th percentile. Another rater, though, may take it to mean subjects from the 1st to the 10th percentile. When this occurs, rater thresholds can usually be adjusted to improve agreement. Similarity of category definitions is reflected as marginal homogeneity between raters. Marginal homogeneity means that the frequencies (or, equivalently, the quot;base ratesquot;) with which two raters use various rating categories are the same. Because disagreement on trait definition and disagreement on rating category widths are distinct components of disagreement, with different practical implications, a statistical approach to the data should ideally quantify each separately. 1.6 Keep it simple All other things being equal, a simpler statistical method is preferable to a more complicated one. Very basic methods can reveal far more about agreement data than is commonly realized. For the most part, advanced methods are complements to, not substitutes for simple methods. 1.6.1 An example: To illustrate these principles, consider the example for rater agreement on screening mammograms, a diagnostic imaging method for detecting possible breast cancer. Radiologists often score mammograms on a scale such as quot;no cancer,quot; quot;benign cancer,quot; quot;possible malignancy,quot; or quot;malignancy.quot; Many studies have examined rater agreement on applying these categories to the same set of images. In choosing a suitable statistical approach, one would first consider theoretical aspects of the data. The trait being measured, degree of evidence for cancer, is continuous. So the actual rating levels would be viewed as somewhat arbitrary discretizations of the underlying trait. A reasonable view is that, in the mind of a rater, the overall weight of evidence for cancer is an aggregate composed of various physical image features and weights attached to each feature. Raters may vary in terms of which features they notice and the weights they associate with each. One would also consider the purpose of analyzing the data. In this application, the purpose of studying rater agreement is not usually to estimate the accuracy of ratings by a single rater. That 7
8. 8. Statistical Methods for Rater Agreement can be done directly in a validity study, which compares ratings to a definitive diagnosis made from a biopsy. Instead, the aim is more to understand the factors that cause raters to disagree, with an ultimate goal of improving their consistency and accuracy. For this, one should separately assess whether raters have the same definition of the basic trait (that different raters weight various image features similarly) and that they have similar widths for the various rating levels. The former can be accomplished with, for example, latent trait models. Moreover, latent trait models are consistent with the theoretical assumptions about the data noted above. Raters' rating category widths can be studied by visually representing raters' rates of use for the different rating levels and/or their thresholds for the various levels, and statistically comparing them with tests of marginal homogeneity. Another possibility would be to examine if some raters are biased such that they make generally higher or lower ratings than other raters. One might also note which images are the subject of the most disagreement and then to try identify the specific image features that are the cause of the disagreement. Such steps can help one identify specific ways to improve ratings. For example, raters who seem to define the trait much differently than other raters, or use a particular category too often, can have this pointed out to them, and this feedback may promote their making ratings in a way more consistent with other raters. 1.7 Recommended Methods This section suggests statistical methods suitable for various levels of measurement based on the principles outlined above. These are general guidelines only--it follows from the discussion that no one method is best for all applications. But these suggestions will at least give the reader an idea of where to start. 1.7.1 Dichotomous data Two raters  Assess raw agreement, overall and specific to each category.  Use Cohen's kappa: (a) from its p-value, establish that agreement exceeds that expected under the null hypothesis of random ratings; (b) interpret the magnitude of kappa as an intraclass correlation. If different raters are used for different subjects, use the Scott/Fleiss kappa instead of Cohen's kappa.  Alternatively, calculate the intraclass correlation directly instead of a kappa statistic.  Use McNemar's test to evaluate marginal homogeneity.  Use the tetrachoric correlation coefficient if its assumptions are sufficiently plausible.  Possibly test association between raters with the log odds ratio; Multiple raters  Assess raw agreement, overall and specific to each category.  Calculate the appropriate intraclass correlation for the data. If different raters are used for each subject, an alternative is the Fleiss kappa. 8
9. 9. Statistical Methods for Rater Agreement  If the trait being rated is assumed to be latently discrete, consider use of latent class models.  If the trait being rated can be interpreted as latently continuous, latent trait models can be used to assess association among raters and to estimate the correlation of ratings with the true trait; these models can also be used to assess marginal homogeneity.  In some cases latent class and latent trait models can be used to estimate the accuracy (e.g., Sensitivity and Specificity) of diagnostic ratings even when a 'gold standard' is lacking. 1.7.2 Ordered-category data Two raters  Use weighted kappa with Fleiss-Cohen (quadratic) weights; note that quadratic weights are not the default with SAS and you must specify (WT=FC) with the AGREE option in PROC FREQ.  Alternatively, estimate the intraclass correlation.  Ordered rating levels often imply a latently continuous trait; if so, measure association between the raters with the polychoric correlation or one of its generalizations.  Test overall marginal homogeneity using the Stuart-Maxwell test or the Bhapkar test.  Test (a) for differences in rater thresholds associated with each rating category and (b) for a difference between the raters' overall bias using the respectively applicable McNemar tests.  Optionally, use graphical displays to visually compare the proportion of times raters use each category (base rates).  Consider association models and related methods for ordered category data. (See Agresti A., Categorical Data Analysis, New York: Wiley, 2002). Multiple raters  Estimate the intraclass correlation.  Test for differences in rater bias using ANOVA or the Friedman test.  Use latent trait analysis as a multi-rater generalization of the polychoric correlation. Latent trait models can also be used to test for differences among raters in individual rating category thresholds.  Graphically examine and compare rater base rates and/or thresholds for various rating categories.  Alternatively, consider each pair of raters and proceed as described for two raters. 1.7.3 Nominal data  Two raters  Assess raw agreement, overall and specific to each category.  Use the p-value of Cohen's unweighted kappa to verify that raters agree more than chance alone would predict. Often (perhaps usually), disregard the actual magnitude of kappa here; it is problematic  with nominal data because ordinarily one can neither assume that all types of 9
10. 10. Statistical Methods for Rater Agreement disagreement are equally serious (unweighted kappa) nor choose an objective set of differential disagreement weights (weighted kappa). If, however, it is genuinely true that all pairs of rating categories are equally quot;disparatequot;, then the magnitude of Cohen's unweighted kappa can be interpreted as a form of intraclass correlation.  Test overall marginal homogeneity using the Stuart-Maxwell test or the Bhapkar test.  Test marginal homogeneity relative to individual categories using McNemar tests.  Consider use of latent class models.  Another possibility is use of loglinear, association, or quasi- symmetry models.  Multiple raters  Assess raw agreement, overall and specific to each category.  If different raters are used for different subjects, use the Fleiss kappa statistic; again, as with nominal data/two raters, attend only to the p-value of the test unless one has a genuine basis for regarding all pairs of rating categories as equally quot;disparatequot;.  Use latent class modeling. Conditional tests of marginal homogeneity can be made within the context of latent class modeling.  Use graphical displays to visually compare the proportion of times raters use each category (base rates).  Alternatively, consider each pair of raters individually and proceed as described for two raters. 1.7.4 Likert-type items Very often, Likert-type items can be assumed to produce interval-level data. (By a quot;Likert-type itemquot; here we mean one where the format clearly implies to the rater that rating levels are evenly-spaced, such as lowest highest |-------|-------|-------|-------|-------|-------| 1 2 3 4 5 6 7 (circle level that applies)  Two raters  Assess association among raters using the regular Pearson correlation coefficient.  Test for differences in rater bias using the t-test for dependent samples.  Possibly estimate the intraclass correlation.  Assess marginal homogeneity as with ordered-category data.  See also methods listed in the section Methods for Likert-type or interval-level data.  Multiple raters  Perform a one-factor common factor analysis; examine/report the correlation of each rater with the common factor (for details, see the section Methods for Likert-type or interval-level data). Test for differences in rater bias using two-way ANOVA models.  Possibly estimate the intraclass correlation.  Use histograms to describe raters' marginal distributions.  If greater detail is required, consider each pair of raters and proceed as described for two  raters 10
11. 11. Statistical Methods for Rater Agreement 2. Raw Agreement Indices 11
12. 12. Statistical Methods for Rater Agreement 2.0 Introduction Much neglected, raw agreement indices are important descriptive statistics. They have unique common-sense value. A study that reports only simple agreement rates can be very useful; a study that omits them but reports complex statistics may fail to inform readers at a practical level. Raw agreement measures and their calculation are explained below. We examine first the case of agreement between two raters on dichotomous ratings. 2.1 Two Raters, Dichotomous Ratings Consider the ratings of two raters (or experts, judges, diagnostic procedures, etc.) summarized by Table 1: Rater 2 Rater 1 + - total + a b a+b - c d c+d total a+c b+d N The values a, b, c and d here denote the observed frequencies for each possible combination of ratings by Rater 1 and Rater 2. 2.2 Proportion of overall agreement The proportion of overall agreement (po) is the proportion of cases for which Raters 1 and 2 agree. That is: a+d a+d po = ------------- = -----. (1) a+b+c+d N This proportion is informative and useful, but, taken by itself, has possible has limitations. One is that it does not distinguish between agreement on positive ratings and agreement on negative ratings. Consider, for example, an epidemiological application where a positive rating corresponds to a positive diagnosis for a very rare disease -- one, say, with a prevalence of 1 in 1,000,000. Here we might not be much impressed if po is very high -- even above .99. This result would be due almost entirely to agreement on disease absence; we are not directly informed as to whether diagnosticians agree on disease presence. 12
13. 13. Statistical Methods for Rater Agreement Further, one may consider Cohen's (1960) criticism of po: that it can be high even with hypothetical raters who randomly guess on each case according to probabilities equal to the observed base rates. In this example, if both raters simply guessed quot;positivequot; the large majority of times they would usually agree on the diagnosis. Cohen proposed to remedy this by comparing po to a corresponding quantity, pc, the proportion of agreement expected by raters who randomly guess. As described on the kappa coefficients page, this logic is questionable; in particular, it is not clear what advantage there is to compare an actual level of agreement, po, with a hypothetical value, pc, which would occur under an obviously unrealistic model. A much simpler way to address this issue is described immediately below. 2.3 Positive agreement and negative agreement We may also compute observed agreement relative to each rating category individually. Generically the resulting indices are called the proportions of specific agreement (Spitzer & Fleiss, 1974). With binary ratings, there are two such indices, positive agreement (PA) and negative agreement (NA). They are calculated as follows: 2a 2d PA = ----------; NA = ----------. (2) 2a + b + c 2d + b + c . PA, for example, estimates the conditional probability, given that one of the raters, randomly selected, makes a positive rating, the other rater will also do so. A joint consideration of PA and NA addresses the potential concern that, when base rates are extreme, po is liable to chance-related inflation or bias. Such inflation, if it exists at all, would affect only the more frequent category. Thus if both PA and NA are satisfactorily large, there is arguably less need or purpose in comparing actual to chance- predicted agreement using a kappa statistic. But in any case, PA and NA provide more information relevant to understanding and improving ratings than a single omnibus index (see Cicchetti and Feinstein, 1990). 2.4 Significance, standard errors, interval estimation 2.4.1 Proportion of overall agreement Statistical significance. In testing the significance of po, the null hypothesis is that raters are independent, with their marginal assignment probabilities equal to the observed marginal proportions. For a 2×2 table, the test is the same as a usual test of statistical independence in a contingency table. Any of the following could potentially be used: test of a nonzero kappa coefficient • test of a nonzero log-odds ratio • a Pearson chi-squared (X²) or likelihood-ratio chi-squared (G²) test of independence • the Fisher exact test • test of fit of a loglinear model with main effects only • 13
14. 14. Statistical Methods for Rater Agreement A potential advantage of a kappa significance test is that the magnitude of kappa can be interpreted as approximately an intra-class correlation coefficient. All of these tests, except the last, can be done with SAS PROC FREQ. Standard error. One can use standard methods applicable to proportions to estimate the standard error and confidence limits of po. For a sample size N, the standard error of po is: SE(po) = sqrt[po(1 - po)/N] (3.1) One can alternatively estimate SE(po) using resampling methods, e.g., the nonparametric bootstrap or the jackknife, as described the next section. Confidence intervals. The Wald or quot;normal approximationquot; method estimates confidence limits of a proportion as follows: CL = po - SE × zcrit (3.2) CU = po + SE × zcrit (3.3) where SE here is SE(po) as estimated by Eq. (3.1), CL and CU are the lower and upper confidence limits, and zcrit is the z-value associated with a confidence range with coverage probability crit. For a 95% confidence range, zcrit = 1.96; for a 90% confidence range, zcrit = 1.645. When po is either very large or very small (and especially with small sample sizes) the Wald method may produce confidence limits less than 0 or greater than 1; in this case better approximate methods (see Agresti, 1996), exact methods, or resampling methods (see below) can be used instead. 2.4.2 Positive agreement and negative agreement Statistical significance. Logically, there is only one test of independence in a 2×2 table; therefore if PA significantly differs from chance, so too would NA, and vice versa. Spitzer and Fleiss (1974) described kappa tests for specific rating levels; in a 2×2 there are two such quot;specific kappasquot;, but both have the same value and statistical significance as the overall kappa. Standard errors. As shown by Mackinnon (2000; p. 130), asymptotic (large sample) standard errors of • PA and NA are estimated by the following formulas: SE(PA) = sqrt[4a (c + b)(a + c + b)] / (2a + b + c)^2 (3.4) • SE(NA) = sqrt[4d (c + b)(d + c + b)] / (2d + b + c)^2 (3.5) • Alternatively, one can estimate standard errors using the nonparametric bootstrap or the jackknife. These are described with reference to PA as follows: With the nonparametric bootstrap (Efron & Tibshirani, 1993), one constructs a large • number of simulated data sets of size N by sampling with replacement from the observed data. For a 2×2 table, this can be done simply by using random numbers to assign simulated cases to cells with probabilities of a/N, b/N, c/N and d/N (however, with large N, is more efficient algorithms are preferable.) One then computes the proportion of positive agreement for each simulated data set -- which we denote PA*. 14
15. 15. Statistical Methods for Rater Agreement The standard deviation of PA* across all simulated data sets estimates the standard error SE(PA). The delete-1 (Efron, 1982) jackknife works by calculating PA for four alternative tables • where 1 is subtracted from each of the four cells of the original 2 × 2 table. A few simple calculations then provide an estimate of the standard error SE(PA). The delete-1 jackknife requires less computation, but the nonparametric bootstrap is usually considered more accurate. Confidence intervals. Asymptotic confidence limits for PA and NA can be obtained as in Eqs. 3.2 and 3.3., • substituting PA and NA for po and using the asymptotic standard errors given by Eqs. 3.4 and 3.5. Alternatively, the bootstrap can be used. Again, we describe the method for PA. As • with bootstrap standard error estimation, ones generate a large number (e.g., 100,000) of simulated data sets, computing an estimate PA* for each one. Results are then sorted by increasing value of PA*. Confidence limits of PA are obtained with reference to the percentiles of this ranking. For example, the 95% confidence range of PA is estimated by the values of PA* that correspond to the 2.5 and 97.5 percentiles of this distribution. An advantage of bootstrapping is that one can use the same simulated data sets to estimate not only the standard errors and confidence limits of PA and NA, but also those of po or any other statistic defined for the 2×2 table. A SAS program to estimate the asymptotic standard errors and asymptotic confidence limits of PA and NA has been written. For a free standalone program that supplies both bootstrap and asymptotic standard errors and confidence limits, please email the author. Readers are referred to Graham and Bull (1998) for fuller coverage of this topic, including a comparison of different methods for estimating confidence intervals for PA and NA. 2.5 Two Raters, Polytomous Ratings We now consider results for two raters making polytomous (either ordered category or purely nominal) ratings. Let C denote the number of rating categories or levels. Results for the two raters may be summarized as a C × C table such as Table 2. Table2 Summary of polytomous ratings by two raters Rater 2 Rater 1 1 2 ... C total 15
16. 16. Statistical Methods for Rater Agreement 1 n11 n12 ... n1C n1. 2 n21 n22 ... n2C n2. . . . . . ... . . . . . C nC1 nC2 ... nCC nC. total n.1 n.2 ... n.C N Here nij denotes the number of cases assigned rating category i by Rater 1 and category j by Rater j, with i, j = 1, ..., C. When a quot;.quot; appears in a subscript, it denotes a marginal sum over the corresponding index; e.g., ni. is the sum of nij for j = 1, ..., c, or the row marginal sum for category i; n.. = N denotes the total number of cases. 2.6 Overall Agreement For this design, po is the sum of frequencies of the main diagonal of table {nij} divided by sample size, or C po = 1/N SUM nii (4) i=1 Statistical significance One may test the statistical significance of po with Cohen's kappa. If kappa is • significant/nonsignificant, then po may be assumed significant/nonsignificant, and vice versa. Note that the numerator of kappa is the difference between po and the level of agreement expected under the null hypothesis of statistical independence. The parametric bootstrap can also be used to test statistical significance. This is like the • nonparametric bootstrap already described, except that samples are generated from the null hypothesis distribution. Specifically, one constructs many -- say 5000 -- simulated samples of size N from the probability distribution {πij}, where ni.n.j πij = ------. (5) N and the tabulates overall agreement, denoted p*o, for each simulated sample. The po for the actual data is considered statistically significant if it exceeds a specified percentage (e.g., 5%) of the p*o values. If one already has a computer program for nonparametric bootstrapping only slight modifications are needed to adapt it to perform a parametric bootstrap significance test. Standard error and confidence limits. Here the standard error and confidence intervals of po can again be calculated with the methods described for 2×2 tables. 16
17. 17. Statistical Methods for Rater Agreement 2.7 Specific agreement With respect to Table 2, the proportion of agreement specific to category i is: 2nii ps(i) = ---------. (6) ni. + n.i Statistical significance Eq. (6) amounts to collapsing the C × C table into a 2×2 table relative to category i, considering this category a 'positive' rating, and then computing the positive agreement (PA) index of Eq. (2). This is done for each category i successively. In each reduced table one may perform a test of statistical independence using Cohen's kappa, the odds ratio, or chi-squared, or use a Fisher exact test. Standard errors and confidence limits Again, for each category i, we may collapse the original C × C table into a 2×2 table, • taking i as the 'positive' rating level. The asymptotic standard error formula Eq. (3.4) for PA may then be used, and the Wald method confidence limits given by Eqs. (3.1) and (3.2) may be computed. Alternatively, one can use the nonparametric bootstrap to estimate standard errors • and/or confidence limits. Note that this does not require a successive collapsing of the original table. The delete-1 jackknife can be used to estimate standard errors, but this does require • successive collapsings of the C × C table. 2.8 Generalized Case We now consider generalized formulas for the proportions of overall and specific agreement. They apply to binary, ordered category, or nominal ratings and permit any number of raters, with potentially different numbers of raters or different raters for each case. 2.9 Specific agreement Let there be K rated cases indexed by k = 1, ..., K. The ratings made on case k are summarized as: {njk} (j = 1, ..., C) = {n1k, n2k, ..., nCk} where njk is the number of times category j (j = 1, ..., C) is applied to case k. For example, if a case k is rated five times and receives ratings of 1, 1, 1, 2, and 2, then n1k = 3, n2k = 2, and {njk} = {3, 2}. Let nk denote the total number of ratings made on case k; that is, C nk = SUM njk. (7) j=1 17
18. 18. Statistical Methods for Rater Agreement For case k, the number of actual agreements on rating level j is njk (njk - 1). (8) The total number of agreements specifically on rating level j, across all cases is K S(j) = SUM njk (njk - 1). (9) k=1 The number of possible agreements specifically on category j for case k is equal to njk (nk - 1) (10) and the number of possible agreements on category j across all cases is: K Sposs(j) = SUM njk (nk - 1). (11) k=1 The proportion of agreement specific to category j is equal to the total number of agreements on category j divided by the total number of opportunities for agreement on category j, or S(j) ps(j) = -------. (12) Sposs(j) 2.10 Overall agreement The total number of actual agreements, regardless of category, is equal to the sum of Eq. (9) across all categories, or C O = SUM S(j). (13) j=1 The total number of possible agreements is K Oposs = SUM nk (nk - 1). (14) k=1 Dividing Eq. (13) by Eq. (14) gives the overall proportion of observed agreement, or O po = ------. (15) Oposs 18
19. 19. Statistical Methods for Rater Agreement 2.11 Standard errors, interval estimation, significance The jackknife or, preferably, the nonparametric bootstrap can be used to estimate standard errors of ps(j) and po in the generalized case. The bootstrap is uncomplicated if one assumes cases are independent and identically distributed (iid). In general, this assumption will be accepted when: the same raters rate each case, and either there are no missing ratings or ratings are • missing completely at random. the raters for each case are randomly sampled and the number of rating per case is • constant or random. in a replicate rating (reproducibility) study, each case is rated by the procedure the same • number of times or else the number of replications for any case is completely random. In these cases, one may construct each simulated sample by repeated random sampling with replacement from the set of K cases. If cases cannot be assumed iid (for example, if ratings are not missing at random, or, say, a study systematically rotates raters), simple modifications of the bootstrap method--such as two- stage sampling, can be made. The parametric bootstrap can be used for significance testing. A variation of this method, patterned after the Monte Carlo approach described by Uebersax (1982), is as follows: Loop through s, where s indexes simulated data sets Loop through all cases k Loop through all ratings on case k For each actual rating, generate a random simulated rating, chosen such that: Pr(Rating category=j|Rater=i) = base rate of category j for Rater i. If rater identities are unknown or for a reproducibility study, the total base rate for category j is used. End loop through case k's ratings End loop through cases Calculate p*o and p*s(j) (and any other statistics of interest) for sample s. End main loop The significance of po, ps(j), or any other statistic calculated, is determined with reference to the distribution of corresponding values in the simulated data sets. For example, po is significant at the .05 level (1-tailed) if it exceeds 95% of the p*o values obtained for the simulated data sets. References 19
20. 20. Statistical Methods for Rater Agreement Agresti A. An introduction to categorical data analysis. New York: Wiley, 1996. Cicchetti DV. Feinstein AR. High agreement but low kappa: II. Resolving the paradoxes. Journal of Clinical Epidemiology, 1990, 43, 551-558. Cohen J. A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 1960, 20, 37-46. Cook RJ, Farewell VT. Conditional inference for subject-specific and marginal agreement: two families on agreement measures. Canadian Journal on Statistics, 1995, 23, 333-344. Efron B. The jackknife, the bootstrap and other resampling plans. Philadelphia: Society for Industrial and Applied Mathematics, 1982. Efron B, Tibshirani RJ. An introduction to the bootstrap. New York: Chapman and Hall, 1993. Fleiss JL. Measuring nominal scale agreement among many raters. Psychological Bulletin, 1971, 76, 378-381. Fleiss JL. Statistical methods for rates and proportions, 2nd Ed. New York: John Wiley, 1981. Graham P, Bull B. Approximate standard errors and confidence intervals for indices of positive and negative agreement. J Clin Epidemiol, 1998, 51(9), 763-771. Mackinnon, A. A spreadsheet for the calculation of comprehensive statistics for the assessment of diagnostic tests and inter-rater agreement. Computers in Biology and Medicine, 2000, 30, 127-134. Spitzer R, Fleiss J. A re-analysis of the reliability of psychiatric diagnosis. British Journal on Psychiatry, 1974, 341-47. Uebersax JS. A design-independent method for measuring the reliability of psychiatric diagnosis. Journal on Psychiatric Research, 1982-1983, 17(4), 335-342. 20
21. 21. Statistical Methods for Rater Agreement 3. Intraclass Correlation and Related Method 3.0 Introduction The Intraclass Correlation (ICC) assesses rating reliability by comparing the variability of different ratings of the same subject to the total variation across all ratings and all subjects. The theoretical formula for the ICC is: s 2(b) ICC = ------------ [1] s 2(b) + s 2 (w) where s 2(w) is the pooled variance within subjects, and s 2(b) is the variance of the trait between subjects. 21
22. 22. Statistical Methods for Rater Agreement It is easily shown that s 2(b) + s 2(w) = the total variance of ratings--i.e., the variance for all ratings, regardless of whether they are for the same subject or not. Hence the interpretation of the ICC as the proportion of total variance accounted for by within-subject variation. Equation [1] would apply if we knew the true values, s 2 (w) and s 2(b). But we rarely do, and must instead estimate them from sample data. For this we wish to use all available information; this adds terms to Equation [1]. For example, s 2(b) is the variance of true trait levels between subjects. Since we do not know a subject's true trait level, we estimate it from the subject's mean rating across the raters who rate the subject. Each mean rating is subject to sampling variation--deviation from the subject's true trait level, or it's surrogate, the mean rating that would be obtained from a very large number of raters. Since the actual mean ratings are often based on two or a few ratings, these deviations are appreciable and inflate the estimate of between-subject variance. We can estimate the amount and correct for this extra, error variation. If all subjects have k ratings, then for the Case 1 ICC (see definition below) the extra variation is estimated as (1/k) s 2 (w), where s 2(w) is the pooled estimate of within-subject variance. When all subjects have k ratings, s2(w) equals the average variance of the k ratings of each subject (each calculated using k-1 as denominator). To get the ICC we then: 1. Estimate s 2(b) as [s 2(b) - s 2(w)/k], where s2(b) is the variance of subjects' mean ratings, 2. Estimate s 2(w) as s 2(w), and 3. Apply Equation [1] For the various other types of ICC's, different corrections are used, each producing it's own equation. Unfortunately, these formulas are usually expressed in their computational form--with terms arranged in a way that facilitates calculation, rather than their derivational form--which would make clear the nature and rationale of the correction terms. 3.1 Different Types of ICC In their important paper, Shrout and Fleiss (1979) describe three classes of ICC for reliability, which they term Case 1, Case 2 and Case 3. Each Case applies to a different rater agreement study design. Case 1 Raters for each subject are selected at random Case 2 The same raters rate each case. These are a random sample. Case 3 The same raters rate each case. These are the only raters. Case 1. One has a pool of raters. For each subject, one randomly samples from the rater pool k different raters to rate this subject. Therefore the raters who rate one subject are not necessarily the same as those who rate another. This design corresponds to a 1-way Analysis of Variance (ANOVA) in which Subject is a random effect, and Rater is viewed as measurement error. Case 2. The same set of k raters rate each subject. This corresponds to a fully-crossed (Rater × Subject), 2-way ANOVA design in which both Subject and Rater are separate effects. In Case 22
23. 23. Statistical Methods for Rater Agreement 2, Rater is considered a random effect; this means the k raters in the study are considered a random sample from a population of potential raters. The Case 2 ICC estimates the reliability of the larger population of raters. Case 3. This is like Case 2--a fully-crossed, 2-way ANOVA design. But here one estimates the ICC that applies only to the k raters in the study. Since this does not permit generalization to other raters, the Case 3 ICC is not often used. Shrout and Fleiss (1981) also show that for each of the three Cases above, one can use the ICC in two ways: To estimate the reliability of a single rating, or • To estimate the reliability of a mean of several ratings. • For each of the Cases, then, there are two forms, producing a total of 6 different versions of the ICC. 3.2 Pros and Cons 3.2.1 Pros Flexible • The ICC, and more broadly, ANOVA analysis of ratings, is very flexible. Besides the six ICCs discussed above, one can consider more complex designs, such as a grouping factor among raters (e.g., experts vs. nonexperts), or covariates. See Landis and Koch (1977a,b) for examples. Software • Software to estimate the ICC is readily available (e.g, SPSS and SAS). Output from most any ANOVA software will contain the values needed to calculate the ICC. Reliability of mean ratings • The ICC allows estimation of the reliability of both single and mean ratings. quot;Prophecyquot; formulas let one predict the reliability of mean ratings based on any number of raters. Combines information about bias and association. • An alternative to the ICC for Cases 2 and 3 is to calculate the Pearson correlation between all pairs of rater. The Pearson correlation measures association between raters, but is insensitive to rater mean differences (bias). The ICC decreases in response to both lower correlation between raters and larger rater mean differences. Some may see this advantage, but others (see Cons) as a limitation. Number of categories • 23
24. 24. Statistical Methods for Rater Agreement The ICC can be used to compare the reliability of different instruments. For example, the reliability of a 3-level rating scale can be compared to the reliability of a 5-level scale (provided they are assessed relative to the same sample or population; see Cons). 3.2.2 Cons Comparability across populations • The ICC is strongly influenced by the variance of the trait in the sample/population in which it is assessed. ICCs measured for different populations might not be comparable. For example, suppose one has a depression rating scale. When applied to a random sample of the adult population the scale might have a high ICC. However, if the scale is applied to a very homogeneous population--such as patients hospitalized for acute depression--it might have a low ICC. This is evident from the definition of the ICC as s 2(b)/ [s 2(b)+s 2(w)]. In both populations above, s 2(w), variance of different raters' opinions of the same subject, may be the same. But between-subject variance, s 2(b), may be much smaller in the clinical population than in the general population. Therefore the ICC would be smaller in the clinical population. The the same instrument may be judged quot;reliablequot; or quot;unreliable,quot; depending on the population in which it is assessed. This issue is similar to, and just as much a concern as, the quot;base ratequot; problem of the kappa coefficient. It means that: 1. One cannot compare ICCs for samples or populations with different between- subject variance; and 2. The often-reproduced table which shows specific ranges for quot;acceptablequot; and quot;unacceptablequot; ICC values should not be used. For more discussion on the implications of this topic see, The Comparability Issue below. Assumes equal spacing • To use the ICC with ordered-category ratings, one must assign the rating categories numeric values. Usually categories are assigned values 1, 2, ..., C, where C is the number of rating categories; this assumes all categories are equally wide, which may not be true. An alternative is to assign ordered categories numeric values from their cumulative frequencies via probit (for a normally distributed trait) or ridit (for a rectangularly distributed trait) scoring; see Fleiss (1981). Association vs. bias • The ICC combines, or some might say, confounds, two ways in which raters differ: (1) association, which concerns whether the raters understand the meaning of the trait in the 24
25. 25. Statistical Methods for Rater Agreement same way, and (2) bias, which concerns whether some raters' mean ratings are higher or lower than others. If a goal is to give feedback to raters to improve future ratings, one should distinguish between these two sources of disagreement. For discussion on alternatives that separate these components, see the Likert Scale page of this website. Reliability vs. agreement • With ordered-category or Likert-type data, the ICC discounts the fact that we have a natural unit to evaluate rating consistency: the number or percent of agreements on each rating category. Raw agreement is simple, intuitive, and clinically meaningful. With ordered category data, it is not clear why one would prefer the ICC to raw agreement rates, especially in light of the comparability issue discussed below. A good idea is to report reliability using both the ICC and raw agreement rates. 3.3 The Comparability Issue Above it was noted that the ICC is strongly dependent on the trait variance within the population for which it is measured. This can complicate comparisons of ICCs measured in different populations, or in generalizing results from a single population. Some suggest avoiding this problem by eliminating or holding constant the quot;problematicquot; term, s 2(b). Holding the term constant would mean choosing some fixed value for s 2(b), and using this in place of the different value estimated in each population. For example, one might pick as s 2(b) the trait variance in the general adult population--regardless of what population the ICC is measured in. However, if one is going to hold s 2(b) constant, one may well question using it at all! Why not simply report as the index of unreliability the value of s 2(w) for a study? Indeed, this has been suggested, though not used in practice much. But if one is going to disregard s 2(b) because it complicates comparisons, why not go a step further and express reliability simply as raw agreement rates--for example, the percent of times two raters agree on the exact same category, and the percent of time they are within on level of one another? An advantage of including s 2(b) is that it automatically controls for the scaling factor of an instrument. Thus (at least within the same population), ICCs for instruments with different numbers of categories can be meaningfully compared. Such is not the case with raw agreement measures or with s 2 (w) alone. Therefore, someone reporting reliability of a new scale may wish to include the ICC along with other measures if they expect later researchers might compare their results to those of a new or different instrument with fewer or more categories. 25
26. 26. Statistical Methods for Rater Agreement 4. Kappa Coefficients 4.0 Summary There is wide disagreement about the usefulness of kappa statistics to assess rater agreement. At the least, it can be said that (1) kappa statstics should not be viewed as the unequivocal standard or default way to quantify agreement; (2) one should be concerned about using a statistic that is the source of so much controversy; and (3) oneshould consider alternatives and make an informed choice. One can distinguish between two possible uses of kappa: as a way to test rater independence (i.e. as a test statistic), and as a way to quantify the level of agreement (i.e., as an effect-size measure). The first use involves testing the null hypothesis that there is no more agreement than might occur by chance given random guessing; that is, one makes a qualitative, quot;yes or noquot; decision about whether raters are independent or not. Kappa is appropriate for this purpose (although to know that raters are not independent is not very informative; raters are dependent by definition, inasmuch as they are rating the same cases). It is the second use of kappa--quantifying actual levels of agreement--that is the source of concern. Kappa's calculation uses a term called the proportion of chance (or expected) agreement. This is interpreted as the proportion of times raters would agree by chance alone. 26
27. 27. Statistical Methods for Rater Agreement However, the term is relevant only under the conditions of statistical independence of raters. Since raters are clearly not independent, the relevance of this term, and its appropriateness as a correction to actual agreement levels, is very questionable. Thus, the common statement that kappa is a quot;chance-corrected measure of agreementquot; misleading. As a test statistic, kappa can verify that agreement exceeds chance levels. But as a measure of the level of agreement, kappa is not quot;chance-correctedquot;; indeed, in the absence of some explicit model of rater decisionmaking, it is by no means clear how chance affects the decisions of actual raters and how one might correct for it. A better case for using kappa to quantify rater agreement is that, under certain conditions, it approximates the intra-class correlation. But this too is problematic in that (1) these conditions are not always met, and (2) one could instead directly calculate the intraclass correlation. 5. Tests of Marginal Homogeneity 5.0 Introduction Consider symptom ratings (1 = low, 2 = moderate, 3 = high) by two raters on the same sample of subjects, summarized by a 3×3 table as follows: Table 1. Summarization of ratings by Rater 1 (rows) and Rater 2 (columns). 1 2 3 1 p11 p12 p13 p1. 2 p21 p22 p23 p2. 3 p31 p32 p33 p3. p.1 p.2 p.3 1.0 Here pij denotes the proportion of all cases assigned to category i Rater 1 and category j by Rater 2. (The table elements could as easily be frequencies.) The terms p1., p2., and p3. denote the 27
28. 28. Statistical Methods for Rater Agreement marginal proportions for Rater 1--i.e. the total proportion of times Rater 1 uses categories 1, 2 and 3, respectively. Similarly, p.1, p.2, and p.3 are the marginal proportions for Rater 2. Marginal homogeneity refers to equality (lack of significant difference) between one or more of the row marginal proportions and the corresponding column proportion(s). Testing marginal homogeneity is often useful in analyzing rater agreement. One reason raters disagree is because of different propensities to use each rating category. When such differences are observed, it may be possible to provide feedback or improve instructions to make raters' marginal proportions more similar and improve agreement. Differences in raters' marginal rates can be formally assessed with statistical tests of marginal homogeneity (Barlow, 1998; Bishop, Fienberg & Holland, 1975; Ch. 8). If each rater rates different cases, testing marginal homogeneity is straightforward: one can compare the marginal frequencies of different raters with a simple chi-squared test. However this cannot be done when different raters rate the same cases--the usual situation with rater agreement studies; then the ratings of different raters are not statistically independent and this must be accounted for. Several statistical approaches to this problem are available. Alternatives include: Nonparametric tests • Bootstrap methods • Loglinear, association, and quasi-symmetry models • Latent trait and related models • These approaches are outlined here. 5.1 Graphical and descriptive methods Before discussing formal statistical methods, non-statistical methods for comparing raters' marginal distributions should be briefly mentioned. Simple descriptive methods can be very useful. For example, a table might report each raters' rate of use for each category. Graphical methods are especially helpful. A histogram can show the distribution of each raters' ratings across categories. The following example is from the output of the MH program: Marginal Distributions of Categories for Rater 1 (**) and Rater 2 (==) 0.304 + ** | ** == | ** == == | ** == ** == ** == | ** == ** == ** == | ** == ** == ** == | ** == ** == ** == ** == | ** == ** == ** == ** == ** == 28
29. 29. Statistical Methods for Rater Agreement | ** == ** == ** == ** == ** == ** == | ** == ** == ** == ** == ** == ** == 0 +----+-------+-------+-------+-------+-------+---- 1 2 3 4 5 6 Notes: x-axis is category number or level. y-axis is proportion of cases. Vertical or horizontal stacked-bar histograms are good ways to summarize the data. With ordered-category ratings, a related type of figure shows the cumulative proportion of cases below each rating level for each rater. An example, again from the MH program, is as follows: Proportion of cases below each level 1 234 5 6 *---*-*-*-----*-------------------*-------------------------- Rater 1 *---*-*-*--------*------------*------------------------------ Rater 2 1 234 5 6 +-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+ Scale 0 .1 .2 .3 .4 .5 .6 .7 .8 .9 .1 These are merely examples. Many other ways to graphically compare marginal distributions are possible. 5.2 Nonparametric tests The main nonparametric test for assessing marginal homogeneity is the McNemar test. The McNemar test assesses marginal homogeneity in a 2×2 table. Suppose, however, that one has an N×N crossclassification frequency table that summarizes ratings by two raters for an N-category rating system. By collapsing the N×N table into various 2×2 tables, one can use the McNemar test to assess marginal homogeneity of each rating category. With ordered-category data one can also collapse the N×N table in other ways to test rater equality of category thresholds, or test raters for overall bias (i.e., a tendency to make higher or lower rating than other raters.) The Stuart-Maxwell test can be used to test marginal homogeneity between two raters across all categories simultaneously. It thus complements McNemar tests of individual categories by providing an overall significance value. Further explanation of these methods and their calculation can be found by clicking on the test names above. MH, a computer program for testing marginal homogeneity with these methods is available online. For more information, click here. 29
30. 30. Statistical Methods for Rater Agreement These tests are remarkably easy to use and are usually just as effective as more complex methods. Because the tests are nonparametric, they make few or no assumptions about the data. While some of the methods described below are potentially more powerful, this comes at the price of making assumptions which may or may not be true. The simplicity of the nonparametric tests lends persuasiveness to their results. A mild limitation is that these tests apply only for comparisons of two raters. With more than two raters, of course, one can apply the tests for each pair of raters. 5.3 Bootstrapping Bootstrap and related jackknife methods (Efron, 1982; Efron & Tibshirani, 1993) provide a very general and flexible framework for testing marginal homogeneity. Again, suppose one has an N×N crossclassification frequency table summarizing agreement between two raters on an N- category rating. Using what is termed the nonparametric bootstrap, one would repeatedly sample from this table to produce a large number (e.g., 500) of pseudo-tables, each with the same total frequency as the original table. Various measures of marginal homogeneity would be calculated for each pseudo-table; for example, one might calculate the difference between the row marginal proportion and the column marginal proportion for each category, or construct an overall measure of row vs. column marginal differences. Let d* denote such a measure calculated for a given pseudo-table, and let d denote the same measure calculated for the original table. From the pseudo-tables, one can empirically calculate the standard deviation of d*, or , d*. Let d' denote the true population value of d. Assuming that d' = 0 corresponds to the null hypothesis of marginal homogeneity, one can test this null hypothesis by calculating the z value: * z = d// d and determining the significance of the standard normal deviate z by usual methods (e.g., a table of z value probabilities). The method above is merely an example. Many variations are possible within the framework of bootstrap and jackknife methods. An advantage of bootstrap and jackknife methods is their flexibility. For example, one could potentially adapt them for simultaneous comparisons among more than two raters. A potential disadvantage of these methods is that the user may need to write a computer program to apply them. However, such a program could also be used for other purposes, such as providing bootstrap significance tests and/or confidence intervals for various raw agreement indices. 5.4 Loglinear, association and quasi-symmetry modeling If one is using a loglinear, association or quasi-symmetry model to analyze agreement data, one can adapt the model to test marginal homogeneity. 30
31. 31. Statistical Methods for Rater Agreement For each type of model the basic approach is the same. First one estimates a general form of the model--that is, one without assuming marginal homogeneity; let this be termed the quot;unrestricted model.quot; Next one adds the assumption of marginal homogeneity to the model. This is done by applying equality restrictions to some model parameters so as to require homogeneity of one or more marginal probabilities (Barlow, 1998). Let this be termed the quot;restricted model.quot; Marginal homogeneity can then be tested using the difference G2 statistic, calculated as: difference G2 = G2(restricted) - G2(unrestricted) where G2(restricted) and G2(unrestricted) are the likelihood-ratio chi-squared model fit statistics (Bishop, Fienberg & Holland, 1975) calculated for the restricted and unrestricted models. The difference G2 can be interpreted as a chi-squared value and its significance determined from a table of chi-squared probabilities. The df are equal to the difference in df for the unrestricted and restricted models. A significant value implies that the rater marginal probabilities are not homogeneous. An advantage of this approach is that one can test marginal homogeneity for one category, several categories, or all categories using a unified approach. Another is that, if one is already analyzing the data with a loglinear, association, or quasi-symmetry model, the addition of marginal homogeneity tests may require relatively little extra work. A possible limitation is that loglinear, association, and quasi-symmetry models are only well- developed for analysis of two-way tables. Another is that use of the difference G2 test typically requires that the unrestricted model fit the data, which sometimes might not be the case. For an excellent discussion of these and related models (including linear-by-linear models), see Agresti (2002). 5.5 Latent trait and related models Latent trait models and related methods such as the tetrachoric and polychoric correlation coefficients can be used to test marginal homogeneity for dichotomous or ordered-category ratings. The general strategy using these methods is similar to that described for loglinear and related models. That is, one estimates both an unrestricted version of the model and a restricted version that assumes marginal homogeneity, and compares the two models with a difference G2 test. With latent trait and related models, the restricted models are usually constructed by assuming that the thresholds for one or more rating levels are equal across raters. A variation of this method tests overall rater bias. That is done by estimating a restricted model in which the thresholds of one rater are equal to those of another plus a fixed constant. A comparison of this restricted model with the corresponding unrestricted model tests the hypothesis that the fixed constant, which corresponds to bias of a rater, is 0. Another way to test marginal homogeneity using latent trait models is with the asymptotic standard errors of estimated category thresholds. These can be used to estimate the standard 31
32. 32. Statistical Methods for Rater Agreement error of the difference between the thresholds of two raters for a given category, and this standard error used to test the significance of the observed difference. An advantage of the latent trait approach is that it can be used to assess marginal homogeneity among any number of raters simultaneously. A disadvantage is that these methods require more computation than nonparametric tests. If one is only interested in testing marginal homogeneity, the nonparametric methods might be a better choice. However, if one is already using latent trait models for other reasons, such as to estimate accuracy of individual raters or to estimate the correlation of their ratings, one might also use them to examine marginal homogeneity; however, even in this case, it might be simpler to use the nonparametric tests of marginal homogeneity. If there are many raters and categories, data may be sparse (i.e., many possible patterns of ratings across raters with 0 observed frequencies). With very sparse data, the difference G2 statistic is no longer distributed as chi-squared, so that standard methods cannot be used to determine its statistical significance. References Agresti A. Categorical data analysis. New York: Wiley, 2002. Barlow W. Modeling of categorical agreement. The encyclopedia of biostatistics, P. Armitage, T. Colton, eds., pp. 541-545. New York: Wiley, 1998. Bishop YMM, Fienberg SE, Holland PW. Discrete multivariate analysis: theory and practice. Cambridge, Massachusetts: MIT Press, 1975 Efron B. The jackknife, the bootstrap and other resampling plans. Philadelphia: Society for Industrial and Applied Mathematics, 1982. Efron B, Tibshirani RJ. An introduction to the bootstrap. New York: Chapman and Hall, 1993. 32
33. 33. Statistical Methods for Rater Agreement 6. The Tetrachoric and Polychoric Correlation Coefficients 6.0 Introduction This page describes the tetrachoric and polychoric correlation coefficients, explains their meaning and uses, gives examples and references, provides programs for their estimation, and discusses other available software. While discussion is primarily oriented to rater agreement problems, it is general enough to apply to most other uses of these statistics. A clear, concise description of the tetrachoric and polychoric correlation coefficients, including issues relating to their estimation, is found in Drasgow (1988). Olsson (1979) is also helpful. What distinguishes the present discussion is the view that the tetrachoric and polychoric correlation models are special cases of latent trait modeling. (This is not a new observation, but it is sometimes overlooked). Recognizing this opens up important new possibilities. In particular, it allows one to relax the distributional assumptions which are the most limiting feature of the quot;classicalquot; tetrachoric and polychoric correlation models. 6.0.1 Summary The tetrachoric correlation (Pearson, 1901), for binary data, and the polychoric correlation, for ordered-category data, are excellent ways to measure rater agreement. They estimate what the correlation between raters would be if ratings were made on a continuous scale; they are, theoretically, invariant over changes in the number or quot;widthquot; of rating categories. The tetrachoric and polychoric correlations also provide a framework that allows testing of marginal homogeneity between raters. Thus, these statistics let one separately assess both components of 33
34. 34. Statistical Methods for Rater Agreement rater agreement: agreement on trait definition and agreement on definitions of specific categories. These statistics make certain assumptions, however. With the polychoric correlation, the assumptions can be tested. The assumptions cannot be tested with the tetrachoric correlation if there are only two raters; in some applications, though, theoretical considerations may justify the use of the tetrachoric correlation without a test of model fit. 6.1 Pros and Cons: Tetrachoric and Polychoric Correlation Coefficients 6.1.1 Pros: These statistics express rater association in a familiar form--a correlation coefficient. • They provide a way to separately quantify association and similarity of category • definitions. They do not depend on number of rating levels; results can be compared for studies • where the number of rating levels is different. They can be used even if different raters have different numbers of rating levels. • The assumptions can be easily tested for the polychoric correlation. • Estimation software is routinely available (e.g., SAS PROC FREQ, and PRELIS). • 6.1.2 Cons: Model assumptions not always appropriate--for example, if the latent trait is truly • discrete. For only two raters, there is no way to test the assumptions of the tetrachoric • correlation. 6.2 Intuitive Explanation Consider the example of two psychiatrists (Raters 1 and 2) making a diagnosis for presence/absence of Major Depression. Though the diagnosis is dichotomous, we allow that depression as a trait is continuously distributed in the population. +---------------------------------------------------------------+ | | | | || * | || * * | || * * | || * |* | || * |* | || ** | ** | || *** | *** | || *** | *** | | | ***** | ***** | | +--------------------------------+----------------> Y | | not depressed t depressed | 34
35. 35. Statistical Methods for Rater Agreement | | +---------------------------------------------------------------+ Figure 1 (draft). Latent continuous variable (depression severity, Y); and discretizing threshold (t). In diagnosing a given case, a rater considers the case's level of depression, Y, relative to some threshold, t: if the judged level is above the threshold, a positive diagnosis is made; otherwise the diagnosis is negative. Figure 2 portrays the situation for two raters. It shows the distribution of cases in terms of depression level as judged by Rater 1 and Rater 2. Figure 2. Joint distribution (ellipse) of depression severity as judged by two raters (Y1 and Y2); and discretizing thresholds (t1 an t2) a, b, c and d denote the proportion of cases that fall in each region defined by the two raters' thresholds. For example, a is the proportion below both raters' thresholds and therefore diagnosed negative by both. These proportions correspond to a summary of data as a 2 x 2 cross-classification of the raters' ratings. +------------------------------------------------+ | | | Rater 1 | | - + | | +-------+-------+ | | -| a | b |a+b | | Rater 2 +-------+-------+ | | +| c | d |c+d | | +-------+-------+ | 35
36. 36. Statistical Methods for Rater Agreement | a+c b+d 1 | | | +------------------------------------------------+ Figure 3 (draft). Crossclassification proportions for binary ratings by two raters. Again, a, b, c and d in Figure 3 represent proportions (not frequencies). Once we know the observed cross-classification proportions a, b, c and d for a study, it is a simple matter to estimate the model represented by Figure 2. Specifically, we estimate the location of the discretizing thresholds, t1 and t2, and a third parameter, rho, which determines the quot;fatnessquot; of the ellipse. Rho is the tetrachoric correlation, or r*. It can be interpreted here as the correlation between judged disease severity (before application of thresholds) as viewed by Rater 1 and Rater 2. The principle of estimation is simple: basically, a computer program tries various combinations for t1, t2 and r* until values are found for which the expected proportions for a, b, c and d in Figure 2 are as close as possible to the observed proportions in Figure 3. The parameter values that do so are regarded as (estimates of) the true, population values. The polychoric correlation, used when there are more than two ordered rating levels is a straightforward extension of the model above. The difference is that there are more thresholds, more regions in Figure 2, and more cells in Figure 3. But again the idea is to find the values for thresholds and r* that maximize similarity between model-expected and observed cross- classification proportions. 36
37. 37. Statistical Methods for Rater Agreement 7. Detailed Description 7.0 Introduction In many situations, even though a trait may be continuous, it may be convenient to divide it into ordered levels. For example, for research purposes, one may classify levels of headache pain into the categories none, mild, moderate and severe. Even for trait usually viewed as discrete, one might still consider continuous gradations--for example, people infected with the flu virus exhibit varying levels of symptom intensity. The tetrachoric correlation and polychoric correlation coefficients are appropriate when the latent trait that forms the basis of ratings can be viewed as continuous. We will outline here the measurement model and assumptions for the tetrachoric correlation. The model and assumptions for the polychoric correlation are the same--the only difference is that there are more threshold parameters for the polychoric correlations, corresponding to the greater number ordered rating levels. 7.1 Measurement Model We begin with some notation and definitions. Let: X1 and X2 be the manifest (observed) ratings by Raters (or procedures, diagnostic tests, etc.) 1 and 2; these are discrete-valued variables; Y1, Y2 be latent continuous variables associated with X1 and X2; these are the pre- discretized, continuous quot;impressionsquot; of the trait level, as judged by Raters 1 and 2; T be the true, latent trait level of a case. A rating or diagnosis of a case begins with the case's true trait level, T. This information, along with quot;noisequot; (random error) and perhaps other information unrelated to the true trait which a given rater may consider (unique variation), leads to each rater's impression of the case's trait level (Y1 and Y2). Each rater applies discretizing thresholds to this judged trait level to yield a dichotomous or ordered-category rating (X1 and X2). 37
38. 38. Statistical Methods for Rater Agreement Stated more formally, we have: Y1 = bT + u1 + e1, Y2 = bT + u2 + e2, where b is a regression coefficient, u1 and u2 are the unique components of the raters' impressions, and e1 and e2 represent random error or noise. It turns out that unique variation and error variation behave more or less the same in the model, and the former can be subsumed under the latter. Thus we may consider the simpler model: Y1 = b1T + e1, Y2 = b2T + e2. The tetrachoric correlation assumes that the latent trait T is normally distributed. As scaling is arbitrary, we specify that T ~ N(0, 1). Error is similarly assumed to be normally distributed (and independent both between raters and across cases). For reasons we need not pursue here, the model loses no generality by assuming that var(e1) = var(e2). We therefore stipulate that e1, e2 ~ N(0, sigmae). A consequence of these assumptions is that Y1 and Y2 must also be normally distributed. To fix the scale, we specify that var(Y1) = var(Y2) = 1. It follows that b1 = b2 = b = the correlation of both Y1 and Y2 with the latent trait. We define the tetrachoric correlation, r*, as r* = b2 A simple quot;path diagramquot; may clarify this: +-------------------------------------+ | | | | | bb | | Y1 <--- T ---> Y2 | | | | | +-------------------------------------+ Figure 4 (draft). Path diagram. Here b is the path coefficient that reflects the influence of T on both Y1 and Y2. Those familiar with the rules of path analysis will see that the correlation of Y1 and Y2 is simply the product of their degree of dependence on T--that is b2. As an aside, one might consider that the value of b is interesting in its own right, inasmuch as it offers a measure of the association of ratings with the true latent trait--i.e., a measure of rating validity or accuracy. The tetrachoric correlation r* is readily interpretable as a measure of the association between the ratings of Rater 1 and Rater 2. Because it estimates the correlation that exists between the pre-discretized judgements of the raters, it is, in theory, not affected by (1) the number of rating 38
39. 39. Statistical Methods for Rater Agreement levels, or (2) the marginal proportions for rating levels (i.e., the 'base rates.') The fact that this association is expressed in the familiar form of a correlation is also helpful. The assumptions of the tetrachoric correlation coefficient may be expressed as follows: • The trait on which ratings are based is continuous. • The latent trait is normally distributed. • Rating errors are normally distributed. • Var(e) is homogeneous across levels of T. • Errors are independent between raters. • Errors are independent between cases. Assumptions 1--4 can be alternatively expressed as the assumption that Y1 and Y2 follow a bivariate normal distribution. We will assume that the one has sufficient theoretical understanding of the application to accept the assumption of latent continuity. The second assumption--that of a normal distribution for T--is potentially more questionable. Absolute normality, however, is probably not necessary; a unimodal, roughly symmetrical distribution may be close enough. Also, the model implicitly allows for a monotonic transformation of the latent continuous variables. That is, a more exact way to express Assumptions 1-4 is that one can obtain a bivariate normal distribution by some monotonic transformation of Y1 and Y2. The model assumptions can be tested for the polychoric correlation. This is done by comparing the observed numbers of cases for each combination of rating levels with those predicted by the model. This is done with the likelihood ratio chi-squared test, G2 (Bishop, Fienberg & Holland, 1975), which is similar the usual Pearson chi-squared test (the Pearson chi-square test can also be used; for more information on these tests, see the FAQ for testing model fit on the Latent Class Analysis web site. The G2 test is assessed by considering the associated p value, with the appropriate degrees of freedom (df). The df are given by: df = RC - R - C where R is the number of levels used by the first rater and C is the number of levels used by the second rater. As this is a quot;goodness-of-fitquot; test, it is standard practice to set the alpha level fairly high (e.g., .10). A p value lower than the alpha level is evidence of model fit. For the tetrachoric correlation R = C = 2, and there are no df with which to test the model. It is possible to test the model, though, when there are more than two raters. 7.2 Using the Polychoric Correlation to Measure Agreement 39
40. 40. Statistical Methods for Rater Agreement Here are the steps one might follow to use the tetrachoric or polychoric correlation to assess agreement in a study. For convenience, we will mainly refer to the polychoric correlation, which includes the tetrachoric correlation as a special case. i) Calculate the value of the polychoric correlation. For this a computer program, such as those described in the software section, is required. ii) Evaluate model fit. The next step is to determine if the assumptions of the polychoric correlation are empirically valid. This is done with the goodness-of-fit test that compares observed crossclassification frequencies to model-predicted frequencies described previously. As noted, this test cannot be done for the tetrachoric correlation. PRELIS includes a test of model fit when estimating the polychoric correlation. It is unknown whether SAS PROC FREQ includes such a test. iii) Assess magnitude and significance of correlation. Assuming that model fit is acceptable, the next step is to note is the magnitude of the polychoric correlation. Its value is interpreted in the same way as a Pearson correlation. As the value approaches 1.0, more agreement on the trait definition is indicated. Values near 0 indicate little agreement on the trait definition. One may wish to test the null hypothesis of no correlation between raters. There are at least two ways to do this. The first makes use of the estimated standard error of the polychoric correlation under the null hypothesis of r* = 0. At least for the tetrachoric correlation, there is a simple closed-form expression for this standard error (Brown, 1977). Knowing this value, one may calculate a z value as: r* z = ----------- sigmar*(0) where the denominator is the standard error of r* where r* = 0. One may then assess statistical significance by evaluating the z value in terms of the associated tail probabilities of the standard normal curve. The second method is via a chi-squared test. If r* = 0, the polychoric correlation model is the same as the model of statistical independence. It therefore seems reasonable to test the null hypothesis of r* = 0 by testing the statistical independence model. Either the Pearson (X2) or likelihood-ratio (G2) chi-squared statistics can be used to test the independence model. The df for either test is (R - 1)(C - 1). A significant chi-squared value implies that r* is not equal to 0. [I now question whether the above is correct. For the polychoric correlation, data may fail the test of independence even with when r* = 0 (i.e., there may be some other kind of 'structure' to the data). If so, a better alternative would be to calculate a difference G2 statistic as: G2H0 - G2H1, 40