How to assess the reliability of measurements in rehabilitation
Upcoming SlideShare
Loading in...5

How to assess the reliability of measurements in rehabilitation






Total Views
Views on SlideShare
Embed Views



0 Embeds 0

No embeds



Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

    How to assess the reliability of measurements in rehabilitation How to assess the reliability of measurements in rehabilitation Document Transcript

    • Authors:Jan E. Lexell, MD, PhDDavid Y. Downham, PhD StatisticsAffiliations:From the Department ofRehabilitation, Lund UniversityHospital, Lund, Sweden (JEL); the INVITED REVIEWDepartment of Health Sciences, LundUniversity, Malmo, Sweden (JEL); the ¨Department of Health Sciences, ¨Lulea University of Technology,Boden, Sweden (JEL); and the How to Assess the Reliability ofDepartment of MathematicalSciences, University of Liverpool,Liverpool, UK (DYD). Measurements in RehabilitationCorrespondence: ABSTRACTAll correspondence and requests forreprints should be addressed to Jan Lexell JE, Downham DY: How to assess the reliability of measurements inE. Lexell, MD, PhD, Department of rehabilitation. Am J Phys Med Rehabil 2005;84:719 –723.Rehabilitation, Lund UniversityHospital, Orupssjukhuset, 22185 To evaluate the effects of rehabilitation interventions, we need reliable measure-Lund, Sweden. ments. The measurements should also be sufficiently sensitive to enable the detection of clinically important changes. In recent years, the assessment ofDisclosures: reliability in clinical practice and medical research has developed from the use ofSupported, in part, by grants from correlation coefficients to a comprehensive set of statistical methods. In thisthe Gun and Bertil Stohne review, we present methods that can be used to assess reliability and describeFoundation, Magn. BergvallFoundation, the Crafoord Foundation, how data from reliability analyses can aid the interpretation of results fromthe Swedish Society of Medicine, the rehabilitation interventions.Swedish Stroke Association, the Key Words: Reference Values, Reproducibility of Results, Research Design, StatisticsSwedish Association of NeurologicallyDisabled (NHR), the Council forMedical Health Care Research inSouth Sweden, Lund University ¨Hospital, and Region Skane.0894-9115/05/8409-0719/0 I n rehabilitation, measurements are obtained for clinical practice and research purposes. To be clinically and scientifically useful, measurements must beAmerican Journal of PhysicalMedicine & Rehabilitation reliable. Reliability refers to the reproducibility of measurements. Measure-Copyright © 2005 by Lippincott ments are reliable if they are stable over time and show adequate levels ofWilliams & Wilkins measurement variability.1 They must also be sufficiently sensitive to detect clinically important changes after rehabilitation interventions. The more sen-DOI: 10.1097/01.phm.0000176452.17771.20 sitive your measurement, the easier it is to detect improvements after inter- ventions or deterioration over time. Reliability in clinical practice and medical research is most commonly determined from measurements of the same subjects on two occasions: so- called retest reliability.2 By applying well-defined statistical methods, physia- trists and clinical scientists can determine whether measurements are suffi- ciently reliable for a particular purpose. In recent years, there has been a growing interest in the statistical methods for the assessment of reliability in clinical practice and medical research. Today, there is general agreement that a comprehensive set of statistical methods is required to address the reliability of measurements.3– 8 In this review, we present the statistical methods that are part of the current approach to the assessment of retest reliability based on continuous data and illustrate them using measurements of isokinetic muscle strength. We also describe how data from reliability analyses can aid the interpretation of results from rehabilitation interventions and the detection of clinically impor- tant changes. In the Appendix, we present the equations that are used to calculate reliability. This provides physiatrists and clinical scientists with aSeptember 2005 Reliability of Measurements in Rehabilitation 719
    • source of reference for the determination of the ple, it can be used with small samples and with data characteristics of measurements and how to inter- from more than two test occasions. There are dif- pret data obtained from rehabilitation interven- ferent types of ICCs available for different study tions. designs, but in practice, their values are often very similar.6,10 The ICC2,1 (Equation 1, Appendix), RETEST CORRELATION COEFFICIENTS which is calculated from a two-way repeated-mea- The most common approach in the assessment sures analysis of variance, covers most situations of retest reliability is to measure a group of sub- and also has the advantage that it provides the basis jects on two occasions, separated by hours or days. for the calculations of some of the other reliability This is done on the same type of subjects or pa- indices. (Reliability can also be determined for or- tients who one wants to use in an intervention. dinal data obtained from, for example, clinical rat- Once the data are obtained, the first step is to plot ing scales. The equivalent to a correlation coeffi- the data. In Figure 1, data on concentric ankle cient is then the kappa coefficient.11) dorsiflexor muscle strength at 30 degrees/sec ob- The data plotted in Figure 1 were analyzed tained from 30 healthy young men and women at using a two-way analysis of variance, and the values two test occasions 7 days apart are presented. The from the analysis of variance table were substituted data come from a larger study in which we evalu- into Equation 1. The value of ICC2,1 is 0.915, which ated the reliability of peak torque, work, and torque is very close to the Pearson’s r. This is often the at a specific time at different angular velocities case when measurements of the same subjects on using the Biodex dynamometer.5 two occasions are analyzed.5 Clearly, the closer the values are to a straight How do we then interpret the values of ICC? As line, the better is the reliability. The usual Pear- a matter of fact, no generally agreed ICC “cut-off” son’s correlation coefficient (Pearson’s r) can be points exist. In their original publication from used to quantify the reliability. The value of Pear- 1979, Shrout and Fleiss9 suggested that specific son’s r is here 0.91; as a value of 1.00 represents values of ICC could be considered to represent perfect agreement and a value of 0.00 no agree- “acceptable,” “good,” and “fair” reliability. Fleiss12 ment, we can already at this stage presume that our later recommended that ICC values of Ͼ0.75 rep- measurements are highly reliable. resent “excellent reliability” and values between 0.4 The intraclass correlation coefficient (ICC) is and 0.75 represent “fair to good reliability.” nowadays the preferred retest correlation coeffi- It is becoming clear that the use of only ICCs cient.9 The ICC has several advantages: for exam- for the analyses of reliability is not sufficient.3,4,7,8 The ICC can give misleading results—for example, if the sample is homogeneous, the value of ICC (and Pearson’s r) may be low. We should therefore avoid assessing reliability just from retest correla- tions and also avoid using global terms, such as “good reliability,” and instead interpret reliability from several statistical methods. CHANGES IN THE MEAN The next step in the reliability analysis is to calculate changes in the mean from the measure- ments obtained from the two test occasions. The change in the mean values between two test occa- sions can consist of two components: a random change and a systematic change. A random change is often referred to as the “sampling error” and comes from the variability in the actual test situa- tion. It can be due to the variability in the equip- ment or method used and to the inherent biolog- ical variability. A systematic change is a nonrandom change in the mean values between the two test occasions. This occurs if the subject or FIGURE 1 To illustrate the analysis of reliability, measurements of concentric ankle dorsi- patient systematically performs better (or worse) flexor muscle strength at 30 degrees/sec on the second test occasion as a result of, for are used. The data were obtained from 30 example, a change in behavior, a learning effect, or healthy young men and women at two fatigue if performance is measured. test occasions 7 days apart.5 To detect a systematic change, several indices720 Lexell and Downham Am. J. Phys. Med. Rehabil. ● Vol. 84, No. 9
    • can be calculated (Equations 2– 4). Based on thesecalculations, a 95% confidence interval for themean difference between the two test occasions canbe formed (Equation 5). This will allow you todetermine if there is a true systematic differencebetween the two test occasions. If the mean value ispositive or negative, the measurements from onetest occasion tend to be larger or smaller thanthose from the other occasion. When the 95%confidence interval does not include zero, this in-dicates a systematic change in the mean, for exam-ple, due to a significant learning effect. If thishappens, one should choose tests or equipmentthat minimize this learning effect or let the sub-jects familiarize themselves before the real trials.This is less of a problem in a controlled study ifboth groups are equally affected by systematic FIGURE 2 Bland–Altman plot for the data in Figurechanges in the mean. 1, with the approximate 95% confidence Using the data from Figure 1, the indices rep- interval (dashed line) of the mean differ-resenting the changes in the mean are calculated ence (solid line) between the two testfrom Equations 2–5 and are presented in Table 1. occasions.The mean difference between the two test occa-sions is here positive (test occasion 2 minus 1),indicating that the isokinetic muscle strength atthe second test occasion tended to be larger than at confidence interval, and that there are no apparentthe first occasion. However, as zero is well within systematic biases or outliers in the data.the 95% confidence interval, there is no significantchange in the mean muscle strength between the MEASUREMENT VARIABILITYtwo test occasions, and we can therefore conclude Once we have established if there are any sys-that there is no systematic change between the two tematic changes in the mean, we want to quantifytest occasions. the actual size of the variability between the mea- The next step is then to present the data graph- surements obtained from the two test occasions.ically in so-called Bland–Altman plots.2 In these This is often referred to as the “within-subjectplots, the differences between measurements from variation,” “typical error,” or “typical variation”7 and is one of the more important reliability indices.the two test occasions are plotted against the mean Quite naturally, a change after any interventionof the two test occasions for each subject, and any will be easier to detect if we have established thatsystematic bias or outliers can be seen. The Bland– the variability between measurements from twoAltman plot for the data in Figure 1 is presented in test occasions is small.3,7Figure 2; the mean difference in muscle strength A simple index of the measurement variabilitybetween the two test occasions (test occasion 2 is the standard deviation of the differences betweenminus 1) and the 95% confidence interval are also the two test occasions (SDdiff) (Equation 3). Divid-included. It is clearly seen that the mean difference ing the SDdiff by ͌2 yields the “method error”is close to zero, that zero is included in the 95% (ME) (Equation 6).13 An alternative index is the “standard error of measurement” (SEM). There are different ways to calculate SEM; a preferred way is TABLE 1. Indices of changes in the mean to take the square root of the mean square error between two test occasions term (WMS) from the analysis of variance (Equa- tion 7). If the sample size is sufficiently large and Mean difference between two 0.03 the mean difference small, both highly likely con- ៮ test occasions (d) ditions, ME and SEM take similar values. These Standard deviation of the 2.88 values are, on their own, not easily interpreted differences between two because we do not know how much of the variabil- test occasions (SDdiff) ity comes from the change in the mean and how Standard error of ៮ (SE) d 0.53 95% confidence interval Ϫ1.04 to 1.10 much is due to the typical variation.7 This can be overcome by expressing the mea- of ៮ (95% CI) d surement variability as a coefficient of variation. The ME or the SEM is then divided by the mean ofSeptember 2005 Reliability of Measurements in Rehabilitation 721
    • all the measurements and multiplied by 100 to give ence range or “error band”14 around the mean a percentage value. These indices—the CV% difference of the measurements from the two test (Equation 8) and the SEM% (Equation 9)—are occasions, a 95% SRD can be calculated (Equation independent of the units of measurements and are 11). The SRD is then divided by the mean of all the therefore more easily interpreted. The CV% and measurements and multiplied by 100 to give a SEM% indicate the typical variation expressed as a percentage value, the SRD% (Equation 12).15 The percentage value and can be used to interpret the SRD%, like the SEM% and CV%, is independent of results of an improvement after an intervention. the units of measurements and therefore more Substituting the data from Figure 1 into Equa- easily interpreted. tions 6 –9, we can calculate the values of ME, SEM, To form practically useful 95% SRD and CV%, and SEM%. As can be seen in Table 2, ME SRD%, a fairly large sample size is required. This and SEM and CV% and SEM% are very similar. We brings us to the question of how many subjects or can then determine that with a CV% of 6.36 and a patients you need to determine the reliability of a mean isokinetic muscle strength of 31.9 Nm (based measurement. In comparison with clinical trials, on the data in Fig. 1), the average person’s muscle sample sizes have received little attention in the strength has a typical variation from one test oc- reliability literature. Some scientists have recom- casion to the other of 2.03 Nm. An improvement mended specific sample sizes, but these recom- after an intervention that is smaller than the typ- mendations are usually based on practical experi- ical variation does not, in most situations, indicate a clinically important improvement. ence. A sample size of 15–20 is often used in reliability studies with continuous data.12 Larger CLINICALLY IMPORTANT CHANGES sample sizes, 30 –50, have been suggested more To evaluate more specifically if a measurement recently7 and would be required to form practically represents a clinically important change, we can useful 95% SRD and SRD%. use the data from the two test occasions to calcu- The values of SRD, 95% SRD, and SRD% are late an interval, which represents the 95% likely then calculated from Equations 10 –12 using the range for the difference between a subject’s mea- data from Figure 1 and are presented in Table 3. surements from these two test occasions. This is a The SRD% is 18.1, which indicates that a measure- form of “reference range” that can be used to ment should exceed that value to indicate a clini- determine if differences between measurements cally important change. Taking the mean value of subsequently obtained before and after “real” inter- the isokinetic muscle strength in Figure 1 (31.9 ventions represent clinically important changes. If Nm), this has to change 5.7 Nm to indicate a the difference for a subject or patient is outside (or clinically important improvement in strength. within) this reference range, it does (or does not) Having done all this, one should remember represent a clinically important change. It is clear that there is an essential difference between the that the smaller the reference range, the more reliability indices described here and clinically im- sensitive are the measurements, and also that the portant changes. The reliability indices describe measurements can be considered highly reliable, the clinometric property of a measurement, but the reference range is too wide to be clinically whereas clinically important changes are more ar- or scientifically useful. bitrarily chosen values that physiatrists and clinical The “smallest real difference” (SRD), intro- scientists judge as minimally (and clinically) im- duced by Beckerman et al.14 is one way to evaluate portant. An interesting area for future research is clinically important changes. The SRD is similar to to explore the clinometric property of measure- the “limits of agreement” proposed by Bland and ments and how that corresponds to what physia- Altman.4 The SRD (Equation 10) is formed by tak- trists and clinical scientists judge as clinically im- ing the measurement variability, represented by portant. Such research will help us to define the SEM, and multiplying it by ͌2 and by 1.96 to optimal outcome measures for clinical practice and include 95% of the observations of the difference research purposes. between the two measurements. To obtain a refer- TABLE 2. Indices of measurement variability TABLE 3. Indices of clinically important changes Method error (ME) 2.03 Standard error of the measurement (SEM) 2.01 Smallest real difference (SRD) 5.81 Coefficient of variation (CV%) 6.36 Smallest real difference (95% SRD) Ϫ5.78 to 5.84 Standard error of the measurement (SEM%) 6.30 Smallest real difference (SRD%) 18.1%722 Lexell and Downham Am. J. Phys. Med. Rehabil. ● Vol. 84, No. 9
    • APPENDIX The SEM% is defined byRetest Correlation Coefficients SEM% ϭ (SEM/mean) ϫ 100 [9] The intraclass correlation coefficient (ICC2,1)for n subjects and for two test occasions is defined where mean in Equations 8 and 9 is the mean of allby the data from the two test occasions. ICC2,1 ϭ (BMS Ϫ EMS)/ CLINICALLY IMPORTANT CHANGES (BMS ϩ EMS ϩ 2[JMS Ϫ EMS]/n) [1] The smallest real difference (SRD) is defined bywhere BMS is the between-subjects mean square, SRD ϭ 1.96 ϫ SEM ϫ ͌2 [10]EMS is the residual mean square, JMS is the with-in-subjects mean square, all obtained from the The 95% SRD is defined bytwo-way analysis of variance, and n is the numberof subjects. ៮ 95% SRD ϭ d Ϯ SRD [11]CHANGES IN THE MEAN The SRD% is defined by To evaluate changes in the mean between two SRD% ϭ (SRD/mean) ϫ 100 [12]test occasions, ៮ is defined by d where mean in Equation 12 is the mean of all the ៮ d ϭ the mean difference between the two test data from the two test occasions.occasions [2] REFERENCES ៮ is assessed by The variation about d 1. Rothstein JM: Measurement and clinical practise: Theory and application, in Rothstein JM (ed): Measurement in SDdiffϭ Physical Therapy. New York, Churchill Livingstone, 1985 the standard deviation of the differences [3] 2. Bland JM, Altman DG: Statistical methods for assessing agreement between two methods of clinical measurement. The variation of ៮ is assessed by the standard d Lancet 1986;1:307–10error of the mean (SE) 3. Atkinson G, Nevill AM: Statistical methods for assessing measurement error (reliability) in variables relevant to SE ϭ SDdiff/͌n [4] sports medicine. Sports Med 1998;26:217–38 4. Bland JM, Altman DG: Measuring agreement in methodwhere n is the number of subjects. SDdiff represents comparison studies. Stat Methods Med Res 1999;8:135–60the variability of the differences between the two 5. Holmback AM, Porter MM, Downham DY, et al: Reliability ¨test occasions and SE represents the precision of ៮d of isokinetic ankle dorsiflexor strength measurements in healthy young men and women. Scand J Rehabil Medas an estimate of the underlying change in the 1999;31:229–39mean. An approximate 95% confidence interval 6. Holmback AM, Porter MM, Downham DY, et al: Ankle ¨ ៮(95% CI) of the mean difference (d) between the dorsiflexor muscle performance in healthy young men and women: Reliability of eccentric peak torque and work.two test occasions is defined by J Rehabil Med 2001;33:90–6 95% CI ϭ ៮ Ϯ 2 ϫ SE d [5] 7. Hopkins WG: Measures of reliability in sports medicine and science. Sports Med 2000;30:1–15 The multiplier of SE in Equation 5 depends on 8. Rankin G, Stokes M: Reliability of assessment tools inthe number of subjects, n; 2 is a good approxima- rehabilitation: An illustration of appropriate statistical anal- yses. Clin Rehabil 1998;12:187–99tion when n is Ͼ20. For the data in Figure 1, it 9. Shrout PE, Fleiss JL: Intraclass correlations: Uses in assess-should be 2.045, which is obtained from the t-table ing rater reliability. Psychol Bull 1979;86:420–8with 29 (n Ϫ 1) degrees of freedom. 10. Baumgartner TA: Norm-referenced measurement: Reliabil- ity, in Safrit M, Wood T (eds): Measurement Concepts inMEASUREMENT VARIABILITY Physical Education and Exercise Science. Champaign, IL, Human Kinetics, 1989, pp 45–72 The method error (ME) is defined by 11. Sim J, Wright CC: The kappa statistic in reliability studies: ME ϭ SDdiff/͌2 [6] Use, interpretation, and sample size requirements. Phys Ther 2005;85:257–68where SDdiff comes from Equation 3. 12. Fleiss JL: The Design and Analysis of Clinical Experiments. The standard error of measurement (SEM) is New York, John Wiley Sons, 1986, pp 1–31defined by 13. Portney LG, Watkins MP: Statistical measures of reliability, in: Foundations of Clinical Research: Applications to Prac- SEM ϭ ͌WMS [7] tice . Englewood Cliffs, NJ, Prentice Hall, 1993, pp 525–6 14. Beckerman H, Roebroeck ME, Lankhorst GJ, et al: Smallestwhere WMS is the mean square error term from real difference: A link between reproducibility and respon-the analysis of variance. siveness. Qual Life Res 2001;10:571–8 The coefficient of variation (CV%) is defined by 15. Flansbjer UB, Holmback AM, Downham DY, et al: Reliability ¨ of gait performance test in men and women with hemipa- CV% ϭ (ME/mean) ϫ 100 [8] resis after stroke. J Rehabil Med 2005;37:75–82September 2005 Reliability of Measurements in Rehabilitation 723