94                                                                                     Key Words                          ...
Professional articles                                                                                                     ...
96   Authors                          Why Estimate Reliability?                           estimate calculated for their da...
Professional articles                                                                                                     ...
98                                   Coefficient of Variation (CV)                      appropriate for method comparison ...
Professional articles                                                                                                     ...
Upcoming SlideShare
Loading in …5

Reliability what is it, and how is it measured


Published on

Published in: Technology, Business
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Reliability what is it, and how is it measured

  1. 1. 94 Key Words Reliability, measurement, quantitative measures, statistical method. by Anne Bruton Reliability: Joy H Conway Stephen T Holgate What is it, and how is it measured? Summary Therapists regularly perform various measurements. How reliable these measurements are in themselves, and how clearly essential knowledge to help clinicians reliable therapists are in using them, is clearly essential knowledge decide whether or not a particular to help clinicians decide whether or not a particular measurement measurement is of any value. is of any value. The aim of this paper is to explain the nature of This article focuses on the reliability of reliability, and to describe some of the commonly used estimates measures that generate quantitative data, and that attempt to quantify it. An understanding of reliability, and in particular ‘interval’ and ‘ratio’ data. how it is estimated, will help therapists to make sense of their own Interval data have equal intervals between numbers but these are not related to true clinical findings, and to interpret published studies. zero, so do not represent absolute quantity. Although reliability is generally perceived as desirable, there is no Examples of inter val data are IQ and firm definition as to the level of reliability required to reach clinical degrees Centigrade or Fahrenheit. In the acceptability. As with hypothesis testing, statistically significant temperature scale, the difference between levels of reliability may not translate into clinically acceptable levels, 10° and 20° is the same as between 70° and so that some authors’ claims about reliability may need to be 80°, but is based on the numerical value of the scale, not the true nature of the variable interpreted with caution. Reliability is generally population specific, itself. Therefore the actual difference in so that caution is also advised in making comparisons between heat and molecular motion generated is not studies. the same and it is not appropriate to say that The current consensus is that no single estimate is sufficient to someone is twice as hot as someone else. provide the full picture about reliability, and that different types of With ratio data, numbers represent units estimate should be used together. with equal intervals, measured from true zero, eg distance, age, time, weight, strength, blood pressure, range of motion, height. Introduction Numbers therefore reflect actual amounts of Therapists regularly per form various the variable being measured, and it is measurements of varying reliability. The appropriate to say that one person is twice as term ‘reliability’ here refers to the heavy, tall, etc, as another. The kind of consistency or repeatability of such quantitative measures that therapists often measurements. Irrespective of the area carry out are outlined in table 2. in which they work, therapists take The aim of this paper is to explain the measurements for any or all of the reasons nature of reliability, and to describe, in outlined in table 1. How reliable these general terms, some of the commonly used measurements are in themselves, and how methods for quantifying it. It is not intended reliable therapists are in performing them, is to be a detailed account of the statistical Table 1: Common reasons why therapists perform Table 2: Examples of quantitative measures measurements performed by physiotherapists As part of patient assessment. Strength measures (eg in newtons of force, kilos lifted. As baseline or outcome measures. Angle or range of motion measures (eg in degrees, Bruton, A, Conway, J H As aids to deciding upon treatment plans. centimetres). and Holgate, S T (2000). As feedback for patients and other interested Velocity or speed measures (eg in litres per minute ‘Reliability: What is it and parties. for peak expiratory flow rate). how is it measured?’ As aids to making predictive judgements, eg about Length or circumference measures (eg in metres, Physiotherapy, 86, 2, outcome. centimetres). 94-99.Physiotherapy February 2000/vol 86/no 2
  2. 2. Professional articles 95minutiae associated with reliability measures, Table 3: Repeated maximum inspiratory pressure measures datafor which readers are referred to standard demonstrating good relative reliabilitybooks on medical statistics. MIP Rank Subject Day 1 Day 2 Difference Day 1 Day 2Measurement Error 1 110 120 +10 2 2It is very rare to find any clinical 2 94 105 +11 4 4measurement that is perfectly reliable, as allinstruments and observers or measurers 3 86 70 --16 5 5(raters) are fallible to some extent and all 4 120 142 +22 1 1humans respond with some inconsistency. 5 107 107 0 3 3Thus any observed score (X) can be thoughtof as a function of two components, ie a truescore (T) and an error component(E): X = T ± E Table 4: Repeated maximum inspiratory pressures measures data demonstrating poor relative reliability The difference between the true value andthe observed value is measurement error. In MIP Rankstatistical terms, ‘error’ refers to all sources Subject Day 1 Day 2 Difference Day 1 Day 2of variability that cannot be explained by the 1 110 95 --15 2 5independent (also known as the predictor, 2 94 107 +13 4 3or explanatory) variable. Since the error 3 86 97 +11 5 4components are generally unknown, it is 4 120 120 0 1 2only possible to estimate the amount of anymeasurement that is attributable to error 5 107 129 +22 3 1and the amount that represents an accuratereading. This estimate is our measure ofreliability. Measurement errors may be systematic or by some type of correlation coefficient, egrandom. Systematic errors are predictable Pearson’s correlation coefficient, usuallyerrors, occurring in one direction only, written as r. For table 3 the data give aconstant and biased. For example, when Pearson’s correlation coefficient of r = 0.94,using a measurement that is susceptible to a generally accepted to indicate a high degreelearning effect (eg strength testing), a retest of correlation. In table 4, however, althoughmay be consistently higher than a prior test the differences between the two measures(perhaps due to improved motor unit co- look similar to those in table 1 (ie –15 to +22ordination). Such a systematic error would cm of water), on this occasion the rankingnot therefore affect reliability, but would has changed. Subject 4 has the highest MIPaffect validity, as test values are not true on day 1, but is second highest on day 2,representations of the quantity being subject 1 had the second highest MIP in daymeasured. Random errors are due to chance 1, but the lowest MIP on day 2, and so on.and unpredictable, thus they are the basic For table 4 data r = 0.51, which would beconcern of reliability. interpreted as a low degree of correlation. Correlation coefficients thus give infor-Types of Reliability mation about association between twoBaumgarter (1989) has identified two types variables, and not necessarily about theirof reliability, ie relative reliability and proximity.absolute reliability. Absolute reliability is the degree to which Relative reliability is the degree to which repeated measurements vary for individuals,individuals maintain their position in a ie the less they vary, the higher the reliability.sample over repeated measurements. Tables This type of reliability is expressed either in3 and 4 give some maximum inspiratory the actual units of measurement, or as apressure (MIP) measures taken on two proportion of the measured values. Theoccasions, 48 hours apart. In table 3, standard error of measurement (SEM),although the differences between the two coefficient of variation (CV) and Bland andmeasures vary from –16 to +22 centimetres Altman’s 95% limits of agreement (1986)of water, the ranking remains unchanged. are all examples of measures of absoluteThat is, on both day 1 and day 2 subject 4 reliability. These will be described later.had the highest MIP, subject 1 the secondhighest, subject 5 the third highest, and soon. This form of reliability is often assessed Physiotherapy February 2000/vol 86/no 2
  3. 3. 96 Authors Why Estimate Reliability? estimate calculated for their data. Table 5 Anne Bruton MA MCSP is Reliability testing is usually performed to summarises the more common reliability currently involved in assess one of the following: indices found in the literature, which are postgraduate research, described below. Joy H Conway PhD MSc s Instrumental reliability, ie the reliability of MCSP is a lecturer in the measurement device. Table 5: Reliability indices in common use physiotherapy, and s Rater reliability, ie the reliability of the Hypothesis tests for bias, eg paired t-test, analysis Stephen T Holgate MD of variance. DSc FRCP researcher/observer/clinician administering the measurement device. Correlation coefficients, eg Pearson’s, ICC. is MRC professor of Standard error of measurement (SEM). immunopharmacology, s Response reliability, ie the all at the University of reliability/stability of the variable being Coefficient of variation (CV). Southampton. measured. Repeatability coefficient. This article was received Bland and Altman 95% limits of agreement. on November 16, 1998, How is Reliability Measured? and accepted on September 7, 1999. As described earlier, observed scores consist Indices Based on Hypothesis Testing for Bias of the true value ± the error component. The paired t-test, and analysis of variance Since it is not possible to know the true techniques are statistical methods for Address for value, the true reliability of any test is not detecting systematic bias between groups Correspondence calculable. It can however be estimated, of data. These estimates, based upon based on the statistical concept of variance, hypothesis testing, are often used in Ms Anne Bruton, Health Research Unit, School of ie a measure of the variability of differences reliability studies. However, they give Health Professions and among scores within a sample. The greater information only about systematic Rehabilitation Sciences, the dispersion of scores, the larger the differences between the means of two sets of University of variance; the more homogeneous the scores, data, not about individual differences. Such Southampton, Highfield, the smaller the variance. tests should, therefore, not be used in Southampton SO17 1BJ. If a single measurer (rater) were to record isolation, but be complemented by other the oxygen saturation of an individual 10 methods, eg Bland and Altman agreement times, the resulting scores would not all be tests (1986). Funding identical, but would exhibit some variance. Anne Bruton is currently Some of this total variance is due to true Correlation Coefficients (r) sponsored by a South and differences between scores (since oxygen As stated earlier, correlation coefficients give West Health Region R&D saturation fluctuates), but some can be information about the degree of association studentship. attributable to measurement error (E). between two sets of data, or the consistency Reliability (R) is the measure of the amount of position within the two distributions. of the total variance attributable to true Provided the relative positions of each differences and can be expressed as the ratio subject remain the same from test to test, of true score variance (T) to total variance high measures of correlation will be or: T obtained. However, a correlation coefficient R=T+E will not detect any systematic errors. So it is This ratio gives a value known as a possible to have two sets of scores that are reliability coefficient. As the observed score highly correlated, but not highly repeatable, approaches the true score, reliability as in table 6 where the hypothetical data increases, so that with zero error there is give a Pearson’s correlation coefficient of perfect reliability and a coefficient of 1, r = 1, ie per fect correlation despite a because the observed score is the same as systematic difference of 40 cm of water the true score. Conversely, as error increases for each subject. reliability diminishes, so that with maximal Thus correlation only tells how two sets of error there is no reliability and the scores vary together, not the extent of coefficient approaches 0. There is, however, agreement between them. Often researchers no such thing as a minimum acceptable level need to know that the actual values obtained of reliability that can be applied to all by two measurements are the same, not just measures, as this will vary depending on the proportional to one another. Although use of the test. published studies abound with correlation used as the sole indicator of reliability, their Indices of Reliability results can be misleading, and it is now In common with medical literature, recommended that they be no longer used physiotherapy literature shows no in isolation (Keating and Matyas, 1998; consistency in authors’ choice of reliability Chinn, 1990).Physiotherapy February 2000/vol 86/no 2
  4. 4. Professional articles 97 Table 6: Repeated maximum inspiratory pressures measures data demonstrating a high Pearson’s correlation coefficient, but poor absolute reliability MIP Rank Subject Day 1 Day 2 Difference Day 1 Day 2 1 110 150 +40 2 2 2 94 134 +40 4 4 3 86 126 +40 5 5 4 120 160 +40 1 1 5 107 147 +40 3 3Intra-class Correlation Coefficient (ICC) subjects to the sum of error variance andThe intra-class correlation coefficient (ICC) subject variance. If the variance betweenis an attempt to overcome some of the subjects is sufficiently high (that is, the datalimitations of the classic correlation come from a heterogeneous sample) thencoefficients. It is a single index calculated reliability will inevitably appear to be high.using variance estimates obtained through Thus if the ICC is applied to data from athe partitioning of total variance into group of individuals demonstrating a widebetween and within subject variance (known range of the measured characteristic,as analysis of variance or ANOVA). It thus reliability will appear to be higher thanreflects both degree of consistency and when applied to a group demonstrating aagreement among ratings. narrow range of the same characteristic. There are numerous versions of the ICC(Shrout and Fleiss, 1979) with each form Standard Error of Measurement (SEM)being appropriate to specific situations. As mentioned earlier, if any measurementReaders interested in using the ICC can find test were to be applied to a single subject anworked examples relevant to rehabilitation infinite number of times, it would bein various published articles (Rankin and expected to generate responses that vary aStokes, 1998; Keating and Matyas, 1998; little from trial to trial, as a result ofStratford et al, 1984; Eliasziw et al, 1994). The measurement error. Theoretically theseuse of the ICC implies that each component responses could be plotted and theirof variance has been estimated appropriately distribution would follow a normal curve,from sufficient data (at least 25 degrees of with the mean equal to the true score,freedom), and from a sample representing and errors occurring above and below thethe population to which the results will be mean.applied (Chinn, 1991). In this instance, The more reliable the measurementdegrees of freedom can be thought of as the response, the less error variability therenumber of subjects multiplied by the would be around the mean. The standardnumber of measurements. deviation of measurement errors is therefore As with other reliability coefficients, there a reflection of the reliability of the testis no standard acceptable level of reliability response, and is known as the standard errorusing the ICC. It will range from 0 to 1, with of measurement (SEM). The value for thevalues closer to one representing the higher SEM will vary from subject to subject, butreliability. Chinn (1991) recommends that there are equations for calculating a groupany measure should have an intra-class estimate, eg SEM = sx √1 – rxx (where sx is thecorrelation coefficient of at least 0.6 to be standard deviation of the set of observed testuseful. The ICC is useful when comparing scores and rxx is the reliability coefficient forthe repeatability of measures using different those data -- often the ICC is used here.)units, as it is a dimensionless statistic. It is The SEM is a measure of absolutemost useful when three or more sets of reliability and is expressed in the actual unitsobservations are taken, either from a single of measurement, making it easy to interpret,sample or from independent samples. It ie the smaller the SEM, the greater thedoes, however, have some disadvantages as reliability. It is only appropriate, however, fordescribed by Rankin and Stokes (1998) that use with interval data (Atkinson and Neville,make it unsuitable for use in isolation. As 1998) since with ratio data the amount ofdescribed earlier, any reliability coefficient is random error may increase as the measureddetermined as the ratio of variance between values increase. Physiotherapy February 2000/vol 86/no 2
  5. 5. 98 Coefficient of Variation (CV) appropriate for method comparison studies The CV is an often-quoted estimate of for reasons described by Bland and Altman measurement error, particularly in lab- in their 1986 paper. These authors have oratory studies where multiple repeated tests therefore proposed an approach for are standard procedure. One form of the CV assessing agreement between two different is calculated as the standard deviation of the methods of clinical measurement. This data, divided by the mean and multiplied by involves calculating the mean for each 100 to give a percentage score. This method and using this in a series of expresses the standard deviation as a agreement tests. proportion of the mean, making it unit Step 1 consists of plotting the difference in independent. However, as Bland (1987) the two results against the mean value from points out, the problem with expressing the the two methods. Step 2 involves calculating error as a percentage, is that x% of the the mean and standard deviation of the smallest observation will differ markedly differences between the measures. Step 3 from x% of the largest observation. Chinn consists of calculating the 95% limits of (1991) suggests that it is preferable to use agreement (as the mean difference plus or the ICC rather than the CV, as the former minus two standard deviations of the relates the size of the error variation to the differences), and 95% confidence intervals size of the variation of interest. It has been for these limits of agreement. The suggested that the above form of the CV advantages of this approach are that by using should no longer be used to estimate scatterplots, data can be visually interpreted reliability, and that other more appropriate fairly swiftly. Any outliers, bias, or rel- methods should be employed based on ationship between variance in measures and analysis of variance of logarithmically size of the mean can therefore be observed transformed data (Atkinson and Neville, easily. The 95% limits of agreement provide 1998). a range of error that may relate to clinical acceptability, although this needs to be Repeatability Coefficient interpreted with reference to the range of Another way to present measurement error measures in the raw data. over two tests, as recommended by the In the same paper, Bland and Altman British Standards Institution (1979) is the have a section headed ‘Repeatability’ in value below which the difference between which they recommend the use of the the two measurements will lie with ‘repeatability coefficient’ (described earlier) probability 0.95. This is based upon the for studies involving repeated measures with within-subject standard deviation (s). the same instrument. In their final Provided the measurement errors are from a discussion, however, they suggest that their normal distribution this can be estimated by agreement testing approach may be used 1.96 x √(2s2), or 2.83s and is known as the either for analysis of repeatability of a single repeatability coefficient (Bland and Altman, measurement method, or for method 1986). This name is rather confusing, as comparison studies. Worked examples using other coefficients (eg reliability coefficient) Bland and Altman agreement tests can be are expected to be unit free and in a range found in their original paper, and more from zero to one. The method of calculation recently in papers by Atkinson and Nevill varies slightly in two different references (1998) and Rankin and Stokes (1998). (Bland and Altman, 1986; Bland, 1987), and to date it is not a frequently quoted statistic. Nature of Reliability Unfortunately, the concept of reliability is Bland and Altman Agreement Tests complex, with less of the straightforward In 1986 The Lancet published a paper by ‘black and white’ statistical theory that Bland and Altman that is frequently cited surrounds hypothesis testing. When testing and has been instrumental in encouraging a research hypothesis there are clear changing use of reliability estimates in the guidelines to help researchers and clinicians medical literature. In the past, studies decide whether results indicate that the comparing the reliability of two different hypothesis can be supported or not. In instruments designed to measure the contrast, the decision as to whether a same variable (eg two different types particular measurement tool or method of goniometer) often quoted correlation is reliable or not is more open to coefficients and ICCs. These can both interpretation. The decision to be made is be misleading, however, and are not whether the level of measurement error isPhysiotherapy February 2000/vol 86/no 2
  6. 6. Professional articles 99considered acceptable for practical use. instrument will have a certain degree ofThere are no firm rules for making this reliability when applied to certaindecision, which will inevitably be context populations under certain conditions. Thebased. An error of ±5° in goniometry issue to be addressed is what level ofmeasures may be clinically acceptable in reliability is considered to be clinicallysome circumstances, but may be less acceptable. In some circumstances thereacceptable if definitive clinical decisions (eg may be a choice only between a measuresurgical intervention) are dependent on the with lower reliability or no measure at all, inmeasure. Because of this dependence on the which case the less than perfect measurecontext in which they are produced, it is may still add useful information.therefore very difficult to make comparisons In recent years several authors haveof reliability across different studies, except recommended that no single reliabilityin very general terms. estimate should be used for reliability studies. Opinion is divided over exactlyConclusion which estimates are suitable for whichThis paper has attempted to explain the circumstances. Rankin and Stokes (1998)concept of reliability and describe some of have recently suggested that a consensusthe estimates commonly used to quantify it. needs to be reached to establish which testsKey points to note about reliability are should be adopted universally. In general,summarised in the panel below. Reliability however, it is suggested that no singleshould not necessarily be conceived as a estimate is universally appropriate, and thatproperty that a particular instrument or a combination of approaches is more likelymeasurer does or does not possess. Any to give a true picture of reliability.References Chinn, S (1991). ‘Repeatability and methodAtkinson, G and Nevill, A M (1998). ‘Statistical comparison’, Thorax, 46, 454-456.methods for assessing measurement error Eliasziw, M, Young, S L, Woodbury, M G et al(reliability) in variables relevant to sports (1994). ‘Statistical methodology for themedicine’, Sports Medicine, 26, 217-238. concurrent assessment of inter-rater andBaumgarter, T A (1989). ‘Norm-referenced intra-rater reliability: Using goniometricmeasurement: reliability’ in: Safrit, M J and Wood, measurements as an example’, Physical Therapy,T M (eds) Measurement Concepts in Physical 74, 777-788.Education and Exercise Science, Champaign, Illinois, Keating, J and Matyas, T (1998). ‘Unreliablepages 45-72. inferences from reliable measurements’,Bland, J M (1987). An Introduction to Medical Australian Journal of Physiotherapy, 44, 5-10.Statistics, Oxford University Press. Rankin, G and Stokes, M (1998).Bland, J M and Altman, D G (1986). ‘Statistical ‘Reliability of assessment tools in rehabilitation:methods for assessing agreement between two An illustration of appropriate statistical analyses’,methods of clinical measurement’, The Lancet, Clinical Rehabilitation, 12, 187-199.February 8, 307-310. Shrout, P E and Fleiss, J L (1979). ‘IntraclassBritish Standards Institution (1979). ‘Precision of correlations: Uses in assessing rater reliability’,test methods. 1: Guide for the determination and Psychological Bulletin, 86, 420-428.reproducibility for a standard test method’ Stratford, P, Agostino, V, Brazeau, C andBS5497, part 1. BSI, London. Gowitzke, B A (1984). ‘Reliability of joint angleChinn, S (1990). ‘The assessment of methods of measurement: A discussion of methodologymeasurement’, Statistics in Medicine, 9, 351-362. issues’, Physiotherapy Canada, 36, 1, 5-9. Key Messages Reliability is: s Population specific. s Not an all-or-none phenomenon. s Related to the variability in the group s Open to interpretation. studied. s Not the same as clinical acceptability. s Best estimated by more than one index. Physiotherapy February 2000/vol 86/no 2