2.
2 Hopkinstween repeated measurements; they are also a con- tests to monitor the performance or health of theircern for anyone interested in a single measurement. clients. In these situations, the smaller the within- Studying the reliability of a measure is a straight- subject variation, the easier it will be to notice orforward matter of repeating the measurement a rea- measure a change in performance or health.sonable number of times on a reasonable number An easy way to understand the meaning of within-of individuals. The most important measurement subject variation is to regard it as the random vari-error to come out of such a study is the random ation in a measure when one individual is testederror or ‘noise’ in the measure: the smaller the er- many times. For example, if the values for manyror, the better the measure. How best to represent trials of one individual are 71, 76, 74, 79, 79 andthis error and several other measures of reliability 76, there is a random variation of a few units be-is a matter of debate. Atkinson and Nevill[1] con- tween trials. A statistic that captures this notion oftributed a useful point of view in their review of random variability of a single individual’s valuesreliability in this journal recently, but I have a dif- on repeated testing is the standard deviation of theferent perspective on the relative merits of the var- individual’s values. This within-subject standardious measures of reliability. In the present article I deviation is also known as the standard error ofjustify my choice of the most appropriate meas- measurement. In plain language, it represents theures. I also explore the uses of reliability and deal typical error in a measurement, and that is how Iwith the design and analysis of reliability studies. will refer to it hereafter.My approach to reliability is appropriate for most The variation represented by typical error comesvariables that have numbers as values (e.g. 71.3kg from several sources. The main source is usuallyfor body mass). Reliability of measures that have biological. For example, an individual’s maximumlabels as values (e.g. female for sex) is beyond the power output changes between trials because ofscope of the present article. changes in mental or physical state. Equipment may also contribute noise to the measurements, although 1. Measures of Reliability in simple reliability studies this technological source When we speak of reliability, we refer to the of error is often unavoidably lumped in with therepeatability or reproducibility of a measure or vari- biological error. When the same individual is re-able. I will sometimes follow the popular but in- tested on different equipment or by different oper-accurate convention of referring not to the reliabil- ators, additional error due to differences in the cali-ity of a measure but to the reliability of the test, bration or functioning of the equipment or in theassay or instrument that provided the measure. I ability of the operators can surface. An analogouswill also use the word ‘trials’ to mean repeated ad- situation occurs when different judges rate the sameministrations of a test or assay. athlete in different locations. I will deal with these Researchers quantify reliability in a variety of and other complex examples of reliability in sectionways. I deal here with what I believe are the only 3.3.3 important types of measure: within-subject vari- In most situations where reliability is an issue,ation, change in the mean, and retest correlation.[2] we are interested in the simple question of repro- ducibility of an individual’s values obtained on the 1.1 Within-Subject Variation same piece of equipment by the same operator. To Within-subject variation is the most important estimate typical error in these situations, we usu-type of reliability measure for researchers, because ally use many participants and a few trials ratherit affects the precision of estimates of change in the than 1 participant and many trials. For example, forvariable of an experimental study. It is also the most 5 participants in 2 trials, with the values shown inimportant type of reliability measure for coaches, table I, the typical error is 2.9. We can still interpretphysicians, scientists and other professionals using the typical error of 2.9 as the variation we would Adis International Limited. All rights reserved. Sports Med 2000 Jul; 30 (1)
3.
Measures of Reliability 3expect to see from trial to trial if any one of these Table I. Data from a reliability study for a variable measured twice in 5 participantsparticipants performed multiple trials. Participant Trial 1 Trial 2 When a group of volunteers performs 2 or more Kim 62 67trials, there is always a change in the mean value Lou 78 76between trials. In the above example, the means in Pat 81 87the first and second trial are 68.4 and 69.6, respec- Sam 55 55tively, so there is a change in the mean of 1.2. Vic 66 63Change in the mean is itself a measure of reliabilitythat I discuss in more detail in the next section. Iintroduce the concept here to point out that, for studies. Bland and Altman,[4] the researchers whoalmost all applications of reliability, it is important devised this measure, realised that the differenceto have an estimate of typical error that is unaf- scores between trials give a good indication of thefected by a change in the mean. The values of the reliability of the test. Instead of using the standardchange score or difference score for each volunteer deviation of the difference scores directly, they cal-yield such an estimate: simply divide the standard culated the range within which an individual’s dif-deviation of the difference score by √2. In the above ference scores would fall most (95%) of the time.example, the difference scores are 5, –2, 6, 0 and In the above example of 5 individuals tested twice,–3; the standard deviation of these scores is 4.1, so the 95% limits of agreement are –10.1 and 12.5.the typical error is 4.1/√2 = 2.9. This method for The interpretation of these limits is as follows: oncalculating the typical error follows from the fact the basis of our 2 trials with 5 participants, whenthat the variance of the difference score (sdiff 2) is we test and then retest another participant, the scoreequal to the sum of the variances representing the in the second trial has 1 chance in 20 of being moretypical error in each trial: sdiff 2 = s2 + s2, so s = than 12.5 higher or less than 10.1 lower than thesdiff/√2. score in the first trial. Note that the limits in this For many measurements in sports medicine and example are not quite symmetrical, because the par-science, the typical error gets bigger as the value ticipants showed an average improvement of 1.2 inof the measure gets bigger.[3] For example, several the second trial. It is preferable to take this im-trials on an ergometer for one athlete might yield provement out of each limit and express the limitspower output with a mean and typical error of 378.6 as 1.2 ± 11.3.± 4.4W, whereas a stronger athlete performing the The relationship between the typical error and the limits of agreement is straightforward. Let thesame trials might produce 453.1 ± 6.1W. Although limits of agreement be L. As before, let the within-the absolute values of the typical errors are some- subject standard deviation (typical error) be s, andwhat different, the values expressed as a percent- the standard deviation of the difference score beage of their respective means are similar: 1.2 and sdiff. For simplicity, we will ignore any change in1.3%. This form of the typical error is a coefficient the mean between the trials. It follows from basicof variation. It is sometimes more applicable to statistical theory that L = ±t0.975,ν • sdiff, whereevery participant than the raw typical error. As a t0.975,ν is the value of the t statistic with cumulativedimensionless measure, it also allows direct com- probability 0.975 and ν degrees of freedom. Butparison of reliability of measures irrespective of sdiff = s•√2, so:calibration or scaling. Thus it facilitates compari-son of reliability between ergometers, analysers, L = ±t0.975,ν • s • √2 (Eq. 1)tests or populations of volunteers. I will refer to itin plain language as the typical percentage error. In our example of 5 participants, s = 2.9, ν = 4 Another measure of within-subject variation, lim- and t0.975,4 = 2.8, so the limits of agreement areits of agreement, has begun to appear in reliability ±(2.8)(√2)s = ±3.9s = ±11.3. When a reliability study Adis International Limited. All rights reserved. Sports Med 2000 Jul; 30 (1)
4.
4 Hopkinshas a large sample size, t0.975,ν = 1.96, so L = ±1.96s ference score is therefore fatuous. Confidence• √2 = ±2.77s, or approximately ± 3 times the typ- limits for a single measurement would be moreical error. This formula is still valid when the typical appropriate, but as a generic measure of within-error is expressed as a coefficient of variation; the subject variation this statistic would have thecorresponding limits of agreement are then percent- same bias problem as limits of agreement.age limits. • The widespread use of 95% confidence limits to Should researchers use the typical error or the represent precision of the estimate of populationlimits of agreement as a measure of within-subject parameters is not a basis for using 95% to definevariation? Atkinson and Nevill[1] favoured limits of agreement limits for an individual participant’sagreement. I believe typical error is better. Here are difference scores. Even the use of 95% for con-my reasons. fidence intervals is debatable, but I will not go• As I have just shown, the values of the limits of into that issue here. Instead, I will show that agreement depend on the sample size of the re- 95% is too stringent for a decision limit, at least liability study from which they are estimated. In when the participant is an athlete. Let us assume statistical terms, the limits are biased. The bias we are monitoring the performance of a runner is < 5% when there are more than 25 degrees of with a reasonably good running test, one that has freedom (e.g. > 25 participants and 2 trials, or > 95% limits of agreement of ± 7.0%. Proponents 13 participants and 3 trials), but it rises to 21% of limits of agreement would argue that an ath- for 7 degrees of freedom (8 participants and 2 lete or coach should be satisfied that something trials). In most studies of reliability, between 8 beneficial has happened between 2 trials only and 30 volunteers perform only 2 trials. The re- when there is an increase in performance of 7.0% sulting bias ranges from 21 to < 5%, so anyone or more. But with an observed change of + 7.0%, comparing the magnitude of limits of agreement there is a 97.5% probability (odds of 39 to 1) between studies must account for the number of that performance is indeed better, or a 2.5% prob- degrees of freedom between the studies. This ability (odds of 1 to 39) that it is worse. In my problem does not occur with the typical error, view, this degree of certainty about a true change which has an expected value independent of in performance is unrealistic: an individual would sample size. Defenders of limits of agreement or should act on less. For example, half the lim- might argue that we should compute limits of its of agreement seems a more reasonable thresh- agreement in all studies by multiplying the typ- old for action; with an observed enhancement of ical error by 2.77 rather than by the exact value 3.5%, the probability that a true enhancement derived from the t statistic with the right number has occurred is still 84%, or odds of about 5 to of degrees of freedom. In that case, though, the 1 that performance is really better. Even smaller level of confidence of the limits would not be changes in performance are worthwhile for top well defined. runners,[2] but you would need a test with better• Limits of agreement apply to the special case of reliability to be confident that such changes were variability of an individual’s values between pairs more than just chance occurrences in this simple of trials, but they do not apply to the simplest test-retest situation with a single athlete. situation of only one trial (e.g. a urine test for a • There is an extensive theoretical base for reli- banned substance). With a single trial, the user ability, the most developed form of which is is interested in the error in the value of that trial, known as generalisability theory.[5,6] Variances not in the error in the difference between the are the common coin for all computations in this trial and some hypothetical previous or future literature. Anyone wishing to perform computa- trial. Characterising the variability of a single tions using a published typical error has only to measurement with confidence limits for a dif- square the published value to convert it to a vari- Adis International Limited. All rights reserved. Sports Med 2000 Jul; 30 (1)
5.
Measures of Reliability 5 ance. Procedures for calculating confidence limits evitably makes the mean for each trial different. of the variance (and therefore of the typical er- The random change is smaller with larger sample ror) are also available. On the other hand, limits sizes, because the random errors from each meas- of agreement have to be converted to a variance urement tend to cancel out when more measure- by factoring in the appropriate number of de- ments are added together for calculation of the mean. grees of freedom. The conversion is straightfor- Systematic change in the mean is a non-random ward for simple reliability studies, but for more change in the value between 2 trials that applies to complex measures of reliability involving several all study participants. The simplest example of a variance components, counting the degrees of systematic change is a learning effect or training freedom may be a challenge. I am also uncertain effect: the participants perform the second trial whether the factor that converts typical error to better than the first, because they benefit from the limits of agreement is the appropriate factor to experience of the first trial. In tests of human per- convert the confidence limits of the typical error formance that depend on effort or motivation, vol- to confidence limits of the limits of agreement, unteers might also perform the second trial better at least for < 25 degrees of freedom. because they want to improve. Performance can be• Which measure is better for the purpose of teach- worse in a second trial if fatigue from the first trial ing or learning about measurement error? Al- is present at the time of the second trial. Perfor- though the numerical difference between them mance can also decline in a series of trials, owing is only a factor of approximately 3, conceptually to loss of motivation. they are quite different. In my opinion the concept Systematic change in the mean is an important of typical error is self-explanatory, and it con- issue when volunteers perform a series of trials as veys what measurement error is all about: vari- part of a monitoring programme. The volunteers ation in the values of repeated measurements. The are usually monitored to determine the effects of concept of 95% confidence limits for the differ- an intervention (e.g. a change in diet or training), ence between 2 measurements narrows the focus so it is important to perform enough trials to make of measurement error to one application: decision- learning effects or other systematic changes negli- making in a test-retest situation. This appears to gible before applying the intervention. be the only situation where limits of agreement Systematic changes are seemingly less impor- would have an advantage over the typical error, tant for researchers performing a controlled study, if 95% confidence limits were appropriate for because it is the relative change in means for both decisions affecting an individual. groups that provides evidence of an effect. How- Researchers and editors now have to consider ever, the magnitude of the systematic change iswhich of these 2 measures they will publish in re- likely to differ between individuals, and these in-liability studies. Publishing both is probably inap- dividual differences make the test less reliable bypropriate, because they are too closely related. increasing the typical error (see section 2.3). Re- searchers should therefore choose or design tests 1.2 Change in the Mean or equipment with small learning effects, or they should get volunteers to perform practice (or famil- This measure of reliability is simply the change iarisation) trials to reduce learning effects.in the mean value between 2 trials of a test. Thechange consists of 2 components: a random change 1.3 Retest Correlationand a systematic change (also known as systematicbias). This type of measure represents how closely the Random change in the mean is due to so-called values of one trial track the values of another as wesampling error. This kind of change arises purely move our attention from individual to individual.from the random error of measurement, which in- If each participant has an identical value in both Adis International Limited. All rights reserved. Sports Med 2000 Jul; 30 (1)
6.
6 Hopkinstrials, the correlation coefficient has a value of 1, tures the essence of the reliability of the test, butand in a plot of the values of the 2 trials all points the retest correlation does not.fall on a straight line. When the random error in the An important corollary is that the typical errormeasurement swamps the real measurement, a plot can often be estimated from a sample of individualsof the values for 2 trials shows a random scatter of that is not particularly representative of a popula-points, and the correlation coefficient approaches tion, or it can be estimated from multiple retests onzero. The correlation also represents how well the just a few volunteers. Either way, the resulting typ-rank order of participants in one trial is replicated ical error often applies to most individuals in thein the second trial: the closer the correlation gets to population, whereas the retest correlation applies1, the better the replication. only to individuals similar to those sampled to es- The retest correlation is clearly a good measure timate the correlation. A further important corol-of reliability, and it shares with typical percentage lary is that you cannot compare the reliability of 2error the advantages of being dimensionless. How- measures on the basis of their retest correlationsever, the within-subject error is the better mea- alone: the worse measure (the one with the largersure.[1,2] The main problem with retest correlation typical error) could have a higher retest correlationis that the value of the correlation is sensitive to the if its reliability was determined with a more heter-heterogeneity (spread) of values between partici- ogeneous sample.pants. You can see this effect in a plot of points that Suppose you are satisfied that your participantshave a strong correlation. If you focus on a small are similar to those in the published reliability study.subsample of the participants in one part of the plot, How do you decide whether the magnitude of thethe points for those individuals seem to be scattered published correlation is acceptable for your pur-randomly. As you expand the range of the subsam- poses? Authors of reliability studies sometimes giveple, the linearity in the scatter gradually emerges. what they consider to be acceptable values. For ex-This effect is also obvious from a formula that can ample, Kovaleski and co-workers[8] cited the classicbe derived from the definition of reliability corre- Shrout and Fleiss paper on reliability[9] to supportlation:[7] their claim that a clinically acceptable correlationr = (pure subject variance)/(pure subject variance + was 0.75[8] or 0.80.[10] It turns out that Shrout andtypical error variance) Fleiss[9] did not assess the utility of magnitudes of= (S2 - s2)/S2 retest correlations. Atkinson and Nevill[1] were of the opinion that no-one had defined acceptable mag-= 1 - (s/S)2 (Eq. 2) nitudes of the retest correlation for practical use,where S is the usual between-subject standard de- although they did cite my statistics website[11] forviation and s is the typical error. the relationship between retest correlation and sam- If the sample takes in a wide range of partici- ple size in experimental studies (see section 2.2).pants, S is much greater than s, so (s/S)2 approaches In fact, there is another study,[12] on acceptable val-zero and the correlation approaches 1. As we focus ues of the validity correlation, that applies to reli-in on a homogeneous subgroup, S gets smaller until ability. In that study, Manly and I found that a testit equals s in magnitude (i.e. any apparent difference used to assign pass-fail grades needs to have a va-between individuals is due entirely to the random lidity correlation of at least 0.90 to keep the errorerror of measurement); therefore (s/S)2 approaches rate acceptable. Assigning 3 or more grades needs1, so the correlation approaches zero. Notice that a test with even higher validity. If the only sourcethe value of the retest correlation changes as we of error in a test is random error of measurementchange the sample of participants, but at no time (the typical error), it is easy to show that the valid-does the test itself change, and at no time does the ity correlation is the square root of the retest reli-typical error change. The typical error therefore cap- ability. Thus tests need to have reliabilities of at Adis International Limited. All rights reserved. Sports Med 2000 Jul; 30 (1)
7.
Measures of Reliability 7least 0.902 = 0.81 to be trustworthy for yes-no de- changes in an individual over any time frame. Incisions about further treatment of an individual, contrast, the typical error for use in estimation ofabout selection of a team member, or for similar sample size and individual differences in experi-criterion-referenced assessments. I emphasise that ments needs to come from a reliability study of thethis rule applies only when the between-subject same duration as the experiment.standard deviation of your participants is similar tothat in the reliability study. 2.2 Estimation of Sample Size 2. Uses of Reliability Most experiments consist of a pretest, a treat- I have already mentioned how reliability affects ment and a post-test. The aim in these studies is tothe precision of single measurements and change measure the change in the mean of a dependentscores. Anyone making decisions based on such variable between the pre- and post-tests. The typi-measurements should take this precision into ac- cal error of the dependent variable represents noisecount. In particular, I give advice here on monitor- that tends to obscure any change in the mean, soing an individual for a real change. Another prac- the magnitude of the typical error has a direct effecttical application of reliability is in the assessment on the sample size needed to give a clear indicationof competing brands of equipment (section 3.3). of the change in the mean. In research settings, an important use of reliabil- In this section I develop formulae for estimatingity is to estimate sample size for experimental stud- sample sizes from the typical error or retest corre-ies. Reliability can also be used to estimate the mag- lation. The resulting sample sizes are often beyondnitude of individual differences in the response to the resources or inclination of researchers, butthe treatments in such studies. I outline procedures studies with smaller sample sizes nevertheless pro-for these 2 uses below. duce confidence limits that are more useful than nothing at all. These studies should therefore be 2.1 Monitoring an Individual published, perhaps designated as pilot studies, so they can be included in meta-analyses. In section 1.1, I argued that an observed change I advocate a new approach to sample size esti-equal in magnitude to the limits of agreement was mation, in which sample size is chosen to give ad-probably too large to use as a threshold for decid- equate precision for an outcome.[2] Precision is de-ing that a real change has occurred. A more realistic fined by confidence limits: the range within whichthreshold appears to be about 1.5 to 2.0 times the the true value of the outcome is 95% likely to oc-typical error (or a little more than half the limits of cur. Adequate precision means that the outcomeagreement), because the corresponding odds of a has no substantial change in impact on an individ-real change are between 6 and 12 to 1. For example, ual volunteer over the range of values representedif an anthropometrist’s typical error of measure- by the confidence limits. Let us apply this approachment for the sum of 7 skinfolds is 1.6mm, an ob- to an experiment.served change of at least 2 to 3mm in an athlete’s For a crossover or simple test-retest experimentskinfolds would indicate that a real change was without a control group, basic statistical theory pre-likely. dicts confidence limits of ±t0.975,n-1 • s • √2/√n for a The value of the typical error to use in such change in the mean, where n is the sample size, ssituations needs to come from a short term or con- is the typical error and t is the t statistic. Equatingcurrent reliability study, in which there is no true this expression to the value of the confidence limitschange in the individuals’ measurements between representing adequate precision, ±d say, and rear-trials. For example, the typical error of measure- ranging:ment between skinfold assessments taken within 1day would be appropriate for making decisions about n = 2(t • s/d)2 ≈ 8s2/d2 (Eq. 3) Adis International Limited. All rights reserved. Sports Med 2000 Jul; 30 (1)
8.
8 Hopkins The fact that sample size is proportional to the concurrent error of measurement (e), such that S2 =square of the typical error in this formula under- ST 2 + e2. Ideally, we should consider the smallestscores the importance of high reliability in experi- worthwhile effect as a fraction of ST rather than ofmental research. For example, when the typical er- S, so the smallest worthwhile effect should be writ-ror of the test has the same magnitude as the smallest ten as 0.2ST = 0.2√(S2 – e2). If e is the same as theworthwhile effect (s = d), a sample of about 8 vol- typical error, s, it is easy to show from this equationunteers (more precisely 10) gives adequate preci- that the sample size needs to be increased by a fac-sion in a simple experiment; a test with twice the tor of 1/r. This factor has little effect on sample sizetypical error entails a study with about 4 times as for high retest correlations, but sample size tendsmany participants. This formula is easily adapted to infinity as r tends to zero.to more complex designs. For example, sample size The concurrent error, e, may be different fromfor a study with participants equally divided be- the within-subject standard deviation, s. For exam-tween an experimental group and a control group ple, in a 1-month study of skinfold thickness, s isis 4n, or 32s2/d2. the error variation between an individual’s meas- Choosing the value for d depends on the nature urements separated by 1 month, but e is the errorof the outcome variable and the participants. In re- variation between an individual’s skinfolds mea-search on factors affecting athletic performance, d sured within a short period (e.g. the same day).is about half the typical error of an athlete’s perfor- Thus, s includes variation due to real changes inmance between races.[2] The resulting sample sizes skinfolds between individuals, but e is simply thecan be very large. For example, if race performance error in the technique of measurement. In this sit-has half the typical error as performance in a labo- uation, sample size needs to be increased by a fac-ratory test, a study with a control group needs a tor of 1/rc, where rc is the concurrent retest corre-sample size of n = 32s2/((s/2)/2)2 = 512 to delimit lation, (S2 – e2)/S2.the smallest worthwhile effect on performance. These formulae for sample size in studies of the When interest centres on experiments involving average person in a population appear to show athe average person in a population, Cohen[13] ar- primacy for retest correlation, but I must cautiongued that clinical judgement should be guided by researchers that use of retest correlation is justifiedthe spread of raw scores (not change scores) in the only if the sample in the reliability study is repre-population, and suggested that the smallest worth- sentative of the population in the experiment. Inwhile value of d is 0.2 of the between-subject stand- particular, it is wrong to use a retest correlationard deviation. Thus, 0.2S = d = t0.975,n-1 • s • √2/√n, based on one population to estimate sample size inso n = 50(t • s/S)2. But (s/S)2 = 1 – r, where r is the a study of a population with a different between-retest correlation, so: subject standard deviation. Most often there will ben = 50t2(1 – r) ≈ 200(1 – r) (Eq. 4) doubt about the applicability of the correlation from a published reliability study, so you should calcu- Total sample size for a study with a control group late sample size using, for example, n = 50(t • s/S)2is again 4n, or 800(1 – r). The profound effect of ≈ 200s2/S2. Or, if you take concurrent reliabilityreliability on sample size is again apparent: the sam- into account, n ≈ 200s2/(S2 – e2). Reliability studiesple size dwindles to a few individuals for a retest provide estimates of s and e; S comes either fromcorrelation that is nearly perfect, whereas the sam- a descriptive study of the population of interest orple size is about 200 (800 with a control group) from a reliability study of a representative samplewhen the retest correlation is zero. of the population. In the above estimate of sample size, the between- Reliability has the same marked effect on samplesubject standard deviation, S, is made up of true size in the traditional approach to sample size esti-between-subject variation (ST) and an independent mation, which is usually based on 80% certainty of Adis International Limited. All rights reserved. Sports Med 2000 Jul; 30 (1)
9.
Measures of Reliability 9observing statistical significance (p < 0.05) for the most athletes would show positive responses to thesmallest worthwhile effect. The resulting sample treatment, some athletes would show little or nosizes are about twice as big as those estimated us- response and some would even respond negatively.ing my approach. For an example related to human Note that this figure of 2.5% is not simply the stand-performance tests, see Eliasziw et al.[14] ard deviation of the difference scores, which would The foregoing formulae for estimating sample include variation due to typical error. When I refersize are based on the value of the typical error in to individual differences, I mean variation in thethe experiment itself. Of course, we do not know true effect free of typical error. Although the pri-that value until we have performed the experiment, mary aim in an experiment is to estimate the meanso we use the value from a reliability study instead. enhancement, it is obviously important to knowIf the typical error in the experiment differs from whether the individual differences are substantial.that in the reliability study, the estimate of sample Analysis of reliability offers one approach to thissize will be misleading. For example, the time be- problem.tween trials may differ between the reliability study When individual differences are present, studyand the experiment, and this difference may have participants show a greater variability in the post-a substantial effect on the typical error. Other rea- pre difference score. Analysis of the experimentalsons for differences in the typical error between the group as a reliability study therefore yields an es-experiment and reliability study include differences timate of the typical error inflated by individualin equipment, researchers, environment and char- differences. Comparison of this inflated typical er-acteristics of the volunteers. The researcher who ror with the typical error of the control group orwants to perform a reliability study to estimate sam- with the typical error from a reliability study per-ple size for a subsequent experiment has some con- mits estimation of the magnitude of the individualtrol over these factors, but 2 more factors that can differences as a standard deviation, sind (2.5% inaffect the typical error are beyond his or her con- the above example). If the experiment consists oftrol. First, the treatment in the experiment may pro- a pre-test, an intervention and a post-test, the esti-duce responses that differ between study partici- mate is readily derived from basic statistical prin-pants. These individual differences in the response ciples as:show up as an increased error in the post-test, therebyincreasing the overall typical error in the experi- sind = √(2s2expt – 2s2) (Eq. 5)ment. Secondly, evidence from a recent study sug-gests that blinding participants to the treatment may where sexpt is the inflated typical error in the exper-increase the variability of responses between par- imental group, and s is the typical error in the con-ticipants, again resulting in an increase in the typ- trol group or in a reliability study. For example, ifical error.[15] Any estimate of sample size based on the typical error in the experimental group is 2%typical error in a reliability study must therefore be and the typical error in the control group or in aregarded as a minimum. reliability study is 1%, the standard deviation of the individual differences (sind) is √6 = 2.5%. Esti- 2.3 Estimation of Individual Differences mation of individual differences is also possible with mixed modelling,[16] which can also generate When the response to an experimental treatment confidence limits for the estimate.differs between participants, we say that there are When individual differences are present, the ob-individual differences in the response. For exam- vious next step is to identify the participant charac-ple, a treatment might increase the power output of teristics that predict the individual differences. Theathletes by a mean of 3%, but the variation in the appropriate analysis is repeated-measures analysis oftrue enhancement between individual athletes might covariance, with the likely participant characteristicsbe a standard deviation of 2.5%. In this example, (e.g. age, gender, fitness, genotype) as covariates.[16] Adis International Limited. All rights reserved. Sports Med 2000 Jul; 30 (1)
10.
10 Hopkins 3. Design and Analysis of factors for computing the likely range of the typical Reliability Studies error in reliability studies consisting of various numbers of participants and trials. Researchers can A typical published reliability study consists of use this table to opt for a combination of trials andseveral trials performed on a sample of volunteers participants that gives an acceptable likely rangewith 1 item of equipment and 1 operator of the for the typical error. The definition of ‘acceptable’equipment. The results of this simple kind of study depends on the intended use of the typical error. Letmeet the needs of most users of the test or equipment, us consider 2 common uses: estimation of sampleprovided the study has a sufficient number of par- size in an experiment and comparison of a new testticipants and trials, and provided the analysis isappropriate. I will deal with design and analysis of with a published test.such studies first, then discuss more complex studies. Suppose we opt for 15 participants and 4 trials, and the observed typical error is 1.0%. From table 3.1 Design of Simple Studies II, the resulting likely range for the true typical error is 1.0 × 1.24 to 1 ÷ 1.24, or 1.24 to 0.81. The The paramount concern in the design of any study likely range for the sample size in the experimentis adequate precision for the estimates of the out- could therefore be overestimated by a factor of 1.54come measures. In a reliability study, the most im- (= 1.242) or underestimated by a factor of 0.65 (=portant outcome measures are the typical error and 0.812). These limits represent a large difference inthe change in the mean between trials. The ration- the resources needed for the study, so we must con-ale for choosing a sample size that gives adequate clude that 15 participants with 4 trials is hardlyprecision for the estimate of systematic change in adequate for estimating reliability. Fifty participantsthe mean presents a conundrum: the sample size and 3 trials reduce the factors to 1.32 and 0.76,must be the same as you would use in a simple which represent a more acceptable risk of wastingexperiment to delimit the smallest worthwhile ef- or underestimating resources for the experiment.fect of a treatment, but you cannot estimate that To compare the typical error of a new test withsample size without knowing the typical error. The a published typical error for another test, we needresearcher therefore has to base sample size for a the precision of the published typical error, or pref-reliability study solely on consideration of preci- erably the sample size and number of trials in thatsion for the typical error. study. We then calculate confidence limits for the Precision is defined, as usual, by the likely range comparison of the typical errors, using the F ratio.(confidence limits) for the true value. Table II shows For simplicity, let us assume that we perform our study with the same sample size and number ofTable II. Factors for generating the 95% likely range of the true trials as in the published study, and that we obtainvalue of a typical error from the value observed in a reliability study the same typical error. For 15 participants and 4consisting of different numbers of participants and trialsa trials, the confidence limits for the ratio of the typ-Participants Trials ical errors is 0.74 to 1.36. In other words, the typical 2 3 4 57 1.94 1.55 1.42 1.35 error for our test could be as low as 0.74 of the10 1.68 1.42 1.32 1.26 typical error for the published test (which would15 1.49 1.32 1.24 1.21 make ours a far better test), or it could be as high20 1.40 1.26 1.20 1.17 as 1.4 of the published test (which would make ours30 1.30 1.20 1.16 1.14 far worse). Once again, 15 participants and 4 trials50 1.22 1.15 1.12 1.10 are clearly inadequate. For 50 participants in 3 tri-a Multiply and divide an observed typical error by the factor to generate the upper and lower Tate and Klett[17] 95% confi- als, the confidence limits for the ratio of the typical dence limits for the true value. Data were generated with a errors are 0.82 to 1.22, from which we could con- spreadsheet.[18] clude tentatively that there is no substantial differ- Adis International Limited. All rights reserved. Sports Med 2000 Jul; 30 (1)
11.
Measures of Reliability 11ence between the 2 tests. Of course, if our test gave typical error: in such an analysis the identity of thea substantially lower or higher typical error than trial is ignored, so changes in the mean betweenthat of the published test, we could make a firmer trials add to the typical error. The resulting statisticconclusion about the relative reliabilities, possibly is biased high and is hard to interpret, because thewith fewer participants or trials. relative contributions of random error and changes A further important design consideration is the in the mean are unknown. For example, with 2 tri-number of practice trials needed before the typical als and a change in the mean equal in magnitude toerror settles into its lowest value. Addressing this the typical error, I have found in simulations thatproblem requires a reasonably accurate estimate of this method yields a typical error inflated by a fac-changes in the typical error between consecutive tor of 1.23. One-way analysis of variance is equiv-pairs of trials. In unpublished simulations, I have alent to calculating a separate variance for eachfound that a sample size of at least 50 gives ade- participant from 2 or more trials, then averagingquate precision for the estimate of the change in the variances and taking the square root. Authorstypical error. Reliability studies in which 50 or more who have used this equivalent method have usuallyvolunteers perform 3 or more trials are rare in the committed a further mistake by averaging the par-literature. It seems we must accept most published ticipants’ standard deviations instead of variances.reliability studies as pilot studies. In my simulations, averaging the standard devia- tions underestimates the typical error by a factor of 3.2 Analysis of Simple Studies 0.82 for 2 trials and 0.90 for 3 trials; the factor tends to 1.00 for a large number of trials. If the Analysis of reliability studies is straightforward change in the mean between 2 tests is equal in mag-when there are only 2 trials. The typical error can nitude to the typical error, the 2 mistakes virtuallybe derived from the standard deviation of the dif-ference scores for each participant, and the change cancel each other out.in the mean is simply the mean of the difference Having opted for an appropriate method of analy-scores. For 3 or more trials, I urge researchers to sis, researchers should check their data for the pre-check for learning effects on the typical error by sence of so-called heteroscedasticity. In the contextperforming separate analyses on consecutive pairs of reliability or repeated-measures analyses, thisof trials (trials 1+2, trials 2+3, etc.). You can down- term refers to a typical error that differs in someload a spreadsheet for this purpose.[19] systematic way between participants. For example, Consecutive trials with similar typical errors can participants with larger values of a variable oftenbe analysed together to produce a single more pre- have larger typical errors, and typical errors forcise estimate of typical error for those trials. Esti- subgroups of participants (male vs female, compet-mates of changes in the mean between these trials itive vs recreational, etc.) may also differ. Analys-will also be a little more precise when derived from ing the raw values of these measures with the usuala single analysis of 3 or more such trials than when statistical procedures is problematic, because thederived from consecutive pairs of trials. The appro- procedures are based on the assumption that thepriate analysis is a linear model with participants typical error is the same for every participant. Ifand trials as effects and with estimation by analysis this assumption is violated, participants with theof variance or by restricted maximum likelihood. larger typical errors have a greater influence on theThe typical error is the residual error term in such value of any derived statistic, and the value of theanalyses, regardless of whether participants and trials statistic may also be biased.are fixed or random effects, but trials has to be a The generic method to check for heteroscedas-fixed effect for estimation of changes in the mean. ticity is to examine plots of residual values versus A one-way analysis of variance with participants predicted values provided by the analysis of vari-as the effect produces an unsuitable estimate of ance or other statistical procedure used to estimate Adis International Limited. All rights reserved. Sports Med 2000 Jul; 30 (1)
12.
12 Hopkinsthe reliability statistics. The residuals are the indi- measure.[20] There is also a special way to interpretvidual values of the random error for each partici- errors > 5%. For example, if the error is 23%, thepant for each trial; indeed, the standard deviation variation about the mean value is typically 1/1.23of the residuals is the typical error. With pairwise to 1.23 times the mean value, or 0.81 to 1.23. Theanalysis of trials, a simple but equivalent method typical variation is not 1 ± 0.23 times the mean.is to plot each participant’s difference score against When a sample is homogeneous – that is, whenthe mean for the 2 trials.[4] If the residuals for one all participants have similar values for the measuregroup of participants are clearly different from an- in question – the typical error is the same for allother, or if the residuals or difference scores show participants, regardless of transformation. In thisa trend towards larger values for participants at one situation, transformation to reduce heteroscedas-end of the plot, heteroscedasticity is present. The ticity is not an issue. Analysis of the log transformedappropriate action in the case of groups with dif- variable is still a convenient method for obtainingferent residuals is to analyse the reliability of the the typical percentage error, although an equallygroups separately. Variation in the magnitude of accurate estimate is obtained by dividing the typi-residuals with magnitude of the variable can be re- cal error (from an analysis of the raw variable) bymoved or reduced by an appropriate transforma- the grand mean of all trials. Log transformationtion of the variable. becomes more important as the sample becomes As noted earlier, for many variables the typical more heterogeneous, but I have found by simula-error increases for volunteers with larger values of tion that estimates of typical percentage error fromthe variable, whereas the typical percentage error raw and log-transformed variables differ substan-tends to be similar between volunteers. For these tially (by a factor of 1.04 or more) only when thevariables, analysis after logarithmic transformation between-subject standard deviation is more thanaddresses the problem of heteroscedasticity and pro- 35% of the mean. I doubt whether any variables invides an estimate of the typical percentage error. To sports medicine and science show such large be-see how, imagine that the typical percentage error tween-subject variation, so estimates of reliabilityis 5%, which means that the observed value for derived from untransformed variables in previousevery volunteer is typically (1 ± 0.05) times the studies are probably not substantially biased.mean value for the volunteer. Therefore, log(ob- The estimate of the typical error for the averageserved value) = log[(mean value)(1 ± 0.05)] = participant may be unbiased, but participants at ei-log(mean value) + log(1 ± 0.05) ≈ log(mean value) ther end of a heterogeneous sample who differ in± 0.05, because log(1 ± 0.05) ≈ ± 0.05 for natural the typical error before transformation may still differ(base e) logarithms. The typical error in the log of in the typical percentage error after log transforma-every individual’s value is therefore the same (0.05). tion. For example, with increasing skinfold thicknessYou obtain the estimate of the typical percentage the typical error increases but the typical percent-error of the original variable by multiplying the age error decreases (Gore C, personal communica-typical error of the log-transformed measure by 100. tion). A simple solution to this kind of problem isAlternatively, if you use 100log(observed value) as to rank-order participants, divide them into severalthe transformation, the errors in the analyses are groups, then compute the typical error or typicalautomatically approximate percentages, as are the percentage error for each group. Alternatively, itmagnitudes of changes in the mean in the analyses. may be possible to find a transformation that givesThe approximation is accurate for errors or changes all participants the same typical error (absence ofless than 5%, but for larger errors or changes the heteroscedasticity) for the transformed variable.typical percentage error or change is 100(eerr/100 – 1), For researchers interested in retest correlation aswhere err is the typical error or change in the mean a measure of reliability, the intraclass correlationprovided by the analysis of the 100log-transformed coefficient derived from a mixed model (the Adis International Limited. All rights reserved. Sports Med 2000 Jul; 30 (1)
13.
Measures of Reliability 13ICC(3,1)[9]) is unbiased for any sample size. Use 3.3 Complex Studiesof the intraclass correlation is also the only sensibleapproach to computing an average correlation be- The foregoing sections concern studies aimed attween more than 2 trials. The usual Pearson corre- determining the reliability of 1 group of individualslation coefficient between a pair of trials is an ad- with 1 type of test or equipment. In this section I dealequate estimate of retest correlation, although it is with more complex studies: reliability of the meanbiased slightly high for small samples: in simula- of several trials; comparison of the reliability of 2 groups of individuals; comparison of 2 test protocols,tions for 7 individuals, the bias is up to 0.04 units, items of equipment or operators of the equipment;depending on the value of the correlation. and studies of continuously graded reliability. Authors of many previous reliability studies have Researchers sometimes improve the reliabilityprovided only a correlation coefficient as the mea- of their measurements by using the mean of multi-sure of reliability. Nevertheless, it is usually possible ple trials: if there are n independent trials, the typ-to calculate the more useful typical error or typical ical error of the mean is 1/√n times the error of apercentage error from their data. By rearranging single trial. If the multiple trials are conducted overthe relationship r = (S2 – s2)/S2, we get the familiar: a short period (e.g. on the same day, without re-s = S√(1 – r) (Eq. 6) calibration of equipment), but the researcher is in- terested in reliability of the mean over a longerwhere s is the typical error, S is the average of the period (e.g. on different days, with recalibration),standard deviations for the participants in each trial the longer period is likely to be a source of substan-and r is the intraclass correlation. The typical per- tial error. Therefore, beyond a certain number ofcentage error is obtained by dividing the resulting multiple trials no substantial increase in reliabilityestimate of the typical error by the mean for the will be possible. To determine the number of trials,participants in all trials, then multiplying by 100. researchers need to perform a reliability study withThis formula is exact when r is the intraclass cor- multiple trials, estimate the magnitude of the errorrelation, but even for a Pearson correlation my sim- between trials over the shorter period (es) and overulations show that the formula in surprisingly ac- the longer period (el), then choose n such thatcurate: for samples of 10 or more participants the es/√n<<el. The most appropriate analysis is by re- peated measures with 2 within-subject effectsresulting typical percentage error is underestimated (same day, different day), each modelled with itsby a factor of 0.95 at most, but for samples of 7 the own within-subject error. A statistically less chal-bias can be a factor of 0.90. lenging approach is as follows: analyse reliability All estimates of reliability should be accompa- of the trials on the same day to determine the trialnied by confidence limits for the true value. Statis- number beyond which learning effects are negli-tical programs usually provide confidence limits gible (e.g. trial 2); now compute between-day reli-for the change in the mean, or you can use the for- ability for the mean of an increasing number ofmula in section 2.2. Confidence limits for the typ- contiguous same-day trials (e.g. trials 3+4, trialsical error are derived from the chi-squared distri- 3+4+5. . .) to determine the number of same-daybution. For small degrees of freedom, the upper trials beyond which there is no further increase inlimit tends to be skewed out relative to the lower between-day reliability.limit. Tate and Klett[17] provided an adjustment that Comparing the reliability of 2 groups of partic-reduces the skewness by minimising the width of ipants is straightforward. The participants are in-the confidence interval, although it is then not an dependent of each other, so any study amounts toequal-probability interval. With only slight adjust- 2 separate reliability studies. Confidence limits forment the Tate and Klett limits can be represented the ratio of the typical errors between correspond-conveniently by a single factor (table II). ing trials in the 2 groups can be derived from an F Adis International Limited. All rights reserved. Sports Med 2000 Jul; 30 (1)
14.
14 Hopkinsratio. Changes in the mean between corresponding by different operators. When a volunteer is retestedpairs of trials can be compared with unpaired t tests on different items of the equipment, this variationof the difference scores. between items adds to what would otherwise be the Comparing the reliability of 2 items (protocols, typical error for retests on the same apparatus, withequipment or operators) is possible using the above the result that the overall typical error is higher.approach for 2 groups of participants tested sepa- This typical error is the one that best represents therately. Using the same participants has more power typical error in a one-off measurement taken on abut requires analysis by an expert. Each participant randomly chosen item of equipment. It is also theperforms at least 1 trial on 1 item of equipment and one to use in the somewhat unusual situation ofat least 2 trials on the other, preferably in a balanced, repeated trials when each trial is with a differentrandomised fashion. The analysis needs a mixed item of equipment.model, in which the equipment is a fixed effect, Researchers who are aware of the concept oftrial number is a fixed effect, participants is a ran- lower reliability when retesting on different itemsdom effect, and a dummy random variable is intro- or installations have usually computed a retest cor-duced to account for the extra within-subject vari- relation rather than a typical error. The appropriateance associated with measures on one of the items. correlation is the intraclass correlation ICC(2,1) ofConfidence limits for the extra variance address the Shrout and Fleiss.[9] It is derived from the so-calledquestion of the difference in typical error between fully random model, in which the identity of thethe items. The model also provides an estimate of participants and trials are considered random ef-the difference in learning effects between the items. fects. Researchers have often misapplied this model When setting up a study to compare 2 items, to data obtained from a single item of equipment.keep in mind that the typical error always consists The resulting reliability is degraded by the learningof biological variation arising from the individuals effect, not by consistent differences in values be-and technological variation arising from the items. tween items of equipment. The only correct way toSince the aim is to compare the technological vari- estimate the reliability between items of equipmentation, try to make the biological variation as small is to test volunteers with a sufficient number ofas possible, because it contributes to the uncertainty different items. The identity of the items is a ran-in your comparison of the items. For example, when dom effect, and an extra fixed effect representingcomparing the reliability of 2 anthropometrists, you trial number is introduced in the analysis to accountwould get them to measure the same individuals on for learning effects. The typical error for a volun-the same day, to avoid any substantial biological teer retested on different items is derived by addingvariation. Similarly, when comparing the reliabil- the residual variance to the variance for the items.ity measures of power provided by 2 ergometers, A similar analysis is appropriate when a number ofuse athletes as study participants, because they ap- different judges rate the performance of the samepear to be more reliable than non-athletes. athletes at different competitions; in this case, the The problem of a continuous gradation of reli- variance corresponding to judges needs to be di-ability arises when randomly chosen items or in- vided by the number of judges before it is added tostallations of the same kind of equipment produce the residual variance to give the typical error vari-consistently different values. For example, one item ance for an athlete between competitions.might always give high values, another might give Unfortunately, even the 2-way random modellow values and so on. Possible sources of these with the addition of a fixed trial effect would stilldifferences between items include inadequate qual- not account for the possibility that the magnitudeity control in manufacture, different environmental of the typical error itself varies between items ofeffects at the same or different locations, and dif- equipment or between judges. As far as I know,ferences in calibration or other aspects of operation no-one has developed a theoretical framework for Adis International Limited. All rights reserved. Sports Med 2000 Jul; 30 (1)
15.
Measures of Reliability 15quantifying such continuous variability in the typical 3. Nevill AM, Atkinson G. Assessing agreement between meas- urements recorded on a ratio scale in sports medicine anderror. It is not part of generalisability theory, which sports science. Br J Sports Med 1997; 31: 314-8is another name for mixed modelling and which 4. Bland JM, Altman DG. Statistical methods for assessing agree-can deal only with the impact of random effects I ment between two methods of clinical measurement. Lancet 1986 Feb; 8: 307-10discussed in the previous paragraph. Modelling con- 5. Roebroeck ME, Harlaar J, Lankhorst GJ. The application oftinuous differences in reliability of subjects also generalizability theory to reliability assessment: an illustration using isometric force measurements. Phys Ther 1993; 73:seems to be impossible at present. Thus, the only 386-401way to model the better reliability that you find, for 6. VanLeeuwen DM, Barnes MD, Pase M. Generalizability theory:example, with faster athletes or more experienced a unified approach to assessing the dependability (reliability) of measurements in the health sciences. J Outcome Measuresoperators, is to divide the volunteers or operators 1998; 2: 302-25appropriately into a small number of groups, then 7. Bartko JJ. The intraclass correlation coefficient as a measure ofcompare the typical errors between groups. reliability. Psych Reports 1966; 19: 3-11 8. Kovaleski JE, Heitman RJ, Gurchiek LR, et al. Reliability and effects of leg dominance on lower extremity isokinetic force 4. Conclusion and work using the Closed Chain Rider System. J Sport Rehabil 1997; 6: 319-26 The concept of the typical error in an individu- 9. Shrout PE, Fleiss JL. Intraclass correlations: uses in assessingal’s score should be comprehensible to most re- rater reliability. Psych Bull 1979; 86: 420-8 10. Kovaleski JE, Ingersoll CD, Knight KL, et al. Reliability of thesearchers and practitioners in sports medicine and BTE Dynatrac isotonic dynamometer. Isokinet Exerc Sci 1996;science. I believe the concept is easier to grasp and 6: 41-3to apply than limits of agreement. Change in the 11. Hopkins WG. A new view of statistics. Available from: http://sportsci.org/resource/stats [Accessed 2000 Apr 18]mean value of a measure between trials is also an 12. Hopkins WG, Manly BFJ. Errors in assigning grades based onimportant component of reliability, and it needs to tests of finite validity. Res Q Exerc Sport 1989; 60: 180-2be kept separate from typical error. Retest correla- 13. Cohen J. Statistical power analysis for the behavioral sciences. 2nd ed. Mahwah (NJ): Lawrence Erlbaum, 1988tion is difficult to use, because its value is sensitive 14. Eliasziw M, Young SL, Woodbury MG, et al. Statistical meth-to the heterogeneity of the sample of participants. odology for the concurrent assessment of interrater and intra-In my opinion, observed values and confidence limits rater reliability: using goniometric measurements as an example. Phys Ther 1994; 74: 777-88of the typical error and changes in the mean are 15. Clark VR, Hopkins WG, Hawley JA, et al. Placebo effect ofnecessary and sufficient to characterise the reliability carbohydrate feedings during a 40-km cycling time trial. Medof a measure. Publication of these data in reliability Sci Sports Exerc. In press 16. Hopkins WG, Wolfinger RD. Estimating ‘individual differences’studies would substantially enhance comparison of in the response to an experimental treatment [abstract]. Medthe reliability of tests, assays or equipment. Greater Sci Sports Exerc 1998; 30 (5): S135understanding of the theory of reliability by re- 17. Tate RF, Klett GW. Optimal confidence intervals for the variance of a normal distribution. J Am Statist Assoc 1959; 54: 674-82searchers would also help reduce the incidence of 18. Hopkins WG. Generalizing to a population. Available from:inappropriate analyses in the literature. http://sportsci.org/resource/stats/generalize.html [Accessed 2000 Apr 18] 19. Hopkins WG. Reliability: calculations and more. Available from: Acknowledgements http://sportsci.org/resource/stats/relycalc.html [Accessed 2000 Apr 18] Chris Gore, John Hawley, Jenny Keating, Michael Mc- 20. Schabort EJ, Hopkins WG, Hawley JA, et al. High reliabilityMahon, Louis Passfield and Andy Stewart provided valuable of performance of well-trained rowers on a rowing ergometer.feedback on drafts of this article. J Sports Sci 1999; 17: 627-32 References 1. Atkinson G, Nevill AM. Statistical methods for addressing measurement error (reliability) in variables relevant to sports Correspondence and offprints: Dr Will G. Hopkins, Depart- medicine. Sports Med 1998; 26: 217-38 2. Hopkins WG, Hawley JA, Burke LM. Design and analysis of ment of Physiology, Medical School, University of Otago, research on sport performance enhancement. Med Sci Sports Box 913, Dunedin, New Zealand. Exerc 1999; 31: 472-85 E-mail: will.hopkins@otago.ac.nz Adis International Limited. All rights reserved. Sports Med 2000 Jul; 30 (1)
Be the first to comment