💎VVIP Kolkata Call Girls Parganas🩱7001035870🩱Independent Girl ( Ac Rooms Avai...
Psycometrics in neuropsychological assesment
1. Psychometrics in Neuropsychological Assessment
with Daniel ). Slick
OVERVIEW
lhe pracos of ncuropsychologicJI asscssmcnt dcpcnds lo a
brge exlcnt OH lhe reliability and valiJity of llcuropsycholog-ieal
lesls. UnfortullJtely, no! ali neuropsychological tests are
crcated equal, and, like any olher product, published tests
ViU}' in terms of lheir "quali'y," as defined in psychometric
tcrms such as reliability, rncasurement crror, temporal slabil-ity,
sCllsitivity, spccificity, prcdictive v,llidity, and with respect
to lhe care with which t('st itcms are derivcJ anJ norm,llivc
data are obtaincJ. In d,lditioll tu commcf(:ial mcasurC5, nu-meram
tcsts dcvclopcd primarilr for rcscarch purposcs have
founJ their war into wide clinicai usagc; Ihese vary wnsidcr-ably
with rcgard to psychomctric propertics. With few cxcep-tions,
whcn tests originate from clinicaI research conlcxts,
thnc is ohcn validity data but littlc c!se, which makcs esti-lllating
mcasurelllcnt precision and stability of test scores a
challenge.
Rcgardless of lhe origins of neuropsyclJOlogical tesls, lheir
competcnt use in clinicai practice demanJs a good working
knowledge of test standards and of lhe specific psychometric
charaeteristics of each lest useJ. This includes familiarity
with the StanJards for Educational anJ Psychological Testing
(American Educational Research Associalion [AERA] et aI.,
1999) and a working knowledge ofbasic psychometrics. 'iCxts
sllch as those by Nunllally and Bernstein (19')4) and AnaSlasi
<IndUrbina (1997) outline some of the fundamental psycho-metric
prerequisites for competent sdectioll of tests and in-terpretation
of oblained scores. Other, neuropsychologieally
focuseJ teXls such as Mitrushina et ai. (2005), Lezak et aI.
(2004), Baron (2004), Franklill (2003a), and Franzcn (2000)
also proviJe guidance. The following is inlended lOprovide a
broad overview of important psyehometric eoncepls in neu-rupsychological
assessment and coverage of important issues
to consider when crilicalty evaluating leSISfor clinicai usage.
Much of the information provided also serves as a conceptual
framework for the test reviews in this volume.
3
THE NORNAl CURVE
Thc frequency Jistributions of many physical, biological, and
psychological attributes, <lSlhey occur ilCroSSindividuais in
nature, tend to conform, to a greater or lcsser degree, to a bell-shaped
curve (see Figure I-I). This normal wrl'c or normal
distributíoll, so namcd by Karl I'earson, is also known as the
Gaussian or Laplace-Gauss distribution, aftcr the 18lh-century
mathematicians who first defined it. The normal curve is lhe
hasis of many commonly used stalislÍeal and psychometric
moJels (e.g., classical test theory) atld is lhe assumed dislri-hulion
for many psyehological variables.'
Definilion ond Charocleristics
The normal curve has a number of spccific propcrties. It is
unimodal, perfectly symmetrical and asymptolie at the t<lils.
With respcct to scores frum measurcs Ihat are normally dis-tributed,
the ordinate, or hcight of lhe curve at any point
along the x (tesl s(Ore) axis, is the proportion af persons
wilhin the sample who ohlained a givcn score. The ordinates
for a range of scores O.e., between two points on the x axis)
ma}' alsa bc summed lo give the proportion of persons Lhat
obtaineJ a score within the speófied range. If a spccified nor-mal
curve accuratdy rdleets a population distribution, then
ordinatc valucs are also cquivalcnl to lhe probahility of oh-serving
a given seore or range of scores when randomly sam-pling
fram the popllation. Thus, the normal curve ma}' also
bc refcrred lo as a probilbilily distribution.
Figure 1-1 Tnc llllfrnal UlrV(
x
2. 4 A Compentliurn lIfNeuwpsychologi«11 Tests
The normal cun'(' is mathematically defincd as fol!ows:
. I .
j(x)=--e-(x-11)- 111
~2ITa'
corrcsponcling 10 any resulting z score can Ihen be easily
looked up in lablcs avail<lblein mosl statistical texts. Z score
conversiolls to percentilcs are ,liso shown in Table I-I.
11ere:
x = measurement values (test scores)
p = lhe mean of lhe test score dístríbution
0'= lhe starHlanl deviat ion of the tesl score dislribut ion
]'f"'" lhe conslanl pi (3.14 ... )
e = the base of naturallogarithms (2.71 ... )
f(x) = lhe heighl (ordinate) of lhe ClUvefor ,IllYgiven tesl
score
Relevancefor Assessment
As noted previously, because il is a frequellcy dislribulioll,
lhe area under any given segmenl of the normal curve indi-cates
lhe freqllency of observalions or cases wilhin Ihal inler-vaI.
From a praclical slandpoint, Ihis provides psychologisls
wilh an estimale of the "normalit(' or "abnormalilY" of any
given tesl score or range of scores (i.e., whelher il falls in lhe
center of lhe bell shape, where the majority of seores lie, or
inslead, ai eilher of the tail ends, whcre few scores can be
founJ). The way in which the degree of "norm,llity" or "ab-normality"
of tesl scores is quantified varies, but perhaps
lhe most useful and inherently underslandablc metric is lhe
pacentí/e.
Z Scores ond Percenliles
A percenlile indicates the percent,lge of scores Ihal fali ai or
below a given lesl score. As an examplc, we will assume lhaI
a given lesl score is plolted on a normal curve. Vhen ali of
lhe ordinate values aI and bclow Ihis tesl score are summed,
lhe resulting value is lhe percenlilc associaled wilh thal lesl
score (e.g., a score in the 75th percentilc indicales Iha175% of
lhe reference samplc oblainecl equal or lower scores).
To converl scores lo percl.:nliics,r,IWscores may be linearl)'
Iransformed or "stanclardizl.:d"in several ways. The simplest
and perll<lpsmost commonly calculated standard score is the
z swre, which is obtained by subtrncting lhe sample mean
score from an obtnined score allJ dividing lhe resull by lhe
sample 50, as show below:
x= meaSllrement value (test score)
X= lhe mean of lhe test score dislribulion
SO = lhe slandard devialion of the lest score dislribution
Interprelalionof Percentile~
An imporlant properly of the normal curve is that the rela-lionship
belweell raw or z scores (which for purposes of this
cliscussion are e{]uívalent, since Ihey are linear trnnsforma-lions
of each other) and percenliles is nol linear. lhat is, a
constant differencc bctween rOlwor z scores will he assocLJ.led
with a variablc difference in percentile scores, as a funClioll of
lhe dislallce ofthe Iwo scores from lhe mean. This isdue to the
fact Ihal there are proportionally more obsen'aliollS (scores)
near the mean Ihan Ihere are farther from the mean; olherwisc,
the distribulion would be reclangular, or non-normal. This
com readily he seen in Figure 1-2, which shows the normal
distribution with demarcation of z scores and corresponding
pcrcclltilc ranges.
The nonlinear relation between z scores alld percentiles
has important inlerprclivc implicatinns. For example, a one-point
diffcrence betwel.:n two z scores may be interpreled
differently, dcpending on where the two scores fali on the Ilor-lllal
curve. Ascombc seen, lhe difference hetween a z score ofo
,md a z score of + I is 34 percenti!e points, because 34% of
scores fali uctween these two z scores (i.e., the scores being
compared are at lhe 50lh and 84th percentiles). iIowever, the
diffcrence belween a z score of +2 nnd a z score of +3 is lcss
than 3 percentile points, because only 2.5% of lhe distribu-tion
falls belween Ihese Iwo poinls (i.e., lhe scores being com-pared
are nl the 981h and 99.91h percentilcs). Ou lhe other
hnnd, interpretalion of percenlile-score differences ISalso nol
slraightforward, in Ihal an equivalcnl "difference" betwcen
lwo percenlile rankings mal' entai! differenl clinicaI implica-lions
if lhe scores occm at the tail end ofthe curve than ifthcy
occur near the míddle of the distribution. For ex,lmple, a 30-
poinl difference belween scores at lhe 1st percentilc versus the
3IsI percenlíle lllay be more C!inical1ymcaningful than the
same difference between scores at the 351h percentile versus
lhe 651hpercenlilc.
LinearTransformatiancf Z Scores: TScores
and OIher Standard Scores
In ,Iddition to the z score, lineM transformalion can be used
to produce other slandardized scores Ihat have lhe same prop-erties
with regard lo easy conversion via tablc look-up (sce
Table I-I). The most common of Ihese are T scores (M == 50,
SD = 10), scalcd scores, and slanclard scores such as Ihose used
in mosl IQ tesls (M = 10, SD= 3, ,md M = 100, SD= 15). li
musl be rcmembered that z scorcs, T scores, slandard scores,
and perccntile equivalenls are dcrived from sl/mples; ahhough
these are of1en treated as population values, any limitations of
generalizability due to rcference samplc composition or test-ing
circumstances muSl be taken into consideralion when
slandardized scores nre inlerprclcd.
z=(x-X)/SD [21
Vhere:
The resulting distrihution of z scores ha.~a mean of O and an
SD of 1,regardlcss of the melric of raw scores from which the)'
werc Jcrived. For example, given a mean llf 25 and an SDof 5,
<lraw scoreof20 translales inlo n zscorc of -1. The percentilc
4. 6 A Compendíllm of Neuropsychologícal Tcsts
FigtJre1-2 The normal curve demarcaled hy z ~cores.
lhe Meaning of Stondordizcd TestScores:
Score Interpretolion
+2
2.35%
0.15%
+3
As wcll as facílilalíng lrallslalion of raw scores to eslímaled
population ranks, standardization of tesl scores, br vírtue of
conver~ion to a common llletric, facililates comparison of
scores across measures. Ilowever, this is only ,ldvisable wnen
the raw score distribulÍons for tests Ihat are being compared
are appcoximatcly normal in the population. In addílion, if
stanJardized sunes are to be compared, ther should be derived
fcom similar S<llllpleS, or more ideally, from the same s<llllple.A
score aI lhe 50th percentilc on a test normed on a population
of uníversily students does not nave lhe same meaning as an
"equivalent" score on a tesl nonned on a populatíon of dderJy
individuais. Vhen comparing test scores, one mUSI<lisolake
into consideration both lhe rclíability of the two measures and
their intercorrelatíon before dctermining if a significall1 differ-ence
exisls (see Crawford & Garthwaite, 2002). In some cases,
rclalivcly large disparities between slandJfd scores may nOI ac-lU<
lllyreflect rcliablc dífferences, and Iherefore may not be
dinically me,mingful. FurtherlIlore, statislicallr significant or
rcliable difTerences bctween test scores may be COllllllon in a
reference sample; therdore, the baserate of differences ml~t
also be considered, JepenJing on lhe levei ofthe ~cores (<InIQ
of 90 versus 110 as compared lo 110 versus 130). Une ~hould
alS(1keep in mind that when lesl scores are not normally dis-tribuled,
standardized score.~may not accllrate!y rc/leet acttl<ll
popul,ltion rank. In these círcumstances, differences between
slandard scores may be misleaJing.
Note also lhat comparability <lcmss tesls does not imply
eqll<llity in meaning and relative imporlance of scores. For ex-
<lmple, one may compare stand<lrd scores on rneasures of
pitch discriminalion and intelligence, but it will rarely be lhe
case that these scores are of equal clinicai or practical meaniog
nr significance.
In clinicai practice, one lllar encounter standard scores that are
either extremely low or extremely high. The meaníng <lndcom-p,
uability of such scores will depend critie<lllyon the charac-teristics
of lhe normative s<lrnplefrom which lhe)"derivl;:.
For exarnplc, cnn~ider a hypothetical case io whicn ,lIl ex-
<lrninee ohtains a rilw score llwl is hclow lhe range of scnres
found io a norll1,ll s,lrnple. Suppose funher th<ll the SLJ in lhe
norm,d salllpk i~verr small ilnd thus the examinee's r<lWscore
lranslates to a z score of -5, indicalíng that lhe prob<lbilily of
encountering lhis score in the normal POPUl<llionwould he 3
in 10 míllion (i.e., a percentile ranking of .00(03). Thi, repre-senIs
J cOllsíder<lbleextrapol<!tion from the ,H:lual normative
data, as (I) lhe normalive ~ampll;:did nol include 10 míllion
individllills (2) not a singlc individual in the normalÍve S<llll-pie
obtained <lscore anywhere close to the examinee's score.
The percentile value i~Iherefore an eXlrapolalioll and confers
a false sense of precisioo. 11ilc one may be confident lhat
it indicales impairment, lhere may be no basis to assume thal
it represenls a meaningfully "worse" performance tlun a z
score of - 3, or of -4.
The t'slÍmlltcd prcvalclKe valuc of Jn obtained z score (nr
T seore, elc.) C<lnbe calcuLlted to {lctermine whether inlerpre-lation
of extreme scores may be appropriale. Thís is simply ac-complished
by inverting the perccntile score corresponding to
lhe z seore (i.e., dividing I by the percentile score). For eX<lm-pie,
<lz $Coreof -4 is associattxl with an cstimated frequency of
occurrence or prevalcnce of appcoximately 0.00003. Dividing 1
by Ihis value gives a rounded result oI' 31,560. Thus, the e~li-mated
prevalence value 01'lhis score in the population is 1 io
31,560. Ifthe norrnative S<lIllPJcfcom which J z score is Jerived
is consider<lbly smaller lhan lhe denominator of lhe estimalcd
preva!cnce value (i.e., 31,560 in the example), then some cau-tion
may be wJrr<lll1edin interprcling the pereenlíle. In <lddi-tion,
whenever such exlrernl;: scores are being ínlerpreted,
eX<llllinersshould also verify th<llthe examinee's raw score falls
wilhin the r<lngeof raw scores in the normative sample. If the
norn1<ltive samplc size is sllbstanliallr slll,lller Ihan lhe esli-mated
prev,llcnce s<lmple Si7£ /lI1t1 the examinee's score falls
olltside lhe s<lmplc range, then cOllsiJerablc caulion may be
indic<ltcJ in interpretíng the percentile assn(Íaled with the
standardized seore. Regardlcss of the z seore v<llue,it must <lIso
be kept in mind thal inlerpretation of lhe <lssoci<ltedpcrcentile
value may not be juslifiable if lhe normative sample !las a sig-nifiC<
llltlynOll-llOrm<l1distrihution (see laler for funhl;:r dis-cussion
of nOIH10rlJl<llily).lo sum, the dinie<ll interprel<llion
of exlreme scores depends to a longeextenl on the properties of
the normal salllples involveJ; one can have more confidence
th<llthe percentile is reasonably <lccurate if the normalive sam-pie
is large and well collstructed and lhe sh<lpeof the norm<l-tive
sampte distribution is ilpproximately normal, particularly
in tail regiolls where extreme $Coresilre found.
lolerprctiog Extreme Scores
A fin<llcritiC<11issue wilh respect lo lhe me,lning oI' standard-ú,
ed seores (e.g., z scores) has to do with extreme observations.
lhe Normol Curve ond TeslConstruetion
Allhough the norm<ll curVI;:is from many standpoints <lnideal
or even expecll;:ddistribulioll for psycholllgical dati!, tcst score
5. l'sychomelrics in Neuropsychological Assessmenl 7
Figure1-3 Skeweddislribulions.
(e.g., a creativily test for gifted students). In lhis case, lhe
characterislks oI' onll' one side oI' lhe silmp1cscore dislribu-tioll
Non.Normality
Al1hough lhe normal curve is an cxcdlcnl modcl for psl'cho-logical
ddla and manl' sample dislribulions of natural pro-cesses
are approximately normal, il is not unllsllal for tesl
score distributions lo be markedll' nOIl-normal, eWIl when
samples are large (Miccerti, 19R9).zFor example, neuropsy-ehological
te..•ls sueh as the Boston Naming Tesl (BNT) and
Wiseonsill Card Sorting Test (WCST) do nol havc normal dis-tributions
when r,lWscores are el;amined, and, even when de-mographie
correction melhods are ilpplietl,some lests continue
to show a non-norm,ll, muhimodal dislriblllion in some pop-ulations
(Faslenau, 1998). (An examplc oI' a non-normal dis-tribulion
is shown in Figure 1-4.)
The degree to which <lgiVClldislribution approximates the
underll'ing populalion distribulion increases as lhe nlllnber
oI' observations (1,rj increases and becomes kss accurate as N
decreases. This has imporl<llll implications for norms com-prised
of small samplcs. Thus, a larger sampk will produce ,I
more normal dislribulion, bul onll' if lhe underll'ing popu-lation
distribution from which lhe samplc is oblained is
normal. In olhcr words, a large N does nol "eorrect~ for non-normality
oI''In under1l'ing popuLlIion dist ribution. Howt:ver,
84 93
Pereentiles
68
Raw Score
08
Mean = 50, 50 = 10
20
(i.e., the uppt:r end) are critical, whilc lhe charactcristics
011 the olher side of lhe dislrihulion are (lI'no particular con-cern.
The 1l1eaSUremar even be dc1iberatdl' designed to have
t100r or ceiling dTecls. ror example, if onc is not inlerested in
one lail (or even olle-half) {lf lhe dislributioll, items lhat
would provide discrimination in that region may be omitted
lo save adminislration time. In lhis case, a lesl with a high
floor or low cciling in lhe general population (and with posi-live
or negalive skew) may be more desirablc thall a test with a
normal dislribution. ln most applicalíons, however, a more
llormal-Iooking curve within the targeted subpopulation is
usually desirable.
Figure1-4 Anon.normallest scoredistrihution.
Positive Skew Negalive Skew
samples do nol always conform 10 a normal dislribution.
Vhen anel'.' tesl is conslrucled, non-normality can be "cor-recled"
br eXilmining lhe dislribulion of swres on lhe proto-trpe
lesl, adjusling test proper1ies, and resampling until a
normal dislribution is n:achC(1.For cX<lmple,whcn a test is
firsl administered during a lrl'-oul phase and a positivell'
skewed distribut ion is obtained (i.e., with mosl swres c1uster-ing
,lt lhe lail end oI' lhe dislribulion), lhe tesl likely has!oo
high a f1oor, callsing mosl examinees lo oblain low scores.
Easl' ilems can then be added so lhat the majoritl' of scores
fali in the middlc of the distribulion rather lhan at the lower
cnd (Anastasi & Urbina, 1997). ""11en this is successful, the
grealesl numbers of individuaIs obtain aboul 50°/" of items
correc!. This leveiof difficulty usualll' provides the besl differ-entiation
between individuais aI ali abilil)' leveis (,nastasi &
Urbina, 1997).
11must be noled lhal a test with a normal dislribulion in
lhe general population mal' show extreme skew or olher di-vngence
from normaJill' when administcred to a populatioll
that differs considerabll' fcom lhe average individual. for ex-ample,
a vocabulary test thal protluces norma]]l' distributed
scores in a general samp1c oI' individuais mal' display a neg-ativell'
skewed distribution dlle to a low cci1ingwhen admin-istered
to docloral sludcnts in literature, and a positivc1l'
skewed distribution dlle to a high l100rwhen adminislered to
preschoo1crs Irom n:cenlll' immigrated, Spanish-speaking
families (see figure 1-3 for examplcs oI' positive and negalive
skew). In this Case,lhe test would be incapablc oI' dfectivc1y
discriminating between individuais within eilher group be-caust:
of ct:iling effecls and !loor efl"t-cts,rt:speclivt:!y,even
though it is of considerablc utilill' in lhe gencral populalion.
Thus, a lest'~ dislribulioll, including 1100rsand ceilings, must
alwal's be eonsidercd when asscssing individuaIs who differ
from lhe normative samplc in terms of ch<uacteristicsthat af-feel
test scores (ç.g., in this example, degree of exposurc to En-glish
words). In additioll, whether a tesl prodmes a normal
dislribution (i.e., wilhoul posilive or negalive skew) is also ,tn
imporlant aspecl of evaluating tests for bias across differenl
populatiollS (see Chapter 2 for more discussion oI' bias).
Depending on Ih.' characlerislics (lI' lhe conslruct being
measured and the purpose for which a lesl is bcing designed, a
normal distribution oI' scores may not he obtainable or cven
desirable. For example, lhe population dislriblltioll of the con-slmcl
bcing llleasured may nol be normally dislribulcd. Aht:r-nalively,
one mal' want onl)' to identifl' and/or discriminate
bdween persons at onll' one end of a continllum of abililies
6. 8 A CompenJium ofNeumpsychological Tesls
small samplcs may yiclJ non-normal distributíon dlle to
ranJom samplíng cffects, even though lhe population fmm
which lhe sanlple is Jrawn has a normal Jistriblllion. Thal
is, one may nol automatically assume, given a non~nonl1al
Jistribulion in a small sample, that lhe populalion Jislribll~
lion is in facl non~nortJlal (note Ihal the Wllverse may ,liso
be true).
Several factors may lead to non-normallesl S(;oreJislribu-tions:
(a) lhe existence of diserete subpopulatiolls within lhe
general population wilh differing abilities, (b) eeiling or l100r
effeels, anJ (c) trealment effeets Ihal ehange lhe localion of
means, meJi<los, and moJes and affeel variability and distri~
bulioo shape (Miccerli, 1YX9).
Skew
As with the normal curve, some varietics of non-nnrmalit)l
may be eharaelerized malhematically. Skew is a formal mea-sure
or asymmelry in a frequeney Jistribulion Ihat can be cal-eui<
lled using a specific formula (see Nunnally & Bernslcin,
1994). lt is also known as the third momem of 11 distriburiol/
(lhe mean and varianee are lhe first <loJ seconJ moments, re-spectivcly).
A Irue normal Jistribution is perfeclly symmetri-cal
aboullhe mean anJ has a skew of zero. A non-lIormal bul
symmetrie dislribution will have a skew valuc lhal is near
zero. Negative skew values indicale Ihal lhe left tail of the dis-tribulion
i.sheavier (and often more elongated) Ihan the righl
tail, which may be lruncaled, while posilive skew vallles indi~
cate lhat lhe Opposile paHem is presenl (see Figure 1-3).
Vhen distribulions are skewed, the mean and median are not
identical beeause the mean will not be at lhe midpoint in rank
and z seores will not aeeuralely translate into sample per~
eentile rank values. lhe error in mapping of z scores lo sam-pie
pereentile ranks increases as skew inereases.
Truncaled Dislribulions
Signifieant skew often indicales the presence of a truncalcd
distribulion. This may oceur when the range of scores is re-slricled
on one side but not lhe olher, as is lhe case, for exam-pie,
with reactioll lime measures, whieh eanllot be lower lhan
several hundred milliseconds, bllt ean reaeh very high positive
values in some individuais. In faet, dislribulions of scores
from reaetion lime measures, whether aggregated aeross Irials
on an individuallevcl or aeross inJiviJuals, are oflell ehar<le-terized
by positive skew anJ positive outliers. lkan values
may therefore be positivdy biased wilh respect to lhe "centr,11
tendcney" nf lhe dislribulion as defined by olher indices, such
as lhe mediano Truncated dislribulions are also collllllonly seen
on error seores. A good example of this is Failure lo Maintain
Sct (FMS) scures on the WCST (see review in this volume).
In the normativc sample of 30- lo 39-year-old persons, ob-served
raw scores range frum Oto 21, but lhe majority of per-sons
(84%) obtain seores ofO or I, and less Ihan 1% obtain
$Coresgrealer lha o 3.
Floor/Ceílíng Elfeds
Hoor and eeiling effecls mar he defined as the presenee of
trunealed lails in lhe context of 1imitations in range of ilem
difficulty. For example, a lesl may be said o have a l1igll}Ioor
when a large pruportíon of lhe examinees obtain ravo:scores at
or near lhe lowest possible score. This may indicate thal lhe
test lacks a sllffieienl number and range 01'easier items. Con-verscl)',
a tesl may he said to have a low ccílillgwhen lhe 01'1'0-
sitc pattern is presenl (i.e., when a high number of examinees
oblain rilWscores aI or near the highesl possiblc seorc). Floor
anJ eeiling effeels may significantly limil lhe uscfu[ness of a
measure. For example, a measure wilh iIhigh floor mar not be
suitable for use wilh low funclioning examinces, particularly
if one wíshes to delineate levei 01'impairment.
Multimodality and Other Types
af Non-Normality
!l.lultimodality is lhe presenee of more tha/l one "peak" in a
frequeTlcyJistribution (see histogram in Figure 1~1 for <lnex-amplel.
Another form of signifieant non-normality is the uni-form
or near-uniform distributíon (a dislributio/l wilh no or
minimal peak and relatívely equal frequelley <lCrossseo[('s).
Vhen such dislributions are present, linearly transformed
$Cores(z scores, T seores, and other deviatio/l seores) may be
tOlally inaceurale with respeel to aelual samplelpopulalion
pereentile rank and should not be interpreted in Ihat frame-work.
[n Ihese cases, sample-derived rank pereentilc seores
may be more clínieally uscful.
Non-Normality ond Perceolile Derivalioos
Non-normality is /lot trivial; it has major implieations for
derivalion and interpretation of standard seores and eompar-ison
of sueh scores aeross lests: standardized seores Jerived by
linear transformalion (e.g., z scores) will nol corresponJ o
samplc percenlilcs, and lhe degree of divergence may be quile
longe.
ConsiJer lhe histogram in Figure 1-4, which shows lhe
dislrihulion of scurcs obtaineJ for iI hypolhelieal test. This
lest, with a samp!e size of 1000, h<lsa mean ril' score of 50
anJ a standarJ devialion of 10; lherefore (and very conve-nient!
y), no linear transformation is required to oblain T
seores. An cxpeeted normal dislrihution based OI} lhe oh-served
mean and standard devialion has been overlaid on the
observed histogram for purposes of comparison.
The histogram in Figure 1~1 shows Ihat lhe díslribution of
scures for the hypotheticallest is grossly non-Ilormal, wilh a
Iruncaled lower l<lilillld significanl positive skew, indicilling
floor effects and the existenee of tW()distinct subpopulations.
If lhe dislributioll were normal (i.e., if we follow the normal
curve, sllperimposed on lhe hislogram in Figure 1-4, instead
(lf the histogram ilsclf), a raw score of 40 would eorrespond
to a T score of 40, a S(;ore lhat is 1 SD or 10 puints fmm the
7. mean, <lnd translate lO lhe 16th pen.:enlilc (pcrcenlilc not
shown in lhe graph). Howcvcr, whcn we calclllate a pcrcellile
for the actual scorc (listribution (i.e., lhe hislogram), a smre
of 40 is actually below lhe Isl percClllile with respcct to
lhe observed sampk dislributioll (pcrcelltile=O.R). C1earl)',
the difterem.:e in percenlilcs in Ihis example is no! trivial anti
has significanl implicatiolls for score interpretalion.
Normalizing Te~tScarc~
Vhen confronted "vilh problematic score distributions, mall}"
lest dcve10pers emplo}" "normalizing" Ir,lllsformalions in an
altempl to correct depiHtures from normalit}" (cxamplcs of
this can be fouod thwugholll this volume, in lhe Normruíw
JJalll sCClíoo for tests reviewed). Allhough hc1pful, these pro-cedurcs
are b}"no means a panace<l, as lhe}" often inlroduce
probkms of Iheir own with respecl lo inlcrpre<llion. iddi-lionalll',
tTlanl' lesl manuais contain only a cursor}" discussion
of nnrmalizalion (jf lesl scorcs. inaslasi and Urbin,l (1997)
statc that scores should onl)' bc normalized if: (I) Ihel' come
from a largc and represcnlalive samplc, or (2) any devialion
from normalitl' arises from ddecls in lhe lesl rather than
charactcrislies of lhe sample. Fllrthermore, as we have nOled
above, it is prderable lo adjusI score distributions prior 10
normalizalion by ll10difying tesl conlent (e.g., by ad(ling or
ll1odifl'ing ilems) ralher than slalislical1y transforming non-normal
scores inlo a normal dislribution. ilthough a detai1cd
discllssion of normali/.ation procedures is beyond lhe scopt.'
of this chapler (interested readcrs arc refcrred lo Anaslasi &
Urbina, 1997), ideall}', test makers should dcscribc in delail
the nalure of any significant samplc Ilon-norm<llity ,md lhe
procedures useJ lo correcl it for derivalion of standardized
scores. The reasons for correction should ,liso be justified, and
direcl percentile conversions uased on thc uncorrecte(l samplc
dislribution should be provided as im 0plion for users, Dc-spile
the limitalions inherenl in correcting for non-normalily,
Anaslasi and Urbina (1997) note th,l[ most tesl developcrs
will probably continue lO do so beca use of lhe necd to usc Icsl
scorcs in statistical analyses Ihal <lssume normality (lf dislri-butions.
From a prattlcal poinl of view, test users should bc
aware of lhe Illathclllalical compulalions <lnd Iransforma-lions
involved in deriving scorcs for Iheir inslruments. Vhcn
ali othcr things are cqual, lest uscrs should dwose lests Ihal
provide informalion on snlfC dislribulions ,llld any proce-dures
Ihal were ulldertaken to correcl non-normalit}', over
thosc Ihat providc partial or no illformalÍon.
Exlrapolalion/lnlerpolotion
Despile ali lhe besl elTorts, Ihcre are times whcn norms fali
shorl in lerms of range or cdl size. This indudes missing dala
in somc cdls, inconsistenl age eoverage, or inadequate demo-gr,
lphic composilíon of some cells compared to lhe popula-tion.
In Ihcse cases, data are oflen eXlrapolalcd or intcrpolaled
using Ihc exisling score dislribulioll and lechniques such as
Ps}'chornctrics in ~curOl's)"dlOrogical Assessment 9
llIultiple regressioTl. For cxamplc, llcalon ,Illd cot!eagues have
puhlished seis of norms Ihal IISt..multip1c regressiol lo cor-rett
for demogrilphic characlcrislics ,uHl compellsate for few
subjects in some cells (I 1caton et aI., 2(03). Although multiple
regressioll is robust to slighl vio1atiolls of assumptinns, eSli-mation
nrors mar occur whcn using llormative dala Ihat vio-lalcs
thc assumplions ()f homoscedaslicil)" (uniform variance
across lhe range of scores) and normal distrihution of scores
necessary for multiple regressioll (Faslenau & AJams, 1996;
f Icalon el aI., 1996).
Age extrapo!alions bel'ond the hounds of the actual ages of
lhe individuais in the samples are also somelimes sccn in nor-mativc
dala seIS, hased on projected devclopmcntal curves.
Thcse llorms should be used with caulion due lo lhe lack of
aCLIaldata points in these age ranges. EXlrapolalÍon melhods,
such as Ihose that emplol' regression lechniqucs, dcpend on
lhe shapc of lhe dislribution of scores. Indudillg only a subset
of lhe dislribulion of age scores in the regression (e.g., b}'
omitling verl' young or ver)" nld individuills) may change lhe
projected developnlental .sllll'C nf cert"in Icsts dralllalicalll'.
Tests Ihat appedf to have !incilr relalionships, whcn consid-ered
olll}' in adulthllod, ma}" ,H.:lually have highll' positivdy
skewcd binomial functioJlS whcn the cnlire age range is con-sidered.
OnC eX<lmple is vocablllary, which lends lo increase
c)(l'0nenlially during lhe preschool l'ears, shows a slower
ratc of progrcss during earll' adulthood, remains re1ative1l'
stablc with conlinued gr,ldual inerease, and Ihcn shows a mi-nor
decrease wilh advancing age. If only a subsel of the age
range (c.g., adulls) is used to cslimale performance aI lhe lail
ends of the dislribulÍon (e.g., prcschoo1crs and elderly), the
eslimalion wiU not fit the shape of lhe aelual distribulion.
Thus, normalizalion mar introduce error when lhe re1a-lionship
between a test ,lJld a demographic variable is I1on-linear.
In Ihis case, linear correetion llsing mulliple regressjoll
distorls thc truc rclationship betwccn variab1cs (Fasleneau,
1998).
MEASUREMENT PREClSION: RELlABllI1Y
AND STANDARD ERROR
l.ike ali (orms of Illeasuremenl, ps)"chological tesls arc nol
perfectl}' precise; ralher, test scores musl be seen as estimares
of abililÍes or funclions, each associated wilh some degree of
mcasurement error.-' Each lesl differs in thc precision of lhe
scores that it produces. Df crilical importance is lhe fact
thal no tcst has (lnl}' one specific Ievc1 of precision. Ralher,
precision alwa}'s varies to some degree, and potentially suh-slanlialll',
across {liffcrent populaliollS and tesl-use senings.
Thcreforc, eslimates of measurelllenl error rc1evanl lo specific
testing circumstances are il prerequisitc for correCI inlcrprela-lion.
For example, even lhe mosl precise lesl mal' produce
highly imprecise results if administered in a nonslandard
fashion, in <Inonoplilllal cnvironmcnl, or lo <In uncoopera-live
examinee. Aside from these obvious cavealS, a few basic
8. 10 A CompfJl(liurn of NcuropsydlOlogieal Tesls
Toble1-2 $Olrç,:sof Errur V;lriallceIn 1(e1atlolllo Relia!:>ilily
Cocfficients
Typcof Rcliabilill'Coefficielll
Split-half
Kuder.l(ichard.soll
Codficirnt all'ha
Test-fetest
Alternale.fofm (immcdialc)
Alternalc-form (delayed)
Interraler InlefSmrer diftúcllccs
01" lhe corre!ation bctween tesl scores and true scores. This is
why il is used for estimaling true seores and associated stan-
(!dai errors (NunnaUy & 13ernslein, 1994). Ali things being
equal, longa lesls will general1y yield higher reliability esli-mates
(Satl!er, 2001). InternaI reliability is llsual1y assessed
with some measure of lhe average correlatinn among ilems
within a tesl (Nunnally & 13ernslein, 1994). These inc!uJe lhe
split-half or Spcarman-13rown reliability coefficient (obtained
by (orrdating two halves of items fram the same test) and co~
dficienl alph.l, which provides <lgeneral estimate of reliability
bascd on ali the possible ways of splitting lesl items. Alpha is
esscntially based on the average inlercorrelation between Icst
ilems anJ any otha sct of ilems, and is used for tests with
items lhat yidd more than two response lypes (i.e., possib!e
srores ofO, I, or 2). For additiollaluseful references coneern-ing
alpha, sce Chronb<Kk (2004) and Streiner (2003a, 2003b).
The Kuder-Richardson rdiabililY coefficient is used for items
with yes/no answers Of helerogencous tests where splít-half
melllllds nlusl be used (i.e., lhe mean of ali thedifferent split-half
coefficienls if the lesl were split inlo ali possib1c ways).
General!y, Kudcr-Rieh,lrJson cocfficienls will be lower Ihan
split -half coeffidents whcn ICstsare hcterogeneous in terms of
content (Anaslasi & Urbina, 1997).
lhe Speciol Cose of Spced lests
Error Varlance
Contmt sampling
Conlmt sampling
Conlent sampling
Time s<lmpling
Cnntcnt sampting
Conlent saml'lingand time
sampling
Tesls involving speed, where lhe score exclusivdy depenJs on
lhe numbcr of items completed wilhin a lime limil rather
than lhe numbef correct, will cause spuriously high inlernal
rdiabililY estimates if internai re1iability indices such as split-half
reliability are useJ. For examplc, dividing lhe items inlo
Iwo halves lo Gl!Culatc ,1 split-half rcli.lbility cocfficicnl will
yie1d IWOhalf-Iesls with 100% item complction ratcs, whether
the indiviJual oblained a score of 4 (i.e., yielding Iwo half-tests
totaling 2 poínls eaeh, or perfcet agreement) or 44 (i.e.,
yiclding two half-tests both lotaling 22 poinls, .llso yiclJing
perfeet agreement). Thc result in both cases is a split-half reli-abilily
of 1.00 (Anaslasi & Urbína, 1997). Some alternalives
are to use test-retest reliability or alternalc forrn rc1iabílily,
ideally wilh lhe a1tefJl<lleforms adminislercd in immediate
suceession to avoid lime sampling error. Rc1iabilities (;Ill also
principies help in deleflnining whelhcr a test generaUy pro-
'lides precise measuremenls in mosl silll.ltiolls where il wiU be
useJ. Ve begin wllh an overvlcw of lhe rc1ated concepls of re-liabilit}',
trw: s{(nes, ol!lail1ed scores, lhe various eslimales of
measurement error, <lnJ lhe nolion of ClIl1fidcl1cc in/crI'als.
These are revieweJ bclO'.
Definitionof Reliability
Rc1iability refenlo lhe consislency of measuremenl of a given
lesl anJ can be defined in several ways, including eonsistency
wilhin ilsc1f (internai consisteney rei iability J, comislency over
lime (Iest-retest rc!i.lbilily), consistem;y ,lCrossallernale forms
(alternale form rcJiability), and consislency across ralers (in-lerrattf
rdiabiJily). lndices (lf rdiabililY indicate lhe degree to
which a tesl is free from measurcment tfror (or the propor-
IÍon of variance in observed scores atlributablc to vMiance in
Irue scores). The inlerprelalion of such indices is oflen not so
slraightforw,lrd.
It is importanl to note Ihal the lerm "error" in this conlexl
does not iKlualll' refer to "incorrecl" or "wrong" informalion.
Rilther, "error" consists of the lllultiple sources of variabilily
Ihal affeel test scores. Vllilt mal' be lcrmed error variance in
ane appliealion mal' be consiJereJ par1 of lhe true score in
anolher, depending on the comt ruet being measureJ (state or
trai!), lhe nalure af lhe les employed, anJ whelher il is
deemed relevant or irrelevanl lo the purpose of lhe lesling
(Anastasi & Urbina, 1997). An exampk rdevanl to neuropsy-chology
is Ihal internai reliability coeffleienlS temi to be
smal1er ai citha end of lhe age continuum. This finJing has
been allribuled to bolh limitatiolls of lesls (e.g., measurement
error) and incf/:ased inlrinsic performance variability among
very young and very 01(1examinecs.
Faclors Alfecting Reliability
Reliability coefficients are infiuenecJ by (a) tesl eharacteristics
(c.g., Icngth, item type, item homngeneity, and intlucncc of
guessing) and (b) sample characteristics (e.g., sample si"c,
range, and v<Hiability). The cxtenl of a test's "darily" is inli-malely
related lo ils rdiability: reliable measurc, Iypieally
h,lve (a) clearly written items, (b) casily ullderstooJ test in-slruClions,
(c) stanJardized administration conditions, (d)
explieit scoring ru1cs Ihat minimize subjectivity, and (e) a
proeess for training ralers to a performance crilerion (Nun.
naUy& 13crmlein, 1994). For a lisl of commonly llsed rdiabil-ity
coefticienls and lheir assoeialeJ sourees of error variance,
sec 1:1blc 1-2.
Internai Reliability
Inlernal reliabililY retleds lhe cxlcnt to v,,,hichilerns within a
lesl measure the same eognitive domain or COllstruet. It is a
core index in c1assicallesl theory. A measure of lhe intercorre-lation
of items, inlernal rcliabilitl' iS;lll estimate of the corre-lalion
between randomly paralleltest forms, anJ by extension,
9. Psychometrics in NeumpsychoJogical Assessment 11
T061e1-3 Coml1lnnSourçcsof Bia.and Error in
Test-lklest Situatiom
_<",n-e:hom I."'fweaver & t.:fld""f, 2lKH. 1'. JQ~.Rel',;nleJ w;lh pell"i";,,,, frofll
EIs",;er.
may or may nol be considered sourccs of measuremenl error.
Apar! fmm these variab[es, une musl cunsider, and possibly
p;lrse out, effecIs of prior exposure, which are often conceplu-a[
ized as invo[ving implicit or explicit Icarning. llence the
terrn pmctifC effi'as is often llsed. Howevcr, prior exposure lo
a tesl does nol neccssarily kad to increased performance at
retes!. Note 'l[so lhat lhe a<.:tlla[nature of lhe lesl may sorne-limes
change with cxposurc. for instance, lests lhal rely on a
~novelty effect~ anJ/or re(]uire (kduction oI' a stralegy or
problem snlving (e.g., VCST, Tower 01' London) may not be
conducled in the samc W,IYonce the examínee has prior fa-miliarity
with lhe tcsling p,Jr<I(ligm.
Like some measures of problcm-solving abilities, measures
oI' lcarning and memory are a!s{}highly susleptible lo prilctice
effccts, though Ihese are kss likdy lo rct!ect a fundamental
change in how examinees approach lasks. In either case, prac-lÍce
cffccts may lead to [ow test-retesl lorrclations by effec-tivdy
[owering lhe ceiling at relesl, resulting in a restriction of
range (i.e., many examinecs ohtain scores at near the IIl<Ixi-mum
possible aI retest). Neverthcless, restriction oI' range
should not bt' assumed when test-retest corrdalÍons are low
unlil this has bem verified br illSpt'ction oI' Jat,l.
The relationship between prior exposure and tesl stability
coefficients is complex, anJ although test-retesl cocfficienls
may be affected hy praclice nr prior expo.sure, lhe cot'fficienl
<1oesnot indica te the magnitude oI' sllch effeets. That is, retest
corre1ations will be very high when individual retesl $Coresali
change by a similar amount, whether lhe praclice effed is nil or
very large. When stability coefficients are low, then lhere may
he (I) no syslelll<lliceffecls of prior exposure, (2) the reialion
he cakulated for any test Ihat can be dividccl into specific time
inlervals; scores per inlerval can lhen bc compared in a pmce-dure
akin to the sp[it-half method, as long as items are oI' rela-tivcly
equivalent difficulty (Anaslasi & Urbina, 1997). For
most oI' the specd lests rcviewed in this volume, rcliaoilíty is
estimaled by using lhe test-retest rdiabi[ity coefficicnt, or dse
br a generalizability cocfficiellt (see be!ow).
Te~t.Re!e~tReliobility
Tcst-retest rdiability, a[so known as temporal stabilíty, pro-vides
an estimate oI' the corrclalion belweell Iwo lest seores
from the same lesl adminislered aI two different ponls in time.
A tesl with gnod lemporal stabilily should show [in[e change
over time, providing Ihal the trait being lJIeasured is stablc ,md
l!lere are no differentia[ cffecls of prior exposure. lt is impor-tant
to note that tests measuring dynamic (i.e., change,lb[e)
abilities will by defmilion producc lower tesl-relest rcliabilities
than tests measuring dom<lins Ihal are more trait-like and sta-b[
e (Nunnally & Ikrnslein, 19(4). See Table 1-3 for commOTl
sources of bÍ<ISand error in test-retesl silualions.
A lest has an infinile number oI' possible test-retesl reliahi[-
ilies, dcpending on the lcngth of the lime inlerva[ belween
1esling. In some cases, rdiability eslimates are inversely relatcd
to thc time inlerva[ bctween baseline and relest (Anaslasi &
Urbina, 1(97). In olher wntds, the shorter lhe time interva[
belween test and retest, lhe higher lhe rcliabi[ity wefficient
will be. liowever, the extent 10which lhe time inlerva! affects
lhe test-relesl coefficienl will dcpend on the Iype of ability
evaluated (i.e., stable versus more v,lfiable). Rcliabilily may
a[so depend on the type oI' individual being assessed, as some
groups are intrinsically more variablc over time lhan olhers.
For examp[e, the exlenl to which scores !luctuate over lime
may depend on subject characterislics, induding age (e.g.,
normal preschoolers will show more variabilily than adults)
and neurological stalus (e.g., TBI examinees' scores may vary
more in lhe acute stale lhan in the posl-acule statc). Ideally,
rdiabilíty estimales should be provided for bulh normal indi-viduais
and the clinicai populalions in which lhe tesl is in-lended
to be llsed, and the speçitic dcmographic characteristics
of the samplcs should be fuHy specified. Test slability coeffi-cients
presenled in published les! manuais are usllally derived
frum rclalÍvdy small normal samples le,ted ovcr much
shorter interva[s than are typical for retesting in clinicai prac-tice
and should therefore be çonsidered with due caution
when drawing inferences regarding clinicai cases. Howcver,
Ihere is some evidence Ihat duration of inlerval has less oI'
an impact on test-retest scores lhan subje<.:tcharacteristics
(Dikmen et a!., 1(99).
Prior Exposure ond Proctice Effects
Variability in scores on the same test over lime may be related
to silualional variables suçh as examinee state, examiner state,
examiner identity (same versus different examincr aI retest),
or envirollmenlal condilions that are oflen unsystcmatic and
Rias
Error
Inlerveninf(variablcs
Practicceffcch
Dt.'rnographic
comidcrations
SI'ltislÍç'l]crrors
RanJom or
unwntrollcJ C'Cllts
Eventsofinterest (e.g., slIrgcry.
lllcdk;ll inlt'rvmlion.
rehahililalion)
ExtraneollSevents
Mcmorr for contcnt
l'rocedllf<lllearning
Olher factors
{a}Familiarilywilh lesling
contexl and exarniner
(h) I'crforl1l;lnceanxit'ly
Age(rnaturalional efft.'ctsand
aging)
EduC<llion
Gender
Elhnkil)'
Hasdint..ability
IvleaslIremenlerror (SE,'vI)
Hcgressiollto lhe mean (SEe)
10. 12 A Compendium of Nellropsychological Tesls
of prior exposure may be nonlinear, or (3) eeiling effeels!
reslrietion of range related to prior exposure may be ,ltlenual-ing
lhe eoefficient. For exampk, certa in SUbgrollPSIllaybendi!
more from prior exposure lo tesl maleriallhan olhers (e.g.,
high-1Q individuaIs; Rapporl el aI., 1998), or some SUbgrollPS
may demollslrale more stablc scores or consislenl praelice cf-feelS
than do othas. This causes lhe score distribulion to
ehange ai retest (effectivdy "shuff]ing" lhe individuais' rank-ings
in lhe dislribulioll), which will attenuate the correlalion.
In Ihese cases, the tesl-relesl corre1alion may vary significantly
aeross SUbgrollPSand the correlatioll for lhe enlire sample
will nol be lhe besl eslimale of reliabilit)' for an)' of the sub-grollPS,
overeslimating rdiabj]ity for some and underestimat-ing
reliabilit)' for olhers. In some cases, practice cffecls, as
long as lhe)' are rdativdy s)'slematic and accuratc!y assessed,
will not render a lesl unusablc from a reliabililY perspective,
Ihough they shollld always be lakell inlo account when retesl
scores are interpreted. In addilion, individual factors must
always be consiuered. For example, while improved perfor-mance
may usually be expecled wilh a particular measure, an
indiviuual examinee may approach lesls Ihal he or she had
difficullY with previously with heighteneu anxielY that leads
to decreased performance. Laslly,it lTlUSI be kepl ill minu Ihal
faclors other than prior exposure (e.g., changes in enviroJl-menl
or examinee state) may affecl tesl- retest reliabilily.
Ahernate Forms Reliability
Some invesligators advoC<lethe use of alternate forms lo
eliminale the confounding effeels of praclice v"hen a test must
be adminislered more Ihan once (r.g., Anaslasi & Urbina,
1997). Ilowever, Ihis praclice inlrodllces a second form of er-ror
variance into lhe mix (i.e., conlent sarnpling error), in ad-uition
to lhe time sampling error inherent in leSI-releSI
parauigms (see Table 1-3; see also Lineweaver & Chelune,
2003). Thus, leslS wilh ahernate forms musl have eXlremely
high correlalions between forms in additioll to high lesl-relesl
reliability lo confer any auvanlage over using lhe same form
administered tvice. iIoreover, Ihey mUSldemonstrale equiva-
Ience in terms of mean scores from lesl lo relest, as well as
collsistency in score e1assificationwilhin indiviuuals from lest
lo retest. Furlhermore, alterna te forms do nol necessarily
climinate effecls of prior exposure, as exposure lOslimul i anJ
procedures can confer some positive carry-over eITecl(e.g.,
procedurallcarning) despite lhe use of a differenl sei of ilems.
These dTects may be mini mal across some Iypes of well-cOllS1rucledparallel
forms, such as Ihose assessing acquired
knowledge. For measures such as the VCST,where specific
lcarning and problem solving are involveu, it may be difticult
or impossible to produce an equiva[ent allernate form that
will be free of cffects of prior exposure 10 the original formo
Ihile it is possiblc to attain Ihis degree of ps}"chomelricso-phistication
thruugh careful item analysis, reIiahilily sludies,
and administration to a represenlative nonnative group, it is
rare for ,11ternateforms to be conslrucled with lhe same psy-chometric
rigor as were lhe original forms frum which they
were derived. Evenwell-(onstructed alternale forms oflen lack
crucl<llv,lliu,llion evidence such as similar corrc!ations lo cri-terion
measure$ as lhe original lesl formo This is especially
lrue for older neuropsychological lest.s, particularly those
wilh original forms Ihal were nevn subjecled lO any item
analysis or rcliability sludies whatsoever (e.g., BVRT). Inade-qu,
lte lcst construnion and ps)'chometric properties are also
found for alternale forms in more general published lests in
commotl usage (e.g., VH.AT-3). l:kcause so few alternate
forms are availablc and few of those th,ll are meel Ihese psy-chomelric
slandards, our tendency is to use rdiable change
inuices or slandardized regression-bascd scores for estimating
change from test lo retes.
lnterratcr Rcliability
Mosl lesl manuaIs provide speciflc and delailcd inslru(tions
on how 10 adminiSlcr anu score le,l, 'lccording lo slandard
procedures lo minimi/,e error variance duc lo uiffaenl exam-iners
and scorers. However,some dcgree of examiner vari,lnce
rem,lins in inuiviuually ,ldminislered lests, parlicularly when
scores involve a degree of judgment (e.g., muhiplc-responsc
verballesls such as lhe Vechsler VOCilhular}" Scalcs,which re-quire
lhe rater to adminisler a score from O lo 2).ln lhis case,
an estim,lIe of lhe rcliability of ,H!minislralion aml scoring
across examiners is neeued.
Inlerrater reliabililY can be evalUaled using percentage
agreemenl, kappa, producl-momenl corre!alion, and inlra-e1asscorreIalion
coefficient (Sauler, 2001). for ,lny given tesl,
l'earson correlalions will provide an llpper limit for lhe intra-e1asscorrel<
ilions,bllt intradass correlalioTlsare preferred be-cause,
unlike the l'earson's r, Ihey take inlo accounl paired
assessments made by the same sei of examiners from lhose
maue by dilTerent ex,lminers. lhus, lhe intradass correlation
dislinguishes Ihose seIs oI"scores ranked in lhe same order
from Ihose lhal ,Ire r,lnked in lhe sallle order but havc [ow,
llloderale, or complete agreemenl with each olher, and cor-rects
for interexaminer or leSI-relesl ,lgreemcnt expected by
chance alone (Cicchetti & Sparrow, 1981). However, adv<ln-tages
of the I'earson correlatioll ,Ire lhat il is familiar, is readily
inlerpretable, and can be eas!l}"compared using sland,lrd sta-tislical
techniques; il is besl for evaluating cOllsistency in
ranking rather than agreement per se (Faslenau el a!., 1')96).
Generolizability CoefReients
One reIiability coefficient type not covercd in this list is the
generalil.abilily cocfficienl, which is starting lo appear more
frequentIy in lest manuais, particularly in the larger test bal-leries
(e.g., Wechsler scales anu NEPSY). In generalizabilil}"
theory, or G rlieory, reliabilily is ev"lualeu by decomposing
test score variance using lhe general linear model (e.g., vari-ance
compollents analysis). This is a varianl of the mathe-matical
methods meu lO,lpl'ortion variance in general linear
model allill)'scs such as ANOVA.In lhe case of G lheory, lhe
belween-groups variance is considered an estimate of a true
11. score 'ariance and wilhin-groups variance is considered an
estimale of rrror variance. lhe generalizability coefficient is
the ratio of estimated lrue variance to lhe sum of the esti-mated
true variJncc and estimated error variance. A discus-sion
of this nexib1c ;Ind powerful model is beyond the scope
of t!lis chapkr, but dctailcd discllSsions can bc found in
Nunnally and Bernslein {I(94) and Shavelson el aI. (1989).
Nunn;llIy and Bemslein (1994) also discuss rclaled isslles
pertinrnl lo eSlim<lling reliabílities of variables ref1ecling
sums such as composite scores, and the fact that reliabililies
of diffcrrllce scores based Oll correJated measures C<1l1be verr
low.
Evaluoling a Test's Reliability
A lest cannot be Silid lo have a single or owralllrvcl of relia-bility.
]{alher, tesls can be said lo exhibil diffcrenl kinds of re-liabilill',
the rdalÍvc importance of which ""iH vary depending
on how lhe tesl is to be used. Moreover, each kind of reliabil-ity
mal' varl' across differenl populalions. For inslance, a test
may be highll' reliable in norm,llly funclioning adulls, bul be
highly unreliablc in young children or in individuais wilh
nnuological illness. It is importanllo nole that whilc high re-liability
is a prerequisile for high validill', the latter does nol
fol!ow automalÍcalll' from lhe former. For exampk, heighl
can be measmed wilh great reliabilitl', hut it is nol a valid in-dex
of intelligence. lt is usuaHy preferable lo choose a lesl of
slighlll' lesser reliabilitl' if it can be de1110TlSlraled tha! the test
is associaled witll ,I meaningfulll' higher levei of validity
(Nunnalll' & Ikrnstein, 1994).
Some halle argued thal internai reli,lbilitl' is more impor-tant
than olher forms of reliability; Ihus, if a!pha is low but
tesl-relest re!iahility is high, a tesl should not be considered
reliable (Nunnal!l', 1978, as cited bl' Cicchetti, 1989). Note
thal il is possihle to have lnw alpha values and high lest-relest
reliabilitl' (if a measure is made Lip of heterogencous items
hut yie1ds the same responses at retesl), or low alpha values
bul high interrater re1iabilitr (if the test is heterngeneous in
ilem contenl hut ridds highll' consislent scores acmss
Iraincd cxperts; an examp1c would be a mental slatus exami-nation).
Internai consislencl' is therefore not necessarill' lhe
primar)' index of re1iabilill', but should be evaluated within
the broader contexl of test-retes! and inlerrater rdiability
(Cicchetli, 1989).
Some argue Ihat test -retest reliabi1iIY is nO! as important as
other forms of rcli<lhilily if the test will only be used once <lnd
is nOllikell' to be administered again in future. However, de-pending
on the naturc of Ihc tcst and rrlcst sampling proce-dures
(as JiSCllssed previous!y), slabilily coefficients m<ll'
provide valuable insight into the replicability of lest results,
particular!l' as Ihese coefficients are a gauge of "real-world"
rdiabilill' ralher Ihan ilccuracy of mCilsurement of true scores
or hypothetical rdiabilill' acmss infinite randomly parallel
forms (as is internaI re1iahilitl').ln addition, as was slated pre-viously,
clinicaI decision making will <llmost alwal's be based
on lhe obt,lined score. Therefore, il is critiCillly importanl O
Psychometrics in Neuropsychological Assessment 13
know the degree to whÍl.:h scores are replieablc ai relesting,
whether or not lhe tcst may be used again in futme.
It is our belirf Ihal test users should take an informed
<lnd pragmatie, ralher Ihan dogmalic, approach lo evaluating
relíability of tests uscd to inform diagnosis or other clinicaI
decisions. If a lest has been designed lo measure a single, one-dimensional
construcl, Ihen high internai consislency rcli<lbil-ily
should be considered an essenli<ll propertl'. High tesl-reles!
reliability should also be collsidereJ an essential property un-less
lhe tesl is designed tn measure stale v;niablcs that are ex-pecled
lo fluctllale, or if syslemalic f,lelors sueh as praetice
effeCls attenuate slability cocfficienls.
What h an Adequale Reliability Coefficient?
Thr reliabilitl' coeffieient ean be inlerpreted direetly in lerEm
of the pereentage of seore vari<lnee atlributed to differenl
sourees (i.e., unlike the corre1ation coefficient, which must be
squared). Thus, with a reliahilitl' of .85, 85% of lhe variance
can be attribuled lO lhe trai I being measured, and 15% can be
altributed to error variance (Anaslasi & Urhina, 1997). When
ali sources of variance are known for the same group (i.e.,
when one knows lhe rdiabilill' ((lefficienls for internai, lest-retest,
alternate form, and interraler rdiabililY on lhe Silme
sampk), it is possible to calculitte the true score variance (for
an example, see Anastasi & Urbina, 1997, pp. 101-102). As
noted above, allhough a delailed discussion of this topie is be-l'ond
lhe scope of this volume, lhe portioning of lotai seore
variante into components is lhe crux of generalizabilitl' lhe-orl'
of re1iability, which forms the basis for re1iability eslÍ-males
for manl' well-knowlI speed lests (e.g., Vechsler scale
sublests such as Digit Symhol).
Salller (2tXll) notes lhat re1iahilities of .80 or higher are
needed for tests used in individllal assessment. Tests used for
dedsion making should have reliabililÍes of .90 or above. Nun-nalll'
and 13ernstein (1994) note Ihal a reliabilitl' of .90 is a
"bare minimum" for tesls used to make important decisions
about individuaIs (e.g., lQ lests), and .95 should be the optimal
slandard. When imponanl decisions wiU be basrJ on lest
scorcs (e.g., placernelll into special education), small score Jif-ferences
on make a greal difference to oulcome, and precision
is paratJlount. Thel' nole that cvrn with a rdiability of .9ü, lhe
SH"l is almusl one-lhirJ as large as lhe overall SDoflest scores.
Given Ihese !ssues, what is a c1inicallr acceptable levei of
reliabilill'~ According to Sall1rr (2001), tests wilh reliabilities
below .(,0 are unrcliable; Ihose above .60 are marginalll' re!i-able,
and those above .70 are rdative!l' re!iable. Of note, tcsls
wilh rdiabilities of .70 may be sufficient in the earll' stages of
valiJalion research to determine whether the test correlates
wilh other validation evidence; if so, additional effort call bc
exprnded to incrcase rdiabilities lo more acceplable leveis
(e.g., .80) by reJucing me,lsurement error (Nunnalll' & Bern-stein,
1994). In outcome slUdies using psl'chological tesls, in-ternaI
collsislencies of .80 lo .90 and test-relest rc1iabilities of
.70 are considcred a minimum acceptable slandard (Andrews
et 011., 1994; Burlingame et aI., 1995).
12. 14 A Compendium of Neuropsychological Tesls
To61e1-4 Magnitude ar ReliahililyCndficients
i.lagniludeof CoeffJdcnl
Very high (.90+)
High (.!lO-.89)
Adc(juatc (.70-.79)
jl;lrgitlill(.60-.69)
Lov (<.59)
In Icrms of inlernal rcliability of neuropsychologieal tests,
Cieehetti el aI. (]990) hayc proposed that internaI consistency
estimates of lcss than .70 are unacu'ptablc, rdiabilities be-t
vecn .70 and .79 are fair, rdiabilities betwecn .80 and .89 are
good, and rdiabilities ilbove .90 are excellcnt.
For interrater reliilbilities, Cicchetti and Sparrow (I981)
report that clinicaI significance is poor for reliability coeffi-eients
below .40, fair between .40 and .59, good belween .tiO
imd .74, and excellent between .75 and 1.00. Faslenau et aI.
(1996), in summarizing guidelines on the interpretation of in~
traclass corrdations and kappa cocfficients for interraler reli-ability,
consider coefficients larger than .60 as sllbstantial and
of .75 or .80 as almost perfecl.
The,c are the general guiddínes that we hayc med
Ihroughoul the lexl to c'aluate thc rdiability of neuropsycho-logical
tests (see Table 1-4) so that lhe text ean be med as a
reference when seleeting tests with the highest rdiability.
Users should note thallhere is a great deal of variability with
regard to the acceptability of reliability coeffieients for neu-ropsychological
lesls, as perusal of this volume will indieate.
In general, for tesls involving multi pIe subtesls and multiplc
scores (e.g., Wechslcr scales, NEPSY, IJ-KEFS), inclucling
lhose dcrived from qualitative observations of performance
(e.g., error an,llyses), the farther away a score gels from lhe
composite score itself and the more difficlllt the seore is lo
quantify, the lower lhe rcliability. A quick review of lhe relia-bility
data presellled in Ihis volume 'lIso indicates Ihal verbal
tests, wilh few exceptions, lend to have consistently higher re-liabílity
than lesls measuring other cognitivc domains.
Lastly, as previously discussed, rcli,lbility coefficienls do
nOI provide comp[ele informalioll on the reproducibilil}' of
individual test senres. Thos, wilh regard to test-retest rdiabil-
Itr, it is possible for a tesl to have high reliability (r= .80) but
have retesl means that are 10 POilltS higher Ihall baseline
,cores. Reliabilíty coefflcients do not provide information on
whethcr individuais retain lheir relalive place in lhe distribu-
- tion from baselínc to retest. Proceclures such as lhe 13Iand~
Altman mcthod (A!tm,m & Bland, 1983; B1and & Altman,
1(86) are one way to determine the limils of agreement be-
Iween two assessments for individuais in a group.
MEASUREMENT ERROR
A good wnrking underslanding of coneeptual issues and meth-ods
of guantifying measuremenl error is essential for compe-lent
clinicai pracliee. We starl our discussion of lhis lopic with
concepls arising fmm dassicallest Iheory.
True Scores
A central ekmenl of classieal test theory is lhe concept of a
/ruc score, or lhe score an examinee wnuld obtain on a mea-sure
in lhe absence of any measuremenl error (Lord & Novick,
1968). True scores can never be known. Instead, they are esti-matcd,
and are coneeplually defined as lhe mean score an ex-aminee
would obtain acmss an infinite number of randomly
parallel forms of ates!, assuming lhat lhe examinee's scores
were 1101systematically affeeled by tesl exposurclpractice or
olher time-related factnrs such as maluralion (Lord & Novick,
1(68). In contrasl to Irue scorcs, oblaíllcd scores are lhe aClual
scures yidded by tests. Obtilinnl scores indude any measure.
ment error associated with a given tesl.' That is, Ihey are the
sum nf lrue seores and l.~rror.
In the dassic<ll modcl, the relation betwcen nblained and
true seores is e)(prcssed in the following formula, where error
(e) is random ,lIld ,111v<lriablcs are assullled to be normal in
distribution:
Vhen lest reli,lbility is less than perfeet, as is always the case,
lhe net effeel of me,ISlrement error iICroSSexaminees is to
bias obtained scores oulward from lhe popul<ltion mean. That
is, scnres above lhe mean are most likcly lo be higher than
true scores, while Ihose below lhe mean are most likdy lo be
lowcr Ihan Irue scores (Lord & Noviek, 19(8). Estimated true
scores correct this bias hy regressing obtained seores toward
the normalive mean, with the amounl of regression depend-ing
OH test reliability and devialion of the obtained sune from
the mean. The formula for estimated true scnres (t') is:
limits af Reliability
Although it is possiblc to have a reliable test thal is not valid for
some purpo,cs, lhe converse is nol the case (see [ater). Further,
it is also conceiv,lblc that Ihere are some neuropsychological
domains that simply cannol be measured reliably. Thus, even
Ihough there is the assumption Ihal questionable rdiability is
always a function of the lest, reliability may depend on the na-lUre
of the ps}'chological process measured or on lhe nature of
the popul,lIion evaluated. For example, many of lhe exceulive
fllnclioning tesls revicwed in this volume have relalivcly mod-est
rcli,lbilities, suggesling Ihal Ihis ahilily is difficult lo assess
reliably. Additionall}', tests used in poplllalions with high re-sponse
variabilily, such as presehoolers, clderly individuaIs, or
individuais wilh brain disorders, may invariably yield low reli-
,lbility cocfficients despile lhe best dTorls of test devclopers.
Vhere:
X= oblained ;;core
t = lrue score
e=error
X=f+e {3]
13. PsychoJnetrics in Neuropsychnlogiol issessment 15
11ere:
x = mean test seore
rxx = tesl reliabilit y (internai consisleney rc1iability in
dassieallesl theory)
x= oill<lineJ seorc
If working with z seores, lhe formula is ~implcr:
lhe U~eof lrue Score~ in Clinicai Pradice
ancy betweell true and obtaineJ scores. ror a highly rdiable
mcasure such as Tesl 1 (r= .95), true score regressioll is mini-mal,
even when an oblained scorc lies a considerablc distance
from the sample mean; in lhis cxamplc, a SliUHl<fdscore of
130, or two Sl.>s abovc the 1l1e,1ll,is associated with an esti-mated
lrue score of 129. In contrast, lur a lesl with low rc!ia-bililY
such as Tesl 3 (r=.65), true score regression is quite
subslant ia!. For this test, an obtailled score of 130 is associated
wilh ,In estimaled true score oC 120; in this case, fully one-third
of lhe observed deviatioll is "losl" lo regression when the
est imaled Irue scnre is calculated.
Such infornl<llion Illay have importam implicatiorls wilh
respect to inlerprelation of lest resu!ts. For example, as shown
in .1~lblc1-5, as a result of differences in rdiability, obtained
scores of 120 Oll Tes! 1 and 130 on Tesl J are associated with
Cssclllial1yequivalcnl estimated true scores (i.e., 119 and 120,
respeelivel}'). If only obtained scores are considercd, one
might inlerprcl scores from Test I anJ Test 3 as signiticantly
differcnt, even though these "difierences" actually disappear
when measurell1ent precision is laken inlo Jccounl. lt should
also be noled thal such differenees ma}' nOIhe limiled lo com-parisons
of scores across differenl tesls within lhe sarne indi-viduai,
but may also apply lo cOlllparisons belween scores
from the same test across differenl individuaIs whcn lhe indi-viduais
come from differenl groups anJ lhe tcsl in question
has variable reliabililY acmss Ihose groups.
Regression to the rnean may also m;lnifest as prunounced
asymmetry of confldellee interv<lls celltered on Irue scores,
relalive to oblained scores, as discus~ed in more detail later,
Although calculalion of (rue scores is encouraged as a means
of g<luginglhe limitations of reli<lbilily,il is important lo WIl-sidu
Ihat an)' signiticant difference belween characteristics of
an examincc and lhe samplc from which a lllean samplc score
and rdiabililY estimate Vere derived may invalidatc the pru-cess.
For example, in some cases il makes litlk sense lo esti-mate
true scores for severdy brain-inillrcd individuais on
lesls of cognition using leSI p,lfameters from healthy norma-tive
samples, as mean scores wilhin the brilin-injured popul<l-tion
are likely lo be suhslilntiall}' different Ccom Ihosc seen in
hea1thy normative samples; reliabililies may Jiffer subsliln-ti<
ll1yas well. Illsteild, olle mal' be justilied in deriving esli-maled
lrue scores lIsing data frorn a cornparable clinicai sarnple
if Ihis is avaiablc. Overall, these issues underline lhe complex-ities
inherent in comparing scores from different tests in dif-ferenl
populalions.
[41
[51
formula 4 shows lhal ;m cxamin('(~'s estimated true score is
the sum nf Ihc 111C,1sIc1ore of the group to which he or she bc-longs
(i.c., lhe normative samp1e) and lhe devialion of his or
her obtaineJ score from the normalive mean weighted br lesl
rcliabililY (as derived from lhe same normativc sample). Fur-
Iher, as tesl reliabililY appro<lehes unil}' (i.e., r= LO), esti-mated
lrue scores approaeh oblained seures (i.e., there is little
measurement error, so eSlim,led lrue scorc~ and oblainnl
scores are nearly equiv<llcnt), Conversely, as test reliabililY ap-pro<
lehes zero (i.e., whcn a tcst is eXlremely unreliablc and
sllbjeCllo excessive lllea~urement error), e~limated lrue scores
approach lhe mcan test score. Thar is, whell ti lest is hígh/y re!i-uh/
r, grratrr weight is givell to obtailler1 scores tlUlIl to the nor-miltive
meml score, but whell 11 Int is very IIllre!illble, grelHo-weiglrt
ís givell to the norma tive metlll score tllllll W obtallJed
scorcs. l'ractically speaking, eSlimaled Irue scores will <llways
be closer to lhe mean than nblJÍned scores are (cxccpt, of
course, where the nblained score is ;lllhe mean).
Although lhe Irue score modcl is abstract, it has practical ulil-ily
and important implications for tcsl scorc interpretation.
For example, whal may not be immeJiatd}' obvious from for-mulas
4 and 5 is readil}' apparent in Table 1-5: estimat(~d true
scores Iranslale tesl rdi,lbilil}' (or lack thereof) into the same
metric as aclUal test scores.
As can be seen in T;lble 1-5, the degree of regression to the
rnean of true scores is inversd}' reLlled to test reliability and
direclly rdated to degree of dcvialion from the reference
mean. This rneans th<ltthe more rdiablc a test is, the doser are
obtained scores 10lrue scores and that lhe further away lheob-tained
scorc is frum the samplc mean, the grealer lhe discrep-loble
1-5 Estimalt'tlTruc S(()rcVahwsfor Tnrce ObscrvcdS(()rcs
011 Thrce Leveisof Reliahility lhe Stondord Error of Moo~urement
Observetl Sçores
(.'.1= IOO,5D", 15)
Reiiability 110 120 DO
.Iest I .95 IlO li' 12.'1
Test2 .80 108 116 121
Te'H3 .65 107 113 120
F.xaminers may wish lo qUill1lilYthe margin of error i1SS0cl-aled
wilh using oblained scores as cslimatcs of lrue seures.
When lhe sJtIlple SLJ <lnd lhe reliability of oblained scnres are
known, an estimale of the SLJ of obtaincd scores about true
scores may be cakubted. This value is known as the stillulard
error oI meUSlIrelllem,or SEM (Lord & Novick, t 968). !vIore
simply, the SEM provides an estimate of the amount of error
in <Iperson's observeJ scorc. lt is a functlon of the re1iabilil}'
14. [61
16 A Compendium of Nellrops}'chological Tesls
of the test, ,mJ of the variabilily of scores wilhin the sOlmple.
The SFM is inversdy rdaled to lhe rcliabililY of the lesl. Thus,
lhe greater the rdiability of lhe lesl is, lhe smaller lhe SIA! is,
and lhe more confidence the examiner can have in lhe preci-sion
01' lhe score.
The SEM is delined by the following formula:
SEM '" SD~1 - rxx
Where:
SlJ= the slandard deviation of lhe lesl, as derived from an
appropriale normalive s<lmplc
rxx= the reliabililY wcffici<'nl of lhe lest (usually internai
rdiabililY)
Confidence Intervols
Whi1c lhe SEM can be considered on ils own as an index of
lesl precision, il is nol necessarily inluitively interpretable,'
and Ihere is oflen a tendenc}' to focus excessively 011 test scores
as point eslimates at the expense oI' consideration of associ-ated
eslimation error ranges. Smh a lendency lo disregard
impreçision is p<uticularly inappropriate when interpreting
senres from t('sls of lower rdiability. Clinically, it may there-fore
be very importanl lo reporl, in a concrele and easily un-derslanJable
manner, lhe degree oI' precision associaled wilh
specific tesl senres. One melhod of doing this is to use confi-delh:
e Hltervals.
The SE!Y! is used to rorm J confi(lence inlerval (or range
oI'scores), around estimaled true scores, wilhin which oblained
scores are mosl likcly lo falI.The dislriblltion of obtained scores
aboul lhe lrue score (lhe error dislrihulion) is assumed lo be
normal, with a mean of zero and an SD equal to the SEM;
therefore, the bounds of çonfi(!cnce intervals can be set lO in-dude
any Jcsired range of probabilities by mulliplying by the
appropriate 2 valuc. Thus, if an inJividual were lo take a brge
number oI' ranJomly parallel versiollS of a tesl, lhe resulting
obtained scores would fali wilhin an inten'al of:tl SEM of lhe
eslimated lrue score óll% of lhe time, ,!nJ wilhin 1.96 SEM
95'Yoof lhe lime (see Table 1-1).
Obviously, wllfidence inlervals for unrcliablc lests (i.e.,
wilh a large SEAl) will be larger than those for highly rdiablc
leslS. For example, we ma}' again use data from Table l-S. for
a highly rcliablc les! such as Tesl 1, a 95% wnfidence interval
for an obtained score of 110 ranges from 103 lo 116. In con-
Irasl, lhe confidence interv,ll for Tesl 3, a lcss rcliable test, is
larger, ranging from 89 to 124.
lt is importanl to bear in mind Ihal çonfidence inlervals
for ohtained swres Ihal are based on lhe SFAl are çentered on
t'stimlltcd truc swrcs." Such confidence intervals wil1 be sym-metric
around obta ined scores only when oblaineJ scores are
ai the test mean or when rcliahility is perfeçl. Confidence in-tervals
will be ,lsymmelriç aboul oblained scores to lhe S,ln1e
degree Ihal lrue scnres diverge frum obl,lined scores. There~
fore, when a lest is highly rcliable, the degree of asymmelry
will nflell be trivial, parliclllar!y for oblained scores within
one SI) of lhe mean. For tests of lesser relLlbilill', the asymme~
Iry may be lTlarked. For examplc, in l:lblc 1-5, wnsiJer lhe
oblailled sçore of 130 on Tesl 2. The estimaled true sçore in
Ihis case is 124 (see eqllalions 4 and 5). Usingequalion 5 and
a z-mulliplier of 1.96, we find thal a 95°11,confidençe interval
for the ob!aincd scores spans :t13 poinls, or from 111 lo 137.
This confidence interva! is subs!antially asymmetric aboul lhe
oblailled score.
It is also importanl to note thal SEM-based çonfidençe in-
ervals should not be llsed for eSlirnating the likelihood oI' ob-taining
a given score at retesting wilh lhe same rneasure, as
cffects oI' prior exposure are nOI accounleJ for. In addilion,
Nllnally and Bernstein (1994) point out thal use of SEM-based
confidence intervals assumes Ihat error Jistrihulions
are normal!y dislribuled and lwmoscedaslic (i.e., equal in
spread) a(rnss lhe range of scores oblainablc for a given lesl.
Howevu, this assumption ma)' oflen be violaled. A number of
alternale error mudeis Jo nol require these assumptions and
mar Ihus be more appropriale in some circumslances (see
Nunally and Bernslein, 1994, for a detai!Cd discussion).1
Lastly,,!Swilh the derivation 01' estimaled lrue scores, when
an examinee is known lo bclong lo a group Ihat markedly dif-fers
from the norm,llive samplc, il may nol be appropriale lo
derive SF,Hs Olndass(lcialed confidence intervais using nor-mative
samplc parameters (i.e., 51) and ru)' as Ihese would
likely differ significanlly from parameters derived from an ap-plicable
clinicai sample.
lhe Stondord Error of Estimation
In additioll to estimating confidence inlervals for oblained
scores, Olle lllay also be inleresled in estimaling confidence in-tervills
for estimated true scores (i.e., lhe likely range of lrue
scores aboul the eslimaled Irue score). For Ihis purpoSt'",one
mal' conSlruCl confiJence intervais using lhe sflllldard error of
estimatíoll (SE,,; Lord & Novick, 1968). The formula for Ihis is:
[71
11ere:
SD= lhe slandard deviation of the variable being
eslimated
r.u= lhe test rdiabili!y coefficient
The SEE' like lhe SEM, is an indie<llion of lesl precision. As
wilh lhe SEM, confidence intervals are formeJ around esli-mateJ
Irue scores by multiplying the SEEby a desired zvalue.
Thal iS,one wüuld expect that over a large nllmber oI' randomly
parallel versions of a lesl, an individuars tme score woulJ fal!
within an illlerval of:tl SEI' of the eslimated Irue score 68%
of lhe time, and fali within 1.96 SEIO95% oI' lhe time. As wilh
confidence inlervals bas~d on lhe SEA1, Ihose based on the
SEI' will usually nol be symmetric arounJ ohtained scores.;1I
oI' lhe olher caveals detaileJ previously regarding SEM-based
confidence interv<lisalso apply.
lhe dlOice oI' construeting confidençe inlervals based on
lhe SEM versus the SEI' wil! depend on whether one is more
15. interesled in true scores or obtained s(Ores. That is, while the
SEM is ,I giluge of test accuracy in that it is used to determine
lhe expeçted range of obtllillcd scores abolll true scores over
parallel assessments (the range of error in 111C115r1rCmCI1/ of lhe
trile score), the SEE is a gauge of estimation accuracy in that it
is used to determine lhe likely range wilhin which trlle $Cores
fJII (the range of error of estimati"n of the true $Core). Re-gardless,
both SEM-based and SEE-based confidence intervals
are symmetric wilh respecl O estimated true scores rather
than lhe obtained scores, and lhe boundaries of both will be
similar for any giwn levei of (Onfidence interval when a test is
highly reli,lble.
The Standard Error of Predietion
When the standard devialion of obtained scores for an alier-nate
form is known, one may cakulale lhe likcly range of ub-tained
scores expected on retesting with an alternate formo
For Ihis purpose, the stmulrml errar of prcdictioll (SEr; Lord &
Novick, 1961'l) may be used to comlruct confidence intervals.
The formula for this is:
[SI
SE!, "'SVy~l-r~
Where:
SDy = the stdndJfd devi,llÍon of lhe parallel form
administered at retest
rxx = the reliability of the form used at initialtesting
In this case, confidence inlervals are formed around cstimdled
Irue scores (derivcd from initial abtained sClnes) by multiply-ing
the SEr by a desired zvalue. That is, one would expect that
when retested OVCf a large number of randomly pJrallcl ver-sions
of a lest, an individual's obl<lined SClne would fali within
<In inlerval af:tl SEI' of the estimated true score 68% oI' the
time, and fali within 1.96 SEE 95% of the time. As wilh confi-dence
intervals based on lhe SEM, those b,lsed un the SEI' will
generally not be symmetric ,Iround obtained SClnes. 111of the
other caveats detailed previously regarding the SEM-I}<Lsed
confidence intervals also apply. In addilion, while it mdY be
templÍng lo use SEf'-based confidence inlervals for eva1tI,Hing
signific<lnce of ch,mge at retesting with lhe same JlleilSUre, Ihis
practice violates the assumplions Ihat a parallel form is used
aI retest and, particular1y, that no prior exposure effects apply.
SEMs and True $cores: Proclicollssues
Nunnally and Bernstein (1994) note Ihat mosl test manu<lls
do '';m exceptionally poor job of reporting estimateJ true
scores ,Ind conlldcnce interva1s for expectC(I obt,tÍned scores
Otl alternative forms. for ex,lnlple, intervals are often erro-neonsly
centered abolll obtained seores rather than estimated
true scores. Often the topic is not even discusscd" (p. 260).
Sattler (2001) also notes that test manuills often base confi-dence
intervals on the overall SE,"1 for the entire standardi/d-tion
sample, rather than on SE"'!s for each age bando Using the
average SEA1 across age is not always appropriate, givcn Ihat
PsydlO111ctries in Ncuropsyehological tssessmenl 17
some age groups are inherently more variable than othcrs
(e.g., preschoo1crs versus adu1ts). In generdl, eonfidencc inter-vais
based on age-specitic SE"'!s are preferable lo Ihose based
on the overall SEAI (particularly at the extremes of the age
distribution, where there is the most variability) and C<1noften
be constructcd using age-based SEMs found in mosl manuaIs.
It is important to ackllow1cdge Ihat whilc estimated true
scores and associated confidence intervals have mcrit, there
are practical reasolls to foeus on ohtained scores inslead. For
example, essentially ali validily studies ,md ,Ktu,nidl predic-lion
mcthods for mosl lesls are based on obtained scores.
Therefore, obtained scores must usually be employcd for di-agnoslie
and olher purposcs to maintain consistency to prior
research and test usage. for more discussion regarding lhe
ca!Culdtion and uses of the SE,H, SEE' SEr' and a1ternalÍve er-ror
models, see Dudek (I979), Lord and Novick (l96l'l), and
Nunnally and Bernslein (1994).
VAUDITY
~lode1s of vdlidity ,Ire not ,Ibstract conceptual framl'works
Ihat ,ne only minimally rclaled to neuropsychological prac-tice.
Thl.~Standanls for Educational dnd Psychological TeslÍng
(lERi et ai., 1(99) state that validati(ln is the joint rcsponsi-bility
oI' the tesl developer and the tcst uscr (1999). Thus, a
working kllowlcdge of validily models and the validity char-
,Ktcristics of specific tests is a central requirement lor respon-sible
and competent test USl.~.From a practical perspective,
a working knowkdge 01' va1idity allows users to determine
which lests are appropriate for use and which fali below stan-dards
for clinicai practice or rescarch utility. Thus, neuropsy-chologists
who use tests to (lctl.~ctand diagnose neurocognitive
difficulties should be thoroughly familiar with commonly
used validity mudeis and how these can be usd to evaluatc
neuropsychologicallools. Assuming that a test is valid because
it was pu[(;hased from a reputabk test publisher, appe<lrs to
have il large normative s,nnp1c, or Came wilh a l<lfge user's
tnanu,11 C<lllbe a sniolls error, as some well-known and com-monly
uscd neuropsycho!ogieal tests are bcking with rcgard
to crucial aspccts 01' validity.
Definilion of Validity
Cronbaeh and Meehl (I ')55) were some of the first Iheorists to
discuss the cOllcept of eonstruct VJlidily. Since then, the hasie
definition of validity evolved as testing necds changed ovcr
the years. Allhough eonslruct validily was first inlroduced as a
scparate Iypc of validity (e.g., Allastasi & Urbina, 1(97), it has
moved, in some models, to encompass ali types of validity
(e.g., Messick, 19')3). In other models, the term "construct
validity" has been deemed redundant and has simply bcen re-placed
by "validity," since ali types of validity ultimatcly in-form
as lo the construet llleasured by lhe lesl. tccordingly, the
term "construet validity" ha.s nol been u.sed in the Standards
for Educational and l'sycho!ogical"lcsting since 1974 (AERA
16. 18 A CompellJium of Neuropsychological Tesls
el a!., 1999). However, whelher il is deellleJ "conslrucl valiJ-ily"
or simply "validil~-:' lhe coneepl is eentr~1 lo evalu~ling
the ulility of a lest in the clinicaI or researeh arena.
Test valiJity may bc Jefined at the mosl basie levei as lhe
degree /O whícJr a leSI (/(/l/(ll/y IIlCllSlIres wllrlt ir is íntended /O
meaS/lre, or in the words uf NUllllally ~nd llernstein (1994),
"how wetl itllleasures what it purports to Illeasure in the eon-text
in which it is to be applied" (p. 112). As with reliability, an
important point 10 be madc here is Ihat a tesl eanflol be said
to have une single levei (lf validity. Rather, it ean be said to ex-hibil
various lypes and leveis of validilY across a speclrum of
usal;e antI popul,llions. That is, 'lIliJity IS nm ti propcrty of 1/
t('st, bul rather, 'ulidily js li prop('rty of the mcrmilJg attached to
(/ t(,SI Sf()re; villidily can only arise and be dellned in the spe-cific
conlext of tesl usal;e. Therefore, whilc it Éscertainly nec-essary
to undersland the valiJity of tests in particular contexts,
ultimate decisions regarding lhe validilY of test scme interpre-tation
must take inlo account any unique factors pertaining to
validity aI the levei of individual assessment, such as devia-tions
fcom slandard adminislration, unusual testing enviroll-
Illents, exalTlinee cooperation, and the like.
In the past, assesslllenl of validity was generally tesl-centrie.
lhat is, test validity was largely indexed by compari-son
with olha tests, especially "standards" in lhe field. Since
Cronbach (1971), therc has becn a move aw~y from test-baseJ
or "measure-centered validity" (Zimi1es, 1996) toward the in-terprelatiall
alld externaI utility of tests. Mcssick (1989, 1993)
expanded the dcfinition af validity lo cncompass an overall
judgmenl of lhe extent to which empirical evidcncc and theo-retical
rationales support lhe <ldequacy ilnd cffeclÍveness of
inlerpretations and ,tCtions resultinl; from test scores. Subse-qllenlly,
!vlessick (1995) proposed <lcomprehensivc model of
construcl validity wherein six different, distinplishablc types
of evidence contribute to construct validity, These are (1)
content rdaled, (2) substantive, (3) slructural, (4) generaliz-ability,
(5) externaI, and (6) collsequcntial evidence snurces
(see Table 1-6), ,llld they form thc "evidential basis for score
Table 1-6 /l,lesskk ..••lludel uf Comtruct ValiJity
Typc af Evitlcncc
SuhstanlÍn'
Structurill
Genefillizilbility
"5<. l«,- J I.<y ( 19'J6) fo, Iim,!au"Tl< "f ,hi, com!",,,<,,'
interpretation" (/I,!cssick, 1995, p. 743). Likewise, the Slan-dards
for Educational and l'sycholol;icallesting (AERA et <lI.,
19(9) follows a modcl very llluch like ~kssick's, whcre differ-ent
kinds of evidence are llsed to bolster test validity bascd on
each of the fol1owing sources: (I) evielence baseei on test COll-tent,
(2) response processes, (3) internaI structure, (4) rda-lions
lo olhe r variables, anel (5) consequences oftesting. The
most conlroversial aspect of these mode1s is lhe requirement
for consequential evidence to support validity. Some argue
that judging validity ,lCcording to whcthcr use of a test results
in positive or negative social consequences is too far-rc,lChinl;
ilml may 1cad to abuses of scicntific inquiry, <lSwhcn a h.'st re-sult
does not agrce with lhe overriding social climate of the
time (Lecs-J-lil1cy, 1996). Sociill anel ethical conscquenccs, al-thoul;
h cruci,tl, milY therefore need lo be treMcd separatcly
from validity (Anastasi & Urbina, 19(7).
Validity Models
Since Cronbach and Mechl, various modcls of validity have
bcen proposed. lhe most frequently encountered is the tripar-tite
modcl whcrcby valídity ís divieleel inlo threc eompotlenls:
content villitlity, criterioll-rc1ated validity, and construct valid-ity
(see Anilstilsi & Urbina, 1997; ltitrushina ct aI., 2005; Nun-nally
& Bernstein, 1994; Salt1cr, 2(01). Other validity subtypes,
including convergent, divcrgent, prcdictivc, trcatment, clinicai,
and face validity, are subsullled within thcse three domaills.
For example, nmverl;enl ,1Ild divergcnt villidity are most often
trealed as subsels of cnnstruct validily (Sattler, 2(01) ,tlld con-current
and predicl!ve validity as subsels of critcrioll V<llídity
(e.g., Milrushina et aI., 20(5). Concurrent and predictivc valid-ily
only differ in terms of a temporill gradicnt; concurrcnt va-lidity
is relevant for lests used to identify existing diagnoses or
conditions, whereas predictive validity applies when dctermin-ing
whether a test predicIs fulure outcnmes (Anastasi & Ur-bana,
1997). Allhough face validily appears to have fallen out
oflilVor as a typc of validity, the extent to which examinees be-lieve
a te~t me<1sures whilt it appears to ll1e~sure can affect mo.
tivation, self-disclo~lrc, <lnd effort. COllSequent1y, face validity
Glll be seen as a moder,lor variab1c affecting COllcurrent and
predietive validity lhal can be operalionillized <1nd measured
(Bornstein, 1996; I'evo, 1985), Again, ali these labcls for dis-tinct
c<ltegories of validity are ways of providing different types
of evidmce for validity and are not, in and of themsclves, differ-ent
types of villidity, as older sources mil;ltt claim (AERA et aI.,
1999; YUtl & Ulrich, 20(2). Lastly, validity is a matler of degree
ralher th<lll an all-or-none propcrty; validity is Iherefore never
aClually"finalil.ed,~ since tcsts must be cOlltinually reevalualed
as populations and testing contexts changc over time (Nun-llally
& Bernslein, 1994).
How lo EvoluoJe the Validity of a Test
I'ragmalically speaking, ali the thcorctic<ll models in lhe world
will be of no utilíty to the practicing clinician unlcss they
ean be translated into specific, step-by-stcp proeedures for
Dcfinition
Relevance, represcnlati'{'lH.'SS,anti technical
qualily of test cOn!ellt
ThCtlfetical rallona!cs for the test anti Icst
responses
Fidelity af scoring slruelme to the structure
(lf lhe constrllet mcasuf(,J by lbe tesl
Seores and interl'retatiulls generalize auoss
groups, scttings, anu tasks
Cunvcrgcnt anJ Jin'rgenl villidity, eriterion
relcvanee, anJ appli<,J utilily
Actual and potelltial cunsequcnccs of test use,
relating to suurces af invaliJity rclatcd to
bias, fairness, ilnd disuiblllive justice"
Extern;t1
ConSl.'quentiill
17. eva luating a test's valiJily .. I:lble 1-7 presenls a eomprehcnsive
(bUl not exhallstivc) list of specilic fealures lIsers c<ln look for
when cvalllatíng a tesl anJ reviewing lcst manuaIs. E<lch is or-ganizcd
according lo the type of validity evidcnce provided.
for exampie, COllstrllct validity ean be ,Issessed via eorrc!a-tions
with other tests, faetor analysis, internai cOlIsistency
(e.g., suhlesl intercorrdations), eonvergellt and Jiscriminant
validation (c.g., multitrait-mllltímethod malrix), experimen-tai
interventions (c.g., scnsitivity lo treatment), slructlH,11
equalion Illodding, and response processes (e.g., lilsk dCCOlll-posilion,
protocol analysis; Anaslasi & Urbina, 1997). lfost
importantly, lIsers shollld also rernembn lhal even if an othcr
condilions are me!, a test cannol be eonsidered valid if it is
not rcliable (see previoll. Jiscussion).
It is importanl to nOle lhal not ali tests will have sufficielll
evidence lo salisfy ali aspects of validity, bllt test uscrs shollld
hilve a suffieicntly broad knowledge of nellropsychological
lools to be ab!c to select one test over anolhn, based on lhe
quality of the validation eviJence availablc. In essence, we
PsydHlnwlries in Nellf(lpsycho!ogical Assessmcnt 19
havc lIscd this modcl lo critically evaluate ali the tests rc-viewed
in this volume.
Note that there is ,I certa in degree of overlap between cat-egorics
in Table 1-7. for example, corrdatiollS between a
specific test Jnd another test me,lsuring IQ Cilll simll!tane-ously
provide criterioll-rcialcJ eviJcnce <lnd construcl-relaled
evidencc of validity. l{egardlcss of lhe termino]ogy, it is im-portant
to understand llOW spccific techniques such as fae-tor
analysis serve to inform lhc validity 01"test interpretation
across the range of sellings in whieh nellropsycho!ogists
Vork.
What Is an Adequate Validíty Coefficient?
Some invcsligalors have proposcd erileria for evaluating cvi-dencc
rcJated to criterion valídity in outeollle assessmcnts. For
instance, Andrcws ct aI. (1994) and 1311rlingamc ct aI. (1995)
recornmcnd tha! a minimlltn levei of ,lCccplabilil}' for corrc!a-tions
involving criterion v'lliJit}' is .50. Howcver, Nunnally
Table 1-7 Somecs of Evidence and Techni'1l1cs for Crilically EvalU<itingthe Validily of NellfOl'>yehological T(.'sts
T}'pe of Evidence
ConteTlt-rc!aled
Conslrlld-rdaled
Criterion-r(.'!aled
Resl'on>e proces.•es
ReIUirCllEvidcnce
Rcfers lo Ihemes, wording, format, lasks, or qnc>liolls on a te,I, and <ldmini,tralion and scnring
Vescril'liou 01"lheorelical mudei (In which lest is bascd
Review of Iilcralure with sUl'porling evidence
Definilion (lf dOlllain of intcrest (e.g., litera!Ure review, lheoretical reasoning)
Opcralionalizalion 01"def1nilion lhrough thorough and syslemalic review of tcst domain frum which ilem> are
to b(..samplcd, wilh Iisling nf slmrces (c.g.. word frequenc)" sOllTcesfor vocabulary tesls}
Collection of samplc of ilems brge enough to be represenUlive of dunuill and with slIfticiclll rang(.' of dífflculty
for largel poplIlation
SdcelÍon of panel of jlldges for expert review, hased on specific selectinn crileria (e.g., acadelllic and praclical
baekgroullds or cxpcrlise within specific subdolllains)
Evall1alion of item., hy experl pane! based on specific uitcria concerning accuracy and relevmlCe
Resolulion of judgmcnl conllids wilhin pane! for ilems lacking uoss-panc! agreelllcnt (e.g., empirical Illeans such
as lndex of llé'fl1Congruem:c; Hamhlelon. 1980)
Formal ddinilioll of comlruct
Formulation of hypothcsc> lo lIIeasure collstruct
Galhering empirical evidence of conSlruct validalion
Evaluating psychofllclric propnlies of imlrunlenl (i.e., reHahilily)
D(.'mon,lration of le.•1s('"milivily lo deve!0l'menul changes, correialioll with olher le~;[S,gWllll differences swdies,
l"aClnranalysis, intertwl wmistcllcy (e.g., wrrdations belweell slolesls, or lo composiles wilhin Ih('"sallle test),
convcr~ell and divergem valitiatioll (e.g., muitilrail-llIu1timclhod l1Iatrix), ,cnsilivity to cxpnilllenlal
manipulalioll (e.g., la'almellt sen,itivity), slruclural equalion modding, and analysis of l'rocess variahles
lIndl'l'l)"ing test performallce.
Idmtification of al'propriate crilerioll
ltientification uf relcv,11I1sample grollp rdk<:ling lhe emire pOl'lItalion of imeresl; if only a SllOgrollP is examined,
Ihen gcneralization mllst remain wilhin subgroup definition (e.g., kccping in mind polenlial SOllrcesof error sllch
,1.1reslriclion {lfrange)
Analysis of test-crilerioll relalionships Ihmugh empiricalmcam sucll as COlllrasting pouP', corrdatiollS wilh
pr('viously availaolc tesls, dassil!calion of accllracy slalistks (e.g., posilive prediclive power), oulcome ,Iudi(.'"
,md llIela-analysi>
Velermining whether perforn""lCe on thc tcsl aCluaJl)"rei,ltes lo lhe domain being lIIeasured
Analysis of individual responses to dderrnine lhe processes underlying performance (c.g., quc,lioning les! lahes
about slralegy, analy,is of lest performance with regard lo othcr variahles. determining whether lhe leSlllleaSllres
the same conSITUClin differeul pOI'UlalioJls, slI<:ha> age)
'i",m'c: Ad"l'tt"d fmm A",,,,,,,i & lIrbi"." 1997; Amer;(." Edll(<ltio'",' Re;eat(h A'so<:i"liun oI Jl .. 19'1');M<»i,k, 1995; .nd Yllll ""d Ulr,,-h. 2002.
18. 20 A Compcndium of Neuropsychological Tests
<lndBem~tein (1994) note th,ll validity coefficient, farei)' ex-cee,!
.30 Of.40 in mo,t circum,tances involving Jl~}'eho!ogical
tests, given the complexities involved in mea~ufing and pre-dicting
human beh,'ior. Thefe afe no hard and fast fUlc~
when evaluating evi(knce supporlive of va!iditl" and intcr~lfe-tation
should consider how the te~t results will be used. Thus,
tests with evcn quite modest predictive validities (r = .50) ma}'
be of considerablc utilitl', depmding on the Cifculll~tancesin
which the}'will be used (Anasla~i & Urbina, 1997;Nunn<llll'&
Bem~teill, 19(4), particularll' if Ihel' serve lo significant1l' in-
(fease lhe tesl's "hil fale" over chance. 11is also important lo
note Ihal in some circulIlslances, crilcrioll validitl' ma}' be
measured in a cakgorical ralher Ihan continuous fashion,
~uch as when lesl scores are used lo inform binarl' diagnoses
(e.g., demented versu~ nol delllenled). ln Ihese cases, one
would Iikell' be more ínlereslcd in indices such as prediclive
power than olher me<l~uresof crilerion validill' (see below for
a discus~ion of c1<lssilicalion"ccuracl' slalislics).
USE OF TESTS IN THE CONTEXT OF
SCREENING AND DIAGNOSIS:
CLASSIFlCATlON ACCURACY STATlSTICS
In some cases, c1inicians use lests lo meaSUfeholl' IIlllfilof;ltl
attribule (e.g., intelligence) an examinee ha~, while in other
cases, tesls are used to help determine whelher or nol an exam-inee
has a specific atlribute, condilion, or illness that mal' be
eithcr prescnt or abscnt (e.g., Alzheimer's disease). In lhe laller
Clse, a sJlecialdi~linction in lesl use mal' be made. SCfcnlillS
tests are those which are broadll' or routinelr used to delecl a
specific altribule, oflell rdcrred lo as a collllítioll of inferest, or
COI, among persons who are not "sl'mplomatic" but who mal'
Ilonctheless have the COI~ (Slreinef, 2003e). Ui'lgnosfíc tests
,Ireu~ed lo assisl in ruling in ()f out a speeifie condilion in per-
~ons who present wilh "sl'mploms" Ihat sugge~1lhe diagnosis
in questionoAnolher related use of lesls is for purpose~ of pre-diclion
of outcome. A~wilh screening and diagnostic tests, lhe
oulcome nf intereslll1al' bc defined in binarl' terms---it wiUei-ther
occur or not occur (e.g., relum lo the same Il'pe anJ levei
(lf emp!ol'menl). Thus, in ali three ca~es,dinicians wil! he in~
terested in the relalion of lhe mca~Ire'sdislribulion of scores
to iln attribule or oulcome Ihat is defincJ in binarl' lerms.
Typiealll" data conceming screening or diagnoslic accu-racl'
are obtained bl' administcring a lestlo a samplc of per-
~ons who are also dassifieJ, wilh rcspect to the COI, b}'a so-called
gotd ~tand<lfJ.Those who have the condition according
to the gold stand<lfd,Ire [;lbcleJ COI+-, while Ihose who do nOI
have lhe condition ,ue hlbcled COl-. In medicine, the gold
stamLud is oflcn a high!y aceurale diagnoslic lest that is more
expcnsive and/or ha~ a higher levei of as~ocialed risk of
lIlorbidity Ihan some new diagnoslic lllelhod thal is being
evaluated for use as a screening measure or as a possible re-placement
for the exisling gold slandarJ. In neuropsychology,
the situalion is oflen more complex, as the cal mar he a ps}'~
chnlogical conslrucl (e.g., malingering) for which consensus
wilh respecl to fundamenlal definilions is lacking or diagnos-tic
gold standarJ.s mar not exi~1.The~c iS~llesmay he less
problemalicwhenleslSareusedtol.redictouleollle(e.g .• re-tum
to work), Ihough nlher problell1s thal mal' amiet olll-come
daIa such as inlervcning variables anJ samplc altrition
ma}'complicale interpretation of predictive aecuraçy.
The simplest wal' to relate tesl rc~ultsto binarl' diagnose~ or
oUlcomes is to utiliJe a cutoff score. This is a ~ínglcpoinl a!ong
the conlinuull1 of possiblc score~ for a given lesl. Scores at or
above lhe cutoff classifr eXilmince, as belonging lo Olleof Iwo
groups; scores below lhe culoff c1assifl'eXilmineesas bclonging
to the other grnup. Those who have the cal acconling lo lhe
tesl are laheled as Test Positin- (Tesl'), whilc Iho~ewho do no!
have the CO! are labeled Tcst Negatiw (Tesl-).
Table l-R shows lhe relation belween examinee classifica-tions
based on tesl resulls versus da~sificalions b<lsedon a
gold slalHhtrd measure.13yconvenlion, lesl da~sificalion is de-noled
bl' row membership and gold sland<lfd classification is
denoled bl' columll membership. Ccll values represenl the 10-
lal number of persons from lhe silmple falling into each of
fom possiblc outcomes with respcct to ilgreemenl belween a
le~1and respective gold slandard. Bl' convention, agreemenls
between gold slandard and test c!a.ssiflcalion.sare referred lo
as Trile Positive and TflIe Nrgative cases, whi[e disagreemenls
are referreJ to ,ISFals!' Posítíw alld FI/Isc Ncglltü'e cases, with
posilívc and negmive refcrring to lhe presellce or absellce of a
COI as per elassificalion bl' the gold slandard. When cOllsid-ering
outcome dala, observed oulcomc is substiluted for the
gold slandard. 1t is imporlant lO kcep in mim! whilc reading
the fol!owing seclion that while golJ standanl measures are
oflen implieitll' Irealed as 100% accurate, thi~ mal' nol a!wal's
be the case. Any limitalions in accuracy or applicabilitl' of a
gold stanJard or oulcome lIleasme need to be accounled for
when interprcting classification accuracy slalistics.
Toble 1-8 Classificalion/Prediction ACÇ[lracy of a Test in Rdation {)a "Cold $Iandard" ur tctua[
Olllc<.Hne
Gold Standard
TeSI Reslllt
Test+
Tesl-
Collltlm 101111
COJ'
A (Tnrc I'usitivcj
C (Fal.se Neg;ltive)
A+C
COJ-ti
(FalscI'osiliv(')
D (Trllr Negative)
II+D
Row Total
A+1l
C+D
N""A+Il+C+D