SlideShare a Scribd company logo
1 of 30
Download to read offline
Psychometrics in Neuropsychological Assessment 
with Daniel ). Slick 
OVERVIEW 
lhe pracos of ncuropsychologicJI asscssmcnt dcpcnds lo a 
brge exlcnt OH lhe reliability and valiJity of llcuropsycholog-ieal 
lesls. UnfortullJtely, no! ali neuropsychological tests are 
crcated equal, and, like any olher product, published tests 
ViU}' in terms of lheir "quali'y," as defined in psychometric 
tcrms such as reliability, rncasurement crror, temporal slabil-ity, 
sCllsitivity, spccificity, prcdictive v,llidity, and with respect 
to lhe care with which t('st itcms are derivcJ anJ norm,llivc 
data are obtaincJ. In d,lditioll tu commcf(:ial mcasurC5, nu-meram 
tcsts dcvclopcd primarilr for rcscarch purposcs have 
founJ their war into wide clinicai usagc; Ihese vary wnsidcr-ably 
with rcgard to psychomctric propertics. With few cxcep-tions, 
whcn tests originate from clinicaI research conlcxts, 
thnc is ohcn validity data but littlc c!se, which makcs esti-lllating 
mcasurelllcnt precision and stability of test scores a 
challenge. 
Rcgardless of lhe origins of neuropsyclJOlogical tesls, lheir 
competcnt use in clinicai practice demanJs a good working 
knowledge of test standards and of lhe specific psychometric 
charaeteristics of each lest useJ. This includes familiarity 
with the StanJards for Educational anJ Psychological Testing 
(American Educational Research Associalion [AERA] et aI., 
1999) and a working knowledge ofbasic psychometrics. 'iCxts 
sllch as those by Nunllally and Bernstein (19')4) and AnaSlasi 
<IndUrbina (1997) outline some of the fundamental psycho-metric 
prerequisites for competent sdectioll of tests and in-terpretation 
of oblained scores. Other, neuropsychologieally 
focuseJ teXls such as Mitrushina et ai. (2005), Lezak et aI. 
(2004), Baron (2004), Franklill (2003a), and Franzcn (2000) 
also proviJe guidance. The following is inlended lOprovide a 
broad overview of important psyehometric eoncepls in neu-rupsychological 
assessment and coverage of important issues 
to consider when crilicalty evaluating leSISfor clinicai usage. 
Much of the information provided also serves as a conceptual 
framework for the test reviews in this volume. 
3 
THE NORNAl CURVE 
Thc frequency Jistributions of many physical, biological, and 
psychological attributes, <lSlhey occur ilCroSSindividuais in 
nature, tend to conform, to a greater or lcsser degree, to a bell-shaped 
curve (see Figure I-I). This normal wrl'c or normal 
distributíoll, so namcd by Karl I'earson, is also known as the 
Gaussian or Laplace-Gauss distribution, aftcr the 18lh-century 
mathematicians who first defined it. The normal curve is lhe 
hasis of many commonly used stalislÍeal and psychometric 
moJels (e.g., classical test theory) atld is lhe assumed dislri-hulion 
for many psyehological variables.' 
Definilion ond Charocleristics 
The normal curve has a number of spccific propcrties. It is 
unimodal, perfectly symmetrical and asymptolie at the t<lils. 
With respcct to scores frum measurcs Ihat are normally dis-tributed, 
the ordinate, or hcight of lhe curve at any point 
along the x (tesl s(Ore) axis, is the proportion af persons 
wilhin the sample who ohlained a givcn score. The ordinates 
for a range of scores O.e., between two points on the x axis) 
ma}' alsa bc summed lo give the proportion of persons Lhat 
obtaineJ a score within the speófied range. If a spccified nor-mal 
curve accuratdy rdleets a population distribution, then 
ordinatc valucs are also cquivalcnl to lhe probahility of oh-serving 
a given seore or range of scores when randomly sam-pling 
fram the popllation. Thus, the normal curve ma}' also 
bc refcrred lo as a probilbilily distribution. 
Figure 1-1 Tnc llllfrnal UlrV( 
x
4 A Compentliurn lIfNeuwpsychologi«11 Tests 
The normal cun'(' is mathematically defincd as fol!ows: 
. I . 
j(x)=--e-(x-11)- 111 
~2ITa' 
corrcsponcling 10 any resulting z score can Ihen be easily 
looked up in lablcs avail<lblein mosl statistical texts. Z score 
conversiolls to percentilcs are ,liso shown in Table I-I. 
11ere: 
x = measurement values (test scores) 
p = lhe mean of lhe test score dístríbution 
0'= lhe starHlanl deviat ion of the tesl score dislribut ion 
]'f"'" lhe conslanl pi (3.14 ... ) 
e = the base of naturallogarithms (2.71 ... ) 
f(x) = lhe heighl (ordinate) of lhe ClUvefor ,IllYgiven tesl 
score 
Relevancefor Assessment 
As noted previously, because il is a frequellcy dislribulioll, 
lhe area under any given segmenl of the normal curve indi-cates 
lhe freqllency of observalions or cases wilhin Ihal inler-vaI. 
From a praclical slandpoint, Ihis provides psychologisls 
wilh an estimale of the "normalit(' or "abnormalilY" of any 
given tesl score or range of scores (i.e., whelher il falls in lhe 
center of lhe bell shape, where the majority of seores lie, or 
inslead, ai eilher of the tail ends, whcre few scores can be 
founJ). The way in which the degree of "norm,llity" or "ab-normality" 
of tesl scores is quantified varies, but perhaps 
lhe most useful and inherently underslandablc metric is lhe 
pacentí/e. 
Z Scores ond Percenliles 
A percenlile indicates the percent,lge of scores Ihal fali ai or 
below a given lesl score. As an examplc, we will assume lhaI 
a given lesl score is plolted on a normal curve. Vhen ali of 
lhe ordinate values aI and bclow Ihis tesl score are summed, 
lhe resulting value is lhe percenlilc associaled wilh thal lesl 
score (e.g., a score in the 75th percentilc indicales Iha175% of 
lhe reference samplc oblainecl equal or lower scores). 
To converl scores lo percl.:nliics,r,IWscores may be linearl)' 
Iransformed or "stanclardizl.:d"in several ways. The simplest 
and perll<lpsmost commonly calculated standard score is the 
z swre, which is obtained by subtrncting lhe sample mean 
score from an obtnined score allJ dividing lhe resull by lhe 
sample 50, as show below: 
x= meaSllrement value (test score) 
X= lhe mean of lhe test score dislribulion 
SO = lhe slandard devialion of the lest score dislribution 
Interprelalionof Percentile~ 
An imporlant properly of the normal curve is that the rela-lionship 
belweell raw or z scores (which for purposes of this 
cliscussion are e{]uívalent, since Ihey are linear trnnsforma-lions 
of each other) and percenliles is nol linear. lhat is, a 
constant differencc bctween rOlwor z scores will he assocLJ.led 
with a variablc difference in percentile scores, as a funClioll of 
lhe dislallce ofthe Iwo scores from lhe mean. This isdue to the 
fact Ihal there are proportionally more obsen'aliollS (scores) 
near the mean Ihan Ihere are farther from the mean; olherwisc, 
the distribulion would be reclangular, or non-normal. This 
com readily he seen in Figure 1-2, which shows the normal 
distribution with demarcation of z scores and corresponding 
pcrcclltilc ranges. 
The nonlinear relation between z scores alld percentiles 
has important inlerprclivc implicatinns. For example, a one-point 
diffcrence betwel.:n two z scores may be interpreled 
differently, dcpending on where the two scores fali on the Ilor-lllal 
curve. Ascombc seen, lhe difference hetween a z score ofo 
,md a z score of + I is 34 percenti!e points, because 34% of 
scores fali uctween these two z scores (i.e., the scores being 
compared are at lhe 50lh and 84th percentiles). iIowever, the 
diffcrence belween a z score of +2 nnd a z score of +3 is lcss 
than 3 percentile points, because only 2.5% of lhe distribu-tion 
falls belween Ihese Iwo poinls (i.e., lhe scores being com-pared 
are nl the 981h and 99.91h percentilcs). Ou lhe other 
hnnd, interpretalion of percenlile-score differences ISalso nol 
slraightforward, in Ihal an equivalcnl "difference" betwcen 
lwo percenlile rankings mal' entai! differenl clinicaI implica-lions 
if lhe scores occm at the tail end ofthe curve than ifthcy 
occur near the míddle of the distribution. For ex,lmple, a 30- 
poinl difference belween scores at lhe 1st percentilc versus the 
3IsI percenlíle lllay be more C!inical1ymcaningful than the 
same difference between scores at the 351h percentile versus 
lhe 651hpercenlilc. 
LinearTransformatiancf Z Scores: TScores 
and OIher Standard Scores 
In ,Iddition to the z score, lineM transformalion can be used 
to produce other slandardized scores Ihat have lhe same prop-erties 
with regard lo easy conversion via tablc look-up (sce 
Table I-I). The most common of Ihese are T scores (M == 50, 
SD = 10), scalcd scores, and slanclard scores such as Ihose used 
in mosl IQ tesls (M = 10, SD= 3, ,md M = 100, SD= 15). li 
musl be rcmembered that z scorcs, T scores, slandard scores, 
and perccntile equivalenls are dcrived from sl/mples; ahhough 
these are of1en treated as population values, any limitations of 
generalizability due to rcference samplc composition or test-ing 
circumstances muSl be taken into consideralion when 
slandardized scores nre inlerprclcd. 
z=(x-X)/SD [21 
Vhere: 
The resulting distrihution of z scores ha.~a mean of O and an 
SD of 1,regardlcss of the melric of raw scores from which the)' 
werc Jcrived. For example, given a mean llf 25 and an SDof 5, 
<lraw scoreof20 translales inlo n zscorc of -1. The percentilc
Toble 1-1 Sum."Convnsíon Tahk 
IQ' T SSh Percenlí1e -zl+z Percentilc SSh T IQ' 
S55 S211 <I SO.I S3.(JO~ ~9').9 ~19 ~l'IO ~145 
56-6fl 21-23 2 <I 2.67-2.99 >99 18 77-99 140....144 
61-67 24-27 3 I 2.20-2.66 99 17 73~76 133--139 
68-70 21:-30 'I 2 1.96-2,19 OH 16 70-72 130-132 
71-72 31 ) 1.82-1,95 97 " 128-129 
73-71 32-.>3 'I 1.7()-1.1:1 96 67~68 126-127 
75-76 34 5 5 1.60....[.,69 95 15 " 124-125 
77 6 1.52...1...59 94 123 
78 35 , 7 1.44-1.5[ 93 65 122 
79 36 U8 ....1.,1} 92 64 121 
80 6 9 1.32-1.37 " 14 120 
81 37 10 1.26-UI 90 63 119 
11 1.21-1.25 "' S2 " 12 1.16-1.20 " 62 118 
83 13 1.11-1.15 " 117 
" 39 11 1.06-1.10 R6 61 116 
15 1.02-1.05 85 
85 40 7 16 .98-1.01 '" U 60 115 
17 .94-.97 " 86 41 18 .90-.93 S2 59 111 
" 19 .86-.89 81 113 
20 .83-.85 80 
" 42 21 .79-.82 79 58 112 
22 .76-.78 78 
"' 2J .73-.75 77 111 
43 24 .70-.72 76 57 
90 8 25 .66-.69 75 12 110 
26 .63-.65 74 
" 44 27 .60-.62 73 56 109 
28 57-59 72 
29 51 ...5..6 71 
92 30 .52-.53 70 108 
15 31 .4<J-.51 69 55 
93 32 .46-.48 6R 107 
3J .43-.45 67 
9,1 46 34 .4)-.42 66 54 06 
35 .38-.39 65 
36 .35-.37 64 
95 9 37 .32....3..4 63 11 105 
" " .3(}-.31 62 53 
% 39 .27....2..9 61 104 
-lO .25-.26 60 
41 .22-.24 59 
97 48 12 .[9-,2[ 58 52 103 
43 .17-.18 57 
H .14.....1.6 56 
98 45 .12-.13 55 102 
49 46 .09....1..1 54 51 
99 47 .O7-.011 53 101 
48 .04-.06 52 
19 .02...J.)J 51 
100 50 10 50 .00-.01 50 10 50 100 
'AI = 100. SD= 15: "M = lO. SD= 3. 
•Vo": SS = Sc.d",J
6 A Compendíllm of Neuropsychologícal Tcsts 
FigtJre1-2 The normal curve demarcaled hy z ~cores. 
lhe Meaning of Stondordizcd TestScores: 
Score Interpretolion 
+2 
2.35% 
0.15% 
+3 
As wcll as facílilalíng lrallslalion of raw scores to eslímaled 
population ranks, standardization of tesl scores, br vírtue of 
conver~ion to a common llletric, facililates comparison of 
scores across measures. Ilowever, this is only ,ldvisable wnen 
the raw score distribulÍons for tests Ihat are being compared 
are appcoximatcly normal in the population. In addílion, if 
stanJardized sunes are to be compared, ther should be derived 
fcom similar S<llllpleS, or more ideally, from the same s<llllple.A 
score aI lhe 50th percentilc on a test normed on a population 
of uníversily students does not nave lhe same meaning as an 
"equivalent" score on a tesl nonned on a populatíon of dderJy 
individuais. Vhen comparing test scores, one mUSI<lisolake 
into consideration both lhe rclíability of the two measures and 
their intercorrelatíon before dctermining if a significall1 differ-ence 
exisls (see Crawford & Garthwaite, 2002). In some cases, 
rclalivcly large disparities between slandJfd scores may nOI ac-lU< 
lllyreflect rcliablc dífferences, and Iherefore may not be 
dinically me,mingful. FurtherlIlore, statislicallr significant or 
rcliable difTerences bctween test scores may be COllllllon in a 
reference sample; therdore, the baserate of differences ml~t 
also be considered, JepenJing on lhe levei ofthe ~cores (<InIQ 
of 90 versus 110 as compared lo 110 versus 130). Une ~hould 
alS(1keep in mind that when lesl scores are not normally dis-tribuled, 
standardized score.~may not accllrate!y rc/leet acttl<ll 
popul,ltion rank. In these círcumstances, differences between 
slandard scores may be misleaJing. 
Note also lhat comparability <lcmss tesls does not imply 
eqll<llity in meaning and relative imporlance of scores. For ex- 
<lmple, one may compare stand<lrd scores on rneasures of 
pitch discriminalion and intelligence, but it will rarely be lhe 
case that these scores are of equal clinicai or practical meaniog 
nr significance. 
In clinicai practice, one lllar encounter standard scores that are 
either extremely low or extremely high. The meaníng <lndcom-p, 
uability of such scores will depend critie<lllyon the charac-teristics 
of lhe normative s<lrnplefrom which lhe)"derivl;:. 
For exarnplc, cnn~ider a hypothetical case io whicn ,lIl ex- 
<lrninee ohtains a rilw score llwl is hclow lhe range of scnres 
found io a norll1,ll s,lrnple. Suppose funher th<ll the SLJ in lhe 
norm,d salllpk i~verr small ilnd thus the examinee's r<lWscore 
lranslates to a z score of -5, indicalíng that lhe prob<lbilily of 
encountering lhis score in the normal POPUl<llionwould he 3 
in 10 míllion (i.e., a percentile ranking of .00(03). Thi, repre-senIs 
J cOllsíder<lbleextrapol<!tion from the ,H:lual normative 
data, as (I) lhe normalive ~ampll;:did nol include 10 míllion 
individllills (2) not a singlc individual in the normalÍve S<llll-pie 
obtained <lscore anywhere close to the examinee's score. 
The percentile value i~Iherefore an eXlrapolalioll and confers 
a false sense of precisioo. 11ilc one may be confident lhat 
it indicales impairment, lhere may be no basis to assume thal 
it represenls a meaningfully "worse" performance tlun a z 
score of - 3, or of -4. 
The t'slÍmlltcd prcvalclKe valuc of Jn obtained z score (nr 
T seore, elc.) C<lnbe calcuLlted to {lctermine whether inlerpre-lation 
of extreme scores may be appropriale. Thís is simply ac-complished 
by inverting the perccntile score corresponding to 
lhe z seore (i.e., dividing I by the percentile score). For eX<lm-pie, 
<lz $Coreof -4 is associattxl with an cstimated frequency of 
occurrence or prevalcnce of appcoximately 0.00003. Dividing 1 
by Ihis value gives a rounded result oI' 31,560. Thus, the e~li-mated 
prevalence value 01'lhis score in the population is 1 io 
31,560. Ifthe norrnative S<lIllPJcfcom which J z score is Jerived 
is consider<lbly smaller lhan lhe denominator of lhe estimalcd 
preva!cnce value (i.e., 31,560 in the example), then some cau-tion 
may be wJrr<lll1edin interprcling the pereenlíle. In <lddi-tion, 
whenever such exlrernl;: scores are being ínlerpreted, 
eX<llllinersshould also verify th<llthe examinee's raw score falls 
wilhin the r<lngeof raw scores in the normative sample. If the 
norn1<ltive samplc size is sllbstanliallr slll,lller Ihan lhe esli-mated 
prev,llcnce s<lmple Si7£ /lI1t1 the examinee's score falls 
olltside lhe s<lmplc range, then cOllsiJerablc caulion may be 
indic<ltcJ in interpretíng the percentile assn(Íaled with the 
standardized seore. Regardlcss of the z seore v<llue,it must <lIso 
be kept in mind thal inlerpretation of lhe <lssoci<ltedpcrcentile 
value may not be juslifiable if lhe normative sample !las a sig-nifiC< 
llltlynOll-llOrm<l1distrihution (see laler for funhl;:r dis-cussion 
of nOIH10rlJl<llily).lo sum, the dinie<ll interprel<llion 
of exlreme scores depends to a longeextenl on the properties of 
the normal salllples involveJ; one can have more confidence 
th<llthe percentile is reasonably <lccurate if the normalive sam-pie 
is large and well collstructed and lhe sh<lpeof the norm<l-tive 
sampte distribution is ilpproximately normal, particularly 
in tail regiolls where extreme $Coresilre found. 
lolerprctiog Extreme Scores 
A fin<llcritiC<11issue wilh respect lo lhe me,lning oI' standard-ú, 
ed seores (e.g., z scores) has to do with extreme observations. 
lhe Normol Curve ond TeslConstruetion 
Allhough the norm<ll curVI;:is from many standpoints <lnideal 
or even expecll;:ddistribulioll for psycholllgical dati!, tcst score
l'sychomelrics in Neuropsychological Assessmenl 7 
Figure1-3 Skeweddislribulions. 
(e.g., a creativily test for gifted students). In lhis case, lhe 
characterislks oI' onll' one side oI' lhe silmp1cscore dislribu-tioll 
Non.Normality 
Al1hough lhe normal curve is an cxcdlcnl modcl for psl'cho-logical 
ddla and manl' sample dislribulions of natural pro-cesses 
are approximately normal, il is not unllsllal for tesl 
score distributions lo be markedll' nOIl-normal, eWIl when 
samples are large (Miccerti, 19R9).zFor example, neuropsy-ehological 
te..•ls sueh as the Boston Naming Tesl (BNT) and 
Wiseonsill Card Sorting Test (WCST) do nol havc normal dis-tributions 
when r,lWscores are el;amined, and, even when de-mographie 
correction melhods are ilpplietl,some lests continue 
to show a non-norm,ll, muhimodal dislriblllion in some pop-ulations 
(Faslenau, 1998). (An examplc oI' a non-normal dis-tribulion 
is shown in Figure 1-4.) 
The degree to which <lgiVClldislribution approximates the 
underll'ing populalion distribulion increases as lhe nlllnber 
oI' observations (1,rj increases and becomes kss accurate as N 
decreases. This has imporl<llll implications for norms com-prised 
of small samplcs. Thus, a larger sampk will produce ,I 
more normal dislribulion, bul onll' if lhe underll'ing popu-lation 
distribution from which lhe samplc is oblained is 
normal. In olhcr words, a large N does nol "eorrect~ for non-normality 
oI''In under1l'ing popuLlIion dist ribution. Howt:ver, 
84 93 
Pereentiles 
68 
Raw Score 
08 
Mean = 50, 50 = 10 
20 
(i.e., the uppt:r end) are critical, whilc lhe charactcristics 
011 the olher side of lhe dislrihulion are (lI'no particular con-cern. 
The 1l1eaSUremar even be dc1iberatdl' designed to have 
t100r or ceiling dTecls. ror example, if onc is not inlerested in 
one lail (or even olle-half) {lf lhe dislributioll, items lhat 
would provide discrimination in that region may be omitted 
lo save adminislration time. In lhis case, a lesl with a high 
floor or low cciling in lhe general population (and with posi-live 
or negalive skew) may be more desirablc thall a test with a 
normal dislribution. ln most applicalíons, however, a more 
llormal-Iooking curve within the targeted subpopulation is 
usually desirable. 
Figure1-4 Anon.normallest scoredistrihution. 
Positive Skew Negalive Skew 
samples do nol always conform 10 a normal dislribution. 
Vhen anel'.' tesl is conslrucled, non-normality can be "cor-recled" 
br eXilmining lhe dislribulion of swres on lhe proto-trpe 
lesl, adjusling test proper1ies, and resampling until a 
normal dislribution is n:achC(1.For cX<lmple,whcn a test is 
firsl administered during a lrl'-oul phase and a positivell' 
skewed distribut ion is obtained (i.e., with mosl swres c1uster-ing 
,lt lhe lail end oI' lhe dislribulion), lhe tesl likely has!oo 
high a f1oor, callsing mosl examinees lo oblain low scores. 
Easl' ilems can then be added so lhat the majoritl' of scores 
fali in the middlc of the distribulion rather lhan at the lower 
cnd (Anastasi & Urbina, 1997). ""11en this is successful, the 
grealesl numbers of individuaIs obtain aboul 50°/" of items 
correc!. This leveiof difficulty usualll' provides the besl differ-entiation 
between individuais aI ali abilil)' leveis (,nastasi & 
Urbina, 1997). 
11must be noled lhal a test with a normal dislribulion in 
lhe general population mal' show extreme skew or olher di-vngence 
from normaJill' when administcred to a populatioll 
that differs considerabll' fcom lhe average individual. for ex-ample, 
a vocabulary test thal protluces norma]]l' distributed 
scores in a general samp1c oI' individuais mal' display a neg-ativell' 
skewed distribution dlle to a low cci1ingwhen admin-istered 
to docloral sludcnts in literature, and a positivc1l' 
skewed distribution dlle to a high l100rwhen adminislered to 
preschoo1crs Irom n:cenlll' immigrated, Spanish-speaking 
families (see figure 1-3 for examplcs oI' positive and negalive 
skew). In this Case,lhe test would be incapablc oI' dfectivc1y 
discriminating between individuais within eilher group be-caust: 
of ct:iling effecls and !loor efl"t-cts,rt:speclivt:!y,even 
though it is of considerablc utilill' in lhe gencral populalion. 
Thus, a lest'~ dislribulioll, including 1100rsand ceilings, must 
alwal's be eonsidercd when asscssing individuaIs who differ 
from lhe normative samplc in terms of ch<uacteristicsthat af-feel 
test scores (ç.g., in this example, degree of exposurc to En-glish 
words). In additioll, whether a tesl prodmes a normal 
dislribution (i.e., wilhoul posilive or negalive skew) is also ,tn 
imporlant aspecl of evaluating tests for bias across differenl 
populatiollS (see Chapter 2 for more discussion oI' bias). 
Depending on Ih.' characlerislics (lI' lhe conslruct being 
measured and the purpose for which a lesl is bcing designed, a 
normal distribution oI' scores may not he obtainable or cven 
desirable. For example, lhe population dislriblltioll of the con-slmcl 
bcing llleasured may nol be normally dislribulcd. Aht:r-nalively, 
one mal' want onl)' to identifl' and/or discriminate 
bdween persons at onll' one end of a continllum of abililies
8 A CompenJium ofNeumpsychological Tesls 
small samplcs may yiclJ non-normal distributíon dlle to 
ranJom samplíng cffects, even though lhe population fmm 
which lhe sanlple is Jrawn has a normal Jistriblllion. Thal 
is, one may nol automatically assume, given a non~nonl1al 
Jistribulion in a small sample, that lhe populalion Jislribll~ 
lion is in facl non~nortJlal (note Ihal the Wllverse may ,liso 
be true). 
Several factors may lead to non-normallesl S(;oreJislribu-tions: 
(a) lhe existence of diserete subpopulatiolls within lhe 
general population wilh differing abilities, (b) eeiling or l100r 
effeels, anJ (c) trealment effeets Ihal ehange lhe localion of 
means, meJi<los, and moJes and affeel variability and distri~ 
bulioo shape (Miccerli, 1YX9). 
Skew 
As with the normal curve, some varietics of non-nnrmalit)l 
may be eharaelerized malhematically. Skew is a formal mea-sure 
or asymmelry in a frequeney Jistribulion Ihat can be cal-eui< 
lled using a specific formula (see Nunnally & Bernslcin, 
1994). lt is also known as the third momem of 11 distriburiol/ 
(lhe mean and varianee are lhe first <loJ seconJ moments, re-spectivcly). 
A Irue normal Jistribution is perfeclly symmetri-cal 
aboullhe mean anJ has a skew of zero. A non-lIormal bul 
symmetrie dislribution will have a skew valuc lhal is near 
zero. Negative skew values indicale Ihal lhe left tail of the dis-tribulion 
i.sheavier (and often more elongated) Ihan the righl 
tail, which may be lruncaled, while posilive skew vallles indi~ 
cate lhat lhe Opposile paHem is presenl (see Figure 1-3). 
Vhen distribulions are skewed, the mean and median are not 
identical beeause the mean will not be at lhe midpoint in rank 
and z seores will not aeeuralely translate into sample per~ 
eentile rank values. lhe error in mapping of z scores lo sam-pie 
pereentile ranks increases as skew inereases. 
Truncaled Dislribulions 
Signifieant skew often indicales the presence of a truncalcd 
distribulion. This may oceur when the range of scores is re-slricled 
on one side but not lhe olher, as is lhe case, for exam-pie, 
with reactioll lime measures, whieh eanllot be lower lhan 
several hundred milliseconds, bllt ean reaeh very high positive 
values in some individuais. In faet, dislribulions of scores 
from reaetion lime measures, whether aggregated aeross Irials 
on an individuallevcl or aeross inJiviJuals, are oflell ehar<le-terized 
by positive skew anJ positive outliers. lkan values 
may therefore be positivdy biased wilh respect to lhe "centr,11 
tendcney" nf lhe dislribulion as defined by olher indices, such 
as lhe mediano Truncated dislribulions are also collllllonly seen 
on error seores. A good example of this is Failure lo Maintain 
Sct (FMS) scures on the WCST (see review in this volume). 
In the normativc sample of 30- lo 39-year-old persons, ob-served 
raw scores range frum Oto 21, but lhe majority of per-sons 
(84%) obtain seores ofO or I, and less Ihan 1% obtain 
$Coresgrealer lha o 3. 
Floor/Ceílíng Elfeds 
Hoor and eeiling effecls mar he defined as the presenee of 
trunealed lails in lhe context of 1imitations in range of ilem 
difficulty. For example, a lesl may be said o have a l1igll}Ioor 
when a large pruportíon of lhe examinees obtain ravo:scores at 
or near lhe lowest possible score. This may indicate thal lhe 
test lacks a sllffieienl number and range 01'easier items. Con-verscl)', 
a tesl may he said to have a low ccílillgwhen lhe 01'1'0- 
sitc pattern is presenl (i.e., when a high number of examinees 
oblain rilWscores aI or near the highesl possiblc seorc). Floor 
anJ eeiling effeels may significantly limil lhe uscfu[ness of a 
measure. For example, a measure wilh iIhigh floor mar not be 
suitable for use wilh low funclioning examinces, particularly 
if one wíshes to delineate levei 01'impairment. 
Multimodality and Other Types 
af Non-Normality 
!l.lultimodality is lhe presenee of more tha/l one "peak" in a 
frequeTlcyJistribution (see histogram in Figure 1~1 for <lnex-amplel. 
Another form of signifieant non-normality is the uni-form 
or near-uniform distributíon (a dislributio/l wilh no or 
minimal peak and relatívely equal frequelley <lCrossseo[('s). 
Vhen such dislributions are present, linearly transformed 
$Cores(z scores, T seores, and other deviatio/l seores) may be 
tOlally inaceurale with respeel to aelual samplelpopulalion 
pereentile rank and should not be interpreted in Ihat frame-work. 
[n Ihese cases, sample-derived rank pereentilc seores 
may be more clínieally uscful. 
Non-Normality ond Perceolile Derivalioos 
Non-normality is /lot trivial; it has major implieations for 
derivalion and interpretation of standard seores and eompar-ison 
of sueh scores aeross lests: standardized seores Jerived by 
linear transformalion (e.g., z scores) will nol corresponJ o 
samplc percenlilcs, and lhe degree of divergence may be quile 
longe. 
ConsiJer lhe histogram in Figure 1-4, which shows lhe 
dislrihulion of scurcs obtaineJ for iI hypolhelieal test. This 
lest, with a samp!e size of 1000, h<lsa mean ril' score of 50 
anJ a standarJ devialion of 10; lherefore (and very conve-nient! 
y), no linear transformation is required to oblain T 
seores. An cxpeeted normal dislrihution based OI} lhe oh-served 
mean and standard devialion has been overlaid on the 
observed histogram for purposes of comparison. 
The histogram in Figure 1~1 shows Ihat lhe díslribution of 
scures for the hypotheticallest is grossly non-Ilormal, wilh a 
Iruncaled lower l<lilillld significanl positive skew, indicilling 
floor effects and the existenee of tW()distinct subpopulations. 
If lhe dislributioll were normal (i.e., if we follow the normal 
curve, sllperimposed on lhe hislogram in Figure 1-4, instead 
(lf the histogram ilsclf), a raw score of 40 would eorrespond 
to a T score of 40, a S(;ore lhat is 1 SD or 10 puints fmm the
mean, <lnd translate lO lhe 16th pen.:enlilc (pcrcenlilc not 
shown in lhe graph). Howcvcr, whcn we calclllate a pcrcellile 
for the actual scorc (listribution (i.e., lhe hislogram), a smre 
of 40 is actually below lhe Isl percClllile with respcct to 
lhe observed sampk dislributioll (pcrcelltile=O.R). C1earl)', 
the difterem.:e in percenlilcs in Ihis example is no! trivial anti 
has significanl implicatiolls for score interpretalion. 
Normalizing Te~tScarc~ 
Vhen confronted "vilh problematic score distributions, mall}" 
lest dcve10pers emplo}" "normalizing" Ir,lllsformalions in an 
altempl to correct depiHtures from normalit}" (cxamplcs of 
this can be fouod thwugholll this volume, in lhe Normruíw 
JJalll sCClíoo for tests reviewed). Allhough hc1pful, these pro-cedurcs 
are b}"no means a panace<l, as lhe}" often inlroduce 
probkms of Iheir own with respecl lo inlcrpre<llion. iddi-lionalll', 
tTlanl' lesl manuais contain only a cursor}" discussion 
of nnrmalizalion (jf lesl scorcs. inaslasi and Urbin,l (1997) 
statc that scores should onl)' bc normalized if: (I) Ihel' come 
from a largc and represcnlalive samplc, or (2) any devialion 
from normalitl' arises from ddecls in lhe lesl rather than 
charactcrislies of lhe sample. Fllrthermore, as we have nOled 
above, it is prderable lo adjusI score distributions prior 10 
normalizalion by ll10difying tesl conlent (e.g., by ad(ling or 
ll1odifl'ing ilems) ralher than slalislical1y transforming non-normal 
scores inlo a normal dislribution. ilthough a detai1cd 
discllssion of normali/.ation procedures is beyond lhe scopt.' 
of this chapler (interested readcrs arc refcrred lo Anaslasi & 
Urbina, 1997), ideall}', test makers should dcscribc in delail 
the nalure of any significant samplc Ilon-norm<llity ,md lhe 
procedures useJ lo correcl it for derivalion of standardized 
scores. The reasons for correction should ,liso be justified, and 
direcl percentile conversions uased on thc uncorrecte(l samplc 
dislribution should be provided as im 0plion for users, Dc-spile 
the limitalions inherenl in correcting for non-normalily, 
Anaslasi and Urbina (1997) note th,l[ most tesl developcrs 
will probably continue lO do so beca use of lhe necd to usc Icsl 
scorcs in statistical analyses Ihal <lssume normality (lf dislri-butions. 
From a prattlcal poinl of view, test users should bc 
aware of lhe Illathclllalical compulalions <lnd Iransforma-lions 
involved in deriving scorcs for Iheir inslruments. Vhcn 
ali othcr things are cqual, lest uscrs should dwose lests Ihal 
provide informalion on snlfC dislribulions ,llld any proce-dures 
Ihal were ulldertaken to correcl non-normalit}', over 
thosc Ihat providc partial or no illformalÍon. 
Exlrapolalion/lnlerpolotion 
Despile ali lhe besl elTorts, Ihcre are times whcn norms fali 
shorl in lerms of range or cdl size. This indudes missing dala 
in somc cdls, inconsistenl age eoverage, or inadequate demo-gr, 
lphic composilíon of some cells compared to lhe popula-tion. 
In Ihcse cases, data are oflen eXlrapolalcd or intcrpolaled 
using Ihc exisling score dislribulioll and lechniques such as 
Ps}'chornctrics in ~curOl's)"dlOrogical Assessment 9 
llIultiple regressioTl. For cxamplc, llcalon ,Illd cot!eagues have 
puhlished seis of norms Ihal IISt..multip1c regressiol lo cor-rett 
for demogrilphic characlcrislics ,uHl compellsate for few 
subjects in some cells (I 1caton et aI., 2(03). Although multiple 
regressioll is robust to slighl vio1atiolls of assumptinns, eSli-mation 
nrors mar occur whcn using llormative dala Ihat vio-lalcs 
thc assumplions ()f homoscedaslicil)" (uniform variance 
across lhe range of scores) and normal distrihution of scores 
necessary for multiple regressioll (Faslenau & AJams, 1996; 
f Icalon el aI., 1996). 
Age extrapo!alions bel'ond the hounds of the actual ages of 
lhe individuais in the samples are also somelimes sccn in nor-mativc 
dala seIS, hased on projected devclopmcntal curves. 
Thcse llorms should be used with caulion due lo lhe lack of 
aCLIaldata points in these age ranges. EXlrapolalÍon melhods, 
such as Ihose that emplol' regression lechniqucs, dcpend on 
lhe shapc of lhe dislribution of scores. Indudillg only a subset 
of lhe dislribulion of age scores in the regression (e.g., b}' 
omitling verl' young or ver)" nld individuills) may change lhe 
projected developnlental .sllll'C nf cert"in Icsts dralllalicalll'. 
Tests Ihat appedf to have !incilr relalionships, whcn consid-ered 
olll}' in adulthllod, ma}" ,H.:lually have highll' positivdy 
skewcd binomial functioJlS whcn the cnlire age range is con-sidered. 
OnC eX<lmple is vocablllary, which lends lo increase 
c)(l'0nenlially during lhe preschool l'ears, shows a slower 
ratc of progrcss during earll' adulthood, remains re1ative1l' 
stablc with conlinued gr,ldual inerease, and Ihcn shows a mi-nor 
decrease wilh advancing age. If only a subsel of the age 
range (c.g., adulls) is used to cslimale performance aI lhe lail 
ends of the dislribulÍon (e.g., prcschoo1crs and elderly), the 
eslimalion wiU not fit the shape of lhe aelual distribulion. 
Thus, normalizalion mar introduce error when lhe re1a-lionship 
between a test ,lJld a demographic variable is I1on-linear. 
In Ihis case, linear correetion llsing mulliple regressjoll 
distorls thc truc rclationship betwccn variab1cs (Fasleneau, 
1998). 
MEASUREMENT PREClSION: RELlABllI1Y 
AND STANDARD ERROR 
l.ike ali (orms of Illeasuremenl, ps)"chological tesls arc nol 
perfectl}' precise; ralher, test scores musl be seen as estimares 
of abililÍes or funclions, each associated wilh some degree of 
mcasurement error.-' Each lesl differs in thc precision of lhe 
scores that it produces. Df crilical importance is lhe fact 
thal no tcst has (lnl}' one specific Ievc1 of precision. Ralher, 
precision alwa}'s varies to some degree, and potentially suh-slanlialll', 
across {liffcrent populaliollS and tesl-use senings. 
Thcreforc, eslimates of measurelllenl error rc1evanl lo specific 
testing circumstances are il prerequisitc for correCI inlcrprela-lion. 
For example, even lhe mosl precise lesl mal' produce 
highly imprecise results if administered in a nonslandard 
fashion, in <Inonoplilllal cnvironmcnl, or lo <In uncoopera-live 
examinee. Aside from these obvious cavealS, a few basic
10 A CompfJl(liurn of NcuropsydlOlogieal Tesls 
Toble1-2 $Olrç,:sof Errur V;lriallceIn 1(e1atlolllo Relia!:>ilily 
Cocfficients 
Typcof Rcliabilill'Coefficielll 
Split-half 
Kuder.l(ichard.soll 
Codficirnt all'ha 
Test-fetest 
Alternale.fofm (immcdialc) 
Alternalc-form (delayed) 
Interraler InlefSmrer diftúcllccs 
01" lhe corre!ation bctween tesl scores and true scores. This is 
why il is used for estimaling true seores and associated stan- 
(!dai errors (NunnaUy & 13ernslein, 1994). Ali things being 
equal, longa lesls will general1y yield higher reliability esli-mates 
(Satl!er, 2001). InternaI reliability is llsual1y assessed 
with some measure of lhe average correlatinn among ilems 
within a tesl (Nunnally & 13ernslein, 1994). These inc!uJe lhe 
split-half or Spcarman-13rown reliability coefficient (obtained 
by (orrdating two halves of items fram the same test) and co~ 
dficienl alph.l, which provides <lgeneral estimate of reliability 
bascd on ali the possible ways of splitting lesl items. Alpha is 
esscntially based on the average inlercorrelation between Icst 
ilems anJ any otha sct of ilems, and is used for tests with 
items lhat yidd more than two response lypes (i.e., possib!e 
srores ofO, I, or 2). For additiollaluseful references coneern-ing 
alpha, sce Chronb<Kk (2004) and Streiner (2003a, 2003b). 
The Kuder-Richardson rdiabililY coefficient is used for items 
with yes/no answers Of helerogencous tests where splít-half 
melllllds nlusl be used (i.e., lhe mean of ali thedifferent split-half 
coefficienls if the lesl were split inlo ali possib1c ways). 
General!y, Kudcr-Rieh,lrJson cocfficienls will be lower Ihan 
split -half coeffidents whcn ICstsare hcterogeneous in terms of 
content (Anaslasi & Urbina, 1997). 
lhe Speciol Cose of Spced lests 
Error Varlance 
Contmt sampling 
Conlmt sampling 
Conlent sampling 
Time s<lmpling 
Cnntcnt sampting 
Conlent saml'lingand time 
sampling 
Tesls involving speed, where lhe score exclusivdy depenJs on 
lhe numbcr of items completed wilhin a lime limil rather 
than lhe numbef correct, will cause spuriously high inlernal 
rdiabililY estimates if internai re1iability indices such as split-half 
reliability are useJ. For examplc, dividing lhe items inlo 
Iwo halves lo Gl!Culatc ,1 split-half rcli.lbility cocfficicnl will 
yie1d IWOhalf-Iesls with 100% item complction ratcs, whether 
the indiviJual oblained a score of 4 (i.e., yielding Iwo half-tests 
totaling 2 poínls eaeh, or perfcet agreement) or 44 (i.e., 
yiclding two half-tests both lotaling 22 poinls, .llso yiclJing 
perfeet agreement). Thc result in both cases is a split-half reli-abilily 
of 1.00 (Anaslasi & Urbína, 1997). Some alternalives 
are to use test-retest reliability or alternalc forrn rc1iabílily, 
ideally wilh lhe a1tefJl<lleforms adminislercd in immediate 
suceession to avoid lime sampling error. Rc1iabilities (;Ill also 
principies help in deleflnining whelhcr a test generaUy pro- 
'lides precise measuremenls in mosl silll.ltiolls where il wiU be 
useJ. Ve begin wllh an overvlcw of lhe rc1ated concepls of re-liabilit}', 
trw: s{(nes, ol!lail1ed scores, lhe various eslimales of 
measurement error, <lnJ lhe nolion of ClIl1fidcl1cc in/crI'als. 
These are revieweJ bclO'. 
Definitionof Reliability 
Rc1iability refenlo lhe consislency of measuremenl of a given 
lesl anJ can be defined in several ways, including eonsistency 
wilhin ilsc1f (internai consisteney rei iability J, comislency over 
lime (Iest-retest rc!i.lbilily), consistem;y ,lCrossallernale forms 
(alternale form rcJiability), and consislency across ralers (in-lerrattf 
rdiabiJily). lndices (lf rdiabililY indicate lhe degree to 
which a tesl is free from measurcment tfror (or the propor- 
IÍon of variance in observed scores atlributablc to vMiance in 
Irue scores). The inlerprelalion of such indices is oflen not so 
slraightforw,lrd. 
It is importanl to note Ihal the lerm "error" in this conlexl 
does not iKlualll' refer to "incorrecl" or "wrong" informalion. 
Rilther, "error" consists of the lllultiple sources of variabilily 
Ihal affeel test scores. Vllilt mal' be lcrmed error variance in 
ane appliealion mal' be consiJereJ par1 of lhe true score in 
anolher, depending on the comt ruet being measureJ (state or 
trai!), lhe nalure af lhe les employed, anJ whelher il is 
deemed relevant or irrelevanl lo the purpose of lhe lesling 
(Anastasi & Urbina, 1997). An exampk rdevanl to neuropsy-chology 
is Ihal internai reliability coeffleienlS temi to be 
smal1er ai citha end of lhe age continuum. This finJing has 
been allribuled to bolh limitatiolls of lesls (e.g., measurement 
error) and incf/:ased inlrinsic performance variability among 
very young and very 01(1examinecs. 
Faclors Alfecting Reliability 
Reliability coefficients are infiuenecJ by (a) tesl eharacteristics 
(c.g., Icngth, item type, item homngeneity, and intlucncc of 
guessing) and (b) sample characteristics (e.g., sample si"c, 
range, and v<Hiability). The cxtenl of a test's "darily" is inli-malely 
related lo ils rdiability: reliable measurc, Iypieally 
h,lve (a) clearly written items, (b) casily ullderstooJ test in-slruClions, 
(c) stanJardized administration conditions, (d) 
explieit scoring ru1cs Ihat minimize subjectivity, and (e) a 
proeess for training ralers to a performance crilerion (Nun. 
naUy& 13crmlein, 1994). For a lisl of commonly llsed rdiabil-ity 
coefticienls and lheir assoeialeJ sourees of error variance, 
sec 1:1blc 1-2. 
Internai Reliability 
Inlernal reliabililY retleds lhe cxlcnt to v,,,hichilerns within a 
lesl measure the same eognitive domain or COllstruet. It is a 
core index in c1assicallesl theory. A measure of lhe intercorre-lation 
of items, inlernal rcliabilitl' iS;lll estimate of the corre-lalion 
between randomly paralleltest forms, anJ by extension,
Psychometrics in NeumpsychoJogical Assessment 11 
T061e1-3 Coml1lnnSourçcsof Bia.and Error in 
Test-lklest Situatiom 
_<",n-e:hom I."'fweaver & t.:fld""f, 2lKH. 1'. JQ~.Rel',;nleJ w;lh pell"i";,,,, frofll 
EIs",;er. 
may or may nol be considered sourccs of measuremenl error. 
Apar! fmm these variab[es, une musl cunsider, and possibly 
p;lrse out, effecIs of prior exposure, which are often conceplu-a[ 
ized as invo[ving implicit or explicit Icarning. llence the 
terrn pmctifC effi'as is often llsed. Howevcr, prior exposure lo 
a tesl does nol neccssarily kad to increased performance at 
retes!. Note 'l[so lhat lhe a<.:tlla[nature of lhe lesl may sorne-limes 
change with cxposurc. for instance, lests lhal rely on a 
~novelty effect~ anJ/or re(]uire (kduction oI' a stralegy or 
problem snlving (e.g., VCST, Tower 01' London) may not be 
conducled in the samc W,IYonce the examínee has prior fa-miliarity 
with lhe tcsling p,Jr<I(ligm. 
Like some measures of problcm-solving abilities, measures 
oI' lcarning and memory are a!s{}highly susleptible lo prilctice 
effccts, though Ihese are kss likdy lo rct!ect a fundamental 
change in how examinees approach lasks. In either case, prac-lÍce 
cffccts may lead to [ow test-retesl lorrclations by effec-tivdy 
[owering lhe ceiling at relesl, resulting in a restriction of 
range (i.e., many examinecs ohtain scores at near the IIl<Ixi-mum 
possible aI retest). Neverthcless, restriction oI' range 
should not bt' assumed when test-retest corrdalÍons are low 
unlil this has bem verified br illSpt'ction oI' Jat,l. 
The relationship between prior exposure and tesl stability 
coefficients is complex, anJ although test-retesl cocfficienls 
may be affected hy praclice nr prior expo.sure, lhe cot'fficienl 
<1oesnot indica te the magnitude oI' sllch effeets. That is, retest 
corre1ations will be very high when individual retesl $Coresali 
change by a similar amount, whether lhe praclice effed is nil or 
very large. When stability coefficients are low, then lhere may 
he (I) no syslelll<lliceffecls of prior exposure, (2) the reialion 
he cakulated for any test Ihat can be dividccl into specific time 
inlervals; scores per inlerval can lhen bc compared in a pmce-dure 
akin to the sp[it-half method, as long as items are oI' rela-tivcly 
equivalent difficulty (Anaslasi & Urbina, 1997). For 
most oI' the specd lests rcviewed in this volume, rcliaoilíty is 
estimaled by using lhe test-retest rdiabi[ity coefficicnt, or dse 
br a generalizability cocfficiellt (see be!ow). 
Te~t.Re!e~tReliobility 
Tcst-retest rdiability, a[so known as temporal stabilíty, pro-vides 
an estimate oI' the corrclalion belweell Iwo lest seores 
from the same lesl adminislered aI two different ponls in time. 
A tesl with gnod lemporal stabilily should show [in[e change 
over time, providing Ihal the trait being lJIeasured is stablc ,md 
l!lere are no differentia[ cffecls of prior exposure. lt is impor-tant 
to note that tests measuring dynamic (i.e., change,lb[e) 
abilities will by defmilion producc lower tesl-relest rcliabilities 
than tests measuring dom<lins Ihal are more trait-like and sta-b[ 
e (Nunnally & Ikrnslein, 19(4). See Table 1-3 for commOTl 
sources of bÍ<ISand error in test-retesl silualions. 
A lest has an infinile number oI' possible test-retesl reliahi[- 
ilies, dcpending on the lcngth of the lime inlerva[ belween 
1esling. In some cases, rdiability eslimates are inversely relatcd 
to thc time inlerva[ bctween baseline and relest (Anaslasi & 
Urbina, 1(97). In olher wntds, the shorter lhe time interva[ 
belween test and retest, lhe higher lhe rcliabi[ity wefficient 
will be. liowever, the extent 10which lhe time inlerva! affects 
lhe test-relesl coefficienl will dcpend on the Iype of ability 
evaluated (i.e., stable versus more v,lfiable). Rcliabilily may 
a[so depend on the type oI' individual being assessed, as some 
groups are intrinsically more variablc over time lhan olhers. 
For examp[e, the exlenl to which scores !luctuate over lime 
may depend on subject characterislics, induding age (e.g., 
normal preschoolers will show more variabilily than adults) 
and neurological stalus (e.g., TBI examinees' scores may vary 
more in lhe acute stale lhan in the posl-acule statc). Ideally, 
rdiabilíty estimales should be provided for bulh normal indi-viduais 
and the clinicai populalions in which lhe tesl is in-lended 
to be llsed, and the speçitic dcmographic characteristics 
of the samplcs should be fuHy specified. Test slability coeffi-cients 
presenled in published les! manuais are usllally derived 
frum rclalÍvdy small normal samples le,ted ovcr much 
shorter interva[s than are typical for retesting in clinicai prac-tice 
and should therefore be çonsidered with due caution 
when drawing inferences regarding clinicai cases. Howcver, 
Ihere is some evidence Ihat duration of inlerval has less oI' 
an impact on test-retest scores lhan subje<.:tcharacteristics 
(Dikmen et a!., 1(99). 
Prior Exposure ond Proctice Effects 
Variability in scores on the same test over lime may be related 
to silualional variables suçh as examinee state, examiner state, 
examiner identity (same versus different examincr aI retest), 
or envirollmenlal condilions that are oflen unsystcmatic and 
Rias 
Error 
Inlerveninf(variablcs 
Practicceffcch 
Dt.'rnographic 
comidcrations 
SI'ltislÍç'l]crrors 
RanJom or 
unwntrollcJ C'Cllts 
Eventsofinterest (e.g., slIrgcry. 
lllcdk;ll inlt'rvmlion. 
rehahililalion) 
ExtraneollSevents 
Mcmorr for contcnt 
l'rocedllf<lllearning 
Olher factors 
{a}Familiarilywilh lesling 
contexl and exarniner 
(h) I'crforl1l;lnceanxit'ly 
Age(rnaturalional efft.'ctsand 
aging) 
EduC<llion 
Gender 
Elhnkil)' 
Hasdint..ability 
IvleaslIremenlerror (SE,'vI) 
Hcgressiollto lhe mean (SEe)
12 A Compendium of Nellropsychological Tesls 
of prior exposure may be nonlinear, or (3) eeiling effeels! 
reslrietion of range related to prior exposure may be ,ltlenual-ing 
lhe eoefficient. For exampk, certa in SUbgrollPSIllaybendi! 
more from prior exposure lo tesl maleriallhan olhers (e.g., 
high-1Q individuaIs; Rapporl el aI., 1998), or some SUbgrollPS 
may demollslrale more stablc scores or consislenl praelice cf-feelS 
than do othas. This causes lhe score distribulion to 
ehange ai retest (effectivdy "shuff]ing" lhe individuais' rank-ings 
in lhe dislribulioll), which will attenuate the correlalion. 
In Ihese cases, the tesl-relesl corre1alion may vary significantly 
aeross SUbgrollPSand the correlatioll for lhe enlire sample 
will nol be lhe besl eslimale of reliabilit)' for an)' of the sub-grollPS, 
overeslimating rdiabj]ity for some and underestimat-ing 
reliabilit)' for olhers. In some cases, practice cffecls, as 
long as lhe)' are rdativdy s)'slematic and accuratc!y assessed, 
will not render a lesl unusablc from a reliabililY perspective, 
Ihough they shollld always be lakell inlo account when retesl 
scores are interpreted. In addilion, individual factors must 
always be consiuered. For example, while improved perfor-mance 
may usually be expecled wilh a particular measure, an 
indiviuual examinee may approach lesls Ihal he or she had 
difficullY with previously with heighteneu anxielY that leads 
to decreased performance. Laslly,it lTlUSI be kepl ill minu Ihal 
faclors other than prior exposure (e.g., changes in enviroJl-menl 
or examinee state) may affecl tesl- retest reliabilily. 
Ahernate Forms Reliability 
Some invesligators advoC<lethe use of alternate forms lo 
eliminale the confounding effeels of praclice v"hen a test must 
be adminislered more Ihan once (r.g., Anaslasi & Urbina, 
1997). Ilowever, Ihis praclice inlrodllces a second form of er-ror 
variance into lhe mix (i.e., conlent sarnpling error), in ad-uition 
to lhe time sampling error inherent in leSI-releSI 
parauigms (see Table 1-3; see also Lineweaver & Chelune, 
2003). Thus, leslS wilh ahernate forms musl have eXlremely 
high correlalions between forms in additioll to high lesl-relesl 
reliability lo confer any auvanlage over using lhe same form 
administered tvice. iIoreover, Ihey mUSldemonstrale equiva- 
Ience in terms of mean scores from lesl lo relest, as well as 
collsistency in score e1assificationwilhin indiviuuals from lest 
lo retest. Furlhermore, alterna te forms do nol necessarily 
climinate effecls of prior exposure, as exposure lOslimul i anJ 
procedures can confer some positive carry-over eITecl(e.g., 
procedurallcarning) despite lhe use of a differenl sei of ilems. 
These dTects may be mini mal across some Iypes of well-cOllS1rucledparallel 
forms, such as Ihose assessing acquired 
knowledge. For measures such as the VCST,where specific 
lcarning and problem solving are involveu, it may be difticult 
or impossible to produce an equiva[ent allernate form that 
will be free of cffects of prior exposure 10 the original formo 
Ihile it is possiblc to attain Ihis degree of ps}"chomelricso-phistication 
thruugh careful item analysis, reIiahilily sludies, 
and administration to a represenlative nonnative group, it is 
rare for ,11ternateforms to be conslrucled with lhe same psy-chometric 
rigor as were lhe original forms frum which they 
were derived. Evenwell-(onstructed alternale forms oflen lack 
crucl<llv,lliu,llion evidence such as similar corrc!ations lo cri-terion 
measure$ as lhe original lesl formo This is especially 
lrue for older neuropsychological lest.s, particularly those 
wilh original forms Ihal were nevn subjecled lO any item 
analysis or rcliability sludies whatsoever (e.g., BVRT). Inade-qu, 
lte lcst construnion and ps)'chometric properties are also 
found for alternale forms in more general published lests in 
commotl usage (e.g., VH.AT-3). l:kcause so few alternate 
forms are availablc and few of those th,ll are meel Ihese psy-chomelric 
slandards, our tendency is to use rdiable change 
inuices or slandardized regression-bascd scores for estimating 
change from test lo retes. 
lnterratcr Rcliability 
Mosl lesl manuaIs provide speciflc and delailcd inslru(tions 
on how 10 adminiSlcr anu score le,l, 'lccording lo slandard 
procedures lo minimi/,e error variance duc lo uiffaenl exam-iners 
and scorers. However,some dcgree of examiner vari,lnce 
rem,lins in inuiviuually ,ldminislered lests, parlicularly when 
scores involve a degree of judgment (e.g., muhiplc-responsc 
verballesls such as lhe Vechsler VOCilhular}" Scalcs,which re-quire 
lhe rater to adminisler a score from O lo 2).ln lhis case, 
an estim,lIe of lhe rcliability of ,H!minislralion aml scoring 
across examiners is neeued. 
Inlerrater reliabililY can be evalUaled using percentage 
agreemenl, kappa, producl-momenl corre!alion, and inlra-e1asscorreIalion 
coefficient (Sauler, 2001). for ,lny given tesl, 
l'earson correlalions will provide an llpper limit for lhe intra-e1asscorrel< 
ilions,bllt intradass correlalioTlsare preferred be-cause, 
unlike the l'earson's r, Ihey take inlo accounl paired 
assessments made by the same sei of examiners from lhose 
maue by dilTerent ex,lminers. lhus, lhe intradass correlation 
dislinguishes Ihose seIs oI"scores ranked in lhe same order 
from Ihose lhal ,Ire r,lnked in lhe sallle order but havc [ow, 
llloderale, or complete agreemenl with each olher, and cor-rects 
for interexaminer or leSI-relesl ,lgreemcnt expected by 
chance alone (Cicchetti & Sparrow, 1981). However, adv<ln-tages 
of the I'earson correlatioll ,Ire lhat il is familiar, is readily 
inlerpretable, and can be eas!l}"compared using sland,lrd sta-tislical 
techniques; il is besl for evaluating cOllsistency in 
ranking rather than agreement per se (Faslenau el a!., 1')96). 
Generolizability CoefReients 
One reIiability coefficient type not covercd in this list is the 
generalil.abilily cocfficienl, which is starting lo appear more 
frequentIy in lest manuais, particularly in the larger test bal-leries 
(e.g., Wechsler scales anu NEPSY). In generalizabilil}" 
theory, or G rlieory, reliabilily is ev"lualeu by decomposing 
test score variance using lhe general linear model (e.g., vari-ance 
compollents analysis). This is a varianl of the mathe-matical 
methods meu lO,lpl'ortion variance in general linear 
model allill)'scs such as ANOVA.In lhe case of G lheory, lhe 
belween-groups variance is considered an estimate of a true
score 'ariance and wilhin-groups variance is considered an 
estimale of rrror variance. lhe generalizability coefficient is 
the ratio of estimated lrue variance to lhe sum of the esti-mated 
true variJncc and estimated error variance. A discus-sion 
of this nexib1c ;Ind powerful model is beyond the scope 
of t!lis chapkr, but dctailcd discllSsions can bc found in 
Nunnally and Bernslein {I(94) and Shavelson el aI. (1989). 
Nunn;llIy and Bemslein (1994) also discuss rclaled isslles 
pertinrnl lo eSlim<lling reliabílities of variables ref1ecling 
sums such as composite scores, and the fact that reliabililies 
of diffcrrllce scores based Oll correJated measures C<1l1be verr 
low. 
Evaluoling a Test's Reliability 
A lest cannot be Silid lo have a single or owralllrvcl of relia-bility. 
]{alher, tesls can be said lo exhibil diffcrenl kinds of re-liabilill', 
the rdalÍvc importance of which ""iH vary depending 
on how lhe tesl is to be used. Moreover, each kind of reliabil-ity 
mal' varl' across differenl populalions. For inslance, a test 
may be highll' reliable in norm,llly funclioning adulls, bul be 
highly unreliablc in young children or in individuais wilh 
nnuological illness. It is importanllo nole that whilc high re-liability 
is a prerequisile for high validill', the latter does nol 
fol!ow automalÍcalll' from lhe former. For exampk, heighl 
can be measmed wilh great reliabilitl', hut it is nol a valid in-dex 
of intelligence. lt is usuaHy preferable lo choose a lesl of 
slighlll' lesser reliabilitl' if it can be de1110TlSlraled tha! the test 
is associaled witll ,I meaningfulll' higher levei of validity 
(Nunnalll' & Ikrnstein, 1994). 
Some halle argued thal internai reli,lbilitl' is more impor-tant 
than olher forms of reliability; Ihus, if a!pha is low but 
tesl-relest re!iahility is high, a tesl should not be considered 
reliable (Nunnal!l', 1978, as cited bl' Cicchetti, 1989). Note 
thal il is possihle to have lnw alpha values and high lest-relest 
reliabilitl' (if a measure is made Lip of heterogencous items 
hut yie1ds the same responses at retesl), or low alpha values 
bul high interrater re1iabilitr (if the test is heterngeneous in 
ilem contenl hut ridds highll' consislent scores acmss 
Iraincd cxperts; an examp1c would be a mental slatus exami-nation). 
Internai consislencl' is therefore not necessarill' lhe 
primar)' index of re1iabilill', but should be evaluated within 
the broader contexl of test-retes! and inlerrater rdiability 
(Cicchetli, 1989). 
Some argue Ihat test -retest reliabi1iIY is nO! as important as 
other forms of rcli<lhilily if the test will only be used once <lnd 
is nOllikell' to be administered again in future. However, de-pending 
on the naturc of Ihc tcst and rrlcst sampling proce-dures 
(as JiSCllssed previous!y), slabilily coefficients m<ll' 
provide valuable insight into the replicability of lest results, 
particular!l' as Ihese coefficients are a gauge of "real-world" 
rdiabilill' ralher Ihan ilccuracy of mCilsurement of true scores 
or hypothetical rdiabilill' acmss infinite randomly parallel 
forms (as is internaI re1iahilitl').ln addition, as was slated pre-viously, 
clinicaI decision making will <llmost alwal's be based 
on lhe obt,lined score. Therefore, il is critiCillly importanl O 
Psychometrics in Neuropsychological Assessment 13 
know the degree to whÍl.:h scores are replieablc ai relesting, 
whether or not lhe tcst may be used again in futme. 
It is our belirf Ihal test users should take an informed 
<lnd pragmatie, ralher Ihan dogmalic, approach lo evaluating 
relíability of tests uscd to inform diagnosis or other clinicaI 
decisions. If a lest has been designed lo measure a single, one-dimensional 
construcl, Ihen high internai consislency rcli<lbil-ily 
should be considered an essenli<ll propertl'. High tesl-reles! 
reliability should also be collsidereJ an essential property un-less 
lhe tesl is designed tn measure stale v;niablcs that are ex-pecled 
lo fluctllale, or if syslemalic f,lelors sueh as praetice 
effeCls attenuate slability cocfficienls. 
What h an Adequale Reliability Coefficient? 
Thr reliabilitl' coeffieient ean be inlerpreted direetly in lerEm 
of the pereentage of seore vari<lnee atlributed to differenl 
sourees (i.e., unlike the corre1ation coefficient, which must be 
squared). Thus, with a reliahilitl' of .85, 85% of lhe variance 
can be attribuled lO lhe trai I being measured, and 15% can be 
altributed to error variance (Anaslasi & Urhina, 1997). When 
ali sources of variance are known for the same group (i.e., 
when one knows lhe rdiabilill' ((lefficienls for internai, lest-retest, 
alternate form, and interraler rdiabililY on lhe Silme 
sampk), it is possible to calculitte the true score variance (for 
an example, see Anastasi & Urbina, 1997, pp. 101-102). As 
noted above, allhough a delailed discussion of this topie is be-l'ond 
lhe scope of this volume, lhe portioning of lotai seore 
variante into components is lhe crux of generalizabilitl' lhe-orl' 
of re1iability, which forms the basis for re1iability eslÍ-males 
for manl' well-knowlI speed lests (e.g., Vechsler scale 
sublests such as Digit Symhol). 
Salller (2tXll) notes lhat re1iahilities of .80 or higher are 
needed for tests used in individllal assessment. Tests used for 
dedsion making should have reliabililÍes of .90 or above. Nun-nalll' 
and 13ernstein (1994) note Ihal a reliabilitl' of .90 is a 
"bare minimum" for tesls used to make important decisions 
about individuaIs (e.g., lQ lests), and .95 should be the optimal 
slandard. When imponanl decisions wiU be basrJ on lest 
scorcs (e.g., placernelll into special education), small score Jif-ferences 
on make a greal difference to oulcome, and precision 
is paratJlount. Thel' nole that cvrn with a rdiability of .9ü, lhe 
SH"l is almusl one-lhirJ as large as lhe overall SDoflest scores. 
Given Ihese !ssues, what is a c1inicallr acceptable levei of 
reliabilill'~ According to Sall1rr (2001), tests wilh reliabilities 
below .(,0 are unrcliable; Ihose above .60 are marginalll' re!i-able, 
and those above .70 are rdative!l' re!iable. Of note, tcsls 
wilh rdiabilities of .70 may be sufficient in the earll' stages of 
valiJalion research to determine whether the test correlates 
wilh other validation evidence; if so, additional effort call bc 
exprnded to incrcase rdiabilities lo more acceplable leveis 
(e.g., .80) by reJucing me,lsurement error (Nunnalll' & Bern-stein, 
1994). In outcome slUdies using psl'chological tesls, in-ternaI 
collsislencies of .80 lo .90 and test-relest rc1iabilities of 
.70 are considcred a minimum acceptable slandard (Andrews 
et 011., 1994; Burlingame et aI., 1995).
14 A Compendium of Neuropsychological Tesls 
To61e1-4 Magnitude ar ReliahililyCndficients 
i.lagniludeof CoeffJdcnl 
Very high (.90+) 
High (.!lO-.89) 
Adc(juatc (.70-.79) 
jl;lrgitlill(.60-.69) 
Lov (<.59) 
In Icrms of inlernal rcliability of neuropsychologieal tests, 
Cieehetti el aI. (]990) hayc proposed that internaI consistency 
estimates of lcss than .70 are unacu'ptablc, rdiabilities be-t 
vecn .70 and .79 are fair, rdiabilities betwecn .80 and .89 are 
good, and rdiabilities ilbove .90 are excellcnt. 
For interrater reliilbilities, Cicchetti and Sparrow (I981) 
report that clinicaI significance is poor for reliability coeffi-eients 
below .40, fair between .40 and .59, good belween .tiO 
imd .74, and excellent between .75 and 1.00. Faslenau et aI. 
(1996), in summarizing guidelines on the interpretation of in~ 
traclass corrdations and kappa cocfficients for interraler reli-ability, 
consider coefficients larger than .60 as sllbstantial and 
of .75 or .80 as almost perfecl. 
The,c are the general guiddínes that we hayc med 
Ihroughoul the lexl to c'aluate thc rdiability of neuropsycho-logical 
tests (see Table 1-4) so that lhe text ean be med as a 
reference when seleeting tests with the highest rdiability. 
Users should note thallhere is a great deal of variability with 
regard to the acceptability of reliability coeffieients for neu-ropsychological 
lesls, as perusal of this volume will indieate. 
In general, for tesls involving multi pIe subtesls and multiplc 
scores (e.g., Wechslcr scales, NEPSY, IJ-KEFS), inclucling 
lhose dcrived from qualitative observations of performance 
(e.g., error an,llyses), the farther away a score gels from lhe 
composite score itself and the more difficlllt the seore is lo 
quantify, the lower lhe rcliability. A quick review of lhe relia-bility 
data presellled in Ihis volume 'lIso indicates Ihal verbal 
tests, wilh few exceptions, lend to have consistently higher re-liabílity 
than lesls measuring other cognitivc domains. 
Lastly, as previously discussed, rcli,lbility coefficienls do 
nOI provide comp[ele informalioll on the reproducibilil}' of 
individual test senres. Thos, wilh regard to test-retest rdiabil- 
Itr, it is possible for a tesl to have high reliability (r= .80) but 
have retesl means that are 10 POilltS higher Ihall baseline 
,cores. Reliabilíty coefflcients do not provide information on 
whethcr individuais retain lheir relalive place in lhe distribu- 
- tion from baselínc to retest. Proceclures such as lhe 13Iand~ 
Altman mcthod (A!tm,m & Bland, 1983; B1and & Altman, 
1(86) are one way to determine the limils of agreement be- 
Iween two assessments for individuais in a group. 
MEASUREMENT ERROR 
A good wnrking underslanding of coneeptual issues and meth-ods 
of guantifying measuremenl error is essential for compe-lent 
clinicai pracliee. We starl our discussion of lhis lopic with 
concepls arising fmm dassicallest Iheory. 
True Scores 
A central ekmenl of classieal test theory is lhe concept of a 
/ruc score, or lhe score an examinee wnuld obtain on a mea-sure 
in lhe absence of any measuremenl error (Lord & Novick, 
1968). True scores can never be known. Instead, they are esti-matcd, 
and are coneeplually defined as lhe mean score an ex-aminee 
would obtain acmss an infinite number of randomly 
parallel forms of ates!, assuming lhat lhe examinee's scores 
were 1101systematically affeeled by tesl exposurclpractice or 
olher time-related factnrs such as maluralion (Lord & Novick, 
1(68). In contrasl to Irue scorcs, oblaíllcd scores are lhe aClual 
scures yidded by tests. Obtilinnl scores indude any measure. 
ment error associated with a given tesl.' That is, Ihey are the 
sum nf lrue seores and l.~rror. 
In the dassic<ll modcl, the relation betwcen nblained and 
true seores is e)(prcssed in the following formula, where error 
(e) is random ,lIld ,111v<lriablcs are assullled to be normal in 
distribution: 
Vhen lest reli,lbility is less than perfeet, as is always the case, 
lhe net effeel of me,ISlrement error iICroSSexaminees is to 
bias obtained scores oulward from lhe popul<ltion mean. That 
is, scnres above lhe mean are most likcly lo be higher than 
true scores, while Ihose below lhe mean are most likdy lo be 
lowcr Ihan Irue scores (Lord & Noviek, 19(8). Estimated true 
scores correct this bias hy regressing obtained seores toward 
the normalive mean, with the amounl of regression depend-ing 
OH test reliability and devialion of the obtained sune from 
the mean. The formula for estimated true scnres (t') is: 
limits af Reliability 
Although it is possiblc to have a reliable test thal is not valid for 
some purpo,cs, lhe converse is nol the case (see [ater). Further, 
it is also conceiv,lblc that Ihere are some neuropsychological 
domains that simply cannol be measured reliably. Thus, even 
Ihough there is the assumption Ihal questionable rdiability is 
always a function of the lest, reliability may depend on the na-lUre 
of the ps}'chological process measured or on lhe nature of 
the popul,lIion evaluated. For example, many of lhe exceulive 
fllnclioning tesls revicwed in this volume have relalivcly mod-est 
rcli,lbilities, suggesling Ihal Ihis ahilily is difficult lo assess 
reliably. Additionall}', tests used in poplllalions with high re-sponse 
variabilily, such as presehoolers, clderly individuaIs, or 
individuais wilh brain disorders, may invariably yield low reli- 
,lbility cocfficients despile lhe best dTorls of test devclopers. 
Vhere: 
X= oblained ;;core 
t = lrue score 
e=error 
X=f+e {3]
PsychoJnetrics in Neuropsychnlogiol issessment 15 
11ere: 
x = mean test seore 
rxx = tesl reliabilit y (internai consisleney rc1iability in 
dassieallesl theory) 
x= oill<lineJ seorc 
If working with z seores, lhe formula is ~implcr: 
lhe U~eof lrue Score~ in Clinicai Pradice 
ancy betweell true and obtaineJ scores. ror a highly rdiable 
mcasure such as Tesl 1 (r= .95), true score regressioll is mini-mal, 
even when an oblained scorc lies a considerablc distance 
from the sample mean; in lhis cxamplc, a SliUHl<fdscore of 
130, or two Sl.>s abovc the 1l1e,1ll,is associated with an esti-mated 
lrue score of 129. In contrast, lur a lesl with low rc!ia-bililY 
such as Tesl 3 (r=.65), true score regression is quite 
subslant ia!. For this test, an obtailled score of 130 is associated 
wilh ,In estimaled true score oC 120; in this case, fully one-third 
of lhe observed deviatioll is "losl" lo regression when the 
est imaled Irue scnre is calculated. 
Such infornl<llion Illay have importam implicatiorls wilh 
respect to inlerprelation of lest resu!ts. For example, as shown 
in .1~lblc1-5, as a result of differences in rdiability, obtained 
scores of 120 Oll Tes! 1 and 130 on Tesl J are associated with 
Cssclllial1yequivalcnl estimated true scores (i.e., 119 and 120, 
respeelivel}'). If only obtained scores are considercd, one 
might inlerprcl scores from Test I anJ Test 3 as signiticantly 
differcnt, even though these "difierences" actually disappear 
when measurell1ent precision is laken inlo Jccounl. lt should 
also be noled thal such differenees ma}' nOIhe limiled lo com-parisons 
of scores across differenl tesls within lhe sarne indi-viduai, 
but may also apply lo cOlllparisons belween scores 
from the same test across differenl individuaIs whcn lhe indi-viduais 
come from differenl groups anJ lhe tcsl in question 
has variable reliabililY acmss Ihose groups. 
Regression to the rnean may also m;lnifest as prunounced 
asymmetry of confldellee interv<lls celltered on Irue scores, 
relalive to oblained scores, as discus~ed in more detail later, 
Although calculalion of (rue scores is encouraged as a means 
of g<luginglhe limitations of reli<lbilily,il is important lo WIl-sidu 
Ihat an)' signiticant difference belween characteristics of 
an examincc and lhe samplc from which a lllean samplc score 
and rdiabililY estimate Vere derived may invalidatc the pru-cess. 
For example, in some cases il makes litlk sense lo esti-mate 
true scores for severdy brain-inillrcd individuais on 
lesls of cognition using leSI p,lfameters from healthy norma-tive 
samples, as mean scores wilhin the brilin-injured popul<l-tion 
are likely lo be suhslilntiall}' different Ccom Ihosc seen in 
hea1thy normative samples; reliabililies may Jiffer subsliln-ti< 
ll1yas well. Illsteild, olle mal' be justilied in deriving esli-maled 
lrue scores lIsing data frorn a cornparable clinicai sarnple 
if Ihis is avaiablc. Overall, these issues underline lhe complex-ities 
inherent in comparing scores from different tests in dif-ferenl 
populalions. 
[41 
[51 
formula 4 shows lhal ;m cxamin('(~'s estimated true score is 
the sum nf Ihc 111C,1sIc1ore of the group to which he or she bc-longs 
(i.c., lhe normative samp1e) and lhe devialion of his or 
her obtaineJ score from the normalive mean weighted br lesl 
rcliabililY (as derived from lhe same normativc sample). Fur- 
Iher, as tesl reliabililY appro<lehes unil}' (i.e., r= LO), esti-mated 
lrue scores approaeh oblained seures (i.e., there is little 
measurement error, so eSlim,led lrue scorc~ and oblainnl 
scores are nearly equiv<llcnt), Conversely, as test reliabililY ap-pro< 
lehes zero (i.e., whcn a tcst is eXlremely unreliablc and 
sllbjeCllo excessive lllea~urement error), e~limated lrue scores 
approach lhe mcan test score. Thar is, whell ti lest is hígh/y re!i-uh/ 
r, grratrr weight is givell to obtailler1 scores tlUlIl to the nor-miltive 
meml score, but whell 11 Int is very IIllre!illble, grelHo-weiglrt 
ís givell to the norma tive metlll score tllllll W obtallJed 
scorcs. l'ractically speaking, eSlimaled Irue scores will <llways 
be closer to lhe mean than nblJÍned scores are (cxccpt, of 
course, where the nblained score is ;lllhe mean). 
Although lhe Irue score modcl is abstract, it has practical ulil-ily 
and important implications for tcsl scorc interpretation. 
For example, whal may not be immeJiatd}' obvious from for-mulas 
4 and 5 is readil}' apparent in Table 1-5: estimat(~d true 
scores Iranslale tesl rdi,lbilil}' (or lack thereof) into the same 
metric as aclUal test scores. 
As can be seen in T;lble 1-5, the degree of regression to the 
rnean of true scores is inversd}' reLlled to test reliability and 
direclly rdated to degree of dcvialion from the reference 
mean. This rneans th<ltthe more rdiablc a test is, the doser are 
obtained scores 10lrue scores and that lhe further away lheob-tained 
scorc is frum the samplc mean, the grealer lhe discrep-loble 
1-5 Estimalt'tlTruc S(()rcVahwsfor Tnrce ObscrvcdS(()rcs 
011 Thrce Leveisof Reliahility lhe Stondord Error of Moo~urement 
Observetl Sçores 
(.'.1= IOO,5D", 15) 
Reiiability 110 120 DO 
.Iest I .95 IlO li' 12.'1 
Test2 .80 108 116 121 
Te'H3 .65 107 113 120 
F.xaminers may wish lo qUill1lilYthe margin of error i1SS0cl-aled 
wilh using oblained scores as cslimatcs of lrue seures. 
When lhe sJtIlple SLJ <lnd lhe reliability of oblained scnres are 
known, an estimale of the SLJ of obtaincd scores about true 
scores may be cakubted. This value is known as the stillulard 
error oI meUSlIrelllem,or SEM (Lord & Novick, t 968). !vIore 
simply, the SEM provides an estimate of the amount of error 
in <Iperson's observeJ scorc. lt is a functlon of the re1iabilil}'
[61 
16 A Compendium of Nellrops}'chological Tesls 
of the test, ,mJ of the variabilily of scores wilhin the sOlmple. 
The SFM is inversdy rdaled to lhe rcliabililY of the lesl. Thus, 
lhe greater the rdiability of lhe lesl is, lhe smaller lhe SIA! is, 
and lhe more confidence the examiner can have in lhe preci-sion 
01' lhe score. 
The SEM is delined by the following formula: 
SEM '" SD~1 - rxx 
Where: 
SlJ= the slandard deviation of lhe lesl, as derived from an 
appropriale normalive s<lmplc 
rxx= the reliabililY wcffici<'nl of lhe lest (usually internai 
rdiabililY) 
Confidence Intervols 
Whi1c lhe SEM can be considered on ils own as an index of 
lesl precision, il is nol necessarily inluitively interpretable,' 
and Ihere is oflen a tendenc}' to focus excessively 011 test scores 
as point eslimates at the expense oI' consideration of associ-ated 
eslimation error ranges. Smh a lendency lo disregard 
impreçision is p<uticularly inappropriate when interpreting 
senres from t('sls of lower rdiability. Clinically, it may there-fore 
be very importanl lo reporl, in a concrele and easily un-derslanJable 
manner, lhe degree oI' precision associaled wilh 
specific tesl senres. One melhod of doing this is to use confi-delh: 
e Hltervals. 
The SE!Y! is used to rorm J confi(lence inlerval (or range 
oI'scores), around estimaled true scores, wilhin which oblained 
scores are mosl likcly lo falI.The dislriblltion of obtained scores 
aboul lhe lrue score (lhe error dislrihulion) is assumed lo be 
normal, with a mean of zero and an SD equal to the SEM; 
therefore, the bounds of çonfi(!cnce intervals can be set lO in-dude 
any Jcsired range of probabilities by mulliplying by the 
appropriate 2 valuc. Thus, if an inJividual were lo take a brge 
number oI' ranJomly parallel versiollS of a tesl, lhe resulting 
obtained scores would fali wilhin an inten'al of:tl SEM of lhe 
eslimated lrue score óll% of lhe time, ,!nJ wilhin 1.96 SEM 
95'Yoof lhe lime (see Table 1-1). 
Obviously, wllfidence inlervals for unrcliablc lests (i.e., 
wilh a large SEAl) will be larger than those for highly rdiablc 
leslS. For example, we ma}' again use data from Table l-S. for 
a highly rcliablc les! such as Tesl 1, a 95% wnfidence interval 
for an obtained score of 110 ranges from 103 lo 116. In con- 
Irasl, lhe confidence interv,ll for Tesl 3, a lcss rcliable test, is 
larger, ranging from 89 to 124. 
lt is importanl to bear in mind Ihal çonfidence inlervals 
for ohtained swres Ihal are based on lhe SFAl are çentered on 
t'stimlltcd truc swrcs." Such confidence intervals wil1 be sym-metric 
around obta ined scores only when oblaineJ scores are 
ai the test mean or when rcliahility is perfeçl. Confidence in-tervals 
will be ,lsymmelriç aboul oblained scores to lhe S,ln1e 
degree Ihal lrue scnres diverge frum obl,lined scores. There~ 
fore, when a lest is highly rcliable, the degree of asymmelry 
will nflell be trivial, parliclllar!y for oblained scores within 
one SI) of lhe mean. For tests of lesser relLlbilill', the asymme~ 
Iry may be lTlarked. For examplc, in l:lblc 1-5, wnsiJer lhe 
oblailled sçore of 130 on Tesl 2. The estimaled true sçore in 
Ihis case is 124 (see eqllalions 4 and 5). Usingequalion 5 and 
a z-mulliplier of 1.96, we find thal a 95°11,confidençe interval 
for the ob!aincd scores spans :t13 poinls, or from 111 lo 137. 
This confidence interva! is subs!antially asymmetric aboul lhe 
oblailled score. 
It is also importanl to note thal SEM-based çonfidençe in- 
ervals should not be llsed for eSlirnating the likelihood oI' ob-taining 
a given score at retesting wilh lhe same rneasure, as 
cffects oI' prior exposure are nOI accounleJ for. In addilion, 
Nllnally and Bernstein (1994) point out thal use of SEM-based 
confidence intervals assumes Ihat error Jistrihulions 
are normal!y dislribuled and lwmoscedaslic (i.e., equal in 
spread) a(rnss lhe range of scores oblainablc for a given lesl. 
Howevu, this assumption ma)' oflen be violaled. A number of 
alternale error mudeis Jo nol require these assumptions and 
mar Ihus be more appropriale in some circumslances (see 
Nunally and Bernslein, 1994, for a detai!Cd discussion).1 
Lastly,,!Swilh the derivation 01' estimaled lrue scores, when 
an examinee is known lo bclong lo a group Ihat markedly dif-fers 
from the norm,llive samplc, il may nol be appropriale lo 
derive SF,Hs Olndass(lcialed confidence intervais using nor-mative 
samplc parameters (i.e., 51) and ru)' as Ihese would 
likely differ significanlly from parameters derived from an ap-plicable 
clinicai sample. 
lhe Stondord Error of Estimation 
In additioll to estimating confidence inlervals for oblained 
scores, Olle lllay also be inleresled in estimaling confidence in-tervills 
for estimated true scores (i.e., lhe likely range of lrue 
scores aboul the eslimaled Irue score). For Ihis purpoSt'",one 
mal' conSlruCl confiJence intervais using lhe sflllldard error of 
estimatíoll (SE,,; Lord & Novick, 1968). The formula for Ihis is: 
[71 
11ere: 
SD= lhe slandard deviation of the variable being 
eslimated 
r.u= lhe test rdiabili!y coefficient 
The SEE' like lhe SEM, is an indie<llion of lesl precision. As 
wilh lhe SEM, confidence intervals are formeJ around esli-mateJ 
Irue scores by multiplying the SEEby a desired zvalue. 
Thal iS,one wüuld expect that over a large nllmber oI' randomly 
parallel versions of a lesl, an individuars tme score woulJ fal! 
within an illlerval of:tl SEI' of the eslimated Irue score 68% 
of lhe time, and fali within 1.96 SEIO95% oI' lhe time. As wilh 
confidence inlervals bas~d on lhe SEA1, Ihose based on the 
SEI' will usually nol be symmetric arounJ ohtained scores.;1I 
oI' lhe olher caveals detaileJ previously regarding SEM-based 
confidence interv<lisalso apply. 
lhe dlOice oI' construeting confidençe inlervals based on 
lhe SEM versus the SEI' wil! depend on whether one is more
interesled in true scores or obtained s(Ores. That is, while the 
SEM is ,I giluge of test accuracy in that it is used to determine 
lhe expeçted range of obtllillcd scores abolll true scores over 
parallel assessments (the range of error in 111C115r1rCmCI1/ of lhe 
trile score), the SEE is a gauge of estimation accuracy in that it 
is used to determine lhe likely range wilhin which trlle $Cores 
fJII (the range of error of estimati"n of the true $Core). Re-gardless, 
both SEM-based and SEE-based confidence intervals 
are symmetric wilh respecl O estimated true scores rather 
than lhe obtained scores, and lhe boundaries of both will be 
similar for any giwn levei of (Onfidence interval when a test is 
highly reli,lble. 
The Standard Error of Predietion 
When the standard devialion of obtained scores for an alier-nate 
form is known, one may cakulale lhe likcly range of ub-tained 
scores expected on retesting with an alternate formo 
For Ihis purpose, the stmulrml errar of prcdictioll (SEr; Lord & 
Novick, 1961'l) may be used to comlruct confidence intervals. 
The formula for this is: 
[SI 
SE!, "'SVy~l-r~ 
Where: 
SDy = the stdndJfd devi,llÍon of lhe parallel form 
administered at retest 
rxx = the reliability of the form used at initialtesting 
In this case, confidence inlervals are formed around cstimdled 
Irue scores (derivcd from initial abtained sClnes) by multiply-ing 
the SEr by a desired zvalue. That is, one would expect that 
when retested OVCf a large number of randomly pJrallcl ver-sions 
of a lest, an individual's obl<lined SClne would fali within 
<In inlerval af:tl SEI' of the estimated true score 68% oI' the 
time, and fali within 1.96 SEE 95% of the time. As wilh confi-dence 
intervals based on lhe SEM, those b,lsed un the SEI' will 
generally not be symmetric ,Iround obtained SClnes. 111of the 
other caveats detailed previously regarding the SEM-I}<Lsed 
confidence intervals also apply. In addilion, while it mdY be 
templÍng lo use SEf'-based confidence inlervals for eva1tI,Hing 
signific<lnce of ch,mge at retesting with lhe same JlleilSUre, Ihis 
practice violates the assumplions Ihat a parallel form is used 
aI retest and, particular1y, that no prior exposure effects apply. 
SEMs and True $cores: Proclicollssues 
Nunnally and Bernstein (1994) note Ihat mosl test manu<lls 
do '';m exceptionally poor job of reporting estimateJ true 
scores ,Ind conlldcnce interva1s for expectC(I obt,tÍned scores 
Otl alternative forms. for ex,lnlple, intervals are often erro-neonsly 
centered abolll obtained seores rather than estimated 
true scores. Often the topic is not even discusscd" (p. 260). 
Sattler (2001) also notes that test manuills often base confi-dence 
intervals on the overall SE,"1 for the entire standardi/d-tion 
sample, rather than on SE"'!s for each age bando Using the 
average SEA1 across age is not always appropriate, givcn Ihat 
PsydlO111ctries in Ncuropsyehological tssessmenl 17 
some age groups are inherently more variable than othcrs 
(e.g., preschoo1crs versus adu1ts). In generdl, eonfidencc inter-vais 
based on age-specitic SE"'!s are preferable lo Ihose based 
on the overall SEAI (particularly at the extremes of the age 
distribution, where there is the most variability) and C<1noften 
be constructcd using age-based SEMs found in mosl manuaIs. 
It is important to ackllow1cdge Ihat whilc estimated true 
scores and associated confidence intervals have mcrit, there 
are practical reasolls to foeus on ohtained scores inslead. For 
example, essentially ali validily studies ,md ,Ktu,nidl predic-lion 
mcthods for mosl lesls are based on obtained scores. 
Therefore, obtained scores must usually be employcd for di-agnoslie 
and olher purposcs to maintain consistency to prior 
research and test usage. for more discussion regarding lhe 
ca!Culdtion and uses of the SE,H, SEE' SEr' and a1ternalÍve er-ror 
models, see Dudek (I979), Lord and Novick (l96l'l), and 
Nunnally and Bernslein (1994). 
VAUDITY 
~lode1s of vdlidity ,Ire not ,Ibstract conceptual framl'works 
Ihat ,ne only minimally rclaled to neuropsychological prac-tice. 
Thl.~Standanls for Educational dnd Psychological TeslÍng 
(lERi et ai., 1(99) state that validati(ln is the joint rcsponsi-bility 
oI' the tesl developer and the tcst uscr (1999). Thus, a 
working kllowlcdge of validily models and the validity char- 
,Ktcristics of specific tests is a central requirement lor respon-sible 
and competent test USl.~.From a practical perspective, 
a working knowkdge 01' va1idity allows users to determine 
which lests are appropriate for use and which fali below stan-dards 
for clinicai practice or rescarch utility. Thus, neuropsy-chologists 
who use tests to (lctl.~ctand diagnose neurocognitive 
difficulties should be thoroughly familiar with commonly 
used validity mudeis and how these can be usd to evaluatc 
neuropsychologicallools. Assuming that a test is valid because 
it was pu[(;hased from a reputabk test publisher, appe<lrs to 
have il large normative s,nnp1c, or Came wilh a l<lfge user's 
tnanu,11 C<lllbe a sniolls error, as some well-known and com-monly 
uscd neuropsycho!ogieal tests are bcking with rcgard 
to crucial aspccts 01' validity. 
Definilion of Validity 
Cronbaeh and Meehl (I ')55) were some of the first Iheorists to 
discuss the cOllcept of eonstruct VJlidily. Since then, the hasie 
definition of validity evolved as testing necds changed ovcr 
the years. Allhough eonslruct validily was first inlroduced as a 
scparate Iypc of validity (e.g., Allastasi & Urbina, 1(97), it has 
moved, in some models, to encompass ali types of validity 
(e.g., Messick, 19')3). In other models, the term "construct 
validity" has been deemed redundant and has simply bcen re-placed 
by "validity," since ali types of validity ultimatcly in-form 
as lo the construet llleasured by lhe lesl. tccordingly, the 
term "construet validity" ha.s nol been u.sed in the Standards 
for Educational and l'sycho!ogical"lcsting since 1974 (AERA
18 A CompellJium of Neuropsychological Tesls 
el a!., 1999). However, whelher il is deellleJ "conslrucl valiJ-ily" 
or simply "validil~-:' lhe coneepl is eentr~1 lo evalu~ling 
the ulility of a lest in the clinicaI or researeh arena. 
Test valiJity may bc Jefined at the mosl basie levei as lhe 
degree /O whícJr a leSI (/(/l/(ll/y IIlCllSlIres wllrlt ir is íntended /O 
meaS/lre, or in the words uf NUllllally ~nd llernstein (1994), 
"how wetl itllleasures what it purports to Illeasure in the eon-text 
in which it is to be applied" (p. 112). As with reliability, an 
important point 10 be madc here is Ihat a tesl eanflol be said 
to have une single levei (lf validity. Rather, it ean be said to ex-hibil 
various lypes and leveis of validilY across a speclrum of 
usal;e antI popul,llions. That is, 'lIliJity IS nm ti propcrty of 1/ 
t('st, bul rather, 'ulidily js li prop('rty of the mcrmilJg attached to 
(/ t(,SI Sf()re; villidily can only arise and be dellned in the spe-cific 
conlext of tesl usal;e. Therefore, whilc it Éscertainly nec-essary 
to undersland the valiJity of tests in particular contexts, 
ultimate decisions regarding lhe validilY of test scme interpre-tation 
must take inlo account any unique factors pertaining to 
validity aI the levei of individual assessment, such as devia-tions 
fcom slandard adminislration, unusual testing enviroll- 
Illents, exalTlinee cooperation, and the like. 
In the past, assesslllenl of validity was generally tesl-centrie. 
lhat is, test validity was largely indexed by compari-son 
with olha tests, especially "standards" in lhe field. Since 
Cronbach (1971), therc has becn a move aw~y from test-baseJ 
or "measure-centered validity" (Zimi1es, 1996) toward the in-terprelatiall 
alld externaI utility of tests. Mcssick (1989, 1993) 
expanded the dcfinition af validity lo cncompass an overall 
judgmenl of lhe extent to which empirical evidcncc and theo-retical 
rationales support lhe <ldequacy ilnd cffeclÍveness of 
inlerpretations and ,tCtions resultinl; from test scores. Subse-qllenlly, 
!vlessick (1995) proposed <lcomprehensivc model of 
construcl validity wherein six different, distinplishablc types 
of evidence contribute to construct validity, These are (1) 
content rdaled, (2) substantive, (3) slructural, (4) generaliz-ability, 
(5) externaI, and (6) collsequcntial evidence snurces 
(see Table 1-6), ,llld they form thc "evidential basis for score 
Table 1-6 /l,lesskk ..••lludel uf Comtruct ValiJity 
Typc af Evitlcncc 
SuhstanlÍn' 
Structurill 
Genefillizilbility 
"5<. l«,- J I.<y ( 19'J6) fo, Iim,!au"Tl< "f ,hi, com!",,,<,,' 
interpretation" (/I,!cssick, 1995, p. 743). Likewise, the Slan-dards 
for Educational and l'sycholol;icallesting (AERA et <lI., 
19(9) follows a modcl very llluch like ~kssick's, whcre differ-ent 
kinds of evidence are llsed to bolster test validity bascd on 
each of the fol1owing sources: (I) evielence baseei on test COll-tent, 
(2) response processes, (3) internaI structure, (4) rda-lions 
lo olhe r variables, anel (5) consequences oftesting. The 
most conlroversial aspect of these mode1s is lhe requirement 
for consequential evidence to support validity. Some argue 
that judging validity ,lCcording to whcthcr use of a test results 
in positive or negative social consequences is too far-rc,lChinl; 
ilml may 1cad to abuses of scicntific inquiry, <lSwhcn a h.'st re-sult 
does not agrce with lhe overriding social climate of the 
time (Lecs-J-lil1cy, 1996). Sociill anel ethical conscquenccs, al-thoul; 
h cruci,tl, milY therefore need lo be treMcd separatcly 
from validity (Anastasi & Urbina, 19(7). 
Validity Models 
Since Cronbach and Mechl, various modcls of validity have 
bcen proposed. lhe most frequently encountered is the tripar-tite 
modcl whcrcby valídity ís divieleel inlo threc eompotlenls: 
content villitlity, criterioll-rc1ated validity, and construct valid-ity 
(see Anilstilsi & Urbina, 1997; ltitrushina ct aI., 2005; Nun-nally 
& Bernstein, 1994; Salt1cr, 2(01). Other validity subtypes, 
including convergent, divcrgent, prcdictivc, trcatment, clinicai, 
and face validity, are subsullled within thcse three domaills. 
For example, nmverl;enl ,1Ild divergcnt villidity are most often 
trealed as subsels of cnnstruct validily (Sattler, 2(01) ,tlld con-current 
and predicl!ve validity as subsels of critcrioll V<llídity 
(e.g., Milrushina et aI., 20(5). Concurrent and predictivc valid-ily 
only differ in terms of a temporill gradicnt; concurrcnt va-lidity 
is relevant for lests used to identify existing diagnoses or 
conditions, whereas predictive validity applies when dctermin-ing 
whether a test predicIs fulure outcnmes (Anastasi & Ur-bana, 
1997). Allhough face validily appears to have fallen out 
oflilVor as a typc of validity, the extent to which examinees be-lieve 
a te~t me<1sures whilt it appears to ll1e~sure can affect mo. 
tivation, self-disclo~lrc, <lnd effort. COllSequent1y, face validity 
Glll be seen as a moder,lor variab1c affecting COllcurrent and 
predietive validity lhal can be operalionillized <1nd measured 
(Bornstein, 1996; I'evo, 1985), Again, ali these labcls for dis-tinct 
c<ltegories of validity are ways of providing different types 
of evidmce for validity and are not, in and of themsclves, differ-ent 
types of villidity, as older sources mil;ltt claim (AERA et aI., 
1999; YUtl & Ulrich, 20(2). Lastly, validity is a matler of degree 
ralher th<lll an all-or-none propcrty; validity is Iherefore never 
aClually"finalil.ed,~ since tcsts must be cOlltinually reevalualed 
as populations and testing contexts changc over time (Nun-llally 
& Bernslein, 1994). 
How lo EvoluoJe the Validity of a Test 
I'ragmalically speaking, ali the thcorctic<ll models in lhe world 
will be of no utilíty to the practicing clinician unlcss they 
ean be translated into specific, step-by-stcp proeedures for 
Dcfinition 
Relevance, represcnlati'{'lH.'SS,anti technical 
qualily of test cOn!ellt 
ThCtlfetical rallona!cs for the test anti Icst 
responses 
Fidelity af scoring slruelme to the structure 
(lf lhe constrllet mcasuf(,J by lbe tesl 
Seores and interl'retatiulls generalize auoss 
groups, scttings, anu tasks 
Cunvcrgcnt anJ Jin'rgenl villidity, eriterion 
relcvanee, anJ appli<,J utilily 
Actual and potelltial cunsequcnccs of test use, 
relating to suurces af invaliJity rclatcd to 
bias, fairness, ilnd disuiblllive justice" 
Extern;t1 
ConSl.'quentiill
eva luating a test's valiJily .. I:lble 1-7 presenls a eomprehcnsive 
(bUl not exhallstivc) list of specilic fealures lIsers c<ln look for 
when cvalllatíng a tesl anJ reviewing lcst manuaIs. E<lch is or-ganizcd 
according lo the type of validity evidcnce provided. 
for exampie, COllstrllct validity ean be ,Issessed via eorrc!a-tions 
with other tests, faetor analysis, internai cOlIsistency 
(e.g., suhlesl intercorrdations), eonvergellt and Jiscriminant 
validation (c.g., multitrait-mllltímethod malrix), experimen-tai 
interventions (c.g., scnsitivity lo treatment), slructlH,11 
equalion Illodding, and response processes (e.g., lilsk dCCOlll-posilion, 
protocol analysis; Anaslasi & Urbina, 1997). lfost 
importantly, lIsers shollld also rernembn lhal even if an othcr 
condilions are me!, a test cannol be eonsidered valid if it is 
not rcliable (see previoll. Jiscussion). 
It is importanl to nOle lhal not ali tests will have sufficielll 
evidence lo salisfy ali aspects of validity, bllt test uscrs shollld 
hilve a suffieicntly broad knowledge of nellropsychological 
lools to be ab!c to select one test over anolhn, based on lhe 
quality of the validation eviJence availablc. In essence, we 
PsydHlnwlries in Nellf(lpsycho!ogical Assessmcnt 19 
havc lIscd this modcl lo critically evaluate ali the tests rc-viewed 
in this volume. 
Note that there is ,I certa in degree of overlap between cat-egorics 
in Table 1-7. for example, corrdatiollS between a 
specific test Jnd another test me,lsuring IQ Cilll simll!tane-ously 
provide criterioll-rcialcJ eviJcnce <lnd construcl-relaled 
evidencc of validity. l{egardlcss of lhe termino]ogy, it is im-portant 
to understand llOW spccific techniques such as fae-tor 
analysis serve to inform lhc validity 01"test interpretation 
across the range of sellings in whieh nellropsycho!ogists 
Vork. 
What Is an Adequate Validíty Coefficient? 
Some invcsligalors have proposcd erileria for evaluating cvi-dencc 
rcJated to criterion valídity in outeollle assessmcnts. For 
instance, Andrcws ct aI. (1994) and 1311rlingamc ct aI. (1995) 
recornmcnd tha! a minimlltn levei of ,lCccplabilil}' for corrc!a-tions 
involving criterion v'lliJit}' is .50. Howcver, Nunnally 
Table 1-7 Somecs of Evidence and Techni'1l1cs for Crilically EvalU<itingthe Validily of NellfOl'>yehological T(.'sts 
T}'pe of Evidence 
ConteTlt-rc!aled 
Conslrlld-rdaled 
Criterion-r(.'!aled 
Resl'on>e proces.•es 
ReIUirCllEvidcnce 
Rcfers lo Ihemes, wording, format, lasks, or qnc>liolls on a te,I, and <ldmini,tralion and scnring 
Vescril'liou 01"lheorelical mudei (In which lest is bascd 
Review of Iilcralure with sUl'porling evidence 
Definilion (lf dOlllain of intcrest (e.g., litera!Ure review, lheoretical reasoning) 
Opcralionalizalion 01"def1nilion lhrough thorough and syslemalic review of tcst domain frum which ilem> are 
to b(..samplcd, wilh Iisling nf slmrces (c.g.. word frequenc)" sOllTcesfor vocabulary tesls} 
Collection of samplc of ilems brge enough to be represenUlive of dunuill and with slIfticiclll rang(.' of dífflculty 
for largel poplIlation 
SdcelÍon of panel of jlldges for expert review, hased on specific selectinn crileria (e.g., acadelllic and praclical 
baekgroullds or cxpcrlise within specific subdolllains) 
Evall1alion of item., hy experl pane! based on specific uitcria concerning accuracy and relevmlCe 
Resolulion of judgmcnl conllids wilhin pane! for ilems lacking uoss-panc! agreelllcnt (e.g., empirical Illeans such 
as lndex of llé'fl1Congruem:c; Hamhlelon. 1980) 
Formal ddinilioll of comlruct 
Formulation of hypothcsc> lo lIIeasure collstruct 
Galhering empirical evidence of conSlruct validalion 
Evaluating psychofllclric propnlies of imlrunlenl (i.e., reHahilily) 
D(.'mon,lration of le.•1s('"milivily lo deve!0l'menul changes, correialioll with olher le~;[S,gWllll differences swdies, 
l"aClnranalysis, intertwl wmistcllcy (e.g., wrrdations belweell slolesls, or lo composiles wilhin Ih('"sallle test), 
convcr~ell and divergem valitiatioll (e.g., muitilrail-llIu1timclhod l1Iatrix), ,cnsilivity to cxpnilllenlal 
manipulalioll (e.g., la'almellt sen,itivity), slruclural equalion modding, and analysis of l'rocess variahles 
lIndl'l'l)"ing test performallce. 
Idmtification of al'propriate crilerioll 
ltientification uf relcv,11I1sample grollp rdk<:ling lhe emire pOl'lItalion of imeresl; if only a SllOgrollP is examined, 
Ihen gcneralization mllst remain wilhin subgroup definition (e.g., kccping in mind polenlial SOllrcesof error sllch 
,1.1reslriclion {lfrange) 
Analysis of test-crilerioll relalionships Ihmugh empiricalmcam sucll as COlllrasting pouP', corrdatiollS wilh 
pr('viously availaolc tesls, dassil!calion of accllracy slalistks (e.g., posilive prediclive power), oulcome ,Iudi(.'" 
,md llIela-analysi> 
Velermining whether perforn""lCe on thc tcsl aCluaJl)"rei,ltes lo lhe domain being lIIeasured 
Analysis of individual responses to dderrnine lhe processes underlying performance (c.g., quc,lioning les! lahes 
about slralegy, analy,is of lest performance with regard lo othcr variahles. determining whether lhe leSlllleaSllres 
the same conSITUClin differeul pOI'UlalioJls, slI<:ha> age) 
'i",m'c: Ad"l'tt"d fmm A",,,,,,,i & lIrbi"." 1997; Amer;(." Edll(<ltio'",' Re;eat(h A'so<:i"liun oI Jl .. 19'1');M<»i,k, 1995; .nd Yllll ""d Ulr,,-h. 2002.
20 A Compcndium of Neuropsychological Tests 
<lndBem~tein (1994) note th,ll validity coefficient, farei)' ex-cee,! 
.30 Of.40 in mo,t circum,tances involving Jl~}'eho!ogical 
tests, given the complexities involved in mea~ufing and pre-dicting 
human beh,'ior. Thefe afe no hard and fast fUlc~ 
when evaluating evi(knce supporlive of va!iditl" and intcr~lfe-tation 
should consider how the te~t results will be used. Thus, 
tests with evcn quite modest predictive validities (r = .50) ma}' 
be of considerablc utilitl', depmding on the Cifculll~tancesin 
which the}'will be used (Anasla~i & Urbina, 1997;Nunn<llll'& 
Bem~teill, 19(4), particularll' if Ihel' serve lo significant1l' in- 
(fease lhe tesl's "hil fale" over chance. 11is also important lo 
note Ihal in some circulIlslances, crilcrioll validitl' ma}' be 
measured in a cakgorical ralher Ihan continuous fashion, 
~uch as when lesl scores are used lo inform binarl' diagnoses 
(e.g., demented versu~ nol delllenled). ln Ihese cases, one 
would Iikell' be more ínlereslcd in indices such as prediclive 
power than olher me<l~uresof crilerion validill' (see below for 
a discus~ion of c1<lssilicalion"ccuracl' slalislics). 
USE OF TESTS IN THE CONTEXT OF 
SCREENING AND DIAGNOSIS: 
CLASSIFlCATlON ACCURACY STATlSTICS 
In some cases, c1inicians use lests lo meaSUfeholl' IIlllfilof;ltl 
attribule (e.g., intelligence) an examinee ha~, while in other 
cases, tesls are used to help determine whelher or nol an exam-inee 
has a specific atlribute, condilion, or illness that mal' be 
eithcr prescnt or abscnt (e.g., Alzheimer's disease). In lhe laller 
Clse, a sJlecialdi~linction in lesl use mal' be made. SCfcnlillS 
tests are those which are broadll' or routinelr used to delecl a 
specific altribule, oflell rdcrred lo as a collllítioll of inferest, or 
COI, among persons who are not "sl'mplomatic" but who mal' 
Ilonctheless have the COI~ (Slreinef, 2003e). Ui'lgnosfíc tests 
,Ireu~ed lo assisl in ruling in ()f out a speeifie condilion in per- 
~ons who present wilh "sl'mploms" Ihat sugge~1lhe diagnosis 
in questionoAnolher related use of lesls is for purpose~ of pre-diclion 
of outcome. A~wilh screening and diagnostic tests, lhe 
oulcome nf intereslll1al' bc defined in binarl' terms---it wiUei-ther 
occur or not occur (e.g., relum lo the same Il'pe anJ levei 
(lf emp!ol'menl). Thus, in ali three ca~es,dinicians wil! he in~ 
terested in the relalion of lhe mca~Ire'sdislribulion of scores 
to iln attribule or oulcome Ihat is defincJ in binarl' lerms. 
Typiealll" data conceming screening or diagnoslic accu-racl' 
are obtained bl' administcring a lestlo a samplc of per- 
~ons who are also dassifieJ, wilh rcspect to the COI, b}'a so-called 
gotd ~tand<lfJ.Those who have the condition according 
to the gold stand<lfd,Ire [;lbcleJ COI+-, while Ihose who do nOI 
have lhe condition ,ue hlbcled COl-. In medicine, the gold 
stamLud is oflcn a high!y aceurale diagnoslic lest that is more 
expcnsive and/or ha~ a higher levei of as~ocialed risk of 
lIlorbidity Ihan some new diagnoslic lllelhod thal is being 
evaluated for use as a screening measure or as a possible re-placement 
for the exisling gold slandarJ. In neuropsychology, 
the situalion is oflen more complex, as the cal mar he a ps}'~ 
chnlogical conslrucl (e.g., malingering) for which consensus 
wilh respecl to fundamenlal definilions is lacking or diagnos-tic 
gold standarJ.s mar not exi~1.The~c iS~llesmay he less 
problemalicwhenleslSareusedtol.redictouleollle(e.g .• re-tum 
to work), Ihough nlher problell1s thal mal' amiet olll-come 
daIa such as inlervcning variables anJ samplc altrition 
ma}'complicale interpretation of predictive aecuraçy. 
The simplest wal' to relate tesl rc~ultsto binarl' diagnose~ or 
oUlcomes is to utiliJe a cutoff score. This is a ~ínglcpoinl a!ong 
the conlinuull1 of possiblc score~ for a given lesl. Scores at or 
above lhe cutoff classifr eXilmince, as belonging lo Olleof Iwo 
groups; scores below lhe culoff c1assifl'eXilmineesas bclonging 
to the other grnup. Those who have the cal acconling lo lhe 
tesl are laheled as Test Positin- (Tesl'), whilc Iho~ewho do no! 
have the CO! are labeled Tcst Negatiw (Tesl-). 
Table l-R shows lhe relation belween examinee classifica-tions 
based on tesl resulls versus da~sificalions b<lsedon a 
gold slalHhtrd measure.13yconvenlion, lesl da~sificalion is de-noled 
bl' row membership and gold sland<lfd classification is 
denoled bl' columll membership. Ccll values represenl the 10- 
lal number of persons from lhe silmple falling into each of 
fom possiblc outcomes with respcct to ilgreemenl belween a 
le~1and respective gold slandard. Bl' convention, agreemenls 
between gold slandard and test c!a.ssiflcalion.sare referred lo 
as Trile Positive and TflIe Nrgative cases, whi[e disagreemenls 
are referreJ to ,ISFals!' Posítíw alld FI/Isc Ncglltü'e cases, with 
posilívc and negmive refcrring to lhe presellce or absellce of a 
COI as per elassificalion bl' the gold slandard. When cOllsid-ering 
outcome dala, observed oulcomc is substiluted for the 
gold slandard. 1t is imporlant lO kcep in mim! whilc reading 
the fol!owing seclion that while golJ standanl measures are 
oflen implieitll' Irealed as 100% accurate, thi~ mal' nol a!wal's 
be the case. Any limitalions in accuracy or applicabilitl' of a 
gold stanJard or oulcome lIleasme need to be accounled for 
when interprcting classification accuracy slalistics. 
Toble 1-8 Classificalion/Prediction ACÇ[lracy of a Test in Rdation {)a "Cold $Iandard" ur tctua[ 
Olllc<.Hne 
Gold Standard 
TeSI Reslllt 
Test+ 
Tesl- 
Collltlm 101111 
COJ' 
A (Tnrc I'usitivcj 
C (Fal.se Neg;ltive) 
A+C 
COJ-ti 
(FalscI'osiliv(') 
D (Trllr Negative) 
II+D 
Row Total 
A+1l 
C+D 
N""A+Il+C+D
Psycometrics in neuropsychological assesment
Psycometrics in neuropsychological assesment
Psycometrics in neuropsychological assesment
Psycometrics in neuropsychological assesment
Psycometrics in neuropsychological assesment
Psycometrics in neuropsychological assesment
Psycometrics in neuropsychological assesment
Psycometrics in neuropsychological assesment
Psycometrics in neuropsychological assesment
Psycometrics in neuropsychological assesment
Psycometrics in neuropsychological assesment
Psycometrics in neuropsychological assesment

More Related Content

What's hot

Statistics in nursing research
Statistics in nursing researchStatistics in nursing research
Statistics in nursing researchNursing Path
 
descriptive and inferential statistics
descriptive and inferential statisticsdescriptive and inferential statistics
descriptive and inferential statisticsMona Sajid
 
Standard Scores
Standard ScoresStandard Scores
Standard Scoresshoffma5
 
Variability, the normal distribution and converted scores
Variability, the normal distribution and converted scoresVariability, the normal distribution and converted scores
Variability, the normal distribution and converted scoresNema Grace Medillo
 
Analyzing quantitative
Analyzing quantitative  Analyzing quantitative
Analyzing quantitative kopidogs
 
CABT SHS Statistics & Probability - The z-scores and Problems involving Norma...
CABT SHS Statistics & Probability - The z-scores and Problems involving Norma...CABT SHS Statistics & Probability - The z-scores and Problems involving Norma...
CABT SHS Statistics & Probability - The z-scores and Problems involving Norma...Gilbert Joseph Abueg
 
Displaying Distributions with Graphs
Displaying Distributions with GraphsDisplaying Distributions with Graphs
Displaying Distributions with Graphsnszakir
 
Normalprobabilitydistribution 090308113911-phpapp02
Normalprobabilitydistribution 090308113911-phpapp02Normalprobabilitydistribution 090308113911-phpapp02
Normalprobabilitydistribution 090308113911-phpapp02keerthi samuel
 
Normal Probability Curve by Dr. Neha Deo
Normal Probability Curve by Dr. Neha DeoNormal Probability Curve by Dr. Neha Deo
Normal Probability Curve by Dr. Neha DeoNeha Deo
 
Case study using one way ANOVA
Case study using one way ANOVACase study using one way ANOVA
Case study using one way ANOVANadzirah Hanis
 
Central tendency and Variation or Dispersion
Central tendency and Variation or DispersionCentral tendency and Variation or Dispersion
Central tendency and Variation or DispersionJohny Kutty Joseph
 
State presentation2
State presentation2State presentation2
State presentation2Lata Bhatta
 
Anova single factor
Anova single factorAnova single factor
Anova single factorDhruv Patel
 
Measures of central tendency
Measures of central tendencyMeasures of central tendency
Measures of central tendencyMmedsc Hahm
 

What's hot (19)

Statistics in nursing research
Statistics in nursing researchStatistics in nursing research
Statistics in nursing research
 
descriptive and inferential statistics
descriptive and inferential statisticsdescriptive and inferential statistics
descriptive and inferential statistics
 
Standard Scores
Standard ScoresStandard Scores
Standard Scores
 
Variability, the normal distribution and converted scores
Variability, the normal distribution and converted scoresVariability, the normal distribution and converted scores
Variability, the normal distribution and converted scores
 
Analyzing quantitative
Analyzing quantitative  Analyzing quantitative
Analyzing quantitative
 
CABT SHS Statistics & Probability - The z-scores and Problems involving Norma...
CABT SHS Statistics & Probability - The z-scores and Problems involving Norma...CABT SHS Statistics & Probability - The z-scores and Problems involving Norma...
CABT SHS Statistics & Probability - The z-scores and Problems involving Norma...
 
Displaying Distributions with Graphs
Displaying Distributions with GraphsDisplaying Distributions with Graphs
Displaying Distributions with Graphs
 
Chapter03
Chapter03Chapter03
Chapter03
 
Normalprobabilitydistribution 090308113911-phpapp02
Normalprobabilitydistribution 090308113911-phpapp02Normalprobabilitydistribution 090308113911-phpapp02
Normalprobabilitydistribution 090308113911-phpapp02
 
Normal Probability Curve by Dr. Neha Deo
Normal Probability Curve by Dr. Neha DeoNormal Probability Curve by Dr. Neha Deo
Normal Probability Curve by Dr. Neha Deo
 
Case study using one way ANOVA
Case study using one way ANOVACase study using one way ANOVA
Case study using one way ANOVA
 
Central tendency and Variation or Dispersion
Central tendency and Variation or DispersionCentral tendency and Variation or Dispersion
Central tendency and Variation or Dispersion
 
Measures of variability
Measures of variabilityMeasures of variability
Measures of variability
 
Normal Curve
Normal CurveNormal Curve
Normal Curve
 
State presentation2
State presentation2State presentation2
State presentation2
 
R training4
R training4R training4
R training4
 
Anova single factor
Anova single factorAnova single factor
Anova single factor
 
Measures of Dispersion (Variability)
Measures of Dispersion (Variability)Measures of Dispersion (Variability)
Measures of Dispersion (Variability)
 
Measures of central tendency
Measures of central tendencyMeasures of central tendency
Measures of central tendency
 

Similar to Psycometrics in neuropsychological assesment

Measures of Dispersion.pptx
Measures of Dispersion.pptxMeasures of Dispersion.pptx
Measures of Dispersion.pptxVanmala Buchke
 
M.Ed Tcs 2 seminar ppt npc to submit
M.Ed Tcs 2 seminar ppt npc   to submitM.Ed Tcs 2 seminar ppt npc   to submit
M.Ed Tcs 2 seminar ppt npc to submitBINCYKMATHEW
 
Statistics for Social Workers J. Timothy Stocks tatr.docx
Statistics for Social Workers J. Timothy Stocks tatr.docxStatistics for Social Workers J. Timothy Stocks tatr.docx
Statistics for Social Workers J. Timothy Stocks tatr.docxdarwinming1
 
Statistics for Social Workers J. Timothy Stocks tatr.docx
Statistics for Social Workers J. Timothy Stocks tatr.docxStatistics for Social Workers J. Timothy Stocks tatr.docx
Statistics for Social Workers J. Timothy Stocks tatr.docxrafaelaj1
 
Statistics for Social Workers J. Timothy Stocks tatr.docx
Statistics for Social Workers J. Timothy Stocks tatr.docxStatistics for Social Workers J. Timothy Stocks tatr.docx
Statistics for Social Workers J. Timothy Stocks tatr.docxsusanschei
 
Normal Curve in Total Quality Management
Normal Curve in Total Quality ManagementNormal Curve in Total Quality Management
Normal Curve in Total Quality ManagementDr.Raja R
 
1.1 course notes inferential statistics
1.1 course notes inferential statistics1.1 course notes inferential statistics
1.1 course notes inferential statisticsDjamel Bob
 
Standard deviation
Standard deviationStandard deviation
Standard deviationMai Ngoc Duc
 
Statistics For Data Analytics - Multiple &amp; logistic regression
Statistics For Data Analytics - Multiple &amp; logistic regression Statistics For Data Analytics - Multiple &amp; logistic regression
Statistics For Data Analytics - Multiple &amp; logistic regression Shrikant Samarth
 
Standard deviation
Standard deviationStandard deviation
Standard deviationM K
 
ders 8 Quantile-Regression.ppt
ders 8 Quantile-Regression.pptders 8 Quantile-Regression.ppt
ders 8 Quantile-Regression.pptErgin Akalpler
 
Review of Chapters 1-5.ppt
Review of Chapters 1-5.pptReview of Chapters 1-5.ppt
Review of Chapters 1-5.pptNobelFFarrar
 
What do youwant to doHow manyvariablesWhat level.docx
What do youwant to doHow manyvariablesWhat level.docxWhat do youwant to doHow manyvariablesWhat level.docx
What do youwant to doHow manyvariablesWhat level.docxphilipnelson29183
 
ders 3 Unit root test.pptx
ders 3 Unit root test.pptxders 3 Unit root test.pptx
ders 3 Unit root test.pptxErgin Akalpler
 
ders 3.2 Unit root testing section 2 .pptx
ders 3.2 Unit root testing section 2 .pptxders 3.2 Unit root testing section 2 .pptx
ders 3.2 Unit root testing section 2 .pptxErgin Akalpler
 
INFERENTIAL STATISTICS: AN INTRODUCTION
INFERENTIAL STATISTICS: AN INTRODUCTIONINFERENTIAL STATISTICS: AN INTRODUCTION
INFERENTIAL STATISTICS: AN INTRODUCTIONJohn Labrador
 

Similar to Psycometrics in neuropsychological assesment (20)

Measures of Dispersion.pptx
Measures of Dispersion.pptxMeasures of Dispersion.pptx
Measures of Dispersion.pptx
 
M.Ed Tcs 2 seminar ppt npc to submit
M.Ed Tcs 2 seminar ppt npc   to submitM.Ed Tcs 2 seminar ppt npc   to submit
M.Ed Tcs 2 seminar ppt npc to submit
 
Inorganic CHEMISTRY
Inorganic CHEMISTRYInorganic CHEMISTRY
Inorganic CHEMISTRY
 
Data analysis
Data analysis Data analysis
Data analysis
 
Statistics for Social Workers J. Timothy Stocks tatr.docx
Statistics for Social Workers J. Timothy Stocks tatr.docxStatistics for Social Workers J. Timothy Stocks tatr.docx
Statistics for Social Workers J. Timothy Stocks tatr.docx
 
Statistics for Social Workers J. Timothy Stocks tatr.docx
Statistics for Social Workers J. Timothy Stocks tatr.docxStatistics for Social Workers J. Timothy Stocks tatr.docx
Statistics for Social Workers J. Timothy Stocks tatr.docx
 
Statistics for Social Workers J. Timothy Stocks tatr.docx
Statistics for Social Workers J. Timothy Stocks tatr.docxStatistics for Social Workers J. Timothy Stocks tatr.docx
Statistics for Social Workers J. Timothy Stocks tatr.docx
 
Statistics.pdf
Statistics.pdfStatistics.pdf
Statistics.pdf
 
Normal Curve in Total Quality Management
Normal Curve in Total Quality ManagementNormal Curve in Total Quality Management
Normal Curve in Total Quality Management
 
1.1 course notes inferential statistics
1.1 course notes inferential statistics1.1 course notes inferential statistics
1.1 course notes inferential statistics
 
Standard deviation
Standard deviationStandard deviation
Standard deviation
 
Statistics For Data Analytics - Multiple &amp; logistic regression
Statistics For Data Analytics - Multiple &amp; logistic regression Statistics For Data Analytics - Multiple &amp; logistic regression
Statistics For Data Analytics - Multiple &amp; logistic regression
 
Standard deviation
Standard deviationStandard deviation
Standard deviation
 
REPORT MATH.pdf
REPORT MATH.pdfREPORT MATH.pdf
REPORT MATH.pdf
 
ders 8 Quantile-Regression.ppt
ders 8 Quantile-Regression.pptders 8 Quantile-Regression.ppt
ders 8 Quantile-Regression.ppt
 
Review of Chapters 1-5.ppt
Review of Chapters 1-5.pptReview of Chapters 1-5.ppt
Review of Chapters 1-5.ppt
 
What do youwant to doHow manyvariablesWhat level.docx
What do youwant to doHow manyvariablesWhat level.docxWhat do youwant to doHow manyvariablesWhat level.docx
What do youwant to doHow manyvariablesWhat level.docx
 
ders 3 Unit root test.pptx
ders 3 Unit root test.pptxders 3 Unit root test.pptx
ders 3 Unit root test.pptx
 
ders 3.2 Unit root testing section 2 .pptx
ders 3.2 Unit root testing section 2 .pptxders 3.2 Unit root testing section 2 .pptx
ders 3.2 Unit root testing section 2 .pptx
 
INFERENTIAL STATISTICS: AN INTRODUCTION
INFERENTIAL STATISTICS: AN INTRODUCTIONINFERENTIAL STATISTICS: AN INTRODUCTION
INFERENTIAL STATISTICS: AN INTRODUCTION
 

Recently uploaded

Call Girls Darjeeling Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Darjeeling Just Call 9907093804 Top Class Call Girl Service AvailableCall Girls Darjeeling Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Darjeeling Just Call 9907093804 Top Class Call Girl Service AvailableDipal Arora
 
Russian Escorts Girls Nehru Place ZINATHI 🔝9711199012 ☪ 24/7 Call Girls Delhi
Russian Escorts Girls  Nehru Place ZINATHI 🔝9711199012 ☪ 24/7 Call Girls DelhiRussian Escorts Girls  Nehru Place ZINATHI 🔝9711199012 ☪ 24/7 Call Girls Delhi
Russian Escorts Girls Nehru Place ZINATHI 🔝9711199012 ☪ 24/7 Call Girls DelhiAlinaDevecerski
 
Call Girls Ludhiana Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Ludhiana Just Call 9907093804 Top Class Call Girl Service AvailableCall Girls Ludhiana Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Ludhiana Just Call 9907093804 Top Class Call Girl Service AvailableDipal Arora
 
VIP Mumbai Call Girls Hiranandani Gardens Just Call 9920874524 with A/C Room ...
VIP Mumbai Call Girls Hiranandani Gardens Just Call 9920874524 with A/C Room ...VIP Mumbai Call Girls Hiranandani Gardens Just Call 9920874524 with A/C Room ...
VIP Mumbai Call Girls Hiranandani Gardens Just Call 9920874524 with A/C Room ...Garima Khatri
 
Low Rate Call Girls Pune Esha 9907093804 Short 1500 Night 6000 Best call girl...
Low Rate Call Girls Pune Esha 9907093804 Short 1500 Night 6000 Best call girl...Low Rate Call Girls Pune Esha 9907093804 Short 1500 Night 6000 Best call girl...
Low Rate Call Girls Pune Esha 9907093804 Short 1500 Night 6000 Best call girl...Miss joya
 
VIP Call Girls Pune Vrinda 9907093804 Short 1500 Night 6000 Best call girls S...
VIP Call Girls Pune Vrinda 9907093804 Short 1500 Night 6000 Best call girls S...VIP Call Girls Pune Vrinda 9907093804 Short 1500 Night 6000 Best call girls S...
VIP Call Girls Pune Vrinda 9907093804 Short 1500 Night 6000 Best call girls S...Miss joya
 
Call Girl Number in Vashi Mumbai📲 9833363713 💞 Full Night Enjoy
Call Girl Number in Vashi Mumbai📲 9833363713 💞 Full Night EnjoyCall Girl Number in Vashi Mumbai📲 9833363713 💞 Full Night Enjoy
Call Girl Number in Vashi Mumbai📲 9833363713 💞 Full Night Enjoybabeytanya
 
(👑VVIP ISHAAN ) Russian Call Girls Service Navi Mumbai🖕9920874524🖕Independent...
(👑VVIP ISHAAN ) Russian Call Girls Service Navi Mumbai🖕9920874524🖕Independent...(👑VVIP ISHAAN ) Russian Call Girls Service Navi Mumbai🖕9920874524🖕Independent...
(👑VVIP ISHAAN ) Russian Call Girls Service Navi Mumbai🖕9920874524🖕Independent...Taniya Sharma
 
Call Girl Coimbatore Prisha☎️ 8250192130 Independent Escort Service Coimbatore
Call Girl Coimbatore Prisha☎️  8250192130 Independent Escort Service CoimbatoreCall Girl Coimbatore Prisha☎️  8250192130 Independent Escort Service Coimbatore
Call Girl Coimbatore Prisha☎️ 8250192130 Independent Escort Service Coimbatorenarwatsonia7
 
VIP Call Girls Tirunelveli Aaradhya 8250192130 Independent Escort Service Tir...
VIP Call Girls Tirunelveli Aaradhya 8250192130 Independent Escort Service Tir...VIP Call Girls Tirunelveli Aaradhya 8250192130 Independent Escort Service Tir...
VIP Call Girls Tirunelveli Aaradhya 8250192130 Independent Escort Service Tir...narwatsonia7
 
Bangalore Call Girls Nelamangala Number 7001035870 Meetin With Bangalore Esc...
Bangalore Call Girls Nelamangala Number 7001035870  Meetin With Bangalore Esc...Bangalore Call Girls Nelamangala Number 7001035870  Meetin With Bangalore Esc...
Bangalore Call Girls Nelamangala Number 7001035870 Meetin With Bangalore Esc...narwatsonia7
 
VIP Service Call Girls Sindhi Colony 📳 7877925207 For 18+ VIP Call Girl At Th...
VIP Service Call Girls Sindhi Colony 📳 7877925207 For 18+ VIP Call Girl At Th...VIP Service Call Girls Sindhi Colony 📳 7877925207 For 18+ VIP Call Girl At Th...
VIP Service Call Girls Sindhi Colony 📳 7877925207 For 18+ VIP Call Girl At Th...jageshsingh5554
 
♛VVIP Hyderabad Call Girls Chintalkunta🖕7001035870🖕Riya Kappor Top Call Girl ...
♛VVIP Hyderabad Call Girls Chintalkunta🖕7001035870🖕Riya Kappor Top Call Girl ...♛VVIP Hyderabad Call Girls Chintalkunta🖕7001035870🖕Riya Kappor Top Call Girl ...
♛VVIP Hyderabad Call Girls Chintalkunta🖕7001035870🖕Riya Kappor Top Call Girl ...astropune
 
Aspirin presentation slides by Dr. Rewas Ali
Aspirin presentation slides by Dr. Rewas AliAspirin presentation slides by Dr. Rewas Ali
Aspirin presentation slides by Dr. Rewas AliRewAs ALI
 
Bangalore Call Girls Hebbal Kempapura Number 7001035870 Meetin With Bangalor...
Bangalore Call Girls Hebbal Kempapura Number 7001035870  Meetin With Bangalor...Bangalore Call Girls Hebbal Kempapura Number 7001035870  Meetin With Bangalor...
Bangalore Call Girls Hebbal Kempapura Number 7001035870 Meetin With Bangalor...narwatsonia7
 
Call Girl Number in Panvel Mumbai📲 9833363713 💞 Full Night Enjoy
Call Girl Number in Panvel Mumbai📲 9833363713 💞 Full Night EnjoyCall Girl Number in Panvel Mumbai📲 9833363713 💞 Full Night Enjoy
Call Girl Number in Panvel Mumbai📲 9833363713 💞 Full Night Enjoybabeytanya
 
Vip Call Girls Anna Salai Chennai 👉 8250192130 ❣️💯 Top Class Girls Available
Vip Call Girls Anna Salai Chennai 👉 8250192130 ❣️💯 Top Class Girls AvailableVip Call Girls Anna Salai Chennai 👉 8250192130 ❣️💯 Top Class Girls Available
Vip Call Girls Anna Salai Chennai 👉 8250192130 ❣️💯 Top Class Girls AvailableNehru place Escorts
 
💎VVIP Kolkata Call Girls Parganas🩱7001035870🩱Independent Girl ( Ac Rooms Avai...
💎VVIP Kolkata Call Girls Parganas🩱7001035870🩱Independent Girl ( Ac Rooms Avai...💎VVIP Kolkata Call Girls Parganas🩱7001035870🩱Independent Girl ( Ac Rooms Avai...
💎VVIP Kolkata Call Girls Parganas🩱7001035870🩱Independent Girl ( Ac Rooms Avai...Taniya Sharma
 

Recently uploaded (20)

Call Girls Darjeeling Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Darjeeling Just Call 9907093804 Top Class Call Girl Service AvailableCall Girls Darjeeling Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Darjeeling Just Call 9907093804 Top Class Call Girl Service Available
 
Russian Escorts Girls Nehru Place ZINATHI 🔝9711199012 ☪ 24/7 Call Girls Delhi
Russian Escorts Girls  Nehru Place ZINATHI 🔝9711199012 ☪ 24/7 Call Girls DelhiRussian Escorts Girls  Nehru Place ZINATHI 🔝9711199012 ☪ 24/7 Call Girls Delhi
Russian Escorts Girls Nehru Place ZINATHI 🔝9711199012 ☪ 24/7 Call Girls Delhi
 
Call Girls Ludhiana Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Ludhiana Just Call 9907093804 Top Class Call Girl Service AvailableCall Girls Ludhiana Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Ludhiana Just Call 9907093804 Top Class Call Girl Service Available
 
VIP Mumbai Call Girls Hiranandani Gardens Just Call 9920874524 with A/C Room ...
VIP Mumbai Call Girls Hiranandani Gardens Just Call 9920874524 with A/C Room ...VIP Mumbai Call Girls Hiranandani Gardens Just Call 9920874524 with A/C Room ...
VIP Mumbai Call Girls Hiranandani Gardens Just Call 9920874524 with A/C Room ...
 
Low Rate Call Girls Pune Esha 9907093804 Short 1500 Night 6000 Best call girl...
Low Rate Call Girls Pune Esha 9907093804 Short 1500 Night 6000 Best call girl...Low Rate Call Girls Pune Esha 9907093804 Short 1500 Night 6000 Best call girl...
Low Rate Call Girls Pune Esha 9907093804 Short 1500 Night 6000 Best call girl...
 
VIP Call Girls Pune Vrinda 9907093804 Short 1500 Night 6000 Best call girls S...
VIP Call Girls Pune Vrinda 9907093804 Short 1500 Night 6000 Best call girls S...VIP Call Girls Pune Vrinda 9907093804 Short 1500 Night 6000 Best call girls S...
VIP Call Girls Pune Vrinda 9907093804 Short 1500 Night 6000 Best call girls S...
 
Russian Call Girls in Delhi Tanvi ➡️ 9711199012 💋📞 Independent Escort Service...
Russian Call Girls in Delhi Tanvi ➡️ 9711199012 💋📞 Independent Escort Service...Russian Call Girls in Delhi Tanvi ➡️ 9711199012 💋📞 Independent Escort Service...
Russian Call Girls in Delhi Tanvi ➡️ 9711199012 💋📞 Independent Escort Service...
 
Call Girl Number in Vashi Mumbai📲 9833363713 💞 Full Night Enjoy
Call Girl Number in Vashi Mumbai📲 9833363713 💞 Full Night EnjoyCall Girl Number in Vashi Mumbai📲 9833363713 💞 Full Night Enjoy
Call Girl Number in Vashi Mumbai📲 9833363713 💞 Full Night Enjoy
 
(👑VVIP ISHAAN ) Russian Call Girls Service Navi Mumbai🖕9920874524🖕Independent...
(👑VVIP ISHAAN ) Russian Call Girls Service Navi Mumbai🖕9920874524🖕Independent...(👑VVIP ISHAAN ) Russian Call Girls Service Navi Mumbai🖕9920874524🖕Independent...
(👑VVIP ISHAAN ) Russian Call Girls Service Navi Mumbai🖕9920874524🖕Independent...
 
Call Girl Coimbatore Prisha☎️ 8250192130 Independent Escort Service Coimbatore
Call Girl Coimbatore Prisha☎️  8250192130 Independent Escort Service CoimbatoreCall Girl Coimbatore Prisha☎️  8250192130 Independent Escort Service Coimbatore
Call Girl Coimbatore Prisha☎️ 8250192130 Independent Escort Service Coimbatore
 
VIP Call Girls Tirunelveli Aaradhya 8250192130 Independent Escort Service Tir...
VIP Call Girls Tirunelveli Aaradhya 8250192130 Independent Escort Service Tir...VIP Call Girls Tirunelveli Aaradhya 8250192130 Independent Escort Service Tir...
VIP Call Girls Tirunelveli Aaradhya 8250192130 Independent Escort Service Tir...
 
Escort Service Call Girls In Sarita Vihar,, 99530°56974 Delhi NCR
Escort Service Call Girls In Sarita Vihar,, 99530°56974 Delhi NCREscort Service Call Girls In Sarita Vihar,, 99530°56974 Delhi NCR
Escort Service Call Girls In Sarita Vihar,, 99530°56974 Delhi NCR
 
Bangalore Call Girls Nelamangala Number 7001035870 Meetin With Bangalore Esc...
Bangalore Call Girls Nelamangala Number 7001035870  Meetin With Bangalore Esc...Bangalore Call Girls Nelamangala Number 7001035870  Meetin With Bangalore Esc...
Bangalore Call Girls Nelamangala Number 7001035870 Meetin With Bangalore Esc...
 
VIP Service Call Girls Sindhi Colony 📳 7877925207 For 18+ VIP Call Girl At Th...
VIP Service Call Girls Sindhi Colony 📳 7877925207 For 18+ VIP Call Girl At Th...VIP Service Call Girls Sindhi Colony 📳 7877925207 For 18+ VIP Call Girl At Th...
VIP Service Call Girls Sindhi Colony 📳 7877925207 For 18+ VIP Call Girl At Th...
 
♛VVIP Hyderabad Call Girls Chintalkunta🖕7001035870🖕Riya Kappor Top Call Girl ...
♛VVIP Hyderabad Call Girls Chintalkunta🖕7001035870🖕Riya Kappor Top Call Girl ...♛VVIP Hyderabad Call Girls Chintalkunta🖕7001035870🖕Riya Kappor Top Call Girl ...
♛VVIP Hyderabad Call Girls Chintalkunta🖕7001035870🖕Riya Kappor Top Call Girl ...
 
Aspirin presentation slides by Dr. Rewas Ali
Aspirin presentation slides by Dr. Rewas AliAspirin presentation slides by Dr. Rewas Ali
Aspirin presentation slides by Dr. Rewas Ali
 
Bangalore Call Girls Hebbal Kempapura Number 7001035870 Meetin With Bangalor...
Bangalore Call Girls Hebbal Kempapura Number 7001035870  Meetin With Bangalor...Bangalore Call Girls Hebbal Kempapura Number 7001035870  Meetin With Bangalor...
Bangalore Call Girls Hebbal Kempapura Number 7001035870 Meetin With Bangalor...
 
Call Girl Number in Panvel Mumbai📲 9833363713 💞 Full Night Enjoy
Call Girl Number in Panvel Mumbai📲 9833363713 💞 Full Night EnjoyCall Girl Number in Panvel Mumbai📲 9833363713 💞 Full Night Enjoy
Call Girl Number in Panvel Mumbai📲 9833363713 💞 Full Night Enjoy
 
Vip Call Girls Anna Salai Chennai 👉 8250192130 ❣️💯 Top Class Girls Available
Vip Call Girls Anna Salai Chennai 👉 8250192130 ❣️💯 Top Class Girls AvailableVip Call Girls Anna Salai Chennai 👉 8250192130 ❣️💯 Top Class Girls Available
Vip Call Girls Anna Salai Chennai 👉 8250192130 ❣️💯 Top Class Girls Available
 
💎VVIP Kolkata Call Girls Parganas🩱7001035870🩱Independent Girl ( Ac Rooms Avai...
💎VVIP Kolkata Call Girls Parganas🩱7001035870🩱Independent Girl ( Ac Rooms Avai...💎VVIP Kolkata Call Girls Parganas🩱7001035870🩱Independent Girl ( Ac Rooms Avai...
💎VVIP Kolkata Call Girls Parganas🩱7001035870🩱Independent Girl ( Ac Rooms Avai...
 

Psycometrics in neuropsychological assesment

  • 1. Psychometrics in Neuropsychological Assessment with Daniel ). Slick OVERVIEW lhe pracos of ncuropsychologicJI asscssmcnt dcpcnds lo a brge exlcnt OH lhe reliability and valiJity of llcuropsycholog-ieal lesls. UnfortullJtely, no! ali neuropsychological tests are crcated equal, and, like any olher product, published tests ViU}' in terms of lheir "quali'y," as defined in psychometric tcrms such as reliability, rncasurement crror, temporal slabil-ity, sCllsitivity, spccificity, prcdictive v,llidity, and with respect to lhe care with which t('st itcms are derivcJ anJ norm,llivc data are obtaincJ. In d,lditioll tu commcf(:ial mcasurC5, nu-meram tcsts dcvclopcd primarilr for rcscarch purposcs have founJ their war into wide clinicai usagc; Ihese vary wnsidcr-ably with rcgard to psychomctric propertics. With few cxcep-tions, whcn tests originate from clinicaI research conlcxts, thnc is ohcn validity data but littlc c!se, which makcs esti-lllating mcasurelllcnt precision and stability of test scores a challenge. Rcgardless of lhe origins of neuropsyclJOlogical tesls, lheir competcnt use in clinicai practice demanJs a good working knowledge of test standards and of lhe specific psychometric charaeteristics of each lest useJ. This includes familiarity with the StanJards for Educational anJ Psychological Testing (American Educational Research Associalion [AERA] et aI., 1999) and a working knowledge ofbasic psychometrics. 'iCxts sllch as those by Nunllally and Bernstein (19')4) and AnaSlasi <IndUrbina (1997) outline some of the fundamental psycho-metric prerequisites for competent sdectioll of tests and in-terpretation of oblained scores. Other, neuropsychologieally focuseJ teXls such as Mitrushina et ai. (2005), Lezak et aI. (2004), Baron (2004), Franklill (2003a), and Franzcn (2000) also proviJe guidance. The following is inlended lOprovide a broad overview of important psyehometric eoncepls in neu-rupsychological assessment and coverage of important issues to consider when crilicalty evaluating leSISfor clinicai usage. Much of the information provided also serves as a conceptual framework for the test reviews in this volume. 3 THE NORNAl CURVE Thc frequency Jistributions of many physical, biological, and psychological attributes, <lSlhey occur ilCroSSindividuais in nature, tend to conform, to a greater or lcsser degree, to a bell-shaped curve (see Figure I-I). This normal wrl'c or normal distributíoll, so namcd by Karl I'earson, is also known as the Gaussian or Laplace-Gauss distribution, aftcr the 18lh-century mathematicians who first defined it. The normal curve is lhe hasis of many commonly used stalislÍeal and psychometric moJels (e.g., classical test theory) atld is lhe assumed dislri-hulion for many psyehological variables.' Definilion ond Charocleristics The normal curve has a number of spccific propcrties. It is unimodal, perfectly symmetrical and asymptolie at the t<lils. With respcct to scores frum measurcs Ihat are normally dis-tributed, the ordinate, or hcight of lhe curve at any point along the x (tesl s(Ore) axis, is the proportion af persons wilhin the sample who ohlained a givcn score. The ordinates for a range of scores O.e., between two points on the x axis) ma}' alsa bc summed lo give the proportion of persons Lhat obtaineJ a score within the speófied range. If a spccified nor-mal curve accuratdy rdleets a population distribution, then ordinatc valucs are also cquivalcnl to lhe probahility of oh-serving a given seore or range of scores when randomly sam-pling fram the popllation. Thus, the normal curve ma}' also bc refcrred lo as a probilbilily distribution. Figure 1-1 Tnc llllfrnal UlrV( x
  • 2. 4 A Compentliurn lIfNeuwpsychologi«11 Tests The normal cun'(' is mathematically defincd as fol!ows: . I . j(x)=--e-(x-11)- 111 ~2ITa' corrcsponcling 10 any resulting z score can Ihen be easily looked up in lablcs avail<lblein mosl statistical texts. Z score conversiolls to percentilcs are ,liso shown in Table I-I. 11ere: x = measurement values (test scores) p = lhe mean of lhe test score dístríbution 0'= lhe starHlanl deviat ion of the tesl score dislribut ion ]'f"'" lhe conslanl pi (3.14 ... ) e = the base of naturallogarithms (2.71 ... ) f(x) = lhe heighl (ordinate) of lhe ClUvefor ,IllYgiven tesl score Relevancefor Assessment As noted previously, because il is a frequellcy dislribulioll, lhe area under any given segmenl of the normal curve indi-cates lhe freqllency of observalions or cases wilhin Ihal inler-vaI. From a praclical slandpoint, Ihis provides psychologisls wilh an estimale of the "normalit(' or "abnormalilY" of any given tesl score or range of scores (i.e., whelher il falls in lhe center of lhe bell shape, where the majority of seores lie, or inslead, ai eilher of the tail ends, whcre few scores can be founJ). The way in which the degree of "norm,llity" or "ab-normality" of tesl scores is quantified varies, but perhaps lhe most useful and inherently underslandablc metric is lhe pacentí/e. Z Scores ond Percenliles A percenlile indicates the percent,lge of scores Ihal fali ai or below a given lesl score. As an examplc, we will assume lhaI a given lesl score is plolted on a normal curve. Vhen ali of lhe ordinate values aI and bclow Ihis tesl score are summed, lhe resulting value is lhe percenlilc associaled wilh thal lesl score (e.g., a score in the 75th percentilc indicales Iha175% of lhe reference samplc oblainecl equal or lower scores). To converl scores lo percl.:nliics,r,IWscores may be linearl)' Iransformed or "stanclardizl.:d"in several ways. The simplest and perll<lpsmost commonly calculated standard score is the z swre, which is obtained by subtrncting lhe sample mean score from an obtnined score allJ dividing lhe resull by lhe sample 50, as show below: x= meaSllrement value (test score) X= lhe mean of lhe test score dislribulion SO = lhe slandard devialion of the lest score dislribution Interprelalionof Percentile~ An imporlant properly of the normal curve is that the rela-lionship belweell raw or z scores (which for purposes of this cliscussion are e{]uívalent, since Ihey are linear trnnsforma-lions of each other) and percenliles is nol linear. lhat is, a constant differencc bctween rOlwor z scores will he assocLJ.led with a variablc difference in percentile scores, as a funClioll of lhe dislallce ofthe Iwo scores from lhe mean. This isdue to the fact Ihal there are proportionally more obsen'aliollS (scores) near the mean Ihan Ihere are farther from the mean; olherwisc, the distribulion would be reclangular, or non-normal. This com readily he seen in Figure 1-2, which shows the normal distribution with demarcation of z scores and corresponding pcrcclltilc ranges. The nonlinear relation between z scores alld percentiles has important inlerprclivc implicatinns. For example, a one-point diffcrence betwel.:n two z scores may be interpreled differently, dcpending on where the two scores fali on the Ilor-lllal curve. Ascombc seen, lhe difference hetween a z score ofo ,md a z score of + I is 34 percenti!e points, because 34% of scores fali uctween these two z scores (i.e., the scores being compared are at lhe 50lh and 84th percentiles). iIowever, the diffcrence belween a z score of +2 nnd a z score of +3 is lcss than 3 percentile points, because only 2.5% of lhe distribu-tion falls belween Ihese Iwo poinls (i.e., lhe scores being com-pared are nl the 981h and 99.91h percentilcs). Ou lhe other hnnd, interpretalion of percenlile-score differences ISalso nol slraightforward, in Ihal an equivalcnl "difference" betwcen lwo percenlile rankings mal' entai! differenl clinicaI implica-lions if lhe scores occm at the tail end ofthe curve than ifthcy occur near the míddle of the distribution. For ex,lmple, a 30- poinl difference belween scores at lhe 1st percentilc versus the 3IsI percenlíle lllay be more C!inical1ymcaningful than the same difference between scores at the 351h percentile versus lhe 651hpercenlilc. LinearTransformatiancf Z Scores: TScores and OIher Standard Scores In ,Iddition to the z score, lineM transformalion can be used to produce other slandardized scores Ihat have lhe same prop-erties with regard lo easy conversion via tablc look-up (sce Table I-I). The most common of Ihese are T scores (M == 50, SD = 10), scalcd scores, and slanclard scores such as Ihose used in mosl IQ tesls (M = 10, SD= 3, ,md M = 100, SD= 15). li musl be rcmembered that z scorcs, T scores, slandard scores, and perccntile equivalenls are dcrived from sl/mples; ahhough these are of1en treated as population values, any limitations of generalizability due to rcference samplc composition or test-ing circumstances muSl be taken into consideralion when slandardized scores nre inlerprclcd. z=(x-X)/SD [21 Vhere: The resulting distrihution of z scores ha.~a mean of O and an SD of 1,regardlcss of the melric of raw scores from which the)' werc Jcrived. For example, given a mean llf 25 and an SDof 5, <lraw scoreof20 translales inlo n zscorc of -1. The percentilc
  • 3. Toble 1-1 Sum."Convnsíon Tahk IQ' T SSh Percenlí1e -zl+z Percentilc SSh T IQ' S55 S211 <I SO.I S3.(JO~ ~9').9 ~19 ~l'IO ~145 56-6fl 21-23 2 <I 2.67-2.99 >99 18 77-99 140....144 61-67 24-27 3 I 2.20-2.66 99 17 73~76 133--139 68-70 21:-30 'I 2 1.96-2,19 OH 16 70-72 130-132 71-72 31 ) 1.82-1,95 97 " 128-129 73-71 32-.>3 'I 1.7()-1.1:1 96 67~68 126-127 75-76 34 5 5 1.60....[.,69 95 15 " 124-125 77 6 1.52...1...59 94 123 78 35 , 7 1.44-1.5[ 93 65 122 79 36 U8 ....1.,1} 92 64 121 80 6 9 1.32-1.37 " 14 120 81 37 10 1.26-UI 90 63 119 11 1.21-1.25 "' S2 " 12 1.16-1.20 " 62 118 83 13 1.11-1.15 " 117 " 39 11 1.06-1.10 R6 61 116 15 1.02-1.05 85 85 40 7 16 .98-1.01 '" U 60 115 17 .94-.97 " 86 41 18 .90-.93 S2 59 111 " 19 .86-.89 81 113 20 .83-.85 80 " 42 21 .79-.82 79 58 112 22 .76-.78 78 "' 2J .73-.75 77 111 43 24 .70-.72 76 57 90 8 25 .66-.69 75 12 110 26 .63-.65 74 " 44 27 .60-.62 73 56 109 28 57-59 72 29 51 ...5..6 71 92 30 .52-.53 70 108 15 31 .4<J-.51 69 55 93 32 .46-.48 6R 107 3J .43-.45 67 9,1 46 34 .4)-.42 66 54 06 35 .38-.39 65 36 .35-.37 64 95 9 37 .32....3..4 63 11 105 " " .3(}-.31 62 53 % 39 .27....2..9 61 104 -lO .25-.26 60 41 .22-.24 59 97 48 12 .[9-,2[ 58 52 103 43 .17-.18 57 H .14.....1.6 56 98 45 .12-.13 55 102 49 46 .09....1..1 54 51 99 47 .O7-.011 53 101 48 .04-.06 52 19 .02...J.)J 51 100 50 10 50 .00-.01 50 10 50 100 'AI = 100. SD= 15: "M = lO. SD= 3. •Vo": SS = Sc.d",J
  • 4. 6 A Compendíllm of Neuropsychologícal Tcsts FigtJre1-2 The normal curve demarcaled hy z ~cores. lhe Meaning of Stondordizcd TestScores: Score Interpretolion +2 2.35% 0.15% +3 As wcll as facílilalíng lrallslalion of raw scores to eslímaled population ranks, standardization of tesl scores, br vírtue of conver~ion to a common llletric, facililates comparison of scores across measures. Ilowever, this is only ,ldvisable wnen the raw score distribulÍons for tests Ihat are being compared are appcoximatcly normal in the population. In addílion, if stanJardized sunes are to be compared, ther should be derived fcom similar S<llllpleS, or more ideally, from the same s<llllple.A score aI lhe 50th percentilc on a test normed on a population of uníversily students does not nave lhe same meaning as an "equivalent" score on a tesl nonned on a populatíon of dderJy individuais. Vhen comparing test scores, one mUSI<lisolake into consideration both lhe rclíability of the two measures and their intercorrelatíon before dctermining if a significall1 differ-ence exisls (see Crawford & Garthwaite, 2002). In some cases, rclalivcly large disparities between slandJfd scores may nOI ac-lU< lllyreflect rcliablc dífferences, and Iherefore may not be dinically me,mingful. FurtherlIlore, statislicallr significant or rcliable difTerences bctween test scores may be COllllllon in a reference sample; therdore, the baserate of differences ml~t also be considered, JepenJing on lhe levei ofthe ~cores (<InIQ of 90 versus 110 as compared lo 110 versus 130). Une ~hould alS(1keep in mind that when lesl scores are not normally dis-tribuled, standardized score.~may not accllrate!y rc/leet acttl<ll popul,ltion rank. In these círcumstances, differences between slandard scores may be misleaJing. Note also lhat comparability <lcmss tesls does not imply eqll<llity in meaning and relative imporlance of scores. For ex- <lmple, one may compare stand<lrd scores on rneasures of pitch discriminalion and intelligence, but it will rarely be lhe case that these scores are of equal clinicai or practical meaniog nr significance. In clinicai practice, one lllar encounter standard scores that are either extremely low or extremely high. The meaníng <lndcom-p, uability of such scores will depend critie<lllyon the charac-teristics of lhe normative s<lrnplefrom which lhe)"derivl;:. For exarnplc, cnn~ider a hypothetical case io whicn ,lIl ex- <lrninee ohtains a rilw score llwl is hclow lhe range of scnres found io a norll1,ll s,lrnple. Suppose funher th<ll the SLJ in lhe norm,d salllpk i~verr small ilnd thus the examinee's r<lWscore lranslates to a z score of -5, indicalíng that lhe prob<lbilily of encountering lhis score in the normal POPUl<llionwould he 3 in 10 míllion (i.e., a percentile ranking of .00(03). Thi, repre-senIs J cOllsíder<lbleextrapol<!tion from the ,H:lual normative data, as (I) lhe normalive ~ampll;:did nol include 10 míllion individllills (2) not a singlc individual in the normalÍve S<llll-pie obtained <lscore anywhere close to the examinee's score. The percentile value i~Iherefore an eXlrapolalioll and confers a false sense of precisioo. 11ilc one may be confident lhat it indicales impairment, lhere may be no basis to assume thal it represenls a meaningfully "worse" performance tlun a z score of - 3, or of -4. The t'slÍmlltcd prcvalclKe valuc of Jn obtained z score (nr T seore, elc.) C<lnbe calcuLlted to {lctermine whether inlerpre-lation of extreme scores may be appropriale. Thís is simply ac-complished by inverting the perccntile score corresponding to lhe z seore (i.e., dividing I by the percentile score). For eX<lm-pie, <lz $Coreof -4 is associattxl with an cstimated frequency of occurrence or prevalcnce of appcoximately 0.00003. Dividing 1 by Ihis value gives a rounded result oI' 31,560. Thus, the e~li-mated prevalence value 01'lhis score in the population is 1 io 31,560. Ifthe norrnative S<lIllPJcfcom which J z score is Jerived is consider<lbly smaller lhan lhe denominator of lhe estimalcd preva!cnce value (i.e., 31,560 in the example), then some cau-tion may be wJrr<lll1edin interprcling the pereenlíle. In <lddi-tion, whenever such exlrernl;: scores are being ínlerpreted, eX<llllinersshould also verify th<llthe examinee's raw score falls wilhin the r<lngeof raw scores in the normative sample. If the norn1<ltive samplc size is sllbstanliallr slll,lller Ihan lhe esli-mated prev,llcnce s<lmple Si7£ /lI1t1 the examinee's score falls olltside lhe s<lmplc range, then cOllsiJerablc caulion may be indic<ltcJ in interpretíng the percentile assn(Íaled with the standardized seore. Regardlcss of the z seore v<llue,it must <lIso be kept in mind thal inlerpretation of lhe <lssoci<ltedpcrcentile value may not be juslifiable if lhe normative sample !las a sig-nifiC< llltlynOll-llOrm<l1distrihution (see laler for funhl;:r dis-cussion of nOIH10rlJl<llily).lo sum, the dinie<ll interprel<llion of exlreme scores depends to a longeextenl on the properties of the normal salllples involveJ; one can have more confidence th<llthe percentile is reasonably <lccurate if the normalive sam-pie is large and well collstructed and lhe sh<lpeof the norm<l-tive sampte distribution is ilpproximately normal, particularly in tail regiolls where extreme $Coresilre found. lolerprctiog Extreme Scores A fin<llcritiC<11issue wilh respect lo lhe me,lning oI' standard-ú, ed seores (e.g., z scores) has to do with extreme observations. lhe Normol Curve ond TeslConstruetion Allhough the norm<ll curVI;:is from many standpoints <lnideal or even expecll;:ddistribulioll for psycholllgical dati!, tcst score
  • 5. l'sychomelrics in Neuropsychological Assessmenl 7 Figure1-3 Skeweddislribulions. (e.g., a creativily test for gifted students). In lhis case, lhe characterislks oI' onll' one side oI' lhe silmp1cscore dislribu-tioll Non.Normality Al1hough lhe normal curve is an cxcdlcnl modcl for psl'cho-logical ddla and manl' sample dislribulions of natural pro-cesses are approximately normal, il is not unllsllal for tesl score distributions lo be markedll' nOIl-normal, eWIl when samples are large (Miccerti, 19R9).zFor example, neuropsy-ehological te..•ls sueh as the Boston Naming Tesl (BNT) and Wiseonsill Card Sorting Test (WCST) do nol havc normal dis-tributions when r,lWscores are el;amined, and, even when de-mographie correction melhods are ilpplietl,some lests continue to show a non-norm,ll, muhimodal dislriblllion in some pop-ulations (Faslenau, 1998). (An examplc oI' a non-normal dis-tribulion is shown in Figure 1-4.) The degree to which <lgiVClldislribution approximates the underll'ing populalion distribulion increases as lhe nlllnber oI' observations (1,rj increases and becomes kss accurate as N decreases. This has imporl<llll implications for norms com-prised of small samplcs. Thus, a larger sampk will produce ,I more normal dislribulion, bul onll' if lhe underll'ing popu-lation distribution from which lhe samplc is oblained is normal. In olhcr words, a large N does nol "eorrect~ for non-normality oI''In under1l'ing popuLlIion dist ribution. Howt:ver, 84 93 Pereentiles 68 Raw Score 08 Mean = 50, 50 = 10 20 (i.e., the uppt:r end) are critical, whilc lhe charactcristics 011 the olher side of lhe dislrihulion are (lI'no particular con-cern. The 1l1eaSUremar even be dc1iberatdl' designed to have t100r or ceiling dTecls. ror example, if onc is not inlerested in one lail (or even olle-half) {lf lhe dislributioll, items lhat would provide discrimination in that region may be omitted lo save adminislration time. In lhis case, a lesl with a high floor or low cciling in lhe general population (and with posi-live or negalive skew) may be more desirablc thall a test with a normal dislribution. ln most applicalíons, however, a more llormal-Iooking curve within the targeted subpopulation is usually desirable. Figure1-4 Anon.normallest scoredistrihution. Positive Skew Negalive Skew samples do nol always conform 10 a normal dislribution. Vhen anel'.' tesl is conslrucled, non-normality can be "cor-recled" br eXilmining lhe dislribulion of swres on lhe proto-trpe lesl, adjusling test proper1ies, and resampling until a normal dislribution is n:achC(1.For cX<lmple,whcn a test is firsl administered during a lrl'-oul phase and a positivell' skewed distribut ion is obtained (i.e., with mosl swres c1uster-ing ,lt lhe lail end oI' lhe dislribulion), lhe tesl likely has!oo high a f1oor, callsing mosl examinees lo oblain low scores. Easl' ilems can then be added so lhat the majoritl' of scores fali in the middlc of the distribulion rather lhan at the lower cnd (Anastasi & Urbina, 1997). ""11en this is successful, the grealesl numbers of individuaIs obtain aboul 50°/" of items correc!. This leveiof difficulty usualll' provides the besl differ-entiation between individuais aI ali abilil)' leveis (,nastasi & Urbina, 1997). 11must be noled lhal a test with a normal dislribulion in lhe general population mal' show extreme skew or olher di-vngence from normaJill' when administcred to a populatioll that differs considerabll' fcom lhe average individual. for ex-ample, a vocabulary test thal protluces norma]]l' distributed scores in a general samp1c oI' individuais mal' display a neg-ativell' skewed distribution dlle to a low cci1ingwhen admin-istered to docloral sludcnts in literature, and a positivc1l' skewed distribution dlle to a high l100rwhen adminislered to preschoo1crs Irom n:cenlll' immigrated, Spanish-speaking families (see figure 1-3 for examplcs oI' positive and negalive skew). In this Case,lhe test would be incapablc oI' dfectivc1y discriminating between individuais within eilher group be-caust: of ct:iling effecls and !loor efl"t-cts,rt:speclivt:!y,even though it is of considerablc utilill' in lhe gencral populalion. Thus, a lest'~ dislribulioll, including 1100rsand ceilings, must alwal's be eonsidercd when asscssing individuaIs who differ from lhe normative samplc in terms of ch<uacteristicsthat af-feel test scores (ç.g., in this example, degree of exposurc to En-glish words). In additioll, whether a tesl prodmes a normal dislribution (i.e., wilhoul posilive or negalive skew) is also ,tn imporlant aspecl of evaluating tests for bias across differenl populatiollS (see Chapter 2 for more discussion oI' bias). Depending on Ih.' characlerislics (lI' lhe conslruct being measured and the purpose for which a lesl is bcing designed, a normal distribution oI' scores may not he obtainable or cven desirable. For example, lhe population dislriblltioll of the con-slmcl bcing llleasured may nol be normally dislribulcd. Aht:r-nalively, one mal' want onl)' to identifl' and/or discriminate bdween persons at onll' one end of a continllum of abililies
  • 6. 8 A CompenJium ofNeumpsychological Tesls small samplcs may yiclJ non-normal distributíon dlle to ranJom samplíng cffects, even though lhe population fmm which lhe sanlple is Jrawn has a normal Jistriblllion. Thal is, one may nol automatically assume, given a non~nonl1al Jistribulion in a small sample, that lhe populalion Jislribll~ lion is in facl non~nortJlal (note Ihal the Wllverse may ,liso be true). Several factors may lead to non-normallesl S(;oreJislribu-tions: (a) lhe existence of diserete subpopulatiolls within lhe general population wilh differing abilities, (b) eeiling or l100r effeels, anJ (c) trealment effeets Ihal ehange lhe localion of means, meJi<los, and moJes and affeel variability and distri~ bulioo shape (Miccerli, 1YX9). Skew As with the normal curve, some varietics of non-nnrmalit)l may be eharaelerized malhematically. Skew is a formal mea-sure or asymmelry in a frequeney Jistribulion Ihat can be cal-eui< lled using a specific formula (see Nunnally & Bernslcin, 1994). lt is also known as the third momem of 11 distriburiol/ (lhe mean and varianee are lhe first <loJ seconJ moments, re-spectivcly). A Irue normal Jistribution is perfeclly symmetri-cal aboullhe mean anJ has a skew of zero. A non-lIormal bul symmetrie dislribution will have a skew valuc lhal is near zero. Negative skew values indicale Ihal lhe left tail of the dis-tribulion i.sheavier (and often more elongated) Ihan the righl tail, which may be lruncaled, while posilive skew vallles indi~ cate lhat lhe Opposile paHem is presenl (see Figure 1-3). Vhen distribulions are skewed, the mean and median are not identical beeause the mean will not be at lhe midpoint in rank and z seores will not aeeuralely translate into sample per~ eentile rank values. lhe error in mapping of z scores lo sam-pie pereentile ranks increases as skew inereases. Truncaled Dislribulions Signifieant skew often indicales the presence of a truncalcd distribulion. This may oceur when the range of scores is re-slricled on one side but not lhe olher, as is lhe case, for exam-pie, with reactioll lime measures, whieh eanllot be lower lhan several hundred milliseconds, bllt ean reaeh very high positive values in some individuais. In faet, dislribulions of scores from reaetion lime measures, whether aggregated aeross Irials on an individuallevcl or aeross inJiviJuals, are oflell ehar<le-terized by positive skew anJ positive outliers. lkan values may therefore be positivdy biased wilh respect to lhe "centr,11 tendcney" nf lhe dislribulion as defined by olher indices, such as lhe mediano Truncated dislribulions are also collllllonly seen on error seores. A good example of this is Failure lo Maintain Sct (FMS) scures on the WCST (see review in this volume). In the normativc sample of 30- lo 39-year-old persons, ob-served raw scores range frum Oto 21, but lhe majority of per-sons (84%) obtain seores ofO or I, and less Ihan 1% obtain $Coresgrealer lha o 3. Floor/Ceílíng Elfeds Hoor and eeiling effecls mar he defined as the presenee of trunealed lails in lhe context of 1imitations in range of ilem difficulty. For example, a lesl may be said o have a l1igll}Ioor when a large pruportíon of lhe examinees obtain ravo:scores at or near lhe lowest possible score. This may indicate thal lhe test lacks a sllffieienl number and range 01'easier items. Con-verscl)', a tesl may he said to have a low ccílillgwhen lhe 01'1'0- sitc pattern is presenl (i.e., when a high number of examinees oblain rilWscores aI or near the highesl possiblc seorc). Floor anJ eeiling effeels may significantly limil lhe uscfu[ness of a measure. For example, a measure wilh iIhigh floor mar not be suitable for use wilh low funclioning examinces, particularly if one wíshes to delineate levei 01'impairment. Multimodality and Other Types af Non-Normality !l.lultimodality is lhe presenee of more tha/l one "peak" in a frequeTlcyJistribution (see histogram in Figure 1~1 for <lnex-amplel. Another form of signifieant non-normality is the uni-form or near-uniform distributíon (a dislributio/l wilh no or minimal peak and relatívely equal frequelley <lCrossseo[('s). Vhen such dislributions are present, linearly transformed $Cores(z scores, T seores, and other deviatio/l seores) may be tOlally inaceurale with respeel to aelual samplelpopulalion pereentile rank and should not be interpreted in Ihat frame-work. [n Ihese cases, sample-derived rank pereentilc seores may be more clínieally uscful. Non-Normality ond Perceolile Derivalioos Non-normality is /lot trivial; it has major implieations for derivalion and interpretation of standard seores and eompar-ison of sueh scores aeross lests: standardized seores Jerived by linear transformalion (e.g., z scores) will nol corresponJ o samplc percenlilcs, and lhe degree of divergence may be quile longe. ConsiJer lhe histogram in Figure 1-4, which shows lhe dislrihulion of scurcs obtaineJ for iI hypolhelieal test. This lest, with a samp!e size of 1000, h<lsa mean ril' score of 50 anJ a standarJ devialion of 10; lherefore (and very conve-nient! y), no linear transformation is required to oblain T seores. An cxpeeted normal dislrihution based OI} lhe oh-served mean and standard devialion has been overlaid on the observed histogram for purposes of comparison. The histogram in Figure 1~1 shows Ihat lhe díslribution of scures for the hypotheticallest is grossly non-Ilormal, wilh a Iruncaled lower l<lilillld significanl positive skew, indicilling floor effects and the existenee of tW()distinct subpopulations. If lhe dislributioll were normal (i.e., if we follow the normal curve, sllperimposed on lhe hislogram in Figure 1-4, instead (lf the histogram ilsclf), a raw score of 40 would eorrespond to a T score of 40, a S(;ore lhat is 1 SD or 10 puints fmm the
  • 7. mean, <lnd translate lO lhe 16th pen.:enlilc (pcrcenlilc not shown in lhe graph). Howcvcr, whcn we calclllate a pcrcellile for the actual scorc (listribution (i.e., lhe hislogram), a smre of 40 is actually below lhe Isl percClllile with respcct to lhe observed sampk dislributioll (pcrcelltile=O.R). C1earl)', the difterem.:e in percenlilcs in Ihis example is no! trivial anti has significanl implicatiolls for score interpretalion. Normalizing Te~tScarc~ Vhen confronted "vilh problematic score distributions, mall}" lest dcve10pers emplo}" "normalizing" Ir,lllsformalions in an altempl to correct depiHtures from normalit}" (cxamplcs of this can be fouod thwugholll this volume, in lhe Normruíw JJalll sCClíoo for tests reviewed). Allhough hc1pful, these pro-cedurcs are b}"no means a panace<l, as lhe}" often inlroduce probkms of Iheir own with respecl lo inlcrpre<llion. iddi-lionalll', tTlanl' lesl manuais contain only a cursor}" discussion of nnrmalizalion (jf lesl scorcs. inaslasi and Urbin,l (1997) statc that scores should onl)' bc normalized if: (I) Ihel' come from a largc and represcnlalive samplc, or (2) any devialion from normalitl' arises from ddecls in lhe lesl rather than charactcrislies of lhe sample. Fllrthermore, as we have nOled above, it is prderable lo adjusI score distributions prior 10 normalizalion by ll10difying tesl conlent (e.g., by ad(ling or ll1odifl'ing ilems) ralher than slalislical1y transforming non-normal scores inlo a normal dislribution. ilthough a detai1cd discllssion of normali/.ation procedures is beyond lhe scopt.' of this chapler (interested readcrs arc refcrred lo Anaslasi & Urbina, 1997), ideall}', test makers should dcscribc in delail the nalure of any significant samplc Ilon-norm<llity ,md lhe procedures useJ lo correcl it for derivalion of standardized scores. The reasons for correction should ,liso be justified, and direcl percentile conversions uased on thc uncorrecte(l samplc dislribution should be provided as im 0plion for users, Dc-spile the limitalions inherenl in correcting for non-normalily, Anaslasi and Urbina (1997) note th,l[ most tesl developcrs will probably continue lO do so beca use of lhe necd to usc Icsl scorcs in statistical analyses Ihal <lssume normality (lf dislri-butions. From a prattlcal poinl of view, test users should bc aware of lhe Illathclllalical compulalions <lnd Iransforma-lions involved in deriving scorcs for Iheir inslruments. Vhcn ali othcr things are cqual, lest uscrs should dwose lests Ihal provide informalion on snlfC dislribulions ,llld any proce-dures Ihal were ulldertaken to correcl non-normalit}', over thosc Ihat providc partial or no illformalÍon. Exlrapolalion/lnlerpolotion Despile ali lhe besl elTorts, Ihcre are times whcn norms fali shorl in lerms of range or cdl size. This indudes missing dala in somc cdls, inconsistenl age eoverage, or inadequate demo-gr, lphic composilíon of some cells compared to lhe popula-tion. In Ihcse cases, data are oflen eXlrapolalcd or intcrpolaled using Ihc exisling score dislribulioll and lechniques such as Ps}'chornctrics in ~curOl's)"dlOrogical Assessment 9 llIultiple regressioTl. For cxamplc, llcalon ,Illd cot!eagues have puhlished seis of norms Ihal IISt..multip1c regressiol lo cor-rett for demogrilphic characlcrislics ,uHl compellsate for few subjects in some cells (I 1caton et aI., 2(03). Although multiple regressioll is robust to slighl vio1atiolls of assumptinns, eSli-mation nrors mar occur whcn using llormative dala Ihat vio-lalcs thc assumplions ()f homoscedaslicil)" (uniform variance across lhe range of scores) and normal distrihution of scores necessary for multiple regressioll (Faslenau & AJams, 1996; f Icalon el aI., 1996). Age extrapo!alions bel'ond the hounds of the actual ages of lhe individuais in the samples are also somelimes sccn in nor-mativc dala seIS, hased on projected devclopmcntal curves. Thcse llorms should be used with caulion due lo lhe lack of aCLIaldata points in these age ranges. EXlrapolalÍon melhods, such as Ihose that emplol' regression lechniqucs, dcpend on lhe shapc of lhe dislribution of scores. Indudillg only a subset of lhe dislribulion of age scores in the regression (e.g., b}' omitling verl' young or ver)" nld individuills) may change lhe projected developnlental .sllll'C nf cert"in Icsts dralllalicalll'. Tests Ihat appedf to have !incilr relalionships, whcn consid-ered olll}' in adulthllod, ma}" ,H.:lually have highll' positivdy skewcd binomial functioJlS whcn the cnlire age range is con-sidered. OnC eX<lmple is vocablllary, which lends lo increase c)(l'0nenlially during lhe preschool l'ears, shows a slower ratc of progrcss during earll' adulthood, remains re1ative1l' stablc with conlinued gr,ldual inerease, and Ihcn shows a mi-nor decrease wilh advancing age. If only a subsel of the age range (c.g., adulls) is used to cslimale performance aI lhe lail ends of the dislribulÍon (e.g., prcschoo1crs and elderly), the eslimalion wiU not fit the shape of lhe aelual distribulion. Thus, normalizalion mar introduce error when lhe re1a-lionship between a test ,lJld a demographic variable is I1on-linear. In Ihis case, linear correetion llsing mulliple regressjoll distorls thc truc rclationship betwccn variab1cs (Fasleneau, 1998). MEASUREMENT PREClSION: RELlABllI1Y AND STANDARD ERROR l.ike ali (orms of Illeasuremenl, ps)"chological tesls arc nol perfectl}' precise; ralher, test scores musl be seen as estimares of abililÍes or funclions, each associated wilh some degree of mcasurement error.-' Each lesl differs in thc precision of lhe scores that it produces. Df crilical importance is lhe fact thal no tcst has (lnl}' one specific Ievc1 of precision. Ralher, precision alwa}'s varies to some degree, and potentially suh-slanlialll', across {liffcrent populaliollS and tesl-use senings. Thcreforc, eslimates of measurelllenl error rc1evanl lo specific testing circumstances are il prerequisitc for correCI inlcrprela-lion. For example, even lhe mosl precise lesl mal' produce highly imprecise results if administered in a nonslandard fashion, in <Inonoplilllal cnvironmcnl, or lo <In uncoopera-live examinee. Aside from these obvious cavealS, a few basic
  • 8. 10 A CompfJl(liurn of NcuropsydlOlogieal Tesls Toble1-2 $Olrç,:sof Errur V;lriallceIn 1(e1atlolllo Relia!:>ilily Cocfficients Typcof Rcliabilill'Coefficielll Split-half Kuder.l(ichard.soll Codficirnt all'ha Test-fetest Alternale.fofm (immcdialc) Alternalc-form (delayed) Interraler InlefSmrer diftúcllccs 01" lhe corre!ation bctween tesl scores and true scores. This is why il is used for estimaling true seores and associated stan- (!dai errors (NunnaUy & 13ernslein, 1994). Ali things being equal, longa lesls will general1y yield higher reliability esli-mates (Satl!er, 2001). InternaI reliability is llsual1y assessed with some measure of lhe average correlatinn among ilems within a tesl (Nunnally & 13ernslein, 1994). These inc!uJe lhe split-half or Spcarman-13rown reliability coefficient (obtained by (orrdating two halves of items fram the same test) and co~ dficienl alph.l, which provides <lgeneral estimate of reliability bascd on ali the possible ways of splitting lesl items. Alpha is esscntially based on the average inlercorrelation between Icst ilems anJ any otha sct of ilems, and is used for tests with items lhat yidd more than two response lypes (i.e., possib!e srores ofO, I, or 2). For additiollaluseful references coneern-ing alpha, sce Chronb<Kk (2004) and Streiner (2003a, 2003b). The Kuder-Richardson rdiabililY coefficient is used for items with yes/no answers Of helerogencous tests where splít-half melllllds nlusl be used (i.e., lhe mean of ali thedifferent split-half coefficienls if the lesl were split inlo ali possib1c ways). General!y, Kudcr-Rieh,lrJson cocfficienls will be lower Ihan split -half coeffidents whcn ICstsare hcterogeneous in terms of content (Anaslasi & Urbina, 1997). lhe Speciol Cose of Spced lests Error Varlance Contmt sampling Conlmt sampling Conlent sampling Time s<lmpling Cnntcnt sampting Conlent saml'lingand time sampling Tesls involving speed, where lhe score exclusivdy depenJs on lhe numbcr of items completed wilhin a lime limil rather than lhe numbef correct, will cause spuriously high inlernal rdiabililY estimates if internai re1iability indices such as split-half reliability are useJ. For examplc, dividing lhe items inlo Iwo halves lo Gl!Culatc ,1 split-half rcli.lbility cocfficicnl will yie1d IWOhalf-Iesls with 100% item complction ratcs, whether the indiviJual oblained a score of 4 (i.e., yielding Iwo half-tests totaling 2 poínls eaeh, or perfcet agreement) or 44 (i.e., yiclding two half-tests both lotaling 22 poinls, .llso yiclJing perfeet agreement). Thc result in both cases is a split-half reli-abilily of 1.00 (Anaslasi & Urbína, 1997). Some alternalives are to use test-retest reliability or alternalc forrn rc1iabílily, ideally wilh lhe a1tefJl<lleforms adminislercd in immediate suceession to avoid lime sampling error. Rc1iabilities (;Ill also principies help in deleflnining whelhcr a test generaUy pro- 'lides precise measuremenls in mosl silll.ltiolls where il wiU be useJ. Ve begin wllh an overvlcw of lhe rc1ated concepls of re-liabilit}', trw: s{(nes, ol!lail1ed scores, lhe various eslimales of measurement error, <lnJ lhe nolion of ClIl1fidcl1cc in/crI'als. These are revieweJ bclO'. Definitionof Reliability Rc1iability refenlo lhe consislency of measuremenl of a given lesl anJ can be defined in several ways, including eonsistency wilhin ilsc1f (internai consisteney rei iability J, comislency over lime (Iest-retest rc!i.lbilily), consistem;y ,lCrossallernale forms (alternale form rcJiability), and consislency across ralers (in-lerrattf rdiabiJily). lndices (lf rdiabililY indicate lhe degree to which a tesl is free from measurcment tfror (or the propor- IÍon of variance in observed scores atlributablc to vMiance in Irue scores). The inlerprelalion of such indices is oflen not so slraightforw,lrd. It is importanl to note Ihal the lerm "error" in this conlexl does not iKlualll' refer to "incorrecl" or "wrong" informalion. Rilther, "error" consists of the lllultiple sources of variabilily Ihal affeel test scores. Vllilt mal' be lcrmed error variance in ane appliealion mal' be consiJereJ par1 of lhe true score in anolher, depending on the comt ruet being measureJ (state or trai!), lhe nalure af lhe les employed, anJ whelher il is deemed relevant or irrelevanl lo the purpose of lhe lesling (Anastasi & Urbina, 1997). An exampk rdevanl to neuropsy-chology is Ihal internai reliability coeffleienlS temi to be smal1er ai citha end of lhe age continuum. This finJing has been allribuled to bolh limitatiolls of lesls (e.g., measurement error) and incf/:ased inlrinsic performance variability among very young and very 01(1examinecs. Faclors Alfecting Reliability Reliability coefficients are infiuenecJ by (a) tesl eharacteristics (c.g., Icngth, item type, item homngeneity, and intlucncc of guessing) and (b) sample characteristics (e.g., sample si"c, range, and v<Hiability). The cxtenl of a test's "darily" is inli-malely related lo ils rdiability: reliable measurc, Iypieally h,lve (a) clearly written items, (b) casily ullderstooJ test in-slruClions, (c) stanJardized administration conditions, (d) explieit scoring ru1cs Ihat minimize subjectivity, and (e) a proeess for training ralers to a performance crilerion (Nun. naUy& 13crmlein, 1994). For a lisl of commonly llsed rdiabil-ity coefticienls and lheir assoeialeJ sourees of error variance, sec 1:1blc 1-2. Internai Reliability Inlernal reliabililY retleds lhe cxlcnt to v,,,hichilerns within a lesl measure the same eognitive domain or COllstruet. It is a core index in c1assicallesl theory. A measure of lhe intercorre-lation of items, inlernal rcliabilitl' iS;lll estimate of the corre-lalion between randomly paralleltest forms, anJ by extension,
  • 9. Psychometrics in NeumpsychoJogical Assessment 11 T061e1-3 Coml1lnnSourçcsof Bia.and Error in Test-lklest Situatiom _<",n-e:hom I."'fweaver & t.:fld""f, 2lKH. 1'. JQ~.Rel',;nleJ w;lh pell"i";,,,, frofll EIs",;er. may or may nol be considered sourccs of measuremenl error. Apar! fmm these variab[es, une musl cunsider, and possibly p;lrse out, effecIs of prior exposure, which are often conceplu-a[ ized as invo[ving implicit or explicit Icarning. llence the terrn pmctifC effi'as is often llsed. Howevcr, prior exposure lo a tesl does nol neccssarily kad to increased performance at retes!. Note 'l[so lhat lhe a<.:tlla[nature of lhe lesl may sorne-limes change with cxposurc. for instance, lests lhal rely on a ~novelty effect~ anJ/or re(]uire (kduction oI' a stralegy or problem snlving (e.g., VCST, Tower 01' London) may not be conducled in the samc W,IYonce the examínee has prior fa-miliarity with lhe tcsling p,Jr<I(ligm. Like some measures of problcm-solving abilities, measures oI' lcarning and memory are a!s{}highly susleptible lo prilctice effccts, though Ihese are kss likdy lo rct!ect a fundamental change in how examinees approach lasks. In either case, prac-lÍce cffccts may lead to [ow test-retesl lorrclations by effec-tivdy [owering lhe ceiling at relesl, resulting in a restriction of range (i.e., many examinecs ohtain scores at near the IIl<Ixi-mum possible aI retest). Neverthcless, restriction oI' range should not bt' assumed when test-retest corrdalÍons are low unlil this has bem verified br illSpt'ction oI' Jat,l. The relationship between prior exposure and tesl stability coefficients is complex, anJ although test-retesl cocfficienls may be affected hy praclice nr prior expo.sure, lhe cot'fficienl <1oesnot indica te the magnitude oI' sllch effeets. That is, retest corre1ations will be very high when individual retesl $Coresali change by a similar amount, whether lhe praclice effed is nil or very large. When stability coefficients are low, then lhere may he (I) no syslelll<lliceffecls of prior exposure, (2) the reialion he cakulated for any test Ihat can be dividccl into specific time inlervals; scores per inlerval can lhen bc compared in a pmce-dure akin to the sp[it-half method, as long as items are oI' rela-tivcly equivalent difficulty (Anaslasi & Urbina, 1997). For most oI' the specd lests rcviewed in this volume, rcliaoilíty is estimaled by using lhe test-retest rdiabi[ity coefficicnt, or dse br a generalizability cocfficiellt (see be!ow). Te~t.Re!e~tReliobility Tcst-retest rdiability, a[so known as temporal stabilíty, pro-vides an estimate oI' the corrclalion belweell Iwo lest seores from the same lesl adminislered aI two different ponls in time. A tesl with gnod lemporal stabilily should show [in[e change over time, providing Ihal the trait being lJIeasured is stablc ,md l!lere are no differentia[ cffecls of prior exposure. lt is impor-tant to note that tests measuring dynamic (i.e., change,lb[e) abilities will by defmilion producc lower tesl-relest rcliabilities than tests measuring dom<lins Ihal are more trait-like and sta-b[ e (Nunnally & Ikrnslein, 19(4). See Table 1-3 for commOTl sources of bÍ<ISand error in test-retesl silualions. A lest has an infinile number oI' possible test-retesl reliahi[- ilies, dcpending on the lcngth of the lime inlerva[ belween 1esling. In some cases, rdiability eslimates are inversely relatcd to thc time inlerva[ bctween baseline and relest (Anaslasi & Urbina, 1(97). In olher wntds, the shorter lhe time interva[ belween test and retest, lhe higher lhe rcliabi[ity wefficient will be. liowever, the extent 10which lhe time inlerva! affects lhe test-relesl coefficienl will dcpend on the Iype of ability evaluated (i.e., stable versus more v,lfiable). Rcliabilily may a[so depend on the type oI' individual being assessed, as some groups are intrinsically more variablc over time lhan olhers. For examp[e, the exlenl to which scores !luctuate over lime may depend on subject characterislics, induding age (e.g., normal preschoolers will show more variabilily than adults) and neurological stalus (e.g., TBI examinees' scores may vary more in lhe acute stale lhan in the posl-acule statc). Ideally, rdiabilíty estimales should be provided for bulh normal indi-viduais and the clinicai populalions in which lhe tesl is in-lended to be llsed, and the speçitic dcmographic characteristics of the samplcs should be fuHy specified. Test slability coeffi-cients presenled in published les! manuais are usllally derived frum rclalÍvdy small normal samples le,ted ovcr much shorter interva[s than are typical for retesting in clinicai prac-tice and should therefore be çonsidered with due caution when drawing inferences regarding clinicai cases. Howcver, Ihere is some evidence Ihat duration of inlerval has less oI' an impact on test-retest scores lhan subje<.:tcharacteristics (Dikmen et a!., 1(99). Prior Exposure ond Proctice Effects Variability in scores on the same test over lime may be related to silualional variables suçh as examinee state, examiner state, examiner identity (same versus different examincr aI retest), or envirollmenlal condilions that are oflen unsystcmatic and Rias Error Inlerveninf(variablcs Practicceffcch Dt.'rnographic comidcrations SI'ltislÍç'l]crrors RanJom or unwntrollcJ C'Cllts Eventsofinterest (e.g., slIrgcry. lllcdk;ll inlt'rvmlion. rehahililalion) ExtraneollSevents Mcmorr for contcnt l'rocedllf<lllearning Olher factors {a}Familiarilywilh lesling contexl and exarniner (h) I'crforl1l;lnceanxit'ly Age(rnaturalional efft.'ctsand aging) EduC<llion Gender Elhnkil)' Hasdint..ability IvleaslIremenlerror (SE,'vI) Hcgressiollto lhe mean (SEe)
  • 10. 12 A Compendium of Nellropsychological Tesls of prior exposure may be nonlinear, or (3) eeiling effeels! reslrietion of range related to prior exposure may be ,ltlenual-ing lhe eoefficient. For exampk, certa in SUbgrollPSIllaybendi! more from prior exposure lo tesl maleriallhan olhers (e.g., high-1Q individuaIs; Rapporl el aI., 1998), or some SUbgrollPS may demollslrale more stablc scores or consislenl praelice cf-feelS than do othas. This causes lhe score distribulion to ehange ai retest (effectivdy "shuff]ing" lhe individuais' rank-ings in lhe dislribulioll), which will attenuate the correlalion. In Ihese cases, the tesl-relesl corre1alion may vary significantly aeross SUbgrollPSand the correlatioll for lhe enlire sample will nol be lhe besl eslimale of reliabilit)' for an)' of the sub-grollPS, overeslimating rdiabj]ity for some and underestimat-ing reliabilit)' for olhers. In some cases, practice cffecls, as long as lhe)' are rdativdy s)'slematic and accuratc!y assessed, will not render a lesl unusablc from a reliabililY perspective, Ihough they shollld always be lakell inlo account when retesl scores are interpreted. In addilion, individual factors must always be consiuered. For example, while improved perfor-mance may usually be expecled wilh a particular measure, an indiviuual examinee may approach lesls Ihal he or she had difficullY with previously with heighteneu anxielY that leads to decreased performance. Laslly,it lTlUSI be kepl ill minu Ihal faclors other than prior exposure (e.g., changes in enviroJl-menl or examinee state) may affecl tesl- retest reliabilily. Ahernate Forms Reliability Some invesligators advoC<lethe use of alternate forms lo eliminale the confounding effeels of praclice v"hen a test must be adminislered more Ihan once (r.g., Anaslasi & Urbina, 1997). Ilowever, Ihis praclice inlrodllces a second form of er-ror variance into lhe mix (i.e., conlent sarnpling error), in ad-uition to lhe time sampling error inherent in leSI-releSI parauigms (see Table 1-3; see also Lineweaver & Chelune, 2003). Thus, leslS wilh ahernate forms musl have eXlremely high correlalions between forms in additioll to high lesl-relesl reliability lo confer any auvanlage over using lhe same form administered tvice. iIoreover, Ihey mUSldemonstrale equiva- Ience in terms of mean scores from lesl lo relest, as well as collsistency in score e1assificationwilhin indiviuuals from lest lo retest. Furlhermore, alterna te forms do nol necessarily climinate effecls of prior exposure, as exposure lOslimul i anJ procedures can confer some positive carry-over eITecl(e.g., procedurallcarning) despite lhe use of a differenl sei of ilems. These dTects may be mini mal across some Iypes of well-cOllS1rucledparallel forms, such as Ihose assessing acquired knowledge. For measures such as the VCST,where specific lcarning and problem solving are involveu, it may be difticult or impossible to produce an equiva[ent allernate form that will be free of cffects of prior exposure 10 the original formo Ihile it is possiblc to attain Ihis degree of ps}"chomelricso-phistication thruugh careful item analysis, reIiahilily sludies, and administration to a represenlative nonnative group, it is rare for ,11ternateforms to be conslrucled with lhe same psy-chometric rigor as were lhe original forms frum which they were derived. Evenwell-(onstructed alternale forms oflen lack crucl<llv,lliu,llion evidence such as similar corrc!ations lo cri-terion measure$ as lhe original lesl formo This is especially lrue for older neuropsychological lest.s, particularly those wilh original forms Ihal were nevn subjecled lO any item analysis or rcliability sludies whatsoever (e.g., BVRT). Inade-qu, lte lcst construnion and ps)'chometric properties are also found for alternale forms in more general published lests in commotl usage (e.g., VH.AT-3). l:kcause so few alternate forms are availablc and few of those th,ll are meel Ihese psy-chomelric slandards, our tendency is to use rdiable change inuices or slandardized regression-bascd scores for estimating change from test lo retes. lnterratcr Rcliability Mosl lesl manuaIs provide speciflc and delailcd inslru(tions on how 10 adminiSlcr anu score le,l, 'lccording lo slandard procedures lo minimi/,e error variance duc lo uiffaenl exam-iners and scorers. However,some dcgree of examiner vari,lnce rem,lins in inuiviuually ,ldminislered lests, parlicularly when scores involve a degree of judgment (e.g., muhiplc-responsc verballesls such as lhe Vechsler VOCilhular}" Scalcs,which re-quire lhe rater to adminisler a score from O lo 2).ln lhis case, an estim,lIe of lhe rcliability of ,H!minislralion aml scoring across examiners is neeued. Inlerrater reliabililY can be evalUaled using percentage agreemenl, kappa, producl-momenl corre!alion, and inlra-e1asscorreIalion coefficient (Sauler, 2001). for ,lny given tesl, l'earson correlalions will provide an llpper limit for lhe intra-e1asscorrel< ilions,bllt intradass correlalioTlsare preferred be-cause, unlike the l'earson's r, Ihey take inlo accounl paired assessments made by the same sei of examiners from lhose maue by dilTerent ex,lminers. lhus, lhe intradass correlation dislinguishes Ihose seIs oI"scores ranked in lhe same order from Ihose lhal ,Ire r,lnked in lhe sallle order but havc [ow, llloderale, or complete agreemenl with each olher, and cor-rects for interexaminer or leSI-relesl ,lgreemcnt expected by chance alone (Cicchetti & Sparrow, 1981). However, adv<ln-tages of the I'earson correlatioll ,Ire lhat il is familiar, is readily inlerpretable, and can be eas!l}"compared using sland,lrd sta-tislical techniques; il is besl for evaluating cOllsistency in ranking rather than agreement per se (Faslenau el a!., 1')96). Generolizability CoefReients One reIiability coefficient type not covercd in this list is the generalil.abilily cocfficienl, which is starting lo appear more frequentIy in lest manuais, particularly in the larger test bal-leries (e.g., Wechsler scales anu NEPSY). In generalizabilil}" theory, or G rlieory, reliabilily is ev"lualeu by decomposing test score variance using lhe general linear model (e.g., vari-ance compollents analysis). This is a varianl of the mathe-matical methods meu lO,lpl'ortion variance in general linear model allill)'scs such as ANOVA.In lhe case of G lheory, lhe belween-groups variance is considered an estimate of a true
  • 11. score 'ariance and wilhin-groups variance is considered an estimale of rrror variance. lhe generalizability coefficient is the ratio of estimated lrue variance to lhe sum of the esti-mated true variJncc and estimated error variance. A discus-sion of this nexib1c ;Ind powerful model is beyond the scope of t!lis chapkr, but dctailcd discllSsions can bc found in Nunnally and Bernslein {I(94) and Shavelson el aI. (1989). Nunn;llIy and Bemslein (1994) also discuss rclaled isslles pertinrnl lo eSlim<lling reliabílities of variables ref1ecling sums such as composite scores, and the fact that reliabililies of diffcrrllce scores based Oll correJated measures C<1l1be verr low. Evaluoling a Test's Reliability A lest cannot be Silid lo have a single or owralllrvcl of relia-bility. ]{alher, tesls can be said lo exhibil diffcrenl kinds of re-liabilill', the rdalÍvc importance of which ""iH vary depending on how lhe tesl is to be used. Moreover, each kind of reliabil-ity mal' varl' across differenl populalions. For inslance, a test may be highll' reliable in norm,llly funclioning adulls, bul be highly unreliablc in young children or in individuais wilh nnuological illness. It is importanllo nole that whilc high re-liability is a prerequisile for high validill', the latter does nol fol!ow automalÍcalll' from lhe former. For exampk, heighl can be measmed wilh great reliabilitl', hut it is nol a valid in-dex of intelligence. lt is usuaHy preferable lo choose a lesl of slighlll' lesser reliabilitl' if it can be de1110TlSlraled tha! the test is associaled witll ,I meaningfulll' higher levei of validity (Nunnalll' & Ikrnstein, 1994). Some halle argued thal internai reli,lbilitl' is more impor-tant than olher forms of reliability; Ihus, if a!pha is low but tesl-relest re!iahility is high, a tesl should not be considered reliable (Nunnal!l', 1978, as cited bl' Cicchetti, 1989). Note thal il is possihle to have lnw alpha values and high lest-relest reliabilitl' (if a measure is made Lip of heterogencous items hut yie1ds the same responses at retesl), or low alpha values bul high interrater re1iabilitr (if the test is heterngeneous in ilem contenl hut ridds highll' consislent scores acmss Iraincd cxperts; an examp1c would be a mental slatus exami-nation). Internai consislencl' is therefore not necessarill' lhe primar)' index of re1iabilill', but should be evaluated within the broader contexl of test-retes! and inlerrater rdiability (Cicchetli, 1989). Some argue Ihat test -retest reliabi1iIY is nO! as important as other forms of rcli<lhilily if the test will only be used once <lnd is nOllikell' to be administered again in future. However, de-pending on the naturc of Ihc tcst and rrlcst sampling proce-dures (as JiSCllssed previous!y), slabilily coefficients m<ll' provide valuable insight into the replicability of lest results, particular!l' as Ihese coefficients are a gauge of "real-world" rdiabilill' ralher Ihan ilccuracy of mCilsurement of true scores or hypothetical rdiabilill' acmss infinite randomly parallel forms (as is internaI re1iahilitl').ln addition, as was slated pre-viously, clinicaI decision making will <llmost alwal's be based on lhe obt,lined score. Therefore, il is critiCillly importanl O Psychometrics in Neuropsychological Assessment 13 know the degree to whÍl.:h scores are replieablc ai relesting, whether or not lhe tcst may be used again in futme. It is our belirf Ihal test users should take an informed <lnd pragmatie, ralher Ihan dogmalic, approach lo evaluating relíability of tests uscd to inform diagnosis or other clinicaI decisions. If a lest has been designed lo measure a single, one-dimensional construcl, Ihen high internai consislency rcli<lbil-ily should be considered an essenli<ll propertl'. High tesl-reles! reliability should also be collsidereJ an essential property un-less lhe tesl is designed tn measure stale v;niablcs that are ex-pecled lo fluctllale, or if syslemalic f,lelors sueh as praetice effeCls attenuate slability cocfficienls. What h an Adequale Reliability Coefficient? Thr reliabilitl' coeffieient ean be inlerpreted direetly in lerEm of the pereentage of seore vari<lnee atlributed to differenl sourees (i.e., unlike the corre1ation coefficient, which must be squared). Thus, with a reliahilitl' of .85, 85% of lhe variance can be attribuled lO lhe trai I being measured, and 15% can be altributed to error variance (Anaslasi & Urhina, 1997). When ali sources of variance are known for the same group (i.e., when one knows lhe rdiabilill' ((lefficienls for internai, lest-retest, alternate form, and interraler rdiabililY on lhe Silme sampk), it is possible to calculitte the true score variance (for an example, see Anastasi & Urbina, 1997, pp. 101-102). As noted above, allhough a delailed discussion of this topie is be-l'ond lhe scope of this volume, lhe portioning of lotai seore variante into components is lhe crux of generalizabilitl' lhe-orl' of re1iability, which forms the basis for re1iability eslÍ-males for manl' well-knowlI speed lests (e.g., Vechsler scale sublests such as Digit Symhol). Salller (2tXll) notes lhat re1iahilities of .80 or higher are needed for tests used in individllal assessment. Tests used for dedsion making should have reliabililÍes of .90 or above. Nun-nalll' and 13ernstein (1994) note Ihal a reliabilitl' of .90 is a "bare minimum" for tesls used to make important decisions about individuaIs (e.g., lQ lests), and .95 should be the optimal slandard. When imponanl decisions wiU be basrJ on lest scorcs (e.g., placernelll into special education), small score Jif-ferences on make a greal difference to oulcome, and precision is paratJlount. Thel' nole that cvrn with a rdiability of .9ü, lhe SH"l is almusl one-lhirJ as large as lhe overall SDoflest scores. Given Ihese !ssues, what is a c1inicallr acceptable levei of reliabilill'~ According to Sall1rr (2001), tests wilh reliabilities below .(,0 are unrcliable; Ihose above .60 are marginalll' re!i-able, and those above .70 are rdative!l' re!iable. Of note, tcsls wilh rdiabilities of .70 may be sufficient in the earll' stages of valiJalion research to determine whether the test correlates wilh other validation evidence; if so, additional effort call bc exprnded to incrcase rdiabilities lo more acceplable leveis (e.g., .80) by reJucing me,lsurement error (Nunnalll' & Bern-stein, 1994). In outcome slUdies using psl'chological tesls, in-ternaI collsislencies of .80 lo .90 and test-relest rc1iabilities of .70 are considcred a minimum acceptable slandard (Andrews et 011., 1994; Burlingame et aI., 1995).
  • 12. 14 A Compendium of Neuropsychological Tesls To61e1-4 Magnitude ar ReliahililyCndficients i.lagniludeof CoeffJdcnl Very high (.90+) High (.!lO-.89) Adc(juatc (.70-.79) jl;lrgitlill(.60-.69) Lov (<.59) In Icrms of inlernal rcliability of neuropsychologieal tests, Cieehetti el aI. (]990) hayc proposed that internaI consistency estimates of lcss than .70 are unacu'ptablc, rdiabilities be-t vecn .70 and .79 are fair, rdiabilities betwecn .80 and .89 are good, and rdiabilities ilbove .90 are excellcnt. For interrater reliilbilities, Cicchetti and Sparrow (I981) report that clinicaI significance is poor for reliability coeffi-eients below .40, fair between .40 and .59, good belween .tiO imd .74, and excellent between .75 and 1.00. Faslenau et aI. (1996), in summarizing guidelines on the interpretation of in~ traclass corrdations and kappa cocfficients for interraler reli-ability, consider coefficients larger than .60 as sllbstantial and of .75 or .80 as almost perfecl. The,c are the general guiddínes that we hayc med Ihroughoul the lexl to c'aluate thc rdiability of neuropsycho-logical tests (see Table 1-4) so that lhe text ean be med as a reference when seleeting tests with the highest rdiability. Users should note thallhere is a great deal of variability with regard to the acceptability of reliability coeffieients for neu-ropsychological lesls, as perusal of this volume will indieate. In general, for tesls involving multi pIe subtesls and multiplc scores (e.g., Wechslcr scales, NEPSY, IJ-KEFS), inclucling lhose dcrived from qualitative observations of performance (e.g., error an,llyses), the farther away a score gels from lhe composite score itself and the more difficlllt the seore is lo quantify, the lower lhe rcliability. A quick review of lhe relia-bility data presellled in Ihis volume 'lIso indicates Ihal verbal tests, wilh few exceptions, lend to have consistently higher re-liabílity than lesls measuring other cognitivc domains. Lastly, as previously discussed, rcli,lbility coefficienls do nOI provide comp[ele informalioll on the reproducibilil}' of individual test senres. Thos, wilh regard to test-retest rdiabil- Itr, it is possible for a tesl to have high reliability (r= .80) but have retesl means that are 10 POilltS higher Ihall baseline ,cores. Reliabilíty coefflcients do not provide information on whethcr individuais retain lheir relalive place in lhe distribu- - tion from baselínc to retest. Proceclures such as lhe 13Iand~ Altman mcthod (A!tm,m & Bland, 1983; B1and & Altman, 1(86) are one way to determine the limils of agreement be- Iween two assessments for individuais in a group. MEASUREMENT ERROR A good wnrking underslanding of coneeptual issues and meth-ods of guantifying measuremenl error is essential for compe-lent clinicai pracliee. We starl our discussion of lhis lopic with concepls arising fmm dassicallest Iheory. True Scores A central ekmenl of classieal test theory is lhe concept of a /ruc score, or lhe score an examinee wnuld obtain on a mea-sure in lhe absence of any measuremenl error (Lord & Novick, 1968). True scores can never be known. Instead, they are esti-matcd, and are coneeplually defined as lhe mean score an ex-aminee would obtain acmss an infinite number of randomly parallel forms of ates!, assuming lhat lhe examinee's scores were 1101systematically affeeled by tesl exposurclpractice or olher time-related factnrs such as maluralion (Lord & Novick, 1(68). In contrasl to Irue scorcs, oblaíllcd scores are lhe aClual scures yidded by tests. Obtilinnl scores indude any measure. ment error associated with a given tesl.' That is, Ihey are the sum nf lrue seores and l.~rror. In the dassic<ll modcl, the relation betwcen nblained and true seores is e)(prcssed in the following formula, where error (e) is random ,lIld ,111v<lriablcs are assullled to be normal in distribution: Vhen lest reli,lbility is less than perfeet, as is always the case, lhe net effeel of me,ISlrement error iICroSSexaminees is to bias obtained scores oulward from lhe popul<ltion mean. That is, scnres above lhe mean are most likcly lo be higher than true scores, while Ihose below lhe mean are most likdy lo be lowcr Ihan Irue scores (Lord & Noviek, 19(8). Estimated true scores correct this bias hy regressing obtained seores toward the normalive mean, with the amounl of regression depend-ing OH test reliability and devialion of the obtained sune from the mean. The formula for estimated true scnres (t') is: limits af Reliability Although it is possiblc to have a reliable test thal is not valid for some purpo,cs, lhe converse is nol the case (see [ater). Further, it is also conceiv,lblc that Ihere are some neuropsychological domains that simply cannol be measured reliably. Thus, even Ihough there is the assumption Ihal questionable rdiability is always a function of the lest, reliability may depend on the na-lUre of the ps}'chological process measured or on lhe nature of the popul,lIion evaluated. For example, many of lhe exceulive fllnclioning tesls revicwed in this volume have relalivcly mod-est rcli,lbilities, suggesling Ihal Ihis ahilily is difficult lo assess reliably. Additionall}', tests used in poplllalions with high re-sponse variabilily, such as presehoolers, clderly individuaIs, or individuais wilh brain disorders, may invariably yield low reli- ,lbility cocfficients despile lhe best dTorls of test devclopers. Vhere: X= oblained ;;core t = lrue score e=error X=f+e {3]
  • 13. PsychoJnetrics in Neuropsychnlogiol issessment 15 11ere: x = mean test seore rxx = tesl reliabilit y (internai consisleney rc1iability in dassieallesl theory) x= oill<lineJ seorc If working with z seores, lhe formula is ~implcr: lhe U~eof lrue Score~ in Clinicai Pradice ancy betweell true and obtaineJ scores. ror a highly rdiable mcasure such as Tesl 1 (r= .95), true score regressioll is mini-mal, even when an oblained scorc lies a considerablc distance from the sample mean; in lhis cxamplc, a SliUHl<fdscore of 130, or two Sl.>s abovc the 1l1e,1ll,is associated with an esti-mated lrue score of 129. In contrast, lur a lesl with low rc!ia-bililY such as Tesl 3 (r=.65), true score regression is quite subslant ia!. For this test, an obtailled score of 130 is associated wilh ,In estimaled true score oC 120; in this case, fully one-third of lhe observed deviatioll is "losl" lo regression when the est imaled Irue scnre is calculated. Such infornl<llion Illay have importam implicatiorls wilh respect to inlerprelation of lest resu!ts. For example, as shown in .1~lblc1-5, as a result of differences in rdiability, obtained scores of 120 Oll Tes! 1 and 130 on Tesl J are associated with Cssclllial1yequivalcnl estimated true scores (i.e., 119 and 120, respeelivel}'). If only obtained scores are considercd, one might inlerprcl scores from Test I anJ Test 3 as signiticantly differcnt, even though these "difierences" actually disappear when measurell1ent precision is laken inlo Jccounl. lt should also be noled thal such differenees ma}' nOIhe limiled lo com-parisons of scores across differenl tesls within lhe sarne indi-viduai, but may also apply lo cOlllparisons belween scores from the same test across differenl individuaIs whcn lhe indi-viduais come from differenl groups anJ lhe tcsl in question has variable reliabililY acmss Ihose groups. Regression to the rnean may also m;lnifest as prunounced asymmetry of confldellee interv<lls celltered on Irue scores, relalive to oblained scores, as discus~ed in more detail later, Although calculalion of (rue scores is encouraged as a means of g<luginglhe limitations of reli<lbilily,il is important lo WIl-sidu Ihat an)' signiticant difference belween characteristics of an examincc and lhe samplc from which a lllean samplc score and rdiabililY estimate Vere derived may invalidatc the pru-cess. For example, in some cases il makes litlk sense lo esti-mate true scores for severdy brain-inillrcd individuais on lesls of cognition using leSI p,lfameters from healthy norma-tive samples, as mean scores wilhin the brilin-injured popul<l-tion are likely lo be suhslilntiall}' different Ccom Ihosc seen in hea1thy normative samples; reliabililies may Jiffer subsliln-ti< ll1yas well. Illsteild, olle mal' be justilied in deriving esli-maled lrue scores lIsing data frorn a cornparable clinicai sarnple if Ihis is avaiablc. Overall, these issues underline lhe complex-ities inherent in comparing scores from different tests in dif-ferenl populalions. [41 [51 formula 4 shows lhal ;m cxamin('(~'s estimated true score is the sum nf Ihc 111C,1sIc1ore of the group to which he or she bc-longs (i.c., lhe normative samp1e) and lhe devialion of his or her obtaineJ score from the normalive mean weighted br lesl rcliabililY (as derived from lhe same normativc sample). Fur- Iher, as tesl reliabililY appro<lehes unil}' (i.e., r= LO), esti-mated lrue scores approaeh oblained seures (i.e., there is little measurement error, so eSlim,led lrue scorc~ and oblainnl scores are nearly equiv<llcnt), Conversely, as test reliabililY ap-pro< lehes zero (i.e., whcn a tcst is eXlremely unreliablc and sllbjeCllo excessive lllea~urement error), e~limated lrue scores approach lhe mcan test score. Thar is, whell ti lest is hígh/y re!i-uh/ r, grratrr weight is givell to obtailler1 scores tlUlIl to the nor-miltive meml score, but whell 11 Int is very IIllre!illble, grelHo-weiglrt ís givell to the norma tive metlll score tllllll W obtallJed scorcs. l'ractically speaking, eSlimaled Irue scores will <llways be closer to lhe mean than nblJÍned scores are (cxccpt, of course, where the nblained score is ;lllhe mean). Although lhe Irue score modcl is abstract, it has practical ulil-ily and important implications for tcsl scorc interpretation. For example, whal may not be immeJiatd}' obvious from for-mulas 4 and 5 is readil}' apparent in Table 1-5: estimat(~d true scores Iranslale tesl rdi,lbilil}' (or lack thereof) into the same metric as aclUal test scores. As can be seen in T;lble 1-5, the degree of regression to the rnean of true scores is inversd}' reLlled to test reliability and direclly rdated to degree of dcvialion from the reference mean. This rneans th<ltthe more rdiablc a test is, the doser are obtained scores 10lrue scores and that lhe further away lheob-tained scorc is frum the samplc mean, the grealer lhe discrep-loble 1-5 Estimalt'tlTruc S(()rcVahwsfor Tnrce ObscrvcdS(()rcs 011 Thrce Leveisof Reliahility lhe Stondord Error of Moo~urement Observetl Sçores (.'.1= IOO,5D", 15) Reiiability 110 120 DO .Iest I .95 IlO li' 12.'1 Test2 .80 108 116 121 Te'H3 .65 107 113 120 F.xaminers may wish lo qUill1lilYthe margin of error i1SS0cl-aled wilh using oblained scores as cslimatcs of lrue seures. When lhe sJtIlple SLJ <lnd lhe reliability of oblained scnres are known, an estimale of the SLJ of obtaincd scores about true scores may be cakubted. This value is known as the stillulard error oI meUSlIrelllem,or SEM (Lord & Novick, t 968). !vIore simply, the SEM provides an estimate of the amount of error in <Iperson's observeJ scorc. lt is a functlon of the re1iabilil}'
  • 14. [61 16 A Compendium of Nellrops}'chological Tesls of the test, ,mJ of the variabilily of scores wilhin the sOlmple. The SFM is inversdy rdaled to lhe rcliabililY of the lesl. Thus, lhe greater the rdiability of lhe lesl is, lhe smaller lhe SIA! is, and lhe more confidence the examiner can have in lhe preci-sion 01' lhe score. The SEM is delined by the following formula: SEM '" SD~1 - rxx Where: SlJ= the slandard deviation of lhe lesl, as derived from an appropriale normalive s<lmplc rxx= the reliabililY wcffici<'nl of lhe lest (usually internai rdiabililY) Confidence Intervols Whi1c lhe SEM can be considered on ils own as an index of lesl precision, il is nol necessarily inluitively interpretable,' and Ihere is oflen a tendenc}' to focus excessively 011 test scores as point eslimates at the expense oI' consideration of associ-ated eslimation error ranges. Smh a lendency lo disregard impreçision is p<uticularly inappropriate when interpreting senres from t('sls of lower rdiability. Clinically, it may there-fore be very importanl lo reporl, in a concrele and easily un-derslanJable manner, lhe degree oI' precision associaled wilh specific tesl senres. One melhod of doing this is to use confi-delh: e Hltervals. The SE!Y! is used to rorm J confi(lence inlerval (or range oI'scores), around estimaled true scores, wilhin which oblained scores are mosl likcly lo falI.The dislriblltion of obtained scores aboul lhe lrue score (lhe error dislrihulion) is assumed lo be normal, with a mean of zero and an SD equal to the SEM; therefore, the bounds of çonfi(!cnce intervals can be set lO in-dude any Jcsired range of probabilities by mulliplying by the appropriate 2 valuc. Thus, if an inJividual were lo take a brge number oI' ranJomly parallel versiollS of a tesl, lhe resulting obtained scores would fali wilhin an inten'al of:tl SEM of lhe eslimated lrue score óll% of lhe time, ,!nJ wilhin 1.96 SEM 95'Yoof lhe lime (see Table 1-1). Obviously, wllfidence inlervals for unrcliablc lests (i.e., wilh a large SEAl) will be larger than those for highly rdiablc leslS. For example, we ma}' again use data from Table l-S. for a highly rcliablc les! such as Tesl 1, a 95% wnfidence interval for an obtained score of 110 ranges from 103 lo 116. In con- Irasl, lhe confidence interv,ll for Tesl 3, a lcss rcliable test, is larger, ranging from 89 to 124. lt is importanl to bear in mind Ihal çonfidence inlervals for ohtained swres Ihal are based on lhe SFAl are çentered on t'stimlltcd truc swrcs." Such confidence intervals wil1 be sym-metric around obta ined scores only when oblaineJ scores are ai the test mean or when rcliahility is perfeçl. Confidence in-tervals will be ,lsymmelriç aboul oblained scores to lhe S,ln1e degree Ihal lrue scnres diverge frum obl,lined scores. There~ fore, when a lest is highly rcliable, the degree of asymmelry will nflell be trivial, parliclllar!y for oblained scores within one SI) of lhe mean. For tests of lesser relLlbilill', the asymme~ Iry may be lTlarked. For examplc, in l:lblc 1-5, wnsiJer lhe oblailled sçore of 130 on Tesl 2. The estimaled true sçore in Ihis case is 124 (see eqllalions 4 and 5). Usingequalion 5 and a z-mulliplier of 1.96, we find thal a 95°11,confidençe interval for the ob!aincd scores spans :t13 poinls, or from 111 lo 137. This confidence interva! is subs!antially asymmetric aboul lhe oblailled score. It is also importanl to note thal SEM-based çonfidençe in- ervals should not be llsed for eSlirnating the likelihood oI' ob-taining a given score at retesting wilh lhe same rneasure, as cffects oI' prior exposure are nOI accounleJ for. In addilion, Nllnally and Bernstein (1994) point out thal use of SEM-based confidence intervals assumes Ihat error Jistrihulions are normal!y dislribuled and lwmoscedaslic (i.e., equal in spread) a(rnss lhe range of scores oblainablc for a given lesl. Howevu, this assumption ma)' oflen be violaled. A number of alternale error mudeis Jo nol require these assumptions and mar Ihus be more appropriale in some circumslances (see Nunally and Bernslein, 1994, for a detai!Cd discussion).1 Lastly,,!Swilh the derivation 01' estimaled lrue scores, when an examinee is known lo bclong lo a group Ihat markedly dif-fers from the norm,llive samplc, il may nol be appropriale lo derive SF,Hs Olndass(lcialed confidence intervais using nor-mative samplc parameters (i.e., 51) and ru)' as Ihese would likely differ significanlly from parameters derived from an ap-plicable clinicai sample. lhe Stondord Error of Estimation In additioll to estimating confidence inlervals for oblained scores, Olle lllay also be inleresled in estimaling confidence in-tervills for estimated true scores (i.e., lhe likely range of lrue scores aboul the eslimaled Irue score). For Ihis purpoSt'",one mal' conSlruCl confiJence intervais using lhe sflllldard error of estimatíoll (SE,,; Lord & Novick, 1968). The formula for Ihis is: [71 11ere: SD= lhe slandard deviation of the variable being eslimated r.u= lhe test rdiabili!y coefficient The SEE' like lhe SEM, is an indie<llion of lesl precision. As wilh lhe SEM, confidence intervals are formeJ around esli-mateJ Irue scores by multiplying the SEEby a desired zvalue. Thal iS,one wüuld expect that over a large nllmber oI' randomly parallel versions of a lesl, an individuars tme score woulJ fal! within an illlerval of:tl SEI' of the eslimated Irue score 68% of lhe time, and fali within 1.96 SEIO95% oI' lhe time. As wilh confidence inlervals bas~d on lhe SEA1, Ihose based on the SEI' will usually nol be symmetric arounJ ohtained scores.;1I oI' lhe olher caveals detaileJ previously regarding SEM-based confidence interv<lisalso apply. lhe dlOice oI' construeting confidençe inlervals based on lhe SEM versus the SEI' wil! depend on whether one is more
  • 15. interesled in true scores or obtained s(Ores. That is, while the SEM is ,I giluge of test accuracy in that it is used to determine lhe expeçted range of obtllillcd scores abolll true scores over parallel assessments (the range of error in 111C115r1rCmCI1/ of lhe trile score), the SEE is a gauge of estimation accuracy in that it is used to determine lhe likely range wilhin which trlle $Cores fJII (the range of error of estimati"n of the true $Core). Re-gardless, both SEM-based and SEE-based confidence intervals are symmetric wilh respecl O estimated true scores rather than lhe obtained scores, and lhe boundaries of both will be similar for any giwn levei of (Onfidence interval when a test is highly reli,lble. The Standard Error of Predietion When the standard devialion of obtained scores for an alier-nate form is known, one may cakulale lhe likcly range of ub-tained scores expected on retesting with an alternate formo For Ihis purpose, the stmulrml errar of prcdictioll (SEr; Lord & Novick, 1961'l) may be used to comlruct confidence intervals. The formula for this is: [SI SE!, "'SVy~l-r~ Where: SDy = the stdndJfd devi,llÍon of lhe parallel form administered at retest rxx = the reliability of the form used at initialtesting In this case, confidence inlervals are formed around cstimdled Irue scores (derivcd from initial abtained sClnes) by multiply-ing the SEr by a desired zvalue. That is, one would expect that when retested OVCf a large number of randomly pJrallcl ver-sions of a lest, an individual's obl<lined SClne would fali within <In inlerval af:tl SEI' of the estimated true score 68% oI' the time, and fali within 1.96 SEE 95% of the time. As wilh confi-dence intervals based on lhe SEM, those b,lsed un the SEI' will generally not be symmetric ,Iround obtained SClnes. 111of the other caveats detailed previously regarding the SEM-I}<Lsed confidence intervals also apply. In addilion, while it mdY be templÍng lo use SEf'-based confidence inlervals for eva1tI,Hing signific<lnce of ch,mge at retesting with lhe same JlleilSUre, Ihis practice violates the assumplions Ihat a parallel form is used aI retest and, particular1y, that no prior exposure effects apply. SEMs and True $cores: Proclicollssues Nunnally and Bernstein (1994) note Ihat mosl test manu<lls do '';m exceptionally poor job of reporting estimateJ true scores ,Ind conlldcnce interva1s for expectC(I obt,tÍned scores Otl alternative forms. for ex,lnlple, intervals are often erro-neonsly centered abolll obtained seores rather than estimated true scores. Often the topic is not even discusscd" (p. 260). Sattler (2001) also notes that test manuills often base confi-dence intervals on the overall SE,"1 for the entire standardi/d-tion sample, rather than on SE"'!s for each age bando Using the average SEA1 across age is not always appropriate, givcn Ihat PsydlO111ctries in Ncuropsyehological tssessmenl 17 some age groups are inherently more variable than othcrs (e.g., preschoo1crs versus adu1ts). In generdl, eonfidencc inter-vais based on age-specitic SE"'!s are preferable lo Ihose based on the overall SEAI (particularly at the extremes of the age distribution, where there is the most variability) and C<1noften be constructcd using age-based SEMs found in mosl manuaIs. It is important to ackllow1cdge Ihat whilc estimated true scores and associated confidence intervals have mcrit, there are practical reasolls to foeus on ohtained scores inslead. For example, essentially ali validily studies ,md ,Ktu,nidl predic-lion mcthods for mosl lesls are based on obtained scores. Therefore, obtained scores must usually be employcd for di-agnoslie and olher purposcs to maintain consistency to prior research and test usage. for more discussion regarding lhe ca!Culdtion and uses of the SE,H, SEE' SEr' and a1ternalÍve er-ror models, see Dudek (I979), Lord and Novick (l96l'l), and Nunnally and Bernslein (1994). VAUDITY ~lode1s of vdlidity ,Ire not ,Ibstract conceptual framl'works Ihat ,ne only minimally rclaled to neuropsychological prac-tice. Thl.~Standanls for Educational dnd Psychological TeslÍng (lERi et ai., 1(99) state that validati(ln is the joint rcsponsi-bility oI' the tesl developer and the tcst uscr (1999). Thus, a working kllowlcdge of validily models and the validity char- ,Ktcristics of specific tests is a central requirement lor respon-sible and competent test USl.~.From a practical perspective, a working knowkdge 01' va1idity allows users to determine which lests are appropriate for use and which fali below stan-dards for clinicai practice or rescarch utility. Thus, neuropsy-chologists who use tests to (lctl.~ctand diagnose neurocognitive difficulties should be thoroughly familiar with commonly used validity mudeis and how these can be usd to evaluatc neuropsychologicallools. Assuming that a test is valid because it was pu[(;hased from a reputabk test publisher, appe<lrs to have il large normative s,nnp1c, or Came wilh a l<lfge user's tnanu,11 C<lllbe a sniolls error, as some well-known and com-monly uscd neuropsycho!ogieal tests are bcking with rcgard to crucial aspccts 01' validity. Definilion of Validity Cronbaeh and Meehl (I ')55) were some of the first Iheorists to discuss the cOllcept of eonstruct VJlidily. Since then, the hasie definition of validity evolved as testing necds changed ovcr the years. Allhough eonslruct validily was first inlroduced as a scparate Iypc of validity (e.g., Allastasi & Urbina, 1(97), it has moved, in some models, to encompass ali types of validity (e.g., Messick, 19')3). In other models, the term "construct validity" has been deemed redundant and has simply bcen re-placed by "validity," since ali types of validity ultimatcly in-form as lo the construet llleasured by lhe lesl. tccordingly, the term "construet validity" ha.s nol been u.sed in the Standards for Educational and l'sycho!ogical"lcsting since 1974 (AERA
  • 16. 18 A CompellJium of Neuropsychological Tesls el a!., 1999). However, whelher il is deellleJ "conslrucl valiJ-ily" or simply "validil~-:' lhe coneepl is eentr~1 lo evalu~ling the ulility of a lest in the clinicaI or researeh arena. Test valiJity may bc Jefined at the mosl basie levei as lhe degree /O whícJr a leSI (/(/l/(ll/y IIlCllSlIres wllrlt ir is íntended /O meaS/lre, or in the words uf NUllllally ~nd llernstein (1994), "how wetl itllleasures what it purports to Illeasure in the eon-text in which it is to be applied" (p. 112). As with reliability, an important point 10 be madc here is Ihat a tesl eanflol be said to have une single levei (lf validity. Rather, it ean be said to ex-hibil various lypes and leveis of validilY across a speclrum of usal;e antI popul,llions. That is, 'lIliJity IS nm ti propcrty of 1/ t('st, bul rather, 'ulidily js li prop('rty of the mcrmilJg attached to (/ t(,SI Sf()re; villidily can only arise and be dellned in the spe-cific conlext of tesl usal;e. Therefore, whilc it Éscertainly nec-essary to undersland the valiJity of tests in particular contexts, ultimate decisions regarding lhe validilY of test scme interpre-tation must take inlo account any unique factors pertaining to validity aI the levei of individual assessment, such as devia-tions fcom slandard adminislration, unusual testing enviroll- Illents, exalTlinee cooperation, and the like. In the past, assesslllenl of validity was generally tesl-centrie. lhat is, test validity was largely indexed by compari-son with olha tests, especially "standards" in lhe field. Since Cronbach (1971), therc has becn a move aw~y from test-baseJ or "measure-centered validity" (Zimi1es, 1996) toward the in-terprelatiall alld externaI utility of tests. Mcssick (1989, 1993) expanded the dcfinition af validity lo cncompass an overall judgmenl of lhe extent to which empirical evidcncc and theo-retical rationales support lhe <ldequacy ilnd cffeclÍveness of inlerpretations and ,tCtions resultinl; from test scores. Subse-qllenlly, !vlessick (1995) proposed <lcomprehensivc model of construcl validity wherein six different, distinplishablc types of evidence contribute to construct validity, These are (1) content rdaled, (2) substantive, (3) slructural, (4) generaliz-ability, (5) externaI, and (6) collsequcntial evidence snurces (see Table 1-6), ,llld they form thc "evidential basis for score Table 1-6 /l,lesskk ..••lludel uf Comtruct ValiJity Typc af Evitlcncc SuhstanlÍn' Structurill Genefillizilbility "5<. l«,- J I.<y ( 19'J6) fo, Iim,!au"Tl< "f ,hi, com!",,,<,,' interpretation" (/I,!cssick, 1995, p. 743). Likewise, the Slan-dards for Educational and l'sycholol;icallesting (AERA et <lI., 19(9) follows a modcl very llluch like ~kssick's, whcre differ-ent kinds of evidence are llsed to bolster test validity bascd on each of the fol1owing sources: (I) evielence baseei on test COll-tent, (2) response processes, (3) internaI structure, (4) rda-lions lo olhe r variables, anel (5) consequences oftesting. The most conlroversial aspect of these mode1s is lhe requirement for consequential evidence to support validity. Some argue that judging validity ,lCcording to whcthcr use of a test results in positive or negative social consequences is too far-rc,lChinl; ilml may 1cad to abuses of scicntific inquiry, <lSwhcn a h.'st re-sult does not agrce with lhe overriding social climate of the time (Lecs-J-lil1cy, 1996). Sociill anel ethical conscquenccs, al-thoul; h cruci,tl, milY therefore need lo be treMcd separatcly from validity (Anastasi & Urbina, 19(7). Validity Models Since Cronbach and Mechl, various modcls of validity have bcen proposed. lhe most frequently encountered is the tripar-tite modcl whcrcby valídity ís divieleel inlo threc eompotlenls: content villitlity, criterioll-rc1ated validity, and construct valid-ity (see Anilstilsi & Urbina, 1997; ltitrushina ct aI., 2005; Nun-nally & Bernstein, 1994; Salt1cr, 2(01). Other validity subtypes, including convergent, divcrgent, prcdictivc, trcatment, clinicai, and face validity, are subsullled within thcse three domaills. For example, nmverl;enl ,1Ild divergcnt villidity are most often trealed as subsels of cnnstruct validily (Sattler, 2(01) ,tlld con-current and predicl!ve validity as subsels of critcrioll V<llídity (e.g., Milrushina et aI., 20(5). Concurrent and predictivc valid-ily only differ in terms of a temporill gradicnt; concurrcnt va-lidity is relevant for lests used to identify existing diagnoses or conditions, whereas predictive validity applies when dctermin-ing whether a test predicIs fulure outcnmes (Anastasi & Ur-bana, 1997). Allhough face validily appears to have fallen out oflilVor as a typc of validity, the extent to which examinees be-lieve a te~t me<1sures whilt it appears to ll1e~sure can affect mo. tivation, self-disclo~lrc, <lnd effort. COllSequent1y, face validity Glll be seen as a moder,lor variab1c affecting COllcurrent and predietive validity lhal can be operalionillized <1nd measured (Bornstein, 1996; I'evo, 1985), Again, ali these labcls for dis-tinct c<ltegories of validity are ways of providing different types of evidmce for validity and are not, in and of themsclves, differ-ent types of villidity, as older sources mil;ltt claim (AERA et aI., 1999; YUtl & Ulrich, 20(2). Lastly, validity is a matler of degree ralher th<lll an all-or-none propcrty; validity is Iherefore never aClually"finalil.ed,~ since tcsts must be cOlltinually reevalualed as populations and testing contexts changc over time (Nun-llally & Bernslein, 1994). How lo EvoluoJe the Validity of a Test I'ragmalically speaking, ali the thcorctic<ll models in lhe world will be of no utilíty to the practicing clinician unlcss they ean be translated into specific, step-by-stcp proeedures for Dcfinition Relevance, represcnlati'{'lH.'SS,anti technical qualily of test cOn!ellt ThCtlfetical rallona!cs for the test anti Icst responses Fidelity af scoring slruelme to the structure (lf lhe constrllet mcasuf(,J by lbe tesl Seores and interl'retatiulls generalize auoss groups, scttings, anu tasks Cunvcrgcnt anJ Jin'rgenl villidity, eriterion relcvanee, anJ appli<,J utilily Actual and potelltial cunsequcnccs of test use, relating to suurces af invaliJity rclatcd to bias, fairness, ilnd disuiblllive justice" Extern;t1 ConSl.'quentiill
  • 17. eva luating a test's valiJily .. I:lble 1-7 presenls a eomprehcnsive (bUl not exhallstivc) list of specilic fealures lIsers c<ln look for when cvalllatíng a tesl anJ reviewing lcst manuaIs. E<lch is or-ganizcd according lo the type of validity evidcnce provided. for exampie, COllstrllct validity ean be ,Issessed via eorrc!a-tions with other tests, faetor analysis, internai cOlIsistency (e.g., suhlesl intercorrdations), eonvergellt and Jiscriminant validation (c.g., multitrait-mllltímethod malrix), experimen-tai interventions (c.g., scnsitivity lo treatment), slructlH,11 equalion Illodding, and response processes (e.g., lilsk dCCOlll-posilion, protocol analysis; Anaslasi & Urbina, 1997). lfost importantly, lIsers shollld also rernembn lhal even if an othcr condilions are me!, a test cannol be eonsidered valid if it is not rcliable (see previoll. Jiscussion). It is importanl to nOle lhal not ali tests will have sufficielll evidence lo salisfy ali aspects of validity, bllt test uscrs shollld hilve a suffieicntly broad knowledge of nellropsychological lools to be ab!c to select one test over anolhn, based on lhe quality of the validation eviJence availablc. In essence, we PsydHlnwlries in Nellf(lpsycho!ogical Assessmcnt 19 havc lIscd this modcl lo critically evaluate ali the tests rc-viewed in this volume. Note that there is ,I certa in degree of overlap between cat-egorics in Table 1-7. for example, corrdatiollS between a specific test Jnd another test me,lsuring IQ Cilll simll!tane-ously provide criterioll-rcialcJ eviJcnce <lnd construcl-relaled evidencc of validity. l{egardlcss of lhe termino]ogy, it is im-portant to understand llOW spccific techniques such as fae-tor analysis serve to inform lhc validity 01"test interpretation across the range of sellings in whieh nellropsycho!ogists Vork. What Is an Adequate Validíty Coefficient? Some invcsligalors have proposcd erileria for evaluating cvi-dencc rcJated to criterion valídity in outeollle assessmcnts. For instance, Andrcws ct aI. (1994) and 1311rlingamc ct aI. (1995) recornmcnd tha! a minimlltn levei of ,lCccplabilil}' for corrc!a-tions involving criterion v'lliJit}' is .50. Howcver, Nunnally Table 1-7 Somecs of Evidence and Techni'1l1cs for Crilically EvalU<itingthe Validily of NellfOl'>yehological T(.'sts T}'pe of Evidence ConteTlt-rc!aled Conslrlld-rdaled Criterion-r(.'!aled Resl'on>e proces.•es ReIUirCllEvidcnce Rcfers lo Ihemes, wording, format, lasks, or qnc>liolls on a te,I, and <ldmini,tralion and scnring Vescril'liou 01"lheorelical mudei (In which lest is bascd Review of Iilcralure with sUl'porling evidence Definilion (lf dOlllain of intcrest (e.g., litera!Ure review, lheoretical reasoning) Opcralionalizalion 01"def1nilion lhrough thorough and syslemalic review of tcst domain frum which ilem> are to b(..samplcd, wilh Iisling nf slmrces (c.g.. word frequenc)" sOllTcesfor vocabulary tesls} Collection of samplc of ilems brge enough to be represenUlive of dunuill and with slIfticiclll rang(.' of dífflculty for largel poplIlation SdcelÍon of panel of jlldges for expert review, hased on specific selectinn crileria (e.g., acadelllic and praclical baekgroullds or cxpcrlise within specific subdolllains) Evall1alion of item., hy experl pane! based on specific uitcria concerning accuracy and relevmlCe Resolulion of judgmcnl conllids wilhin pane! for ilems lacking uoss-panc! agreelllcnt (e.g., empirical Illeans such as lndex of llé'fl1Congruem:c; Hamhlelon. 1980) Formal ddinilioll of comlruct Formulation of hypothcsc> lo lIIeasure collstruct Galhering empirical evidence of conSlruct validalion Evaluating psychofllclric propnlies of imlrunlenl (i.e., reHahilily) D(.'mon,lration of le.•1s('"milivily lo deve!0l'menul changes, correialioll with olher le~;[S,gWllll differences swdies, l"aClnranalysis, intertwl wmistcllcy (e.g., wrrdations belweell slolesls, or lo composiles wilhin Ih('"sallle test), convcr~ell and divergem valitiatioll (e.g., muitilrail-llIu1timclhod l1Iatrix), ,cnsilivity to cxpnilllenlal manipulalioll (e.g., la'almellt sen,itivity), slruclural equalion modding, and analysis of l'rocess variahles lIndl'l'l)"ing test performallce. Idmtification of al'propriate crilerioll ltientification uf relcv,11I1sample grollp rdk<:ling lhe emire pOl'lItalion of imeresl; if only a SllOgrollP is examined, Ihen gcneralization mllst remain wilhin subgroup definition (e.g., kccping in mind polenlial SOllrcesof error sllch ,1.1reslriclion {lfrange) Analysis of test-crilerioll relalionships Ihmugh empiricalmcam sucll as COlllrasting pouP', corrdatiollS wilh pr('viously availaolc tesls, dassil!calion of accllracy slalistks (e.g., posilive prediclive power), oulcome ,Iudi(.'" ,md llIela-analysi> Velermining whether perforn""lCe on thc tcsl aCluaJl)"rei,ltes lo lhe domain being lIIeasured Analysis of individual responses to dderrnine lhe processes underlying performance (c.g., quc,lioning les! lahes about slralegy, analy,is of lest performance with regard lo othcr variahles. determining whether lhe leSlllleaSllres the same conSITUClin differeul pOI'UlalioJls, slI<:ha> age) 'i",m'c: Ad"l'tt"d fmm A",,,,,,,i & lIrbi"." 1997; Amer;(." Edll(<ltio'",' Re;eat(h A'so<:i"liun oI Jl .. 19'1');M<»i,k, 1995; .nd Yllll ""d Ulr,,-h. 2002.
  • 18. 20 A Compcndium of Neuropsychological Tests <lndBem~tein (1994) note th,ll validity coefficient, farei)' ex-cee,! .30 Of.40 in mo,t circum,tances involving Jl~}'eho!ogical tests, given the complexities involved in mea~ufing and pre-dicting human beh,'ior. Thefe afe no hard and fast fUlc~ when evaluating evi(knce supporlive of va!iditl" and intcr~lfe-tation should consider how the te~t results will be used. Thus, tests with evcn quite modest predictive validities (r = .50) ma}' be of considerablc utilitl', depmding on the Cifculll~tancesin which the}'will be used (Anasla~i & Urbina, 1997;Nunn<llll'& Bem~teill, 19(4), particularll' if Ihel' serve lo significant1l' in- (fease lhe tesl's "hil fale" over chance. 11is also important lo note Ihal in some circulIlslances, crilcrioll validitl' ma}' be measured in a cakgorical ralher Ihan continuous fashion, ~uch as when lesl scores are used lo inform binarl' diagnoses (e.g., demented versu~ nol delllenled). ln Ihese cases, one would Iikell' be more ínlereslcd in indices such as prediclive power than olher me<l~uresof crilerion validill' (see below for a discus~ion of c1<lssilicalion"ccuracl' slalislics). USE OF TESTS IN THE CONTEXT OF SCREENING AND DIAGNOSIS: CLASSIFlCATlON ACCURACY STATlSTICS In some cases, c1inicians use lests lo meaSUfeholl' IIlllfilof;ltl attribule (e.g., intelligence) an examinee ha~, while in other cases, tesls are used to help determine whelher or nol an exam-inee has a specific atlribute, condilion, or illness that mal' be eithcr prescnt or abscnt (e.g., Alzheimer's disease). In lhe laller Clse, a sJlecialdi~linction in lesl use mal' be made. SCfcnlillS tests are those which are broadll' or routinelr used to delecl a specific altribule, oflell rdcrred lo as a collllítioll of inferest, or COI, among persons who are not "sl'mplomatic" but who mal' Ilonctheless have the COI~ (Slreinef, 2003e). Ui'lgnosfíc tests ,Ireu~ed lo assisl in ruling in ()f out a speeifie condilion in per- ~ons who present wilh "sl'mploms" Ihat sugge~1lhe diagnosis in questionoAnolher related use of lesls is for purpose~ of pre-diclion of outcome. A~wilh screening and diagnostic tests, lhe oulcome nf intereslll1al' bc defined in binarl' terms---it wiUei-ther occur or not occur (e.g., relum lo the same Il'pe anJ levei (lf emp!ol'menl). Thus, in ali three ca~es,dinicians wil! he in~ terested in the relalion of lhe mca~Ire'sdislribulion of scores to iln attribule or oulcome Ihat is defincJ in binarl' lerms. Typiealll" data conceming screening or diagnoslic accu-racl' are obtained bl' administcring a lestlo a samplc of per- ~ons who are also dassifieJ, wilh rcspect to the COI, b}'a so-called gotd ~tand<lfJ.Those who have the condition according to the gold stand<lfd,Ire [;lbcleJ COI+-, while Ihose who do nOI have lhe condition ,ue hlbcled COl-. In medicine, the gold stamLud is oflcn a high!y aceurale diagnoslic lest that is more expcnsive and/or ha~ a higher levei of as~ocialed risk of lIlorbidity Ihan some new diagnoslic lllelhod thal is being evaluated for use as a screening measure or as a possible re-placement for the exisling gold slandarJ. In neuropsychology, the situalion is oflen more complex, as the cal mar he a ps}'~ chnlogical conslrucl (e.g., malingering) for which consensus wilh respecl to fundamenlal definilions is lacking or diagnos-tic gold standarJ.s mar not exi~1.The~c iS~llesmay he less problemalicwhenleslSareusedtol.redictouleollle(e.g .• re-tum to work), Ihough nlher problell1s thal mal' amiet olll-come daIa such as inlervcning variables anJ samplc altrition ma}'complicale interpretation of predictive aecuraçy. The simplest wal' to relate tesl rc~ultsto binarl' diagnose~ or oUlcomes is to utiliJe a cutoff score. This is a ~ínglcpoinl a!ong the conlinuull1 of possiblc score~ for a given lesl. Scores at or above lhe cutoff classifr eXilmince, as belonging lo Olleof Iwo groups; scores below lhe culoff c1assifl'eXilmineesas bclonging to the other grnup. Those who have the cal acconling lo lhe tesl are laheled as Test Positin- (Tesl'), whilc Iho~ewho do no! have the CO! are labeled Tcst Negatiw (Tesl-). Table l-R shows lhe relation belween examinee classifica-tions based on tesl resulls versus da~sificalions b<lsedon a gold slalHhtrd measure.13yconvenlion, lesl da~sificalion is de-noled bl' row membership and gold sland<lfd classification is denoled bl' columll membership. Ccll values represenl the 10- lal number of persons from lhe silmple falling into each of fom possiblc outcomes with respcct to ilgreemenl belween a le~1and respective gold slandard. Bl' convention, agreemenls between gold slandard and test c!a.ssiflcalion.sare referred lo as Trile Positive and TflIe Nrgative cases, whi[e disagreemenls are referreJ to ,ISFals!' Posítíw alld FI/Isc Ncglltü'e cases, with posilívc and negmive refcrring to lhe presellce or absellce of a COI as per elassificalion bl' the gold slandard. When cOllsid-ering outcome dala, observed oulcomc is substiluted for the gold slandard. 1t is imporlant lO kcep in mim! whilc reading the fol!owing seclion that while golJ standanl measures are oflen implieitll' Irealed as 100% accurate, thi~ mal' nol a!wal's be the case. Any limitalions in accuracy or applicabilitl' of a gold stanJard or oulcome lIleasme need to be accounled for when interprcting classification accuracy slalistics. Toble 1-8 Classificalion/Prediction ACÇ[lracy of a Test in Rdation {)a "Cold $Iandard" ur tctua[ Olllc<.Hne Gold Standard TeSI Reslllt Test+ Tesl- Collltlm 101111 COJ' A (Tnrc I'usitivcj C (Fal.se Neg;ltive) A+C COJ-ti (FalscI'osiliv(') D (Trllr Negative) II+D Row Total A+1l C+D N""A+Il+C+D