1) A cyber crime is a crime that involves a computer and the Internet. A forensics investigation involves gathering and preserving evidence in a way that is suitable for presentation in a court of law. Use the library to research any recent (within the past 12 months), real-world cyber crime. Discuss each of the following scenarios:
· What was the cyber crime? Who or what did the cyber crime affect?
· How did the cyber crime occur?
· In your opinion, how could the cyber crime have been avoided?
· How would you conduct the forensics investigation for this cyber crime?
Use and list at least 2 sources to support your response to the question. You may use the textbook as a resource. Be sure to use APA formatting for all references.
Responses to Other Students: Respond to at least 2 of your fellow classmates with at least a 100-word reply about their Primary Task Response regarding items you found to be compelling and enlightening. To help you with your discussion, please consider the following questions:
· What did you learn from your classmate's posting?
· What additional questions do you have after reading the posting?
· What clarification do you need regarding the posting?
· What differences or similarities do you see between your posting and other classmates' postings?
2) Antiforensic techniques make proper forensic investigations more difficult. Antiforensic techniques are deliberate and can reduce the quantity and quality of digital evidence. Antiforensic techniques can also be used to increase security. Use the library to research antiforensic techniques, and discuss the following:
· What are at least 3 examples of antiforensic techniques, and how are they used?
· Discuss how antiforensic techniques affect computer forensics, file recovery, and security.
Use and list at least 2 sources to support your response to the question. You may use the textbook as a resource. Be sure to use APA formatting for all references.
3) Review and reflect on the knowledge you have gained from this course. Based on your review and reflection, write at least 3 paragraphs on the following:
· What was the most valuable concept that you learned in this class that you will most likely use in your future career?
· What concept in this course provided the most insight to the technical aspects of computer forensics? Explain.
· The main post should include at least 1 reference to research sources, and all sources should be cited using APA format.
A Primer on the Validity of
Assessment Instruments Gail M. Sullivan, MD, MPH
1. What is reliability?1
Reliability refers to whether an assessment instrument gives
the same results each time it is used in the same setting with
the same type of subjects. Reliability essentially means
consistent or dependable results. Reliability is a part of the
assessment of validity.
2. What is validity?1
Validity in research refers to how accurately a study answers
the study question or the strength of the study c ...
Cyber Crime Forensics Investigation Research Sources
1. 1) A cyber crime is a crime that involves a computer and the
Internet. A forensics investigation involves gathering and
preserving evidence in a way that is suitable for presentation in
a court of law. Use the library to research any recent (within the
past 12 months), real-world cyber crime. Discuss each of the
following scenarios:
· What was the cyber crime? Who or what did the cyber crime
affect?
· How did the cyber crime occur?
· In your opinion, how could the cyber crime have been
avoided?
· How would you conduct the forensics investigation for this
cyber crime?
Use and list at least 2 sources to support your response to the
question. You may use the textbook as a resource. Be sure to
use APA formatting for all references.
Responses to Other Students: Respond to at least 2 of your
fellow classmates with at least a 100-word reply about their
Primary Task Response regarding items you found to be
compelling and enlightening. To help you with your discussion,
please consider the following questions:
· What did you learn from your classmate's posting?
· What additional questions do you have after reading the
posting?
· What clarification do you need regarding the posting?
· What differences or similarities do you see between your
posting and other classmates' postings?
2) Antiforensic techniques make proper forensic investigations
more difficult. Antiforensic techniques are deliberate and can
reduce the quantity and quality of digital evidence. Antiforensic
techniques can also be used to increase security. Use the library
to research antiforensic techniques, and discuss the following:
· What are at least 3 examples of antiforensic techniques, and
how are they used?
2. · Discuss how antiforensic techniques affect computer
forensics, file recovery, and security.
Use and list at least 2 sources to support your response to the
question. You may use the textbook as a resource. Be sure to
use APA formatting for all references.
3) Review and reflect on the knowledge you have gained from
this course. Based on your review and reflection, write at least 3
paragraphs on the following:
· What was the most valuable concept that you learned in this
class that you will most likely use in your future career?
· What concept in this course provided the most insight to the
technical aspects of computer forensics? Explain.
· The main post should include at least 1 reference to research
sources, and all sources should be cited using APA format.
A Primer on the Validity of
Assessment Instruments Gail M. Sullivan, MD, MPH
1. What is reliability?1
Reliability refers to whether an assessment instrument gives
the same results each time it is used in the same setting with
the same type of subjects. Reliability essentially means
consistent or dependable results. Reliability is a part of the
assessment of validity.
2. What is validity?1
3. Validity in research refers to how accurately a study answers
the study question or the strength of the study conclusions.
For outcome measures such as surveys or tests, validity refers
to the accuracy of measurement. Here validity refers to how
well the assessment tool actually measures the underlying
outcome of interest. Validity is not a property of the tool
itself, but rather of the interpretation or specific purpose of
the assessment tool with particular settings and learners.
Assessment instruments must be both reliable and valid for
study results to be credible. Thus, reliability and validity must
be
examined and reported, or references cited, for each assessment
instrument used to measure study outcomes. Examples of
assessments include resident feedback survey, course
evaluation,
written test, clinical simulation observer ratings, needs
assessment survey, and teacher evaluation. Using an instrument
with high reliability is not sufficient; other measures of validity
are needed to establish the credibility of your study.
4. 3. How is reliability measured?2–4
Reliability can be estimated in several ways; the method will
depend upon the type of assessment instrument. Sometimes
reliability is referred to as internal validity or internal
structure of the assessment tool.
For internal consistency 2 to 3 questions or items are
created that measure the same concept, and the difference
among the answers is calculated. That is, the correlation
among the answers is measured.
Cronbach alpha is a test of internal consistency and
frequently used to calculate the correlation values among
the answers on your assessment tool.5 Cronbach alpha
calculates correlation among all the variables, in every
combination; a high reliability estimate should be as close to
1 as possible.
For test/retest the test should give the same results each
time, assuming there are no interval changes in what you are
measuring, and they are often measured as correlation, with
5. Pearson r.
Test/retest is a more conservative estimate of reliability
than Cronbach alpha, but it takes at least 2 administrations
of the tool, whereas Cronbach alpha can be calculated after
a single administration. To perform a test/retest, you must
be able to minimize or eliminate any change (ie, learning) in
the condition you are measuring, between the 2
measurement times. Administer the assessment instrument
at 2 separate times for each subject and calculate the
correlation between the 2 different measurements.
Interrater reliability is used to study the effect of
different raters or observers using the same tool and is
generally estimated by percent agreement, kappa (for binary
outcomes), or Kendall tau.
Another method uses analysis of variance (ANOVA) to
generate a generalizability coefficient, to quantify how
much measurement error can be attributed to each potential
factor, such as different test items, subjects, raters, dates of
6. administration, and so forth. This model looks at the overall
reliability of the results.6
5. How is the validity of an assessment
instrument determined?4,7,8
Validity of assessment instruments requires several sources
of evidence to build the case that the instrument measures
what it is supposed to measure.9,10 Determining validity can
be viewed as constructing an evidence-based argument
regarding how well a tool measures what it is supposed to
do. Evidence can be assembled to support, or not support, a
specific use of the assessment tool. Evidence can be found in
content, response process, relationships to other variables,
and consequences.
Content includes a description of the steps used to
develop the instrument. Provide information such as who
created the instrument (national experts would confer
greater validity than local experts, who in turn would have
more validity than nonexperts) and other steps that support
7. the instrument has the appropriate content.
Response process includes information about whether
the actions or thoughts of the subjects actually match the
test and also information regarding training for the raters/
observers, instructions for the test-takers, instructions for
scoring, and clarity of these materials.
Relationship to other variables includes correlation of
the new assessment instrument results with other
performance outcomes that would likely be the same. If
Gail M. Sullivan, MD, MPH, is Editor-in-Chief, Journal of
Graduate Medical
Education.
Corresponding author: Gail M. Sullivan, MD, MPH, Editor-in-
Chief, Journal of
Graduate Medical Education, 515 N State St, Suite 2000,
[email protected]
DOI: 10.4300/JGME-D-11-00075.1
E D I T O R I A L
Journal of Graduate Medical Education, June 2011 119
there is a previously accepted ‘‘gold standard’’ of
8. measurement, correlate the instrument results to the
subject’s performance on the ‘‘gold standard.’’ In many
cases, no ‘‘gold standard’’ exists and comparison is made to
other assessments that appear reasonable (eg, in-training
examinations, objective structured clinical examinations,
rotation ‘‘grades,’’ similar surveys).
Consequences means that if there are pass/fail or cut-off
performance scores, those grouped in each category tend to
perform the same in other settings. Also, if lower performers
receive additional training and their scores improve, this
would add to the validity of the instrument.
Different types of instruments need an emphasis on
different sources of validity evidence.7 For example, for
observer ratings of resident performance, interrater agreement
may be key, whereas for a survey measuring resident stress,
relationship to other variables may be more important. For a
multiple choice examination, content and consequences may
be essential sources of validity evidence. For high-stakes
9. assessments (eg, board examinations), substantial evidence to
support the case for validity will be required.9
There are also other types of validity evidence, which
are not discussed here.
6. How can researchers enhance the validity of their
assessment instruments?
First, do a literature search and use previously developed
outcome measures. If the instrument must be modified for use
with your subjects or setting, modify and describe how, in a
transparent way. Include sufficient detail to allow readers to
understand the potential limitations of this approach.
If no assessment instruments are available, use content
experts to create your own and pilot the instrument prior to
using it in your study. Test reliability and include as many
sources of validity evidence as are possible in your paper.
Discuss the limitations of this approach openly.
7. What are the expectations of JGME editors regarding
assessment instruments used in graduate medical education
research?
JGME editors expect that discussions of the validity of your
assessment tools will be explicitly mentioned in your
10. manuscript, in the methods section. If you are using a
previously studied tool in the same setting, with the same
subjects, and for the same purpose, citing the reference(s) is
sufficient. Additional discussion about your adaptation is
needed if you (1) have modified previously studied
instruments; (2) are using the instrument for different
settings, subjects, or purposes; or (3) are using different
interpretation or cut-off points. Discuss whether the
changes are likely to affect the reliability or validity of the
instrument.
Researchers who create novel assessment instruments
need to state the development process, reliability measures,
pilot results, and any other information that may lend
credibility to the use of homegrown instruments.
Transparency enhances credibility.
In general, little information can be gleaned from single-
site studies using untested assessment instruments; these
studies are unlikely to be accepted for publication.
11. 8. What are useful resources for reliability and validity of
assessment instruments?
The references for this editorial are a good starting
point.
References
1 American Educational Research Association, American
Psychological
Association, National Council on Measurement in Education.
Standards for
Educational and Psychological Testing. Washington, DC:
American
Educational Research Association; 1999.
2 Downing SM. Reliability: on the reproducibility of assessment
data. Med
Educ. 2004;38(9):1006–1012.
3 Beckman TJ, Ghosh AK, Cook DA, Erwin PJ, Manderkar JN.
How reliable are
assessments of clinical teaching?: a review of the published
instruments. J
Gen Intern Med. 2004;19(9):971–977.
4 Cook DA, Beckman TJ. Current concepts in validity and
reliability for
psychometric instruments. Am J Med. 2006;119(2):166e7–
166e16.
5 Bland JM, Altman DG. Statistics notes: Cronbach’s alpha.
BMJ.
1997;314:572.
12. 6 Brennan RL. Generalizability Theory. New York, NY:
Springer-Verlag; 2001.
7 Downing SM. Validity: on the meaningful interpretation of
assessment
data. Med Educ. 2003;37(9):830–837.
8 Downing SM, Haldyna TM. Validity threats: overcoming
interference
with proposed interpretations of assessment data. Med Educ.
2004;38(3):327–333.
9 Kane M. Validating high-stakes testing programs. Educ Meas
Issues Pract.
2002;1:31–41.
10 Kane M. The assessment of professional competence. Eval
Health Prof.
1992;15(2):163–182.
E D I T O R I A L
120 Journal of Graduate Medical Education, June 2011
T
he following are corrections to the June 2011 issue.
1. Sullivan GM. A primer on the validity of assessment
instruments. J Grad Med Educ. 2011;3(2):119–120.
On p 119, the sentence should read: Cronbach alpha
13. calculates correlation among all the variables, in every
combination, and generates one number that the closer it is
to 1, the higher the reliability estimate.
2. Salem JK, Jones RR, Sweet DB, Hasan S, Torregosa-
Arcay H, Clough L. Improving care in a resident practice
for patients with diabetes. J Grad Med Educ.
2011;3(2):196–202.
The Figure legends should read:
Figure 1 Description of Sample Selection for Outcomes
Analysis
Figure 2 Timeline for Implementation of Interventions
3. Saeed F, Majeed MH, Kousar N. Easing international
medical graduates entry into us training. J Grad Med Educ.
2011;3(2):269.
The lead author’s name is Fahad Saeed, MD.
4. Sweeney A, Stephany A, Whicker S, Bookman J,
Turner DA. Senior Pediatric Residents as Teachers for an
Innovative Multidisciplinary Mock Code Curriculum.
14. J Grad Med Educ. 2011;3(2):188–195.
The Figure 3 label for the seventh column is:
Communicating Effectively.
5. Le-Bucklin KT, Hicks R, Wong A. Impact of a
Teaching Rotation on Residents’ Attitudes Toward
Teaching: A 5-Year Study. J Grad Med Educ.
2011;3(2):253–255.
The Results section of the Abstract should read:
Results: Four categories showed significant
improvement, including feeling prepared to teach
(P , .0001), having confidence in their teaching ability
(P , .0001), being aware of their expectations as a teacher
(P , .0001), and feeling that their anxiety about teaching
was at a healthy level (P 5 .0037). There was an increase
in the level of enthusiasm, but the P value did not reach a
significant range (P 5 .12). The level of enthusiasm started
high and was significantly higher on the pretest than every
other tested category (P , .0001).
15. Footnote c to Table 2 should read: P value as calculated
using the Mann-Whitney U test.
E R R A T A
446 Journal of Graduate Medical Education, September 2011
<<
/ASCII85EncodePages false
/AllowTransparency false
/AutoPositionEPSFiles true
/AutoRotatePages /None
/Binding /Left
/CalGrayProfile (Dot Gain 30%)
/CalRGBProfile (None)
/CalCMYKProfile (U.S. Sheetfed Coated v2)
/sRGBProfile (sRGB IEC61966-2.1)
/CannotEmbedFontPolicy /Error
/CompatibilityLevel 1.4
/CompressObjects /Off
/CompressPages true
/ConvertImagesToIndexed false
/PassThroughJPEGImages true
/CreateJobTicket false
/DefaultRenderingIntent /Default
/DetectBlends true
/DetectCurves 0.1000
/ColorConversionStrategy /LeaveColorUnchanged
/DoThumbnails false
/EmbedAllFonts true
/EmbedOpenType false
/ParseICCProfilesInComments true
/EmbedJobOptions true
/DSCReportingLevel 0
/EmitDSCWarnings false
29. Validity? Don’t Blame the
Messenger
Sandip Sinharay1, Shelby J. Haberman1,
and Howard Wainer2
Abstract
There are several techniques that increase the precision of
subscores by borrowing
information from other parts of the test. These techniques have
been criticized on
validity grounds in several of the recent publications. In this
note, the authors ques-
tion the argument used in these publications and suggest both
inherent limits to the
validity argument and empirical issues worth examining.
Keywords
subscores, validity, augmented subscore
Introduction: Subscores and Adjusted Subscores
There are several techniques that increase the precision of
subscores by borrowing
information from other parts of the test. These techniques have
been criticized on val-
idity grounds in several recent publications such as Skorupski
and Carvajal (2010) and
Stone, Ye, Zhu, and Lane (2010). In this note, we question the
argument used in these
30. publications and suggest both inherent limits to the validity
argument and empirical
issues worth examining. We begin with an introduction to the
techniques that borrow
information from other parts of the test as part of the subscore
computation process
and then evaluate the validity arguments advanced recently
concerning these
techniques.
Interest in subscores in educational testing reflects their
potential remedial and
instructional benefit. According to the National Research
Council report ‘‘Knowing
1Educational Testing Service, Princeton, NJ, USA
2National Board of Medical Examiners, Philadelphia, PA, USA
Corresponding Author:
Sandip Sinharay, Educational Testing Service, 12T Rosedale
Road, Princeton, NJ 08541, USA
Email: [email protected]
Educational and Psychological
Measurement
71(5) 789–797
ª The Author(s) 2011
Reprints and permission:
31. sagepub.com/journalsPermissions.nav
DOI: 10.1177/0013164410391782
http://epm.sagepub.com
http://crossmark.crossref.org/dialog/?doi=10.1177%2F00131644
10391782&domain=pdf&date_stamp=2011-03-22
What Students Know’’ (2001), the target of assessment is to
provide particular infor-
mation about an examinee’s knowledge, skill, and abilities.
Subscores have the poten-
tial to provide such information; however, they are too often not
reliable enough for
their intended purposes. Several researchers have suggested
methods that increase the
precision of subscores by borrowing information from the other
related scores or sub-
scores. For example,
• Wainer, Sheehan, and Wang (2000) and Wainer, Vevea, et al.
(2001) suggest
the augmented subscore that is a function of an examinee’s
score on the sub-
scale of interest and that examinee’s score on the remaining
subscales.
• Yen (1987) suggested the objective performance index (OPI)
that is
32. a weighted average of the observed subscore and an estimate of
the observed
subscore obtained using a unidimensional item response theory
(IRT) model
for the entire test.
• Haberman (2008a) suggested a weighted average of a subscore
and the total
score. Sinharay (2010) found that this weighted average is
typically very sim-
ilar to the augmented subscore (Wainer et al., 2000).
• Several researchers (de la Torre & Patz, 2005; Haberman &
Sinharay, 2010;
Luecht, 2003; Yao & Boughton, 2007) suggested using
estimated abilities or
their transformations obtained from a multivariate IRT (MIRT)
model as sub-
scores. For background on MIRT models, see, for example,
Reckase (1997).
The scores obtained from the above-mentioned approaches will
be referred to as
‘‘adjusted subscores.’’1 Researchers have found that adjusted
subscores are more reli-
able, often substantially so, than the subscores themselves
(Dwyer, Boughton, Yao,
Steffen, & Lewis, 2006; Sinharay, 2010; Skorupski & Carvajal,
33. 2010; Stone, Ye,
Zhu, & Lane, 2010).
Recent Criticisms of Adjusted Subscores
The validity of adjusted subscores has been questioned recently.
Skorupski and
Carvajal (2010) studied four subscores from a large statewide
test and found that
the corresponding OPIs and the augmented subscores (Wainer et
al., 2000) were
highly correlated among themselves. The correlations between
augmented subscores
were 0.97 or greater and those between the OPIs were all 1.00.
Skorupski and Carvajal
(2010) commented that this phenomenon of high correlations
among the adjusted sub-
scores (which means that the rank orderings for the four
adjusted subscores are very
similar) leads to potential loss of meaning of the subscores and
‘‘reduces, if not elim-
inates, the utility of the subscores for the diagnostic purposes
for which they are
intended. This begs the question: Are the augmented subscores
providing more useful
34. information than the raw ones?’’ (p. 372). They went on to
comment that ‘‘although
augmentation dramatically improves the reliability of subscores,
it may in fact nega-
tively affect the validity of score interpretations’’ (p. 372). In
the abstract of their arti-
cle, they commented that the near-perfect correlations among
the adjusted subscores
790 Educational and Psychological Measurement 71(5)
‘‘called into question the validity of the resultant subscores, and
therefore the useful-
ness of the subscore augmentation process.’’
Stone et al. (2010) studied the four subscores for the spring
2006 assessment of the
Delaware Student Testing Program 8th grade mathematics
assessment. They found
the augmented subscores, the OPIs, and the MIRT-based
subscores to be highly cor-
related among themselves and commented that ‘‘it may be that
adjusted subscale
scores represent the measurement of a construct that is different
from the construct
35. being measured by the unadjusted subscale scores’’ (p. 80).
They commented that bor-
rowing information from other subscales causes a ‘‘potential
threat to validity’’ of the
adjusted subscores (p. 80).
It seems that Skorupski and Carvajal (2010) and Stone et al.
(2010) have criticized
the use of adjusted subscores in general (rather than criticizing
their use with their data
sets), and their criticisms might make some practitioners
wonder whether it makes
sense to use adjusted subscores at all.
Should One Report Diagnostic Scores for the Tests
Considered in Skorupski and Carvajal (2010) and
Stone et al. (2010)?
Let us look closely at the tests considered by Skorupski and
Carvajal (2010) and Stone
et al. (2010) and ask the question, ‘‘Should one report
subscores, or, more generally,
any kind of diagnostic scores for these tests?’’
According to Standard 5.12 of the Standards for Educational
and Psychological
Testing (American Educational Research Association, American
Psychological Asso-
36. ciation, & National Council for Measurement in
Education,1999), scores should not
be reported for individuals unless the validity, comparability,
and reliability of such
scores have been established. This standard applies to subscores
as well as to the over-
all or total score. Furthermore, Standard 1.12 of the Standards
for Educational and
Psychological Testing (1999) demands that, if a test provides
more than one score,
then the distinctiveness of the separate scores should be
demonstrated.
Haberman (2008a) suggested an approach to determine if
subscores and augmented
subscores have added value over the total score. This approach
has been applied in
Lyren (2009); Puhan, Sinharay, Haberman, and Larkin (2010);
and Sinharay
(2010). In this approach, a subscore has added value if it is
reliable and is distinct
from the other subscores.
Sinharay (2010) applied the approach of Haberman to the data
set considered in
37. Stone et al. (2010) and concluded that none of the original
subscores were of added
value and that none of the weighted averages (or augmented
subscores) were of added
value. In addition, Stone et al. reported an exploratory factor
analysis that suggested
the presence of only one factor in the data set and found the
disattenuated correlations
between the subscores to be between 0.96 and 1.03.
Sinharay et al. 791
The disattenuated correlations between the subscores of
Skorupski and Carvajal
(2010) were between 0.89 and 0.96, with an average of 0.94.
None of the subscores,
weighted averages, and augmented subscores had added value
for this data set either.2
These results are enough to conclude that subscores or, more
generally, any kind of
diagnostic scores (including adjusted subscores) for the tests
considered in Skorupski
and Carvajal (2010) and Stone et al. (2010) will not satisfy
professional quality stand-
38. ards (especially the above-mentioned Standard 1.12 on
distinctiveness). Hence, it is
true that the adjusted subscores for these tests lack validity
(because of the fact that
Haberman, 2008b, showed that the validity of subscores is
limited when the subscores
are either not reliable or are highly correlated with total scores).
However, no reasonable person should blame the adjusted
subscores for not being
valid for the tests considered in Skorupski and Carvajal (2010)
and Stone et al. (2010).
If the bathroom scale tells us that we need to lose weight, it
would be unfair to blame
the scale. The tests considered in Skorupski and Carvajal and
Stone et al. were unidi-
mensional and were incapable of producing diagnostic scores of
any kind. So it is no
wonder that the adjusted subscores computed from these data
are not valid. However,
responsibility for the lack of validity lies not with the adjusted
subscores but rather
with the tests and those who try to report any diagnostic
subscores from the tests in
the first place. The adjusted subscores are just the messengers
39. of the bad news that
the data are not appropriate for diagnostic score reporting.
A General Defense of Adjusted Subscores
The examples of Skorupski and Carvajal (2010) and Stone et al.
(2010) do not repre-
sent a complete picture of the empirical situation, as is evident
from a recent review of
subscores for operational tests (Sinharay, 2010). For example,
consider the Swedish
Scholastic Assessment Test considered in Lyren (2009), which
included subscores
and adjusted subscores that had added value. For this test, the
correlation between
the augmented subscores ranged between 0.58 and 0.94, with an
average of 0.79.3
These correlations are much higher than the correlations
between the unadjusted sub-
scores that ranged between 0.42 and 0.67, with an average of
0.55. However, the cor-
relation between the augmented subscores are much lower than
those in Skorupski and
Carvajal (2010) and Stone et al. (2010) and demonstrate that the
correlations between
40. augmented subscores are not always extremely high.
When several subscores of an assessment are adjusted by use of
the total score (or
other parts of the test), the adjusted subscores share a common
component, the total
score (or score on the other parts), so that the adjusted
subscores will always be
more highly correlated than are the original observed subscores.
In general, increased correlations among adjusted subscores do
not threaten valid-
ity. If the correlations are very high, then the adjusted subscores
are essentially just
versions of the total score, and the test is not able to produce
useful diagnostic scores.
If the correlations are not very high, measurement error has
been reduced with the
computation of the adjusted subscore (because the variance of
adjusted subscore is
less than that of the subscore and hence the reliability is
higher). If the measurement
792 Educational and Psychological Measurement 71(5)
error is sufficiently reduced, then the correlation with external
41. criterion scores is likely
to increase rather than decrease when adjusted rather than
observed subscores are
employed, although empirical study is needed to verify this
observation with real
data (Haberman, 2008b).
To examine the validity issue in a simple setting, it is helpful to
consider parallel
forms. The subscore on a parallel form is a basic validity
criterion for the correspond-
ing observed and adjusted subscores on the original form. For
example, let us consider
the test TC2 considered in Table 1 of Sinharay (2010). The test,
which measured
achievement in a discipline, had 200 multiple choice items and
three subscores,
each having 66 or 67 items. We split the test into two tests, say
Test A and Test B,
of length 100 items each. Tests A and B were made roughly
parallel in difficulty
and content. We then computed the subscores and augmented
subscores for Tests A
and B. All three of the augmented subscores have added value
for both Tests A and
42. B according to the criteria of Haberman (2008a). Table 1 shows
some correlations.
The table shows that any subscore on Test B (or A) has a higher
correlation with
the corresponding augmented subscore on parallel Test A (or B)
than with the corre-
sponding subscore on parallel Test A (or B).4 For example, the
correlation between
subscore 1 on Test B and augmented subscore 1 on Test A is
0.88, which is larger
than 0.85, the correlation between subscore 1 on Test B and
subscore 1 on Test A.
Figure 1, which is like Figure 4 of Skorupski and Carvajal
(2010), shows the subscore
profiles (top panel) and the profiles of augmented subscores
(bottom panel) of five
randomly chosen examinees. Although the three observed
subscores for each exam-
inee vary more than the three augmented subscore for the
examinee, the profiles of
augmented subscores are not all parallel, unlike in Figure 4 of
Skorupski and Carvajal.
Some of the profiles of augmented subscores even intersect with
each other.
43. Thus, the two facts—(a) the adjusted subscores (augmented
subscores in this case)
estimate the subscores on a parallel form better and (b) the
profiles of the adjusted sub-
scores are not all parallel—show that adjusted subscores did not
‘‘lose their meaning’’
or ‘‘have their utility reduced or eliminated’’ (as commented in
Skorupski & Carvajal,
2010, p. 372) and did not represent a construct different from
that measured by the
subscores (as mentioned in Stone et al., 2010).
Therefore, for a test that was designed to report diagnostic
scores (e.g., the Swedish
Scholastic Assessment Test or the test TC2 considered above),
it is straightforward to
gather evidence that supports the proposed interpretation of the
adjusted subscores and
Table 1. Correlations Among Subscores and Augmented
Subscores
Subscore
Correlation Between a
Subscore on Test A
and the Corresponding
44. Subscore on Test B
Correlation Between an
Augmented Subscore on Test
A and the Corresponding
Subscore on Test B
Correlation Between a
Subscore on Test A and the
Corresponding Augmented
Subscore on Test B
1 0.85 0.88 0.87
2 0.79 0.84 0.82
3 0.83 0.85 0.84
Sinharay et al. 793
it will not be difficult to stand up to any criticism of the
adjusted subscores as long as
the accumulated evidence is evaluated in an evenhanded way
(Kane, 2006, mentioned
the need to stand up to criticism in establishing validity).
To make the validity claim foolproof, it is important also to
collect empirical evi-
dence concerning validity of subscores and adjusted subscores.
Haberman (2008b)
45. suggested some theoretical results on the validity of subscores,
but those results do
not obviate the need for data on validity. Although modern
concepts of validity of tests
consider many aspects of test content, intended use, and
consequences of use (Kane,
2006; Messick, 1989), a mature testing program requires
empirical evidence that
a reported test score is adequately related to appropriate
criterion scores.
If the adjusted subscores have lower correlations with
appropriate criterion varia-
bles than the total scores or the original subscores, then there is
justification to criticize
them for lack of validity. However, until that can be
demonstrated, we think that it is
premature to criticize their validity based on any current
findings. It does not seem that
any of the validity standards of the Standards for Educational
and Psychological Test-
ing (American Educational Research Association, American
Psychological
Figure 1. Subscore profiles of five randomly chosen examinees
794 Educational and Psychological Measurement 71(5)
46. Association, & National Council for Measurement in Education,
1999) have been vio-
lated by the use of the adjusted subscores.
We agree with Stone et al. (2010) that score users may not like
or understand the
dependence of, say, a reading subscore on a speaking subscore.
However, many score
users do not understand measurement concepts such as
Cronbach’s alpha or equiper-
centile equating and that has not deterred the testing companies
from reporting reli-
ability values or equated scores. In addition, we believe that it
would not be
difficult to make an argument that, for example, a common
language skill is required
to answer both reading and listening items, which will justify
the adjustment of listen-
ing subscores using reading subscores in addition to the
listening subscores. Think of
a test for which (a) the subscores are reliable and distinct, (b)
the adjusted subscores
have higher reliability than the subscores, (c) there is a strong
47. evidence of criterion
validity of adjusted subscores, and (d) the subscales are
somewhat connected concep-
tually (e.g., language skills such as reading and listening). Here,
it makes sense to
adjust subscores. In our opinion, for such a test, it is possible
for the testing company
to make a claim about the validity of the adjusted subscores that
is strong enough to
overcome the above-mentioned potential problem of explanation
of adjusted sub-
scores to users and to convince the users that the adjusted
subscores are reliable, valid,
and useful.
Authors’ Note
Any opinions expressed in this article are those of the authors
and are not necessarily those of
Educational Testing Service or National Board of Medical
Examiners.
Acknowledgements
The authors are grateful to Dan Eignor, Wendy Yen, Gautam
Puhan, and George Mercoulides
for their helpful comments and to William Skorupski for
48. generously sharing with us some sum-
mary of a data set.
Declaration of Conflicting Interests
The authors declared no potential conflicts of interests with
respect to the authorship and/or
publication of this article.
Funding
The research of the first two authors was funded by Educational
Testing Service (ETS), which is
the company that these two authors work for.
Notes
1. A better name would have been ‘‘augmented subscores,’’ but
that corresponds to the scores
described in Wainer et al. (2000).
Sinharay et al. 795
2. We thank William Skorupski for generously sharing with us
some summary of their data that
allowed us to perform these computations.
3. The correlations among OPIs or MIRT-based subscores would
be of similar magnitudes.
49. 4. The same result would have been obtained for OPIs and
MIRT-based subscores.
References
American Educational Research Association, American
Psychological Association, & National
Council on Measurement in Education. (1999). Standards for
educational and psychological
testing. Washington, DC: American Educational Research
Association.
de la Torre, J., & Patz, R. J. (2005). Making the most of what
we have: A practical application of
multidimensional IRT in test scoring. Journal of Educational
and Behavioral Statistics, 30,
295-311.
Dwyer, A., Boughton, K. A., Yao, L., Steffen, M., & Lewis, D.
(2006). A comparison of sub-
scale score augmentation methods using empirical data. Paper
presented at the annual meet-
ing of the National Council on Measurement in Education, San
Francisco, CA.
Haberman, S. J. (2008a). When can subscores have value?
Journal of Educational and Behav-
ioral Statistics, 33, 204-229.
50. Haberman, S. J. (2008b). Subscores and validity (ETS Research
Report No. RR-08-64). Prince-
ton, NJ: Educational Testing Services.
Haberman, S. J., & Sinharay, S. (2010). Reporting of subscores
using multidimensional item
response theory. Psychometrika, 75, 209-227.
Kane, M. T. (2006). Validation. In R. L. Brennan (Ed.),
Educational measurement (4th ed.,
pp. 18-64). Westport, CT: Praeger.
Luecht, R. M. (2003, April). Applications of multidimensional
diagnostic scoring for certifica-
tion and licensure tests. Paper presented at the meeting of the
National Council on Measure-
ment in Education, Chicago, IL.
Lyren, P. (2009). Reporting subscores from college admission
tests. Practical Assessment,
Research, and Evaluation, 14, 1-10.
Messick, S. (1989). Validity. In R. L. Linn (Ed.) Educational
measurement (3rd ed., pp. 13-
103). Washington, DC: National Council on Measurement in
Education and American
Council on Education.
51. National Research Council. (2001). Knowing what students
know: The science and design of
educational assessment. Washington, DC: National Academies
Press.
Puhan, G., Sinharay, S., Haberman, S. J., & Larkin, K. (2010).
Comparison of subscores based
on classical test theory. Applied Measurement in Education, 23,
1-20.
Reckase, M. D. (1997). The past and future of multidimensional
item response theory. Applied
Psychological Measurement, 21, 25-36.
Sinharay, S. (2010). How often do subscores have added value?
Results from operational and
simulated data. Journal of Educational Measurement, 47, 150-
174.
Skorupski, W. P., & Carvajal, J. (2010). A comparison of
approaches for improving the reliabil-
ity of objective level scores. Educational and Psychological
Measurement, 70, 357-375.
Stone, C. A., Ye, F., Zhu, X., & Lane, S. (2010). Providing
subscale scores for diagnostic infor-
mation: A case study when the test is essentially
unidimensional. Applied Measurement in
52. Education, 23, 63-86.
Wainer, H., Sheehan, K., & Wang, X. (2000). Some paths
toward making praxis scores more
useful. Journal of Educational Measurement, 37, 113-140.
796 Educational and Psychological Measurement 71(5)
Wainer, H., Vevea, J. L., Camacho, F., Reeve, B. B., Rosa, K.,
Nelson, L., Swygert, K. A., . . .
Thissen, D. (2001). Augmented scores—‘‘Borrowing strength’’
to compute scores based on
small numbers of items. In D. Thissen & H. Wainer (Eds.), Test
scoring (pp. 343-387). Mah-
wah, NJ: Lawrence Erlbaum.
Yao, L., & Boughton, K. A. (2007). A multidimensional item
response modeling approach for
improving subscale proficiency estimation and classification.
Applied Psychological Mea-
surement, 31, 83-105.
Yen, W. M. (1987, June). A Bayesian/IRT index of objective
performance. Paper presented at
the annual meeting of the Psychometric Society, Montreal,
Quebec, Canada.