This document provides a critical examination of using content validity evidence to validate personality tests for employee selection. It notes conflicting guidance between standards, with the Uniform Guidelines prohibiting content validity for traits like personality, while the SIOP Principles and Joint Standards support it. The document aims to resolve this ambiguity by analyzing whether content validity is appropriate for personality tests and how such evidence could be gathered. It argues content validity is underutilized but can provide valuable insights, especially given personality testing's prevalence in selection.
2. importance of gathering sound validity evidence, and the
deference given to these
standards/guidelines in contemporary employee selection
practice. As a consequence,
test users and practitioners are likely to be reticent or uncertain
about gathering
content-based evidence for personality measures, which, in turn,
may cause such
evidence to be underutilized when personality testing is of
interest. The current
investigation critically examines whether (and how) content
validity evidence should
be used for measures of personality in relation to employee
selection. The ensuing
discussion, which is especially relevant in highly litigious
contexts such as personnel
selection in the public sector, sheds new light on test validation
practices.
Keywords
test validation, content validity, personality testing, employee
selection
An essential consideration when using any test or measurement
tool for employee
selection is gathering and evaluating relevant validity evidence.
In the contemporary
employee selection context, validity evidence is generally
understood to mean
1The University of Tulsa, OK, USA
2Qualtrics, Provo, UT, USA
3Rice University, Houston, TX, USA
Corresponding Author:
David M. Fisher, Assistant Professor of Psychology, The
3. University of Tulsa, 800 S. Tucker Drive, Tulsa,
OK 74104, USA.
Email: [email protected]
935582PPMXXX10.1177/0091026020935582Public Personnel
ManagementFisher et al.
research-article2020
https://us.sagepub.com/en-us/journals-permissions
https://journals.sagepub.com/home/ppm
mailto:[email protected]
Fisher et al. 233
evidence that substantiates inferences made from test scores.
Various sources provide
standards and guidelines for gathering validity evidence,
including the Uniform
Guidelines on Employee Selection Procedures (Equal
Employment Opportunity
Commission, Civil Service Commission, Department of Labor,
& Department of
Justice, 1978; hereafter, Uniform Guidelines, 1978), the
Principles for the Validation
and Use of Personnel Selection Procedures (Society for
Industrial and Organizational
Psychology [SIOP], 2003; hereafter, SIOP Principles, 2003),
and Standards for
Educational and Psychological Testing (American Educational
Research Association,
American Psychological Association, & National Council on
Measurement in
Education, 1999/2014; hereafter, Joint Standards, 1999/2014),
as well as the academic
literature (e.g., Aguinis et al., 2001). Having such a variety of
sources available is
4. beneficial, but challenges arise when the various sources
provide ambiguous or con-
tradictory information. Such ambiguity can be particularly
troublesome in highly liti-
gious contexts, such as the public sector, where adherence to
regulations governing
selection is of paramount importance.
The current investigation attempts to shed light on one such
area of ambiguity—
whether evidence based on test content should be used as a
means of validating per-
sonality test inferences for employee selection. Rothstein and
Goffin (2006) noted, “It
has been estimated that personality testing is a $400 million
industry in the United
States and it is growing at an average of 10% a year” (Hsu,
2004, p. 156). Given this
reality, it is important to carefully consider appropriate
validation procedures for such
measures. However, the various sources mentioned above
present conflicting direc-
tions on this issue, specifically in relation to content-based
validity evidence. On one
hand, evidence based on test content is one of five potential
sources of validity evi-
dence described by the Joint Standards (1999/2014), which is
similarly endorsed by
the SIOP Principles (2003). This form of evidence has further
been suggested by some
to be particularly relevant to personality tests (e.g., Murphy et
al., 2009; O’Neill et al.,
2009), and especially under challenging validation conditions,
such as small sample
sizes, test security concerns, or lack of a reliable criterion
measure (Landy, 1986; Tan,
5. 2009; Thornton, 2009). On the other hand, the Uniform
Guidelines (1978) assert that
“. . . a content strategy is not appropriate for demonstrating the
validity of selection
procedures which purport to measure traits or constructs, such
as intelligence, apti-
tude, personality, commonsense, judgment, leadership, and
spatial ability [emphasis
added]” (Section 14.C.1). Other sources similarly convey
reticence toward content
validity for measures of traits or constructs (e.g., Goldstein et
al., 1993; Lawshe, 1985;
Wollack, 1976). Thus, there appears to be conflicting guidance
on the use of content
validity evidence to support personality measures.
In light of this discrepancy, the current investigation offers a
critical examination of
content validity evidence and personality testing for employee
selection. Such an
investigation is valuable for several reasons. First, an important
consequence of the
inconsistency noted above is that content-based evidence may
be overlooked as a
valuable approach to validation when personality testing is of
interest. Evidence for
this can be seen in the fact that other approaches such as
criterion-related validation
are sometimes viewed as the only option for personality
measures (Biddle, 2011).
234 Public Personnel Management 50(2)
Similarly, prominent writings on personality testing in the
6. workplace (e.g., Morgeson
et al., 2007b; O’Neill et al., 2013; Ones et al., 2007; Rothstein
& Goffin, 2006; Tett &
Christiansen, 2007) have tended to ignore the applicability of
content validation to
personality measures. Furthermore, considering the deference
given to the various
standards and guidelines in contemporary employee selection
practice (Schmit &
Ryan, 2013), those concerned about strict adherence to such
standards/guidelines are
likely to be reticent or uncertain about gathering content-based
evidence for personal-
ity measures—in no small part due to conflicting or ambiguous
recommendations. The
above circumstances tend to relegate content-based evidence to
be seen as less desir-
able or otherwise viewed as an afterthought. In turn, this
represents a missed opportu-
nity for valuable insight into the use of personality measures.
Second, the neglect or underutilization of content-based
evidence is, in many ways,
antithetical to the broader goal of developing a theory-based
and scientifically
grounded understanding of tests and measures used for
employee selection (Binning
& Barrett, 1989). For example, as elaborated below, there are
various situations in
which content-based evidence may be more optimal than
criterion-based evidence, not
the least of which includes an insufficient sample size for a
criterion-based investiga-
tion (McDaniel et al., 2011). Similarly, an exclusive focus on
empirical prediction
ignores the importance of underlying theory, which is critical
7. for advancing employee
selection research. Of relevance, the examination of content
validity evidence forces
one to carefully consider the correspondence between selection
measures and underly-
ing construct domains, as informed by theoretical
considerations. Evidence for the
value of content validity can also be found in trait activation
theory (Tett & Burnett,
2003; Tett et al., 2013), which highlights the importance of a
clear conceptual linkage
between the content of personality traits/constructs and the job
domain in question.
Thus, content validity evidence should be of primary
importance for personality test
validation.
Third, it is useful to acknowledge that the prohibition against
content validity evi-
dence in relation to personality measures noted in the Uniform
Guidelines (1978)
appears to be at odds with contemporary thinking on validation
(Joint Standards,
1999/2014). The focal passage quoted above from the Uniform
Guidelines has been
described as being “. . . as destructive to the interface of
psychological theory and
practice as any that might have been conceived” (Landy, 1986,
p. 1189). Although
there have been well-argued critiques of the Uniform Guidelines
(e.g., McDaniel
et al., 2011), in addition to thoughtful elaboration of issues
surrounding content valid-
ity (e.g., Binning & LeBreton, 2009), a direct attempt at
resolving the noted contradic-
tion remains conspicuously absent from the literature. This
8. contradiction, in
conjunction with the absence of a satisfactory explanation, is
problematic given the
importance of gathering sound validity evidence pertaining to
psychological test use.
As such, a critical examination of this issue is warranted.
Finally, the findings of the current investigation are likely to
have broad applicabil-
ity. Namely, although focused on personality testing, the
discussion below is relevant
to measures of other commonly assessed attributes classified
under the Uniform
Guidelines (1978) as “traits or constructs” (Section 14.C.1).
Similarly, while we
Fisher et al. 235
address the Uniform Guidelines—which some argue are
outdated (e.g., Jeanneret &
Zedeck, 2010) and further limited by their applicability to
employee selection in the
United States—we believe the value of this discussion extends
far beyond these guide-
lines. It is important to carefully consider appropriate validation
strategies in all cir-
cumstances where psychological tests are used. Hence, the
discussion presented herein
is likely to be of relevance for content-based validation efforts
in other areas beyond
employee selection in the United States (e.g., educational
testing, clinical practice,
international employee selection efforts).
9. Following a brief overview of validity and content-based
validation, our investiga-
tion is organized around three fundamental questions. Question
1 asks whether current
standards and guidelines support the use of content validity
evidence for validation of
personality test inferences in an employee selection context.
Based on the concerns
raised above, a preliminary answer to this question is that it is
unclear. Question 2 then
asks about the underlying bases of the inconsistency. Building
on the identified causes
of disagreement, Question 3 asks how one might actually gather
evidence based on test
content for personality measures. Ultimately, our goal in this
effort is to reduce ambigu-
ity and promote clarity regarding content-based validation of
personality measures.
Overview of Validity and Evidence Based on Test
Content
Broadly speaking, validity in measurement refers to how well an
assessment device
measures what it is supposed to (Schmitt, 2006). The focus of
measurement is typi-
cally described as a construct (Joint Standards, 1999/2014),
which represents a latent
attribute on which individuals can vary (e.g., cognitive ability,
diligence, interpersonal
skill, knowledge, the capacity to complete a given task).
Importantly, a person’s level
or relative standing with regard to the construct of interest is
inferred from the test
scores (SIOP Principles, 2003). As such, the notion of validity
addresses the simple yet
10. fundamental issue of whether test scores actually reflect the
attribute or construct that
the test is intended to measure. However, this succinct
characterization of validity also
belies the true complexity of this topic (Furr & Bacharach,
2014). Two particular com-
plexities bear discussion in light of our current aims.
First, contemporary thinking holds that validity is not the
property of a test per se,
but rather of the inferences made from test scores (Binning &
Barrett, 1989; Furr &
Bacharach, 2014; Joint Standards, 1999/2014; Landy, 1986;
SIOP Principles, 2003).
The value of this approach can be seen when the same test is
used for two different
purposes—for example, when an interpersonal skills test
developed for the selection
of sales personnel is used for hiring both sales representatives
and accountants.
Notably, the test itself does not change, but the inferences made
from the test scores
regarding the job performance potential of the applicants may
be more or less valid
given the focal job in question. In accord with this perspective,
the Joint Standards
(1999/2014) describe validity as “the degree to which evidence
and theory support the
interpretations of test scores for proposed uses of the test” (p.
11). Inherent in this view
is the idea that validity is difficult to fully assess without a
clear explication of the
236 Public Personnel Management 50(2)
11. intended interpretation of scores and corresponding purpose of
testing. Thus, substan-
tiating relevant inferences in terms of the intended purpose of
the test is of primary
concern in the contemporary view of validity.
Second, validity has come to be understood as a unitary
concept, as compared with
the dated notion of distinct types of validity (Binning & Barrett,
1989; Furr &
Bacharach, 2014; Joint Standards, 1999/2014; Landy, 1986;
SIOP Principles, 2003).
The older trinitarian view (Guion, 1980) posits three different
types of validity, includ-
ing criterion-related, content, and construct validity, each
relevant for different test
applications (Lawshe, 1985). By contrast, the more recent
unitarian perspective
(Landy, 1986) emphasizes that all measurement attempts are
ultimately about assess-
ing a target construct, and validation entails the collection of
evidence to support the
argument that test scores actually reflect the construct (and that
the construct is rele-
vant to the intended use of the test). Consistent with this latter
perspective, the Joint
Standards (1999/2014) espouse a unitary view of validity and
identify five sources of
validity evidence, including evidence based on test content,
response processes, inter-
nal structure, relations to other variables, and consequences of
testing. In summary, the
contemporary view of validity suggests that measurement
efforts ultimately implicate
constructs, and different sources of evidence can be marshaled
12. to substantiate the
validity of inferences based on test scores.
Drawing on the above discussion, evidence based on test
content represents one of
several potential sources of evidence for validity judgments.
The collection of content-
based evidence has become well-established as an important and
viable validation
strategy, as can be seen in the common discussion and
endorsement of content validity
in the academic literature (e.g., Aguinis et al., 2001; Binning &
Barrett, 1989; Furr &
Bacharach, 2014; Haynes et al., 1995; Landy, 1986) as well as
in legal, professional,
and technical standards or guidelines (e.g., Joint Standards,
1999/2014; SIOP
Principles, 2003; Uniform Guidelines, 1978). The specific
manner in which evidence
based on test content can substantiate the validity of test score
inferences is via an
informed and judicious examination of the match between the
content of an assess-
ment tool (e.g., test instructions, item wording, response
format) and the target con-
struct in light of the assessment purpose (Haynes et al., 1995).
For the sake of simplicity
and ease of exposition, throughout this article, we use various
terms interchangeably
to represent the concept of evidence based on test content, such
as content validity
evidence, content validation strategy, content-based strategy, or
simply content valid-
ity. However, each reference to this concept is intended to
reflect contemporary think-
ing regarding validity as described above—specifically, content
13. validity evidence is
not a separate “type” of validity but rather, a category of
evidence that can be used to
substantiate the validity of inferences regarding test scores.
Do Current Standards Support Content Validity for
Personality?
Having introduced the concepts of validity and evidence based
on test content, we now
turn to our primary purpose of discussing whether a content
validation strategy should
Fisher et al. 237
be used as a means of validating personality test inferences for
employee selection
purposes. In doing so, a preliminary question becomes whether
current standards and
guidelines support this practice. The following four
sources/areas are considered: (a)
the Uniform Guidelines (1978), (b) the SIOP Principles (2003),
(c) the Joint Standards
(1999/2014), and (c) a general review of relevant academic
literature. A summary of
information derived from these sources is shown in Table 1.
The Uniform Guidelines (1978)
The Uniform Guidelines (1978) are federally endorsed standards
pertaining to
employee selection procedures, which were jointly developed by
the Equal
Employment Opportunity Commission, the Civil Service
14. Commission, the Department
of Labor, and the Department of Justice in the United States.
Regarding content vali-
dation, the guidelines state that,
Evidence of the validity of a test or other selection procedure by
a content validity study
should consist of data showing that the content of the selection
procedure is representative
of important aspects of performance on the job for which the
candidates are to be
evaluated. (Section 5.B)
The guidelines go on to describe specific technical standards
and requirements for
content validity studies. For example, a content validity study
should include a review
of information about the job under consideration (Section 14.A;
Section 14.C.2).
Furthermore, when the selection procedure focuses on work
tasks or behaviors, it must
be shown that the selection procedure includes a representative
sample of on-the-job
behaviors or work products (Section 14.C.1; Section 14.C.4).
Conversely, under cer-
tain circumstances, the guidelines also permit content validation
where the selection
procedure focuses on worker requirements or attributes,
including knowledge, skills,
or abilities (KSAs). In such cases, beyond showing that the
selection procedure reflects
a representative sample of the implicated KSA, it must
additionally be documented
that the KSA is needed to perform important work tasks
(Section 14.C.1; Section
14.C.4), and the KSA must be operationally defined in terms of
15. observable work
behaviors (Section 14.C.4).
The above notwithstanding, the Uniform Guidelines (1978)
explicitly prohibit con-
tent validity for tests focusing on traits or constructs, including
personality (Section
14.C.1). The logic underlying this restriction appears to be
based on the seemingly
reasonable notion that content-based validation becomes
increasingly difficult as the
focus of the selection test is farther removed from actual work
behaviors (Section
14.C.4; Landy, 1986; Lawshe, 1985). This logic was confirmed
in a subsequent
“Questions and Answers” document, where it is stated that,
The Guidelines emphasize the importance of a close
approximation between the content of
the selection procedure and the observable behaviors or
products of the job, so as to minimize
the inferential leap between performance on the selection
procedure and job performance
[emphasis added]. (See
http://www.uniformguidelines.com/questionandanswers.html)
http://www.uniformguidelines.com/questionandanswers.html
238 Public Personnel Management 50(2)
Table 1. Review of Various Sources Regarding Content Validity
and Personality Testing.
Source Description of content validity Position on personality
measures
16. Uniform
Guidelines
(1978)
“Evidence of the validity of
a test or other selection
procedure by a content
validity study should consist
of data showing that the
content of the selection
procedure is representative
of important aspects of
performance on the job for
which the candidates are to
be evaluated” (Section 5.B)
Explicit prohibition related to the use of
content validity for tests that focus on
traits or constructs, such as personality:
“. . . a content strategy is not appropriate
for demonstrating the validity of selection
procedures which purport to measure
traits or constructs, such as intelligence,
aptitude, personality, commonsense,
judgment, leadership, and spatial ability”
(Section 14.C.1)
SIOP
Principles
(2003)
“Evidence for validity based on
content typically consists of
a demonstration of a strong
linkage between the content
17. of the selection procedure
and important work
behaviors, activities, worker
requirements, or outcomes
on the job” (p. 21)
Approval of content validity approach for
personality measures can be inferred from
[1] the absence of an explicit prohibition
against the use of content validity evidence
for tests that focus on traits or constructs
and [2] the stated scope of applicability for
content-based evidence, which includes
tests that focus on knowledge, skills,
abilities, and other personal characteristics
Joint
Standards
(1999/2014)
“Important validity evidence
can be obtained from an
analysis of the relationship
between the content of a
test and the construct it is
intended to measure” (p. 14)
Approval of content validity approach for
personality measures can be inferred from
[1] the absence of an explicit prohibition
against the use of content validity evidence
for tests that focus on traits or constructs,
[2] the explicit description of content
validity as pertaining to “the relationship
between the content of a test and the
construct it is intended to measure” (p.
18. 14), and [3] the broad definition of the
term construct (see p. 217), which makes
it clear that personality variables would fall
under the definition of a construct
General
review of
academic
literature
Most, if not all, descriptions of
content validity found in the
literature embody the core
notion of documenting the
linkage between the content
of a test and a particular
domain that represents
the target of measurement
and/or purpose of testing
(Haynes et al., 1995)
The sources that specifically discuss this
issue collectively indicate mixed opinions;
while some authors have expressed
reticence toward the use of content-based
evidence for measures of personality
(e.g., Goldstein et al., 1993; Lawshe,
1985; Wollack, 1976); others consider
this restriction to be problematic (e.g.,
Landy, 1986; McDaniel et al., 2011) or view
content validity as particularly relevant
to personality testing (e.g., Murphy et al.,
2009; O’Neill et al., 2009)
Note. SIOP = Society for Industrial and Organizational
Psychology.
19. Fisher et al. 239
Interestingly, in an apparent application of this logic, the
guidelines permit content
validation for selection procedures focusing on KSAs (as noted
in the preceding para-
graph). In such cases, the inferential leap necessary to link
KSAs to job performance
is ostensibly greater than if the selection procedures were to
focus directly on work
behaviors, which explains why the guidelines include additional
requirements related
to the content validation of tests focusing on these worker
attributes (see Sections
14.C.1 and 14.C.4). Presumably, these additional requirements
serve to bridge the
larger inferential leap made when the test does not directly
focus on work behaviors.
Thus, the Uniform Guidelines do not limit the use of content
validity to actual samples
of work behavior, but additional evidence is needed to help
bridge the larger inferen-
tial leap made when selection tests target worker attributes (i.e.,
KSAs)—yet this same
reasoning is not extended to what the guidelines characterize as
traits or constructs.
The SIOP Principles (2003)
The SIOP Principles (2003) embody the formal pronouncements
of the Society for
Industrial and Organizational Psychology pertaining to
appropriate validation and use
20. of employee selection procedures. For content validation, the
principles state that,
“Evidence for validity based on content typically consists of a
demonstration of a
strong linkage between the content of the selection procedure
and important work
behaviors, activities, worker requirements, or outcomes on the
job” (p. 21). Like the
Uniform Guidelines (1978), the SIOP Principles stress the
importance of capturing a
representative sample of the target of measurement and further
establishing a close
correspondence between the selection procedure and the work
domain. The principles
also acknowledge that content validity evidence can be either
“logical or empirical”
(p. 6), highlighting the role of job analysis and expert judgment
in generating content-
based evidence. However, unlike the Uniform Guidelines, the
SIOP Principles do not
make a substantive distinction between work tasks/behaviors
and worker require-
ments/attributes in relation to content-based evidence but
rather, collectively, consider
selection procedures that focus on “work behaviors, activities,
and/or worker KSAOs”
(p. 21). Importantly, the addition of “O” to the KSA acronym
represents “other per-
sonal characteristics,” which are generally understood to
include “interests, prefer-
ences, temperament, and personality characteristics [emphasis
added]” (Brannick
et al., 2007, p. 62). Accordingly, although not explicitly stated,
the use of content
validity evidence as a mean of validating personality test
inferences for employee
21. selection purposes appears to be consistent with the SIOP
Principles.
The Joint Standards (1999/2014)
The Joint Standards (1999/2014) are a set of guidelines for test
development and valida-
tion in the areas of psychological and educational testing, which
were developed by a
joint committee including representatives from the American
Educational Research
Association, the American Psychological Association, and the
National Council of
Measurement in Education. According to the standards, content
validity is examined by
240 Public Personnel Management 50(2)
specifying the content domain to be measured and then
conducting “logical or empirical
analyses of the adequacy with which the test content represents
the content domain and of
the relevance of the content domain to the proposed
interpretation of test scores” (p. 14).
In other words, content validity is described as pertaining to
“the relationship between the
content of a test and the construct it is intended to measure” (p.
14), where “construct” is
defined as “The concept or characteristic that a test is designed
to measure” (p. 217).
Because personality traits are easily understood as constructs,
the Joint Standards suggest
that personality test inferences may be subject to content-based
validation.
22. Academic Literature
It is also informative to examine the academic literature
regarding validation and per-
sonality testing. In doing so, several general observations can
be made. First, most if
not all definitions of content validity share the core notion of
documenting the linkage
between the content of a test and a particular domain that
represents the target of mea-
surement and/or purpose of testing (e.g., Aguinis et al., 2001;
Goldstein et al., 1993;
Haynes et al., 1995; Sireci, 1998). Second, as noted previously,
prominent writings on
personality testing in the workplace (e.g., Morgeson et al.,
2007b; O’Neill et al., 2013;
Ones et al., 2007; Rothstein & Goffin, 2006; Tett &
Christiansen, 2007) have tended
to ignore the applicability of content validation to personality
measures. Third, the
sources that do specifically address this issue present mixed
opinions. While some
have expressed reticence about content-based evidence for
measures of personality
(e.g., Goldstein et al., 1993; Lawshe, 1985; Wollack, 1976),
others consider this
restriction to be problematic (e.g., Landy, 1986; McDaniel et
al., 2011) or view content
validity as particularly relevant to personality testing (e.g.,
Murphy et al., 2009;
O’Neill et al., 2009). Thus, as with the technical standards and
guidelines discussed
above, those turning to the academic literature for guidance
might similarly come
away uncertain regarding the use of content validity evidence to
23. support personality
measures in an employee selection context.
What Are the Bases of Inconsistency?
This section attempts to identify the conceptual issues that form
the bases for disagree-
ment/misunderstanding regarding the use of content validity
evidence for personality
measures. Making these underlying matters explicit will help to
identify some com-
mon ground and the potential for a way forward. Based on the
review of documents
and literature above, the primary areas to be addressed include
(a) vestiges of the trini-
tarian view of validity, (b) the focus of the content match, and
(c) a clear understanding
of the inferences to be substantiated.
Vestiges of the Trinitarian View of Validity
Although it is now well-established that validity should be
characterized in a manner
consistent with the contemporary perspective described above
(Binning & Barrett,
Fisher et al. 241
1989; Joint Standards, 1999/2014; Landy, 1986; SIOP
Principles, 2003), the outdated
trinitarian view continues to exert substantial influence. Perhaps
the most prominent
example of this can be found in the Uniform Guidelines (1978),
which clearly reflects
24. a trinitarian view of validity yet remains an important document
that holds consider-
able weight in contemporary employee selection practice
(McDaniel et al., 2011).
There are at least two important concerns related to this residual
influence of the trini-
tarian perspective.
First, the trinitarian view of validity suggests that constructs
represent a separate
category of measurement that is somehow distinct from other
types of measurement
efforts. This can most readily be seen in the simple fact that
there is a separate label for
construct validity, as compared with other categories of
validity. As a result, the deter-
mination of which “type” of validity to focus on rests on
whether or not a construct—as
opposed to some other type of attribute—is the target of
measurement (Landy, 1986).
This same logic is embodied in the Uniform Guidelines (1978),
where it is indicated
that certain validation strategies are appropriate when
measuring traits or constructs
(e.g., construct validity studies), whereas other strategies are
not (e.g., content validity
studies). Importantly, this perspective is in direct opposition to
contemporary thinking
regarding validity, which suggests that all measurement efforts
ultimately implicate
constructs (Joint Standards, 1999/2014). This notion is
illuminated in a telling example
provided by Landy (1986), where he contrasts a hypothetical
typing ability test with a
measure of reasoning ability. Of particular relevance is the idea
that the typing ability
25. test more readily implicates observable behaviors and, thus,
could be subject to content
validation according to the Uniform Guidelines. Conversely, the
reasoning ability test
is more easily described as trait- or construct-focused, in turn,
precluding the use of
content validation according to the Uniform Guidelines.
Landy’s point was that both of
these tests actually focus on constructs (i.e., typing ability,
reasoning ability), neither of
which is directly observable. Rather, in both cases, one must
infer the level of the con-
struct possessed by an individual via the administration of a
test.
The above example is intended to highlight that all variables
measured by psycho-
logical tests can be characterized as constructs. As such, the
notion that some variables
are constructs while others are not is problematic at best and
also in direct opposition
to prevailing conceptions of test validation. However, this is not
meant to imply that
all constructs are the same. In the example above, it is clear that
the typing ability
construct might more easily be linked to job performance, as
contrasted with the rea-
soning ability construct, given that typing ability has a more
direct and obvious behav-
ioral manifestation (i.e., typing). Conversely, one might say that
a greater inferential
leap is required when validating a measure of reasoning ability,
as reasoning ability is
relatively farther removed from actual job behavior (although
not irreconcilably far
removed). This critical distinction will be discussed further in
26. the sections to follow.
For now, the important conclusion is that all psychological
variables that might be
measured in an employee selection context, including
personality, are no more nor less
constructs than any other.
A second concern related to the residual influence of the
trinitarian perspective is
that there appears to be a de facto preference given to criterion-
related validity
242 Public Personnel Management 50(2)
evidence when considering employee selection procedures
(Binning & Barrett, 1989).
Nevertheless, it is critical to acknowledge that there are various
circumstances where
criterion-related validity evidence is far from optimal. It has
been suggested that the
minimum sample size for a criterion-related study be no fewer
than 250 study partici-
pants (Biddle, 2011). Yet, McDaniel et al. (2011) assert that
most employers simply do
not have enough employees or applicants to conduct such a
study. The outcome of a
criterion-related study is also highly contingent on the quality
of the criterion measure.
In this regard, it is useful to note that the appropriate
development and validation of
criterion measures is often given far less attention than the
predictor side of the equa-
tion (Binning & Barrett, 1989). Furthermore, in the context of a
high-stakes testing
27. situation, conducting a criterion-related study might represent a
test security risk, as
sensitive test content will be presented to study participants
who might subsequently
share confidential test information. None of the above is meant
to suggest that crite-
rion-related validity evidence is always bad. However, it is
similarly important to real-
ize that there are various situations in which evidence based on
test content might well
be the preferred approach toward validation.
Taken together, once you debunk the notion that personality
measures somehow
represent a different category of construct measurement than
other “non-construct”
variables—and further acknowledge that criterion-related
validity evidence (although
extremely useful in many situations) should not be de facto
treated as the strategy of
choice—then the idea of gathering content validity evidence for
personality measures
becomes much more relevant. In the words of Binning and
Barrett (1989), “One could
reasonably argue that content-related and construct-related
evidence, when based on
sound professional judgment about appropriate test use, are
often superior to criterion-
related evidence” (p. 484).
Focus of the Content Match
Another issue that may be fueling misunderstanding regarding
the use of content
validity evidence has to do with the focus of the content match.
This issue becomes
28. apparent when one carefully examines the various definitions
for content validity dis-
cussed previously and shown in Table 1. For example, the
Uniform Guidelines (1978)
indicate that content validity is applicable when it can be shown
that “the content of
the selection procedure is representative of important aspects of
performance on the
job” (Section 5.B). This suggests that the primary focus of
content match should be
between the content of the selection procedure and
representative elements of job
performance, such as work tasks or behaviors. In contrast, the
Joint Standards
(1999/2014) indicate that content validity is applicable
whenever it can be shown that
there is an overlap between the content of a test and the
construct that is the focus of
measurement, which for employee selection is often some
worker attribute or require-
ment. Hence, there appears to be a duality of focus when it
comes to content validity
evidence, where some would argue such evidence is derived
from documenting over-
lap with the job performance domain of application (e.g.,
specific tasks), whereas
others would argue that content-based evidence is based on
content overlap with the
construct domain the test is intended to measure (e.g., some
personal attribute).
Fisher et al. 243
Clarity regarding this issue can be found by considering the
29. distinction between
work samples versus signs (Binning & Barrett, 1989;
Wernimont & Campbell, 1968).
In the context of employee selection procedures, samples refer
to those assessments
that directly implicate or elicit behaviors relevant to
performance on the job, such as
the typing test from Landy’s (1986) example discussed above,
where the test elicits
behavior (i.e., typing) that can be seen as interchangeable with
relevant on-the-job
behavior. In contrast, employee selection measures
characterized as signs refer to
those assessments that do not directly target behaviors from the
performance domain,
but nonetheless attempt to assess attributes or capabilities that
are thought to be rele-
vant for job performance. An illustration of this would be the
reasoning test from
Landy’s example, where the actual behaviors elicited by the
assessment (e.g., reading
logic problems, completing multiple choice questions) may be
less obviously relevant
to primary job functions, but the test nevertheless measures an
attribute that is undoubt-
edly critical for effective performance in many occupations
(i.e., the capacity for
effective reasoning).
In light of this distinction, employee selection measures that
fall on the work-sam-
ple end of the spectrum are likely to focus on work tasks or
behaviors, while those that
fall on the sign end of the spectrum are likely to focus on
worker attributes and require-
ments. By extension, it appears that the Uniform Guidelines
30. (1978) primarily permit
the use of content validity evidence in relation to work samples
that target important
tasks and behaviors, whereas the SIOP Principles (2003) and
Joint Standards
(1999/2014) would additionally permit the use of content
validity evidence for sign-
based measures that focus on job-relevant personal capacities
and worker require-
ments. Importantly, although both work samples and signs can
be used as predictors
for employee selection, these two meaningfully differ in terms
of whether the behav-
iors implicated and/or elicited by the test are isomorphic with,
or functionally similar
to, the performance construct domain (Binning & Barrett, 1989;
Binning & LeBreton,
2009). More specifically, work samples tend to implicate or
elicit behaviors that
exhibit a high degree of isomorphism with performance
behaviors, whereas this tends
to be less true (although not necessarily untrue) for sign-based
measures.
The above discussion helps to clarify the divergent perspectives
concerning the focus
of the content match. First, work samples tend to be constructed
with the intention of
sampling from the performance domain, as exhibited in the
increased isomorphism with
performance behaviors (Binning & Barrett, 1989). As such, the
appropriate focus of
content validity in the case of work-sample-based measures
should be the degree of
match between the content of the measure and the job
performance domain (Binning &
31. LeBreton, 2009). Indeed, this appears to be the central logic
espoused in the Uniform
Guidelines (1978), which primarily permit the use of content
validity evidence in rela-
tion to work samples that target important tasks and behaviors.
Second, sign-based mea-
sures tend to be constructed with the intention of sampling from
a separate construct
domain that is technically distinct from—yet conceptually
relevant to—the job perfor-
mance domain (Binning & Barrett, 1989). Therefore, the
appropriate focus of content
validity in the case of sign-based measures should be the degree
of match between the
content of the selection measure and whatever distinct construct
domain represents that
244 Public Personnel Management 50(2)
target of measurement (Binning & LeBreton, 2009). The logic
behind this latter asser-
tion appears consistent with the SIOP Principles (2003) and
Joint Standards (1999/2014),
which additionally permit the use of content validity evidence
for sign-based measures
that focus on job-relevant worker requirements and attributes.
Understanding the Inferences to Be Substantiated
As described previously, the collection of validity evidence is
ultimately aimed at
substantiating the inferences made from test scores. However, a
careful consideration
of the validation process indicates that there are several
32. potential inferences that might
be of relevance, especially when the intended purpose of testing
is taken into account
(Binning & Barrett, 1989). For example, one might aim to
substantiate the inference
that test scores accurately reflect varying levels of the
underlying construct being mea-
sured. While this is, indeed, a crucial inference with regard to
validation, it does not
necessarily capture the intended use of the test. As such, it is
additionally relevant to
substantiate the inference that test scores (and corresponding
levels of the target con-
struct) have relevance with regard to the purpose of testing. In
the case of employee
selection tests, this purpose typically informs inferences about
job performance. This,
in turn, suggests the importance of understanding the
performance domain, which may
additionally implicate other inferences, such as substantiating
the degree to which
operational measures of job performance accurately reflect an
underlying performance
construct. Given these various potential inferences, it becomes
important to clearly
understand the specific inferences that must be addressed as
part of the validation
process. Toward this end, a better understanding of such
inferences can be achieved by
visually depicting the relevant inferences that are necessary for
linking the test in
question to the construct it is intended to measure, in addition
to the purpose of testing.
This can be seen in Figure 1a, which is based on the seminal
work of Binning and
Barrett (1989; also see Arthur & Villado, 2008; Binning &
33. LeBreton, 2009; Guion,
2004; Joint Standards, 1999/2014).
The framework depicted in Figure 1a allows one to clearly
discuss the specific infer-
ences involved in various approaches to validation. First and
foremost, given that the
ultimate purpose of most (if not all) employee selection
measures is to understand job
performance, it has been argued that the inference linking the
predictor measure to the
job performance construct domain represents the most critical
inference in the employee
selection context (Binning & Barrett, 1989). This inference is
depicted in Figure 1a as
Inference 1. To the extent that the predictor measure is work-
sample-based, and, thus,
exhibits a high degree of isomorphism with the job performance
domain, this primary
link can be directly substantiated by documenting the degree of
overlap (or content
match) between the predictor assessment and job performance
(Binning & LeBreton,
2009). This is consistent with what the Uniform Guidelines
(1978) describe as a content
validity study for a selection measure that focuses on work
tasks or behaviors, and is
further consistent with what the SIOP Principles (2003) and
Joint Standards (1999/2014)
would consider evidence for validity based on test content.
However, to the degree that
the selection measure in question is sign-based, and, thus,
departs from isomorphism
34. Fisher et al. 245
Figure 1. Inferences in the validation process: (a) common
framework for depicting
interferences in the validation process, (b) modified framework
for depicting interferences in
the validation process.
246 Public Personnel Management 50(2)
with job performance, then the direct substantiation of Inference
1 becomes more tenu-
ous, in turn, requiring indirect substantiation via the pairing of
additional inferences—
which nonetheless represents an appropriate means of validation
(Binning & Barrett,
1989; Binning & LeBreton, 2009; Joint Standards, 1999/2014).
A second viable approach for validation would be to
collectively substantiate
Inference 2 and Inference 3, as shown in Figure 1(a). Inference
2 represents the degree
to which the operational predictor measure reflects the
underlying construct it is pur-
ported to measure, while Inference 3 represents the degree to
which the underlying
predictor construct is relevant to the job performance domain.
Using the nomenclature
of the traditional trinitarian view of validity, this approach
might be labeled as con-
struct validity (Binning & Barrett, 1989; Joint Standards,
1999/2014), given that refer-
ence is made to underlying constructs. However, in light of the
fact that contemporary
35. conceptualizations of validity eschew the notion of a distinct
form of “construct valid-
ity,” Binning and LeBreton (2009) argue that this approach is
better characterized as
content-based evidence. Specifically, although the indirect
approach discussed here is
distinct from the direct substantiation of Inference 1 described
above, both approaches
rely heavily on comparing predictor content to some underlying
construct domain.
The difference lies in the fact that the substantiation of
Inference 1 compares the pre-
dictor content with the job performance construct domain, while
the substantiation of
Inference 2 compares the predictor content with the underlying
predictor construct
domain. Furthermore, to account for the less direct route of
substantiation in the latter
approach, additional evidence is required to bridge the larger
inferential leap to the job
performance domain, which is reflected in Inference 3.
Importantly, the collective
examination of Inference 2 and Inference 3 represents an
appropriate means for deriv-
ing content validity evidence for sign-based selection measures
that exhibit less iso-
morphism with the job performance domain (Binning &
LeBreton, 2009). Indeed, this
logic is explicitly included in the Joint Standards (1999/2014;
see pp. 172–173) and
also implicitly described in the Uniform Guidelines, where it is
stated that,
For any selection procedure measuring a knowledge, skill, or
ability [i.e., sign-based
measure] the user should show that (a) the selection procedure
36. measures and is a
representative sample of that knowledge, skill, or ability [i.e.,
Inference 2]; and (b) that
knowledge, skill, or ability is used in and is a necessary
prerequisite to performance of
critical or important work behavior(s) [i.e., Inference 3].
(Section 14.C.4)
Another approach to validation suggested by Figure 1a would be
to collectively
substantiate Inference 4 and Inference 5. Here, Inference 4
represents the empirical
relationship between the predictor measure and a job
performance/criterion measure,
while Inference 5 represents the degree to which the
performance/criterion measure
reflects the underlying performance construct domain it is
intended to capture. This
approach is analogous to what the Uniform Guidelines (1978)
would characterize as a
criterion-related validity study, and is further consistent with
what the SIOP Principles
(2003) and Joint Standards (1999/2014) would describe as
evidence for validity based
on relations to other variables. In practice, however, criterion-
related validity studies
Fisher et al. 247
often focus primarily and/or exclusively on Inference 4, at the
exclusion of Inference
5. Unfortunately, this means that validation efforts of this
nature are typically com-
pleted with only cursory reference to underlying theory or
37. consideration for the impli-
cated construct domains. This again highlights the fact that
criterion-related validation
should not necessarily or always be seen as the most optimal
strategy, especially from
the perspective of generating a theory-based and scientifically
grounded (as opposed
to purely empirical) understanding of tests and measures used
for employee selection.
Conversely, when both Inference 4 and Inference 5 are given
due consideration, one
can see that this approach mirrors the approach pertaining to
Inferences 2 and 3, sug-
gesting two different but comparably informative approaches to
understanding the
relevance of the focal predictor measure to the underlying job
performance domain.
Returning to the topic of content validity, the above discussion
suggests two poten-
tially viable approaches for examining content-based evidence
pertaining to an
employee selection test. First, to the degree that the measure is
work-sample-based,
and, thus, exhibits a high degree of isomorphism with the job
performance domain,
then the focus should be on Inference 1. This can be referred to
as validity evidence
based on test content for work-sample-based measures. Second,
to the extent that the
measure is sign-based, and, thus, departs from isomorphism
with job performance,
then the focus should be collectively on Inference 2 and
Inference 3. This can be
referred to as validity evidence based on test content for sign-
based measures. Figure
38. 1b presents a modified framework that accounts for these
divergent approaches to
validation. Importantly, this framework is consistent with a
contemporary conceptual-
ization of validity that places primary emphasis on inferences in
the validation process
and also explicitly acknowledges the purpose of testing. At the
same time, this frame-
work also explicitly represents the degree of inferential leap
necessary for linking the
predictor measure with the performance construct domain—an
issue that is of para-
mount importance in the Uniform Guidelines (1978).
Specifically, as sign-based mea-
sures exhibit lower isomorphism with the job performance
domain as compared with
work-sample-based measures, additional evidence is needed to
bridge the larger infer-
ential leap, which is manifested in the requirement to
substantiate two inferences (e.g.,
Inferences 2 and 3), as opposed to just one (i.e., Inference 1).
As aptly summarized by
Binning and LeBreton (2009), “content validation involves
either (a) directly match-
ing predictor content to criterion CDs [construct domains] or (b)
matching predictor
content to psychological CDs which are in turn related to
criterion CDs (i.e., delineat-
ing psychological traits believed to influence job behavior)” (p.
489).
How to Gather Content Evidence for Personality
Measures?
Building on the preceding sections, our view is that evidence
based on test content can
39. and should be used as a means of validating personality test
inferences for employee
selection purposes. This practice is consistent with
contemporary conceptualizations of
validity, as embodied in the SIOP Principles (2003) and Joint
Standards (1999/2014),
which both place primary emphasis on inferences in the
validation process and further
248 Public Personnel Management 50(2)
view content-based evidence as one of several viable
approaches to validating such
inferences. Furthermore, the explicit prohibition against this
practice in the Uniform
Guidelines (1978) appears to be largely based on outdated
notions of validity and the
fallacious idea that tests focusing on constructs somehow
represent a different category
of measurement than other “non-construct” variables. Thus,
content validity evidence
should be treated as an appropriate means of validating
personality test inferences.
Accordingly, we can now make informed recommendations
regarding how best to
collect content validity evidence for personality measures.
There are at least two pri-
mary conduits through which one might maximize the content-
relevance of personal-
ity tests and generate appropriate content-based evidence,
including (a) maximizing
isomorphism with the job performance domain during initial test
development, and (b)
40. substantiating the appropriate inferences via expert judgment
after test development,
but prior to operational use.
Maximizing Isomorphism During Test Development
One way to increase the content-relevance of personality
measures used for employee
selection is to draw directly from the job performance domain
while developing the
test. The difference between work-sample-based measures (that
sample primarily
from the job performance domain) and sign-based measures
(that sample primarily
from a distinct yet related construct domain) is a matter of
degree, as opposed to a
strict categorical distinction. In other words, psychological tests
can move along this
continuum by sampling predominantly from one domain or the
other, in addition to a
combination of both (Spengler et al., 2009). Of relevance,
sampling from the perfor-
mance domain has the effect of increasing the degree of
isomorphism between the test
content and job performance, in turn, moving the measure
toward the work-sample
end of the spectrum.
By way of illustration, traditional measures of personality such
as those that might
be found in the International Personality Item Pool
(http://ipip.ori.org/; also see
Goldberg et al., 2006) sample primarily from the personality
construct domain of
interest. As such, an example item for the trait of extraversion
might include “I am the
41. life of the party.” Conversely, personality measures specifically
designed to be rele-
vant in the workplace (e.g., Ellingson et al., 2013) sample both
from the personality
construct domain as well as the intended domain of application
(i.e., work). Here, an
example item for extraversion would be “I involve my coworker
in what I am doing.”
Notably, the personality statement in the latter example is more
obviously and directly
applicable to the job performance domain. Related to this, there
is a growing body of
literature that supports the practice of contextualizing
personality scale content to spe-
cifically reference the domain of work (e.g., Hunthausen et al.,
2003; Lievens et al.,
2008; Schmit et al., 1995; Shaffer & Postlethwaite, 2012; also
see Ones & Viswesvaran,
2001). While this research on contextualization does not
primarily focus on the issue
of content validity, the practice of modifying personality scale
content to explicitly
reference the work context nonetheless has the consequence of
extending content-
relevance to the job performance domain. Indeed, it has been
suggested that the use of
http://ipip.ori.org/
Fisher et al. 249
custom-developed personality tests based on work-
contextualization “is extremely
valuable because it may open up content validation as a
potential validation strategy”
42. (Morgeson et al., 2007a, p. 1043).
In summary, the content-relevance of personality measures used
for employee
selection can be improved by sampling both from the
personality construct domain
and the job performance domain. This practice is ultimately
manifested in the creation
and use of personality scale items that directly reference and/or
implicate the perfor-
mance domain in question. Practically, this can be accomplished
by first identifying
critical features of performance via job analysis efforts and
subsequently creating
personality-based items that explicitly reflect the identified
performance elements. If
this is done, the personality construct of interest will ultimately
be operationalized in
terms of relevant behaviors and experiences that collectively
comprise important
aspects of performance on-the-job. Consequently, the more the
personality items
directly implicate or reference job behaviors (e.g., “I show up to
work on time”)—in
addition to cognitive and affective performance-related
experiences (e.g., “I get ner-
vous when I talk to clients”)—the lower the inferential leap
necessary for linking the
content of the test directly to the job performance domain.
Substantiation of Appropriate Inferences via Expert Judgment
The considerations discussed in the preceding section are only
applicable if a new test
is being created or if the opportunity to modify an existing
general-focused personality
43. test is available. The topic of the current section is on
generating content-based evi-
dence for existing or unmodifiable personality measures to be
used for employee
selection. For this, the primary mechanism of generating such
evidence involves elicit-
ing expert judgment regarding content-relevance. More
specifically, when the person-
ality measure in question exhibits a high degree of isomorphism
with the performance
domain of interest (e.g., contextualized personality scales),
content validity evidence
can be generated by eliciting expert judgment regarding the
direct overlap (or content
match) between the personality measure content and the job
performance domain (i.e.,
Inference 1 from Figure 1). However, to the extent that this is
not the case, as with
noncontextualized or general-focused personality measures,
content-based validation
would proceed via expert judgment regarding Inference 2 and
Inference 3. Example
scales that might be used to substantiate these various
inferences via expert judgment
are shown in Figure 2. These scales can be used to assist in the
collection of content-
based validity evidence.
The first approach involves substantiating the direct link
between the predictor
measure and the job performance construct domain (i.e.,
Inference 1). In other words,
expert judges are asked to indicate the degree to which the
specific items that comprise
the test are directly relevant to performance on-the-job. A
common method for quanti-
44. fying such ratings is the content validity ratio (Lawshe, 1975),
which—as originally
conceptualized—asks subject-matter experts (SMEs) whether a
particular skill or
knowledge area measured by test items is “Essential,” “Useful
but not essential,” or
“Not necessary” for performance of the job in question.
Although the content validity
250 Public Personnel Management 50(2)
ratio was originally intended for tests that focus on knowledge
or skills, with slight
modifications, it can be applied to measures of personality as
well. Examples of this
can be seen in Figure 2. Other scales may also be created to
serve the same purpose as
long as they adequately capture the degree to which the test
content is relevant to the
job performance domain of interest.
Figure 2. Example scales for substantiating relevant inferences
related to content validation.
Fisher et al. 251
The second approach involves substantiating the link between
the test in question
and the underlying construct that it is purported to measures
(i.e., Inference 2), in addi-
tion to the link between the predictor construct and the
underlying job performance
45. domain (i.e., Inference 3). Aguinis et al.’s (2001) discussion of
the content validation
ratio suggests that it can also be modified to serve the purpose
of substantiating
Inference 2, as was the case with Inference 1 above. As noted
by Aguinis et al., expert
judges can be asked to “rate whether each item is essential,
useful but not essential, or
not necessary for measuring the attribute [emphasis added]” (p.
38). Thus, in accor-
dance with the above discussion, the primary distinction
between the methods for
substantiating Inference 1 and Inference 2 is that the former
focuses on the job perfor-
mance domain whereas the latter focuses on the predictor
construct domain. Again,
other scales may be created to serve the same purpose as long as
they adequately
capture the degree to which the test content is reflective of the
underlying predictor
construct (see Figure 2).
In terms of Inference 3, Arthur and Villado (2008) indicate that
this inference “is
established via job analysis processes that are intended to
identify the predictor con-
structs deemed requisite for the successful performance of the
specified job (or perfor-
mance) behaviors in question” (p. 436). In this regard,
personality-oriented job analysis
efforts can be used to identify and substantiate the relevance of
particular traits for the
job under investigation (O’Neill et al., 2013; Raymark et al.,
1997; Tett & Burnett,
2003; Tett & Christiansen, 2007). For example, as shown in
Figure 2, Raymark et al.
46. (1997) adopted the following stem for their Personality-Related
Position Requirements
Form: “Effective performance in this position requires the
person to . . .” (p. 724).
Taken together, expert judgment regarding the above inferences
constitutes evidence
for validity based on test content (Binning & LeBreton, 2009).
Discussion
There are conflicting recommendations regarding the use of
content validity evidence
to support personality test inferences for employee selection.
Unfortunately, inconsis-
tencies of this nature may be the inevitable result of various
different constituencies
(e.g., legal, professional, scientific) jointly vying to determine
standards and guide-
lines within the collective enterprise of test use and validation
(see Binning & Barrett,
1989; Landy, 1986; McDaniel et al., 2011). As such, it may be
an unrealistic ideal to
achieve a perfect resolution that fully addresses any and all
potential inconsistencies.
Nonetheless, it is our hope that the discussion presented above
has ameliorated the
ambiguity to some meaningful degree. In particular, we believe
that the recommenda-
tions above concerning content validity evidence and
personality testing are—to the
greatest extent possible—consistent with the spirit and intention
of prevailing stan-
dards and guidelines, even if not with certain technical
proscriptions (Uniform
Guidelines, 1978). That being said, there remain a few caveats
to be discussed below.
47. First, aside from the explicit prohibition against content validity
for traits and con-
structs, another requirement in the Uniform Guidelines (1978) is
that tests validated
via a content-based strategy should focus on attributes that are
operationally defined
252 Public Personnel Management 50(2)
in terms of observable job behaviors (Section 14.C.4). This is
potentially problematic
for measures of personality, as personality constructs implicate
not only observable
behavioral manifestations (e.g., “I show up to work on time”),
but also various cogni-
tive and affective experiences (e.g., “I get nervous when I talk
to clients”). As a poten-
tial solution, when considering the cognitive and affective
experiences implicated by
a particular construct, it is helpful to think about the work-
relevant behavioral conse-
quences of such experiences (as informed by job analysis
efforts), and correspond-
ingly develop items to reflect such behaviors. Despite this
potential solution, we view
the strict requirement to operationalize all measures that are
subject to content valida-
tion in terms of observable behaviors as problematic.
Specifically, in an information-
based economy, many critical features of job performance may
not be directly or
obviously observable. As a result, measures that are limited to
observable behaviors
48. may suffer from construct deficiency to the extent that non-
observable experiences
represent important aspects of the construct domain.
Furthermore, the recommenda-
tions above provide viable paths for content validation
regardless of whether the test
in question exhibits complete isomorphism with performance in
terms of observable
job behaviors. To the extent that it does, the substantiation of
Inference 1 represents an
appropriate focus. Conversely, to the extent that it does not, the
collective substantia-
tion of Inference 2 and Inference 3 becomes an appropriate
alternative (Binning &
LeBreton, 2009). Regardless, those concerned about strict
adherence to the Uniform
Guidelines (1978) should consider limiting the personality items
used to those that
directly reference observable behavior.
Second, considering the critical attention given to the Uniform
Guidelines (1978),
it might come across as though our goal is to somehow vilify
these guidelines. This is
certainly not the case. The guidelines were created with
admirable intentions regard-
ing standardization of validation efforts and the prevention of
discrimination in
employment practices. These are extremely important concerns,
and we support efforts
that strive to accomplish these noble ideals. At the same time, it
is important that
efforts of this nature are concordant with contemporary
scientific understanding per-
taining to the subject matter. Unfortunately, as described by
McDaniel et al. (2011),
49. the Uniform Guidelines have “not been revised in over 3
decades [and as a result are]
substantially inconsistent with scientific knowledge and
professional guidelines and
practice” (p. 494). As such, our intention is not to disparage
these guidelines, but rather
to ensure that validation practices are consistent with
contemporary conceptualiza-
tions of validity. We also believe that the value of the above
discussion extends far
beyond the Uniform Guidelines and employee selection in the
United States. As noted
in the introduction, it is important to carefully consider
appropriate validation strate-
gies so that accurate inferences are made about all test-takers,
regardless of when,
where, or why testing and validation efforts occur. Accordingly,
the discussion pre-
sented herein is likely to be of relevance for content-based
validation efforts in all
areas that utilize psychological testing, including, for example,
educational testing,
clinical practice, and international employee selection efforts.
Third, not everyone may agree with our choice of terminology
in relation to the two
paths for content-based validation described above.
Specifically, we opted to label the
Fisher et al. 253
substantiation of Inference 1 as validity evidence based on test
content for work-sam-
ple-based measures and the collective substantiation of
50. Inferences 2 and 3 as validity
evidence based on test content for sign-based measures (see
Figure 1). In considering
our choice for terminology, we explored three possible avenues.
First, one possibility
was to adopt terminology that is consistent with the historical
trinitarian view of valid-
ity, which places primary emphasis on three “types” of
validity—content, criterion, and
construct. From this perspective, Inference 1 might be labeled
as content validity and
Inferences 2 and 3 as construct validity. However, as is clearly
outlined above, this trini-
tarian view is considered problematic and inconsistent with
contemporary thinking
regarding validation. Second, another possibility is to create a
new label to replace
construct validity for Inferences 2 and 3, given that a primary
criticism of the historical
trinitarian view is the notion of a separate category of construct
validity. While we
understand the appeal of this second approach, we also believe
that creating a new label
might have the unintended consequence of adding further
confusion to the literature,
especially considering that there is already much potential for
confusion in the arena of
validation terminology. This brings us to our third and favored
approach, which is to
adopt terminology that is consistent with a contemporary
perspective on validity,
wherein validity is a unitary concept with various sources of
supporting evidence. From
this perspective, the labeling of different inferential pathways
should be based on a
thoughtful consideration of which form of validity evidence
51. best aligns with the valida-
tion activities implicated by the paths of interest. Regarding
validity evidence based on
test content, this broadly refers to an analysis of whether the
content of a measure ade-
quately reflects (or samples from) a relevant underlying
construct domain, which is
precisely what is involved in both Inference 1 and the
combination of Inferences 2/3.
Finally, and perhaps most critically, we are by no means
arguing that a content-
based strategy should necessarily be the preferred or only
method of validation. The
important point is that content-based evidence is not inherently
more or less appropri-
ate than other sources of validity evidence. Rather, evidence
based on test content
represents one of several potential sources (Joint Standards,
1999/2014), and the par-
ticular circumstances of the validation effort should be carefully
considered before
determining which source(s) of evidence are most appropriate.
Validity generaliza-
tion, validity transport, and synthetic validity also represent
viable options.
Furthermore, validation efforts need not (and should not) be
limited to just one form
of evidence. For example, content-based evidence can be
combined with evidence
pertaining to an empirical relationship with a criterion measure,
which, in turn, would
result in a stronger validity argument. Or, to the extent that
faking or response distor-
tion is a concern (Morgeson et al., 2007b; Ones et al., 2007),
evidence pertaining to
52. response processes might be gathered as well. In a similar vein,
although we discuss
two potential paths for content validation (i.e., Inference 1 vs.
Inferences 2 and 3; see
Figure 1), this does not preclude efforts at validating all three
inferences, which again
would result in a stronger validity argument. Ultimately, “the
process of validation
involves accumulating relevant evidence to provide a sound
scientific basis for the
proposed score interpretations” (Joint Standards, 1999/2014, p.
11). To the extent
feasible, the more evidence the better.
254 Public Personnel Management 50(2)
Authors’ Note
An earlier version of this manuscript was presented as a poster
session at the 31st Annual
Conference of the Society for Industrial and Organizational
Psychology in Anaheim, California.
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with
respect to the research, authorship,
and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial
support for the research, authorship,
and/or publication of this article: This work was, in part,
53. supported by a Faculty Development
Summer Fellowship Grant awarded by The University of Tulsa
to the first author.
ORCID iD
David M. Fisher https://orcid.org/0000-0002-7810-3494
References
Aguinis, H., Henle, C. A., & Ostroff, C. (2001). Measurement in
work and organizational psy-
chology. In N. Anderson, D. S. Ones, H. K. Sinangil, & C.
Viswesvaran (Eds.), Handbook
of industrial, work and organizational psychology (Vol. 1, pp.
27–50). SAGE.
American Educational Research Association, American
Psychological Association, &
National Council on Measurement in Educatio n. (2014).
Standards for Educational and
Psychological Testing. (Original work published 1999)
Arthur, W., Jr., & Villado, A. J. (2008). The importance of
distinguishing between constructs
and methods when comparing predictors in personnel selection
research and practice.
Journal of Applied Psychology, 93, 435–442.
https://doi.org/10.1037/0021-9010.93.2.435
Biddle, D. A. (2011). Adverse impact and test validation: A
practitioner’s handbook (3rd ed.).
Infinity Publishing.
Binning, J. F., & Barrett, G. V. (1989). Validity of personnel
decisions: A conceptual analysis
54. of the inferential and evidential bases. Journal of Applied
Psychology, 74, 478–494. https://
doi.org/10.1037/0021-9010.74.3.478
Binning, J. F., & LeBreton, J. M. (2009). Coherent
conceptualization is useful for many things,
and understanding validity is one of them. Industrial and
Organizational Psychology, 2,
486–492. https://doi.org/10.1111/j.1754-9434.2009.01178.x
Brannick, M. T., Levine, E. L., & Morgeson, F. P. (2007). Job
and work analysis: Methods,
research, and applications for human resource management (2nd
ed.). SAGE.
Ellingson, J. E., Heggestad, E. D., & Myers, H. (2013). The
workplace IPIP: A contextualized
measure of personality [Unpublished manuscript].
Equal Employment Opportunity Commission, Civil Service
Commission, Department of Labor,
& Department of Justice. (1978). Uniform guidelines on
employee selection procedures.
Federal Register, 43, 38290–38315.
Furr, R. M., & Bacharach, V. R. (2014). Psychometrics: An
introduction (2nd ed.). SAGE.
Goldberg, L. R., Johnson, J. A., Eber, H. W., Hogan, R.,
Ashton, M. C., Cloninger, C. R., &
Gough, H. C. (2006). The international personality item pool
and the future of public-
domain personality measures. Journal of Research in
Personality, 40, 84–96. https://doi.
org/10.1016/j.jrp.2005.08.007
55. https://orcid.org/0000-0002-7810-3494
https://doi.org/10.1037/0021-9010.93.2.435
https://doi.org/10.1037/0021-9010.74.3.478
https://doi.org/10.1037/0021-9010.74.3.478
https://doi.org/10.1111/j.1754-9434.2009.01178.x
https://doi.org/10.1016/j.jrp.2005.08.007
https://doi.org/10.1016/j.jrp.2005.08.007
Fisher et al. 255
Goldstein, I. L., Zedeck, S., & Schneider, B. (1993). An
exploration of the job analysis–content
validity process. In N. Schmitt & W. C. Borman (Eds.),
Personnel selection in organiza-
tions (pp. 3–34). Jossey-Bass.
Guion, R. M. (1980). On trinitarian doctrines of validity.
Professional Psychology, 11, 385–
398. https://doi.org/10.1037/0735-7028.11.3.385
Guion, R. M. (2004). Validity and reliability. In S. G.
Rogelberg (Ed.), Handbook of research
methods in industrial and organizational psychology (pp. 57–
76). Blackwell Publishing.
Haynes, S. N., Richard, D., & Kubany, E. S. (1995). Content
validity in psychological assess-
ment: A functional approach to concepts and methods.
Psychological Assessment, 7, 238–
247. https://doi.org/10.1037/1040-3590.7.3.238
Hsu, C. (2004). The testing of America. U.S. News and World
Report, 137, 68–69.
Hunthausen, J. M., Truxillo, D. M., Bauer, T. N., & Hammer, L.
B. (2003). A field study of
56. frame-of-reference effects on personality test validity. Journal
of Applied Psychology, 88,
545–551. https://doi.org/10.1037/0021-9010.88.3.545
Jeanneret, P. R., & Zedeck, S. (2010). Professional
guidelines/standards. In J. L. Farr & N. T.
Tippins (Eds.), Handbook of employee selection (pp. 593–625).
Routledge.
Landy, E. J. (1986). Stamp collecting versus science: Validation
as hypothesis testing. American
Psychologist, 41, 1183–1192. https://doi.org/10.1037/0003-
066X.41.11.1183
Lawshe, C. H. (1975). A quantitative approach to content
validity. Personnel Psychology, 28,
563–575. https://doi.org/10.1111/j.1744-6570.1975.tb01393.x
Lawshe, C. H. (1985). Inferences from personnel tests and their
validity. Journal of Applied
Psychology, 70, 237–238. https://doi.org/10.1037/0021-
9010.70.1.237
Lievens, F., De Corte, W., & Schollaert, E. (2008). A closer
look at the frame-of-reference
effect in personality scale scores and validity. Journal of
Applied Psychology, 93(2), 268–
279. https://doi.org/10.1037/0021-9010.93.2.268
McDaniel, M. A., Kepes, S., & Banks, G. C. (2011). The
uniform guidelines are a detriment
to the field of personnel selection. Industrial and Organizational
Psychology, 4, 494–514.
https://doi.org/10.1111/j.1754-9434.2011.01382.x
57. Morgeson, F. P., Campion, M. A., Dipboye, R. L., Hollenbeck,
J. R., Murphy, K., & Schmitt,
N. (2007a). Are we getting fooled again? Coming to terms with
limitations in the use of
personality tests for personnel selection. Personnel Psychology,
60, 1029–1047. https://doi.
org/10.1111/j.1744-6570.2007.00100.x
Morgeson, F. P., Campion, M. A., Dipboye, R. L., Hollenbeck,
J. R., Murphy, K., & Schmitt,
N. (2007b). Reconsidering the use of personality tests in
personnel selection contexts.
Personnel Psychology, 60, 683–729.
https://doi.org/10.1111/j.1744-6570.2007.00089.x
Murphy, K. R., Dzieweczynski, J. L., & Zhang, Y. (2009).
Positive manifold limits the rel-
evance of content-matching strategies for validating selection
test batteries. Journal of
Applied Psychology, 94, 1018–1031.
https://doi.org/10.1037/a0014075
O’Neill, T. A., Goffin, R. D., & Rothstein, M. (2013).
Personality and the need for personality-
oriented work analysis. In N. D. Christiansen & R. P. Tett
(Eds.), Handbook of personality
at work (pp. 226–252). Routledge.
O’Neill, T. A., Goffin, R. D., & Tett, R. P. (2009). Content
validation is fundamental for opti-
mizing the criterion validity of personality tests. Industrial and
Organizational Psychology,
2, 509–513. https://doi.org/10.1111/j.1754-9434.2009.01184.x
Ones, D. S., Dilchert, S., Viswesvaran, C., & Judge, T. A.
(2007). In support of personality
58. assessment in organizational settings. Personnel Psychology,
60, 995–1027. https://doi.
org/10.1111/j.1744-6570.2007.00099.x
https://doi.org/10.1037/0735-7028.11.3.385
https://doi.org/10.1037/1040-3590.7.3.238
https://doi.org/10.1037/0021-9010.88.3.545
https://doi.org/10.1037/0003-066X.41.11.1183
https://doi.org/10.1111/j.1744-6570.1975.tb01393.x
https://doi.org/10.1037/0021-9010.70.1.237
https://doi.org/10.1037/0021-9010.93.2.268
https://doi.org/10.1111/j.1754-9434.2011.01382.x
https://doi.org/10.1111/j.1744-6570.2007.00100.x
https://doi.org/10.1111/j.1744-6570.2007.00100.x
https://doi.org/10.1111/j.1744-6570.2007.00089.x
https://doi.org/10.1037/a0014075
https://doi.org/10.1111/j.1754-9434.2009.01184.x
https://doi.org/10.1111/j.1744-6570.2007.00099.x
https://doi.org/10.1111/j.1744-6570.2007.00099.x
256 Public Personnel Management 50(2)
Ones, D. S., & Viswesvaran, C. (2001). Integrity tests and other
criterion-focused occupational
personality scales (COPS) used in personnel selection.
International Journal of Selection
and Assessment, 9, 31–39. https://doi.org/10.1111/1468-
2389.00161
Raymark, P. H., Schmit, M. J., & Guion, R. M. (1997).
Identifying potentially useful person-
ality constructs for employee selection. Personnel Psychology,
50, 723–736. https://doi.
org/10.1111/j.1744-6570.1997.tb00712.x
59. Rothstein, M. G., & Goffin, R. D. (2006). The use of
personality measures in personnel selec-
tion: What does current research support? Human Resource
Management Review, 16, 155–
180. https://doi.org/10.1016/j.hrmr.2006.03.004
Schmit, M. J., & Ryan, A. M. (2013). Legal issues in
personality testing. In N. D. Christiansen
& R. P. Tett (Eds.), Handbook of personality at work (pp. 525–
542). Routledge.
Schmit, M. J., Ryan, A. M., Stierwalt, S. L., & Powell, A. B.
(1995). Frame-of-reference effects
on personality scale scores and criterion-related validity.
Journal of Applied Psychology,
80, 607–620. https://doi.org/10.1037/0021-9010.80.5.607
Schmitt, M. (2006). Conceptual, theoretical, and historical
foundations of multimethod assess-
ment. In M. Eid & E. Diener (Eds.), Handbook of multimethod
measurement in psychology
(pp. 9–25). American Psychological Association.
Shaffer, J. A., & Postlethwaite, B. W. (2012). A matter of
context: A meta-analytic investiga-
tion of the relative validity of contextualized and
noncontextualized personality measures.
Personnel Psychology, 65, 445–494.
https://doi.org/10.1111/j.1744-6570.2012.01250.x
Sireci, S. G. (1998). The construct of content validity. Social
Indicators Research, 45, 83–117.
https://doi.org/10.1023/A:1006985528729
Society for Industrial and Organizational Psychology. (2003).
Principles for the validation and
60. use of personnel selection procedures (4th ed.).
Spengler, M., Gelléri, P., & Schuler, H. (2009). The construct
behind content validity: New
approaches to a better understanding. Industrial and
Organizational Psychology, 2, 504–
508. https://doi.org/10.1111/j.1754-9434.2009.01183.x
Tan, J. A. (2009). Babies, bathwater, and validity: Content
validity is useful in the validation
process. Industrial and Organizational Psychology, 2, 514–516.
https://doi.org/10.1111/
j.1754-9434.2009.01185.x
Tett, R. P., & Burnett, D. D. (2003). A personality trait-based
interactionist model of job per-
formance. Journal of Applied Psychology, 88, 500–517.
https://doi.org/10.1037/0021-
9010.88.3.500
Tett, R. P., & Christiansen, N. D. (2007). Personality tests at
the crossroads: A response to
Morgeson, Campion, Dipboye, Hollenbeck, Murphy, and
Schmitt (2007). Personnel
Psychology, 60, 967–993. https://doi.org/10.1111/j.1744-
6570.2007.00098.x
Tett, R. P., Simonet, D. V., Walser, B., & Brown, C. (2013).
Trait activation theory: Applications,
developments, and implications for person–workplace fit. In N.
D. Christiansen & R. P.
Tett (Eds.), Handbook of personality at work (pp. 71–100).
Routledge.
Thornton, G. C., III. (2009). Evidence of content matching is
evidence of validity.
61. Industrial and Organizational Psychology, 2, 469–474.
https://doi.org/10.1111/j.1754-
9434.2009.01175.x
Wernimont, P. F., & Campbell, J. P. (1968). Signs, samples and
criteria. Journal of Applied
Psychology, 52, 372–376. https://doi.org/10.1037/h0026244
Wollack, S. (1976). Content validity: Its legal and psychometric
basis. Public Personnel
Management, 5, 397–408.
https://doi.org/10.1111/1468-2389.00161
https://doi.org/10.1111/j.1744-6570.1997.tb00712.x
https://doi.org/10.1111/j.1744-6570.1997.tb00712.x
https://doi.org/10.1016/j.hrmr.2006.03.004
https://doi.org/10.1037/0021-9010.80.5.607
https://doi.org/10.1111/j.1744-6570.2012.01250.x
https://doi.org/10.1023/A:1006985528729
https://doi.org/10.1111/j.1754-9434.2009.01183.x
https://doi.org/10.1111/j.1754-9434.2009.01185.x
https://doi.org/10.1111/j.1754-9434.2009.01185.x
https://doi.org/10.1037/0021-9010.88.3.500
https://doi.org/10.1037/0021-9010.88.3.500
https://doi.org/10.1111/j.1744-6570.2007.00098.x
https://doi.org/10.1111/j.1754-9434.2009.01175.x
https://doi.org/10.1111/j.1754-9434.2009.01175.x
https://doi.org/10.1037/h0026244
Fisher et al. 257
Author Biographies
David M. Fisher is an assistant professor of psychology at The
University of Tulsa. Prior to his
62. academic position, he did consulting work that focused on
selection and testing for public safety
agencies. His research interests include employee selection,
organizational work teams, and
occupational health/resilience.
Christopher R. Milane is a senior project manager of research
services at Qualtrics. The
majority of this manuscript was written while he was a graduate
student at The University of
Tulsa. His research interests include employee selection,
organizational work teams, and leader-
ship development.
Sarah Sullivan is the department coordinator at the Doerr
Institute for New Leaders at Rice
University. The majority of this manuscript was written while
she was a graduate student at The
University of Tulsa. Her research interests include leadership
development, employee selection,
and organizational work teams.
Robert P. Tett is professor of Industrial-Organizational (I-O)
Psychology and director of the
I-O Graduate Program at The University of Tulsa where he
teaches courses in personnel selec-
tion, psychometrics, statistics, personality at work, and
evolutionary psychology. His research
targets personality trait-situation interactions, meta-analysis,
leadership competencies, and trait-
emotional intelligence.
Copyright of Public Personnel Management is the property of
Sage Publications Inc. and its
63. content may not be copied or emailed to multiple sites or posted
to a listserv without the
copyright holder's express written permission. However, users
may print, download, or email
articles for individual use.
Methodological and Statistical Advances in the Consideration of
Cultural
Diversity in Assessment: A Critical Review of Group
Classification and
Measurement Invariance Testing
Kyunghee Han, Stephen M. Colarelli, and Nathan C. Weed
Central Michigan University
One of the most important considerations in psychological and
educational assessment is the extent to
which a test is free of bias and fair for groups with diverse
backgrounds. Establishing measurement
invariance (MI) of a test or items is a prerequisite for
meaningful comparisons across groups as it ensures
that test items do not function differently across groups.
Demonstration of MI is particularly important
in assessment settings where test scores are used in decision
making. In this review, we begin with an
overview of test bias and fairness, followed by a discussion of
issues involving group classification,
focusing on categorizations of race/ethnicity and sex/gender.
We then describe procedures used to
establish MI, detailing steps in the implementation of
multigroup confirmatory factor analysis, and
discussing recent developments in alternative procedures for
64. establishing MI, such as the alignment
method and moderated nonlinear factor analysis, which
accommodate reconceptualization of group
categorizations. Lastly, we discuss a variety of important
statistical and conceptual issues to be
considered in conducting multigroup confirmatory factor
analysis and related methods and conclude with
some recommendations for applications of these procedures.
Public Significance Statement
This article highlights some important conceptual and statistical
and issues that researchers should
consider in research involving MI to maximize the
meaningfulness of their results. Additionally, it
offers recommendations for conducting MI research with
multigroup confirmatory factor analysis
and related procedures.
Keywords: test bias and fairness, categorizations of
race/ethnicity and sex/gender, measurement
invariance, multigroup CFA
Supplemental materials:
http://dx.doi.org/10.1037/pas0000731.supp
When psychological tests are used in diverse populations, it
is assumed that a given test score represents the same level of
the underlying construct across groups and predicts the same
outcome score. Suppose that two hypothetical examinees, a
middle aged Mexican immigrant woman and a Jewish European
American male college student, each produced the same score
on a measure of depression. We would like to conclude that the
examinees exhibit the same severity and breadth of depression
symptoms and that their therapists would rate them similarly on
relevant behavioral and symptom measures. If empirical evi -
65. dence indicates otherwise, and such conclusions are not justi-
fied, scores on the measure are said to be biased.
Although it has been defined variously, a representative
definition refers to psychometric bias as “systematic error in
estimation of a value”). A biased test “is one that systematically
overestimates or underestimates the value of the variable it is
intended to assess” due to group membership, such as ethnicity
or gender (Reynolds & Suzuki, 2013, p. 83). The “value of the
variable it is intended to assess” can either be a “true score”
(see
S1 in the online supplemental materials) on the latent construct
or a score on a specified criterion measure. The former appli -
cation concerns what is sometimes termed measurement bias, in
which the relationship between test scores and the latent attri -
bute that these test scores measure varies for different groups
(Borsboom, Romejin, & Wicherts, 2008; Millsap, 1997),
whereas the latter application concerns what is referred to as
predictive bias, which entails systematic inaccuracies in the
prediction of a criterion from a test depending upon group
membership (Clearly, 1968; Millsap, 1997).
Kyunghee Han, Stephen M. Colarelli, and Nathan C. Weed,
Department
of Psychology, Central Michigan University.
This article has not been published elsewhere, nor has it been
submitted simultaneously for publication elsewhere. The
author(s) de-
clared no potential conflicts of interest with respect to the
research,
authorship, and/or publication of this article. The author(s)
received no
funding for this study.
Correspondence concerning this article should be addressed to
66. Kyung-
hee Han, Department of Psychology, Central Michigan
University, Mount
Pleasant, MI 48859. E-mail: [email protected]
T
hi
s
do
cu
m
en
t
is
co
py
ri
gh
te
d
by
th
e
A
m
er
ic
71. praisals of test fairness include multifaceted aspects of the
assess-
ment process, lack of test bias being only one facet (American
Educational Research Association, American Psychological
Asso-
ciation [APA], & National Council on Measurement in
Education,
2014; Society for Industrial Organizational Psychology, 2018;
see
S2 in the online supplemental materials).
In the example above, the measure of depression may be unfair
for the Mexican female client if an English language version of
the
measure was used without evaluating her English proficiency, if
her score was derived using American norms only, if
computerized
administration was used, or if use of the test leads her to be less
likely than members of other groups to be hired for a job.
Although
test bias is not a necessary condition for test unfairness to exist,
it
may be a sufficient condition (Kline, 2013). Accordingly, it is
especially important to evaluate whether test scores are biased
against vulnerable groups.
The evaluation of test bias and test fairness each entails a
comparison of one group of people with another. While asking
the
question, “Is a test biased?” we are also implicitly asking
“against
or for which group?” Similarly, if we are concerned about using
a
test fairly, we must ask: are the outcomes based on the results
of
the test apportioned fairly to groups of people who have taken
72. the
test? Thus, the categorization of people into distinct groups is a
sine qua non of many aspects of psychological assessment re-
search. Racial/ethnic and sex/gender categories are prominent
fea-
tures of the social, cultural, and political landscapes in the
United
States (e.g., Helms, 2006; Hyde, Bigler, Joel, Tate, & van
Anders,
2019; Jensen, 1980; Newman, Hanges, & Outtz, 2007), and have
therefore been the most commonly studied group variables in
bias
research (e.g., Warne, Yoon, & Price, 2014). Most of the initial
research on and debates about test bias and fairness in the
United
States stemmed from political movements addressing race and
sex
discrimination (e.g., Sackett & Wilk, 1994). In service of
pressing
research on questions of discriminatio n and economic
inequality, it
thus became commonplace among psychologists and social
scien-
tists to categorize people crudely into groups (based primarily
on
race, ethnicity, and sex/gender) without much thought to the
mean-
ing and validity of those categorizations (e.g., Hyde et al.,
2019;
Yee, 1983; Yee, Fairchild, Weizmann, & Wyatt, 1993). This has
changed somewhat over the past two decades as scholarship by
psychologists and others has increasingly focused on nuances of
identity, multiculturalism, intersectionality, and multiple
position-
alities (Cole, 2009; Song, 2017). This scholarship has
emphasized
73. that racial, ethnic, and gender classifications can be complex,
ambiguous, and debatable—and that identities are often self-
constructed and can be fluid (Helms, 2006; Hyde et al., 2019).
The
first goal of this review, therefore, is to overview contemporary
issues involving race/ethnicity and sex/gender classifications in
bias research and to describe alternative approaches to the mea -
surement of these variables.
The psychometric methods used to examine test bias usually
depend on the definition of test bias operating for a given appli -
cation. Evaluating predictive bias (i.e., establishing predictive
in-
variance) often involves regressing total scores from a criterion
measure onto total scores on the measure of interest, and
compar-
ing regression slopes and intercepts across groups (Clearly,
1968).
Evaluating measurement bias (i.e., establishing measurement in-
variance [MI]) often necessitates more advanced quantitative
meth-
ods, such as confirmatory factor analysis (CFA) or methods
deriv-
ing from item response theory, to compare the properties of
item
scores and scores on latent variables across different groups.
Multigroup confirmatory factor analysis (MGCFA) has been one
of the most commonly used techniques to examine MI (Davidov,
Meuleman, Cieciuch, Schmidt, & Billiet, 2014) because it pro-
vides a comprehensive framework for evaluating different forms
of MI. The second goal of this review is to provide a broad
overview of MGCFA and related procedures and their relevance
to
psychological assessment.
74. Although MGCFA is a well-established procedure in the eval-
uation of MI, it has limitations. MGCFA is not an optimal
method
for conducting MI tests when many groups are involved. More-
over, the grouping variable in MGCFA must be categorical, and
therefore does not permit MI testing with continuous grouping
variables (e.g., age). As modern research questions may require
MI
testing across many groups, and with continuous
reconceptualiza-
tions of some of the grouping variables (e.g., gender), more
flex-
ible techniques are needed. Our third goal, therefore, is to
describe
two recent alternative methods for MI testing, the alignment
method and moderated nonlinear factor analysis, that aim to
over-
come these limitations. We conclude the review with a
discussion
of some important statistical and conceptual issues to be consid-
ered when evaluating MI, and include a list of recommended
practices.
Group Classifications Used in Bias Research
Racial and Ethnic Classifications
Race and ethnicity (see S3 in the online supplemental materials)
are conceptually vague and empirically complex social
constructs
that have been examined by numerous researchers across many
disciplines (Betancourt & López, 1993; Helms, Jernigan, &
Mascher, 2005; Yee et al., 1993). Consider race. As a biological
concept, it is essentially meaningless. In most cases, there is
more
genetic variation within so-called racial groups than between
75. racial
groups (Witherspoon et al., 2007). Even if we allow race to be
defined by a combination of specific morphological features and
ancestry, few “racial” populations are pure (Gibbons, 2017).
Most
are mixed—like real numbers, with infinite gradations. For
exam-
ple, although many African Americans trace their ancestry to
West
Africa, about 20% to 30% of their genetic heritage is from
Euro-
pean and American Indian ancestors (Parra et al., 1998), and
racial
admixture continues as the frequency of interracial marriages
increases (Rosenfeld, 2006; U.S. Census Bureau, 2008). Even if
one were to accept race as a combination of biological features
and
cultural and social identities (shared cultural heritage,
hardships,
and discrimination), there is the problem of degree. For
example,
while many Black Americans share social and cultural identities
based on roots in American slavery and racial discrimination,
not
all do, such as recent Black immigrants from the Caribbean.
Racial
and ethnic classifications are often conflated. In psychological
research, “Asian” is commonly used both as a cultural (Nisbett,
T
hi
s
do
cu
80. dl
y.
1482 HAN, COLARELLI, AND WEED
http://dx.doi.org/10.1037/pas0000731.supp
http://dx.doi.org/10.1037/pas0000731.supp
Peng, Choi, & Norenzayan, 2001) and racial category (Rushton,
1994). Yet it is a catch-all term based primarily on geography.
It
typically refers to people from (or whose ancestors are from)
South, Southeast, and Eastern Asia. The term Hispanic often
conflates linguistic, cultural, and sometimes even
morphological
features (Humes, Jones, & Ramirez, 2010).
In public policy, mixtures of racial (or ethnic) background has
only recently begun to be addressed. The U.S. Census, for
exam-
ple, did not include a multiracial category until 2000 (Nobles,
2000). We are only beginning to see assessment studies that
parse
people from traditional broad groupings into smaller, more
mean-
ingful and homogeneous groups. In one of the few studies that
identified different types of Asians, Appel, Huang, Ai, and Lin
(2011) found significant (and sometimes major) differences in
physical, behavioral, and mental health problems among
Chinese,
Vietnamese, and Filipina women in the U.S. More recently, Tal-
helm et al. (2014) found important differences in culture and
thought patterns within only one Asian country, China. People
in
81. northern China were significantly more individualistic than
those
in southern China, who were more collectivistic. With current
and
historical farming practices as their theoretical centerpiece, they
examined farming practices as causal factors. In northern China
wheat has been farmed as a staple crop for millennia, whereas
in
southern China rice has been (and is) the staple crop. Talhelm et
al.
argued that the farming practices required by these two crops
required different types of social organization that, over time,
influenced cultural values and cognition. The work by Talhelm
and
colleagues is important because it is one of the first studies to
show—along with a powerful theoretical rationale—that there
are
important cultural differences between people from what has
typ-
ically been thought of as a relatively homogeneous racial and
cultural group.
In another seminal article, Gelfand and colleagues (2011) ex-
amined the looseness-tightness dimension of cultures in 33
coun-
tries. This dimension reflects the strength of norms and the
toler-
ance of deviant behavior. Loose cultures have weaker norms and
are more tolerant of deviant behavior. While there was
substantial
variation between countries, there was still considerable
variation
among countries typically considered “Asian.” Hong Kong was
the
loosest (6.3), while Malaysia was the tightest (11.8), with the
People’s Republic of China (7.9), Japan (8.6), South Korea