673Foundations of Psychological TestingNoel Hendrick.docx

67
3Foundations of
Psychological Testing
Noel Hendrickson/Photodisc/Thinkstock
Learning Outcomes
After reading this chapter, you should be able to
• Identify the purpose and uses of psychological testing.
• Describe the characteristics of a high-quality psychological
assessment tool or selection method.
• Explain the importance of reliability and validity.
• Identify commonly used psychological test design formats.
• Recognize the types of tests used to assess individual
differences.
• List the steps needed to develop and administer tests most
effectively.
• Discuss special issues in testing, including applicants’
reactions to testing and online administration.
• Summarize the importance of testing for positivity.
you83701_03_c03_067-102.indd 67 4/20/17 4:22 PM

© 2017 Bridgepoint Education, Inc. All rights reserved. Not for
resale or redistribution.
68
Section 3.2 What Are Tests?
3.1 The Importance of Testing
When you hear the word test, what comes to mind? For many
people, tests are not a pleasant
activity. Most of us can remember wishing, as children, to grow
up and finish school so we
would never have to take a test again. Of course, as adults, we
discover that tests affect our
lives long after we earn our diplomas. Tests determine whether
we can drive a car, get into
a graduate or job-training program, or earn a professional
certification. They influence our
career choices and, quite often, our career advancement.
What profession do you plan to pursue? Do you want to be a
doctor or a lawyer? How about
a police officer or firefighter? Perhaps you would like to earn
your MBA and start your own
business? Each of these examples, along with most professions,
require many years of tests,
demand high levels of knowledge and skills, and require
continued education and recertifica-
tion testing.
Businesses use tests to help determine whether job applicants
possess the skills and abilities
needed to perform a job. After an applicant is hired, tests will
help determine placement in an

appropriate training and development program. Throughout an
employee’s career, the orga-
nization may require testing for new job placements or
promotions.
As you can see, tests can have a significant influence on
people’s lives; they can help identify
talent and promote deserving candidates. But they can also be
misused. Unfortunately, there
are many poorly designed psychological tests on the market.
They seduce organizations with
promises of fantastic results but do little to identify quality
employees. I/O psychologists pos-
sess the knowledge, skills, and education to design, implement,
and score measures that meet
the legal and ethical standards for an effective psychological
test.
The goal of this chapter is not to teach you how to design
quality psychological tests, but
rather to acquaint you with the requirements, challenges, and
advantages of doing so. Fur-
thermore, understanding the test-making methods that I/O
psychologists use will make you
a more informed consumer of tests for your own personal and
professional goals.
3.2 What Are Tests?
In general, a test is an instrument or procedure that measures
samples of behavior or perfor-
mance. In an employment situation, tests measure an
individual’s employment and career-
related qualifications and characteristics. The Uniform
Guidelines on Employee Selection
Procedures (1978) defines a test as any method used to make a
decision about whether to

hire, retain, promote, place, demote, or dismiss an employee or
potential employee. By this
definition, then, any procedure that eliminates an applicant from
the selection process would
be defined as a test. As discussed in Chapter 2, examples
include application forms that eval-
uate education and experience; résumé screening processes;
interviews; reference checks;
performance in training programs; and psychological, physical
ability, cognitive, or knowl-
edge-based tests.
you83701_03_c03_067-102.indd 68 4/20/17 4:22 PM
69
I/O psychologists are concerned with design-
ing and implementing selection systems that
identify quality job candidates. Clearly, orga-
nizations want to hire the best workers, but
it is a real challenge to screen candidates
who vary widely in their KSAOs, behaviors,
traits, and attitudes—or what is known as
their individual differences. This is especially
true when hiring decisions are made with
only basic tools such as application forms
and short interviews. To help organizations
better measure applicants’ personal charac-
teristics, I/O psychologists have developed

psychological measurements to identify how
and to what extent people vary regarding
individual differences (Berry, 2003).
What Is the Purpose of
Psychological Testing and Selection Methods?
In employment, tests and other selection tools are used to
predict job performance. Keep in
mind that job performance can never be predicted with 100%
accuracy. The only way employ-
ers could reach such a high level of accuracy would be to hire
all the applicants for a particu-
lar job, have them perform the job, and then choose those with
the highest performance. Of
course, this approach is neither practical nor cost effective.
Moreover, even if an organiza-
tion could afford to hire a large number of applicants and retain
only those who performed
best, performance prediction is still not perfectly accurate. For
example, many organizations
have probationary periods in which the employer and the
employee try each other out before
a more permanent arrangement is established. Employees may
be motivated to perform at
a much higher level during the probationary period in order to
secure permanent employ-
ment, but performance levels may drop once the probationary
period is over. Moreover, job
performance may be influenced over time by a myriad of factors
that cannot be predicted or
managed.
Although it is impossible to perfectly predict job performance,
psychological testing and
selection methods can provide reasonable levels of prediction if
they accurately and consis-

tently assess predictors that are related to specific performance
criteria. As briefly introduced
in Chapter 2, accurately predicting performance criteria—
usually referred to as validity—
ensures that test results indicate performance outcomes, so that
those who score favorably
on the test are more likely to be high performers than those who
do not. Simply put, valid-
ity reflects the correlation between applicants’ scores on the
test or selection tool and their
actual performance. A high correlation indicates that test scores
can accurately predict per-
formance. A low correlation indicates that test scores are poorly
related to performance.
In order to assess the validity of a selection tool, job
performance must be quantifiable, mean-
ing that there is a set of numbers associated with applicants’
test scores so one can calculate
a correlation. Performance scores are usually obtained from
performance appraisal systems,
which will be discussed in more detail in Chapter 4. Poorly
designed performance appraisal
Fuse/Thinkstock
Students take tests to demonstrate
their knowledge of a particular subject.
Similarly, employers administer exams to
job applicants to measure employment
and career-related qualifications and
characteristics.
you83701_03_c03_067-102.indd 69 4/20/17 4:22 PM

70
systems can hinder an organization’s ability to assess the
validity of its selection methods.
For example, performance evaluations are sometimes highly
subjective. Some managers tend
to score all of their employees similarly regardless of
performance in order to ensure they all
receive a raise. Some do so to sidestep confrontation or to avoid
having to justify the decision.
This lack of variability in scores can bias the results of the
statistical analysis’s underlying
validity, preventing it from adequately calculating or comparing
the validity of various selec-
tion methods.
Another important determining factor of tests and other
selection tools is reliability. Also
referred to as consistency, reliability is the extent to which test
scores can be replicated over
time and across situations. A reliable test will reflect an
applicant’s aptitude rather than the
influence of other factors such as the interviewer, room
temperature, or noise level. For exam-
ple, many organizations use multiple interviews or panel
interviews to evaluate applicants so
that there are multiple raters scoring each applicant. Such
processes have scorers assign both
an absolute score, which measures how the applicant did in
relation to the highest possible

score, and a relative score, which measures how the applicant
did in relation to the rest of the
interviewees. When the scores assigned by these multiple raters
are comparable in terms of
absolute scores for each applicant, as well as relative scores and
rankings across applicants,
the interview process is considered reliable. On the other hand,
if different raters score the
same applicant very differently, and if the interview process
yields different rankings across
applicants and thus different hiring recommendations, then the
process is unreliable. Similar
to validity, no test or selection method has perfect reliability,
but the more reliable and con-
sistent a selection tool is, the more accurate it will be in
determining quality candidates, and
the more legally defensible it will be if the organization is sued
for discriminatory hiring. An
objective and systematic selection process that leads to
consistent results across candidates
and raters is an organization’s first line of defense against such
accusations.
Ensuring that tests are both valid and reliable is an important
part of the assessment process.
Of course, the more accurate the testing process, the more likely
the best candidates will be
selected, promoted, or matched with the right job. However,
that is not the only reason to
do so: invalid or unreliable tests can be costly. Many tests need
to be purchased or require a
license of use to be obtained. Testing also takes time, both for
the candidate and the organiza-
tion. Tests need to be administered, be rated, and have the
results reported, which requires
managers’ and HR professionals’ time and effort. Tests that are

not valid or reliable also have
opportunity costs, such as the time spent using them as well as
the lower productivity of
those who were hired or promoted using the wrong test.
Finally, there are legal implications for ineffective testing. An
invalid test may not be related
to performance, but it may be discriminatory. It may favor
certain protected classes over oth-
ers. For example, if younger job applicants consistently score
higher than older ones on a test,
and these scores are not related to job performance, then that
test may be found discrimina-
tory. Similarly, if a particular test favors men over women or
places minority applicants at a
disadvantage, it can be considered discriminatory and thus
illegal. For example, complaints
were filed against the pharmacy chain CVS Caremark for using
a discriminatory personality
test. The test included questions about the applicant’s
propensity to get angry, trust others,
and build friendships. These questions were found to be
potentially discriminatory against
applicants with mental disabilities or emotional disorders
(Tahmincioglu, 2011). Although
the organization may have had no intent to discriminate, using
invalid discriminatory tests
can result in what was referred to in Chapter 2 as “disparate
impact,” which is also illegal.
you83701_03_c03_067-102.indd 70 4/20/17 4:22 PM

71
Uses of Tests
Companies of all sizes are integrating tests into their
employment practices. In 2001 a study
by the American Management Association found that 68% of
large U.S. companies used job-
skill testing as part of their employment process; psychological
(29%) and cognitive (20%)
measurements were used less frequently. More recent studies,
however, show that the test-
ing trend is on the rise, with nearly 80% of Fortune 500
organizations using assessments of
Consider This: Tests and Testing
Make a list of as many tests as you can remember having taken
during times when you were up
for selection from a pool of potential candidates (e.g., jobs,
volunteering opportunities, college
admissions, scholarships, military service). Remember that a
test can be any instrument or
procedure that measures samples of behavior or performance. It
does not have to be a written,
proctored exam.
Questions to Consider
1. What do you think each of those tests was attempting to
measure?
2. What is your opinion of each of those tests?
3. Did the tests adequately measure what they were trying to
measure?
4. What were some strengths and weaknesses of each test?

Find Out for Yourself: The Validity and Reliability of
Commonly
Used Selection Tools
Visit the following websites to read about the different types of
validity and reliability, as well
as how they are measured.
Validity & Reliability of Methods
Psychometric Assessment Validity
Psychometric Test Reliability
What Did You Learn?
1. Compare the validity of various selection methods such as
interviews, reference checks,
and others. If you were a recruiter, which ones would you use?
Why?
2. If you were to use an actual test, how long would you design
the test to be? Why?
3. If you were to design an interview process, how many
interviewers would you use for
each candidate? What are the benefits of using more than one
interviewer (rater)?
4. What are the key factors in increasing the validity of the
selection process?
5. What are the key factors in increasing the reliability of the
selection process?
you83701_03_c03_067-102.indd 71 4/20/17 4:22 PM

http://www.nicheconsulting.co.nz/validity_reliability.htm
http://www.nicheconsulting.co.nz/validity.htm
http://www.nicheconsulting.co.nz/reliability.htm
72
Section 3.3 Requirements of Psychological Measurement
some sort (Dattner, 2008). Moreover, about a quarter of
employers utilize online personality
testing to weed out applicants early in the recruitment process,
even before any other screen-
ing tool is used; this trend is expected to increase by about 20%
annually. Examples include
large, nationwide employers such as McDonald’s and CVS
Caremark (Tahmincioglu, 2011).
The growing popularity of testing within organizations has
resulted in the use of tests not
only for selection but also for a number of other HR functions.
One of the most important ways in which organizations use tests
is to evaluate job applicant
or employee fit. Organizations often administer tests to
accurately evaluate an applicant’s
job-related characteristics or determine an employee’s
suitability for promotion or place-
ment in a new position within the company. Because promotions
and training are expensive,
organizations place high importance on being able to determine
which employees possess
the higher level abilities and skills needed to assume advanced
job positions. Similarly, during

job reorganizations, companies must be able to place
individuals into new jobs that align with
their skills and abilities. Keep in mind that selection,
promotion, and job-placement processes
all involve employment decisions and thus must be well
designed in order to meet the requi-
site legal and professional standards.
HR professionals make use of tests in areas outside the realm of
employment selection. Train-
ing and development is one such example. Trainees are often
tested on their job knowledge
and skills to determine the level of training that will fit their
proficiency. At the end of train-
ing, they may take tests to assess their mastery of the training
materials or to identify areas
where they need to be retrained. Other types of tests help
individuals identify areas for self-
improvement, and sometimes job teams take tests to help
facilitate team-building activities.
Finally, tests can help individuals make educational or
vocational choices. People who work
at jobs that utilize their skills and interests are more likely to be
successful and satisfied, so
it is important that vocational and educational tests make
accurate matches and predictions.
Note that tests used solely for career exploration or counseling
need not meet the same strict
legal standards as tests used for employee selection.
3.3 Requirements of Psychological Measurement
Tests designed by I/O psychologists possess a number of
important characteristics that set
them apart from other tests you may have taken. Scientifically
designed tests differ from mag-
azine quizzes or informal tests that you find online in that they

are more than a set of ques-
tions related to a specific topic. Instead, they must meet
standards related to administration
(the way in which the test is given), scoring methods, score
interpretation, reliability, and
validity. Unfortunately, many employers and consultants think
they can simply put together a
test or an interview protocol that they believe measures what
they feel is necessary and start
using the selection tool without any statistical analysis to assess
its quality. In addition to the
fact that this approach does not effectively distinguish
applicants with higher performance
potential, which makes it a waste of time and resources, it can
also yield inconsistent and
discriminatory results, which makes it illegal.
you83701_03_c03_067-102.indd 72 4/20/17 4:22 PM
73
Standardized Administration
To administer a selection test properly, the conditions under
which applicants complete the
test must be standard—that is, they must be the same every time
the test is given. These con-
ditions include the test materials, instructions, testing facilities,
time allowed for testing, and
allowed resources and materials. To ensure standardization,

organizations put instructions
into written form or administer the test to large groups so that
all applicants hear the same
instructions. Additionally, applicants will all complete the test
in the same location using
well-functioning equipment and comfortable seating. Test
administrators are also careful to
keep the testing environment comfortable in terms of
temperature and humidity as well as
free from extraneous noise or other distractions.
Variations in testing conditions can significantly
interfere with results, making it impossible to
create accurate comparisons between different
applicants based on the conditions in which they
were tested.
Consider how changing even one aspect of test-
ing conditions can affect test performance. What
would happen if, on a cold day in the middle
of winter, the heat stopped working partway
through a series of applicant evaluations? Appli-
cants might not perform well on a typing test
because their hands were cold and stiff, or they
might not complete a written test because they
were shivering and could not concentrate. Now think about how
differently two groups of
test takers would perform if one group accidentally received
incomplete instructions from an
inexperienced administrator, while a second group received the
proper instructions from an
experienced administrator. You can easily see how unfair it
would be to try to compare test
results of applicants who were not all tested under equal
conditions!

Of course, it is sometimes not only appropriate but also
necessary to alter the testing condi-
tions. Applicants with disabilities may need specific
accommodations, such as a sign language
interpreter for a person with a hearing impairment or a reader or
Braille version of a written
test for a person with a visual impairment. For applicants with
disabilities, then, not allow-
ing for changes in the testing conditions would make it
difficult, if not impossible, for them to
perform their best.
A real-life example of this occurred when an on-campus
recruiter for a highly coveted intern-
ship program noticed that, despite performing as well as other
students when interviewed
on campus, minority students from very low-income areas fared
poorly when invited to an
on-site interview. The recruiter suspected that the
organization’s luxurious office building
and extravagant furnishings may have intimidated those
students and caused their poor per-
formance. To further investigate his point, he changed the
location of the interview to a local
community youth center—everything else about the interview
process was kept identical.
Under these conditions, the students’ performance improved
significantly and was no differ-
ent from the overall applicant pool. In other words, the change
was necessary to neutralize
the distracting effects of an otherwise unrelated testing
condition.
Lisa F. Young/iStock/Thinkstock

It is important that all applicants take the
same test under the same conditions.
you83701_03_c03_067-102.indd 73 4/20/17 4:22 PM
74
Objective Scoring
Just as performance on a test can be affected by testing
conditions, results can be affected
by the way in which the test is scored. To eliminate the
possibility of bias, test scores must
be standardized, which means that a test must be scored in the
same way for everyone who
takes it. Establishing a scoring key or a clear scoring criterion
before administering a test will
reduce an evaluator’s subjective judgments and produce the
same or similar test scores no
matter who evaluates the test.
Organizations can utilize both objective tests and subjective
tests. Objective tests, such
as achievement tests and some cognitive ability tests, have one
clearly correct answer—
multiple-choice tests are an example. As long as a scoring key
is available, scoring an objec-
tive test should be free from bias. In contrast, subjective tests,
such as résumé evaluations,
interviews, some personality tests, and work simulations, have

no definitive right or wrong
answers; their scores rely on the interpretation and judgment of
the evaluator. To minimize
the influence of personal biases and increase scoring accuracy,
the evaluator uses a predeter-
mined scoring guide or template, also sometimes referred to as a
rubric, which establishes
a very specific set of scoring criteria, usually with examples of
what to look for in order to
assign a particular score. For example, if a rater is scoring an
applicant on professionalism
using a scale of 1 to 5, with 3 being “average” or “meeting
expectations,” this score can also
be accompanied by a description of what is considered “meeting
expectations” so the assess-
ment is not subject to the rater’s interpretation of what is
expected. Although subjective tests
are commonly used to make employment decisions, objective
tests are preferred for making
fair, accurate evaluations of and comparisons between job
candidates. In the majority of situ-
ations, not everything can be objectively measured, so both
subjective and objective tests are
used to combine the specificity of objective tests and the
richness of subjective tests.
Score Interpretation
After a test has been scored, the score needs to be interpreted.
Once again, this process must
be standardized. Additionally, to be interpreted properly, a
person’s score on a test must be
compared to other people’s scores on the same test. I/O
psychologists use a standardization
sample, which is a large group of people who have taken the
test against whose scores an
individual’s score can be compared. The comparison scores

provided by the standardization
sample are called test norms. The demographics of the
standardization sample can be used
to establish test norms for various racial and ethnic groups, men
and women, and groups of
different ages and various education levels.
Let’s look at an example of how test-score interpretation works.
As part of a selection process,
an applicant might answer 30 out of 40 questions correctly on a
multiple-choice cognitive
ability test. By itself, this score provides little information
about the applicant’s level of cogni-
tive ability. However, if we can compare it to how others
performed on the same test—spe-
cifically, the test norms established by a standardization
sample—we will be able to ascribe
some meaning to the score.
Often, raw scores, the number of points a person scores on a
test, are transformed into per-
centile scores, which tell the evaluator the percentage of people
in the standardized sample
who scored below an individual’s raw score. Continuing the
example above, if the score of 30
falls in the 90th percentile, 90% of the people in the
standardized sample scored lower than
our fictional applicant.
you83701_03_c03_067-102.indd 74 4/20/17 4:22 PM

75
Test Reliability
As introduced earlier, test reliability refers to the dependability
and consistency of a test’s
measurements. If a person takes a test several times and scores
similarly on it each time, that
test is said to measure consistently. If a test measures
inconsistently, then outside factors
must be influencing the results. For example, if an applicant
takes a mechanical ability test one
week and correctly answers 90 out of 100 items, but then takes
another form of the test the
following week and gets only 50 out of 100 items correct, the
test evaluator must ask whether
the tests are actually doing what they’re supposed to be doing—
measuring mechanical abil-
ity—or if something else is influencing the scores. Examples of
common but often unreliable
interview questions include “tell me about yourself ” and
“discuss your strengths and weak-
nesses.” Such questions are unreliable because the answer may
vary widely depending on the
applicant’s mood or recollection of recent events. Moreover,
interpretation of the answers
is subjective and depends on whether the interviewer likes or
dislikes what the applicant
says. There is also a very limited scope for comparing answers
across applicants to determine
which answers are higher or lower quality. On the other hand,
more targeted and job-related
questions—such as “tell me about a situation where you had to .
. .” or “what would you do
if you were faced with this situation”—are more likely to yield

consistent and comparable
responses. Thus, before trusting scores from any test that
measures inconsistently, I/O psy-
chologists must discover what is influencing the scores.
The test taker’s emotional and physical state can influence his
or her score. A person’s mood,
state of anxiety, and level of fatigue may change from one test-
taking time to another, and
these factors can have a profound effect on test performance.
Illness can also impact perfor-
mance. If a person is healthy the first time she takes a test, but
then has a cold when she takes
the test again, her score will likely be lower the second time
around.
Changing environmental factors can also make a test measure
inconsistently. Differences
from one testing environment to another, such as room lighting,
noise level, temperature,
humidity, and equipment, will affect a person’s performance, as
will the relative completeness
or incompleteness of instructions and manner in which they are
given.
Differences between versions of the same test also influence
reliability. Many tests have
more than one version or form (the written portion of a driver’s
license test is an example).
Although the test versions aim to measure the same knowledge,
the test items or questions
vary. If the questions in one version are more difficult than
those in another, the test taker
may perform better on one version of the test.
Finally, some inconsistency in test scores stems from real

changes in the test taker’s KSAOs.
These changes often appear if a significant period of time
passes between tests. High school
students taking the SAT, for example, may show vast increases
in scores from their junior year
to their senior year because they have improved their cognitive
ability, subject knowledge,
and/or test-taking skills.
Measures of Reliability
Typically, reliability is measured by gathering scores from two
sets of tests and then deter-
mining their association. The reliability coefficient states the
correlation—or relation-
ship—between the two score sets, and ranges from 0 to +1.00.
Although we won’t go into
you83701_03_c03_067-102.indd 75 4/20/17 4:22 PM
76
mathematical details of calculating correlations here, it is
important to understand that the
closer the score sets approach a perfect +1.00 correlation, the
more reliable the test. For
employment tests, reliability coefficients above +.90 are
considered excellent, and those
above +.70 are considered adequate. Tests with reliability
estimates lower than +.70 may

possess sufficient errors to make them unusable in employment
situations. I/O psychologists
measure test reliability with the internal consistency, test–
retest, interrater, alternate-form
(or parallel-form), and split-halves methods described below.
Internal Consistency Reliability
Internal consistency reliability assesses the extent to which
different test items or questions
are correlated and thus consistently measure the same trait or
characteristic. For example,
if a 20-question test is used to measure extroversion, then
scores on these 20 items should
be highly correlated. If they are not, then some of the items may
be measuring a different
concept. The most common measure of internal consistency
reliability is Cronbach’s alpha,
which is an overall statistical measure of all the
intercorrelations across items in a particular
test. It can also pinpoint which items are poorly performing and
which should be removed to
improve the test’s internal consistency.
Test–Retest Reliability
Test–retest reliability involves administering the same test to
the same group of people at two
different times and then correlating the two sets of scores. To
the extent that the scores are
consistent over time, the reliability coefficient shows the test’s
stability (see Figure 3.1). This
method has a few limitations. First, it can be uneconomical,
because it requires considerable
time for employees to complete the tests on two or more
occasions. Second, it can be difficult
to determine the optimal length of time that should pass
between one test-taking session

and the next. If the interval is short, say, a few hours, test
takers may remember all the ques-
tions and simply answer everything the same way on the retest,
which could artificially inflate
the reliability coefficient. Conversely, waiting too long between
tests, say, 6 months to 1 year,
could cause retesting scores to be affected by changes that
result from outside learning. This
can artificially reduce the test’s reliability. The best time
interval is therefore relatively short,
such as a few weeks or up to a few months.
Interrater Reliability
As the name implies, and as introduced in earlier examples,
interrater reliability involves
allowing more than one evaluator (rater) to assess each
candidate’s performance, and then
correlating the scores of the raters across candidates. To the
extent that the scores are con-
sistent across raters, the test is considered more reliable. This
method is particularly relevant
for subjective tests that are prone to personal interpretation.
Even with well-designed rubrics
and valid questions, these methods introduce some variability in
assessment across candi-
dates and raters. Interrater reliability ensures that such
variability is under control and insuf-
ficient to bias the results of the process or alter hiring
decisions. Various advanced statistical
methods are available to take into account both the consistency
of the absolute scores of the
candidates across raters and their relative scores and rankings
compared to each other. Both
types of consistency are important. For example, absolute
scores can affect selection deci-
sions in situations where there is a cutoff score, such as in the

case of college admissions and
certifications. Relative scores and rankings can affect
promotion decisions or come into play
when only a predetermined number of candidates (e.g., top five)
can be selected.
you83701_03_c03_067-102.indd 76 4/20/17 4:22 PM
77
Mechanical
comprehension at first
test administration
Mechanical
comprehension at second
test administration
High
Low
High
Low
Mechanical

test administration
Mechanical
test administration
High
Low
High
Low
High reliability
Low reliability
Alternate-Form Reliability
Alternate- (or parallel-) form reliability has to do with how
consistent test scores are likely to
be if a person takes two similar, but not identical, forms of a
test. As with test–retest reliability,
this measure requires two sets of scores from a group of people
who have taken two varia-
tions of a test. If the score sets yield a high parallel-form
reliability coefficient, then the tests
are not materially different. If the reverse happens, the tests are
probably not equivalent and
therefore cannot be used interchangeably.
Figure 3.1: Test-retest reliability

One way to determine a test’s reliability is to conduct a test–
retest. The more consistent test scores
are over time, the more reliable the test is thought to be.
From Levy, P.E. (2016). Industrial/organizational psychology:
Understanding the workplace. (5 ed). p. 26, Fig. 2.3. Copyright
2017
by Worth Publishers. All rights reserved. Reprinted by
permission of Worth Publishers.
Mechanical
test administration
Mechanical
test administration
High
Low
High
Low
Mechanical
test administration
Mechanical

test administration
High
Low
High
Low
High reliability
Low reliability
you83701_03_c03_067-102.indd 77 4/20/17 4:22 PM
78
The major limitations of parallel-form reliability are that it is
time consuming and costly. Fur-
thermore, it requires the test developer to design two versions
of a test that both cover the
same subject matter and are equivalent in difficulty and reading
level.
Split-Halves Reliability
Split-halves reliability tests are more cost effective than either
test–retest or parallel-form
reliability because they can be assessed with scores from one

test administered just one time.
After the test has been given, it is split in half, and scores from
the two halves of the test are
correlated. A high reliability coefficient indicates that each
section of the test is consistently
measuring similar content, whereas the reverse is true with a
low reliability coefficient.
The tricky part of split-halves reliability is determining how
best to split the test. For example,
if test items increase in difficulty as the test progresses, or if
the first half of the test contains
fewer difficult questions than the second, it won’t work to
simply split the test down the
middle and compare the scores from each half. To solve this
dilemma, tests are often split by
odd- and even-numbered questions.
Find Out for Yourself: Test Reliability
This exercise is designed to help you grasp the challenges
involved in designing a reliable test.
To begin, choose a trait or skill in which you are interested. For
example, consider an academic
subject, mastery of an online game, or a personality trait that
you admire. Then, write 10 state-
ments you believe to be good measures of the selected trait or
skill. Ask three friends or family
members to rate themselves on each statement on a scale of 1–5
(1 = strongly disagree, 5 =
strongly agree). Finally, without looking at their scores, rate
each individual on the same 10
statements based on your own perceptions of that individual.
What Did You Learn?
1. Assess interrater reliability: Add up each individual’s scores

based on his or her own
assessment. Rank order the scores. Add up each individual’s
scores based on your
assessment of that individual. Rank order the scores. Did the
rankings change?
2. Assess split-halves reliability: Add up each individual’s
scores on the first five questions.
Add up each individual’s scores on the second set of five
questions. Are the scores of
each individual on the two test halves similar? Rank order the
three individuals based
on their first five questions. Rank order them again based on the
second set of five ques-
tions. Did the rankings change?
3. Ask the same three individuals to rate the same statements
again a week later. Assess
test–retest reliability: Add up each individual’s scores on the
first time he or she took
the test. Add up each individual’s scores on the second time he
or she took the test. Are
the scores similar? Rank order the three individuals based on
their first test. Rank order
them again based on their second test. Did the rankings change?
As you can probably appreciate from this exercise, anyone can
“whip up” a test, but the test
may be highly subjective and unreliable if the scores are not
consistent across raters, ques-
tions, and times of test administration. You probably now have
some idea as to how to improve
your test. Similarly, scientific test design requires numerous
iterations of writing and rewrit-
ing items and statistically examining results with multiple
samples to ensure reliability before

the test can be used.
you83701_03_c03_067-102.indd 78 4/20/17 4:22 PM
79
Test Validity
Validity is the most important aspect of an employment test.
Although a test is reliable if it is
able to make consistent measurements, a test is valid if it truly
measures the characteristics it
is supposed to measure. An employment test may yield reliable
scores, but if it doesn’t mea-
sure the skills that are needed to perform a job successfully, it
is not very useful. I/O psycholo-
gists use three methods to establish test validity: criterion-
related validity, content validity,
and construct validity.
Criterion-Related Validity
The purpose of criterion-related validity is to establish a
predictive, empirical (number-based)
link between test scores and actual job performance (see Figure
3.2). To do this, I/O psycholo-
gists compare applicants’ employment test scores with their
subsequent job performance.
This correlation is called the validity coefficient, and it ranges
from 0 to ±1.00. Tests that yield
validity coefficients ranging from +.35 to +.45 are considered

useful for making employment
decisions, whereas those with validity coefficients of less than
+.10 probably have little rela-
tionship to job performance.
I/O psychologists use two different methods to establish
criterion-related validity: predictive
validity and concurrent validity. Predictive validity involves
administering a new test to all job
applicants but not using the test scores to try to predict the
applicants’ job readiness. Instead,
the scores are filed away to be analyzed later. After a time,
managers will have accumulated
performance ratings or other information that indicates how new
hires are performing on
the job. At that point, the new hires’ preemployment test scores
will be correlated with their
performance, and the organization can look at how successfully
the test predicted perfor-
mance. If the test proves to be a valid predictor of performance,
it can be used in future hiring
decisions. Although this approach is considered the gold
standard of test validation, many
organizations are unwilling to use the predictive validity
method because filing away employ-
ment test scores lets some unqualified applicants slip through
the preemployment screening
process. However, scientifically developed tests regularly use
this method to validate them
prior to their use.
The concurrent validation approach is more popular because,
instead of job applicants, on-
the-job employees are used to establish the test’s validity. With
this method, both the current
employees’ test scores and their job performance ratings are

gathered at the same time, and
test validity is established by correlating these measures.
Organizations appreciate the cost-
effectiveness offered by concurrent validation. Tests that are
found to be of high concurrent
validity based on the results from current employees are then
used to assess job applicants.
Other forms of concurrent validation include convergent and
divergent validity. Convergent
validity refers to the correlation between test scores and scores
on other related tests. For
example, SAT and ACT tests are expected to correlate, so the
validation of new SAT questions
may involve examining their correlation with ACT questions.
Divergent validity refers to the
correlation between test scores and scores on other tests or
factors that should not be related.
For example, test scores should not be related to gender, race,
religion, or other protected
classes. To ensure that a test is not discriminatory, the
correlation between its scores and
each of these factors can be examined. Divergent validity is
supported when these correla-
tions are low or statistically insignificant. Taken together,
convergent and divergent validity
you83701_03_c03_067-102.indd 79 4/20/17 4:22 PM
80

Cognitive
ability score
Job
performance
High
Low
Excellent
Poor
Cognitive
ability score
Job
performance
High
Low
Excellent
Poor
High validity
Low validity
examine the extent to which a test relates to what it should be
related to and does not relate

to what it should not be related to, respectively.
Despite the time- and cost-saving advantages, concurrent
validation does have a number of
drawbacks. First, the group of employees who validate a test
could be very different from the
Figure 3.2: Criterion-related validity
In order to establish a connection between test scores and actual
job performance, I/O psychologists
use criterion-related validity. Tests with high correlations
between scores and job performance are
considered to be high-validity tests and can be useful for
making employment decisions. Tests with
low correlations between scores and job performance are
considered low-validity tests and would
not be ideal for assessing job performance.
From Levy, P.E. (2016). Industrial/organizational psychology:
Understanding the workplace. (5 ed). p. 29, Fig. 2.4. Copyright
2017
by Worth Publishers. All rights reserved. Reprinted by
permission of Worth Publishers.
Cognitive
ability score
Job
performance
High
Low
Excellent

Poor
Cognitive
ability score
Job
performance
High
Low
Excellent
Poor
High validity
Low validity
you83701_03_c03_067-102.indd 80 4/20/17 4:22 PM
81
group of applicants who actually make use of the test. The
former would likely skew the vali-
dation because lower performing employees (and their low
scores) would have already been

removed from their positions and thus not be part of a test-
validating process. This can cause
the validity of the test to appear higher than it really is. On the
other hand, employees may
also skew the validation by not trying as hard as job applicants.
Employees who already have
jobs might not be as motivated to do their best as applicants
eager for a job would be. This can
cause the test’s validity to appear lower than it really is.
Content-Related Validity
Content-related validity is the rational link between the test
content and the critical job-related
behaviors. In other words, test items should be directly related
to the important require-
ments and qualifications for the job. The rationale behind
content-related validation is that if
a test samples actual job behaviors, then individuals who
perform well on it will also perform
well on the job. Remember that the goal of testing is to predict
performance.
As you can probably guess, content-related validation studies
rely heavily on information
gathered from a job analysis. If test questions are directly
related to the specific skills needed
to perform a job, the test will have high content-related validity.
For example, a test for admin-
istrative professionals might ask questions related to effective
filing methods, schedule man-
agement, and typing techniques. Because these skills are
important for the administrative
professional position, this test would have high content-related
validity. Content validity can-
not be evaluated numerically or statistically as readily as
criterion validity. It is often qualita-

tively evaluated by subject matter experts. However, qualitative
evaluations can be quantified
using well-designed rubrics and assessed for reliability and
consistency.
Construct Validity
Construct validity is the extent to which a test accurately
assesses the abstract personal attri-
butes, or constructs, that it intends to measure. Although there
are valid measures of many
personality traits, numerous invalid measures of the same traits
can be found in less scientific
sources such as magazines or the Internet. Because constructs
are intangible, it can be chal-
lenging to design tests that measure them.
How do we know if tests of personality, reasoning, or
motivation actually measure these
intangible, unobservable characteristics? One way to establish
construct validity is to cor-
relate a new test with an established test that is known to
measure the construct in question.
If the new test correlates highly with the established test, the
new test is likely measuring the
construct it is intended to measure.
Validity Generalization
Initially, I/O psychologists thought that validity evidence was
situation specific; that is, a test
that was validated and used for applicants for one job could not
be used for applicants for a
different job unless an additional job-specific validation study
was performed. Further, I/O
psychologists believed that tests that had been validated for a
position at one company could
not be used for the same position at a different company—

again, unless it was validated for
the second company, which is a tedious and costly process.
you83701_03_c03_067-102.indd 81 4/20/17 4:22 PM
82
However, research over the past few decades has shown that
validity specificity is unfounded
(Lubinski & Dawis, 1992). I/O psychologists now believe that
validity evidence transfers
across situations, a notion that is referred to as validity
generalization. Researchers discov-
ered that the studies that supported validity specificity were
flawed (Schmidt & Hunter, 1981).
Validity generalization has been a huge breakthrough for
organizations. Establishing validity
evidence for every employment test in every situation was, for
most organizations, both cost
and time prohibitive. Because of its cost- and time-saving
benefits, the advent of validity gen-
eralization has meant that organizations are more willing to
integrate quality tests into their
employment practices.
Face Validity
Face validity is not a form of validity in a technical sense;
rather, it is the subjective impression

of how job relevant an applicant perceives a test to be. For
example, a bank teller would find
nothing strange about taking an employment test that dealt with
numerical ability or money-
counting skills, because these skills are obviously related to job
performance. On the other
hand, the applicant may not see the relevance of a personality
test that asks questions about
personal relationships. This test would thus have low face
validity for this job.
Organizations need to pay close attention to their applicants’
perceptions of a test’s face valid-
ity, because low face validity can cause an applicant to feel
negatively about the organization
(Chan, Schmitt, DeShon, Clause, & Delbridge, 1997; Smither,
Reilly, Millsap, Pearlman, & Stof-
fey, 1993). If organizations have the opportunity to pick
between two tests that are otherwise
equally valid, they should use the test with the greater level of
face validity.
Find Out for Yourself: Your Personality Type
Visit the 16 Personalities website and take the personality-type
assessment provided. Read
your results, then visit the Encyclopedia Britannica entry on
personality assessment and read
about the reliability and validity of assessment methods.
16 Personalities
Reliability and Validity of Assessment Methods
What Did You Learn?
1. What have you learned about yourself through this

assessment?
2. Is this assessment accurate? Which types of validity and
reliability apply to it?
3. Based on this assessment, what are some examples of jobs
that would fit your type?
you83701_03_c03_067-102.indd 82 4/20/17 4:22 PM
https://www.16personalities.com/free-personality-test
https://www.britannica.com/science/personality-
assessment/Reliability-and-validity-of-assessment-methods
83
Section 3.4 Test Formats
Qualitative Methods
Most of the methods discussed in this chapter are quantitative in
nature. However, employ-
ers often find it necessary to take into consideration other
factors that are not necessarily
quantifiable but can be very important in employee selection.
Qualitative selection methods
may include observation, unstructured interview questions,
consultation with references, or
solicitation of opinions from past or future managers and
coworkers. Qualitative methods can
yield a much richer and broader array of information about an
applicant that may be impos-
sible to capture using quantitative methods.
Unfortunately, however, qualitative methods are much harder to

assess in terms of validity
and reliability, and thus they may lead to erroneous decisions.
Their subjectivity can also lead
to discriminatory decisions. Qualitative methods can be
particularly problematic in ranking
applicants. Without a predetermined set of evaluation criteria
and a rating scale, interrater
reliability can be very low. That is why psychologists attempt to
quantify even what would
be considered qualitative criteria—such as person–organization
fit and person–job fit—by
creating survey measures of these factors. With the help of I/O
psychologists, employers can
also create quantitative scoring themes for qualitative data to
increase the integrity and legal-
ity of these methods.
3.4 Test Formats
Thousands of employment tests are on the market today.
Naturally, no two tests are the same
and may differ in their construction or administration. Tests
vary in their quality depending
on the rigor of their validation processes. They also vary in cost
depending on their extensive-
ness and popularity. However, it is important to note that
quality and cost do not always go
hand in hand. Some of the most valid and reliable tests are
available in the scientific literature
free of charge, but they are not very popular among
practitioners, who are often unfamiliar
with the scholarly literature. On the other hand, some of the
popular and expensive tests
marketed by well-known consulting companies have
questionable validity and reliability. In
some cases these tests are only taken at face validity and are
never statistically analyzed. In

other cases the test developers make sure their tests are valid
and reliable but are reluctant to
publicly share their analyses for proprietary reasons. In all
cases prudent employers should
demand evidence of validity and reliability in order to ensure
that they are (a) investing their
time and resources in the right selection tools that will yield the
most qualified workforce and
(b) using legally defensible and nondiscriminatory methods.
Commonly used test-design for-
mats include assessment centers, computer-adaptive tests, speed
and power tests, situational
judgment tests, and work-sample tests.
Assessment Centers
An assessment center is one of the most comprehensive tests
available and is often used to
select management and sales personnel because of its ability to
assess interpersonal, commu-
nication, and managerial skills. A typical assessment center
includes personality inventories,
cognitive assessments, and interviews, as well as simulated
activities that mimic the types of
activities performed on the job. Common types of simulated
activities include in-basket tasks,
leaderless group discussions, and role-play exercises.
you83701_03_c03_067-102.indd 83 4/20/17 4:22 PM
84

Section 3.4 Test Formats
Although assessment centers can predict the level of success
both in training and on the job,
they have a number of limitations. First, assessment centers are
expensive to design and
administer, because administrators must be specifically trained
to evaluate discussions and
perform role plays. Assessment centers for senior management
positions can cost more than
$10,000, a price tag that is prohibitive for many organizations.
Second, because scoring an
assessment center relies on the judgment of its assessors, it can
be difficult to standardize
scores across time and location. This issue can be mitigated by
training assessors to evaluate
behaviors against an established set of scoring criteria.
Computer-Adaptive Tests
Typical tests include items that sample all levels of a
candidate’s ability. In other words, they
contain some questions that are easy and will be answered
correctly by almost all test tak-
ers, some that are difficult and will be answered correctly by
only a few, and some that are
in-between. A computer-adaptive test, however, tailors the test
to each test taker’s individual
ability.
In this type of test, the candidate begins by answering a
question that has an average level of
difficulty. If he or she answers correctly, the next question will
be more difficult; if he or she
answers incorrectly, the next question will be easier. This
process continues until the candi-
date’s proficiency level is determined. The fact that candidates

do not waste time answering
questions of inappropriate difficulty is a clear advantage of the
computer-adaptive test. Addi-
tionally, because each test is tailored to the individual, test
security (i.e., cheating) is less of a
concern.
Speed and Power Tests
Tests can be designed to assess either an individual’s depth of
knowledge or rate of response.
The first type of test is called a power test. Power tests are
designed to be difficult, and very
few individuals are able to answer all of the items correctly.
Test takers receive either a gen-
erous time limit or no time limit at all. The overall purpose of
the power test is to evaluate
depth of knowledge in a particular domain. Therefore, response
accuracy is the focus, instead
of response speed.
Speed tests contain a homogenous content set, and test takers
receive a limited amount of
time to complete the test. These tests are well suited to jobs in
which tasks must be per-
formed both quickly and accurately, such as bookkeeping or
word processing. For these jobs,
a data-entry test would be an appropriate speed test for
measuring an applicant’s potential
for success.
Situational Judgment Tests
A situational judgment test is a type of job simulation that is
composed of a number of job-
related situations designed to assess the applicant’s judgment.
Each situation includes mul-
tiple options for how to respond. The applicant must select the

options that will produce
the most and least effective outcomes. Statistically, situational
judgment tests have validities
comparable to structured interviews, biographical data, and
assessment centers (Schmidt &
Hunter, 1998).
you83701_03_c03_067-102.indd 84 4/20/17 4:22 PM
85
Section 3.5 Testing for Individual Differences
Situational judgment tests are frequently admin-
istered to candidates for management positions,
although research shows them to be predictive
of performance for a wide variety of jobs. Stud-
ies have found validity evidence for situational
judgment tests’ ability to predict supervisory
performance (Motowidlo, Hanson, & Craft, 1997)
and to predict job performance for sales profes-
sionals (Phillips, 1992), insurance agents (Dales-
sio, 1994), and retail store employees (Weekley
& Jones, 1997).
Work-Sample Tests
Work-sample tests evaluate an applicant’s level of
performance when demonstrating a small sample
set of a job’s specific tasks. The two general areas for work-
sample tests are motor skills and
verbal abilities. A test that requires a machinist applicant to

properly operate a drill press is
an example of a motor skills work-sample test; a test that asks a
training applicant to present a
portion of the organization’s training program is a verbal ability
work-sample test.
One advantage of work-sample tests is that they generally show
a high degree of job related-
ness, so applicants perceive that they have a high degree of face
validity. Additionally, these
tests provide applicants with a realistic job preview. The
disadvantage is that they can be
expensive to develop and administer.
Goodshot/Thinkstock
Situational judgment tests present
applicants with job-related scenarios
in order to assess their decision-
making skills.
Consider This: The Costs of Testing
1. For most of the tests discussed in this section, expense is a
major drawback. Why do you
think an organization would go to the trouble of developing and
using employment tests?
2. How might the expense of a test be justified, offset, or
overcome?
3.5 Testing for Individual Differences
People differ in psychological and physical characteristics, and
identifying and categorizing
people in respect to these differences is important for
successfully predicting both job per-
formance and job satisfaction. The most commonly tested

categories of individual differences
are cognitive ability, physical ability, personality, integrity, and
vocational interests. Each of
these categories has an important theoretical foundation as well
as its own set of advantages
and disadvantages.
Cognitive Abilities
The past hundred years of study have produced two distinct
concepts of cognitive ability.
Beginning with Spearman’s seminal research in 1904 on general
intelligence, one concept is
based on the belief that cognitive ability is a single, unitary
construct (called the g, or general,
you83701_03_c03_067-102.indd 85 4/20/17 4:22 PM
86
factor) along with multiple subfactors (called s factors).
According to this two-factor, or hier-
archical, theory of cognitive ability, the g factor is important to
all cognitive performance,
whereas s factors influence specific intelligence domains. For
example, your performance on
a math test will be influenced by both your general, overall
intelligence (the g factor) and your
knowledge of the specific math topic being tested (an s factor).
From test to test, your scores

will be strongly correlated, because all performance is
influenced by the g factor; however, the
influence of s factors will keep the correlation from being
perfect. So, although a high g factor
of overall intelligence might mean that you would score higher
than most on both a math test
and a verbal reasoning test, your math test scores could be
lower than your verbal reasoning
scores simply because you never took a class that covered the
specific topics on the math test
(an s factor).
Led by Thurstone’s research, begun in 1938, scientists
challenged Spearman’s theories by
proposing that cognitive ability was a combination of multiple
distinct factors, with no over-
arching factor. Using a statistical technique called factor
analysis, Thurstone and his col-
leagues identified seven primary mental abilities: spatial
visualization, number facility, verbal
comprehension, word fluency, associative memory, perceptual
speed, and reasoning (Thur-
stone, 1947). This theory suggests that employment tests should
evaluate the primary men-
tal abilities that are most closely linked to a specific job. For
example, a test for engineering
applicants would focus on spatial and numerical abilities.
Although there is no consensus, research has supported
Spearman’s hierarchical model of
cognitive ability (Carroll, 1993; Schmid & Leiman, 1957).
Consequently, employers tend to
use tests to measure both general intelligence and specific
mental domains. Those that focus
on the g factor are called general cognitive ability tests. They
measure one or more broad

mental abilities, such as verbal, mathematical, or reasoning
skills. General cognitive ability
tests can be used to evaluate candidates for almost any job,
especially those in which cogni-
tive abilities such as reading, computing, analyzing, or
communicating are involved. Specific
cognitive ability tests measure the s factors and focus on
discrete mental abilities such as reac-
tion time, written comprehension, and mathematical reasoning.
These tests must be closely
linked to the job’s specific functions.
Cognitive ability tests are among the most widely used tests
because they are highly effec-
tive at predicting job and training success across many
occupations. In their landmark study,
Schmidt and Hunter (1998) examined validity evidence for 19
different selection processes
from thousands of studies over an 85-year period. After
compiling a meta-analysis (which is
a combination of the results of several studies that address a set
of related research hypoth-
eses), Schmidt and Hunter found that cognitive ability test
scores correlated with job perfor-
mance at .51 and training success at .53, which were the highest
validity coefficients among
all the types of tests they examined. Other researchers found
similar validities using data
from European countries (Bertua, Anderson, & Salgado, 2005;
Salgado, Anderson, Moscoso,
Bertua, & Fruyt, 2003). Interestingly, additional research has
found that a job’s complexity
positively affects the validity of cognitive ability tests. In other
words, the more complex the
job, the better the test is at predicting future job performance.
For jobs with low complexity,

on the other hand, high cognitive ability test scores are less
important for predicting success-
ful job performance (Hunter, 1980; Schmidt & Hunter, 2004).
The most popular general cognitive ability test is the Wonderlic
Cognitive Ability Test. Devel-
oped by Eldon F. Wonderlic in the 1930s, this test contains 50
items, has a 12-minute time
you83701_03_c03_067-102.indd 86 4/20/17 4:22 PM
87
limit, and is used by both private businesses and
government agencies. Test norms have been set by
more than 450,000 working adults and are avail-
able for over 140 different jobs. Test content cov-
ers numerical reasoning, verbal comprehension,
and spatial ability. The test begins with very easy
questions and progresses to very difficult ones;
due to this range and the short time limit, few peo-
ple are able to answer all 50 questions correctly.
The test is offered in a variety of versions, includ-
ing computer- and paper-based formats, and is
now available in nine languages, including French,
Spanish, British English, and American English.
In all, more than 130 million job applicants have
taken the Wonderlic.

The Wechsler Adult Intelligence Scale-Revised
(WAIS-R), developed by David Wechsler in 1955
and currently in its fourth edition, is another com-
monly used general cognitive ability test. It differs
from the Wonderlic in both length and scope and is composed of
11 different tests (6 verbal
and 5 performance), requiring 75 minutes to complete. The 6
verbal tests are comprehension,
information, digit span, vocabulary, arithmetic, and similarities.
The performance tests are
picture completion, picture arrangement, object assembly,
digital symbol, and block design.
Naturally, this complex psychological assessment requires well-
trained administrators to
ensure accurate scoring and score interpretation. The WAIS-R is
typically used when select-
ing for senior management or other positions that require
complex cognitive thinking.
Outside the world of work, a person’s cognitive ability also
predicts his or her academic suc-
cess. Using meta-analytic research, psychologists examined the
relationship between the
Miller Analogies cognitive ability test (a test commonly used to
select both graduate students
and professional-level employees) and student and professional
success. Interestingly, results
showed that there was no significant difference between the
cognitive abilities required for
academic and business success (Kuncel, Hezlett, & Ones, 2004).
Although they are valid performance predictors for many jobs,
cognitive ability tests can pro-
duce different selection rates for individuals in select classes.
Whites typically score higher

than African Americans, and there is much concern that these
differences are due to bias
within the test. If a test is biased to favor one ethnic group or
population over others, then
any employment selection or promotion program that utilizes it
will be inherently flawed.
Discovering bias within a test can be extremely difficult, but
I/O psychologists have been able
to reduce the impact of potentially biased cognitive ability tests
by adding noncognitive tests,
such as personality tests, to selection processes (Olson-
Buchanan et al., 1998).
Physical Abilities
Many jobs require a significant amount of physical effort.
Firefighters and police officers,
for example, may need physical strength, and mechanics may
need finger dexterity. Other
examples of hazardous or physically demanding work
environments include factories, power
Roy Delgado/CartoonStock
you83701_03_c03_067-102.indd 87 4/20/17 4:22 PM
88
plants, and hospitals. Organizations must be careful to use
information from a job analysis to

understand the specific physical requirements of a job before
purchasing or developing any
work-sample tests to use in the selection process. Fleishman
(1967) identifies nine physical
ability characteristics present in many jobs (see Table 3.1).
Measures of each physical ability
are not strongly correlated, however, which suggests that there
is no overall measure of gen-
eral physical ability.
Table 3.1: Fleishman’s physical ability dimensions
Physical ability Description
Static strength Maximum muscle force exerted by muscle
groups (e.g., legs, arms,
hands) over a continuous period of time
Explosive strength Explosive bursts of muscle energy over a
very short duration to move
the individual or an object
Dynamic strength Repeated use of a single muscle over an
extended period of time
Trunk strength Ability of back and core muscles to support the
body over repeated
lifting movements
Extent flexibility Depth of flexibility of arms, legs, and body
Dynamic flexibility Speed of flexibility of arms, legs, and body
Gross body coordination Ability to coordinate arms, legs, and
body to perform activities
requiring whole-body movements

Gross body equilibrium Ability to coordinate arms, legs, and
body to maintain balance and
remain upright in unstable positions
Stamina Ability to exert oneself physically over a long duration
Research consistently demonstrates that physical ability tests
predict job performance for
physically demanding jobs (Hogan, 1991). Identifying
individuals who cannot perform the
essential physical functions of a job—especially in hazardous
positions such as police officer,
firefighter, and military personnel—can minimize the risk of
physical harm to the job candi-
date, other employees, and civilians. Another positive feature of
physical ability tests is that
they are not strongly correlated with cognitive ability tests,
which as mentioned earlier tend to
be biased. Thus, using physical ability tests in conjunction with
cognitive ability tests can help
decrease potential bias and make job performance predictions
more accurate (Carroll, 1993).
I/O psychologists must be careful to design and measure
physical ability tests so they do
not discriminate against minority groups. Unfortunately,
although a job may legitimately
need candidates who possess specific physical skills, the
standards and measures of those
skills are often arbitrarily or inaccurately made. For example,
height and weight are often
used as a proxy for physical strength. Even though these
measurements are quick and easy
to make, they are not always the most accurate, and they have
resulted in the underselection

of women for physically demanding jobs (Meadows v. Ford
Motor Co., 1975). Companies that
fail to implement accurate, careful measurements of an
applicant’s ability to perform a job’s
necessary, specific physical requirements run the risk of
discriminating against protected
classes—something that will likely land them in court, where
judges and juries consistently
rule in favor of the plaintiff and award large settlements.
you83701_03_c03_067-102.indd 88 4/20/17 4:22 PM
89
One example of a valid, commercially available physical ability
test is the Crawford Small Parts
Dexterity Test. This test is used to assess the fine motor skills
and manual dexterity required
for industrial jobs. It examines applicants’ ability to place small
objects into small holes on
a board. For their first task, test takers use a pair of tweezers to
place 36 pins into holes and
position a collar around each pin. In their second task, test
takers use a screwdriver to insert
36 screws into threaded holes. The test is scored in two ways:
by measuring the amount of
time taken to complete both parts of the test and by measuring
the number of items com-
pleted during a set time limit (3 minutes for part 1 and 5

minutes for part 2).
Personality
I/O psychologists have studied how personality affects the
prediction of job performance
since the early 20th century. After examining 113 studies
published from 1913 to 1953, Ghis-
elli and Barthol (1953) found positive but small correlations
between personality and the
prediction of job performance. The researchers were surprised
that the correlation was not
stronger and suggested the companies had used personality tests
with weak validity evidence.
Developing and using tests with stronger validity evidence, they
concluded, would facilitate
better predictions of applicants’ future job performance.
Guion and Gottier (1965) disagreed with this notion, suggesting
instead that personality
measures were unrelated to job performance—even though the
two noted that their data
came from studies that used poorly designed or theoretically
unfounded personality mea-
sures. In fact, most researchers at the time developed their own
concepts of personality and
created tests to match them, which naturally led to considerable
inconsistency in the ways
they measured personality constructs. Thus, although
organizations continued to use person-
ality tests to select candidates for management and sales
positions, academic research in this
area waned for the next 20 years.
In the early 1990s Barrick and Mount’s (1991) landmark meta-
analysis established the five-
factor model of personality, now commonly referred to as the

Big Five personality factors—
extraversion, agreeableness, conscientiousness, neuroticism,
and openness to experience—
which are described in detail in Table 3.2. This model is
broader than earlier theories and
therefore lends itself more readily to a useful classification for
the interpretation of personal-
ity constructs (Digman, 1990).
Table 3.2: Big Five personality factors
Factor Also referred to as Description
Neuroticism Adjustment Insecure, untrusting, worried, guilty
Extraversion Sociability Sociable, gregarious, fun, people
person
Openness to experience Inquisitiveness Risk taking, creative,
independent
Agreeableness Interpersonal sensitivity Empathetic,
approachable, courteous
Conscientiousness Mindfulness Dependable, organized,
hardworking
you83701_03_c03_067-102.indd 89 4/20/17 4:22 PM
90

The most important advantage of the five-factor model is the
functional structure it provides
for predicting the relationships between personality and job
performance. Barrick and Mount
(1991) reviewed 117 criterion-related validation studies
published between 1952 and 1988
that measured at least one of the five personality factors.
Results showed that conscientious-
ness (a measure of dependability, planfulness, and persistence)
was predictive of job perfor-
mance for all types of jobs. Further, extraversion (a measure of
energy, enthusiasm, and gre-
gariousness) predicted performance in sales and management
jobs. The other three factors
were found to be valid but were weaker predictors of some
dimensions of performance in
some occupations. That same year, Tett, Jackson, and Rothstein
(1991) found strong positive
predictive evidence not only for conscientiousness and
extraversion but also for agreeable-
ness and openness to experience. On the other hand, they found
neuroticism to be negatively
related to job performance. These researchers also discovered
that validity was higher among
studies that referenced job analysis information to create tests
that linked specific personal-
ity traits with job requirements. In summary, then, measures of
the Big Five personality fac-
tors can significantly predict job performance, but to do so, they
must be carefully aligned
with critical job functions.
In addition to strong criterion-related validity, measures of the
Big Five factors also gener-

ally show very little bias. Across a number of personality
factors, score differences across
racial groups and/or genders are minor and would be unlikely to
adversely impact employ-
ment decisions; two areas that fall outside this generalization
are agreeableness, in which
women score higher than men, and dominance (an element of
extraversion), in which men
score higher (Feingold, 1994; Foldes, Duehr, & Ones, 2008).
Because they produce almost no
adverse impact, personality tests can be used in conjunction
with cognitive ability tests dur-
ing selection processes to increase validity while reducing the
threat of potential bias (Hough,
Oswald, & Ployhart, 2001).
The first test to examine the Big Five personality factors was
the NEO Personality Inventory,
named for the first three factors of neuroticism, extraversion,
and openness to experience.
Composed of 240 items, the test can be completed in 30 to 40
minutes and is available in a
number of languages, including Spanish, German, and British
English. However, research now
supports much shorter versions, as short as 10 items (Gosling,
Rentfrow, & Swann, 2003),
which are ideal for use in work settings, either separately or in
combination with other tests
and survey measures.
Types of Personality Tests
There are two basic types of personality tests: projective tests
and self-report inventories. The
former presents test takers with an ambiguous image, such as an
inkblot, and asks them to
describe what they see. Trained psychologists then interpret the

descriptions. The rationale
for this type of test is that test takers will project their
unconscious personalities into their
descriptions of the image. Two examples of projective tests are
the Rorschach (inkblot) test
and the Thematic Apperception Test.
Projective tests are most often used by clinical psychologists
and are rarely used in employee
selection processes because they are expensive, are time
consuming, and require professional
interpretation. Instead, employers make use of self-report
personality inventories, which ask
individuals to identify how a situation, behavior, activity, or
feeling is related to them. The
rationale for this type of test is that test takers know themselves
well enough to make an
you83701_03_c03_067-102.indd 90 4/20/17 4:22 PM
91
accurate report of their own personality. Some advantages of
the self-report inventories over
projective tests are cost-effectiveness, standardization of
administration practices, and ease
in scoring and interpreting results. For example, the website
below can help you assess your
Big Five personality traits.

Find Out for Yourself: Your Big Five Personality Traits
Visit the following website to find extensive information and
research about the Big Five per-
sonality traits, take the Big Five personality test, and get instant
feedback.
The Big Five Project Personality Test
Unfortunately, self-report inventories have drawbacks. A major
one is the tendency of test
takers to distort or fake their responses. Because test items
usually have no right or wrong
answers, test takers can easily choose to provide socially
acceptable responses instead of
true answers in order to make themselves look better to the
hiring organization. Indeed, in
one controlled study, researchers instructed test takers to try to
respond to a personality
inventory in a way they felt would create the best impression,
which resulted in more posi-
tive scores (Hough, Eaton, Dunnette, Kamp, & McCloy, 1990).
Furthermore, a significant num-
ber of actual applicants who took personality inventories as part
of a selection process were
found to have distorted their responses to appear more
attractive, even without having been
told to do so (Stark, Chernyshenko, Chan, Lee, & Drasgow,
2001).
The real question for I/O psychologists is whether response
distortion significantly affects the
validity of personality inventories. Ones, Viswesvaran, and
Reiss (1996) conducted a meta-
analysis that examined the effects of social desirability on the
relationship between measures

of Big Five factors and both job performance and
counterproductive behaviors. They found
that test takers’ attempts to provide socially acceptable—not
necessarily truthful—answers
did affect their scores in the areas of emotional stability and
conscientiousness but did not seri-
ously influence test validity. However, faking answers can
influence hiring decisions by chang-
ing the rank ordering of job candidates (Christiansen, Goffin,
Johnston, & Rothstein, 1994).
One interesting and paradoxical finding of personality test
research is that people who are
able to recognize the socially acceptable answers, whether they
accurately represent the truth
of that person, tend to perform better on the job than people
who are unable do so (Ones et
al., 1996). How can this be? One explanation is that people in
the former group are better at
reading a situation’s subtle social cues and are therefore more
able to extrapolate what they
need to do to fulfill coworkers’ and managers’ expectations.
To balance the advantages and disadvantages of projective and
self-report tests, a new type of
tests, called implicit measures, has emerged. Implicit measures
are self-report tests in which
the questions are intentionally designed to make the purpose of
the test less obvious and thus
less amenable to faking and social desirability biases. For
example, the test taker may be given
a few seemingly neutral situations and a list of thoughts,
feelings, and actions and be directed
to select the ones that most closely represent him or her in each
situation. This intentional
vagueness allows implicit measures to assess a construct more

accurately and comprehen-
sively (Bing, LeBreton, Davison, Migetz, & James, 2007;
LeBel, & Paunonen, 2011).
you83701_03_c03_067-102.indd 91 4/20/17 4:22 PM
http://www.outofservice.com/bigfive/
92
Honesty and Integrity
Organizations need to be able to identify individuals who are
likely to engage in dishonest
behaviors. Employee misconduct is more serious than distortion
of answers on a personality
test and can have a significant impact on both coworkers and
the organization as a whole.
Employee theft, embezzlement, and other forms of dishonesty
may cost American businesses
billions of dollars annually. According to the National White
Collar Crime Center, embezzle-
ment alone is estimated to cost companies as much as $90
billion each year (Bressler, 2009).
In the past, organizations used polygraph tests, but polygraph
test results are not always
accurate, and applicants may find them to be an invasion of
privacy. When the Employee Poly-
graph Protection Act was passed in 1988, most private
employers became unable to use these
tests, thus requiring them to find another way to identify this

trait.
A more valid way to measure employee dishonesty is with an
integrity test. Integrity tests fall
into two categories: overt integrity tests and personality-based
integrity tests. The first type
assesses an individual’s direct attitudes and actions toward theft
and employment dishon-
esty. Test items typically ask individuals to consider their
opinions about theft behaviors or
to think of their own dishonest behaviors. Sample questions
include “Is it OK to take money
from someone who is rich?” and “Have you taken illegal drugs
in the past year?”
Personality-based integrity tests typically contain disguised-
purpose, or covert, questions
that measure various personality factors—such as responsibility,
virtue, rule following,
excitement seeking, anger, hostility, and social conformity—
that are related to both produc-
tive and counterproductive employee behaviors. Although overt
integrity tests can predict
theft and other glaring forms of dishonesty, personality-based
integrity tests are able to pre-
dict behaviors that are more subtly or secretly dishonest, such
as absenteeism, insubordina-
tion, and substance abuse.
Vocational Interests
Unlike most of the tests we have discussed so far, vocational
interest inventories are designed
for career counseling and should not be used for employee
selection. In these inventories,
test takers respond to a series of statements pertaining to
various interests and preferences.

In theory, people who share the preferences and interests of
successful workers in a given
occupation should experience a high level of job satisfaction if
they pursue that line of work.
Vocational interest scores predict future occupational choices
reasonably well, with between
50% and 60% of test takers subsequently working in jobs
consistent with their vocational
interests (Hansen & Dik, 2005). However, even though people
are likely to get a job doing
something in which they are interested, research does not
support the notion that vocational
interest will always lead to high job performance or
satisfaction. In fact, interest congruence
is only weakly related to job satisfaction and does not
effectively predict either job or train-
ing performance (Tranberg, Slane, & Ekeberg, 1993; Schmidt &
Hunter, 1998). Keep in mind
that just because someone is interested in a certain job does not
mean he or she will be able
to perform it well.
One frequently used measure of vocational interest is the Strong
Interest Inventory (SII),
previously known as the Strong Vocational Interest Blank. The
SII is a self-report inventory
composed of 291 items divided into six themes: artistic,
conventional, social, realistic, enter-
prising, and investigative. The test requires 25 minutes to
complete and is administered and
you83701_03_c03_067-102.indd 92 4/20/17 4:22 PM

93
Section 3.6 Developing a Testing Program
scored by computer. The results help test takers identify
occupations, leisure activities, and
work preferences that match their interests. It possesses norms
for 211 different occupa-
tions. Because interests tend to remain stable throughout a
person’s life, it makes sense for
high school and college students to take an interest inventory
like the SII as they begin the
process of developing their professional careers.
Another example is the Armed Services Vocational Aptitude
Battery (ASVAB). This test
assesses a wide range of abilities that predict future success in
the military. More than 1 mil-
lion military applicants, high school students, and
postsecondary students take this test every
year. Test subcategories include reading comprehension, world
knowledge, science, math,
electronics, and mechanics.
Find Out for Yourself: Occupational Interests and Personal
Abilities
Complete the Interest Profiler available on the O*NET website.
O*NET Interest Profiler
What Did You Learn?
1. What did you find out about your occupational interests and

personal abilities?
2. Are you currently working in a job that aligns with your
occupational interests? Why or
why not?
3.6 Developing a Testing Program
Although creating, identifying, and using valid tests are
essential for any quality testing pro-
gram, organizations also face a number of administrative
decisions that can affect the pro-
gram’s overall success.
Deciding When Not to Test
Most important, an organization must decide whether to use
testing in the selection process.
Time and cost are extremely important considerations, and they
include test development
and design, necessary equipment, facility usage, and
administrator/evaluator training and
pay. Naturally, organizations will need to ensure that the
benefits of their testing program
outweigh the costs.
Sometimes, the level of employee productivity that a test is able
to identify is not high enough
to warrant the expense of a testing program. In these cases other
measures, such as improv-
ing employee training and development, can help advance new
hires’ performance. Alter-
nately, conducting a more careful review of applicants’
educational backgrounds or asking
more in-depth interview questions can provide greater insight
into the job-related skills and
abilities of potential employees, without having to add tests to
the preemployment process.

you83701_03_c03_067-102.indd 93 4/20/17 4:22 PM
http://www.mynextmove.org/explore/ip
94
Section 3.6 Developing a Testing Program
In summary, then, it is important for an organization both to
establish its employment needs
and to determine the potential benefits and expected costs of
testing programs before imple-
menting these useful selection tools.
Finding Quality Tests
Over the past few decades, the volume and variety of selection
tests have increased dramati-
cally. Unfortunately, the test publishers and consulting
companies that design them use vary-
ing levels of rigor and expertise. How, then, is an organization
to know which tests have met
acceptable standards of validity, reliability, and test norming?
Two periodicals, the Mental
Measures Yearbook and Tests in Print, are excellent reference
sources that provide descrip-
tions and expert reviews of a large number of tests, including
those used for selection. As you
learned in Chapter 1, you can also consult the original scientific
research that was conducted
to develop and validate various tests in order to assess their
quality and rigor.

Find Out for Yourself: Quality of Selection Methods
Research various selection methods with which you are familiar
or that you have undergone in
the past. Examples may include job applications, interviews,
reference checks, medical exams,
and referrals. Try to find validity and reliability scores for each
method. Which ones are more
valid? Which ones are more reliable? Why do you think that is
the case?
Test Administrators
A test’s usefulness depends in part on its proper administration,
scoring, and interpretation.
Organizations must not only train their testing administrators on
these key functions but also
establish quality controls and retrain administrators when
necessary. The requirements for
administrator qualifications and abilities vary from test to test,
so it is important for organi-
zations to be aware of the requirements outlined in each test’s
manual when selecting and
administering tests.
Addressing Ethical and Privacy Issues
Test security is a major concern for I/O psychologists and
organizations in order to maintain
high ethical testing practices. Tests and scores must remain
confidential. Questions should
never be published or distributed to the public, and tests should
only be administered to
qualified individuals.
Some applicants may view tests as an invasion of privacy,
particularly ones that assess per-
sonality and integrity or screen for drugs. As we have noted,

fear or mistrust in the selection
process can have adverse consequences. Organizations can
alleviate some of these concerns
by communicating to applicants the reasons for the test, how
test results will be used, and
how confidentiality will be maintained.
you83701_03_c03_067-102.indd 94 4/20/17 4:22 PM
95
Section 3.7 Psychological Testing: Special Issues
Testing People With Cultural and Language Differences
Differences in cultural backgrounds can shape how test takers
interpret and respond to test
questions. As the American workforce becomes more racially
and ethnically diverse, it is criti-
cal that organizations emphasize the use of culturally unbiased
tests. Moreover, English is
no longer the primary language for a growing number of
applicants. Naturally, applicants
who cannot fluently read or speak the language used in a test
will likely return artificially
low scores, not because they lack skills or knowledge but
simply because they cannot com-
prehend the instructions or understand the test questions. To
overcome language barriers,
a test can be translated into a variety of languages. However, a
common problem with this
approach is that expressions and phrases used in the test items

may be lost in translation,
which decreases the test’s validity generalization. Thus,
additional validation is necessary
to assess validity generalization whenever a test will be used for
a different racial or ethnic
group or translated into a different language, and the test may
need to be adapted accordingly.
Testing People With Disabilities
The ADA protects qualified individuals with disabilities from
discrimination in all areas of
employment, including employment testing. It can be
challenging for organizations to accom-
modate individuals with disabilities; they must aim to be
sensitive to the needs of the indi-
vidual while also maintain the integrity of the testing process
and avoid incurring undue
hardship. Test administrators require training to understand and
properly respond to accom-
modation requests. Examples of reasonable accommodations
include modifying test equip-
ment or seating, ensuring accessibility to the testing facility,
and providing a Braille or large-
print version of a test to visually impaired candidates.
Establishing Appeals and Retest Processes
Every applicant should have the opportunity to perform at his or
her best on a test. Despite
every intention to create this opportunity, sometimes it is
simply not possible. Equipment can
malfunction, the testing environment could be poor (noise,
temperature, bad odors, or even
disasters such as fire or flood), and candidates can be affected
by outside stressors (illness or
hospitalization of a family member, among others). With each
of these situations, candidates

could perform significantly better if given the opportunity to
retake the test under conditions
in which the negative influences are not present.
Test administrators should be trained to identify situations that
produce invalid results and
then implement a specific process to retest the candidate based
on information and guidance
provided by the test publisher. The organization should also
establish policies for handling
complaints regarding testing in order to resolve concerns fairly
and consistently.
3.7 Psychological Testing: Special Issues
Over the past decade, I/O psychologists have become interested
in a number of questions
related to employment testing. How do applicants feel about
being tested? Do these feel-
ings affect their perceptions of the company? Do online tests
show the same validity as
you83701_03_c03_067-102.indd 95 4/20/17 4:22 PM
96
paper-and-pencil tests, and how can organizations keep
applicants from cheating on them?
Recent research findings shed some light on these interesting
questions.

Applicants’ Reactions to Tests
Most research about testing has focused on technical aspects—
content, type, statistical mea-
sures, scoring, and interpretation—and not on social aspects.
The fact is, no matter how use-
ful and important tests are, applicants generally do not like
taking them. According to a study
by Schmit and Ryan (1997), 1 out of 3 Americans have a
negative perception of employment
testing. Another study found that students, after completing a
number of different selection
measures as part of a simulated application process, preferred
hiring processes that excluded
testing (Rosse, Ringer, & Miller, 1996). Additional research has
shown that applicants’ nega-
tive perceptions about tests significantly affect their opinions
about the organization giving
the test. It is important for I/O psychologists to understand how
and why this occurs so they
can adopt testing procedures that are more agreeable to test
takers and that reflect a more
positive organizational image.
Typically, negative reactions to tests lower applicants’
perceptions of several organizational
outcome variables, including organizational attraction (how
much they like a company), job
acceptance intentions (whether they will accept a job offer),
recommendation intentions
(whether they will tell others to patronize or apply for a job at
the company), and purchasing
intentions (whether they will shop at or do business with the
company). A study conducted
in 2006 found that “[e]mployment tests provide organizations
with a serious dilemma [: . . .]

how can [they] administer assessments in order to take
advantage of their predictive capa-
bilities without offending the applicants they are trying to
attract?” (Noon, 2006, p. 2). With
the increasing war for top-quality talent, organizations must
develop recruitment and selec-
tion tools that attract, rather than drive away, highly qualified
personnel.
What Can Be Done?
I/O psychologists have addressed this dilemma by identifying
several ways to improve testing
perceptions. One is to increase the test’s face validity. For
example, applicants typically view
cognitive ability tests negatively, but their reactions change if
the test items are rewritten
to reflect a business-situation perspective. Similarly,
organizations can use test formats that
already tend to be viewed positively, such as assessment centers
and work samples, because
they are easily relatable to the job (Smither et al., 1993).
Providing applicants with information about the test is another,
less costly way to improve
perceptions. Applicants can be told what a test is intended to
measure, why it is necessary,
who will see the results, and how these will be used (Ployhart &
Hayes, 2003). Doing so should
lead applicants to view both the organization and its treatment
of them during the selection
process more favorably (Gilliland, 1993). Noon’s 2006 study
investigated applicants’ reac-
tions to completing cognitive ability and personality tests as
part of the selection process for
a data-processing position. Half of the applicants received
detailed information explaining

you83701_03_c03_067-102.indd 96 4/20/17 4:22 PM
97
the testing process, while the other half received the standard
information normally provided
by the company. Applicants who received the detailed
information found the company more
attractive and were more likely to recommend it to others, even
if they did not receive a job
offer. Although this tactic is not always used during testing
process, providing detailed infor-
mation about a test is a quick, cost-effective, and practical way
for organizations to improve
applicant test perceptions (Lounsbury, Bobrow, & Jensen,
1989).
Online Administration
Over the past decade, online testing has increased dramatically.
Also referred to as unproctored
Internet testing, the online test has replaced many traditional
paper-and-pencil alternatives,
and almost every type of test is now available
through online administration. Online tests have
a number of advantages over paper-and-pencil
tests. They can be taken by anyone from nearly
anywhere in the world, increasing the pool of
applicants for a job. Brick-and-mortar testing

673Foundations of Psychological TestingNoel Hendrick.docx

673Foundations of Psychological TestingNoel Hendrick.docx

Recommended

Recommended

More Related Content

Similar to 673Foundations of Psychological TestingNoel Hendrick.docx

Similar to 673Foundations of Psychological TestingNoel Hendrick.docx (20)

More from alinainglis

More from alinainglis (20)

Recently uploaded

Recently uploaded (20)

673Foundations of Psychological TestingNoel Hendrick.docx