4. Reliability
Extent to which a score from a test or from
an evaluation is consistent and free from
error.
Determined in four ways: test-retest,
alternate-forms, internal, and scorer
reliability.
5. Test-retest reliability
Extent to which repeated administration of
the same test will achieve similar results.
Scores from the first administration of the
test are correlated with scores from the
second to determine whether they are
similar.
Temporal stability: consistency o test
scores across time.
6. Test-retest reliability
Time interval should be long enough so
that specific test answers have not been
memorized, but short enough so that the
individual has not changed significantly
[EG: an administration of a personality
inventory]
Typical time intervals range from 3 days-3
months.
7. Test-retest reliability
Longer interval= lower reliability coefficient
Common test-retest reliability coefficients
for tests used is .86 (Hood,2001)
8. Alternate-Forms Reliability
Extent to which two forms of the same test
are similar
Counterbalancing- method of controlling
for order effects by giving half of a sample
Test A first, followed by Test B, and giving
the other half of the sample Test B first,
followed by Test A.
9. Alternate-Forms Reliability
Scores on forms A and B are then
correlated to determine whether they are
similar. If yes, then the test has form
stability
Form Stability- extent to which the scores
on two forms of a test are similar.
10. Alternate-Forms Reliability
Why use this method? To prevent
cheating.
Time interval should be as short as
possible.
Average correlation between alternate-forms
of tests is .89 (Hood,2001)
11. Internal Reliability
Internal consistency- extent to which similar
items are answered in similar ways. Measures
item stability
Item stability- extent to which responses to the
same tests are consistent
Longer test= higher internal consistency
Example: Test with 5 items VS Test with 20
items.
12. Internal Reliability
Item homogeneity- extent to which test
items measure the same construct.
The more homogenous the items=higher
internal consistency.
3 methods to determine internal
consistency: split-half, coefficient alpha,
and K-R 20 (Kuder-Richardson formula
20)
13. Split-Half method
Form of internal reliability in which the
consistency of item responses is
determined by comparing scores on half of
the items with scores on the other half of
the items.
Odd-numbered items in one group, even-numbered
items in another.
Scores of the 2 groups are then correlated
14. Split-Half method
Spearman-Brown prophecy formula- used
to correct reliability coefficients resulting
from the split-half method.
15. Cronbach’s Coefficient Alpha
A statistic used to determine internal
reliability of tests that use interval or ratio
scales.
16. K-R 20
Statistic used to determine internal
reliability of tests that use items with
dichotomous answers. [yes/no, true/false]
17. Scorer reliability
Extent to which two people scoring a test
agree on the test score, or extent to which
a test is scored correctly.
When human judgement of performance is
involved, scorer reliability is discussed in
terms of interrater reliability.
18. Evaluating the reliability of a test
Consider the magnitude of the reliability
coefficient and the people who will be
taking the test.
19. Validity
Degree to which inferences from test
scores are justified by the evidence.
Reliability has a necessary but not
sufficient relationship with validity.
5 common strategies to investigate validity
of scores on a test: content, criterion,
construct, face, and known-group.
20. Content Validity
Extent to which tests or test items sample
the content that they are supposed to
measure.
21. Criterion Validity
Extent to which a test score is related to
some measure of job performance.
Criterion- measure of job performance,
such as attendance, productivity, or a
supervisor rating.
Criterion validity is established using one
of two research designs: concurrent or
predictive.
22. Criterion Validity
Concurrent validity- correlates test scores
with measures of job performance for
employees currently working for an
organization.
Predictive Validity- test scores of
applicants are compared at a later date
with a measure of job performance.
23. Criterion Validity
Concurrent design is weaker than predictive
because of the homogeneity of performance
scores.
Restricted range- narrow range of performance
scores that make it difficult to obtain a significant
validity coefficient
Validity generalization- inferences from test
scores from one organization can be applied to
another organization
24. Criterion Validity
Research has indicated that a test valid for
a job in one organization is also valid for
the SAME job in another organization
Synthetic validity- form of VG in which
validity is inferred on the basis of a match
between job components and tests
previously found valid for those job
components.
25. Criterion Validity
Key difference between VG and SV is that
in VG we are trying to generalize the
results of studies conducted on a
particular job to the same job at another
organization. SV tries to generalize the
results of studies of different jobs to a job
that shares a common component
26. Construct Validity
Extent to which a test actually measures
the construct that it purports to measure.
Construct validity is concerned with
inferences about test scores; content
validity is concerned with inferences about
test construction.
27. Construct Validity
Construct validity is usually determined by
correlating scores on a test with scores
from other tests.
Convergent validity- tests that measure
the same construct
Discriminant validity- tests that do not
measure the same construct
28. Construct Validity
Known-group validity- form of validity in
which test scores from two contrasting
groups “known” to differ on a construct are
compared.
If known groups do not differ on test
scores, test is invalid.
If known groups differ, validity is still
unknown.
29. Face validity
Extent to which a test appears to be valid
Face-valid tests result in high levels of test-taking
motivation.
One down side is that it is tempting to fake
answers.
Barnum statements- statements that are so
general that they can be true of almost anyone.
30. MMY
Mental measurements yearbook- book
containing information about the reliability
and validity of various psychological tests.
31. Cost-efficiency
Choose the cheaper and easier to
administer test without compromising
validity and reliability.
Computer-adaptive testing (CAT)- type of
test taken on a computer in which the
computer adapts the difficulty of questions
asked to the test-taker’s success in
answering previous questions.
32. Taylor-Russell Tables
Series of tables based on the selection ratio,
base rate, and test validity that yield information
about the percentage of future employees who
will be successful if a a particular test is used.
A test will be useful to an organization if 1) test is
valid, 2) organization can be selective in its
hiring because it has more applicants than
openings, and 3) there are plenty of current
employees who are not performing well, thus
there is room for improvement.
33. Taylor-Russell Tables
First piece of information needed is a test’s
criterion validity coefficient which can be
obtained in two ways.
Best way would be to conduct a criterion validity
study with test scores correlated with some
measure of job performance.
Use VG.
The higher the validity coefficient, the greater the
possibility the test will be useful.
34. Taylor-Russell Tables
Second piece of information that must be
obtained is the Selection ratio.
Selection ratio- percentage of applicants an
organization hires.
Formula: SR= number hired/ number of
applicants.
Lower selection ratio= greater potential
usefulness of the test.
35. Taylor-Russell Tables
Final piece of information needed is the
base rate of current performance
Base rate- percentage of current
employees who are considered
successful.
Base rate can be obtained in two ways.
36. Taylor-Russell Tables
First method is simple but least accurate.
Split employees in two equal groups
based on their scores on some criterion.
Base rate using this method is always .50
because one half of the employees are
considered satisfactory.
37. Taylor-Russell Tables
Second method is to choose a criterion
measure score above which all employees
are considered successful.
After validity, selection ratio, and base rate
figures have been obtained, consult the
Taylor-Russell tables.
38.
39. Proportion of correct decisions
Utility method that compares the
percentage of times a selection decision
was accurate with the percentage of
successful employees.
Easier to do, but less accurate than Taylor-
Russell tables.
Only information needed is employee test
scores and the scores on the criterion
40. Proportion of correct decisions
The two scores are graphed on a chart.
Lines are drawn from the point on the Y-axis
( criterion score) that represents a
successful applicant, and from the X-axis
that represents the lowest score of a hired
applicant.
41. Proportion of correct decisions
Quadrant I- employees who scored poorly on
the test and were successful on the job
Quadrant II- employees who scored well on the
test and were successful on the job
Quadrant III- employees who scored well on the
test yet did poorly on the job
Quadrant IV- employees who scored low on the
test and did poorly on the job.
42. Proportion of correct decisions
To estimate a test’s effectiveness, the
number of points in each quadrant is
totaled, and the following formula is used:
points in Quadrants II and IV / total points
in all quadrants.
The quotient represents the percentage of
time that we expect to be accurate in
making a selection decision in the future.
43. Proportion of correct decisions
To determine whether this is an
improvement, we use the following
formula: points in Quadrants I and II / total
points in all quadrants.
If percentage from first formula is higher
than that from the second, proposed test
should increase selection accuracy. If not,
stick to selection method currently used.
44. Lawshe Tables
Uses base rate, test validity, and applicant
percentile on a test to determine the
probability of future success for that
applicant.
45. Brogden-Cronbach-Gleser Utility
formula
Method of ascertaining the extent to which an organization will
benefit from the use of a particular selection system.
To use this formula, 5 items of information must be known.
Number of employees hired per year (n)
Average tenure (t)- average amount of time employees in the
position tend to stay with the company. Number is computed by
using information from company records to identify the time that
each employee in that position stayed with the company. Number of
years of tenure for each employee is then summed and divided by
the total number of employees.
46. Brogden-Cronbach-Gleser Utility
formula
Test validity (r)- this figure is the criterion
validity coefficient that was obtained
through either a validity study or VG.
Standard deviation of performance in
dollars (SDy) – 40% of employee’s annual
salary. Total salaries of current employees
in the position in question should be
averaged.
47. Brogden-Cronbach-Gleser Utility
formula
Mean standardized predictor score of selected
applicants (m)- can be obtained in one of two
ways. 1)obtain average score on the selection
test for both the applicants who are hired and
the applicants who are not hired. Average test
score of the nonhired applicants is subtracted
from the average test score of the hired
applicants. Difference is divided by the standard
deviation of all test scores.
48. Brogden-Cronbach-Gleser Utility
formula
2) compute the proportion of applicants who are
hired and then use a conversion table to convert
the proportion into a standard score. This
method is used when an organization plans to
use a test and knows the probable selection
ratio based on previous hirings, but does not
know the average test scores because the
organization has never used the test.
49. Determining the fairness of a test
Measurement bias- group differences in test
scores that are unrelated to the construct being
measured
Adverse impact- employment practice that
results in members of a protected class being
negatively affected at a higher rate than
members of the majority class. Adverse impact
is usually determined by the four-fifths rule.
50. Determining the fairness of a test
Predictive bias- situation in which the predicted level of
job success falsely favors one group over another
Single-group validity- characteristic of a test that
significantly predicts a criterion for one class of people
but not for another
Differential validity- characteristic of a test that
significantly predicts a criterion for two groups, such as
both minorities and nonminorities, but predicts
significantly better for one of the two groups.
51. Making the hiring decision
Multiple regression- statistical procedure in
which the scores from more than one criterion-valid
test are weighted according to how well
each test score predicts the criterion
Linear approaches to hiring usually take one of
four forms: unadjusted top-down selection, rules
of three, passing scores, or banding.
52. Unadjusted top-down selection
Selecting applicants in straight rank order of
their test scores.
Advantage: organization will gain the most utility
(Schimdt, 1991)
Disadvantage: can result in high levels of
adverse impact and it reduces an organization’s
flexibility to use nontest factors such as
references or organizational fit.
53. Unadjusted top-down selection
Compensatory approach- method of making
selection decisions in which a high score on one
test can compensate for a low score on another
test.
To determine whether a score on one test can
compensate for a score on another, multiple
regression is used in which each test score is
weighted according to how well it predicts the
criterion.
54. Rule of three
Variation on top-down selection in which
the names of the top three applicants are
given to a hiring authority who can then
select any of the three.
55. Passing scores
Minimum test score that an applicant must
achieve to be considered for hire. A means
for reducing adverse impact and
increasing flexibility.
Multiple-cutoff strategy – selection strategy
in which applicants must meet or exceed
te passing score on more than one
selection test.
56. Passing scores
Multiple-hurdle approach – selection
practice of administering on test at a time
so that applicants must pass that test
before being allowed to take the next test.
57. Banding
Statistical technique based on the
standard error of measurement that allows
similar test scores to be grouped.
Standard error (SE) – number of points
that a test score could be off due to test
unreliability