This document discusses research into inflated test scores and possible causes. It summarizes the work of Dr. John Cannell who investigated unexpectedly high achievement scores across many US states in the 1980s. Cannell and others at CRESST suspected score inflation was caused by outdated testing norms, high-stakes testing pressures inducing teaching to the test, and a lack of security and item rotation in many state tests. Later research generally confirmed that high-stakes testing can lead to score inflation through practices like test coaching. However, the document notes tests without these security issues, like NAEP, seem resistant to inflation.
1. The Source of Lake Wobegon
By Richard P. Phelps
(c)2007-2012, Richard P. Phelps
2. “Welcome to Lake Wobegon, where all the women
are strong, all the men are good-looking, and all
the children are above average.”
- Garrison Keillor, A Prairie Home Companion
3. John J. Cannell, M.D.
• Residency in rural West Virginia, 1980s
• Surprised by claims that state and school district
scored “above average” on national tests
• Investigated, found that all 50 states claimed to
be “above average”
4. Cannell’s suspects
• Outdated or invalid norms
• Lax security
• Deliberate educator manipulation
– Showing test items to teachers beforehand
– Keeping test forms around for years
– Misleading reporting, etc.
5. CRESST’s suspects
• Outdated or invalid norms
• High stakes, that induce “teaching to the test”
(i.e., test coaching)
(This hypothesis now generally accepted as accurate
among K-12 education researchers)
6. • “We know that tests that are
used for accountability tend to
be taught to in ways that
produce inflated scores.”
- Dan Koretz, CRESST, 1992
• “Corruption of indicators is a
continuing problem where tests
are used for accountability or
other high-stakes purposes.”
- Robert Linn, CRESST, 2000
7. Explanations for Spuriously High Achievement Scores
From Responses to CannelI in Educational Measurement:
Issues and Practice (1988)
Authors: A B C D E F
Inadequate norms X X X X
Outdated norms X X X X X
Curriculum alignment X X X
High stakes pressure X X
Teaching the test X X X
Incomplete population tested X X X
Inappropriate comparisons X X
8. More left-out-
variable bias
• Linn (2000) cites higher gains on Title 1 pre-post testing
over 9 months than over 12 as evidence of inflation
– Does not consider 3 months of forgetting
• CRESST study (1991) in one school district also cited as
evidence of inflation
– Does not consider curricular misalignment, motivation, test
security, variation in stakes
9. Examining the high-
stakes-cause-score-
inflation hypothesis
• “Strong” version of hypothesis:
– There are no rival hypotheses
• “Weak” version of hypothesis:
– More inflation in grades closer to stakes
– Test coaching increases scores
– Correlation between stakes and inflation
11. Testing the strong
hypothesis 1
State rotated items? yes no
Average “score inflation” 9.3 10.0
Level of test security lax med tight
Average “score inflation” 10.6 9.7 8.9
12. Testing the strong
hypothesis 2
Moreover…
Cannell found score inflation in elementary school
tests in dozens of states – none of those tests
had high stakes.
Cannell also found score inflation in secondary
school tests in dozens of states – only one had
high stakes.
13. Test Security in South
Carolina:
score-inflated test
Cannell, 1989, p.89:
“Unlike their other two tests, teachers are allowed to look at
test booklets, teachers may obtain test booklets before
the day of testing, booklets are not sealed, and testing is
not routinely monitored by state officials. Outside test
proctors are not used, test questions have not been
rotated every year, and answer sheets have not been
scanned for suspicious erasures or analyzed for cluster
variance. There are no state regulations that govern test
security and test administration for norm-referenced
testing done independently in the local school districts.”
14. Test Security In South
Carolina:
two high-stakes tests
Cannell, 1989, p.89:
“South Carolina also administers a graduation exam and a
criterion referenced test, both of which have significant
security measures. Teachers are not allowed to look at
either of these two test booklets, teachers may not
obtain booklets before the day of testing, the graduation
test booklets are sealed, testing is routinely monitored by
state officials, special education students are generally
included in all tests used in South Carolina unless their
IEP recommends against testing, outside test proctors
administer the graduation exam, and most test questions
are rotated every year on the criterion referenced test.”
15. Tomāto Tomăto
Is the high-stakes-cause-test-score-inflation
hypothesis caused by semantic distortion?
“Tests are ‘high-stakes’ when:
teachers feel judged by the results?”
parents receive reports of their child’s test scores?”
test scores are widely reported in the newspapers?”
16. Standards for
Educational and
Psychological
Testing:
“High-stakes test. A test used to provide results that have
important, direct consequences for examinees,
programs, or institutions involved in the testing.” (p.176)
“Low-stakes test. A test used to provide results that have
only minor or indirect consequences for examinees,
programs, or institutions involved in the testing.” (p.178)
17. Shortcomings of
Cannell’s studies
• Responses to his survey of state test security
practices do not always specify which practices
apply to which tests in states that administered more
than one
• He calculated score trends for NRTs and, with one
exception, not for standards-based tests
18. Testing the weak
hypothesis 1
Q. Do grade levels closer to high-stakes event
(e.g., high school graduation exam) show
greater score increases?
Yes, in “washback” studies of: John Bishop (1997),
Linda Winfield (1990), Norm Fredericksen (1994)
No, in Cannell’s data
19. Q. Why disparate results?
A. Low-stakes comparison tests differed
Washback studies used untraceable,
sample-based tests, administered
with tight security (TIMSS, NAEP)
Cannell used traceable NRTs
administered with lax security
20. Testing the weak
hypothesis 2
Q. Is there direct evidence that test coaching raises test
scores?
A. No, see Powers (1993), Becker (1990), Powers & Rock
(1994), Camara (2001), etc.
21. Testing the weak hypothesis 3
Perhaps low-stakes tests
are subject to score
inflation where a jurisdiction
administers a separate
high-stakes test, thereby
creating a general
environment of high-stakes
pressure?
23. 25
20
Amount of "inflation"
15
(in percentile points)
10
5
0
40 50 60 70 80 90
-5
Average NAEP percentile score
Pink squares: states with a high-stakes test
Blue diamonds: states without any high-stakes test
24. Two types of
tests resist
score inflation:
1. Those untraceable to individual jurisdictions or schools
(no incentive to cheat)
2. Those with tight security and ample item rotation (no
opportunity to cheat)
Traceable tests lacking security and item
rotation are candidates for score inflation
25. Artificial test score gains (score inflation) are
caused by neglect, incompetence, or
deliberate educator manipulation, but always
require means and opportunity.
• Motive is only present with
traceable tests.
• Means and opportunity exist
only in the absence of
security measures and item
rotation.