The effect of testing on student achievement: 1910-2010

The effect of testing on student achievement:
1910-2010

Richard P. PHELPS

© 2012, Richard P 1
International Test Commission, 8th Conference, Amsterdam,
PHELPS

Meta-analysis

• A method for
summarizing a large
research literature,
with a single,
comparable measure.

PHELPS

The effect of testing on student
achievement

• 12-year long study

• analyzed close to 700 separate studies,
and more than 1,600 separate effects

• 2,000 other studies were reviewed and
found incomplete or inappropriate

• lacking sufficient time and money,
hundreds of other studies will not be
reviewed

PHELPS

Looking for studies to include in the
meta-analyses

1. Included only those studies that found an effect from
testing on student achievement or on teacher
instruction…

PHELPS

Studies included in the meta-analyses

2. …when:
• a test is newly introduced, or newly removed
• quantity of testing is increased or reduced
• test stakes are introduced or increased, or removed or
reduced
PHELPS

Studies included in the meta-analyses

3. …plus previous research summaries (e.g.)

• Kulik, Kulik, Bangert-Drowns, &
Schwalb (1983-1991) on:
– mastery testing,
– frequency of testing, and
– programs for high-risk university
students
• Basol & Johanson (2009) on testing
frequency
• Jaekyung Lee (2007) on cross-state
studies
• W.J. Haynie (2007) in career-tech ed

PHELPS

Number of studies of effects,
by methodology type

Number of Number of
Methodology type studies effects
Quantitative 177 640

Surveys and public
247 813
opinion polls (US & Canada)

Qualitative 245 245

TOTAL 669 1698

PHELPS

Effect size: Cohen’s d

d = (YE - YC) / Spool

YE = mean, experimental group
YC = mean, control group
Spooled = standard deviation

PHELPS

Effect size: Other formulae

d = t*((n1+n2/n1*n2)^0.5
d = 2r/(1-r²)^0.5
d = (YE pre-YE post-YC pre+
YC post)/Spooled post

PHELPS

Effect size: Interpretation

• d between 0.25 & 0.50  weak effect
• d between 0.50 et 0.75  medium effect
• d more than 0.75  strong effect

PHELPS

Quantitative studies
(population coverage ≈ 7 million persons)

PHELPS

Quantitative studies: Effect size

• “Bare bones” calculation:

d ≈ +0.55 …a medium effect

• Bare bones effect size adjusted for measurement error

d ≈ +0.71 …a stronger effect

• Using same-study-author aggregation

d ≈ +0.88 …a strong effect

PHELPS

Which predictors matter?

Mean Effect
Treatment Group… Size
…is made aware of performance, and control group is not +0.98

…receives targeted instruction (e.g., remediation) +0.96

…is tested with higher stakes than control group +0.87

…is tested more frequently than control group +0.85

PHELPS

More Moderators – Source of Test

Number of Mean
Studies Effect Size
Researcher or Teacher 87 0.93
National 24 0.87
Commercial 38 0.82
State or District 11 0.72
Total 160

PHELPS

More Moderators – Sponsor of Test

Number of Mean
Studies Effect Size
International 5 1.02
Local 99 0.93
National 45 0.81
State 11 0.64
Total 160

PHELPS

More Moderators - Study Design

Number of Mean
Studies Effect Size
Pre-post 12 0.97
Experiment, Quasi-experiment 107 0.94
Multivariate 26 0.80
Experiment, posttest only 7 0.60
Pre-post (with shadow test) 8 0.58
Total 160

PHELPS

More Moderators – Scale of Analysis

Number of Mean
Studies Effect Size
Aggregated 9 1.60
Small-scale 118 0.91
Large-scale 33 0.57
Total 160

PHELPS

More Moderators – Scale of Administration

Number Mean
of Studies Effect Size
Classroom 115 0.95
Mid-scale 6 0.72
Large-scale 39 0.71

Total 160

PHELPS

Surveys and opinion polls

PHELPS

Percentage of survey items,
by respondent group and type of survey

50
45
40
35
30 Education
Percent

25 Providers
20
15 Education
10 Consumers
5
0
Public opinion polls Program evaluation
surveys*

PHELPS

Number and percent of survey items,
by test stakes and target group

Test stakes Number % Target group Number %

High 507 62 Students 393 46

Medium 184 23 Schools 281 33

Low 33 4 Teachers 116 14

Unknown 89 11 No stakes 64 7

TOTAL 813 TOTAL 854

PHELPS

Opinion polls, by year

• 244 between 1958--2008, in the U.S. & Canada
• 813 unique question-response combinations
• close to 700,000 individual respondents
120

100

80

60

40

20

0
1960 1965 1970 1975 1980 1985 1990 1995 2000 2005
Year

PHELPS

Surveys and opinion polls:
Regular standardized tests, performance tests

Regular tests Performance tests
(N ≈125) (N ≈ 50)

Respondent opinion d d
Achievement is increased 1.2 1.0

…weighted by size of study population 1.9 0.5

Instruction is improved 1.0 1.4


Tests help align instruction 1.0 1.0


PHELPS

Qualitative studies: Summary

(One cannot calculate an effect size.)

PHELPS

Qualitative studies, by methodology type

Number of
Methodology studies %

Case study 120 43

Experiment or pre-post study 21 7

Interviews (individual or group) 75 27

Journal 2 1

Review of official records, documents, reports 33 12

Research review 8 3

Survey 22 8

TOTAL 281 100

PHELPS

Qualitative studies:
Effect on student achievement

244 studies conducted in the past century in over 30 countries

Number of Percent without
Direction of effect studies Percent of studies the inferred

Positive 204 84 93

Positive inferred 24 10

Mixed 5 2 2

No change 8 3 4

Negative 3 1 1

TOTAL 244 100 100

PHELPS

Qualitative studies: Testing improves student achievement
and teacher instruction

Number of
Achievement is improved studies %

Yes 200 95

Mixed results 1 <1

No 10 5

TOTAL 211 100

Number of
Instruction is improved studies %
Yes 158 96
No 7 4
TOTAL 165 100

PHELPS

Variation by rigor and test stakes
Level of rigor
Direction of effect high medium low Total
Positive 95 67 42 204
Positive inferred 10 8 6 24
Mixed 3 1 1 5
No change 4 3 1 8
Negative 1 1 1 3
TOTAL 113 80 51 244

Stakes
Direction of effect high medium low unknown Total
Positive 133 27 38 6 204
Positive inferred 12 5 7 24
Mixed 4 1 5
No change 2 1 5 8
Negative 3 3
TOTAL 154 33 51 6 244
PHELPS

Regular standardized tests and performance tests

Regular tests Performance tests
(N =176) (N = 69)

Study results % %
Generally positive 93 95

High-stakes tests 71 42

High level of study rigor 46 48

Student attitudes toward test positive 60 71

Teacher attitudes toward test positive 55 80

Student achievement improved 95 95

Instruction improved 92 100

Large-scale testing 86 68

PHELPS

An enormous research literature

• But, assertions that it does not
exist at all are common

– Some claims are made by
those who oppose standardized
testing, and may be wishful
thinking

– Others are “firstness” claims

PHELPS

Dismissive research reviews

• With a dismissive research
literature review, a
researcher assures all that
no other researcher has
studied the same topic

PHELPS

Firstness claims

• With a firstness
claim, a researcher
insists that he or
she is the first to
ever study a topic

PHELPS

Social costs are
enormous

• Research conducted by those
without power or celebrity is
dismissed -- ignored and lost
• Public policies are skewed, based
exclusively on the research results of
those with power or celebrity
• Society pays again and again for
research that has already been done

PHELPS

The effect of testing on student
achievement:
1910-2010

Richard P. PHELPS
PHELPS

The effect of testing on student achievement: 1910-2010

More Related Content

Similar to The effect of testing on student achievement: 1910-2010

More from Richard P Phelps

Recently uploaded

The effect of testing on student achievement: 1910-2010