Designing an Assessment System
Richard P. Phelps
International Research-to-Practice Conference
Nazarbayev Intellectual Schools AEO
Astana, Kazakhstan
October, 2016
© 2016, Richard P PHELPS International Research-to-Practice Conference, Astana, Kazakhstan, October, 2016 1
© 2016, Richard P PHELPS International Research-to-Practice Conference, Astana, Kazakhstan, October, 2016 2
“If a thing exists, it
exists in some
amount. If it exists in
some amount, then it
is capable of being
measured.”
−−René Descartes,
Principles of
Philosophy, 1664
© 2016, Richard P PHELPS International Research-to-Practice Conference, Astana, Kazakhstan, October, 2016 3
Image of Protein Molecules Forming Memories
Albert Einstein College of Medicine, New York, January 2014
© 2016, Richard P PHELPS International Research-to-Practice Conference, Astana, Kazakhstan, October, 2016 4
Image of Protein Molecules Forming Memories
Albert Einstein College of Medicine, New York, January 2014
© 2016, Richard P PHELPS International Research-to-Practice Conference, Astana, Kazakhstan, October, 2016 5
Learning Curve
© 2016, Richard P PHELPS International Research-to-Practice Conference, Astana, Kazakhstan, October, 2016 6
Forgetting Curve (1870s)
© 2016, Richard P PHELPS International Research-to-Practice Conference, Astana, Kazakhstan, October, 2016 7
© 2016, Richard P PHELPS International Research-to-Practice Conference, Astana, Kazakhstan, October, 2016 8
Ebbinghaus:
“Learning usually
requires rehearsal
or repetition”
© 2016, Richard P PHELPS International Research-to-Practice Conference, Astana, Kazakhstan, October, 2016 9
Cognitive Load Theory
John Sweller, 1980s
Working Memory Capacity
George Miller, 1950s
© 2016, Richard P PHELPS International Research-to-Practice Conference, Astana, Kazakhstan, October, 2016 10
Working Memory:
Ability to temorarily hold and
manipulate information for
cognitive tasks
Working Memory is challenged by:
new, unfamiliar information and
quantity of discrete bits of information
© 2016, Richard P PHELPS International Research-to-Practice Conference, Astana, Kazakhstan, October, 2016 11
I am thinking of a type of object, what is it?
They are shapes, geometric plane figures,
polygons, quadrilaterals, and parallelograms
with opposite equal acute angles, opposite
equal obtuse angles, and four equal sides
Description 1:
© 2016, Richard P PHELPS International Research-to-Practice Conference, Astana, Kazakhstan, October, 2016 12
I am thinking of a type of object, what is it?
Description 2:
© 2016, Richard P PHELPS International Research-to-Practice Conference, Astana, Kazakhstan, October, 2016 13
Two centuries of research on learning concludes…
© 2016, Richard P PHELPS International Research-to-Practice Conference, Astana, Kazakhstan, October, 2016 14
“…repeated retrieval during learning is the key to
long-term retention.”
— Henry L. “Roddy” Roediger
© 2016, Richard P PHELPS International Research-to-Practice Conference, Astana, Kazakhstan, October, 2016 15
Cognitive Scientists’ 6 Strategies for Effective Learning
Retrieval Practice
Spaced Practice
Dual Coding
Interleaving
Concrete Examples
Elaboration
© 2016, Richard P PHELPS International Research-to-Practice Conference, Astana, Kazakhstan, October, 2016 16
Retrieval Practice
© 2016, Richard P PHELPS International Research-to-Practice Conference, Astana, Kazakhstan, October, 2016 17
© 2016, Richard P PHELPS International Research-to-Practice Conference, Astana, Kazakhstan, October, 2016 18
© 2016, Richard P PHELPS International Research-to-Practice Conference, Astana, Kazakhstan, October, 2016 19
Implications for Teachers 1
Most teachers should test more
frequently, …with smaller,
shorter, low-stakes tests
Understand that useful
assessment can be short and
simple.
© 2016, Richard P PHELPS International Research-to-Practice Conference, Astana, Kazakhstan, October, 2016 20
Implications for Teachers 2
Does the test format
matter?
• multiple-choice?
• essay?
• short answer?
• oral?
• demonstration?
• …etc.?
Not so much.
© 2016, Richard P PHELPS International Research-to-Practice Conference, Astana, Kazakhstan, October, 2016 21
Tests provide
feedback to teachers
about what works
and what does not
Implications for Teachers 3
Just like students can learn by testing each other;
teachers can help each other by reviewing each
others’ tests.
© 2016, Richard P PHELPS International Research-to-Practice Conference, Astana, Kazakhstan, October, 2016 22
Cognitive Psychology
experiments were
conducted with
“formative” tests in
schools and classrooms
What about systemwide, large-scale tests?
First priority:
do no harm to the
formative testing
programs in schools
and classrooms
© 2016, Richard P PHELPS International Research-to-Practice Conference, Astana, Kazakhstan, October, 2016 24
The effect of testing on student learning
• 12-year study, read >3,000 documents
• analyzed close to 700 separate studies, and
more than 1,600 separate effects
• 2,000 other studies were reviewed and
found incomplete or inappropriate
• hundreds of other studies remain to be
reviewed
© 2016, Richard P PHELPS International Research-to-Practice Conference, Astana, Kazakhstan, October, 2016 25
245 Qualitative studies
813 Surveys or Polls
640 Quantitative Studies:
Experiments:
School- and classroom-level
Multivariate studies:
Large-scale testing programs
The effect of testing on student learning
© 2016, Richard P PHELPS International Research-to-Practice Conference, Astana, Kazakhstan, October, 2016 26
Meta-analysis
A method for
summarizing a large
research literature, with
a single, comparable
measure.
( 0.5 effect size ≈ 1 grade level of learning )
© 2016, Richard P PHELPS International Research-to-Practice Conference, Astana, Kazakhstan, October, 2016 27
Findings from Phelps (2012):
• Survey study effect sizes average >1.0
• Over 90% of qualitative studies positive
• For quantitative studies, univariate effect sizes positive and
stronger when:
– Testing more frequently
– Testing with feedback
– Testing with stakes
28
Findings from Phelps & Silva (2015)
For quantitative studies, effect sizes vary
between 0.55 and 0.88:
+++ testing more frequently
++ testing with stakes
+ testing with feedback
International Research-to-Practice Conference, Astana, Kazakhstan, October, 2016© 2016, Richard P PHELPS
© 2016, Richard P PHELPS International Research-to-Practice Conference, Astana, Kazakhstan, October, 2016 29
• size of study population
• small +0.34 over large
• scale of test administration
• small-scale +0.14 over large-scale
• responsible level of government
• local tests +0.29 over state tests
Effect of scale on testing benefits
Large-scale test, tight security
© 2016, Richard P PHELPS International Research-to-Practice Conference, Astana, Kazakhstan, October, 2016 30
Large-scale test, lax security
© 2016, Richard P PHELPS International Research-to-Practice Conference, Astana, Kazakhstan, October, 2016 31
© 2016, Richard P PHELPS International Research-to-Practice Conference, Astana, Kazakhstan, October, 2016 32
Besides, systemwide tests are needed for
other purposes, such as…
…selection to programs with limited number of places
…monitoring and system diagnosis
…workforce planning
…accountability
…credentialing
That’s enough!
© 2016, Richard P PHELPS International Research-to-Practice Conference, Astana, Kazakhstan, October, 2016 33
Some large-scale test advantages
On per-student basis, inexpensive
Cognitive laboratory pre-testing possible
Standardization offers comparisons across schools and regions.
May produce high-quality items that schools and teachers can use.
MOST IMPORTANT:
provides reliable, comparative information to all those not involved in
a particular school
The more systemwide decision points, the better ?
Figure 1: Average TIMSS Score and Number of Quality Control
Measures Used, by Country
0
10
20
30
40
50
60
70
80
0 5 10 15 20
Number of Quality Control Measures Used
AveragePercentCorrect(grades7&8)
Top-Performing Countries Bottom-Performing Countries
SOURCE: Phelps, Benchmarking to the best in mathematics, Evaluation Review, 2001
© 2016, Richard P PHELPS International Research-to-Practice Conference, Astana, Kazakhstan, October, 2016 34
Quality control has proportionally greater effect in poorer countries
Figure 2: Average TIMSS Score and Number of Quality Control
Measures Used (each adjusted for GDP/capita), by Country
Number of Quality Control Measures Used (per GDP/capita)
AveragePercentCorrect(grades7&8)
(perGDP/capita)
SOURCE: Phelps, Benchmarking to the best in mathematics, Evaluation Review, 2001
© 2016, Richard P PHELPS International Research-to-Practice Conference, Astana, Kazakhstan, October, 2016 35
© 2016, Richard P PHELPS International Research-to-Practice Conference, Astana, Kazakhstan, October, 2016 36
TIMSS, PIRLS, CIVED, SITES, ICILS,
PPP, ECES, TEDS
IEA:
OECD PISA:
World Bank:
PISA, PISA for schools
PISA for development
READ, SABER
…provides funding for PISA
© 2016, Richard P PHELPS International Research-to-Practice Conference, Astana, Kazakhstan, October, 2016 37
The effect of international testing programs
Freedomtodesignyourtesting
school
tests
international
tests
state and national tests
© 2016, Richard P PHELPS International Research-to-Practice Conference, Astana, Kazakhstan, October, 2016 38
OECD and World Bank are run by economists
How well do economists understand PSYCH-ometrics?
Some interesting examples:
Chile’s national testing
program, funded by the
World Bank
OECD’s “Synergies for
Better Learning” project
© 2016, Richard P PHELPS International Research-to-Practice Conference, Astana, Kazakhstan, October, 2016 39
Some interesting oddities:
World Bank educational
assessment chiefs are always
Irish nationals affiliated with
Boston College in the USA.
PISA is universally interpreted as
an achievement test, even by
the OECD. In reality, it has been
an unvalidated aptitude test.
Designing an Assessment System
richard {at} nonpartisaneducation {dot} org

Designing an Assessment System

  • 1.
    Designing an AssessmentSystem Richard P. Phelps International Research-to-Practice Conference Nazarbayev Intellectual Schools AEO Astana, Kazakhstan October, 2016 © 2016, Richard P PHELPS International Research-to-Practice Conference, Astana, Kazakhstan, October, 2016 1
  • 2.
    © 2016, RichardP PHELPS International Research-to-Practice Conference, Astana, Kazakhstan, October, 2016 2 “If a thing exists, it exists in some amount. If it exists in some amount, then it is capable of being measured.” −−René Descartes, Principles of Philosophy, 1664
  • 3.
    © 2016, RichardP PHELPS International Research-to-Practice Conference, Astana, Kazakhstan, October, 2016 3 Image of Protein Molecules Forming Memories Albert Einstein College of Medicine, New York, January 2014
  • 4.
    © 2016, RichardP PHELPS International Research-to-Practice Conference, Astana, Kazakhstan, October, 2016 4 Image of Protein Molecules Forming Memories Albert Einstein College of Medicine, New York, January 2014
  • 5.
    © 2016, RichardP PHELPS International Research-to-Practice Conference, Astana, Kazakhstan, October, 2016 5 Learning Curve
  • 6.
    © 2016, RichardP PHELPS International Research-to-Practice Conference, Astana, Kazakhstan, October, 2016 6 Forgetting Curve (1870s)
  • 7.
    © 2016, RichardP PHELPS International Research-to-Practice Conference, Astana, Kazakhstan, October, 2016 7
  • 8.
    © 2016, RichardP PHELPS International Research-to-Practice Conference, Astana, Kazakhstan, October, 2016 8 Ebbinghaus: “Learning usually requires rehearsal or repetition”
  • 9.
    © 2016, RichardP PHELPS International Research-to-Practice Conference, Astana, Kazakhstan, October, 2016 9 Cognitive Load Theory John Sweller, 1980s Working Memory Capacity George Miller, 1950s
  • 10.
    © 2016, RichardP PHELPS International Research-to-Practice Conference, Astana, Kazakhstan, October, 2016 10 Working Memory: Ability to temorarily hold and manipulate information for cognitive tasks Working Memory is challenged by: new, unfamiliar information and quantity of discrete bits of information
  • 11.
    © 2016, RichardP PHELPS International Research-to-Practice Conference, Astana, Kazakhstan, October, 2016 11 I am thinking of a type of object, what is it? They are shapes, geometric plane figures, polygons, quadrilaterals, and parallelograms with opposite equal acute angles, opposite equal obtuse angles, and four equal sides Description 1:
  • 12.
    © 2016, RichardP PHELPS International Research-to-Practice Conference, Astana, Kazakhstan, October, 2016 12 I am thinking of a type of object, what is it? Description 2:
  • 13.
    © 2016, RichardP PHELPS International Research-to-Practice Conference, Astana, Kazakhstan, October, 2016 13
  • 14.
    Two centuries ofresearch on learning concludes… © 2016, Richard P PHELPS International Research-to-Practice Conference, Astana, Kazakhstan, October, 2016 14 “…repeated retrieval during learning is the key to long-term retention.” — Henry L. “Roddy” Roediger
  • 15.
    © 2016, RichardP PHELPS International Research-to-Practice Conference, Astana, Kazakhstan, October, 2016 15 Cognitive Scientists’ 6 Strategies for Effective Learning Retrieval Practice Spaced Practice Dual Coding Interleaving Concrete Examples Elaboration
  • 16.
    © 2016, RichardP PHELPS International Research-to-Practice Conference, Astana, Kazakhstan, October, 2016 16 Retrieval Practice
  • 17.
    © 2016, RichardP PHELPS International Research-to-Practice Conference, Astana, Kazakhstan, October, 2016 17
  • 18.
    © 2016, RichardP PHELPS International Research-to-Practice Conference, Astana, Kazakhstan, October, 2016 18
  • 19.
    © 2016, RichardP PHELPS International Research-to-Practice Conference, Astana, Kazakhstan, October, 2016 19 Implications for Teachers 1 Most teachers should test more frequently, …with smaller, shorter, low-stakes tests Understand that useful assessment can be short and simple.
  • 20.
    © 2016, RichardP PHELPS International Research-to-Practice Conference, Astana, Kazakhstan, October, 2016 20 Implications for Teachers 2 Does the test format matter? • multiple-choice? • essay? • short answer? • oral? • demonstration? • …etc.? Not so much.
  • 21.
    © 2016, RichardP PHELPS International Research-to-Practice Conference, Astana, Kazakhstan, October, 2016 21 Tests provide feedback to teachers about what works and what does not Implications for Teachers 3 Just like students can learn by testing each other; teachers can help each other by reviewing each others’ tests.
  • 22.
    © 2016, RichardP PHELPS International Research-to-Practice Conference, Astana, Kazakhstan, October, 2016 22 Cognitive Psychology experiments were conducted with “formative” tests in schools and classrooms
  • 23.
    What about systemwide,large-scale tests? First priority: do no harm to the formative testing programs in schools and classrooms
  • 24.
    © 2016, RichardP PHELPS International Research-to-Practice Conference, Astana, Kazakhstan, October, 2016 24 The effect of testing on student learning • 12-year study, read >3,000 documents • analyzed close to 700 separate studies, and more than 1,600 separate effects • 2,000 other studies were reviewed and found incomplete or inappropriate • hundreds of other studies remain to be reviewed
  • 25.
    © 2016, RichardP PHELPS International Research-to-Practice Conference, Astana, Kazakhstan, October, 2016 25 245 Qualitative studies 813 Surveys or Polls 640 Quantitative Studies: Experiments: School- and classroom-level Multivariate studies: Large-scale testing programs The effect of testing on student learning
  • 26.
    © 2016, RichardP PHELPS International Research-to-Practice Conference, Astana, Kazakhstan, October, 2016 26 Meta-analysis A method for summarizing a large research literature, with a single, comparable measure. ( 0.5 effect size ≈ 1 grade level of learning )
  • 27.
    © 2016, RichardP PHELPS International Research-to-Practice Conference, Astana, Kazakhstan, October, 2016 27 Findings from Phelps (2012): • Survey study effect sizes average >1.0 • Over 90% of qualitative studies positive • For quantitative studies, univariate effect sizes positive and stronger when: – Testing more frequently – Testing with feedback – Testing with stakes
  • 28.
    28 Findings from Phelps& Silva (2015) For quantitative studies, effect sizes vary between 0.55 and 0.88: +++ testing more frequently ++ testing with stakes + testing with feedback International Research-to-Practice Conference, Astana, Kazakhstan, October, 2016© 2016, Richard P PHELPS
  • 29.
    © 2016, RichardP PHELPS International Research-to-Practice Conference, Astana, Kazakhstan, October, 2016 29 • size of study population • small +0.34 over large • scale of test administration • small-scale +0.14 over large-scale • responsible level of government • local tests +0.29 over state tests Effect of scale on testing benefits
  • 30.
    Large-scale test, tightsecurity © 2016, Richard P PHELPS International Research-to-Practice Conference, Astana, Kazakhstan, October, 2016 30
  • 31.
    Large-scale test, laxsecurity © 2016, Richard P PHELPS International Research-to-Practice Conference, Astana, Kazakhstan, October, 2016 31
  • 32.
    © 2016, RichardP PHELPS International Research-to-Practice Conference, Astana, Kazakhstan, October, 2016 32 Besides, systemwide tests are needed for other purposes, such as… …selection to programs with limited number of places …monitoring and system diagnosis …workforce planning …accountability …credentialing That’s enough!
  • 33.
    © 2016, RichardP PHELPS International Research-to-Practice Conference, Astana, Kazakhstan, October, 2016 33 Some large-scale test advantages On per-student basis, inexpensive Cognitive laboratory pre-testing possible Standardization offers comparisons across schools and regions. May produce high-quality items that schools and teachers can use. MOST IMPORTANT: provides reliable, comparative information to all those not involved in a particular school
  • 34.
    The more systemwidedecision points, the better ? Figure 1: Average TIMSS Score and Number of Quality Control Measures Used, by Country 0 10 20 30 40 50 60 70 80 0 5 10 15 20 Number of Quality Control Measures Used AveragePercentCorrect(grades7&8) Top-Performing Countries Bottom-Performing Countries SOURCE: Phelps, Benchmarking to the best in mathematics, Evaluation Review, 2001 © 2016, Richard P PHELPS International Research-to-Practice Conference, Astana, Kazakhstan, October, 2016 34
  • 35.
    Quality control hasproportionally greater effect in poorer countries Figure 2: Average TIMSS Score and Number of Quality Control Measures Used (each adjusted for GDP/capita), by Country Number of Quality Control Measures Used (per GDP/capita) AveragePercentCorrect(grades7&8) (perGDP/capita) SOURCE: Phelps, Benchmarking to the best in mathematics, Evaluation Review, 2001 © 2016, Richard P PHELPS International Research-to-Practice Conference, Astana, Kazakhstan, October, 2016 35
  • 36.
    © 2016, RichardP PHELPS International Research-to-Practice Conference, Astana, Kazakhstan, October, 2016 36 TIMSS, PIRLS, CIVED, SITES, ICILS, PPP, ECES, TEDS IEA: OECD PISA: World Bank: PISA, PISA for schools PISA for development READ, SABER …provides funding for PISA
  • 37.
    © 2016, RichardP PHELPS International Research-to-Practice Conference, Astana, Kazakhstan, October, 2016 37 The effect of international testing programs Freedomtodesignyourtesting school tests international tests state and national tests
  • 38.
    © 2016, RichardP PHELPS International Research-to-Practice Conference, Astana, Kazakhstan, October, 2016 38 OECD and World Bank are run by economists How well do economists understand PSYCH-ometrics? Some interesting examples: Chile’s national testing program, funded by the World Bank OECD’s “Synergies for Better Learning” project
  • 39.
    © 2016, RichardP PHELPS International Research-to-Practice Conference, Astana, Kazakhstan, October, 2016 39 Some interesting oddities: World Bank educational assessment chiefs are always Irish nationals affiliated with Boston College in the USA. PISA is universally interpreted as an achievement test, even by the OECD. In reality, it has been an unvalidated aptitude test.
  • 40.
    Designing an AssessmentSystem richard {at} nonpartisaneducation {dot} org