Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Designing an Assessment System

94 views

Published on

If standardized testing were just now invented, with no predispositions or expectations about its use, how would we use it? The most important theme to keep in mind is that standardized tests are not all the same. They vary in length, format, content, purpose etc. in innumerable ways. The same assessment may be highly appropriate in one circumstance, and highly inappropriate in another. If one could design a system so that all tests in an education system were complementarily used to maximize their collective social benefit, what would that collection of tests look like? Which types of tests would be used where and when?

This presentation responds to these questions, recognizing that there is no single correct answer. An impressive body of research evidence will inform the talk; some of the most informative, from cognitive psychologists, is fairly recent. Topics will include cognitive load theory; the interplay between stakes and security, and stakes and motivation; retrieval, spacing, and other cognitive science concepts; the role of format (selected response, constructed response, authentic, etc.); and, more generally, the role of assessment in students’ intellectual development.

Published in: Education
  • Be the first to comment

  • Be the first to like this

Designing an Assessment System

  1. 1. Designing an Assessment System Richard P. Phelps International Research-to-Practice Conference Nazarbayev Intellectual Schools AEO Astana, Kazakhstan October, 2016 © 2016, Richard P PHELPS International Research-to-Practice Conference, Astana, Kazakhstan, October, 2016 1
  2. 2. © 2016, Richard P PHELPS International Research-to-Practice Conference, Astana, Kazakhstan, October, 2016 2 “If a thing exists, it exists in some amount. If it exists in some amount, then it is capable of being measured.” −−René Descartes, Principles of Philosophy, 1664
  3. 3. © 2016, Richard P PHELPS International Research-to-Practice Conference, Astana, Kazakhstan, October, 2016 3 Image of Protein Molecules Forming Memories Albert Einstein College of Medicine, New York, January 2014
  4. 4. © 2016, Richard P PHELPS International Research-to-Practice Conference, Astana, Kazakhstan, October, 2016 4 Image of Protein Molecules Forming Memories Albert Einstein College of Medicine, New York, January 2014
  5. 5. © 2016, Richard P PHELPS International Research-to-Practice Conference, Astana, Kazakhstan, October, 2016 5 Learning Curve
  6. 6. © 2016, Richard P PHELPS International Research-to-Practice Conference, Astana, Kazakhstan, October, 2016 6 Forgetting Curve (1870s)
  7. 7. © 2016, Richard P PHELPS International Research-to-Practice Conference, Astana, Kazakhstan, October, 2016 7
  8. 8. © 2016, Richard P PHELPS International Research-to-Practice Conference, Astana, Kazakhstan, October, 2016 8 Ebbinghaus: “Learning usually requires rehearsal or repetition”
  9. 9. © 2016, Richard P PHELPS International Research-to-Practice Conference, Astana, Kazakhstan, October, 2016 9 Cognitive Load Theory John Sweller, 1980s Working Memory Capacity George Miller, 1950s
  10. 10. © 2016, Richard P PHELPS International Research-to-Practice Conference, Astana, Kazakhstan, October, 2016 10 Working Memory: Ability to temorarily hold and manipulate information for cognitive tasks Working Memory is challenged by: new, unfamiliar information and quantity of discrete bits of information
  11. 11. © 2016, Richard P PHELPS International Research-to-Practice Conference, Astana, Kazakhstan, October, 2016 11 I am thinking of a type of object, what is it? They are shapes, geometric plane figures, polygons, quadrilaterals, and parallelograms with opposite equal acute angles, opposite equal obtuse angles, and four equal sides Description 1:
  12. 12. © 2016, Richard P PHELPS International Research-to-Practice Conference, Astana, Kazakhstan, October, 2016 12 I am thinking of a type of object, what is it? Description 2:
  13. 13. © 2016, Richard P PHELPS International Research-to-Practice Conference, Astana, Kazakhstan, October, 2016 13
  14. 14. Two centuries of research on learning concludes… © 2016, Richard P PHELPS International Research-to-Practice Conference, Astana, Kazakhstan, October, 2016 14 “…repeated retrieval during learning is the key to long-term retention.” — Henry L. “Roddy” Roediger
  15. 15. © 2016, Richard P PHELPS International Research-to-Practice Conference, Astana, Kazakhstan, October, 2016 15 Cognitive Scientists’ 6 Strategies for Effective Learning Retrieval Practice Spaced Practice Dual Coding Interleaving Concrete Examples Elaboration
  16. 16. © 2016, Richard P PHELPS International Research-to-Practice Conference, Astana, Kazakhstan, October, 2016 16 Retrieval Practice
  17. 17. © 2016, Richard P PHELPS International Research-to-Practice Conference, Astana, Kazakhstan, October, 2016 17
  18. 18. © 2016, Richard P PHELPS International Research-to-Practice Conference, Astana, Kazakhstan, October, 2016 18
  19. 19. © 2016, Richard P PHELPS International Research-to-Practice Conference, Astana, Kazakhstan, October, 2016 19 Implications for Teachers 1 Most teachers should test more frequently, …with smaller, shorter, low-stakes tests Understand that useful assessment can be short and simple.
  20. 20. © 2016, Richard P PHELPS International Research-to-Practice Conference, Astana, Kazakhstan, October, 2016 20 Implications for Teachers 2 Does the test format matter? • multiple-choice? • essay? • short answer? • oral? • demonstration? • …etc.? Not so much.
  21. 21. © 2016, Richard P PHELPS International Research-to-Practice Conference, Astana, Kazakhstan, October, 2016 21 Tests provide feedback to teachers about what works and what does not Implications for Teachers 3 Just like students can learn by testing each other; teachers can help each other by reviewing each others’ tests.
  22. 22. © 2016, Richard P PHELPS International Research-to-Practice Conference, Astana, Kazakhstan, October, 2016 22 Cognitive Psychology experiments were conducted with “formative” tests in schools and classrooms
  23. 23. What about systemwide, large-scale tests? First priority: do no harm to the formative testing programs in schools and classrooms
  24. 24. © 2016, Richard P PHELPS International Research-to-Practice Conference, Astana, Kazakhstan, October, 2016 24 The effect of testing on student learning • 12-year study, read >3,000 documents • analyzed close to 700 separate studies, and more than 1,600 separate effects • 2,000 other studies were reviewed and found incomplete or inappropriate • hundreds of other studies remain to be reviewed
  25. 25. © 2016, Richard P PHELPS International Research-to-Practice Conference, Astana, Kazakhstan, October, 2016 25 245 Qualitative studies 813 Surveys or Polls 640 Quantitative Studies: Experiments: School- and classroom-level Multivariate studies: Large-scale testing programs The effect of testing on student learning
  26. 26. © 2016, Richard P PHELPS International Research-to-Practice Conference, Astana, Kazakhstan, October, 2016 26 Meta-analysis A method for summarizing a large research literature, with a single, comparable measure. ( 0.5 effect size ≈ 1 grade level of learning )
  27. 27. © 2016, Richard P PHELPS International Research-to-Practice Conference, Astana, Kazakhstan, October, 2016 27 Findings from Phelps (2012): • Survey study effect sizes average >1.0 • Over 90% of qualitative studies positive • For quantitative studies, univariate effect sizes positive and stronger when: – Testing more frequently – Testing with feedback – Testing with stakes
  28. 28. 28 Findings from Phelps & Silva (2015) For quantitative studies, effect sizes vary between 0.55 and 0.88: +++ testing more frequently ++ testing with stakes + testing with feedback International Research-to-Practice Conference, Astana, Kazakhstan, October, 2016© 2016, Richard P PHELPS
  29. 29. © 2016, Richard P PHELPS International Research-to-Practice Conference, Astana, Kazakhstan, October, 2016 29 • size of study population • small +0.34 over large • scale of test administration • small-scale +0.14 over large-scale • responsible level of government • local tests +0.29 over state tests Effect of scale on testing benefits
  30. 30. Large-scale test, tight security © 2016, Richard P PHELPS International Research-to-Practice Conference, Astana, Kazakhstan, October, 2016 30
  31. 31. Large-scale test, lax security © 2016, Richard P PHELPS International Research-to-Practice Conference, Astana, Kazakhstan, October, 2016 31
  32. 32. © 2016, Richard P PHELPS International Research-to-Practice Conference, Astana, Kazakhstan, October, 2016 32 Besides, systemwide tests are needed for other purposes, such as… …selection to programs with limited number of places …monitoring and system diagnosis …workforce planning …accountability …credentialing That’s enough!
  33. 33. © 2016, Richard P PHELPS International Research-to-Practice Conference, Astana, Kazakhstan, October, 2016 33 Some large-scale test advantages On per-student basis, inexpensive Cognitive laboratory pre-testing possible Standardization offers comparisons across schools and regions. May produce high-quality items that schools and teachers can use. MOST IMPORTANT: provides reliable, comparative information to all those not involved in a particular school
  34. 34. The more systemwide decision points, the better ? Figure 1: Average TIMSS Score and Number of Quality Control Measures Used, by Country 0 10 20 30 40 50 60 70 80 0 5 10 15 20 Number of Quality Control Measures Used AveragePercentCorrect(grades7&8) Top-Performing Countries Bottom-Performing Countries SOURCE: Phelps, Benchmarking to the best in mathematics, Evaluation Review, 2001 © 2016, Richard P PHELPS International Research-to-Practice Conference, Astana, Kazakhstan, October, 2016 34
  35. 35. Quality control has proportionally greater effect in poorer countries Figure 2: Average TIMSS Score and Number of Quality Control Measures Used (each adjusted for GDP/capita), by Country Number of Quality Control Measures Used (per GDP/capita) AveragePercentCorrect(grades7&8) (perGDP/capita) SOURCE: Phelps, Benchmarking to the best in mathematics, Evaluation Review, 2001 © 2016, Richard P PHELPS International Research-to-Practice Conference, Astana, Kazakhstan, October, 2016 35
  36. 36. © 2016, Richard P PHELPS International Research-to-Practice Conference, Astana, Kazakhstan, October, 2016 36 TIMSS, PIRLS, CIVED, SITES, ICILS, PPP, ECES, TEDS IEA: OECD PISA: World Bank: PISA, PISA for schools PISA for development READ, SABER …provides funding for PISA
  37. 37. © 2016, Richard P PHELPS International Research-to-Practice Conference, Astana, Kazakhstan, October, 2016 37 The effect of international testing programs Freedomtodesignyourtesting school tests international tests state and national tests
  38. 38. © 2016, Richard P PHELPS International Research-to-Practice Conference, Astana, Kazakhstan, October, 2016 38 OECD and World Bank are run by economists How well do economists understand PSYCH-ometrics? Some interesting examples: Chile’s national testing program, funded by the World Bank OECD’s “Synergies for Better Learning” project
  39. 39. © 2016, Richard P PHELPS International Research-to-Practice Conference, Astana, Kazakhstan, October, 2016 39 Some interesting oddities: World Bank educational assessment chiefs are always Irish nationals affiliated with Boston College in the USA. PISA is universally interpreted as an achievement test, even by the OECD. In reality, it has been an unvalidated aptitude test.
  40. 40. Designing an Assessment System richard {at} nonpartisaneducation {dot} org

×