www.twosigmas.com twitter.com/twosigmas_ facebook/twosigmaspage
www.twosigmas.com
What makes a good adaptive testing program?
www.twosigmas.com twitter.com/twosigmas_ facebook/twosigmaspage
Advantage of IRT based CAT
2
On a 40 question exam with dichotomous
scoring (wrong or right), the total number of
questions you might need to develop is
240
= 1.01 × 1012
.
On a well designed IRT based CAT, the total
number of questions you might need to
develop is
≈ 400.
www.twosigmas.com twitter.com/twosigmas_ facebook/twosigmaspage
Who is your market?
3
Students – Computer adaptive teaching/
learning? Blended learning with MOOCs?
Schools – Adaptive homework? Computer
based in school exams? Formative
assessment?
Exam Board – Professionalize organizations?
Corporations? Government organizations?
www.twosigmas.com twitter.com/twosigmas_ facebook/twosigmaspage
Background
4
What are you trying to
measure?
How is it manifest?
What questions tests
this?
Is this question valid?
Is this question reliable?
Is this question fair?
www.twosigmas.com twitter.com/twosigmas_ facebook/twosigmaspage
General Characteristics of any IRT CAT
5
Item bank
Development
Pre-Test Items
Prioritizing
Publishing
CAT
Maintaining
CAT
Qualification
Specification
Reliability &
Validity
Differential Item
Analysis (DIF)
Content Balancing Item Selection
Percentile
Ranking
Item
Calibration
Communicate Results
Standard Exam
Conditions
Exam Security
(Exposure)
Validity Testing
Termination Criteria
Better than
existing solution?
www.twosigmas.com twitter.com/twosigmas_ facebook/twosigmaspage
Differential Item Functioning (DIF): Guessing (𝑥)
6
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
-4 -3 -2 -1 0 1 2 3 4
LikelihoodofGettingaQuestionCorrect
Ability of Test Taker
(z-score)
Item Response Curves
Standard Normal Item 1 Item 2
𝑃𝑖𝑗 = 𝑃 𝑢 = 1 𝑥, 𝛼, 𝜃, 𝛿 = 𝑥 + 1 − 𝑥
𝑒1.702𝛼 𝜃−𝛿
1 + 𝑒1.702𝛼 𝜃−𝛿
,
𝛼 𝑥 𝛿
Normal 1 0 0
Item 1 1 0 0.2
Item 2 1 0 0.07
where i represents the ith item and j represents the jth test taker.
www.twosigmas.com twitter.com/twosigmas_ facebook/twosigmaspage
Differential Item Functioning (DIF): Difficulty (𝛿)
7
𝑃𝑖𝑗 = 𝑃 𝑢 = 1 𝑥, 𝛼, 𝜃, 𝛿 = 𝑥 + (1 − 𝑥)
𝑒1.702𝛼(𝜃−𝛿)
1 + 𝑒1.702𝛼(𝜃−𝛿)
𝛼 𝑥 𝛿
Item 1 1 0 0
Item 2 1 0 -1
Item 3 1 0 1
where i represents the ith item and j represents the jth test taker.
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
-4 -3 -2 -1 0 1 2 3 4
LikelihoodofGettingaQuestionCorrect(percent)
Ability of Test Taker (z-score)
Item Response Curves
Item 1 Item 2 Item 3
www.twosigmas.com twitter.com/twosigmas_ facebook/twosigmaspage
Differential Item Functioning (DIF): Discrimination (𝛼)
8
𝑃𝑖𝑗 = 𝑃 𝑢 = 1 𝑥, 𝛼, 𝜃, 𝛿 = 𝑥 + (1 − 𝑥)
𝑒1.702𝛼(𝜃−𝛿)
1 + 𝑒1.702𝛼(𝜃−𝛿)
𝛼 𝑥 𝛿
Item 1 0.4 0 0
Item 2 0.8 0 -1
Item 3 1.2 0 1
where i represents the ith item and j represents the jth test taker.
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
-4 -3 -2 -1 0 1 2 3 4
LikelihoodofGettingaQuestionCorrect(percent)
Ability of Test Taker (z-score)
Item Response Curves
Item 1 Item 2 Item 3
www.twosigmas.com twitter.com/twosigmas_ facebook/twosigmaspage
References:
9
• Baker, F. B. and Kim, S. (2004). Item Response Theory Parameter Estimation
Techniques, 2nd Edition, Revised and Expanded. New York, NY: CRC Press,
Taylor and Francis Group.
• Bao, Han, Dayton, C. Mitchell, & Hendrickson, Amy B. (2009). Differential Item
Functioning Amplification and Cancellation in a Reading Test. Practical
Assessment, Research & Evaluation, 14(19). Available online:
http://pareonline.net/getvn.asp?v=14&n=19
• Bergstorm, B. A., Gershon, R.C., and Brown, W. L. (1993) Differential Item
Functioning vs. Differential Test Functioning. Paper Presented at the Annual
Meeting of the American Educational Research Association (Atlanta, GA. April 12 -
16) Retrieved on February 20, 2011 from
http://www.eric.ed.gov/PDFS/ED377227.pdf
• Birdsall, M (2011) Implementing Computer Adaptive Testing to Improve
Achievement Opportunities. Ofqual, Coventry. Online at:
http://webarchive.nationalarchives.gov.uk/+/http://www.ofqual.gov.uk/files/2011-06-
15-implementing-computer-adaptive-testing-to-improve-achievement-
opportunities.pdf
• Bowles, R. and Pommerich, M. (2001). An Examination of Item Review on a CAT
Using the Specific Information Item Selection Algorithm. Paper presented at the
Annual Meeting of the National Council on Measurement in Education, Seattle, WA.
• Childs, Ruth A. & Andrew P. Jaciw (2003). Matrix sampling of items in large-scale
assessments. Practical Assessment, Research & Evaluation, 8(16). Retrieved
February 22, 2011 from http://PAREonline.net/getvn.asp?v=8&n=16
• de Ayala, R. J. (2009). The Theory and Practice of Item Response Theory. New
York, NY: The Guildfor Press.
• He, Q. (2010) Maintaining Standards in on Demand Testing Using Item Response
Theory. Ofqual, Coventry. Retrieved on February 10, 2011, from http://e-
assessment.org.uk/images/uploads/s-docs/Ofqual-10-4724-Maintaining-
standards.pdf
• Newton, Paul E. (2007) 'Clarifying the purposes of educational assessment',
Assessment in Education: Principles, Policy & Practice, 14:2, 149 -170. Retrieved
February 20, 2011 from http://dx.doi.org/10.1080/09695940701478321
• Pommerich, M., Segall, D.O., & Moreno, K.E. (2009). The nine lives of CAT-ASVAB:
Innovations and revelations. In D. J. Weiss (Ed.), Proceedings of the 2009 GMAC
Conference on Computerized Adaptive Testing. Retrieved on February 15, 2011
from www.psych.umn.edu/psylabs/CATCentral/
• Rudner, L. M. (2007). Implementing the Graduate Management Admission Test®
computerized adaptive test. In D. J. Weiss (Ed.), Proceedings of the 2007 GMAC
Conference on Computerized Adaptive Testing. Retrieved January 10, 2010 from
www.psych.umn.edu/psylabs/CATCentral/
• Segall, D. O. and Moreno, K. E. (1999) Development of the CAT-ASVAB. In F.
Drasgow & J. B. Olson-Buchanan (Eds.). Innovations in Computerized Assessment
(pp. 35—65). Hillsdale, NJ: Lawrence Erlbaum Associates. Retrieved on February
20, 2011 from http://www.danielsegall.com/catasvab.pdf
• van der Linden, W. J. and Glas, A. W. (eds.), (2010) Elements of Computer
Adaptive Testing: Statistics. Chapters 4, 10, 17, and page 349. London, UK:
Springer Science + Business Media LLC.
• Wise, L. L., Curran, L. T., & McBride, J. R. (1997). CAT-ASVAB Cost and Benefit
Analyses. In W. A. Sands, B. K. Waters, & J. R. McBride (Eds.), Computerized
adaptive testing: From inquiry to operation (pp. 227-236). Washington, DC:
American Psychological Association.
• Zumbo, B. D. (2007). Three Generations of DIF Analyses: Considering Where it Has
Been, Where it is Now, and Where it is Going. Language Assessment Quarterly,
4(2), 223-233, Lawrence Erlbaum Associates, Publishers. Retrieved February 20,
2011 from http://educ.ubc.ca/faculty/zumbo/papers/Zumbo_LAQ_reprint.pdf

What makes a good adaptive testing program

  • 1.
  • 2.
    www.twosigmas.com twitter.com/twosigmas_ facebook/twosigmaspage Advantageof IRT based CAT 2 On a 40 question exam with dichotomous scoring (wrong or right), the total number of questions you might need to develop is 240 = 1.01 × 1012 . On a well designed IRT based CAT, the total number of questions you might need to develop is ≈ 400.
  • 3.
    www.twosigmas.com twitter.com/twosigmas_ facebook/twosigmaspage Whois your market? 3 Students – Computer adaptive teaching/ learning? Blended learning with MOOCs? Schools – Adaptive homework? Computer based in school exams? Formative assessment? Exam Board – Professionalize organizations? Corporations? Government organizations?
  • 4.
    www.twosigmas.com twitter.com/twosigmas_ facebook/twosigmaspage Background 4 Whatare you trying to measure? How is it manifest? What questions tests this? Is this question valid? Is this question reliable? Is this question fair?
  • 5.
    www.twosigmas.com twitter.com/twosigmas_ facebook/twosigmaspage GeneralCharacteristics of any IRT CAT 5 Item bank Development Pre-Test Items Prioritizing Publishing CAT Maintaining CAT Qualification Specification Reliability & Validity Differential Item Analysis (DIF) Content Balancing Item Selection Percentile Ranking Item Calibration Communicate Results Standard Exam Conditions Exam Security (Exposure) Validity Testing Termination Criteria Better than existing solution?
  • 6.
    www.twosigmas.com twitter.com/twosigmas_ facebook/twosigmaspage DifferentialItem Functioning (DIF): Guessing (𝑥) 6 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% -4 -3 -2 -1 0 1 2 3 4 LikelihoodofGettingaQuestionCorrect Ability of Test Taker (z-score) Item Response Curves Standard Normal Item 1 Item 2 𝑃𝑖𝑗 = 𝑃 𝑢 = 1 𝑥, 𝛼, 𝜃, 𝛿 = 𝑥 + 1 − 𝑥 𝑒1.702𝛼 𝜃−𝛿 1 + 𝑒1.702𝛼 𝜃−𝛿 , 𝛼 𝑥 𝛿 Normal 1 0 0 Item 1 1 0 0.2 Item 2 1 0 0.07 where i represents the ith item and j represents the jth test taker.
  • 7.
    www.twosigmas.com twitter.com/twosigmas_ facebook/twosigmaspage DifferentialItem Functioning (DIF): Difficulty (𝛿) 7 𝑃𝑖𝑗 = 𝑃 𝑢 = 1 𝑥, 𝛼, 𝜃, 𝛿 = 𝑥 + (1 − 𝑥) 𝑒1.702𝛼(𝜃−𝛿) 1 + 𝑒1.702𝛼(𝜃−𝛿) 𝛼 𝑥 𝛿 Item 1 1 0 0 Item 2 1 0 -1 Item 3 1 0 1 where i represents the ith item and j represents the jth test taker. 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% -4 -3 -2 -1 0 1 2 3 4 LikelihoodofGettingaQuestionCorrect(percent) Ability of Test Taker (z-score) Item Response Curves Item 1 Item 2 Item 3
  • 8.
    www.twosigmas.com twitter.com/twosigmas_ facebook/twosigmaspage DifferentialItem Functioning (DIF): Discrimination (𝛼) 8 𝑃𝑖𝑗 = 𝑃 𝑢 = 1 𝑥, 𝛼, 𝜃, 𝛿 = 𝑥 + (1 − 𝑥) 𝑒1.702𝛼(𝜃−𝛿) 1 + 𝑒1.702𝛼(𝜃−𝛿) 𝛼 𝑥 𝛿 Item 1 0.4 0 0 Item 2 0.8 0 -1 Item 3 1.2 0 1 where i represents the ith item and j represents the jth test taker. 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% -4 -3 -2 -1 0 1 2 3 4 LikelihoodofGettingaQuestionCorrect(percent) Ability of Test Taker (z-score) Item Response Curves Item 1 Item 2 Item 3
  • 9.
    www.twosigmas.com twitter.com/twosigmas_ facebook/twosigmaspage References: 9 •Baker, F. B. and Kim, S. (2004). Item Response Theory Parameter Estimation Techniques, 2nd Edition, Revised and Expanded. New York, NY: CRC Press, Taylor and Francis Group. • Bao, Han, Dayton, C. Mitchell, & Hendrickson, Amy B. (2009). Differential Item Functioning Amplification and Cancellation in a Reading Test. Practical Assessment, Research & Evaluation, 14(19). Available online: http://pareonline.net/getvn.asp?v=14&n=19 • Bergstorm, B. A., Gershon, R.C., and Brown, W. L. (1993) Differential Item Functioning vs. Differential Test Functioning. Paper Presented at the Annual Meeting of the American Educational Research Association (Atlanta, GA. April 12 - 16) Retrieved on February 20, 2011 from http://www.eric.ed.gov/PDFS/ED377227.pdf • Birdsall, M (2011) Implementing Computer Adaptive Testing to Improve Achievement Opportunities. Ofqual, Coventry. Online at: http://webarchive.nationalarchives.gov.uk/+/http://www.ofqual.gov.uk/files/2011-06- 15-implementing-computer-adaptive-testing-to-improve-achievement- opportunities.pdf • Bowles, R. and Pommerich, M. (2001). An Examination of Item Review on a CAT Using the Specific Information Item Selection Algorithm. Paper presented at the Annual Meeting of the National Council on Measurement in Education, Seattle, WA. • Childs, Ruth A. & Andrew P. Jaciw (2003). Matrix sampling of items in large-scale assessments. Practical Assessment, Research & Evaluation, 8(16). Retrieved February 22, 2011 from http://PAREonline.net/getvn.asp?v=8&n=16 • de Ayala, R. J. (2009). The Theory and Practice of Item Response Theory. New York, NY: The Guildfor Press. • He, Q. (2010) Maintaining Standards in on Demand Testing Using Item Response Theory. Ofqual, Coventry. Retrieved on February 10, 2011, from http://e- assessment.org.uk/images/uploads/s-docs/Ofqual-10-4724-Maintaining- standards.pdf • Newton, Paul E. (2007) 'Clarifying the purposes of educational assessment', Assessment in Education: Principles, Policy & Practice, 14:2, 149 -170. Retrieved February 20, 2011 from http://dx.doi.org/10.1080/09695940701478321 • Pommerich, M., Segall, D.O., & Moreno, K.E. (2009). The nine lives of CAT-ASVAB: Innovations and revelations. In D. J. Weiss (Ed.), Proceedings of the 2009 GMAC Conference on Computerized Adaptive Testing. Retrieved on February 15, 2011 from www.psych.umn.edu/psylabs/CATCentral/ • Rudner, L. M. (2007). Implementing the Graduate Management Admission Test® computerized adaptive test. In D. J. Weiss (Ed.), Proceedings of the 2007 GMAC Conference on Computerized Adaptive Testing. Retrieved January 10, 2010 from www.psych.umn.edu/psylabs/CATCentral/ • Segall, D. O. and Moreno, K. E. (1999) Development of the CAT-ASVAB. In F. Drasgow & J. B. Olson-Buchanan (Eds.). Innovations in Computerized Assessment (pp. 35—65). Hillsdale, NJ: Lawrence Erlbaum Associates. Retrieved on February 20, 2011 from http://www.danielsegall.com/catasvab.pdf • van der Linden, W. J. and Glas, A. W. (eds.), (2010) Elements of Computer Adaptive Testing: Statistics. Chapters 4, 10, 17, and page 349. London, UK: Springer Science + Business Media LLC. • Wise, L. L., Curran, L. T., & McBride, J. R. (1997). CAT-ASVAB Cost and Benefit Analyses. In W. A. Sands, B. K. Waters, & J. R. McBride (Eds.), Computerized adaptive testing: From inquiry to operation (pp. 227-236). Washington, DC: American Psychological Association. • Zumbo, B. D. (2007). Three Generations of DIF Analyses: Considering Where it Has Been, Where it is Now, and Where it is Going. Language Assessment Quarterly, 4(2), 223-233, Lawrence Erlbaum Associates, Publishers. Retrieved February 20, 2011 from http://educ.ubc.ca/faculty/zumbo/papers/Zumbo_LAQ_reprint.pdf