SlideShare a Scribd company logo
1 of 31
Trend measurement

  Eveline Gebhardt
Overview

• Test equating

• Booklet design

• Selection of trend items
Three common methods
TEST EQUATING
Test equating
• Putting two tests on one scale so that student
  abilities and item difficulties can be compared
  between tests
   – For example, to compare mean performance at
     time 1 with mean performance at time 2 (trends)
• Group of common items (or common
  students) so that part of the items used in test
  1 are also used in test 2
Some equating methods
• Average item difficulty of set of common
  items needs to be equal in both tests
• Three common methods:
  – Shift method (trends)
  – Joint scaling (booklets)
  – Anchoring item difficulties
Shift method
                  Items A     Items B    Items C
    Test 1          X           X
    Test 2                      X           X

•   (Items B are the common items)
•   Scale test 1 and test 2 separately
•   Compute average difficulty of items B in test 1 and test 2
•   Compute difference between averages (test 1 – test 2)
•   Shift the student abilities of test 2 by the difference
•   Method often used for equating tests over time (trends)
Joint scaling
• Data of test 1 and 2 are joined in one data set
• Test 1 and 2 are scaled together
• Difficulties of items B are estimated only once
• Difficulties of items B are identical for test 1
  and 2
• Tests are on the same scale
• Often used for equating booklets
• (And when population variances are assumed
  to be equal)
Anchoring
• Scale test 1 (items A and B)
• Select difficulties of items B
• Scale test 2 (items B and C) items B
  anchored to the same values as test 1
An effect of item positioning on trend estimation

BOOKLET DESIGN
Booklet design
• A unit consists of one stimulus and
  multiple items
• Several units assigned to clusters
• Clusters rotated across booklets
• Test consists of multiple booklets
Fully rotated booklet design
            Position 1   Position 2   Position 3
Booklet 1       A            B            C
Booklet 2       B            D            E
Booklet 3       D            C            F
Booklet 4       C            E            G
Booklet 5       E            F            H
Booklet 6       F            G            I
Booklet 7       G            H            A
Booklet 8       H            I            B
Booklet 9       I            A            D
Experiment 1
            Cluster 1   Cluster 2   Cluster 3
Booklet 1      A           B           C
Booklet 2      B           D           E
Booklet 3      D           C           F
Booklet 4      C           E           G
Booklet 5      E           F           H
Booklet 6      F           G            I
Booklet 7      G           H           A
Booklet 8      H            I          B
Booklet 9       I          A           D
Positioning effect
• Full model: abilities booklet 2 and 5
  equal
• Common items in cluster E
  – at the end of booklet 2
  – at the start of booklet 5
Imagine
•   booklet 2 is a full test at time 1 and
•   booklet 5 is a full test at time 2
•   cluster E are trend items
•   Equate two tests using common items
    from cluster E (joint scaling method)

    (remember that the average ability of the two groups
    are equal when scaling all booklets)
Results experiment 1
                   Time 1      Time 2
  Mean              0.46        0.70


• Change is not caused by an increase in
  ability over time,
• But by a change in booklet design (the trend
  items moved forward and became easier)
• Examples PISA reading 2003 and science
  2009
Effects of item characteristics on trend estimation

SELECTION OF TREND ITEMS
Trend items &
  Differential Item Functioning
• Assumption of Rasch model:
 all students with the same ability have the
 same probability to respond correctly to an
 item, independent of the subgroup a student
 belongs to
• The violation of this assumption is
  called Differential Item Functioning
  (DIF)
Example: sex DIF
Experiment 2
• Item pool of 105 items for assessment
  at time 1
• Selection of 55 trend items all favouring
  boys
• Scale two sets of items on the same set
  of student responses
Results experiment 2
                     Abilities by subgroup
                           Boys   Girls


                   0.60

                                          0.50
0.44                                                          0.44




       All items                                 Boys items
Conclusion experiment 2
• Selecting trend items that on average
  favour a subgroup of students changes
  the gap in performance between
  subgroups
• Example PISA reading 2003
Trend items &
             Item discrimination

• Good items discriminate between good and bad
  students
• Some items discriminate more than others
• Average abilities of students:
               Item A     Item B
  Answer A         1.00       0.42
  Answer B        -0.22       0.41
  Answer C        -0.15       0.81
  Answer D        -0.02       0.33
Slopes
• Level of discrimination is reflected by
  the slope of the ICC
Assumption
• Assumption of the Rasch model:
 slopes are equal across items
• However, in practice slopes always vary
  a little within a test
• The expected slope is the average
  slope of all items in a test
• He population variance is a reflection of
  the average slope
Experiment 3
• Item pool of 105 items to assess
  students at time 1
• A set of 53 more discriminating items
  selected as trend items
• Scale each set of items on the same
  student responses
Results experiment 3
                  Population distribution




                                                    All items
                                                    High discr.




-6   -4      -2        0        2           4   6
Conclusion experiment 3
• Selecting more discriminating items as
  trend items increases the average slope
  and therefore the variance of
  performance in the student population
• Happens in practice because items with
  high discrimination are often regarded
  as better items and are therefore kept
  for future testing
Trend items &
            Sub-domains
• Equating shift should be based on a set
  of items that is representative of the
  whole test
• Equating shifts can be slightly different
  for different sub-domains
• Best practice to have equal proportions
  of sub-domains in trend items and in the
  total item pool
Trend items &
              Item types
• Equating shifts can be slightly different
  for multiple choice items than for open
  ended items
• Best practice to have equal proportions
  of item types in trend items and in the
  total item pool
Test developers

RECOMMENDATIONS
Recommendations
• Drop items after field trial with high DIF in student
  background characteristics of most interest
• Drop items after field trial with low discrimination
• Keep as many trend items as possible
• Check if proportions of important item characteristics
  (including DIF, discrimination, sub-domain, item type,
  item difficulty) are roughly equal between trend items
  and the total item pool of both the old and the new
  test

More Related Content

What's hot

Sept07 college 1
Sept07 college 1Sept07 college 1
Sept07 college 1Ning Ding
 
01 introducción a la evaluación del aprendizaje de idiomas
01 introducción a la evaluación del aprendizaje de idiomas01 introducción a la evaluación del aprendizaje de idiomas
01 introducción a la evaluación del aprendizaje de idiomasY Casart
 
Research PresentatioThe Effects of Student Assessment Choices on 11th Grade E...
Research PresentatioThe Effects of Student Assessment Choices on 11th Grade E...Research PresentatioThe Effects of Student Assessment Choices on 11th Grade E...
Research PresentatioThe Effects of Student Assessment Choices on 11th Grade E...Matthew Prost
 
Reading test specifications assignment-01-ppt
Reading test specifications assignment-01-pptReading test specifications assignment-01-ppt
Reading test specifications assignment-01-pptBilal Yaseen
 
Right statistics for the right research questions
Right statistics for the right research questionsRight statistics for the right research questions
Right statistics for the right research questionsSheila Shamuganathan
 
Variables and experimental desighn
Variables and experimental desighnVariables and experimental desighn
Variables and experimental desighncoburgpsych
 
Week 4 variables and designs
Week 4 variables and designsWeek 4 variables and designs
Week 4 variables and designswawaaa789
 
Glossary of L2 evaluation terms
Glossary of L2 evaluation termsGlossary of L2 evaluation terms
Glossary of L2 evaluation termsY Casart
 
TSL3133 Topic 6 Action Research the Process
TSL3133 Topic 6 Action Research the ProcessTSL3133 Topic 6 Action Research the Process
TSL3133 Topic 6 Action Research the ProcessYee Bee Choo
 

What's hot (15)

Moderator sony
Moderator sonyModerator sony
Moderator sony
 
Sept07 college 1
Sept07 college 1Sept07 college 1
Sept07 college 1
 
01 introducción a la evaluación del aprendizaje de idiomas
01 introducción a la evaluación del aprendizaje de idiomas01 introducción a la evaluación del aprendizaje de idiomas
01 introducción a la evaluación del aprendizaje de idiomas
 
Quasi experimental design
Quasi experimental designQuasi experimental design
Quasi experimental design
 
Research PresentatioThe Effects of Student Assessment Choices on 11th Grade E...
Research PresentatioThe Effects of Student Assessment Choices on 11th Grade E...Research PresentatioThe Effects of Student Assessment Choices on 11th Grade E...
Research PresentatioThe Effects of Student Assessment Choices on 11th Grade E...
 
Reading test specifications assignment-01-ppt
Reading test specifications assignment-01-pptReading test specifications assignment-01-ppt
Reading test specifications assignment-01-ppt
 
Right statistics for the right research questions
Right statistics for the right research questionsRight statistics for the right research questions
Right statistics for the right research questions
 
Causal
CausalCausal
Causal
 
Variables and experimental desighn
Variables and experimental desighnVariables and experimental desighn
Variables and experimental desighn
 
Peralta sheila-a
Peralta sheila-aPeralta sheila-a
Peralta sheila-a
 
Test specifications and designs session 4
Test specifications and designs  session 4Test specifications and designs  session 4
Test specifications and designs session 4
 
Week 4 variables and designs
Week 4 variables and designsWeek 4 variables and designs
Week 4 variables and designs
 
Glossary of L2 evaluation terms
Glossary of L2 evaluation termsGlossary of L2 evaluation terms
Glossary of L2 evaluation terms
 
The Components of Test Specifications
The Components of Test SpecificationsThe Components of Test Specifications
The Components of Test Specifications
 
TSL3133 Topic 6 Action Research the Process
TSL3133 Topic 6 Action Research the ProcessTSL3133 Topic 6 Action Research the Process
TSL3133 Topic 6 Action Research the Process
 

Viewers also liked

Praia no rio Tapajós
Praia no rio TapajósPraia no rio Tapajós
Praia no rio TapajósJoão
 
CPT Company Accounts Imp MCQs & Solutions
CPT Company  Accounts Imp MCQs & SolutionsCPT Company  Accounts Imp MCQs & Solutions
CPT Company Accounts Imp MCQs & SolutionsVXplain
 
Time Series
Time SeriesTime Series
Time Seriesyush313
 

Viewers also liked (7)

Praia no rio Tapajós
Praia no rio TapajósPraia no rio Tapajós
Praia no rio Tapajós
 
Whole Sale Price Index
Whole Sale Price IndexWhole Sale Price Index
Whole Sale Price Index
 
Wpi
WpiWpi
Wpi
 
Wpi and CPi
Wpi and CPiWpi and CPi
Wpi and CPi
 
CPT Company Accounts Imp MCQs & Solutions
CPT Company  Accounts Imp MCQs & SolutionsCPT Company  Accounts Imp MCQs & Solutions
CPT Company Accounts Imp MCQs & Solutions
 
Time Series
Time SeriesTime Series
Time Series
 
Wholesale Price Index
Wholesale Price IndexWholesale Price Index
Wholesale Price Index
 

Similar to Measuring trends accurately with test equating and proper trend item selection

Parent night slide show
Parent night slide showParent night slide show
Parent night slide showkeyre
 
Quantitative techniques for psychology
Quantitative techniques for psychologyQuantitative techniques for psychology
Quantitative techniques for psychologySmiley Rathy
 
DEVELOPMENT AND EVALUATION OF SCALES/INSTRUMENTS IN PSYCHIATRY
DEVELOPMENT AND EVALUATION OF SCALES/INSTRUMENTS IN PSYCHIATRYDEVELOPMENT AND EVALUATION OF SCALES/INSTRUMENTS IN PSYCHIATRY
DEVELOPMENT AND EVALUATION OF SCALES/INSTRUMENTS IN PSYCHIATRYPawan Sharma
 
Lesson 3 developing a teacher made test
Lesson 3 developing a teacher made testLesson 3 developing a teacher made test
Lesson 3 developing a teacher made testCarlo Magno
 
Multiple choice
Multiple choiceMultiple choice
Multiple choicehalay
 
2010-2011 District Grading Policy
2010-2011 District Grading Policy2010-2011 District Grading Policy
2010-2011 District Grading PolicyJackie Korenek
 
Ch at nuremberg-juanjo
Ch at nuremberg-juanjoCh at nuremberg-juanjo
Ch at nuremberg-juanjoJunjun Yin
 
Introduction 2º
Introduction 2ºIntroduction 2º
Introduction 2ºaimorales
 
Characteristics of a Good Test
Characteristics of a Good TestCharacteristics of a Good Test
Characteristics of a Good TestAjab Ali Lashari
 
Chapter 2 class version b
Chapter 2 class version bChapter 2 class version b
Chapter 2 class version bjbnx
 
Introduction to IBDP Natural Sciences
Introduction to IBDP Natural SciencesIntroduction to IBDP Natural Sciences
Introduction to IBDP Natural Sciencesiliasl
 
Research seminar lecture_4_research_questions
Research seminar lecture_4_research_questionsResearch seminar lecture_4_research_questions
Research seminar lecture_4_research_questionsDaria Bogdanova
 

Similar to Measuring trends accurately with test equating and proper trend item selection (20)

Parent night slide show
Parent night slide showParent night slide show
Parent night slide show
 
Item analysis with spss software
Item analysis with spss softwareItem analysis with spss software
Item analysis with spss software
 
Quantitative techniques for psychology
Quantitative techniques for psychologyQuantitative techniques for psychology
Quantitative techniques for psychology
 
DEVELOPMENT AND EVALUATION OF SCALES/INSTRUMENTS IN PSYCHIATRY
DEVELOPMENT AND EVALUATION OF SCALES/INSTRUMENTS IN PSYCHIATRYDEVELOPMENT AND EVALUATION OF SCALES/INSTRUMENTS IN PSYCHIATRY
DEVELOPMENT AND EVALUATION OF SCALES/INSTRUMENTS IN PSYCHIATRY
 
Lesson 3 developing a teacher made test
Lesson 3 developing a teacher made testLesson 3 developing a teacher made test
Lesson 3 developing a teacher made test
 
Chapter24
Chapter24Chapter24
Chapter24
 
Multiple choice
Multiple choiceMultiple choice
Multiple choice
 
Item analysis
Item analysisItem analysis
Item analysis
 
Objective Tests
Objective TestsObjective Tests
Objective Tests
 
2010-2011 District Grading Policy
2010-2011 District Grading Policy2010-2011 District Grading Policy
2010-2011 District Grading Policy
 
Ch at nuremberg-juanjo
Ch at nuremberg-juanjoCh at nuremberg-juanjo
Ch at nuremberg-juanjo
 
DepEd Item Analysis
DepEd Item AnalysisDepEd Item Analysis
DepEd Item Analysis
 
Introduction 2º
Introduction 2ºIntroduction 2º
Introduction 2º
 
Characteristics of a Good Test
Characteristics of a Good TestCharacteristics of a Good Test
Characteristics of a Good Test
 
classfeb24.ppt
classfeb24.pptclassfeb24.ppt
classfeb24.ppt
 
classfeb24.ppt
classfeb24.pptclassfeb24.ppt
classfeb24.ppt
 
Chapter 2 class version b
Chapter 2 class version bChapter 2 class version b
Chapter 2 class version b
 
Introduction to IBDP Natural Sciences
Introduction to IBDP Natural SciencesIntroduction to IBDP Natural Sciences
Introduction to IBDP Natural Sciences
 
Item analysis
Item analysisItem analysis
Item analysis
 
Research seminar lecture_4_research_questions
Research seminar lecture_4_research_questionsResearch seminar lecture_4_research_questions
Research seminar lecture_4_research_questions
 

Measuring trends accurately with test equating and proper trend item selection

  • 1. Trend measurement Eveline Gebhardt
  • 2. Overview • Test equating • Booklet design • Selection of trend items
  • 4. Test equating • Putting two tests on one scale so that student abilities and item difficulties can be compared between tests – For example, to compare mean performance at time 1 with mean performance at time 2 (trends) • Group of common items (or common students) so that part of the items used in test 1 are also used in test 2
  • 5. Some equating methods • Average item difficulty of set of common items needs to be equal in both tests • Three common methods: – Shift method (trends) – Joint scaling (booklets) – Anchoring item difficulties
  • 6. Shift method Items A Items B Items C Test 1 X X Test 2 X X • (Items B are the common items) • Scale test 1 and test 2 separately • Compute average difficulty of items B in test 1 and test 2 • Compute difference between averages (test 1 – test 2) • Shift the student abilities of test 2 by the difference • Method often used for equating tests over time (trends)
  • 7. Joint scaling • Data of test 1 and 2 are joined in one data set • Test 1 and 2 are scaled together • Difficulties of items B are estimated only once • Difficulties of items B are identical for test 1 and 2 • Tests are on the same scale • Often used for equating booklets • (And when population variances are assumed to be equal)
  • 8. Anchoring • Scale test 1 (items A and B) • Select difficulties of items B • Scale test 2 (items B and C) items B anchored to the same values as test 1
  • 9. An effect of item positioning on trend estimation BOOKLET DESIGN
  • 10. Booklet design • A unit consists of one stimulus and multiple items • Several units assigned to clusters • Clusters rotated across booklets • Test consists of multiple booklets
  • 11. Fully rotated booklet design Position 1 Position 2 Position 3 Booklet 1 A B C Booklet 2 B D E Booklet 3 D C F Booklet 4 C E G Booklet 5 E F H Booklet 6 F G I Booklet 7 G H A Booklet 8 H I B Booklet 9 I A D
  • 12. Experiment 1 Cluster 1 Cluster 2 Cluster 3 Booklet 1 A B C Booklet 2 B D E Booklet 3 D C F Booklet 4 C E G Booklet 5 E F H Booklet 6 F G I Booklet 7 G H A Booklet 8 H I B Booklet 9 I A D
  • 13. Positioning effect • Full model: abilities booklet 2 and 5 equal • Common items in cluster E – at the end of booklet 2 – at the start of booklet 5
  • 14. Imagine • booklet 2 is a full test at time 1 and • booklet 5 is a full test at time 2 • cluster E are trend items • Equate two tests using common items from cluster E (joint scaling method) (remember that the average ability of the two groups are equal when scaling all booklets)
  • 15. Results experiment 1 Time 1 Time 2 Mean 0.46 0.70 • Change is not caused by an increase in ability over time, • But by a change in booklet design (the trend items moved forward and became easier) • Examples PISA reading 2003 and science 2009
  • 16. Effects of item characteristics on trend estimation SELECTION OF TREND ITEMS
  • 17. Trend items & Differential Item Functioning • Assumption of Rasch model: all students with the same ability have the same probability to respond correctly to an item, independent of the subgroup a student belongs to • The violation of this assumption is called Differential Item Functioning (DIF)
  • 19. Experiment 2 • Item pool of 105 items for assessment at time 1 • Selection of 55 trend items all favouring boys • Scale two sets of items on the same set of student responses
  • 20. Results experiment 2 Abilities by subgroup Boys Girls 0.60 0.50 0.44 0.44 All items Boys items
  • 21. Conclusion experiment 2 • Selecting trend items that on average favour a subgroup of students changes the gap in performance between subgroups • Example PISA reading 2003
  • 22. Trend items & Item discrimination • Good items discriminate between good and bad students • Some items discriminate more than others • Average abilities of students: Item A Item B Answer A 1.00 0.42 Answer B -0.22 0.41 Answer C -0.15 0.81 Answer D -0.02 0.33
  • 23. Slopes • Level of discrimination is reflected by the slope of the ICC
  • 24. Assumption • Assumption of the Rasch model: slopes are equal across items • However, in practice slopes always vary a little within a test • The expected slope is the average slope of all items in a test • He population variance is a reflection of the average slope
  • 25. Experiment 3 • Item pool of 105 items to assess students at time 1 • A set of 53 more discriminating items selected as trend items • Scale each set of items on the same student responses
  • 26. Results experiment 3 Population distribution All items High discr. -6 -4 -2 0 2 4 6
  • 27. Conclusion experiment 3 • Selecting more discriminating items as trend items increases the average slope and therefore the variance of performance in the student population • Happens in practice because items with high discrimination are often regarded as better items and are therefore kept for future testing
  • 28. Trend items & Sub-domains • Equating shift should be based on a set of items that is representative of the whole test • Equating shifts can be slightly different for different sub-domains • Best practice to have equal proportions of sub-domains in trend items and in the total item pool
  • 29. Trend items & Item types • Equating shifts can be slightly different for multiple choice items than for open ended items • Best practice to have equal proportions of item types in trend items and in the total item pool
  • 31. Recommendations • Drop items after field trial with high DIF in student background characteristics of most interest • Drop items after field trial with low discrimination • Keep as many trend items as possible • Check if proportions of important item characteristics (including DIF, discrimination, sub-domain, item type, item difficulty) are roughly equal between trend items and the total item pool of both the old and the new test