4. Test equating
• Putting two tests on one scale so that student
abilities and item difficulties can be compared
between tests
– For example, to compare mean performance at
time 1 with mean performance at time 2 (trends)
• Group of common items (or common
students) so that part of the items used in test
1 are also used in test 2
5. Some equating methods
• Average item difficulty of set of common
items needs to be equal in both tests
• Three common methods:
– Shift method (trends)
– Joint scaling (booklets)
– Anchoring item difficulties
6. Shift method
Items A Items B Items C
Test 1 X X
Test 2 X X
• (Items B are the common items)
• Scale test 1 and test 2 separately
• Compute average difficulty of items B in test 1 and test 2
• Compute difference between averages (test 1 – test 2)
• Shift the student abilities of test 2 by the difference
• Method often used for equating tests over time (trends)
7. Joint scaling
• Data of test 1 and 2 are joined in one data set
• Test 1 and 2 are scaled together
• Difficulties of items B are estimated only once
• Difficulties of items B are identical for test 1
and 2
• Tests are on the same scale
• Often used for equating booklets
• (And when population variances are assumed
to be equal)
8. Anchoring
• Scale test 1 (items A and B)
• Select difficulties of items B
• Scale test 2 (items B and C) items B
anchored to the same values as test 1
9. An effect of item positioning on trend estimation
BOOKLET DESIGN
10. Booklet design
• A unit consists of one stimulus and
multiple items
• Several units assigned to clusters
• Clusters rotated across booklets
• Test consists of multiple booklets
11. Fully rotated booklet design
Position 1 Position 2 Position 3
Booklet 1 A B C
Booklet 2 B D E
Booklet 3 D C F
Booklet 4 C E G
Booklet 5 E F H
Booklet 6 F G I
Booklet 7 G H A
Booklet 8 H I B
Booklet 9 I A D
12. Experiment 1
Cluster 1 Cluster 2 Cluster 3
Booklet 1 A B C
Booklet 2 B D E
Booklet 3 D C F
Booklet 4 C E G
Booklet 5 E F H
Booklet 6 F G I
Booklet 7 G H A
Booklet 8 H I B
Booklet 9 I A D
13. Positioning effect
• Full model: abilities booklet 2 and 5
equal
• Common items in cluster E
– at the end of booklet 2
– at the start of booklet 5
14. Imagine
• booklet 2 is a full test at time 1 and
• booklet 5 is a full test at time 2
• cluster E are trend items
• Equate two tests using common items
from cluster E (joint scaling method)
(remember that the average ability of the two groups
are equal when scaling all booklets)
15. Results experiment 1
Time 1 Time 2
Mean 0.46 0.70
• Change is not caused by an increase in
ability over time,
• But by a change in booklet design (the trend
items moved forward and became easier)
• Examples PISA reading 2003 and science
2009
16. Effects of item characteristics on trend estimation
SELECTION OF TREND ITEMS
17. Trend items &
Differential Item Functioning
• Assumption of Rasch model:
all students with the same ability have the
same probability to respond correctly to an
item, independent of the subgroup a student
belongs to
• The violation of this assumption is
called Differential Item Functioning
(DIF)
19. Experiment 2
• Item pool of 105 items for assessment
at time 1
• Selection of 55 trend items all favouring
boys
• Scale two sets of items on the same set
of student responses
20. Results experiment 2
Abilities by subgroup
Boys Girls
0.60
0.50
0.44 0.44
All items Boys items
21. Conclusion experiment 2
• Selecting trend items that on average
favour a subgroup of students changes
the gap in performance between
subgroups
• Example PISA reading 2003
22. Trend items &
Item discrimination
• Good items discriminate between good and bad
students
• Some items discriminate more than others
• Average abilities of students:
Item A Item B
Answer A 1.00 0.42
Answer B -0.22 0.41
Answer C -0.15 0.81
Answer D -0.02 0.33
23. Slopes
• Level of discrimination is reflected by
the slope of the ICC
24. Assumption
• Assumption of the Rasch model:
slopes are equal across items
• However, in practice slopes always vary
a little within a test
• The expected slope is the average
slope of all items in a test
• He population variance is a reflection of
the average slope
25. Experiment 3
• Item pool of 105 items to assess
students at time 1
• A set of 53 more discriminating items
selected as trend items
• Scale each set of items on the same
student responses
26. Results experiment 3
Population distribution
All items
High discr.
-6 -4 -2 0 2 4 6
27. Conclusion experiment 3
• Selecting more discriminating items as
trend items increases the average slope
and therefore the variance of
performance in the student population
• Happens in practice because items with
high discrimination are often regarded
as better items and are therefore kept
for future testing
28. Trend items &
Sub-domains
• Equating shift should be based on a set
of items that is representative of the
whole test
• Equating shifts can be slightly different
for different sub-domains
• Best practice to have equal proportions
of sub-domains in trend items and in the
total item pool
29. Trend items &
Item types
• Equating shifts can be slightly different
for multiple choice items than for open
ended items
• Best practice to have equal proportions
of item types in trend items and in the
total item pool
31. Recommendations
• Drop items after field trial with high DIF in student
background characteristics of most interest
• Drop items after field trial with low discrimination
• Keep as many trend items as possible
• Check if proportions of important item characteristics
(including DIF, discrimination, sub-domain, item type,
item difficulty) are roughly equal between trend items
and the total item pool of both the old and the new
test