Measuring trends accurately with test equating and proper trend item selection

Trend measurement

Eveline Gebhardt

Overview

• Test equating

• Booklet design

• Selection of trend items

Three common methods
TEST EQUATING

Test equating
• Putting two tests on one scale so that student
abilities and item difficulties can be compared
between tests
– For example, to compare mean performance at
time 1 with mean performance at time 2 (trends)
• Group of common items (or common
students) so that part of the items used in test
1 are also used in test 2

Some equating methods
• Average item difficulty of set of common
items needs to be equal in both tests
• Three common methods:
– Shift method (trends)
– Joint scaling (booklets)
– Anchoring item difficulties

Shift method
Items A Items B Items C
Test 1 X X
Test 2 X X

• (Items B are the common items)
• Scale test 1 and test 2 separately
• Compute average difficulty of items B in test 1 and test 2
• Compute difference between averages (test 1 – test 2)
• Shift the student abilities of test 2 by the difference
• Method often used for equating tests over time (trends)

Joint scaling
• Data of test 1 and 2 are joined in one data set
• Test 1 and 2 are scaled together
• Difficulties of items B are estimated only once
• Difficulties of items B are identical for test 1
and 2
• Tests are on the same scale
• Often used for equating booklets
• (And when population variances are assumed
to be equal)

Anchoring
• Scale test 1 (items A and B)
• Select difficulties of items B
• Scale test 2 (items B and C) items B
anchored to the same values as test 1

An effect of item positioning on trend estimation

BOOKLET DESIGN

Booklet design
• A unit consists of one stimulus and
multiple items
• Several units assigned to clusters
• Clusters rotated across booklets
• Test consists of multiple booklets

Fully rotated booklet design
Position 1 Position 2 Position 3
Booklet 1 A B C
Booklet 2 B D E
Booklet 3 D C F
Booklet 4 C E G
Booklet 5 E F H
Booklet 6 F G I
Booklet 7 G H A
Booklet 8 H I B
Booklet 9 I A D

Experiment 1
Cluster 1 Cluster 2 Cluster 3
Booklet 1 A B C
Booklet 2 B D E
Booklet 3 D C F
Booklet 4 C E G
Booklet 5 E F H
Booklet 6 F G I
Booklet 7 G H A
Booklet 8 H I B
Booklet 9 I A D

Positioning effect
• Full model: abilities booklet 2 and 5
equal
• Common items in cluster E
– at the end of booklet 2
– at the start of booklet 5

Imagine
• booklet 2 is a full test at time 1 and
• booklet 5 is a full test at time 2
• cluster E are trend items
• Equate two tests using common items
from cluster E (joint scaling method)

(remember that the average ability of the two groups
are equal when scaling all booklets)

Results experiment 1
Time 1 Time 2
Mean 0.46 0.70

• Change is not caused by an increase in
ability over time,
• But by a change in booklet design (the trend
items moved forward and became easier)
• Examples PISA reading 2003 and science
2009

Effects of item characteristics on trend estimation

SELECTION OF TREND ITEMS

Trend items &
Differential Item Functioning
• Assumption of Rasch model:
all students with the same ability have the
same probability to respond correctly to an
item, independent of the subgroup a student
belongs to
• The violation of this assumption is
called Differential Item Functioning
(DIF)

Experiment 2
• Item pool of 105 items for assessment
at time 1
• Selection of 55 trend items all favouring
boys
• Scale two sets of items on the same set
of student responses

Abilities by subgroup
Boys Girls

0.60

0.50
0.44 0.44

All items Boys items

Conclusion experiment 2
• Selecting trend items that on average
favour a subgroup of students changes
the gap in performance between
subgroups
• Example PISA reading 2003

Trend items &
Item discrimination

• Good items discriminate between good and bad
students
• Some items discriminate more than others
• Average abilities of students:
Item A Item B
Answer A 1.00 0.42
Answer B -0.22 0.41
Answer C -0.15 0.81
Answer D -0.02 0.33

Slopes
• Level of discrimination is reflected by
the slope of the ICC

Assumption
• Assumption of the Rasch model:
slopes are equal across items
• However, in practice slopes always vary
a little within a test
• The expected slope is the average
slope of all items in a test
• He population variance is a reflection of
the average slope

Experiment 3
• Item pool of 105 items to assess
students at time 1
• A set of 53 more discriminating items
selected as trend items
• Scale each set of items on the same
student responses

Population distribution

All items
High discr.

-6 -4 -2 0 2 4 6

Conclusion experiment 3
• Selecting more discriminating items as
trend items increases the average slope
and therefore the variance of
performance in the student population
• Happens in practice because items with
high discrimination are often regarded
as better items and are therefore kept
for future testing

Trend items &
Sub-domains
• Equating shift should be based on a set
of items that is representative of the
whole test
• Equating shifts can be slightly different
for different sub-domains
• Best practice to have equal proportions
of sub-domains in trend items and in the
total item pool

Trend items &
Item types
• Equating shifts can be slightly different
for multiple choice items than for open
ended items
• Best practice to have equal proportions
of item types in trend items and in the
total item pool

Test developers

RECOMMENDATIONS

Recommendations
• Drop items after field trial with high DIF in student
background characteristics of most interest
• Drop items after field trial with low discrimination
• Keep as many trend items as possible
• Check if proportions of important item characteristics
(including DIF, discrimination, sub-domain, item type,
item difficulty) are roughly equal between trend items
and the total item pool of both the old and the new
test

Measuring trends accurately with test equating and proper trend item selection

Recommended

Recommended

More Related Content

What's hot

What's hot (15)

Viewers also liked

Viewers also liked (7)

Similar to Measuring trends accurately with test equating and proper trend item selection

Similar to Measuring trends accurately with test equating and proper trend item selection (20)

Measuring trends accurately with test equating and proper trend item selection