This is the third of a series of powerpoints presented at a CAT/IRT workshop at the University of Brasilia in 2012. It provides an introduction to item response theory (IRT), discussing advanced topics like linking & equating, scaling, differential item functioning, polytomous models, and dimensionality. Learn more at www.assess.com.
MARGINALIZATION (Different learners in Marginalized Group
Implementing Item Response Theory
1. Day 2 AM: Advanced IRT topics
Linking and Equating
DIF
Polytomous IRT
IRT Software overview
Dimensionality
2. Part 1
Linking, equating, and scaling
Linking = setting different sets of item parameters onto the
same scale
Equating = setting different sets of students/scores on the
same scale
3. Linking and equating
Why important? This is necessary for
a stable scale
If we equate scores on this year’s
exam forms to last year’s we know
that a score of X still means the
exact same thing
If we don’t, a score of X could mean
different things
4. Linking and equating
Two approaches
◦ Base form: Completely map Form B on to
the scale of Form A (the base form)
Appropriate for maintaining continuity across
time… Form A is base for a long time
◦ Merged scale: Combine data and scales
for Forms A and B (“super matrix”)
OK for multiple forms at same time, but not
across time
5. Linking and equating
In IRT, items and people are on the
same scale, so linking/equating are
equivalent theoretically (although
they can be conducted separate)
In CTT, this is not the case
Linking doesn’t really exist in CTT –
but there is extensive research on
equating because it is so important
6. IRT equating
Many issues in CTT equating are
reduced with IRT because of the
property of invariance
Item parameters are invariant across
calibration groups, except for a
linear transformation
All we have to do is find it
7. IRT equating
Why? The scales are defined by the
sample scores, centered on N(0,1)
Another sample might have a slightly
different distribution of true scores,
and is calibrated with a theoretically
different N(0,1)
So we find how to map the two scales
to each other
8. Prerequisites
To accomplish linking/equating:
1. The two tests/forms must measure
the same thing (otherwise
concordance)
2. The two tests/forms must have
equal reliability (classically)
3. The equating transformation must
be invertible
◦ (A to B and B to A)
9. Common items or people?
To do an effective linking between
two data sets, you need something in
common
◦ People – some of the students are the
same as last year (but unchanged, so this
is probably not a good idea for
education!)
◦ Items – Rule of thumb is 20% or 20 items
11. Common item linking
Suppose there were 100 items on a
test in 2010
…and the first 20 were anchors back
to 2009
Then we need to pick 20 out of the
last 80 to be anchors in 2011
80 additional items would be
selected as “new” (not necessarily
brand new)
12. Common item linking
2009 average = 65
2010 average = 67
2009 anchor average = 11
2010 anchor average = 12
13. Common item linking
Items should specifically be selected
to be the anchors
Difficulty: spread similar to the test
as a whole
Discrimination: higher is better, but
not so much that it is
unrepresentative of the test
Not previous anchors
14. IRT linking analyses
There are two paradigms for linking:
◦ Concurrent calibration linking
Full Group (merges scale!!!)
Target Group (fix parameters)
◦ Conversion linking
Parameter transformation (mean/mean, mean
sigma)
TRF methods (Stocking & Lord, Haebara)
15. IRT linking analyses
I recommend either target group
calibration or S&L conversion
Xcalibre does concurrent calibration
methods
Conversion methods are an additional
post-hoc analysis, so separate
software: IRTEQ
16. Linking software - IRTEQ
User-friendly conversion linking
◦ Kyung (Chris) T. Han
Now at GMAC
◦ Windows GUI
◦ Does all major conversion methods, and
compares them
◦ Interfaces with Parscale
17. Linking software - IRTEQ
Purpose of conversion methods:
estimate the linear conversion
between two IRT scales (two
different forms)
Kind of like regression
Since linear, “no change” as a slope
(A) of 1 and intercept (B) of 0
Five different methods of estimating
these
23. What is DIF?
Differential item functioning
The item functions differently for
different groups – and is therefore unfair
One group is more likely to get an item
correct when ability is held constant
24. What is DIF?
Two ways to operationally define:
Directly evaluate probability of response
for ability level slices (Mantel-Haenszel)
Compare item parameters or statistics
for each group
◦ Basically, analyze the data for each group
separately, then compare
25. DIF Groups
Reference group – the main (usually
majority) group
Focal group – the group being
examined to see if different than
reference group (usually minority)
DIF analyses assume that both are on
same scale
26. Types of DIF
Non-Crossing DIF = same across all ability
levels
◦ Females do better than males, regardless of
ability
◦ “Bias”
◦ aka Uniform DIF
Crossing DIF = not the same across ability
levels
◦ Females do better than males at above
average ability, but the same for low ability
◦ aka Non-Uniform DIF
29. Quantifying DIF
Mantel-Haenszel (in Xcalibre)
Make N ability level slices
At each, 2 x 2 table of reference/
focal and correct/incorrect
“Ability” can be classical or IRT
scores
Show Xcalibre – SMKING with P/L
30. Quantifying DIF
There are two IRT-only approaches to
quantifying DIF
◦ Difference between item parameters
bR = bF?
Parscale uses this
◦ Difference between IRFs
More advanced and recent (1995)
Special program needed: DFIT
ASC sells Windows version; DOS is free
31. DIF in Parscale
Parscale gives several indices (bR = bF)
◦ 1. Absolute difference in parameter
◦ 2. Standardized difference (absolute/SE)
◦ 3. Chi-square = (StanDiff)2
36. Compensatory DIF
Another thing to keep an eye out for
DIF in one item can be offset by DIF
in another
So a few items in one direction can
be offset
The total test is then non-DTF
37. Compensatory DIF
And you’re not likely to have a test
without DIF items, it just happens for
whatever reason (Type I error)
DIF analysis can only flag items for
you
You then need to closely evaluate
content and decide if there is an
issue, or to proceed
39. Polytomous IRT
For data that is scored with 3+ data
points (remember, multiple choice
collapses to two)
As mentioned previously, there are
two main families of polytomous
models
◦ Rating Scale
◦ Partial Credit
Rasch and non-Rasch (“Generalized”)
40. Rating scale approach
Designed for Likert-type questions…
◦ Rate on a scale of 1 to 5 whether the
adjective applies to you:
Adjective 1 2 3 4 5
Trustworthy
Outgoing
Diligent
Conscientious
41. Rating scale approach
We assume that the process or
mental structure behind the 1-5 scale
is the same for every item
But items might differ in “difficulty”
42. Partial credit approach
We assume that the response process
is different for every item
The difference between 2 and 4
points might be wider/narrower
47. IRT Software
There are a number of programs out
there, reflecting:
◦ Types of approaches (Rasch, 3PL, Poly)
◦ Cost (free up to 100s of dollars)
◦ Special topics like fit, linking, and form
assembly
◦ Usability vs. flexibility
48. Some IRT calibration programs
Xcalibre 4 – easy to use, complete
reports
Parscale – extremely flexible, does
most models, but difficult to use
Bilog – most powerful dichotomous
program, difficult to use
ConQuest – advanced things like
facets models and multidimensional
Winsteps – most common Rasch
program
49. Some IRT calibration programs
PARAM3PL – free; only 3PL
ICL – free; lots of stuff, but difficult
to use and no support
R – free; some routines there, but
slow, and inferior output
OPLM – free; from Cito
50. Other IRT programs
ASC’s Form Building Tool to build new
forms using calibrated items
DIF Tool for DIF graphing
DFIT8 – DFIT framework for DIF
ScoreAll – scores examinees
CATSim – CAT simulations
IRTLRDIF2
Most organizations build own tools for
specific purposes
51. What to do with the results?
Often a good idea to import scores
and item parameters into Excel (ASC
does CSV directly)
You can manipulate and further
analyze (frequency graphs, etc.)
Also helpful for further importing –
scores into a database and item
parameters into the item banker
53. IRT assumptions
Basic Assumptions
1. A stable unidimensional trait
Item responses are independent of each
other (local independence), except for the
trait/ability that they measure
2. A specific form of the relationship
between trait level and probability of a
response (the response function, or IRT
model)
54. IRT assumptions
Unidimensionality and local
independence are actually equivalent
◦ If items are interdependent, then the
probability of a response is due to two
things: your trait level, and whether you
saw the “tripping” item first
◦ This makes it two-dimensional
55. IRT assumptions
Two other common violations:
◦ Speededness
◦ Actual multidimensional test (medical
knowledge vs. clinical ability, language
vs. math)
56. How to check
So there are two important things to
check:
◦ Unidimensionality
Factor Analysis
Bejar’s method
DIMTEST
◦ Whether our IRT model was a good
choice
Model fit
57. Checking unidimensionality
Factor analysis
Used often in research investigating
dimensionality
But it is not recommended to use
“normal” factor analysis, which uses
Pearson correlations
◦ This is used in typical software packages
like SPSS
58. Checking unidimensionality
Item-level data of tests are
dichotomous, unlike total scores,
which are continuous
Special software does factor analysis
with tetrachoric calculations for this
MicroFact (from ASC)
TESTFACT (from SSI)
59. Checking unidimensionality
Output is still similar to regular
factor analysis
Eigenvalue plot to examine number
of factors
Factor loading matrix to examine
“sorting” of items
61. Checking unidimensionality
See output files…
If unidimensional, factor loadings will
pattern similar to IRT a parameters
Item a Loading
1 .72 .42
2 .81 .44
3 .96 .54
4 .83 .25
5 .47 .11
62. Checking unidimensionality
Bejar’s Method
Useful in situations where you know
your test has different content areas
Examples:
◦ Cognitive test with fluid and crystallized
◦ Math test with story problems and
number-only problems
◦ Language test with writing and reading
63. Checking unidimensionality
It is possible that these tests are not
completely unidimensional, and we
have a good reason to check
64. Checking unidimensionality
Bejar’s method:
◦ 1. Do an IRT calibration of the entire test
◦ 2. Do an IRT calibration of each area
separately
◦ 3. Compare the item parameters