Implementing Item Response Theory

Day 2 AM: Advanced IRT topics
Linking and Equating
DIF
Polytomous IRT
IRT Software overview
Dimensionality

Part 1
Linking, equating, and scaling
Linking = setting different sets of item parameters onto the
same scale
Equating = setting different sets of students/scores on the
same scale

Linking and equating
 Why important? This is necessary for
a stable scale
 If we equate scores on this year’s
exam forms to last year’s we know
that a score of X still means the
exact same thing
 If we don’t, a score of X could mean
different things

 Two approaches
◦ Base form: Completely map Form B on to
the scale of Form A (the base form)
 Appropriate for maintaining continuity across
time… Form A is base for a long time
◦ Merged scale: Combine data and scales
for Forms A and B (“super matrix”)
 OK for multiple forms at same time, but not
across time

 In IRT, items and people are on the
same scale, so linking/equating are
equivalent theoretically (although
they can be conducted separate)
 In CTT, this is not the case
 Linking doesn’t really exist in CTT –
but there is extensive research on
equating because it is so important

IRT equating
 Many issues in CTT equating are
reduced with IRT because of the
property of invariance
 Item parameters are invariant across
calibration groups, except for a
linear transformation
 All we have to do is find it

IRT equating
 Why? The scales are defined by the
sample scores, centered on N(0,1)
 Another sample might have a slightly
different distribution of true scores,
and is calibrated with a theoretically
different N(0,1)
 So we find how to map the two scales
to each other

Prerequisites
 To accomplish linking/equating:
 1. The two tests/forms must measure
the same thing (otherwise
concordance)
 2. The two tests/forms must have
equal reliability (classically)
 3. The equating transformation must
be invertible
◦ (A to B and B to A)

Common items or people?
 To do an effective linking between
two data sets, you need something in
common
◦ People – some of the students are the
same as last year (but unchanged, so this
is probably not a good idea for
education!)
◦ Items – Rule of thumb is 20% or 20 items

Common item linking
 Suppose there were 100 items on a
test in 2010
 …and the first 20 were anchors back
to 2009
 Then we need to pick 20 out of the
last 80 to be anchors in 2011
 80 additional items would be
selected as “new” (not necessarily
brand new)

Common item linking
 2009 average = 65
 2010 average = 67
 2009 anchor average = 11
 2010 anchor average = 12

Common item linking
 Items should specifically be selected
to be the anchors
 Difficulty: spread similar to the test
as a whole
 Discrimination: higher is better, but
not so much that it is
unrepresentative of the test
 Not previous anchors

IRT linking analyses
 There are two paradigms for linking:
◦ Concurrent calibration linking
 Full Group (merges scale!!!)
 Target Group (fix parameters)
◦ Conversion linking
 Parameter transformation (mean/mean, mean
sigma)
 TRF methods (Stocking & Lord, Haebara)

IRT linking analyses
 I recommend either target group
calibration or S&L conversion
 Xcalibre does concurrent calibration
methods
 Conversion methods are an additional
post-hoc analysis, so separate
software: IRTEQ

Linking software - IRTEQ
 User-friendly conversion linking
◦ Kyung (Chris) T. Han
 Now at GMAC
◦ Windows GUI
◦ Does all major conversion methods, and
compares them
◦ Interfaces with Parscale

 Purpose of conversion methods:
estimate the linear conversion
between two IRT scales (two
different forms)
 Kind of like regression
 Since linear, “no change” as a slope
(A) of 1 and intercept (B) of 0
 Five different methods of estimating
these

Linking software – IRTEQ


 Compare bs

 Output from example data…

 Then what?
 θ* = A(θ)+B
 b* = A(b)+B
 a* = a/A

Part 2
Differential item functioning (DIF)

What is DIF?
 Differential item functioning
 The item functions differently for
different groups – and is therefore unfair
 One group is more likely to get an item
correct when ability is held constant

What is DIF?
 Two ways to operationally define:
 Directly evaluate probability of response
for ability level slices (Mantel-Haenszel)
 Compare item parameters or statistics
for each group
◦ Basically, analyze the data for each group
separately, then compare

DIF Groups
 Reference group – the main (usually
majority) group
 Focal group – the group being
examined to see if different than
reference group (usually minority)
 DIF analyses assume that both are on
same scale

Types of DIF
 Non-Crossing DIF = same across all ability
levels
◦ Females do better than males, regardless of
ability
◦ “Bias”
◦ aka Uniform DIF
 Crossing DIF = not the same across ability
levels
◦ Females do better than males at above
average ability, but the same for low ability
◦ aka Non-Uniform DIF

Quantifying DIF
 Mantel-Haenszel (in Xcalibre)
 Make N ability level slices
 At each, 2 x 2 table of reference/
focal and correct/incorrect
 “Ability” can be classical or IRT
scores
 Show Xcalibre – SMKING with P/L

Quantifying DIF
 There are two IRT-only approaches to
quantifying DIF
◦ Difference between item parameters
 bR = bF?
 Parscale uses this
◦ Difference between IRFs
 More advanced and recent (1995)
 Special program needed: DFIT
 ASC sells Windows version; DOS is free

DIF in Parscale
 Parscale gives several indices (bR = bF)
◦ 1. Absolute difference in parameter
◦ 2. Standardized difference (absolute/SE)
◦ 3. Chi-square = (StanDiff)2

DIF in Parscale
 Contrast and standardized difference

DIF in Parscale
 Chi-square
 More
conservative,
so better to
use

DIF in Parscale
 Straightforward interpretations
◦ Raw diff
◦ Standard diff
◦ p values
 But notice that they are different in the two
tables!

DFIT
 Tests the shaded area

Compensatory DIF
 Another thing to keep an eye out for
 DIF in one item can be offset by DIF
in another
 So a few items in one direction can
be offset
 The total test is then non-DTF

Compensatory DIF
 And you’re not likely to have a test
without DIF items, it just happens for
whatever reason (Type I error)
 DIF analysis can only flag items for
you
 You then need to closely evaluate
content and decide if there is an
issue, or to proceed

Polytomous IRT
 For data that is scored with 3+ data
points (remember, multiple choice
collapses to two)
 As mentioned previously, there are
two main families of polytomous
models
◦ Rating Scale
◦ Partial Credit
 Rasch and non-Rasch (“Generalized”)

Rating scale approach
 Designed for Likert-type questions…
◦ Rate on a scale of 1 to 5 whether the
adjective applies to you:
Adjective 1 2 3 4 5
Trustworthy
Outgoing
Diligent
Conscientious

Rating scale approach
 We assume that the process or
mental structure behind the 1-5 scale
is the same for every item
 But items might differ in “difficulty”

Partial credit approach
 We assume that the response process
is different for every item
 The difference between 2 and 4
points might be wider/narrower

Rasch/Non-Rasch
 Non-Rasch allows discrimination to
vary between items
 This means curves can have different
steepness/separation

Comparison table
Model Item Disc. Step
Spacing
Step
Ordering
Option Disc.
RSM Fixed Fixed Fixed Fixed
PCM Fixed Variable Variable Fixed
GRSM Variable Fixed Fixed Fixed
GRM Variable Variable Fixed Fixed
GPCM Variable Variable Variable Fixed
NRM Variable (each
option)
Variable Variable Variable
Fixed/Variable between items

Polytomous IRT
 It used to be that you had to program
all that manually (PARSCALE)
 Let’s look at it in Xcalibre 4…

IRT Software
 There are a number of programs out
there, reflecting:
◦ Types of approaches (Rasch, 3PL, Poly)
◦ Cost (free up to 100s of dollars)
◦ Special topics like fit, linking, and form
assembly
◦ Usability vs. flexibility

Some IRT calibration programs
 Xcalibre 4 – easy to use, complete
reports
 Parscale – extremely flexible, does
most models, but difficult to use
 Bilog – most powerful dichotomous
program, difficult to use
 ConQuest – advanced things like
facets models and multidimensional
 Winsteps – most common Rasch
program

Some IRT calibration programs
 PARAM3PL – free; only 3PL
 ICL – free; lots of stuff, but difficult
to use and no support
 R – free; some routines there, but
slow, and inferior output
 OPLM – free; from Cito

Other IRT programs
 ASC’s Form Building Tool to build new
forms using calibrated items
 DIF Tool for DIF graphing
 DFIT8 – DFIT framework for DIF
 ScoreAll – scores examinees
 CATSim – CAT simulations
 IRTLRDIF2
 Most organizations build own tools for
specific purposes

What to do with the results?
 Often a good idea to import scores
and item parameters into Excel (ASC
does CSV directly)
 You can manipulate and further
analyze (frequency graphs, etc.)
 Also helpful for further importing –
scores into a database and item
parameters into the item banker

Part 5
Assumptions of IRT: Dimensionality

IRT assumptions
 Basic Assumptions
1. A stable unidimensional trait
 Item responses are independent of each
other (local independence), except for the
trait/ability that they measure
2. A specific form of the relationship
between trait level and probability of a
response (the response function, or IRT
model)

IRT assumptions
 Unidimensionality and local
independence are actually equivalent
◦ If items are interdependent, then the
probability of a response is due to two
things: your trait level, and whether you
saw the “tripping” item first
◦ This makes it two-dimensional

IRT assumptions
 Two other common violations:
◦ Speededness
◦ Actual multidimensional test (medical
knowledge vs. clinical ability, language
vs. math)

How to check
 So there are two important things to
check:
◦ Unidimensionality
 Factor Analysis
 Bejar’s method
 DIMTEST
◦ Whether our IRT model was a good
choice
 Model fit

Checking unidimensionality
 Factor analysis
 Used often in research investigating
dimensionality
 But it is not recommended to use
“normal” factor analysis, which uses
Pearson correlations
◦ This is used in typical software packages
like SPSS

 Item-level data of tests are
dichotomous, unlike total scores,
which are continuous
 Special software does factor analysis
with tetrachoric calculations for this
 MicroFact (from ASC)
 TESTFACT (from SSI)

 Output is still similar to regular
factor analysis
 Eigenvalue plot to examine number
of factors
 Factor loading matrix to examine
“sorting” of items

 See output files…
 If unidimensional, factor loadings will
pattern similar to IRT a parameters
Item a Loading
1 .72 .42
2 .81 .44
3 .96 .54
4 .83 .25
5 .47 .11

 Bejar’s Method
 Useful in situations where you know
your test has different content areas
 Examples:
◦ Cognitive test with fluid and crystallized
◦ Math test with story problems and
number-only problems
◦ Language test with writing and reading

 It is possible that these tests are not
completely unidimensional, and we
have a good reason to check

 Bejar’s method:
◦ 1. Do an IRT calibration of the entire test
◦ 2. Do an IRT calibration of each area
separately
◦ 3. Compare the item parameters

 b parameters

 a parameters

 c parameters

Implementing Item Response Theory

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (15)

Similar to Implementing Item Response Theory

Similar to Implementing Item Response Theory (20)

More from Nathan Thompson

More from Nathan Thompson (6)

Recently uploaded

Recently uploaded (20)

Implementing Item Response Theory