Introduction to Item Response Theory

Day 1 AM: An Introduction to
Item Response Theory
Nathan A. Thompson
Vice President, Assessment Systems Corporation
Adjunct faculty, University of Cincinnati
nthompson@assess.com

Welcome!
 Thank you for attending!
 Introductions and important info now
 Software… download or USB
 Please ask questions
◦ Also, slow me down or ask for translation!
 Goal: provide an intro on IRT/CAT to
those who are new
◦ For those with some experience, to
provide new viewpoints and more
resources/recommendations

Where I’m from, professionally
 PhD, University of Minnesota
◦ CAT for classifications
 Test development manager for
ophthalmology certifications
 Psychometrician at Prometric (many
certifications)
 VP at ASC

Where I’m from, geographically

Except now things look like…

Introduce yourselves
 Name
 Employer/organization
 Types of tests you do and/or why you
are interested in IRT/CAT
 (There might be someone with similar
interests here)

Another announcement
 Newly formed: International
Association for Computerized Adaptive
Testing (IACAT)
◦ www.iacat.org
◦ Free membership
◦ Growing resources
◦ Next conference: August 2012, Sydney

Welcome!
 This workshop is on two highly related
topics: IRT and CAT
 IRT is the modern paradigm for
developing, analyzing, scoring, and
linking tests
 CAT is a next-generation method of
delivering tests
 CAT is not feasible without IRT, so we
do IRT first

IRT – where are we going?
 IRT, as many of you know, provides a
way of analyzing items
 However, it has drawbacks (no
distractor analysis), so the main
reasons to use IRT are at the test level
 It solves certain issues with classical
test theory (CTT)
 But the two should always be used
together

 Advantages
◦ Better error characterization
◦ More precise scores
◦ Better linking
◦ Model-based
◦ Items and people on same scale (CAT)
◦ Sample-independence
◦ Powerful test assembly

 Keyword: paradigm or approach
◦ Not just another statistical analysis
◦ It is a different way of thinking about how
tests should work, and how we can
approach specific problems (scaling,
equating, test assembly) from that
viewpoint

Day 1
 There will be four parts this morning,
covering the theory behind IRT:
◦ Rationale: A graphical introduction to IRT
◦ Models (dichotomous and polytomous) and
their response functions
◦ IRT scoring (θ estimation)
◦ Item parameter estimation and model fit

Part 1
A graphical introduction to IRT

What is IRT?
 Basic Assumptions
1. Unidimensionality
 A unidimensional latent trait (1 at a time)
 Item responses are independent of each
other (local independence), except for the
trait/ability that they measure
2. A specific form of the relationship
between trait level and probability of a
response
 The response function, or IRT model
 There are a growing number of models

What is IRT?
 A theory of mathematical functions that
model the responses of examinees to
test items/questions
 These functions are item response
functions (IRFs)
 Historically, it has also been known as
latent trait theory and item
characteristic curve theory
 The IRFs are best described by showing
how the concept is derived from classical
analysis…

Classical item statistics
 CTT statistics are typically calculated
for each option
Option N Prop Rpbis Mean
1 307 0.860 0.221 91.876
2 25 0.070 -0.142 85.600
3 14 0.039 -0.137 83.929
4 11 0.031 -0.081 86.273

 The proportions are often translated
to a figure like this, where examinees
are split into groups

 The general idea of IRT is to split the
previous graph up into more groups,
and then find a mathematical model
for the blue line
 This is what makes the item response
function (IRF)

 Example with 10 groups

The item response function
 Reflects the probability of a given
response as a function of the latent
trait (z-score)
 Example:

The IRF
 For dichotomously scored items, it is
the probability of a correct or keyed
response
 Also called Item Characteristic Curve (ICC) or
Trace Line
 Only one curve (correct response), and all
other responses are grouped as (1-IRF)
 For polytomous items (partial credit,
etc.), it is the probability of each
response

The IRF
 How do we know exactly what the
IRF for an item is?
 We estimate parameters for an
equation that draws the curve
 For dichotomous IRT, there are three
relevant parameters: a, b, and c

The IRF
 a: The discrimination parameter;
represents how well the item
differentiates examinees; slope of the
curve at its center
 b: The difficulty parameter; represents
how easy or hard the item is with
respect to examinees; location of the
curve (left to right)
 c: The pseudoguessing parameter;
represents the ‘base probability’ of
answering the question; lower asymptote

The IRF…
 is the “basic building block” of IRT
 will differ from item to item
 can be one of several different
models (now)
 can be used to evaluate items (now)
 is used for IRT scoring (next)
 leads to “information” used for test
design (after that)
 is the basis of CAT (tomorrow)

IRT models
 Several families of models
◦ Dichotomous
◦ Polytomous
◦ Multidimensional
◦ Facets (scenarios vs raters)
◦ Mixed (additional parameters)
◦ Cognitive diagnostic
◦ We will focus on first two

Dichotomous IRT models
 There are 3 main models in use, as
mentioned earlier: 1PL, 2PL, 3PL
 The “L” refers to “logistic”: which is
the type of equation
 IRT was originally developed decades
ago with a cumulative normal curve
 This means that calculus needed to
be used

The logistic function
 An approximation was developed: the
logistic curve
 No calculus needed
 There are two formats based on D
 If D = 1.702, then diff < 0.01
 If D = 1.0, a little more difference;
called the true logistic form
 Does not really matter, as long as you
are consistent

The logistic function
 The basic form of the curve

Item parameters
 We add parameters to slightly modify
the shape to get it to match our data
 For example, a 4-option multiple
choice item has a 25% chance of
being guessed correctly
 So we add a c parameter as a lower
asymptote, which means that the
curve is “squished” so it never goes
below 0.25 (next)

Item parameters
 Sample IRF to show c

Item parameters
 We can also add a parameter (a) that
modifies the slope
 And a b parameter that slides the
entire curve left or right
◦ Tells us which person z-score for which the
item is appropriate
 Items can be evaluated based on these
just like with CTT statistics
 A little more next…

Item parameters: a
 The a parameter ranges from 0.0 to
about 2.0 in practice (theoretically to
infinity)
 Higher means better discriminating
 For achievement testing, 0.7 or 0.8 is
good, aptitude testing is higher
 Helps you: Remove items with a<0.4?
Identify a>1.0 as great items?

Item parameters: b
 For what person z-score is the item
appropriate? (non-Rasch)
 Should be between -3 and 3
◦ 99.9% of students are in that range
 0.0 is average person
 1.0 is difficult (85th percentile)
 -1.0 is easy (15th percentile)
 2.0 is super difficult (98%)
 -2.0 is super easy (2%)

Item parameters: b
 If item difficulties are normally
distributed, where does this fall?
(Rasch)
 0.0 is average item (NOT PERSON)

Item parameters: c
 The c parameter should be about
1/k, where k is the number of options
 If higher, this indicates that options
are not attractive
 For example, suppose c = 0.5
 This means there is a 50/50 chance
 That implies that even the lowest
students are able to ignore two
options and guess between the other
two options

Item parameters
 Extreme example:
◦ What is 23+25?
 A. 48
 B. 47
 C. 3.141529…
 D. 1,256,457

The (3PL) logistic function
 Here is the equation for the 3PL, so you
can see where the parameters are
inserted
 Item i, person j
 Equivalent formulations can be seen in
the literature, like moving the (1-c)
above the line
( )
1
( 1| ) (1 )
1 i j ii i j i i Da b
P X c c
e

  
   


 ai is the item discrimination
parameter for item i,
 bi is the item difficulty or location
parameter for item i,
 ci is the lower asymptote, or
pseudoguessing parameter for item i,
 D is the scaling constant equal to
1.702 or 1.0.

 The P is due primarily to (-b)
 The effect due to a and c is not as
strong
 That is, your probability of getting
the item correct is mostly due to
whether it is easy/difficult for you
◦ This leads to the idea of adaptive testing

3PL
 IRT has 3 dichotomous models
 I’ll now go through the models with
more detail, from 3PL down to 1PL
 The 3PL is appropriate for knowledge
or ability testing, where guessing is
relevant
 Each item will have an a, b, and c
parameter

IRT models
 Three 3PL IRFs, c = 0, 0.1, 0.2,
(b = -1, 0, 1; a = 1, 1, 1)
-3 -2 -1 0 1 2 3
0.00.20.40.60.81.0
theta
probability

2PL
 The 2PL assumes that there is no
guessing (c = 0.0)
 Items can still differ in discrimination
 This is appropriate for attitude or
psychological type data with
dichotomous responses
◦ I like recess time at school (T/F)
◦ My favorite subject is math (T/F)

IRT models
 Three 2PL IRFs, a = 0.75, 1.5, 0.3,
b = -1.0, 0.0, 1.0
-3 -2 -1 0 1 2 3
0.00.20.40.60.81.0
theta
probability

1PL
 The 1PL assumes that all items are of
equal discrimination
 Items only differ in terms of difficulty
 The raw score is now a sufficient
statistic for the IRT score
 Not the case with 2PL or 3PL; it’s not
just how many items you get right,
but which ones
 10 hard items vs. 10 easy items

1PL
 The 1PL is also appropriate for
attitude or psychological type data,
but where there is no reason to
believe items differ substantially in
terms of discrimination
 This is rarely the case
 Still used: see Rasch discussion later

1PL
 Three 1PL IRFs: b = -1, 0, 1
-3 -2 -1 0 1 2 3
0.00.20.40.60.81.0
theta
probability

How to choose?
 Characteristics of the items
 Check with the data! (fit)
 Sample size:
◦ 1PL = 100 minimum
 Score report considerations
(sufficient statistics)

The Rasch Perspective
 Another argument in choice
 There is a group of psychometricians
(mostly from Australia and Chicago)
who believe that the 1PL is THE model
 Everything else is just noise
 Data should be “cleaned” to reflect
this

 How to clean? A big target is to
eliminate guessing
 But how do you know?
 Slumdog Millionaire Effect

 This group is very strong in their
belief
 Why? They believe it is “objective”
measurement
 Score scale centered on items, not
people, so “person-free”
 Software and journals devoted just to
the Rasch idea

 Should you use it?
 I was trained to never use Rasch
◦ Equal discrimination assumption is
completely unrealistic… we all know
some items are better than others
◦ We all know guessing should not be
ignored
◦ Data should probably not be doctored
◦ Instead, data should drive the model

 However, while some researchers
hate the Rasch model, I don’t
◦ It is very simple
◦ It works better with tiny samples
◦ It is easier to describe
◦ Score reports and sufficient statistics
◦ Discussion points from you?
◦ Nevertheless, I recommend IRT

Polytomous models
 Polytomous models are for items that
are not scored correct/incorrect,
yes/no, etc.
 Two types:
◦ Rating scale or Likert: “Rate on a scale of
1 to 5”
◦ Partial credit – very useful in
constructed-response educational items
 My experience as a scorer

Polytomous models
 Partial credit example with rubric:
◦ Open response question to “2+3(4+5)=“
 0: no answer
 1: 2, 3, 4, or 5 (picks one)
 2: 14 (adds all)
 3: 45 (does (2+3) x (4+5) )
 4: 27 (everything but add 2)
 5: 29 (correct)

The IRF
 Polytomous example (CRFs):

Comparison table
Model Item Disc. Step
Spacing
Step
Ordering
Option Disc.
RSM Fixed Fixed Fixed Fixed
PCM Fixed Variable Variable Fixed
GRSM Variable Fixed Fixed Fixed
GRM Variable Variable Fixed Fixed
GPCM Variable Variable Variable Fixed
NRM Variable (each
option)
Variable Variable Variable
Fixed/Variable between items… more later, if time

Part 3
Ability () estimation
(IRT Scoring)

Scoring
 First: throw out your idea of a
“score” as the number of items
correct
 We actually want something more
accurate: the precise z-score
 Because the z-scores axis is called θ
in IRT, the scoring is called θ
estimation

Scoring
 IRT utilizes the IRFs in scoring
examinees
 If an examinee gets a question right,
they “get” the item’s IRF
 If they get the question wrong, they
“get” the (1-IRF)
 These curves are multiplied for all
items to get a final curve called the
likelihood function

Scoring
 Here’s an example IRF; a =1, b=0, c = 0

Scoring
 We multiply those to get a curve like
this…

Scoring - MLE
 The score is the point on the x-axis
where the highest likelihood is
 This is the maximum likelihood
estimate
 In the example, 0.0 (average ability)
 This obtains precise estimates on the
 scale

Maximum likelihood
 The LF is technically defined as:
 Where u is a response vector of 1s
and 0s
 Note what this does to the exponents
  ij i j
n
u 1 u
j ij ij
i 1
L P Qu 


 %

Scoring - SEM
 A quantification of just how precise
can also be calculated, called the
standard error of measurement
 This is assumed to be the same for
everyone in classical test theory, but
in IRT depends on the items and the
responses, and the level of 

Scoring - SEM
 Here’s a new LF – blue has the same
MLE but is less spread out
 Both are two items, blue with a = 2

Scoring - SEM
 The first LF had an SEM ~ 1.0
 The second LF had an SEM ~ 0.5
 We have more certainty about the
second person’s score
 This shows how much high-quality
items aid in measurement
◦ Same items and responses, except a
higher a

Scoring - SEM
 SEM is usually used to stop CATs
 General interpretation: confidence
interval
 Plus or minus 1.96 (about 2) is 95%
 So if the SEM in the example is 0.5,
we are 95% sure that the student’s
true ability is somewhere between
-1.0 and +1.0

Scoring - SEM
 If a student gives aberrant responses
(cheating, not paying attention, etc.)
they will have a larger SEM
 This is not enough to accuse of
cheating (they could have just dozed
off), but it can provide useful
information for research

Scoring - SEM
 SEM CI is also used to make decisions
◦ Pass if 2 SEMs above a cutoff

Details on IRT Scores
 Student scores are on the  scale,
which is analogous to the standard
normal z scale – same interpretations!
 There are four methods of scoring
◦ Maximum Likelihood (MLE)
◦ Bayesian Modal (or MAP, for maximum a
posteriori)
◦ Bayesian EAP (expectation a posteriori)
◦ Weighted MLE (less common)

Maximum likelihood
 Take the likelihood function “as is”
and find the highest point

Maximum likelihood
 Problem: all incorrect or all correct
answers

Bayesian modal
 Addresses that problem by always
multiplying the LF by a bell-shaped
curve, which forces it to have a
maximum somewhere
 Still find the highest point

Bayesian EAP
 Argues that the curve is not
symmetrical, and we should not
ignore everything except the
maximum
 So it takes the “average” of the
curve by splitting it into many slices
and finding the weighted average
 The slices are called quadrature
points or nodes

Bayesian EAP
 Example: see 3PL tail

Bayesian EAP
 Simple EAP overlay:  ~ -0.50

Bayesian
 Why Bayesian?
◦ Nonmixed response vectors
◦ Asymmetric LF
 Why not Bayesian?
◦ Biased inward – if you find the 
estimates of 1000 students, the SD would
be smaller with the Bayesian estimates,
maybe 0.95

Newton-Raphson
 Most IRT software actually uses a
somewhat different approach to MLE
and Bayesian Modal
 The straightforward way is to
calculate the value of the LF at each
point in , within reason
 For example, -4 to 4 at 0.001
 That’s 8,000 calculations! Too much
for 1970s computers…

Newton-Raphson
 Newton-Raphson is a shortcut method
that searches the curve iteratively
for its maximum
 Why? Same 0.001 level of accuracy in
only 5 to 20 iterations
 Across thousands of students, that is
a huge amount of calculations saved
 But certain issues (local maxima or
minima)… maybe time to abandon?

Examples
 See IRT Scoring and Graphing Tool

Part 4
Item parameter estimation
How do we get a, b, and c?

The estimation problem
 Estimating student  given a set of
known item parameters is easy
because we have something
established
 But what about the first time a test is
given?
 All items are new, and there are no
established student scores

The estimation problem
 Which came first, the chicken or the
egg?
 Since we don’t know, we go back and
forth, trying one and then the other
◦ Fix “temporary” z-scores
◦ Estimate item parameters
◦ Fix the new item parameters
◦ Estimate scores
◦ Do it again until we’re satisfied

Calibration algorithms
 There are two calibration algorithms
◦ Joint maximum likelihood (JML) – older
◦ Marginal maximum likelihood (MML) –
newer, and works better with smaller
samples… the standard
◦ Also conditional maximum likelihood, but
it only works with 1PL, so rarer
◦ New in research, but not in standard
software: Markov chain monte carlo

 The term maximum likelihood is used
here because we are maximizing the
likelihood of the entire data set, for
all items i and persons j
 X is the data set of responses xij
 b is the set of item parameters bi
  is the set of examinee js

 This means we want to find the b and
 that make that number the largest
 So we set , find a good b, use it to
score students and find a new , find
a better b, etc…
◦ Marginal ML uses marginal distributions
not exact points, hence it being faster
and working better with smaller samples
of people/items

 Note: rather than examine the LF
(which gets incredibly small),
software examines -2*ln(LF)
 IRT software tracks these iterations
because they provide information on
model fit
 See output

Part 4 (cont.)
Assumptions of IRT: Model-data fit

Checking fit
 One assumption of IRT (#2) is that our
data even follows the idea of IRT!
 This is true at both the item and the
test level
 Also true about examinees: they
should be getting items wrong that are
above their θ and getting items
correct that are below their θ

Model-data fit
 Whenever fitting any mathematical
model to empirical data (not just IRT),
it is important to assess fit
 Fit refers to whether the model
adequately represents the data
 Alternatively, if the data is far away
from the model

Model-data fit
 There are two types of fit important
in IRT
◦ Item (and test) - compares observed data
to the IRF
◦ Person – evaluates whether individual
students are responding according to the
model
 Easy items correct, hard items incorrect

Model-data fit
 Remember the 10-group empirical IRF
that I drew? This is great!

Model-data fit
 You’re more likely to see something
like this:

Model-data fit
 Or even worse…

Model-data fit
 Note that if we drew an IRF in each
of those graphs, it would be about
the same
 But it is obviously less appropriate in
Graph #3 (“even worse”)
 Fit analyses provide a way of
quantifying this

Item fit
 Most basic approach is to subtract
observed frequency correct from the
expected value for each slice (g) of 
 This is then summarized in a chi-
square statistic
 Bigger = bad fit

Item fit
 Graphical depiction:

Item fit
 The slices are called quadrature points
 Also used for item parameter
estimation
 The number of slices for chi-square
need not be the same as for
estimation, but it helps interpretation

Item fit
 Chi-square is oversensitive to sample
size
 A better way is to compute
standardized residuals
 Divide a chi-square by its df = G-m
where m is the number of item
parameters
 This is more interpretable because of
the well-known scale
 0 is OK, examine items > 2

Item fit
 For broad analysis of fit, use quantile
plots (Xcalibre, Iteman, or Lertap)
◦ 3 to 7 groups
◦ Can find hidden issues (My example:
social desirability in Likert #2)
 See Xcalibre output
◦ Fit statistics
◦ Fit graphs (many more groups, and IRF)

Person fit
 Is an examinee responding oddly?
 Most basic measure: take the log of
the LF at the max ( estimate)
 A higher number means we are more
sure of the estimate
 But this is dependent on the level of
, so we need it standardized: lz
   


n
1i
u1
i
u
io
ii
ˆQˆPlnl

Person fit
 lz is like a z-score for fit: z = (x-μ)/s
 Less than -2 means bad fit
 
 
             
        
 

























n
1i
2
i
i
iio
n
1i
iiiio
o
oo
z
ˆP1
ˆP
lnˆP1ˆPlVar
ˆP1lnˆP1ˆPlnˆPlE
lVar
lEl
l

Person fit
 lz is sensitive to the distribution of
item difficulties
 Works best when there is a range of
difficulty
 That is, if there are no items for
high-ability examinees, none of them
will have a good estimate!
 Best to evaluate groups, not
individuals

How is fit useful?
 Throw out items?
 Throw out people?
 Change model used?
 Bad fit can flag other possible issues
◦ Speededness: fit (and N) gets worse at
end of test
◦ Multidimensionality: certain areas

How is fit useful?
 Note that this fits in with the
estimation process
 IRT calibration is not “one-click”
 Review results, then make
adjustments
◦ Remove items/people
◦ Modify par distributions
◦ Modify quadrature points
◦ Etc.

Summary
 That was a basic intro to the
rationale of IRT
 Now start talking about some
applications and uses
 Also examine IRT software and output

Introduction to Item Response Theory

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Introduction to Item Response Theory

Similar to Introduction to Item Response Theory (20)

Recently uploaded

Recently uploaded (20)

Introduction to Item Response Theory