SlideShare a Scribd company logo
Day 1 AM: An Introduction to
Item Response Theory
Nathan A. Thompson
Vice President, Assessment Systems Corporation
Adjunct faculty, University of Cincinnati
 Thank you for attending!
 Introductions and important info now
 Software… download or USB
 Please ask questions
◦ Also, slow me down or ask for translation!
 Goal: provide an intro on IRT/CAT to
those who are new
◦ For those with some experience, to
provide new viewpoints and more
Where I’m from, professionally
 PhD, University of Minnesota
◦ CAT for classifications
 Test development manager for
ophthalmology certifications
 Psychometrician at Prometric (many
 VP at ASC
Where I’m from, geographically
Except now things look like…
We do odd things in winter
Introduce yourselves
 Name
 Employer/organization
 Types of tests you do and/or why you
are interested in IRT/CAT
 (There might be someone with similar
interests here)
Another announcement
 Newly formed: International
Association for Computerized Adaptive
Testing (IACAT)
◦ Free membership
◦ Growing resources
◦ Next conference: August 2012, Sydney
 This workshop is on two highly related
topics: IRT and CAT
 IRT is the modern paradigm for
developing, analyzing, scoring, and
linking tests
 CAT is a next-generation method of
delivering tests
 CAT is not feasible without IRT, so we
do IRT first
IRT – where are we going?
 IRT, as many of you know, provides a
way of analyzing items
 However, it has drawbacks (no
distractor analysis), so the main
reasons to use IRT are at the test level
 It solves certain issues with classical
test theory (CTT)
 But the two should always be used
IRT – where are we going?
 Advantages
◦ Better error characterization
◦ More precise scores
◦ Better linking
◦ Model-based
◦ Items and people on same scale (CAT)
◦ Sample-independence
◦ Powerful test assembly
IRT – where are we going?
 Keyword: paradigm or approach
◦ Not just another statistical analysis
◦ It is a different way of thinking about how
tests should work, and how we can
approach specific problems (scaling,
equating, test assembly) from that
Day 1
 There will be four parts this morning,
covering the theory behind IRT:
◦ Rationale: A graphical introduction to IRT
◦ Models (dichotomous and polytomous) and
their response functions
◦ IRT scoring (θ estimation)
◦ Item parameter estimation and model fit
Part 1
A graphical introduction to IRT
What is IRT?
 Basic Assumptions
1. Unidimensionality
 A unidimensional latent trait (1 at a time)
 Item responses are independent of each
other (local independence), except for the
trait/ability that they measure
2. A specific form of the relationship
between trait level and probability of a
 The response function, or IRT model
 There are a growing number of models
What is IRT?
 A theory of mathematical functions that
model the responses of examinees to
test items/questions
 These functions are item response
functions (IRFs)
 Historically, it has also been known as
latent trait theory and item
characteristic curve theory
 The IRFs are best described by showing
how the concept is derived from classical
Classical item statistics
 CTT statistics are typically calculated
for each option
Option N Prop Rpbis Mean
1 307 0.860 0.221 91.876
2 25 0.070 -0.142 85.600
3 14 0.039 -0.137 83.929
4 11 0.031 -0.081 86.273
Classical item statistics
 The proportions are often translated
to a figure like this, where examinees
are split into groups
Classical item statistics
 The general idea of IRT is to split the
previous graph up into more groups,
and then find a mathematical model
for the blue line
 This is what makes the item response
function (IRF)
Classical item statistics
 Example with 10 groups
The item response function
 Reflects the probability of a given
response as a function of the latent
trait (z-score)
 Example:
 For dichotomously scored items, it is
the probability of a correct or keyed
 Also called Item Characteristic Curve (ICC) or
Trace Line
 Only one curve (correct response), and all
other responses are grouped as (1-IRF)
 For polytomous items (partial credit,
etc.), it is the probability of each
 How do we know exactly what the
IRF for an item is?
 We estimate parameters for an
equation that draws the curve
 For dichotomous IRT, there are three
relevant parameters: a, b, and c
 a: The discrimination parameter;
represents how well the item
differentiates examinees; slope of the
curve at its center
 b: The difficulty parameter; represents
how easy or hard the item is with
respect to examinees; location of the
curve (left to right)
 c: The pseudoguessing parameter;
represents the ‘base probability’ of
answering the question; lower asymptote
 a=1, b=0, c=0.25
The IRF…
 is the “basic building block” of IRT
 will differ from item to item
 can be one of several different
models (now)
 can be used to evaluate items (now)
 is used for IRT scoring (next)
 leads to “information” used for test
design (after that)
 is the basis of CAT (tomorrow)
Part 2
IRT models
IRT models
 Several families of models
◦ Dichotomous
◦ Polytomous
◦ Multidimensional
◦ Facets (scenarios vs raters)
◦ Mixed (additional parameters)
◦ Cognitive diagnostic
◦ We will focus on first two
Dichotomous IRT models
 There are 3 main models in use, as
mentioned earlier: 1PL, 2PL, 3PL
 The “L” refers to “logistic”: which is
the type of equation
 IRT was originally developed decades
ago with a cumulative normal curve
 This means that calculus needed to
be used
The logistic function
 An approximation was developed: the
logistic curve
 No calculus needed
 There are two formats based on D
 If D = 1.702, then diff < 0.01
 If D = 1.0, a little more difference;
called the true logistic form
 Does not really matter, as long as you
are consistent
The logistic function
 The basic form of the curve
Item parameters
 We add parameters to slightly modify
the shape to get it to match our data
 For example, a 4-option multiple
choice item has a 25% chance of
being guessed correctly
 So we add a c parameter as a lower
asymptote, which means that the
curve is “squished” so it never goes
below 0.25 (next)
Item parameters
 Sample IRF to show c
Item parameters
 We can also add a parameter (a) that
modifies the slope
 And a b parameter that slides the
entire curve left or right
◦ Tells us which person z-score for which the
item is appropriate
 Items can be evaluated based on these
just like with CTT statistics
 A little more next…
Item parameters: a
 The a parameter ranges from 0.0 to
about 2.0 in practice (theoretically to
 Higher means better discriminating
 For achievement testing, 0.7 or 0.8 is
good, aptitude testing is higher
 Helps you: Remove items with a<0.4?
Identify a>1.0 as great items?
Item parameters: b
 For what person z-score is the item
appropriate? (non-Rasch)
 Should be between -3 and 3
◦ 99.9% of students are in that range
 0.0 is average person
 1.0 is difficult (85th percentile)
 -1.0 is easy (15th percentile)
 2.0 is super difficult (98%)
 -2.0 is super easy (2%)
Item parameters: b
 If item difficulties are normally
distributed, where does this fall?
 0.0 is average item (NOT PERSON)
Item parameters: c
 The c parameter should be about
1/k, where k is the number of options
 If higher, this indicates that options
are not attractive
 For example, suppose c = 0.5
 This means there is a 50/50 chance
 That implies that even the lowest
students are able to ignore two
options and guess between the other
two options
Item parameters
 Extreme example:
◦ What is 23+25?
 A. 48
 B. 47
 C. 3.141529…
 D. 1,256,457
The (3PL) logistic function
 Here is the equation for the 3PL, so you
can see where the parameters are
 Item i, person j
 Equivalent formulations can be seen in
the literature, like moving the (1-c)
above the line
( )
( 1| ) (1 )
1 i j ii i j i i Da b
P X c c
  
   
The (3PL) logistic function
 ai is the item discrimination
parameter for item i,
 bi is the item difficulty or location
parameter for item i,
 ci is the lower asymptote, or
pseudoguessing parameter for item i,
 D is the scaling constant equal to
1.702 or 1.0.
The (3PL) logistic function
 The P is due primarily to (-b)
 The effect due to a and c is not as
 That is, your probability of getting
the item correct is mostly due to
whether it is easy/difficult for you
◦ This leads to the idea of adaptive testing
 IRT has 3 dichotomous models
 I’ll now go through the models with
more detail, from 3PL down to 1PL
 The 3PL is appropriate for knowledge
or ability testing, where guessing is
 Each item will have an a, b, and c
IRT models
 Three 3PL IRFs, c = 0, 0.1, 0.2,
(b = -1, 0, 1; a = 1, 1, 1)
-3 -2 -1 0 1 2 3
 The 2PL assumes that there is no
guessing (c = 0.0)
 Items can still differ in discrimination
 This is appropriate for attitude or
psychological type data with
dichotomous responses
◦ I like recess time at school (T/F)
◦ My favorite subject is math (T/F)
IRT models
 Three 2PL IRFs, a = 0.75, 1.5, 0.3,
b = -1.0, 0.0, 1.0
-3 -2 -1 0 1 2 3
 The 1PL assumes that all items are of
equal discrimination
 Items only differ in terms of difficulty
 The raw score is now a sufficient
statistic for the IRT score
 Not the case with 2PL or 3PL; it’s not
just how many items you get right,
but which ones
 10 hard items vs. 10 easy items
 The 1PL is also appropriate for
attitude or psychological type data,
but where there is no reason to
believe items differ substantially in
terms of discrimination
 This is rarely the case
 Still used: see Rasch discussion later
 Three 1PL IRFs: b = -1, 0, 1
-3 -2 -1 0 1 2 3
How to choose?
 Characteristics of the items
 Check with the data! (fit)
 Sample size:
◦ 1PL = 100 minimum
◦ 2PL = 300 minimum
◦ 3PL = 500 minimum
 Score report considerations
(sufficient statistics)
The Rasch Perspective
 Another argument in choice
 There is a group of psychometricians
(mostly from Australia and Chicago)
who believe that the 1PL is THE model
 Everything else is just noise
 Data should be “cleaned” to reflect
The Rasch Perspective
 How to clean? A big target is to
eliminate guessing
 But how do you know?
 Slumdog Millionaire Effect
The Rasch Perspective
 This group is very strong in their
 Why? They believe it is “objective”
 Score scale centered on items, not
people, so “person-free”
 Software and journals devoted just to
the Rasch idea
The Rasch Perspective
 Should you use it?
 I was trained to never use Rasch
◦ Equal discrimination assumption is
completely unrealistic… we all know
some items are better than others
◦ We all know guessing should not be
◦ Data should probably not be doctored
◦ Instead, data should drive the model
The Rasch Perspective
 However, while some researchers
hate the Rasch model, I don’t
◦ It is very simple
◦ It works better with tiny samples
◦ It is easier to describe
◦ Score reports and sufficient statistics
◦ Discussion points from you?
◦ Nevertheless, I recommend IRT
Polytomous models
 Polytomous models are for items that
are not scored correct/incorrect,
yes/no, etc.
 Two types:
◦ Rating scale or Likert: “Rate on a scale of
1 to 5”
◦ Partial credit – very useful in
constructed-response educational items
 My experience as a scorer
Polytomous models
 Partial credit example with rubric:
◦ Open response question to “2+3(4+5)=“
 0: no answer
 1: 2, 3, 4, or 5 (picks one)
 2: 14 (adds all)
 3: 45 (does (2+3) x (4+5) )
 4: 27 (everything but add 2)
 5: 29 (correct)
 Polytomous example (CRFs):
Comparison table
Model Item Disc. Step
Option Disc.
RSM Fixed Fixed Fixed Fixed
PCM Fixed Variable Variable Fixed
GRSM Variable Fixed Fixed Fixed
GRM Variable Variable Fixed Fixed
GPCM Variable Variable Variable Fixed
NRM Variable (each
Variable Variable Variable
Fixed/Variable between items… more later, if time
Part 3
Ability () estimation
(IRT Scoring)
 First: throw out your idea of a
“score” as the number of items
 We actually want something more
accurate: the precise z-score
 Because the z-scores axis is called θ
in IRT, the scoring is called θ
 IRT utilizes the IRFs in scoring
 If an examinee gets a question right,
they “get” the item’s IRF
 If they get the question wrong, they
“get” the (1-IRF)
 These curves are multiplied for all
items to get a final curve called the
likelihood function
 Here’s an example IRF; a =1, b=0, c = 0
 A “1-IRF”
 We multiply those to get a curve like
Scoring - MLE
 The score is the point on the x-axis
where the highest likelihood is
 This is the maximum likelihood
 In the example, 0.0 (average ability)
 This obtains precise estimates on the
 scale
Maximum likelihood
 The LF is technically defined as:
 Where u is a response vector of 1s
and 0s
 Note what this does to the exponents
  ij i j
u 1 u
j ij ij
i 1
L P Qu 
 %
Scoring - SEM
 A quantification of just how precise
can also be calculated, called the
standard error of measurement
 This is assumed to be the same for
everyone in classical test theory, but
in IRT depends on the items and the
responses, and the level of 
Scoring - SEM
 Here’s a new LF – blue has the same
MLE but is less spread out
 Both are two items, blue with a = 2
Scoring - SEM
 The first LF had an SEM ~ 1.0
 The second LF had an SEM ~ 0.5
 We have more certainty about the
second person’s score
 This shows how much high-quality
items aid in measurement
◦ Same items and responses, except a
higher a
Scoring - SEM
 SEM is usually used to stop CATs
 General interpretation: confidence
 Plus or minus 1.96 (about 2) is 95%
 So if the SEM in the example is 0.5,
we are 95% sure that the student’s
true ability is somewhere between
-1.0 and +1.0
Scoring - SEM
 If a student gives aberrant responses
(cheating, not paying attention, etc.)
they will have a larger SEM
 This is not enough to accuse of
cheating (they could have just dozed
off), but it can provide useful
information for research
Scoring - SEM
 SEM CI is also used to make decisions
◦ Pass if 2 SEMs above a cutoff
Details on IRT Scores
 Student scores are on the  scale,
which is analogous to the standard
normal z scale – same interpretations!
 There are four methods of scoring
◦ Maximum Likelihood (MLE)
◦ Bayesian Modal (or MAP, for maximum a
◦ Bayesian EAP (expectation a posteriori)
◦ Weighted MLE (less common)
Maximum likelihood
 Take the likelihood function “as is”
and find the highest point
Maximum likelihood
 Problem: all incorrect or all correct
Bayesian modal
 Addresses that problem by always
multiplying the LF by a bell-shaped
curve, which forces it to have a
maximum somewhere
 Still find the highest point
Bayesian EAP
 Argues that the curve is not
symmetrical, and we should not
ignore everything except the
 So it takes the “average” of the
curve by splitting it into many slices
and finding the weighted average
 The slices are called quadrature
points or nodes
Bayesian EAP
 Example: see 3PL tail
Bayesian EAP
 Simple EAP overlay:  ~ -0.50
 Why Bayesian?
◦ Nonmixed response vectors
◦ Asymmetric LF
 Why not Bayesian?
◦ Biased inward – if you find the 
estimates of 1000 students, the SD would
be smaller with the Bayesian estimates,
maybe 0.95
 Most IRT software actually uses a
somewhat different approach to MLE
and Bayesian Modal
 The straightforward way is to
calculate the value of the LF at each
point in , within reason
 For example, -4 to 4 at 0.001
 That’s 8,000 calculations! Too much
for 1970s computers…
 Newton-Raphson is a shortcut method
that searches the curve iteratively
for its maximum
 Why? Same 0.001 level of accuracy in
only 5 to 20 iterations
 Across thousands of students, that is
a huge amount of calculations saved
 But certain issues (local maxima or
minima)… maybe time to abandon?
 See IRT Scoring and Graphing Tool
Part 4
Item parameter estimation
How do we get a, b, and c?
The estimation problem
 Estimating student  given a set of
known item parameters is easy
because we have something
 But what about the first time a test is
 All items are new, and there are no
established student scores
The estimation problem
 Which came first, the chicken or the
 Since we don’t know, we go back and
forth, trying one and then the other
◦ Fix “temporary” z-scores
◦ Estimate item parameters
◦ Fix the new item parameters
◦ Estimate scores
◦ Do it again until we’re satisfied
Calibration algorithms
 There are two calibration algorithms
◦ Joint maximum likelihood (JML) – older
◦ Marginal maximum likelihood (MML) –
newer, and works better with smaller
samples… the standard
◦ Also conditional maximum likelihood, but
it only works with 1PL, so rarer
◦ New in research, but not in standard
software: Markov chain monte carlo
Calibration algorithms
 The term maximum likelihood is used
here because we are maximizing the
likelihood of the entire data set, for
all items i and persons j
 X is the data set of responses xij
 b is the set of item parameters bi
  is the set of examinee js
Calibration algorithms
 This means we want to find the b and
 that make that number the largest
 So we set , find a good b, use it to
score students and find a new , find
a better b, etc…
◦ Marginal ML uses marginal distributions
not exact points, hence it being faster
and working better with smaller samples
of people/items
Calibration algorithms
 Note: rather than examine the LF
(which gets incredibly small),
software examines -2*ln(LF)
 IRT software tracks these iterations
because they provide information on
model fit
 See output
Part 4 (cont.)
Assumptions of IRT: Model-data fit
Checking fit
 One assumption of IRT (#2) is that our
data even follows the idea of IRT!
 This is true at both the item and the
test level
 Also true about examinees: they
should be getting items wrong that are
above their θ and getting items
correct that are below their θ
Model-data fit
 Whenever fitting any mathematical
model to empirical data (not just IRT),
it is important to assess fit
 Fit refers to whether the model
adequately represents the data
 Alternatively, if the data is far away
from the model
Model-data fit
 There are two types of fit important
in IRT
◦ Item (and test) - compares observed data
to the IRF
◦ Person – evaluates whether individual
students are responding according to the
 Easy items correct, hard items incorrect
Model-data fit
 Remember the 10-group empirical IRF
that I drew? This is great!
Model-data fit
 You’re more likely to see something
like this:
Model-data fit
 Or even worse…
Model-data fit
 Note that if we drew an IRF in each
of those graphs, it would be about
the same
 But it is obviously less appropriate in
Graph #3 (“even worse”)
 Fit analyses provide a way of
quantifying this
Item fit
 Most basic approach is to subtract
observed frequency correct from the
expected value for each slice (g) of 
 This is then summarized in a chi-
square statistic
 Bigger = bad fit
Item fit
 Graphical depiction:
Item fit
 Better fit
Item fit
 The slices are called quadrature points
 Also used for item parameter
 The number of slices for chi-square
need not be the same as for
estimation, but it helps interpretation
Item fit
 Chi-square is oversensitive to sample
 A better way is to compute
standardized residuals
 Divide a chi-square by its df = G-m
where m is the number of item
 This is more interpretable because of
the well-known scale
 0 is OK, examine items > 2
Item fit
 For broad analysis of fit, use quantile
plots (Xcalibre, Iteman, or Lertap)
◦ 3 to 7 groups
◦ Can find hidden issues (My example:
social desirability in Likert #2)
 See Xcalibre output
◦ Fit statistics
◦ Fit graphs (many more groups, and IRF)
Person fit
 Is an examinee responding oddly?
 Most basic measure: take the log of
the LF at the max ( estimate)
 A higher number means we are more
sure of the estimate
 But this is dependent on the level of
, so we need it standardized: lz
   
Person fit
 lz is like a z-score for fit: z = (x-μ)/s
 Less than -2 means bad fit
 
 
             
        
 
Person fit
 lz is sensitive to the distribution of
item difficulties
 Works best when there is a range of
 That is, if there are no items for
high-ability examinees, none of them
will have a good estimate!
 Best to evaluate groups, not
How is fit useful?
 Throw out items?
 Throw out people?
 Change model used?
 Bad fit can flag other possible issues
◦ Speededness: fit (and N) gets worse at
end of test
◦ Multidimensionality: certain areas
How is fit useful?
 Note that this fits in with the
estimation process
 IRT calibration is not “one-click”
 Review results, then make
◦ Remove items/people
◦ Modify par distributions
◦ Modify quadrature points
◦ Etc.
 That was a basic intro to the
rationale of IRT
 Now start talking about some
applications and uses
 Also examine IRT software and output

More Related Content

What's hot

Test standardization and norming
Test standardization and normingTest standardization and norming
Test standardization and norming
Hannah Grace Gilo
IRT - Item response Theory
IRT - Item response TheoryIRT - Item response Theory
IRT - Item response TheoryAjay Dhamija
Reliability and validity
Reliability and validityReliability and validity
Reliability and validity
Carlos Tian Chow Correos
What is Reliability and its Types?
What is Reliability and its Types? What is Reliability and its Types?
What is Reliability and its Types?
Dr. Amjad Ali Arain
Reliability & validity
Reliability & validityReliability & validity
Reliability & validityshefali84
The application of irt using the rasch model presnetation1
The application of irt using the rasch model presnetation1The application of irt using the rasch model presnetation1
The application of irt using the rasch model presnetation1Carlo Magno
Reliability and its types: Split half method and test retest methods
Reliability and its types: Split half method and test retest methodsReliability and its types: Split half method and test retest methods
Reliability and its types: Split half method and test retest methods
Aamir Hussain
Item writing
Item writingItem writing
Item writing
Atuhairwe Richard
Correlational research
Correlational research Correlational research
Correlational research Self employed
Validity and reliability in assessment.
Validity and reliability in assessment. Validity and reliability in assessment.
Validity and reliability in assessment.
Tarek Tawfik Amin
Presentation Validity & Reliability
Presentation Validity & ReliabilityPresentation Validity & Reliability
Presentation Validity & Reliabilitysongoten77
Developmental Research
Developmental ResearchDevelopmental Research
Developmental Research
Parisa Mehran
Reliability Reliability
Validity in Assessment
Validity in AssessmentValidity in Assessment
Validity in Assessment
sheldine abuhan
Reliability for testing and assessment
Reliability for testing and assessmentReliability for testing and assessment
Reliability for testing and assessment
Erlwinmer Mangmang
Irt 1 pl, 2pl, 3pl.pdf
Irt 1 pl, 2pl, 3pl.pdfIrt 1 pl, 2pl, 3pl.pdf
Irt 1 pl, 2pl, 3pl.pdfCarlo Magno
Test standardization
Test standardizationTest standardization
Test standardizationKaye Batica
Validity &amp; reliability
Validity &amp; reliabilityValidity &amp; reliability
Validity &amp; reliability
Praisy AB Vineesh
Reliability and validity ppt
Reliability and validity pptReliability and validity ppt
Reliability and validity ppt
surendra poudel

What's hot (20)

Test standardization and norming
Test standardization and normingTest standardization and norming
Test standardization and norming
IRT - Item response Theory
IRT - Item response TheoryIRT - Item response Theory
IRT - Item response Theory
Reliability and validity
Reliability and validityReliability and validity
Reliability and validity
What is Reliability and its Types?
What is Reliability and its Types? What is Reliability and its Types?
What is Reliability and its Types?
Reliability & validity
Reliability & validityReliability & validity
Reliability & validity
The application of irt using the rasch model presnetation1
The application of irt using the rasch model presnetation1The application of irt using the rasch model presnetation1
The application of irt using the rasch model presnetation1
Reliability and its types: Split half method and test retest methods
Reliability and its types: Split half method and test retest methodsReliability and its types: Split half method and test retest methods
Reliability and its types: Split half method and test retest methods
Item writing
Item writingItem writing
Item writing
Correlational research
Correlational research Correlational research
Correlational research
Validity and reliability in assessment.
Validity and reliability in assessment. Validity and reliability in assessment.
Validity and reliability in assessment.
Presentation Validity & Reliability
Presentation Validity & ReliabilityPresentation Validity & Reliability
Presentation Validity & Reliability
Developmental Research
Developmental ResearchDevelopmental Research
Developmental Research
Reliability Reliability
Validity in Assessment
Validity in AssessmentValidity in Assessment
Validity in Assessment
Reliability for testing and assessment
Reliability for testing and assessmentReliability for testing and assessment
Reliability for testing and assessment
Irt 1 pl, 2pl, 3pl.pdf
Irt 1 pl, 2pl, 3pl.pdfIrt 1 pl, 2pl, 3pl.pdf
Irt 1 pl, 2pl, 3pl.pdf
Test standardization
Test standardizationTest standardization
Test standardization
Validity &amp; reliability
Validity &amp; reliabilityValidity &amp; reliability
Validity &amp; reliability
Reliability and validity ppt
Reliability and validity pptReliability and validity ppt
Reliability and validity ppt

Viewers also liked

Classical Test Theory and Item Response Theory
Classical Test Theory and Item Response TheoryClassical Test Theory and Item Response Theory
Classical Test Theory and Item Response Theory
saira kazim
Best Practices in Item Writing
Best Practices in Item WritingBest Practices in Item Writing
Best Practices in Item Writing
Nathan Thompson
MC Item Writing Workshop
MC Item Writing WorkshopMC Item Writing Workshop
MC Item Writing Workshop
Spectrum Of Education Technologies1.1
Spectrum Of Education Technologies1.1Spectrum Of Education Technologies1.1
Spectrum Of Education Technologies1.1
Laos Session 4: Developing Quality Assessment Items (EN)
Laos Session 4:  Developing Quality Assessment Items (EN)Laos Session 4:  Developing Quality Assessment Items (EN)
Laos Session 4: Developing Quality Assessment Items (EN)
A Simple Guide to the Item Response Theory (IRT) and Rasch Modeling
A Simple Guide to the Item Response Theory (IRT) and Rasch ModelingA Simple Guide to the Item Response Theory (IRT) and Rasch Modeling
A Simple Guide to the Item Response Theory (IRT) and Rasch Modeling
OpenThink Labs
Irt assessment
Irt assessmentIrt assessment
Irt assessment
Allame Tabatabaei
Laos Session 6: Developing Quality Assessment Items Extended Response Items
Laos Session 6: Developing Quality Assessment Items Extended Response ItemsLaos Session 6: Developing Quality Assessment Items Extended Response Items
Laos Session 6: Developing Quality Assessment Items Extended Response Items
Scoring and grading ppt
Scoring and grading pptScoring and grading ppt
Scoring and grading ppt
M Shoaib GH
Item Response Theory in Constructing Measures
Item Response Theory in Constructing MeasuresItem Response Theory in Constructing Measures
Item Response Theory in Constructing Measures
Carlo Magno
good test Characteristics
good test Characteristics  good test Characteristics
good test Characteristics
Ali Heydari
Qualities of a good test (1)
Qualities of a good test (1)Qualities of a good test (1)
Qualities of a good test (1)kimoya
Writing test
Writing testWriting test
Writing testThao Le
TOEFL TEST Preparation and Scoring Guidelines
TOEFL TEST Preparation and Scoring GuidelinesTOEFL TEST Preparation and Scoring Guidelines
TOEFL TEST Preparation and Scoring Guidelines
Princeton Review ME
Characteristics of a good test
Characteristics of a good test Characteristics of a good test
Characteristics of a good test
Arash Yazdani
Characteristics of a good test
Characteristics of a good testCharacteristics of a good test
Characteristics of a good testcyrilcoscos
Characteristics of a good test
Characteristics of a good testCharacteristics of a good test
Characteristics of a good testBoyet Aluan
Principles of Test Construction 1
Principles of Test Construction 1Principles of Test Construction 1
Principles of Test Construction 1
Monica P

Viewers also liked (20)

Classical Test Theory and Item Response Theory
Classical Test Theory and Item Response TheoryClassical Test Theory and Item Response Theory
Classical Test Theory and Item Response Theory
Best Practices in Item Writing
Best Practices in Item WritingBest Practices in Item Writing
Best Practices in Item Writing
MC Item Writing Workshop
MC Item Writing WorkshopMC Item Writing Workshop
MC Item Writing Workshop
Spectrum Of Education Technologies1.1
Spectrum Of Education Technologies1.1Spectrum Of Education Technologies1.1
Spectrum Of Education Technologies1.1
Laos Session 4: Developing Quality Assessment Items (EN)
Laos Session 4:  Developing Quality Assessment Items (EN)Laos Session 4:  Developing Quality Assessment Items (EN)
Laos Session 4: Developing Quality Assessment Items (EN)
A Simple Guide to the Item Response Theory (IRT) and Rasch Modeling
A Simple Guide to the Item Response Theory (IRT) and Rasch ModelingA Simple Guide to the Item Response Theory (IRT) and Rasch Modeling
A Simple Guide to the Item Response Theory (IRT) and Rasch Modeling
Irt assessment
Irt assessmentIrt assessment
Irt assessment
Laos Session 6: Developing Quality Assessment Items Extended Response Items
Laos Session 6: Developing Quality Assessment Items Extended Response ItemsLaos Session 6: Developing Quality Assessment Items Extended Response Items
Laos Session 6: Developing Quality Assessment Items Extended Response Items
Scoring and grading ppt
Scoring and grading pptScoring and grading ppt
Scoring and grading ppt
Item Response Theory in Constructing Measures
Item Response Theory in Constructing MeasuresItem Response Theory in Constructing Measures
Item Response Theory in Constructing Measures
good test Characteristics
good test Characteristics  good test Characteristics
good test Characteristics
Qualities of a good test (1)
Qualities of a good test (1)Qualities of a good test (1)
Qualities of a good test (1)
Writing test
Writing testWriting test
Writing test
TOEFL TEST Preparation and Scoring Guidelines
TOEFL TEST Preparation and Scoring GuidelinesTOEFL TEST Preparation and Scoring Guidelines
TOEFL TEST Preparation and Scoring Guidelines
Testing reading
Testing readingTesting reading
Testing reading
Characteristics of a good test
Characteristics of a good test Characteristics of a good test
Characteristics of a good test
Characteristics of a good test
Characteristics of a good testCharacteristics of a good test
Characteristics of a good test
Characteristics of a good test
Characteristics of a good testCharacteristics of a good test
Characteristics of a good test
Principles of Test Construction 1
Principles of Test Construction 1Principles of Test Construction 1
Principles of Test Construction 1

Similar to Introduction to Item Response Theory

A visual guide to item response theory
A visual guide to item response theoryA visual guide to item response theory
A visual guide to item response theory
ahmad rustam
Data Science Interview Questions | Data Science Interview Questions And Answe...
Data Science Interview Questions | Data Science Interview Questions And Answe...Data Science Interview Questions | Data Science Interview Questions And Answe...
Data Science Interview Questions | Data Science Interview Questions And Answe...
Pentaho Meeting 2008 - Statistics & BI
Pentaho Meeting 2008 - Statistics & BIPentaho Meeting 2008 - Statistics & BI
Pentaho Meeting 2008 - Statistics & BI
Studio Synthesis
Artificial intyelligence and machine learning introduction.pptx
Artificial intyelligence and machine learning introduction.pptxArtificial intyelligence and machine learning introduction.pptx
Artificial intyelligence and machine learning introduction.pptx
Item Analysis: Classical and Beyond
Item Analysis: Classical and BeyondItem Analysis: Classical and Beyond
Item Analysis: Classical and BeyondMhairi Mcalpine
0 Model Interpretation setting.pdf
0 Model Interpretation setting.pdf0 Model Interpretation setting.pdf
0 Model Interpretation setting.pdf
Leonardo Auslender
Module 4 data analysis
Module 4 data analysisModule 4 data analysis
Module 4 data analysisILRI-Jmaru
Top 100+ Google Data Science Interview Questions.pdf
Top 100+ Google Data Science Interview Questions.pdfTop 100+ Google Data Science Interview Questions.pdf
Top 100+ Google Data Science Interview Questions.pdf
Lecture 3.1_ Logistic Regression.pptx
Lecture 3.1_ Logistic Regression.pptxLecture 3.1_ Logistic Regression.pptx
Lecture 3.1_ Logistic Regression.pptx
Essentials of machine learning algorithms
Essentials of machine learning algorithmsEssentials of machine learning algorithms
Essentials of machine learning algorithms
Arunangsu Sahu
Mb0050 research methodology
Mb0050   research methodologyMb0050   research methodology
Mb0050 research methodologysmumbahelp
M08 BiasVarianceTradeoff
M08 BiasVarianceTradeoffM08 BiasVarianceTradeoff
M08 BiasVarianceTradeoff
Raman Kannan
Ch17 lab r_verdu103: Entry level statistics exercise (descriptives)
Ch17 lab r_verdu103: Entry level statistics exercise (descriptives)Ch17 lab r_verdu103: Entry level statistics exercise (descriptives)
Ch17 lab r_verdu103: Entry level statistics exercise (descriptives)Sherri Gunder
Lecture 3: Basic Concepts of Machine Learning - Induction & Evaluation
Lecture 3: Basic Concepts of Machine Learning - Induction & EvaluationLecture 3: Basic Concepts of Machine Learning - Induction & Evaluation
Lecture 3: Basic Concepts of Machine Learning - Induction & Evaluation
Marina Santini
An Adaptive Evaluation System to Test Student Caliber using Item Response Theory
An Adaptive Evaluation System to Test Student Caliber using Item Response TheoryAn Adaptive Evaluation System to Test Student Caliber using Item Response Theory
An Adaptive Evaluation System to Test Student Caliber using Item Response Theory
Machine Learning - Deep Learning
Machine Learning - Deep LearningMachine Learning - Deep Learning
Machine Learning - Deep Learning
Adetimehin Oluwasegun Matthew
Introduction to unidimensional item response model
Introduction to unidimensional item response modelIntroduction to unidimensional item response model
Introduction to unidimensional item response model
Sumit Das

Similar to Introduction to Item Response Theory (20)

A visual guide to item response theory
A visual guide to item response theoryA visual guide to item response theory
A visual guide to item response theory
Slide Psikologi.docx
Slide Psikologi.docxSlide Psikologi.docx
Slide Psikologi.docx
Data Science Interview Questions | Data Science Interview Questions And Answe...
Data Science Interview Questions | Data Science Interview Questions And Answe...Data Science Interview Questions | Data Science Interview Questions And Answe...
Data Science Interview Questions | Data Science Interview Questions And Answe...
Pentaho Meeting 2008 - Statistics & BI
Pentaho Meeting 2008 - Statistics & BIPentaho Meeting 2008 - Statistics & BI
Pentaho Meeting 2008 - Statistics & BI
Artificial intyelligence and machine learning introduction.pptx
Artificial intyelligence and machine learning introduction.pptxArtificial intyelligence and machine learning introduction.pptx
Artificial intyelligence and machine learning introduction.pptx
Item Analysis: Classical and Beyond
Item Analysis: Classical and BeyondItem Analysis: Classical and Beyond
Item Analysis: Classical and Beyond
0 Model Interpretation setting.pdf
0 Model Interpretation setting.pdf0 Model Interpretation setting.pdf
0 Model Interpretation setting.pdf
Module 4 data analysis
Module 4 data analysisModule 4 data analysis
Module 4 data analysis
Top 100+ Google Data Science Interview Questions.pdf
Top 100+ Google Data Science Interview Questions.pdfTop 100+ Google Data Science Interview Questions.pdf
Top 100+ Google Data Science Interview Questions.pdf
Lecture 3.1_ Logistic Regression.pptx
Lecture 3.1_ Logistic Regression.pptxLecture 3.1_ Logistic Regression.pptx
Lecture 3.1_ Logistic Regression.pptx
Essentials of machine learning algorithms
Essentials of machine learning algorithmsEssentials of machine learning algorithms
Essentials of machine learning algorithms
Mb0050 research methodology
Mb0050   research methodologyMb0050   research methodology
Mb0050 research methodology
M08 BiasVarianceTradeoff
M08 BiasVarianceTradeoffM08 BiasVarianceTradeoff
M08 BiasVarianceTradeoff
Chemistry Lab Manual
Chemistry Lab ManualChemistry Lab Manual
Chemistry Lab Manual
Ch17 lab r_verdu103: Entry level statistics exercise (descriptives)
Ch17 lab r_verdu103: Entry level statistics exercise (descriptives)Ch17 lab r_verdu103: Entry level statistics exercise (descriptives)
Ch17 lab r_verdu103: Entry level statistics exercise (descriptives)
Lecture 3: Basic Concepts of Machine Learning - Induction & Evaluation
Lecture 3: Basic Concepts of Machine Learning - Induction & EvaluationLecture 3: Basic Concepts of Machine Learning - Induction & Evaluation
Lecture 3: Basic Concepts of Machine Learning - Induction & Evaluation
An Adaptive Evaluation System to Test Student Caliber using Item Response Theory
An Adaptive Evaluation System to Test Student Caliber using Item Response TheoryAn Adaptive Evaluation System to Test Student Caliber using Item Response Theory
An Adaptive Evaluation System to Test Student Caliber using Item Response Theory
Machine Learning - Deep Learning
Machine Learning - Deep LearningMachine Learning - Deep Learning
Machine Learning - Deep Learning
Introduction to unidimensional item response model
Introduction to unidimensional item response modelIntroduction to unidimensional item response model
Introduction to unidimensional item response model

Recently uploaded

Digital Artifact 2 - Investigating Pavilion Designs
Digital Artifact 2 - Investigating Pavilion DesignsDigital Artifact 2 - Investigating Pavilion Designs
Digital Artifact 2 - Investigating Pavilion Designs
special B.ed 2nd year old paper_20240531.pdf
special B.ed 2nd year old paper_20240531.pdfspecial B.ed 2nd year old paper_20240531.pdf
special B.ed 2nd year old paper_20240531.pdf
Special education needs
Unit 8 - Information and Communication Technology (Paper I).pdf
Unit 8 - Information and Communication Technology (Paper I).pdfUnit 8 - Information and Communication Technology (Paper I).pdf
Unit 8 - Information and Communication Technology (Paper I).pdf
Thiyagu K
Acetabularia Information For Class 9 .docx
Acetabularia Information For Class 9  .docxAcetabularia Information For Class 9  .docx
Acetabularia Information For Class 9 .docx
Pride Month Slides 2024 David Douglas School District
Pride Month Slides 2024 David Douglas School DistrictPride Month Slides 2024 David Douglas School District
Pride Month Slides 2024 David Douglas School District
David Douglas School District
Nguyen Thanh Tu Collection
Normal Labour/ Stages of Labour/ Mechanism of Labour
Normal Labour/ Stages of Labour/ Mechanism of LabourNormal Labour/ Stages of Labour/ Mechanism of Labour
Normal Labour/ Stages of Labour/ Mechanism of Labour
Wasim Ak
JEE1_This_section_contains_FOUR_ questions
JEE1_This_section_contains_FOUR_ questionsJEE1_This_section_contains_FOUR_ questions
JEE1_This_section_contains_FOUR_ questions
Biological Screening of Herbal Drugs in detailed.
Biological Screening of Herbal Drugs in detailed.Biological Screening of Herbal Drugs in detailed.
Biological Screening of Herbal Drugs in detailed.
Ashokrao Mane college of Pharmacy Peth-Vadgaon
Multithreading_in_C++ - std::thread, race condition
Multithreading_in_C++ - std::thread, race conditionMultithreading_in_C++ - std::thread, race condition
Multithreading_in_C++ - std::thread, race condition
Mohammed Sikander
Best Digital Marketing Institute In NOIDA
Best Digital Marketing Institute In NOIDABest Digital Marketing Institute In NOIDA
Best Digital Marketing Institute In NOIDA
Operation Blue Star - Saka Neela Tara
Operation Blue Star   -  Saka Neela TaraOperation Blue Star   -  Saka Neela Tara
Operation Blue Star - Saka Neela Tara
Balvir Singh
The basics of sentences session 5pptx.pptx
The basics of sentences session 5pptx.pptxThe basics of sentences session 5pptx.pptx
The basics of sentences session 5pptx.pptx
Lapbook sobre os Regimes Totalitários.pdf
Lapbook sobre os Regimes Totalitários.pdfLapbook sobre os Regimes Totalitários.pdf
Lapbook sobre os Regimes Totalitários.pdf
Jean Carlos Nunes Paixão
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
Supporting (UKRI) OA monographs at Salford.pptx
Supporting (UKRI) OA monographs at Salford.pptxSupporting (UKRI) OA monographs at Salford.pptx
Supporting (UKRI) OA monographs at Salford.pptx
Chapter 3 - Islamic Banking Products and Services.pptx
Chapter 3 - Islamic Banking Products and Services.pptxChapter 3 - Islamic Banking Products and Services.pptx
Chapter 3 - Islamic Banking Products and Services.pptx
Mohd Adib Abd Muin, Senior Lecturer at Universiti Utara Malaysia
Thesis Statement for students diagnonsed withADHD.ppt
Thesis Statement for students diagnonsed withADHD.pptThesis Statement for students diagnonsed withADHD.ppt
Thesis Statement for students diagnonsed withADHD.ppt
Chapter 4 - Islamic Financial Institutions in Malaysia.pptx
Chapter 4 - Islamic Financial Institutions in Malaysia.pptxChapter 4 - Islamic Financial Institutions in Malaysia.pptx
Chapter 4 - Islamic Financial Institutions in Malaysia.pptx
Mohd Adib Abd Muin, Senior Lecturer at Universiti Utara Malaysia

Recently uploaded (20)

Digital Artifact 2 - Investigating Pavilion Designs
Digital Artifact 2 - Investigating Pavilion DesignsDigital Artifact 2 - Investigating Pavilion Designs
Digital Artifact 2 - Investigating Pavilion Designs
special B.ed 2nd year old paper_20240531.pdf
special B.ed 2nd year old paper_20240531.pdfspecial B.ed 2nd year old paper_20240531.pdf
special B.ed 2nd year old paper_20240531.pdf
Unit 8 - Information and Communication Technology (Paper I).pdf
Unit 8 - Information and Communication Technology (Paper I).pdfUnit 8 - Information and Communication Technology (Paper I).pdf
Unit 8 - Information and Communication Technology (Paper I).pdf
Acetabularia Information For Class 9 .docx
Acetabularia Information For Class 9  .docxAcetabularia Information For Class 9  .docx
Acetabularia Information For Class 9 .docx
Pride Month Slides 2024 David Douglas School District
Pride Month Slides 2024 David Douglas School DistrictPride Month Slides 2024 David Douglas School District
Pride Month Slides 2024 David Douglas School District
Normal Labour/ Stages of Labour/ Mechanism of Labour
Normal Labour/ Stages of Labour/ Mechanism of LabourNormal Labour/ Stages of Labour/ Mechanism of Labour
Normal Labour/ Stages of Labour/ Mechanism of Labour
JEE1_This_section_contains_FOUR_ questions
JEE1_This_section_contains_FOUR_ questionsJEE1_This_section_contains_FOUR_ questions
JEE1_This_section_contains_FOUR_ questions
Biological Screening of Herbal Drugs in detailed.
Biological Screening of Herbal Drugs in detailed.Biological Screening of Herbal Drugs in detailed.
Biological Screening of Herbal Drugs in detailed.
Multithreading_in_C++ - std::thread, race condition
Multithreading_in_C++ - std::thread, race conditionMultithreading_in_C++ - std::thread, race condition
Multithreading_in_C++ - std::thread, race condition
Best Digital Marketing Institute In NOIDA
Best Digital Marketing Institute In NOIDABest Digital Marketing Institute In NOIDA
Best Digital Marketing Institute In NOIDA
Operation Blue Star - Saka Neela Tara
Operation Blue Star   -  Saka Neela TaraOperation Blue Star   -  Saka Neela Tara
Operation Blue Star - Saka Neela Tara
The basics of sentences session 5pptx.pptx
The basics of sentences session 5pptx.pptxThe basics of sentences session 5pptx.pptx
The basics of sentences session 5pptx.pptx
Lapbook sobre os Regimes Totalitários.pdf
Lapbook sobre os Regimes Totalitários.pdfLapbook sobre os Regimes Totalitários.pdf
Lapbook sobre os Regimes Totalitários.pdf
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
Supporting (UKRI) OA monographs at Salford.pptx
Supporting (UKRI) OA monographs at Salford.pptxSupporting (UKRI) OA monographs at Salford.pptx
Supporting (UKRI) OA monographs at Salford.pptx
Chapter 3 - Islamic Banking Products and Services.pptx
Chapter 3 - Islamic Banking Products and Services.pptxChapter 3 - Islamic Banking Products and Services.pptx
Chapter 3 - Islamic Banking Products and Services.pptx
Thesis Statement for students diagnonsed withADHD.ppt
Thesis Statement for students diagnonsed withADHD.pptThesis Statement for students diagnonsed withADHD.ppt
Thesis Statement for students diagnonsed withADHD.ppt
Chapter 4 - Islamic Financial Institutions in Malaysia.pptx
Chapter 4 - Islamic Financial Institutions in Malaysia.pptxChapter 4 - Islamic Financial Institutions in Malaysia.pptx
Chapter 4 - Islamic Financial Institutions in Malaysia.pptx

Introduction to Item Response Theory

  • 1. Day 1 AM: An Introduction to Item Response Theory Nathan A. Thompson Vice President, Assessment Systems Corporation Adjunct faculty, University of Cincinnati
  • 2. Welcome!  Thank you for attending!  Introductions and important info now  Software… download or USB  Please ask questions ◦ Also, slow me down or ask for translation!  Goal: provide an intro on IRT/CAT to those who are new ◦ For those with some experience, to provide new viewpoints and more resources/recommendations
  • 3. Where I’m from, professionally  PhD, University of Minnesota ◦ CAT for classifications  Test development manager for ophthalmology certifications  Psychometrician at Prometric (many certifications)  VP at ASC
  • 4. Where I’m from, geographically
  • 5. Except now things look like…
  • 6. We do odd things in winter
  • 7. Introduce yourselves  Name  Employer/organization  Types of tests you do and/or why you are interested in IRT/CAT  (There might be someone with similar interests here)
  • 8. Another announcement  Newly formed: International Association for Computerized Adaptive Testing (IACAT) ◦ ◦ Free membership ◦ Growing resources ◦ Next conference: August 2012, Sydney
  • 9. Welcome!  This workshop is on two highly related topics: IRT and CAT  IRT is the modern paradigm for developing, analyzing, scoring, and linking tests  CAT is a next-generation method of delivering tests  CAT is not feasible without IRT, so we do IRT first
  • 10. IRT – where are we going?  IRT, as many of you know, provides a way of analyzing items  However, it has drawbacks (no distractor analysis), so the main reasons to use IRT are at the test level  It solves certain issues with classical test theory (CTT)  But the two should always be used together
  • 11. IRT – where are we going?  Advantages ◦ Better error characterization ◦ More precise scores ◦ Better linking ◦ Model-based ◦ Items and people on same scale (CAT) ◦ Sample-independence ◦ Powerful test assembly
  • 12. IRT – where are we going?  Keyword: paradigm or approach ◦ Not just another statistical analysis ◦ It is a different way of thinking about how tests should work, and how we can approach specific problems (scaling, equating, test assembly) from that viewpoint
  • 13. Day 1  There will be four parts this morning, covering the theory behind IRT: ◦ Rationale: A graphical introduction to IRT ◦ Models (dichotomous and polytomous) and their response functions ◦ IRT scoring (θ estimation) ◦ Item parameter estimation and model fit
  • 14. Part 1 A graphical introduction to IRT
  • 15. What is IRT?  Basic Assumptions 1. Unidimensionality  A unidimensional latent trait (1 at a time)  Item responses are independent of each other (local independence), except for the trait/ability that they measure 2. A specific form of the relationship between trait level and probability of a response  The response function, or IRT model  There are a growing number of models
  • 16. What is IRT?  A theory of mathematical functions that model the responses of examinees to test items/questions  These functions are item response functions (IRFs)  Historically, it has also been known as latent trait theory and item characteristic curve theory  The IRFs are best described by showing how the concept is derived from classical analysis…
  • 17. Classical item statistics  CTT statistics are typically calculated for each option Option N Prop Rpbis Mean 1 307 0.860 0.221 91.876 2 25 0.070 -0.142 85.600 3 14 0.039 -0.137 83.929 4 11 0.031 -0.081 86.273
  • 18. Classical item statistics  The proportions are often translated to a figure like this, where examinees are split into groups
  • 19. Classical item statistics  The general idea of IRT is to split the previous graph up into more groups, and then find a mathematical model for the blue line  This is what makes the item response function (IRF)
  • 20. Classical item statistics  Example with 10 groups
  • 21. The item response function  Reflects the probability of a given response as a function of the latent trait (z-score)  Example:
  • 22. The IRF  For dichotomously scored items, it is the probability of a correct or keyed response  Also called Item Characteristic Curve (ICC) or Trace Line  Only one curve (correct response), and all other responses are grouped as (1-IRF)  For polytomous items (partial credit, etc.), it is the probability of each response
  • 23. The IRF  How do we know exactly what the IRF for an item is?  We estimate parameters for an equation that draws the curve  For dichotomous IRT, there are three relevant parameters: a, b, and c
  • 24. The IRF  a: The discrimination parameter; represents how well the item differentiates examinees; slope of the curve at its center  b: The difficulty parameter; represents how easy or hard the item is with respect to examinees; location of the curve (left to right)  c: The pseudoguessing parameter; represents the ‘base probability’ of answering the question; lower asymptote
  • 25. The IRF  a=1, b=0, c=0.25
  • 26. The IRF…  is the “basic building block” of IRT  will differ from item to item  can be one of several different models (now)  can be used to evaluate items (now)  is used for IRT scoring (next)  leads to “information” used for test design (after that)  is the basis of CAT (tomorrow)
  • 28. IRT models  Several families of models ◦ Dichotomous ◦ Polytomous ◦ Multidimensional ◦ Facets (scenarios vs raters) ◦ Mixed (additional parameters) ◦ Cognitive diagnostic ◦ We will focus on first two
  • 29. Dichotomous IRT models  There are 3 main models in use, as mentioned earlier: 1PL, 2PL, 3PL  The “L” refers to “logistic”: which is the type of equation  IRT was originally developed decades ago with a cumulative normal curve  This means that calculus needed to be used
  • 30. The logistic function  An approximation was developed: the logistic curve  No calculus needed  There are two formats based on D  If D = 1.702, then diff < 0.01  If D = 1.0, a little more difference; called the true logistic form  Does not really matter, as long as you are consistent
  • 31. The logistic function  The basic form of the curve
  • 32. Item parameters  We add parameters to slightly modify the shape to get it to match our data  For example, a 4-option multiple choice item has a 25% chance of being guessed correctly  So we add a c parameter as a lower asymptote, which means that the curve is “squished” so it never goes below 0.25 (next)
  • 33. Item parameters  Sample IRF to show c
  • 34. Item parameters  We can also add a parameter (a) that modifies the slope  And a b parameter that slides the entire curve left or right ◦ Tells us which person z-score for which the item is appropriate  Items can be evaluated based on these just like with CTT statistics  A little more next…
  • 35. Item parameters: a  The a parameter ranges from 0.0 to about 2.0 in practice (theoretically to infinity)  Higher means better discriminating  For achievement testing, 0.7 or 0.8 is good, aptitude testing is higher  Helps you: Remove items with a<0.4? Identify a>1.0 as great items?
  • 36. Item parameters: b  For what person z-score is the item appropriate? (non-Rasch)  Should be between -3 and 3 ◦ 99.9% of students are in that range  0.0 is average person  1.0 is difficult (85th percentile)  -1.0 is easy (15th percentile)  2.0 is super difficult (98%)  -2.0 is super easy (2%)
  • 37. Item parameters: b  If item difficulties are normally distributed, where does this fall? (Rasch)  0.0 is average item (NOT PERSON)
  • 38. Item parameters: c  The c parameter should be about 1/k, where k is the number of options  If higher, this indicates that options are not attractive  For example, suppose c = 0.5  This means there is a 50/50 chance  That implies that even the lowest students are able to ignore two options and guess between the other two options
  • 39. Item parameters  Extreme example: ◦ What is 23+25?  A. 48  B. 47  C. 3.141529…  D. 1,256,457
  • 40. The (3PL) logistic function  Here is the equation for the 3PL, so you can see where the parameters are inserted  Item i, person j  Equivalent formulations can be seen in the literature, like moving the (1-c) above the line ( ) 1 ( 1| ) (1 ) 1 i j ii i j i i Da b P X c c e         
  • 41. The (3PL) logistic function  ai is the item discrimination parameter for item i,  bi is the item difficulty or location parameter for item i,  ci is the lower asymptote, or pseudoguessing parameter for item i,  D is the scaling constant equal to 1.702 or 1.0.
  • 42. The (3PL) logistic function  The P is due primarily to (-b)  The effect due to a and c is not as strong  That is, your probability of getting the item correct is mostly due to whether it is easy/difficult for you ◦ This leads to the idea of adaptive testing
  • 43. 3PL  IRT has 3 dichotomous models  I’ll now go through the models with more detail, from 3PL down to 1PL  The 3PL is appropriate for knowledge or ability testing, where guessing is relevant  Each item will have an a, b, and c parameter
  • 44. IRT models  Three 3PL IRFs, c = 0, 0.1, 0.2, (b = -1, 0, 1; a = 1, 1, 1) -3 -2 -1 0 1 2 3 theta probability
  • 45. 2PL  The 2PL assumes that there is no guessing (c = 0.0)  Items can still differ in discrimination  This is appropriate for attitude or psychological type data with dichotomous responses ◦ I like recess time at school (T/F) ◦ My favorite subject is math (T/F)
  • 46. IRT models  Three 2PL IRFs, a = 0.75, 1.5, 0.3, b = -1.0, 0.0, 1.0 -3 -2 -1 0 1 2 3 theta probability
  • 47. 1PL  The 1PL assumes that all items are of equal discrimination  Items only differ in terms of difficulty  The raw score is now a sufficient statistic for the IRT score  Not the case with 2PL or 3PL; it’s not just how many items you get right, but which ones  10 hard items vs. 10 easy items
  • 48. 1PL  The 1PL is also appropriate for attitude or psychological type data, but where there is no reason to believe items differ substantially in terms of discrimination  This is rarely the case  Still used: see Rasch discussion later
  • 49. 1PL  Three 1PL IRFs: b = -1, 0, 1 -3 -2 -1 0 1 2 3 theta probability
  • 50. How to choose?  Characteristics of the items  Check with the data! (fit)  Sample size: ◦ 1PL = 100 minimum ◦ 2PL = 300 minimum ◦ 3PL = 500 minimum  Score report considerations (sufficient statistics)
  • 51. The Rasch Perspective  Another argument in choice  There is a group of psychometricians (mostly from Australia and Chicago) who believe that the 1PL is THE model  Everything else is just noise  Data should be “cleaned” to reflect this
  • 52. The Rasch Perspective  How to clean? A big target is to eliminate guessing  But how do you know?  Slumdog Millionaire Effect
  • 53. The Rasch Perspective  This group is very strong in their belief  Why? They believe it is “objective” measurement  Score scale centered on items, not people, so “person-free”  Software and journals devoted just to the Rasch idea
  • 54. The Rasch Perspective  Should you use it?  I was trained to never use Rasch ◦ Equal discrimination assumption is completely unrealistic… we all know some items are better than others ◦ We all know guessing should not be ignored ◦ Data should probably not be doctored ◦ Instead, data should drive the model
  • 55. The Rasch Perspective  However, while some researchers hate the Rasch model, I don’t ◦ It is very simple ◦ It works better with tiny samples ◦ It is easier to describe ◦ Score reports and sufficient statistics ◦ Discussion points from you? ◦ Nevertheless, I recommend IRT
  • 56. Polytomous models  Polytomous models are for items that are not scored correct/incorrect, yes/no, etc.  Two types: ◦ Rating scale or Likert: “Rate on a scale of 1 to 5” ◦ Partial credit – very useful in constructed-response educational items  My experience as a scorer
  • 57. Polytomous models  Partial credit example with rubric: ◦ Open response question to “2+3(4+5)=“  0: no answer  1: 2, 3, 4, or 5 (picks one)  2: 14 (adds all)  3: 45 (does (2+3) x (4+5) )  4: 27 (everything but add 2)  5: 29 (correct)
  • 58. The IRF  Polytomous example (CRFs):
  • 59. Comparison table Model Item Disc. Step Spacing Step Ordering Option Disc. RSM Fixed Fixed Fixed Fixed PCM Fixed Variable Variable Fixed GRSM Variable Fixed Fixed Fixed GRM Variable Variable Fixed Fixed GPCM Variable Variable Variable Fixed NRM Variable (each option) Variable Variable Variable Fixed/Variable between items… more later, if time
  • 60. Part 3 Ability () estimation (IRT Scoring)
  • 61. Scoring  First: throw out your idea of a “score” as the number of items correct  We actually want something more accurate: the precise z-score  Because the z-scores axis is called θ in IRT, the scoring is called θ estimation
  • 62. Scoring  IRT utilizes the IRFs in scoring examinees  If an examinee gets a question right, they “get” the item’s IRF  If they get the question wrong, they “get” the (1-IRF)  These curves are multiplied for all items to get a final curve called the likelihood function
  • 63. Scoring  Here’s an example IRF; a =1, b=0, c = 0
  • 65. Scoring  We multiply those to get a curve like this…
  • 66. Scoring - MLE  The score is the point on the x-axis where the highest likelihood is  This is the maximum likelihood estimate  In the example, 0.0 (average ability)  This obtains precise estimates on the  scale
  • 67. Maximum likelihood  The LF is technically defined as:  Where u is a response vector of 1s and 0s  Note what this does to the exponents   ij i j n u 1 u j ij ij i 1 L P Qu     %
  • 68. Scoring - SEM  A quantification of just how precise can also be calculated, called the standard error of measurement  This is assumed to be the same for everyone in classical test theory, but in IRT depends on the items and the responses, and the level of 
  • 69. Scoring - SEM  Here’s a new LF – blue has the same MLE but is less spread out  Both are two items, blue with a = 2
  • 70. Scoring - SEM  The first LF had an SEM ~ 1.0  The second LF had an SEM ~ 0.5  We have more certainty about the second person’s score  This shows how much high-quality items aid in measurement ◦ Same items and responses, except a higher a
  • 71. Scoring - SEM  SEM is usually used to stop CATs  General interpretation: confidence interval  Plus or minus 1.96 (about 2) is 95%  So if the SEM in the example is 0.5, we are 95% sure that the student’s true ability is somewhere between -1.0 and +1.0
  • 72. Scoring - SEM  If a student gives aberrant responses (cheating, not paying attention, etc.) they will have a larger SEM  This is not enough to accuse of cheating (they could have just dozed off), but it can provide useful information for research
  • 73. Scoring - SEM  SEM CI is also used to make decisions ◦ Pass if 2 SEMs above a cutoff
  • 74. Details on IRT Scores  Student scores are on the  scale, which is analogous to the standard normal z scale – same interpretations!  There are four methods of scoring ◦ Maximum Likelihood (MLE) ◦ Bayesian Modal (or MAP, for maximum a posteriori) ◦ Bayesian EAP (expectation a posteriori) ◦ Weighted MLE (less common)
  • 75. Maximum likelihood  Take the likelihood function “as is” and find the highest point
  • 76. Maximum likelihood  Problem: all incorrect or all correct answers
  • 77. Bayesian modal  Addresses that problem by always multiplying the LF by a bell-shaped curve, which forces it to have a maximum somewhere  Still find the highest point
  • 78. Bayesian EAP  Argues that the curve is not symmetrical, and we should not ignore everything except the maximum  So it takes the “average” of the curve by splitting it into many slices and finding the weighted average  The slices are called quadrature points or nodes
  • 80. Bayesian EAP  Simple EAP overlay:  ~ -0.50
  • 81. Bayesian  Why Bayesian? ◦ Nonmixed response vectors ◦ Asymmetric LF  Why not Bayesian? ◦ Biased inward – if you find the  estimates of 1000 students, the SD would be smaller with the Bayesian estimates, maybe 0.95
  • 82. Newton-Raphson  Most IRT software actually uses a somewhat different approach to MLE and Bayesian Modal  The straightforward way is to calculate the value of the LF at each point in , within reason  For example, -4 to 4 at 0.001  That’s 8,000 calculations! Too much for 1970s computers…
  • 83. Newton-Raphson  Newton-Raphson is a shortcut method that searches the curve iteratively for its maximum  Why? Same 0.001 level of accuracy in only 5 to 20 iterations  Across thousands of students, that is a huge amount of calculations saved  But certain issues (local maxima or minima)… maybe time to abandon?
  • 84. Examples  See IRT Scoring and Graphing Tool
  • 85. Part 4 Item parameter estimation How do we get a, b, and c?
  • 86. The estimation problem  Estimating student  given a set of known item parameters is easy because we have something established  But what about the first time a test is given?  All items are new, and there are no established student scores
  • 87. The estimation problem  Which came first, the chicken or the egg?  Since we don’t know, we go back and forth, trying one and then the other ◦ Fix “temporary” z-scores ◦ Estimate item parameters ◦ Fix the new item parameters ◦ Estimate scores ◦ Do it again until we’re satisfied
  • 88. Calibration algorithms  There are two calibration algorithms ◦ Joint maximum likelihood (JML) – older ◦ Marginal maximum likelihood (MML) – newer, and works better with smaller samples… the standard ◦ Also conditional maximum likelihood, but it only works with 1PL, so rarer ◦ New in research, but not in standard software: Markov chain monte carlo
  • 89. Calibration algorithms  The term maximum likelihood is used here because we are maximizing the likelihood of the entire data set, for all items i and persons j  X is the data set of responses xij  b is the set of item parameters bi   is the set of examinee js
  • 90. Calibration algorithms  This means we want to find the b and  that make that number the largest  So we set , find a good b, use it to score students and find a new , find a better b, etc… ◦ Marginal ML uses marginal distributions not exact points, hence it being faster and working better with smaller samples of people/items
  • 91. Calibration algorithms  Note: rather than examine the LF (which gets incredibly small), software examines -2*ln(LF)  IRT software tracks these iterations because they provide information on model fit  See output
  • 92. Part 4 (cont.) Assumptions of IRT: Model-data fit
  • 93. Checking fit  One assumption of IRT (#2) is that our data even follows the idea of IRT!  This is true at both the item and the test level  Also true about examinees: they should be getting items wrong that are above their θ and getting items correct that are below their θ
  • 94. Model-data fit  Whenever fitting any mathematical model to empirical data (not just IRT), it is important to assess fit  Fit refers to whether the model adequately represents the data  Alternatively, if the data is far away from the model
  • 95. Model-data fit  There are two types of fit important in IRT ◦ Item (and test) - compares observed data to the IRF ◦ Person – evaluates whether individual students are responding according to the model  Easy items correct, hard items incorrect
  • 96. Model-data fit  Remember the 10-group empirical IRF that I drew? This is great!
  • 97. Model-data fit  You’re more likely to see something like this:
  • 98. Model-data fit  Or even worse…
  • 99. Model-data fit  Note that if we drew an IRF in each of those graphs, it would be about the same  But it is obviously less appropriate in Graph #3 (“even worse”)  Fit analyses provide a way of quantifying this
  • 100. Item fit  Most basic approach is to subtract observed frequency correct from the expected value for each slice (g) of   This is then summarized in a chi- square statistic  Bigger = bad fit
  • 101. Item fit  Graphical depiction:
  • 103. Item fit  The slices are called quadrature points  Also used for item parameter estimation  The number of slices for chi-square need not be the same as for estimation, but it helps interpretation
  • 104. Item fit  Chi-square is oversensitive to sample size  A better way is to compute standardized residuals  Divide a chi-square by its df = G-m where m is the number of item parameters  This is more interpretable because of the well-known scale  0 is OK, examine items > 2
  • 105. Item fit  For broad analysis of fit, use quantile plots (Xcalibre, Iteman, or Lertap) ◦ 3 to 7 groups ◦ Can find hidden issues (My example: social desirability in Likert #2)  See Xcalibre output ◦ Fit statistics ◦ Fit graphs (many more groups, and IRF)
  • 106. Person fit  Is an examinee responding oddly?  Most basic measure: take the log of the LF at the max ( estimate)  A higher number means we are more sure of the estimate  But this is dependent on the level of , so we need it standardized: lz       n 1i u1 i u io ii ˆQˆPlnl
  • 107. Person fit  lz is like a z-score for fit: z = (x-μ)/s  Less than -2 means bad fit                                                       n 1i 2 i i iio n 1i iiiio o oo z ˆP1 ˆP lnˆP1ˆPlVar ˆP1lnˆP1ˆPlnˆPlE lVar lEl l
  • 108. Person fit  lz is sensitive to the distribution of item difficulties  Works best when there is a range of difficulty  That is, if there are no items for high-ability examinees, none of them will have a good estimate!  Best to evaluate groups, not individuals
  • 109. How is fit useful?  Throw out items?  Throw out people?  Change model used?  Bad fit can flag other possible issues ◦ Speededness: fit (and N) gets worse at end of test ◦ Multidimensionality: certain areas
  • 110. How is fit useful?  Note that this fits in with the estimation process  IRT calibration is not “one-click”  Review results, then make adjustments ◦ Remove items/people ◦ Modify par distributions ◦ Modify quadrature points ◦ Etc.
  • 111. Summary  That was a basic intro to the rationale of IRT  Now start talking about some applications and uses  Also examine IRT software and output