Quantitative Studies Group - Item Response Theory Spring 2014.pdf

Thoughts on Item Response Theory
...an incomplete story
Quinn N Lathrop
University of Notre Dame
February 20, 2014

Overview
I 3PL IRT model
I Measurement in IRT
I Guessing Parameter
I Nonparametric IRT
I The Digital Ocean
I A new method, simulation, results

The Default Model: 3PL
Prob(Ypi = 1) = ci + (1 − ci ) × logit−1
[ai (θp − bi )]
i = 1, 2, ..., I for items and p = 1, 2, ..., P for persons.
Ypi ∼ Bernoulli(Prob(Ypi = 1))
“a three-parameter hammer”

What about the 2PL?
Each item has its own discrimination parameter ai .
2P-IRT model
Ability
Prob(Y=1)
-2 -1 0 1 2
0.0
0.2
0.4
0.6
0.8
1.0
Discrimination is usually estimated.
Similar to logistic regression, except x’s are latent.

Tangent #1: What is Measurement?
Usual definition of measurement: the assignment of numbers to objects or
events according to rules.
“In the physical sciences, the practical
consequence of performing arithmetic
computations on a numeric scale with
ordinal properties relative to one with
ratio or interval properties is
significant.
Is it less so in the social sciences?”
-Derek Briggs 2013 JEM

What is measurement in psychology?
“Measurement as a metaphor” is not the only option.
Scientific definition of measurement: the estimation of the ratio of a
magnitude to a unit.
Psychology is a pathological science because we assume our latent traits are
quantitive without empirically testing that assumption, or even acknowledging
that we made that assumption (Michelle, 1991).
“By avoiding tests of the assumption of a quantitative structure of
psychological attributes, psychologists have yet failed to make progress ... in
regard to their most fundamental assumptions” (Heene, 2013).

Additive Conjoint Measurement
Newton’s 2nd law
log(Force) = log(Mass) + log(Acceleration)
If the addition of X and Y satisfy certain axioms, then X, Y , and X + Y = Z
all have quantitative structure.
If the data fit, the Rasch or 1PL IRT model measures a quantitative
(interval-level) latent trait.
logit[Prob(correct)] = Ability − Difficulty
2PL and 3PL IRT models (and Structural Equation Models) generally do not
support additive conjoint measurement.

The Unit in IRT
logit[Prob(Ypi = 1)] = ai (θp − bi )
The discrimination parameter ai is the ratio between the item’s unit and the
latent unit.
Estimating the units while simultaneously making a measurement in the unit.

3PL Model
In addition to estimating ai , the 3PL model also estimates the lower bound of
Prob(Ypi = 1)
Prob(Ypi = 1) = ci + (1 − ci ) × logit−1
[ai (θp − bi )]
where
ci - guessing (lower asymptote), Prob(Ypi = 1|θp = −∞) = ci

But...
...c cannot be estimated
I Only possible to estimate two parameters per item consistently (Holland
1990)
I Requires large sample size, medium to high difficulty (Lord, 1974; Wood,
Wingersky, & Lord, 1976; Thissen & Wainer, 1982)
I Convergences rates of can be below 20% (Han, 2012)
I Null hypothesis that c = 0 is on boundary of parameter space
I There are no examinees near θ = −∞
The solution?
I Estimate it anyways
I Strong prior distributions
I Don’t estimate it (1/number of options) (Han, 2012)

Reality rears its ugly head
The response form is often not included in the response data
Item:
“What is the probability of rolling a 4 on a 6-sided die?”
Response: /
I Free response so (1 / number of options) doesn’t apply
I Response form may allow for some guessing

Demonstration Data
I 540 3PL item parameters
from retired test bank
I 5000 person parameters
from N(0, 1)
I No missing data
I Fit 3PL with BILOG
(MMLE, E-M)
I Default priors vs no priors
BILOG-MG Prior on c, Beta(5,17)
c
dbeta(x,
5,
17)
0.0 0.2 0.4 0.6 0.8 1.0
0
1
2
3
4
5

With this much data, things seem ok
c Estimates with Prior
True c
Estimated
c
0.0 0.1 0.2 0.3 0.4 0.5
0.0
0.1
0.2
0.3
0.4
0.5
c Estimates without Prior
True c
Estimated
c
0.0 0.1 0.2 0.3 0.4 0.5
0.0
0.1
0.2
0.3
0.4
0.5

Standard Errors
SE of c Estimates with Prior
Estimated c
Standard
Error
0.1 0.2 0.3 0.4 0.5
0.0
0.2
0.4
0.6
0.8
SE of c Estimates without Prior
Estimated c
Standard
Error
0.0 0.1 0.2 0.3 0.4 0.5
0.0
0.2
0.4
0.6
0.8

Summary of demonstration data c
With a large sample size of 5000 persons, using a strong prior on c
I brings the estimation away from zero
I reduces standard errors of the estimates
But do we really want all the items to have nonzero and different c’s?

We like parameters...
but the 3PL parameters cannot be used separately across items.
I Guessing is ci = P(−∞)
I Difficulty is P(bi ) = (1 + ci )/2
I Discrimination is
ai = 4 × P0
(bi )/(1 − ci )
The difficulty and discrimination are not comparable unless c’s are equal.
Can use graphical curves instead (called ICCs or IRFs).

“The best way to prevent undesirable consequences of such misuse of the item
parameters with 3PL simply would be not to use or not to interpret the item
parameters. Graphical analyses on IRFs could be used instead. Preventing the
3PL item parameters from being interpreted, however, would substantially
limit the utility of 3PL” (Han, 2012).

If we don’t need parameters, let’s use nonparametric IRT
P̂i (t) =
X
p
K

θ̂p − t
h

Ypi
X
p
K

θ̂p − t
h

I θ̂p - person ability estimate
I Ypi - response
I t - evaluation point
I K - kernel
I h - bandwidth parameter
Ability Estimate
Prob.Correct
-2 -1 0 1 2
0.0
0.2
0.4
0.6
0.8
1.0

Measurement in Nonparametric IRT
To get θ̂p,
I rank persons on total score*
I place ranks onto N(0, 1)
Note: The latent trait estimates are explicitly ordinal but placed onto N(0, 1)
for connivence and familiarity.

The Digital Ocean
Data are now a side effect of our interaction with the environment.
Consequence on technology.
The data are never balanced, and we would never expect them to be.
Items have different sample sizes.
Persons have different sample sizes.
Nonparametric IRT fails with missing data
I Can’t rank subjects on total score
I Can’t rank subjects on proportion correct
How else can we rank individuals?

Matrix Decomposition
Factorization of a matrix into a product of matrices.
The P by I response matrix Y, can be decomposed into a matrix
corresponding to the rows of Y and a matrix corresponding to the columns of
Y.
Y = RC0
Similar in concept to a 1PL IRT model where
Ỹ(P×I) = R(P×1)C0
(1×I)
or a Singular Value Decomposition of rank 1 or one Principal Component.

Alternating Least Squares: SVD with missing data
Alternating Least Square is simple and fast
Start by randomly filling C
R = YC(C0
C)−1
C0
= (R0
R)−1
R0
Y
Can simply skip over missing data in Y when calculating these equations.

How does ALS-SVD compare to total score?
Alternating Least Squares Ranking
True Ability
values
in
R
-3 -2 -1 0 1 2 3
0.0
0.2
0.4
0.6
0.8
1.0
1.2
True Scores Ranking
True Ability
Total
Scores
-3 -2 -1 0 1 2 3
0.0
0.5
1.0
1.5
2.0
2.5
3.0
Three item test, no missing data.

Unbalanced data
True Ability
values
in
R
-3 -2 -1 0 1 2 3
0.0
0.5
1.0
1.5
True Scores Ranking
True Ability
Total
Scores
-3 -2 -1 0 1 2 3
0.0
0.5
1.0
1.5
2.0
2.5
3.0
Five item test, θ 0 answer items 1, 2, 3 and θ 0 answer items 3, 4, 5.

More unbalanced data
11 items (ordered in difficulty)
Persons interact with different numbers of items
Persons interact with different difficulties of items

More unbalanced data
True Ability
values
in
R
-3 -2 -1 0 1 2 3
0.0
0.5
1.0
1.5
2.0
Total Score
7
6
5
4
3
2
1
0
Proportion Correct Ranking
True Ability
Total
Scores -3 -2 -1 0 1 2 3
0.0
0.2
0.4
0.6
0.8
1.0

ALS-SVD leads to better kernel-smoothing
-2 -1 0 1 2
0.0
0.4
0.8
Item 1
-2 -1 0 1 2
0.0
0.4
0.8
Item 2
-2 -1 0 1 2
0.0
0.4
0.8
Item 3
-2 -1 0 1 2
0.0
0.4
0.8
Item 4
-2 -1 0 1 2
0.0
0.4
0.8
Item 5
-2 -1 0 1 2
0.0
0.4
0.8
Item 6
-2 -1 0 1 2
0.0
0.4
0.8
Item 7
-2 -1 0 1 2
0.0
0.4
0.8
Item 8
-2 -1 0 1 2
0.0
0.4
0.8
Item 9
-2 -1 0 1 2
0.0
0.4
0.8
Item 10
-2 -1 0 1 2
0.0
0.4
0.8
Item 11
ALS-SVD NP
Prop-Cor NP
True ICC

Summary of ALS-SVD
Now, methods based on total score can be applied to more challenging data
structures.
I Nonparametric smoothing for graphical purposes
I Popular item fit statistics (S − χ2
)
I Nonparametric DIF
I Nonparametric classification accuracy/consistency estimation
But, still a ton of work to do
I Consequences of binary data? Constraints?
I Assumptions about missing data?
I Other assumptions?
I Connection to latent trait? 2PL model?
I Interpreting the values of R and C

If we don’t need parameters, let’s use nonparametric IRT
Ability Estimate
Prob.Correct
-2 -1 0 1 2
0.0
0.2
0.4
0.6
0.8
1.0
P̂i (t) =
X
p
K

θ̂p − t
h

Ypi
X
p
K

θ̂p − t
h

“But I want parameters” -Everybody

Nonparametric Parameters
I The nonparametric curve
converges to the true ICC
(Douglas, 1997)
I The best fitting parametric ICC to
the nonparametric curve (Wells
Bolt 2008)
So we can fit the 3PL to the smoothed
data (pairs of t and P̂i (t)) instead of
to Ypi (just a vector of 0’s and 1’s).
-2 -1 0 1 2
0.0
0.2
0.4
0.6
0.8
1.0
Item 8 from More Unbalanced Data
ALS-SVD NP
Best Fitting 3PL to NP

Summary of proposed method
To get 3PL parameter estimates
1. Given the P by I matrix Y, find the P by 1 R by SVD
2. Rank persons based on R
3. Use kernel regression with the ranks to estimate ICC
4. Find the closest 3PL curve to the nonparametric ICC
Benefits
I barely iterative
I fast
I works for large and odd data structures
I easy to parallel implementations
But how does it compare?

Baseline condition
With the demonstration data (5000x540, no missing data):
Bias RMSE Cor(True, Est)
Disc (a) -.045 .134 .974
BILOG - Prior Diff (b) -.018 .094 .996
Guess (c) -.009 .034 .908
Disc (a) -.149 .237 .946
BIOLG - NoPrior Diff (b) -.108 .220 .990
Guess (c) -.066 .102 .627
Disc (a) -.071 .166 .965
SVD-NP Diff (b) .012 .181 .985
Guess (c) .002 .066 .761

More challenging data
11,000 persons and 600 3PL items
Persons respond to 10, 25, or 50 items
Items are responded to about 100, 500, 1000
In total 314,500 person/item interactions, so matrix is 96% NA.

How does SVD-NP perform?
ALS-SVD took 20 seconds and 8 iterations.
Kernel-smoothing and 3PL parameter fitting took 11 seconds.
BILOG would not converge, non-converged results were way off
R package ltm took close to 20 hours.
SVD-NP MMLE (ltm)
Bais RMSE Cor(T, E) Bais RMSE Cor(T, E)
Disc a -0.124 0.567 0.505 0.214 1.023 0.511
Diff b -0.015 0.515 0.905 -0.049 0.727 0.869
Guess c -0.042 0.143 0.315 -0.029 0.119 0.408

True a
0.5 1.0 1.5 2.0 2.5 3.0 3.5
-1.5
-1.0
-0.5
0.0
0.5
1.0
1.5
True b
Bias
in
b
-3 -2 -1 0 1 2 3
-1.5
-1.0
-0.5
0.0
0.5
1.0
1.5
True c
Estimated
c
0.0 0.1 0.2 0.3 0.4 0.5 0.6
0.0
0.1
0.2
0.3
0.4
0.5
0.6
True a
0.5 1.0 1.5 2.0 2.5 3.0 3.5
-1.5
-1.0
-0.5
0.0
0.5
1.0
1.5
True b
Bias
in
b
-3 -2 -1 0 1 2 3
-1.5
-1.0
-0.5
0.0
0.5
1.0
1.5
True c
Estimated
c
0.0 0.1 0.2 0.3 0.4 0.5 0.6
0.0
0.1
0.2
0.3
0.4
0.5
0.6

Summary
The SVD-NP:
I Extends NP IRT methods to complex data structures
I Recovers parametric parameters as well as traditional implementations
I Very fast (0.04% of time of ltm)
Lot to do:
I Stronger foundation

0.0 0.2 0.4 0.6 0.8
-1.5
-1.0
-0.5
0.0
0.5
1.0
1.5
ALS-SVD and Principal Components
ALS-SVD scores
Principal
Component
Scores

Quantitative Studies Group - Item Response Theory Spring 2014.pdf

Recommended

Recommended

More Related Content

Similar to Quantitative Studies Group - Item Response Theory Spring 2014.pdf

Similar to Quantitative Studies Group - Item Response Theory Spring 2014.pdf (20)

Recently uploaded

Recently uploaded (20)

Quantitative Studies Group - Item Response Theory Spring 2014.pdf