Irt assessment

ITEM RESPONSE THEORY
Maryam Bolouri

Different Measurement Theories
 ClassicalTestTheory (CTT) or
ClassicalTrue Score (CTS)
 GeneralizibilityTheory (G-Theory)
 Item ResponseTheory (IRT)

Problems with CTT
 True score and error score have
theoretical unobservable constructs
 Sample dependence (test & testee)
 Unified error variance
 No account of interaction of error
variances
 Single SEM across ability levels

Generalizibiliy Theory
(An Extension of CTT)
 G-Theory advantages: Sources and
interaction of variances accounted
for
 G-Theory problems: Sample
dependent and single SEM

IRT or Latent Trait Theory
 Item response theory (IRT) is an approach
used to estimate how much of a latent
trait an individual possesses. The theory
aims to link individuals’ observed
performances to a location on an
underlying continuum of the unobservable
trait. Because the trait is unobservable, IRT
is also referred to as latent trait theory
 IRT can be used to link observable performances
to various types of underlying traits.

Latent variables or construct
or underlying trait
 second language listening ability
 English reading ability
 test anxiety

Four Advantages of IRT:
 1. ability estimates are drawn from the population
of interest, they are group independent.This means
that ability estimates are not dependent on the
particular group of test takers that complete the
assessment.
 2. it is used to aid in designing instruments that
target specific ability levels based on the TIF. Using
IRT item difficulty parameters makes it possible to
design items with difficulty levels near the desired
cut-score, which would increase the accuracy of
decisions at this crucial ability location.

Advantages of IRT:
 3. IRT provides information about various
aspects of the assessment process, including
items, raters, and test takers, which can be
useful for test development. For instance,
raters can be identified that have inconsistent
rating patterns or are too lenient. These raters
can then be provided with specific feedback on
how to improve their rating behavior.
 4. test takers do not need to take the same
items to be meaningfully compared on the
construct of interest (fairness)

lack of widespread use is likely due to
practical and technical disadvantages of
IRT when compared to CTT.
1. the necessary assumptions underlying IRT
may not hold with many language
assessment data sets.
2. lack of agreement on an appropriate
algorithm to represent IRT-based test scores
(to users) leads to distrust of IRTtechniques.
3. understanding of the somewhat technical
math which underlies IRT models is
intimidating to many.

lack of widespread use is likely due to
practical and technical disadvantages of IRT
when compared to CTT.
4. the relatively large samples sizes required for
parameter estimation are not available for many
assessment projects.
5. although IRT software packages continue to
become more user friendly, most have steep
learning curves which can discourage fledgling
test developers and researchers.

History:
 ancient Babylon, to the Greek philosophers, to the
adventurers of the Renaissance”
 Current IRT practices can betraced back to two
separate lines of development:
1) A method of scaling psychological and educational
tests, “intimations” of IRT for one line of
development.
Fredrick Lord (1952): provided the foundations of IRT
as a measurement theory by outlining assumptions
and providing detailed models.

History:
 Lord and Novick’s (1968) monumental textbook,
Statistical theories of mental test scores, outlined
the principles of IRT
2) George Rasch (1960), a Danish mathematician with
focus on the use of probability to separate test taker
ability and item difficulty.
Wright and his graduate students are credited with
many of the developments of the family of Rasch
models.

The 2 development lines:
 They have led to quite similar practices
 one major difference:
 Rasch models are prescriptive. If data do not fit
the model, the data must be edited or discarded
 .The other approach (derived from Lord’s work)
promotes a descriptive philosophy. Under this
view, a model is built that best describes the
characteristics of the data. If the model does not fit
the data, the model is adapted until it can account
for the data.

History:
The first article in the journal LanguageTesting by Grant
Henning (1984)
“ advantages of latent trait measurement in language
testing,”
About a decade after IRT appeared in the journal
LanguageTesting, an influential book on the subject
was written byTim McNamara (1996), Measuring
Second Language Performance.
an introduction to many-facet Rasch model and FACETS
software used for estimating ability on performance-
based assessments.
studies which used MFRM began to appear in the
language testing literature soon after McNamara
publication

Assumptions underlying IRT
models
1. Local independence :
 This means that each item should be assessed
independently of all other items.The assumption of local
independence could be
 violated on a reading test when the question or answer
options for one item provide information that may be
helpful for correctly answering another item about the
same passage.
.

models
2. Unidimensionality:
 In a unidimensional data set, a single ability
can account for the differences in scores. For
example, a second language listening test
would need to be constructed so that only
listening ability underlies test takers’
responses to the test items. A violation of this
assumption would be the inclusion of an item
that measured both the targeted ability of
listening as well as reading ability not
required for listening comprehension

models
 3. it is , sometimes referred to as certainty of
response
test takers make an effort to demonstrate the level
of ability that they possess when they complete
the assessment (Osterlind, 2010). Test takers must
try to answer all questions correctly because the
probability of a correct response in IRT is directly
related to their ability. This assumption is often
violated when researchers recruit test takers for a
study, and there is little or no incentive for the test
takers to offer their best effort.

models
 It is important to bear in mind that almost all
data will violate one or more of the IRT
assumptions to some extent. It is the degree
to which such violations occur that
determines how meaningful the resulting
analysis is (de Ayala, 2009).

How to assess assumptions:
 Sample size:
 In general, smaller samples provide less accurate
parameter estimates, and models with more
parameters require larger samples for accurate
estimates. A minimum of about 100 cases is
required for most testing contexts when the
simplest model, the 1PL Rasch model, is used
(McNamara, 1996). As a general rule, de Ayala
(2009) recommends that the starting point for
determining sample size should be a
few hundred.

IRT Parameters
 1. Item Parameters
 Parameter is used in IRT to indicate a characteristic
about a test’s stimuli.
a) Item Characteristic Curve (ICC)
Difficulty (b)
Discrimination (a)
Guessing Factor (c)
b) Item Information Function (IIF)
2.Test Parameter
a)Test Information Function (TIF)
3. Ability Parameter (Ө)

A test taker with an ability of 0 logits would
have a 50% chance of correctly answering an item
with a difficulty level of 0 logits.

ICC
 The probability of a test taker correctly
responding to an item is presented on the
vertical axis.This scale ranges from zero
probability at the bottom to absolute
probability at the top.
 The horizontal axis displays the estimated
ability level of test takers in relation to item
difficulties, with least at the far left and most
at the far right.The measurement unit of the
scale is a logit, and it is set to have a center
point of 0.

ICC
 ICCs express the relationship between the
probability of a test taker correctly
answering each item and a test taker’s
ability. As a test taker’s ability level
increases, moving from left to right along
the horizontal axis, the probability of
correctly answering each item increases,
moving from the bottom to the top of the
vertical axis.

ICC
 the ICCs are somewhat S-shaped, meaning
 the probability of a correct response changes
considerably over a small ability level range.
 Test takers with abilities ranging from -3 to -1 have
less than a 0.2 probability of answering the item
correctly
 test takers with abilities levels in the middle of the
scale, between roughly -1 and +1, the probability of
correctly responding to that item changes from
quite low, about 0.1 to quite high, about 0.9

 All ICC have the same level of difficulty
 Different location index
 Left ICC easy item
 Right ICC hard item
 Roughly half of the time the test takers respond
correctly, and the other half of the time, they
respond incorrectly. So these test takers have
about a 0.5 probability of answering these
items successfully. By capitalizing on these
probabilities, the test taker’s ability can be
defined by the items that are at this level of
difficulty for the test taker.

Figure 3
 All have same level of difficulty
 Different level of discrimination
 Upper curve: highest discrimination short
distance to the left or right will have much
different probability with dramatic change
(steep)
 The middle one has moderate level of
discrimination
 Lower one: very small slope and change
slightly as a result of movement to the left or
right point of 0.5

Some issues about ICC
 When the a is less that moderate ICC is nearly
linear and flat
 When the a is more than moderate, it is likely
to be steep in the middle section
 A and b are independent of each other
 Horizontal line in ICC : means no
discrimination and undefined difficulty
 Probability of 0.5 corresponds to b in easy
items it occurs at low ability and in hard ones
it occurs at high ability level.

Some issues about ICC
 When the item is hard most of the ICC has
the probability of correct response less than
0.5
 When the item is easy most of the ICC has
the probability of correct response that is
larger than 0.5

Bear in mind
 The figures show a range of ability is from -3
to + 3
 The theoretical range of ability is from
negative infinity to positive infinity.
 All ICC become asymptotic to a probaility of
zero at one tail and one at the other tail.
 It is necessary to fit the curves on the
computer screen.

 It is a vertical line along the ability scale.
 It is ideal for distinguishing btw examinees
with abilities above and below 1.5
 No discrimination of examinees below or
above 1.5

Different IRT Models
Model Item Format Features
1-Parameter Logistic
Model/
Rasch Model
Dichotomous Discrimination
power equal across
all items. Difficulty
varies across items
Model
Dichotomous Discrimination and
difficulty parameters
vary across items
Model
Dichotomous Also includes
pseudo-guessing
parameter

ICC models
 A model is a mathematical equation in which
independent variables are combined to optimally
predict dependent variables
 Each of these models has particular mathematical
equation and are used to estimate individuals’
underlying traits on language ability constructs.
 The standard mathematical model for ICC is the
cumulative form of logistic function
 It was first derived in 1844 and has been widely used in
biological sciences to model the growth of plants and
animals from birth to maturity
 It was first used in ICC in the late 1950s because of its
simplicity.

 Parameter a is multiplied by 1.70 to obtain
the corresponding logistic value
 L=a (theta-b)
 Discrimination parameter is proportional to
the slope of the ICC

The most fundamental IRT model,
the Rasch or 1-parameter (1PL)
logistic model
 Relating test taker ability to the difficulty of items
makes it possible to mathematically model the
probability that a test taker will respond correctly to
an item.

 It was first published by Danish mathematician:
Georg Rasch
 Under this model, the discrimination parameter of
the two-parameter logistic model is fixed at a value
of a = 1.0 for all items;
 only the difficulty parameter can take on different
values. Because of this, the Rasch model is often
referred to as the one parameter logistic model.

 the probability of correct response includes a small
component that is due to guessing.
 Neither of the two previous item characteristic curve models
took the guessing phenomenon into consideration.
 Birnbaum (1968) modified the two-parameter logistic model
to include a parameter that represents the contribution of
guessing to the probability of correct response.
 Unfortunately, in so doing, some of the nice mathematical
properties of the logistic function were lost.
 Nevertheless the resulting model has become known as the
 three-parameter logistic model, even though it technically is
no longer a logistic model.The equation for the three-
parameter model is:

The equation for the three-
parameter model is:

Range of parameters:
 -3<a<+3
 -2.80<b<+2.80
 0<c<1 values above 0.35 are not acceptable
 Item parameters are not dependent upon the
ability level of examinees or they are group
invariant-parameters are the value of items
not the group

Positive and Negative Discrimination
 Positive: the probability of correct response
increases as the ability level increases
 Negative: the probability of correct response
decreases as the ability level increases from
low to high.

Items with negative
discrimination occur in two
ways:
 . First, the incorrect response to a two-choice
item will always have a negative
discrimination parameter if the correct
response has a positive value.
 Second when something is wrong with the
item: Either it is poorly written or there is
some misinformation prevalent among the
high-ability students.

AN ITEM INFORMATION FUNCTION (IIF)
GIVING MAXIMUM INFORMATION FOR
AVERAGE ABILITY LEVEL

A TEST INFORMATION FUNCTION (TIF)

ANOTHER TEST INFORMATION FUNCTION (TIF)
GIVING MORE INFORMATION FOR LOWER ABILITY
LEVELS

TIF
 Information about all of the items on a test
are often combined and presented in test
information function (TIF) plots.
 TheTIF indicates the average item
information at each ability level.TheTIF can
be used to help test developers locate areas
on the ability continuum where there are few
items. Items can then be written that target
these ability levels.

Steps in running IRT analysis
 Data entry
 Model selection through scale and fit
analyses
 Estimating and inspecting
1. ICC
2. IIF
3. DIF (If needed)
4.TIF

Many-facet Rasch measurement
model
 The many-facet Rasch measurement (MFRM)
model has been used in the language testing
field to model and adjust for various assessment
characteristics on performance-based tests.
 Facets such as:
1. test taker ability
2. item difficulty
3. Raters
4. Scales

Many-facet Rasch measurement
model
 The scores may be affected by factors like
 rater severity, the difficulty of the prompt, or
the time of day that the test is administered.
MFRM can be used to identify such effects
and adjust the scores to compensate for
them.

The difference between this MFRM and the
1PL Rasch model for items scored as correct
or incorrect is that
 The severity of the rater :
Rater severity denotes how strict a rater is in
assigning scores to test takers
 The rating step difficulty:
rating step difficulty refers to how much of the ability
is required to move from one step on a rating scale
to another
 For example, on a five-point writing scale with 1
indicating least proficient and 5 most proficient, the
level of ability required to move from a rating of 1 to 2,
or between any two scales would be difficulty of
rating step.

A test taker with an ability level of 0 would
have virtually no probability of a rating of 1
or 5, a little above a 0.2 probability of a
rating of 2, and about a 0.7 probability of a
rating of 3.

CRC
 CRCs are analogous to ICCs.The probability of
assignment of a rating on the scale, the five-
point scale
 It indicates that a score of 2 is the most
commonly assigned since it extends the furthest
along the horizontal axis.
 Ideally, rating categories should be highly
peaked and equivalent in size and shape to each
other.
 Test developers can use the information in the
CRCs to revise rating scales.

Use of MFRM:
 investigating task characteristics and their effects
on various types of performance-based
assessments.
 investigate the effects of rater bias, rater severity,
 Rater training, rater feedback ,task difficulty and
rating scale reliability

IRT Applications
 Item banking and calibration
 AdaptiveTests (CAT/IBAT)
 Differential Item Functioning
(DIF) studies
 Test equating

CAT
 Applications of IRT to computer adaptive testing (CAT)
are not commonly reported in the language
assessment literature, likely because of the large
number of items and test takers required for its
feasibility. However, it is used in some large-scale
language assessments and is considered one of the
most promising applications of IRT.
 A computer is programmed to deliver items
increasingly closer to the test takers’ ability levels. In its
simplest form, if a test taker answers an item correctly,
the IRT-based algorithm assigns the test taker a more
difficult item, whereas, if the test taker answers an
item incorrectly, the next item will be easier. The test is
complete when a predetermined level of precision of
locating the test taker’s ability level has been achieved.

Differential Item Functioning
(DIF)
Differential Item Functioning is said
to occur when the probability of
answering an item correctly is not
the same for examinees who are on
the same ability level but belong to
different groups.

(DIF)
 Language testers also use IRT techniques to
identify and understand possible differences in
 the way items function for different groups of
test takers. Differential item functioning (DIF),
 which can be an indicator of biased test items,
exists if test takers from different groups with
 equal ability do not have the same chance of
answering an item correctly. IRT DIF methods
 compare ICCs for the same item in the two
groups of interest.

(DIF)
 DIF is an extremely useful and rigorous method
for studying groups differences:
 Sex Differences
 Race/Ethnic Differences
 Academic background differences
 Socioeconomic status differences
 Cross-cultural and Cross-national studies
 Determine whether differences are an artifact of
measurement or something different about the
construct and population.

Bias & DIF
 The logical first step in detecting bias is to find
items where one group performs much better
than the other group: such items function
differently for the two groups and this is known
as Differential Item Functioning (DIF).
 DIF is a necessary but not sufficient condition for
bias: bias only exists if the difference is
illegitimate, i.e., if both groups should be
performing equally well on the item.

Bias & DIF (Continued)
 An item may show DIF but not be biased if the
difference is due to actual differences in the groups'
ability needed to answer the item, e.g., if one group
is high proficiency and the other low proficiency: the
low proficiency group would necessarily score much
lower.
 Only where the difference is caused by construct-
irrelevant factors can DIF be viewed as bias. In such
cases, the item measures another construct, in
addition to the one it is supposed to measure.
 Bias is usually a characteristic of a whole test,
whereas DIF is a characteristic of an individual item.

An example of an item that displays
uniform DIF
The item favors all males regardless of ability.
Only difficulty parameters differ across groups.

Comparison of CTT and IRT
(Embreston & Reise, 2000)
CTT
1. Single SEM across
2. Longer test more
reliable
3. Score comparison across
parallel forms are
optimal
4. Unbiased estimates
requires representative
sample
IRT
1.Various SEM across
2. Shorter test can be
equally or even more
reliable (TIF)
3. Optimal when test
difficulty varies between
persons
4. OK with
unrepresentative sample

Continued…
CTT
5. Scores are meaningful
against norm
6. Interval scales properties
achieved through
normal distribution
7. Mixed item formats
leads to unbalance
8. Change score not
comparable when initial
score differ
IRT
5.Test scores against distance
from items
6. Interval scales properties
achieved by applying
justifiable measurement
model
7. No problem
8. No problem

Continued…
CTT
9. Factor analysis produces
artifacts
10. Item stimulus features are
not important compared to
psychometric properties
11. No graphic displays of item
and test parameters
* All in all, better and more
practical for class based
low-stake tests.
IRT
9. Factor analysis produces full
information FA
10. Item stimulus features are
directly related to
psychometric properties
11. Graphic displays of item and
test parameters
* Much more advantageous
and preferable for high-
stake, large-sample tests.
*THE ONLY CHOICE FOR
ADAPTIVETESTS.

future research:
Techniques, such as item bundling (to meet
the assumption of local independence)
The development of techniques which require
fewer cases for accurate parameter
estimation
Guidance on using IRT (written resources
specific to the needs of language testers)
computer-friendly programs so that the use
of IRT techniques will become more prevalent
in the field

References:
 Bachman, L. F. (1990). Fundamental
considerations in language testing. Oxford:
Oxford University Press.
 Baker, F. B. (2001). The basics of item response
theory. ERIC Clearing House on Assessment and
Evaluation.
 Embreston, S. E. & Reise, S. P. (2000). Item
response theory for psychologists. Mahwah, New
Jersey: Lawrence Erlbaum Associates.
 Fulcher, G. & Davidson, F. (2007). Language
testing and assessment: An advanced resource
book. NewYork: Routledge
 Fulcher, G. & Davidson, F. (2012).The Routledge
Handbook of LanguageTesting. NewYork:
Routledge

Irt assessment

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (18)

Similar to Irt assessment

Similar to Irt assessment (20)

More from Allame Tabatabaei

More from Allame Tabatabaei (20)

Recently uploaded

Recently uploaded (20)

Irt assessment