ITEM ANALYSIS 2023.pptx uses for exam development especially national examination development.

Educational Assessment and Examinations
Service (EAES)
August, 2023
Dembel View Hotel, Adama

Outlines
• Introduction
• Basic concepts of
CTT
• Basic concepts of
IRT
• Reliability and Validity
• Differential Item
Functioning
• Item analysis using
Software

Introduction
• Test is concerned with Turing performance in numbers (Baxten,
1998)
• 13% of students who fail in the class are caused by faulty test items
(World Watch, 2005)
• Masters et al. (2001) in US , 2,233 minor and major violations of
item-writing guidelines were identified.
• It is estimated that 90% of the testing items are out of quality
(Wilen, W.W, 1992)
• Teachers have difficulty in developing plausible distractors in MCQs
and only 52% of all distractors were functioning effectively
( Tarrant, Ware & Mohammed, 2009)
• What about in case of Ethiopia with respect to quality test??
• Thus, item analysis is very important to keep the quality of the test
3

Cont.…
4
Item
Analysis
A method used to evaluate test items, typically for the
purpose of test construction and revision.
A process to examines students individual test items and
test as a whole.
Useful to improve items and eliminate ambiguous/
misleading items
Help to identify specific areas of subject content that need
greater emphasis/clarity
Suggest ways of improving the measurement of a test
Valuable to increase the skill in test construction

Cont.…
Helps to bring a match between
what is taught and what is
assessed. Care should be taken
on sampling of items and its
difficulty level.
Help to understand and make
decisions about poor
performing items
Helps to improve test items and
identify unfair or biased items
Help to carefully align
instruction with the Grade level
expectations from which a
standardized test items derived.
Purpose of Item
Analysis

Item Analysis Methods
Qualitative
• A non – numerical method for
analyzing test items not employing
students responses.
• Rather considering test objectives,
content validity, and technical item
quality such as
Matching items and objectives
Editing poorly written items
Improving the content validity of the test
Evaluating the items across table of
specification and item writing guidelines
Quantitative
• A numerical method for
analyzing test items based
on students response.
• It includes:
Difficulty/b-parameter
Discrimination/ a-parameter
Option analysis
Reliability
Differential item functioning ,
etc.
6

Cont.…
Item Analysis Methods
Quantitative Qualitative
Difficulty/b-parameter
Discrimination/a-parameter
Option Analysis
Reliability
Validity
Differential Item Functioning/DIF

Basic concepts of Classical Test Theory and Item
Response Theory
• CTT and IRT are the two primary psychometric paradigms.
• They are a mathematical approaches to how tests items are
analyzed.
• They differ quite substantially in substance and complexity, even
though they both nominally do the same thing.
• So there is no single best answer to the question of using either
CTT or IRT.
• Since in many cases, BOTH are necessary and can be used based
on the purpose of the item analysis.
• However, CTT and IRT have some differences .
8

Comparison of CTT vs. IRT
9
Feature CTT IRT
Ability-Item relationship Linear Logistic curve
Invariance of item &
person statistics
No Yes
Difficulty P-value b-parameter
Discrimination D(item-total) a-parameter
Adaptive Testing Rare Suitable
Reliability Depend on test
length
Don't depend on
test length
Equating Complicated Automatic
Item – Model Fit No Yes
Sample size needed Small Large
Option analysis Preferable Rare

Classical Test Theory and Item Response Theory

Classical Test Theory (CTT)
• They are the easiest and most widely used form of analyses.
• The statistics can be computed by readily available statistical
packages (or even by hand).
• They are performed on the test as a whole rather than on the item
• Although item statistics can be generated, they apply only to that
group of students on that collection of items
• CTT assumes that each person has a true score (T), that would be
obtained if there were no errors (E) in measurement.
• Unfortunately, test users never observe a person's true score, only an
observed score (X)
• Thus, CTT is based on the true score model:
• In CTT we assume that the error :
Is normally distributed
Uncorrelated with true score
Has a mean of Zero
X T E
 

Item Difficulty and Discrimination in CTT
Item Difficulty Level (P)
• The percentage of students who answered the item correctly
• Calculation:
or
• The range is b/n 0% and 100% (0.0&1.00)
• The higher the value, the easier the item and the lower the value,
the harder the item
• An item with a p value of .0 or 1.0 does not contribute to measure
individual differences
• Ideal value of an item difficulty is 0.50
• Small number of easy or difficult items may be included to motivate
or differentiate the test takers 12
P= Upper group + Lower group
Total group
P= # Correct students
Total students

The distribution items by difficulty levels in a Test
13
Author Type of Items Percentage
Sugianto [2020]
Very Easy 10%
Easy 20%
Moderate 40%
Difficult 20%
Challenging 10%
Arifin (2009)- three options
based on the purpose of test
Difficult
Medium
Easy
25%
50%
25%
Difficult
Medium
Easy
20%
60%
20%
Difficult
Medium
Easy
15%
70%
15%

Interpretation of difficulty index (p-value).
14
Author Difficulty index Interpretation
Uddin et al. (2020) >80% Easy
30–80% Moderate
<30% Difficult
Kaur, Singla et al. (2016) >80 Easy
40–80 Moderate
<39 Difficult
Sugianto (2020) 90% Easy
50% Moderate
10% Difficult
Jaipurkar et al. (2021) >70% Too easy
50–60% Excellent/ideal
30–70% Good/acceptable/average
Obon and Rey (2019) > 0.76 Easy
0.26–0.75 Right difficult (Retain)
0–0.25 Difficult (Revise/Discard)
Bhat and Prasad (2021) >70% Easy
30–70% Good
<30% Difficult

Item Discrimination Power (D)
• Ability of items to elicit different responses from students with
different abilities/skills.
• The computed difference between the percentage of high achievers
and the percentage of low achievers who got the item right.
• The maximum range of the Discrimination Index is from -1.0 to +1.0
• The higher the value of D, the more adequately the item discriminates
(highest value is 1.0)
• Values close to 0 means most students performed the same on an item
15
D = (Correct Upper) - (Correct Lower)
(1/2 Total)

Discrimination Index (D)
• Those who did well on the overall test chose the correct
answer for a particular item more often than those who did
poorly on the overall test.
Positive Discrimination Index
• Those who did poorly on the overall test chose the correct
answer for a particular item more often than those who did
well on the overall test.
Negative Discrimination Index
• Those who did well and those who did poorly on the
overall test chose the correct answer for a particular item
with equal frequency
Zero Discrimination Index
 Negative discriminators (-) (This is never what we want)
Non-discriminators (0) (This may or may not be what we want)
Positive discriminators (+) (This is usually what we want)

Interpretation of Discrimination power (D)
17
Author Discrimination power Interpretation
Elfaki, Bahamdan et
al. [2015]
• ≥0.35 Excellent
• 0.25–0.34 Good
• 0.21–0.24 Acceptable
• ≤ 0.20 Poor
Obon and Rey [2019]
• ≥ 0.50 Very Good
• 0.40–0.49 Good (Very Usable)
• 0.30–0.39 Fair Quality (Usable Item)
• 0.20–0.29 Poor (Revised)
• ≤ 0.20 Very Poor (Critically Revised/ Discard
Bhat and Prasad
[2021]
• > 0.35 Excellent
• 0.2–0.35 Good
• < 0.2 Poor
Sugianto [2020]
• >0.40 Very good
• 0.30–0.39 Good
• 0.20–0.29 Marginal & need improvement
• <0.19 Poor, rejected/ improved by revision

Interpretation of Discrimination power(D)
18
Author Discrimination power Interpretation
Aljehani, Pullisheryet
al. [2020] and Sharma
[2021]
• ≥ 0.40 Very good item (Keep)
• 0.30–0.39 Good item (Keep)
• 0.20–0.29 Moderate & fair (Keep)
• < 0.20 Marginal, Revise/Discard
• Negative Worst & Definitely Discard
Ramzan, Imran et al.
[2020]
• > 0.30 Excellent
• 0.20–0.29 Good
• 0–0.19 Poor
• <0 Defective & Discard
Uddin et al. [2020]
• ≥ 0.35 Excellent
• 0.25–0.34 Good
• 0.21–0.24 Acceptable
• < 0.20 Poor

Item Analysis for Partial Credit Items
Partial credit items (short answer, essay) items can be analyzed using
the following formula:
P = UGSP + LGSP x 100
WI x T
D = UGSP - LGSP
WI x 1/2T
Where: UGSP-Upper group sum point on item
LGSP-Lower group sum point on item
WI- Weight of an Item
T- Total number of students
19

Example
• The maximum point of a short answer item was 3. Among the upper
groups of students, each 6 of them scored 3 points and 4 of them
scored 2 points. From the lower group, each 4 of them 2 and 6 of
them scored 1 point. Find P and D of an item.
• P = UGSP + LGSP x 100
WI x T
= (6*3+4*2)+(4*2+6*1) = 40/60 = 0.67
3*20
• D = UGSP - LGSP
WI x 1/2T
= 26-14 = 8/30 = 0.27
3*10 20

Option Analysis
• Analysis of how well the high and low groups are responding to the
items options.
• Compare the performance of the highest- and lowest-scoring of the
students on the distracter options
• Fewer of the top performers should choose each of the distractors
as their answer compared to the bottom performers.
• A good distractor attracts more students from the lower group
than the upper group.
• It is not desirable to have one of the distractors chosen more
often than the correct answer.
• If so, this distractor may be too similar to the correct answer
and/or there may be something in either the stem or the alternatives
that is misleading.
• At the key answer, the difference between upper and lower
performers expected to be positive, while at the distracters it
expected to be negative.
21

Activity 1
Consider the case below
Suppose your students chose the options to a four – alternative
multiple – choice item.
Let C as the correct answer.
A B C* D
3 0 18 9
Questions
• How does this information help us?
• Is the item too difficult/easy for the students?
• What is the difficulty level value?
• What is the discrimination index value?
• Are the distractors of the items effective?
• Should this item be eliminated?
Item X

𝒑 =
𝑵𝒖𝒎𝒃𝒆𝒓 𝒔𝒆𝒍𝒆𝒄𝒕𝒊𝒏𝒈 𝒄𝒐𝒓𝒓𝒆𝒄𝒕 𝒂𝒏𝒔𝒘𝒆𝒓
𝑻𝒐𝒕𝒂𝒍 𝑵𝒖𝒎𝒃𝒆𝒓 𝒕𝒂𝒌𝒊𝒏𝒈 𝒕𝒉𝒆 𝒕𝒆𝒔𝒕
𝒑 = 𝟎. 𝟔𝟎
Solving the difficulty index for Item X.
𝒑 =
𝟏𝟖
𝟑𝟎
• Thus, the difficulty level of the item is 0. 60 (60%), the item is
moderate.
A B C* D
3 0 18 9
Item X
As Bhat and Prasad (2021):
If level > 0.70, the item is considered relatively easy.
If P level < 0. 30, the item is considered relatively difficult.

D= 𝟎. 𝟐𝟕
Solving the Discrimination Index for item X. if the # of
upper and lower groups correct are 11 and 7 respectively.
D=
𝟏𝟏−𝟕
𝟏𝟓
=
𝟒
𝟏𝟓
• The discrimination power of item X is 0. 27 and positive.
• More students who did well on the overall test answered the item
correctly than students who did poorly on the overall test.
• Thus, it has good discrimination power
A B C* D
3 0 18 9
Item X
As Bhat and Prasad (2021):
If D index is 20-35 , the item has good discrimination power.
D = (Correct Upper) - (Correct Lower)
(1/2 Total)

Implication
Difficulty Level (p) = 0. 60
Discrimination Index (D) = 0.27
1. Should item X be eliminated?
Item X is considered a moderately difficult item that has positive
(desirable) discrimination ability.
NO
2. Should any distractor(s) be modified?
A B C* D
3 0 18 9
Item X
YES
• Option B is ought to be modified or replaced. As No one chose it
• Option A also need revision

Item Response Theory (IRT)
• IRT – refers to a family of latent trait models used to establish
psychometric properties of items and scales
• Sometimes known as modern psychometrics because in large-scale
assessment, testing programs and professional testing IRT has
almost completely replaced CTT
• IRT has many advantages over CTT that have brought IRT into
more frequent use
• Three Basics Components of IRT are:
Item Response Function (IRF) – Mathematical function that relates the latent
trait to the probability of endorsing an item
Item Information Function – an indication of item quality; an item’s ability
to differentiate among respondents
Invariance – position on the latent trait can be estimated by any items with
know IRFs and item characteristics are population independent within a
linear transformation

Cont.…
Item Response Function (IRF)
• It characterizes the relation between a latent variable/ability and
the probability of endorsing an item.
• The IRF models the relationship between examinee trait level,
item properties and the probability of endorsing the item.
• Examinee trait level is signified by the Greek letter theta () and
typically has mean = 0 and a standard deviation = 1
• IRFs can be converted into Item Characteristic Curves (ICC)
which are graphical functions that represents the respondents
ability as a function of the probability of endorsing the item

IRF- Item Parameters in IRT
Location – b-parameter
• An item’s location is defined as the amount of the latent trait
needed to have a .5 probability of endorsing the item.
• The higher the “b” parameter the higher on the trait level a
respondent needs to be in order to endorse the item
• It is analogous to difficulty level in CTT
• Like Z scores, the values of b typically range from -3 to +3
• Indicates the steepness of the IRF at the items location
Discrimination/Slope –a- parameter
• It indicates how strongly related the item is to the latent trait like
loadings in a factor analysis
• Items with high discriminations are better at differentiating
respondents around the location point
• It typically ranges from 0 to 2 and should never be negative.
• Vice versa for items with low discriminations

Cont.…
Pseudo-guessing –c - parameter
• The inclusion of a “c” parameter suggests that respondents very low
on the trait may still choose the correct answer.
• In other words respondents with low trait levels may still have a
small probability of endorsing an item
• This is mostly used with multiple choice testing and the value should
not vary excessively from the reciprocal of the number of choices.
• In general, it is the probability of getting the item correct by guessing
alone and varies from 0 to 1. For instance, c = 0.20 means that at all
ability levels, the probability of getting the item correct by guessing
alone is 0.20
Upper asymptote –d-parameter
• The inclusion of a “d” parameter suggests that respondents very high
on the latent trait are not guaranteed (i.e. have less than 1
probability) to endorse the item
• Often an item that is difficult to endorse

IRT - Logistic models
The 4-parameter logistic model:
Where
•  represents examinee trait level
• b is the item difficulty that determines the location of the IRF
• a is the item’s discrimination that determines the steepness of
the IRF
• c is a lower asymptote parameter for the IRF
• d is an upper asymptote parameter for the IRF
( )
( )
e
( 1 , , , , ) ( )
1 e
a b
a b
P X a b c d c d c





   


Cont.…
The 3-parameter logistic model
• If the upper asymptote parameter is set to 1.0, then the model is
termed a 3PLM.
• In this model, individuals at low trait levels have a non-zero
probability of endorsing the item.
( )
( )
e
( 1 , , , ) (1 )
1 e
a b
a b
P X a b c c c





   


Cont.…
• If the lower asymptote parameter is constrained to zero, then the
model is termed a 2PLM.
• In the 2PLM, IRFs vary both in their discrimination and
difficulty (i.e., location) parameters.
( )
( )
e
( 1 , , )
1 e
a b
a b
P X a b





 


Cont.…
• If the item discrimination is set to 1.0 the result is a 1PLM
• A 1PLM assumes that all scale items relate to the latent trait equally
and items vary only in difficulty (equivalent to having equal factor
loadings across items).
• Mathematically, the most basic IRT model in 1PLM is identical to,
Rasch model, however, there are some differences
• In Rasch, the model is superior and data which does not fit the model
is discarded
• Rasch does not permit abilities to be estimated for extreme items and
persons
( )
( )
e
( 1 , )
1 e
b
b
P X b





 


Activity
38
• Which item do think the most difficult?
• Which item do think mostly differentiate the learners?

Cont.….
In IRT:
• If the ability of the student > the difficulty of the item, what do you
think the p(1)?
• If the ability of the student < the difficulty of the item, what do
you think the p(1)?
• If the ability of the student = the difficulty of the item, what do
you think the p(1)?

Test Response Curve/TCC
• Test response
curve/TCC is the
sum of item
response functions
/ICCs.
• A TCC is the latent
trait relative to the
number of items

Differential Item Functioning (DIF)
What is DIF?
• DIF is a statistical characteristic of an item that shows the extent to
which the item might be measuring different abilities for members
of separate subgroups (gender, location, language, ethnicity etc.)
• DIF occurs when one group of examinees has a different expected
item score than comparable examinees from another group.
• An item is considered free of differential functioning (DIF) if the
item response function is the same across groups (Zwick, 1990).
• DIF means that either the item performs differently or measures
something different. If the item shows DIF it means that the item is
less valid for one subgroup (Steinberg & Thissen, 2006).
• A fundamental aspect of all DIF is the matching of students in the
reference and focal groups on some measure of ability (Clauser &
Mazor, 1998).
• The focal group is the one of interest and usually represents a
minority group, while a reference group represents a larger group.

Types DIF
Uniform DIF
• DIF is in the same direction across the entire spectrum of item
response curves for two groups do not cross
• DIF involves the location (b) parameters
• DIF is a significant main (group) effect in regression analyses
predicting item response
Non-uniform DIF
• An item favors one group at certain disability levels, and other
groups at other levels
• DIF involves the discrimination (a) parameters
• DIF is a significant group by ability interaction in regressions
predicting item response
45

Uniform DIF Non-uniform DIF
46

ETS: DIF Classification Levels
ETS rules for classifying the magnitude of DIF. Stated in terms of
the common odds ratio, these rules are as follows:
• “A” /items have:
 (a) a CMH p-value greater than 0.05, or
 (b) the common odds ratio is strictly between 0.65 and 1.53.
• “B” items are neither “A” nor “C” items.
• “C” items have:
 (a) a common odds ratio less than 0.53, and the upper bound of the 95%
confidence interval for the common odds ratio is less than 0.65, or
 (b) a common odds ratio greater than 1.89 and the lower bound of the 95%
confidence interval is greater than 1.53.
47

Reliability
• Types of reliability
Test-retest
Parallel Forms
Split-half
Internal Consistency
• Can be calculated by:
Split half
KR20 (Kuder-Richardson Formula 20)
KR21 (Kuder-Richardson Formula 21)
Chrobach Alpha
48
• Produce results which are accurate and consistent
• Degree to which scores are free of “measurement error” (higher
reliabilities = less measurement error)
• Reliability coefficients range from .00 - 1.00.
• Ideal score of reliability is >0.80 and at least not < 0.70

Interpretation for Reliability
49
Author Interpretation of Cronbach’s alpha (KR20)
Robinson,
Shaver et al
(1999)
- ≥0.80 Exemplary - 0.70–0.79 Extensive
- 0.60–0.69 Moderate - <0.60 Minimal
Cicchetti (1994) - >0 .90 Excellent - 0.80–0.90 Good
- 0.70–0.80 Fair - <0.70 Unacceptable
Axelson and
Kreiter (2019)
• >0.90 is needed for very high stakes tests
• 0.80–0.89 is acceptable for moderate stakes tests
• 0.70–0.79 acceptable for lower stakes assessments
• <0.70 might be useful as component of overall composite score
Obon and Rey
(2019)
• >0.90 Excellent reliability
• 0.80–0.90 Very good for a classroom test
• 0.70–0.80 good for a classroom test
• 0.60–0.70 Somewhat low
• 0.50–0.60 Suggests need for revision of test.
• 0.50 < Questionable reliability.
Hassan and Hod
(2017)
- > 0.7 is excellent - < 0.5 is unacceptable
- 0.6–0.7 is acceptable - < 0.30 is unreliable
- 0.5-0.6 is poor

Validity
• The extent to which measures indicate what they are intended to
measure.
• Establishes the measure covers the full range of the concept’s
meaning, i.e., covers all dimensions of a concept
• Internal statistical validity can be applied to check Uni-
dimensionality
It more depends on “good “ expert judgment
50

Validity Vs. Reliability
Neither Valid nor Reliable Reliable but not Valid
Valid & Reliable
• Reliability is a necessary condition for validity but not sufficient
• Reliability is a prerequisite for measurement validity
• One needs reliability, but it’s not enough for validity

Test Item Analysis Using Software's
• There are software products performing the computations of both
CTT’s and IRT’s statistics.
• There area also those solely performing calculations for CTT or IRT.
• For Example:
 CITAS, ITEMAN, Lertap, and TAP are the packages that are
widely can be performed used only for CTT.
 BILOG-MG, flexMIRT, ICL, MULTILOG, PARSCALE,
PARAM-3PL, Winsteps and Xcalibre, IRT PRO, NOHARM,
TESTFACT, flexMIRT are the packages only used in IRT
Whereas JMetrik, R, Mplus and IATA are the software products can
compute statistics for both CTT and IRT.
• Among software packages facilitate the computations of IRT, only
jMetrik, PARAM-3PL, NOHARM and R are free of charge.
52

Hands on Practice
The commonly used open source programs and user friendly Test
Analysis software’s to be discussed in this session are:
• TAP: Test Analysis Program /CTT
Item And Test Analysis / CTT and IRT -
JMetrik used in IRT and CTT.

Data Preparation
• Capture data in excel/access/SPSS/Text
• Design test Map contains answer key, content domain, cognitive
domain etc
• Insert/import or open student response data in the software (TAP,
IATA, JMetrik etc.)
• Fill all necessary information required according to the character of
the software
• Analyze the data and save/print
• Identify any exam items that may require revision.
• For each identified item, list your observation and a hypothesis of the
nature of the problem.
54

TAP: Test Analysis Program …
• The Test Analysis Program (TAP) is designed as a powerful, easy-to-
use (and free!) test analysis package.
• TAP is a classical test and item analysis program:- performs test
analyses and item analyses based on CTT.
TAP output provides:
• Examinee Analysis, including percentage correct, letter grade and
confidence intervals for each student and aggregate descriptive
statistics for the group.
• It generates report for each student indicating his/her score,
responses to each item and the correct answers to items missed.
• Item and Test Analysis, including item difficulty, point biserial,
discrimination index and various statistics if item deleted (KR20,
scale mean and standard deviation, etc.).
• Options Analysis, including high group and low group item
difficulty for correct answer and distracters.

• TAP uses text file.
• Any text file can be inserted into the TAP data
editor window. But for it to work with TAP, the
text file must be formatted as follows:
 Every case must be on a single row,
 The EXAMINEE ID LABEL must be at the far left of each
row,
 The ITEM data must be numeric and must be in single columns
with NO SPACES between the item scores,
 The first line must be the ANSWER KEY, formatted like the
data (but with no numbers until the first correct answer). After
you insert your data, any text in the ANSWER KEY must be
deleted.
 Also, you will need to fill in the appropriate data editor fields:
# examinees, # items, id label length, # options for each
item and INCLUDE each item

Steps to Run TAP
1. TAP Software Installation and Setup
• Write Test Analysis Program (TAP) in Google search engine , then
click website https://people.ohio.edu/brooksg/
2. Under Programs Available (click on the program name to go to the
Description and Download Link for the Program), Click the link
TAP: Test Analysis Program (last updated December 2018)
3. The program will be downloaded and launch ready to begin.
4. Click Run to open the TAP.
5. Entering test data directly into TAP
Entering new data and go to data editor
Importing test data - TAP also provides the option of
entering data from an existing text (.txt) file.

Steps to Run TAP …
In the data editor screen enter the following
information's:
• descriptive information in the Title and Comments
sections.
• Input the number of examinees, number of items,
missing data symbol and ID label (student name)
length in the appropriate fields.
• In the Answer Key field, enter the numbers
corresponding to the correct answers as a string with
no delimiters.
• In the # Options field, enter the number of options
corresponding to each question.
• The Item Included field allows the user to eliminate
items from the analysis or set alternative correct
answers. Say Y to include or N to exclude.
• In the Data screen, enter the student identification
information and scores. Align the score data with the
guide above the Data screen.
• When all information is entered, click on either Save
File or Close and Analyze at the bottom of the Data
Entry screen.

Steps to Run TAP …
5. Saving Data Files
• Test data created entered by the user or created by TAP’s
random data generator can be saved as TAP files for archival
information, future modification or analysis.
• Data files can be saved in TAP’s data editor window.
i. Choose Save TAP file under the File menu.
ii. Select the location and save the TAP file.
iii. To open file later, choose Open TAP file under the File
menu in the Data Editor Screen.
6. Analyzing Tests with TAP
Once you have a set of test scores in the Data Editor either by
direct entering or importing your own test, you can run the
analysis by clicking Analyze (F9).
To retrieve the full analysis click on the View Full Results box

•Response Data and Key for TAPAnalysis
•TAP
63

IATA: Item And Test Analysis
• The item and test analysis software (IATA) is intended to help
national assessment practitioners, researchers, and others analyze
test item data as well as build effective assessment tools.
• IATA was designed to offer a user-friendly way to address many
statistical considerations related to national assessments.
• It targets specifically those who are interested in analyzing test
data, creating a new test from an item bank, or comparing or
scaling test items between different samples.
• The overarching goal of IATA is to increase the usability and
interpretability of test scores.
• The primary goal of test development from a statistical perspective
is to reduce the error of measurement. To reduce error of
measurement, IATA identifies problematic items that contribute to
error so that they may be revised, replaced, or removed altogether.
• The second goal is to establish meaningful and consistent scales on
which to report test scores.

IATA: Item And Test Analysis …
• IATA can read and write a variety of common data table formats (
Access, Excel, SPSS, delimited text files) if they are formatted
correctly.
• If the data are not formatted with the correct structure, IATA will
not be able to carry out the analyses.
• Database-compatible format such as Access or SPSS already take
care of most data formatting issue.
• However, if the data are stored in a less restrictive format, such as
Excel or text file, the following conventions should be followed:
The names of variables should appear in the cell at the top of each column (header).
The name of each variable must be distinct from the names of other variables in a data
file. The names of variables must begin with a letter & should not contain any spaces.
The data range must not contain any empty rows or columns. The data range is the
rectangle of cells that contain data, beginning with the variable name of the first
variable to appear in the data file and ending with the value of the last variable in the
bottom-most row.
The data range must begin at the first cell in the spreadsheet or file. In Excel, this cell
is labelled “A1.” In text files, this is the top-left cursor position in the text file.

• There are two main types of data produced by and used in the
analysis of assessments: response data and item data.
1. Response data are produced by the individual learners as they
answer questions on a test.
• It includes the response of each student to each test item. Should
record codes representing the options endorsed by each student
(e.g., A, B, C, D, etc. or it may be numbers).
• It may also include other useful demographic information on
variables for analyzing test results such as age, grade, gender,
school, and region.
• Unique response Identifier (ID) other wise IATA will
automatically produce it.
• In order to score the response data, for most analyses, an answer
key must be loaded into IATA.

• Treatment of Missing and Omitted Data:
When a student does not provide a response to a test item, rather than leaving
the data field blank, a missing value code is used to record why the response is
missing. There are two types of missing responses: missing and omitted.
Common conventions use specific values for the different types of non-
response data. See Greaney and Kellaghan (2012) for information on response
codes. Common values used are:
9 for missing responses, where students have not responded at all to an item,
8 for muptiple responses, which typically occurs in multiple choice tests when
students provide multiple response and in open-ended items when student
responses are illegible, and
7 for omitted or not-presented items, which might be used in a rotated booklet
design.
Regardless of the specific codes used, you must specify how IATA is to treat
each non-response code, as either missing or omitted.

• Item naming:
• It is important to assign a unique name to each item in a
national assessment program (see Anderson and Morgan 2008;
Greaney and Kellaghan 2012) for different purposes such as
linking, future use (IB), etc. (e.g. MA04M17001).
• Variables reserved by IATA
• During the analysis of response data, IATA will calculate a
variety of different working variables so that we are restricted
and should not be used as names of test items or questionnaire
variables.
• These are: Xweight, Missing, PercentScore, PercentError,
Percentile, RawZScore, Zscore, IRTscore, IRTerror, IRTskew,
IRTkurt, TrueScore, Level and “@” symbol.

2. Item Data:
• A test is a specific collection of questions that evaluate a
common domain of proficiency or knowledge. Individual
questions on a test are referred to as items.
• IATA produces and uses item data files with a specific format.
• An item data file contains all the information required to perform
statistical analysis of items and may contain the parameters used
to describe the statistical properties of items.
• It is simply a bank file that describe the item.

• Required Variables in an item data file are the following.
• Example

IATA Result interpretations
• Uses traffic symbols:
Symbol Meaning
Green circles indicate no problems.
A yellow diamond indicates that the results are less than optimal. This
indicator is used to suggest that modifications may be required to either
the analysis specifications or the items themselves. However, the item is
not introducing any significant error into the analysis results.
A red warning triangle appears beside any potentially problematic items.
This indicator is used either to indicate items that could not be included
in the analysis due to problems with the data or specifications, or to
recommend a more detailed examination of the specifications or
underlying data and test item. When this indicator appears, it does not
necessarily mean that there is a problem, but it does suggest that the
overall analysis results may be more accurate if the indicated test item
were removed or if the analysis were re-specified.

IATA analysis workflows and interfaces (Steps)
Double click or right click to open IATA.
Click main menu.

IATA analysis workflows and interfaces (Steps) …
There are five workflows
available in IATA:
1. Response data analysis,
2. Response data analysis
with linking,
3. Linking item data,
4. Selecting optimal test
items, and
5. Developing and
assigning performance
standards

• There are 10 different tasks in IATA and the workflows in which
they are used
Task
Workflow:
A. Response data analysis
B. Response data analysis with linking
C. Linking item data
D. Selecting optimal test items
E. Developing and assigning performance
standards.
A B C D E
1. Loading data ● ● ● ● ●
2. Setting analysis specifications ● ●
3. Analyzing test items ● ●
4. Analysing test dimensionality ● ●
5. Analyzing differential item functioning ● ●
6. Linking ● ●
7. Scaling test results ● ●
8. Selecting optimal test items ● ● ●
9. Informing development of performance
standards
● ● ●
10. Saving results ● ● ● ● ●

•Response Data and Key for IATAAnalysis
•IATA
76

JMetrik
• JMetrik software is one of the open source programs that can be
used in the context of IRT and CTT.
Download and Install jMetrik
• jMetrik is a free application that runs on any Windows, MacOS, or
Linux computer that has Java installed.
• jMetrik is available from this url:
https://itemanalysis.com/jmetrik-download/
• Make sure that you updated Java or you may have problems getting
jMetrik to run properly. You can download Java from this URL:
https://www.java.com/en/download/
77

JMetrik
78
JMetrik Software Main Screen

•Response Data and Key for JMetrik Analysis
•JMeterik
86

ITEM ANALYSIS 2023.pptx uses for exam development especially national examination development.

Recommended

Recommended

More Related Content

Similar to ITEM ANALYSIS 2023.pptx uses for exam development especially national examination development.

Similar to ITEM ANALYSIS 2023.pptx uses for exam development especially national examination development. (20)

Recently uploaded

Recently uploaded (20)

ITEM ANALYSIS 2023.pptx uses for exam development especially national examination development.