Why should the bright ones get the wrong answer while the poor ones got the right answer?
Item analysis and validation
ITEM ANALYSIS AND
Mark Leonard Tan
The teacher normally prepares a draft of
the test. Such a draft is subjected to item
analysis and validation to ensure that the final
version of the test would be useful and
Phases of preparing a test
Item analysis phase
Item revision phase
There are two important characteristics of an
item that will be of interest of the teacher:
Item Difficulty or the difficulty of an item is
defined as the number of students who are able
to answer the item correctly divided by the total
number of students.Thus:
Item difficulty = number of students with the correct answer
Total number of students
The item difficulty is usually expressed in percentage.
What is the item difficulty index of an item if 25
students are unable to answer it correctly while 75
answered it correctly?
Here the total number of students is 100, hence,
the item difficulty index is 75/100 or 75%.
One problem with this type of difficulty
index is that it may not actually indicate
that the item is difficult or easy. A student
who does not know the subject matter will
naturally be unable to answer the item
correctly even if the question is easy. How
do we decide on the basis of this index
whether the item is too difficult or too
Range of difficulty
0 – 0.25 Difficult Revise or discard
0.26 – 0.75 Right difficulty retain
0.76 - above Easy Revise or discard
Difficult items tend to discriminate between
those who know and those who does not know
Easy items cannot discriminate between those
two groups of students.
We are therefore interested in deriving a
measure that will tell us whether an item can
discriminate between these two groups of
students. Such a measure is called an index of
An easy way to derive such a measure is to
measure how difficult an item is with
respect to those in the upper 25% of the
class and how difficult it is with respect to
those in the lower 25% of the class. If the
upper 25% of the class found the item easy
yet the lower 25% found it difficult, then
the item can discriminate properly
between these two groups.Thus:
Index of discrimination = DU – DL
Example: Obtain the index of discrimination of an
item if the upper 25% of the class had a difficulty
index of 0.60 (i.e. 60% of the upper 25% got the
correct answer) while the lower 25% of the class
had a difficulty index of 0.20.
DU = 0.60 while DL = 0.20, thus index of
discrimination = .60 - .20 = .40.
Theoretically, the index of discrimination can
range from -1.0 (when DU =0 and DL = 1) to 1.0
(when DU = 1 and DL = 0)
When the index of discrimination is equal to -1,
then this means that all of the lower 25% of the
students got the correct answer while all of the
upper 25% got the wrong answer. In a sense,
such an index discriminates correctly between
the two groups but the item itself is highly
On the other hand, if the index
discrimination is 1.0, then this means that
all of the lower 25% failed to get the correct
answer while all of the upper 25% got the
correct answer.This is a perfectly
discriminating item and is the ideal item
that should be included in the test.
As in the case of index difficulty, we have
the following rule of thumb:
Index Range Interpretation Action
-1.0 to -.50 Can discriminate
but the item is
-.55 to .45 Non-discriminating Revised
.46 to 1.0 Discriminating item Include
Example: Consider a multiple item choice type of
test with the ff. data were obtained:
A B* C D
0 40 20 20 Total
0 15 5 0 Upper 25%
0 5 10 5 Lower 25%
The correct response is B. Let us compute the difficulty index and index of
Difficulty index = no. of students getting the correct answer
= 40%, within of a “good item”
The correct response is B. Let us compute the
difficulty index and index of discrimination:
The discrimination index can be similarly be
DU = no. of students in the upper 25% with correct response
No. of students in the upper 25%
=15/20 = .75 or 75%
DL= no. of students in lower 75% with correct response
no. of students in the lower 25%
= 5/20 = .25 or 25%
Discrimination index = DU – DL
= .75 - .25
= .50 or 50%
Thus, the item also has a “good discriminating power”.
It is also instructive to note that the distracter A
is not an effective distracter since this was never
selected by the students. Distracter C and D
appear to have a good appeal as distracters.
Basic Item Analysis
The Michigan State University
Measurement and Evaluation Department
reports a number of item statistics which aid in
evaluating the effectiveness of an item.
Index of Difficulty – the proportional of the
total group who got the item wrong. “Thus a
high index indicates a difficult item and a low
index indicates an easy item.
Index of Discrimination – is the difference
between the proportion of the upper group who
got an item right and the proportion of the lower
group who got the item right.
Item Discrimination refers to the ability of an
item to differentiate among students on the
basis of how well they know the material
A good item is one that has good
discriminating ability and has a sufficient
level of difficulty (not too difficult nor too
At the end of the item analysis report, test items
are listed according to their degrees of difficulty
(easy, medium, hard) and discrimination (good,
fair, poor).These distributions provide a quick
overview of the test and can be used to identify
items which are not performing well and which
perhaps be improved or discarded.
The Item-Analysis Procedure for Norm
provides the following information:
1. The difficulty of an item
2. The discriminating power of an item
3. The effectiveness of each alternative
Benefits derived from Item Analysis
1. It provides useful information for class
discussion of the test.
2. It provides data which helps students improve
3. It provides insights and skills that lead to the
preparation of better tests in the future.
The discriminating power of an item is reported as
a decimal fraction; maximum discriminating power
is indicated by an index of 1.00.
Maximum discrimination is usually found at the 50
per cent level of difficulty.
0.00 – 0.20 = very difficult
0.21 – 0.80 = moderately difficult
0.81 – 1.00 = very easy
After performing the item analysis and
revising the items which need revision, the
next step is to validate the instrument.
The purpose of validation is to determine the
characteristics of the whole test itself,
namely, the validity and reliability of the test.
Validation is the process of collecting and
analysing evidence to support the
meaningfulness and usefulness of the test.
is the extent to which measures what it
purports to measure or referring to the
appropriateness, correctness, meaningfulness,
and usefulness of the specific decisions a
teacher makes based on the test results.
There are three main types of
evidences that may be
1. Content-related evidence of validity
2. Criterion-related evidence of validity
3. Construct-related evidence of validity
Content-related evidence of
refers to the content and format of the
How appropriate is the content?
Does it logically get at the intended variable?
How adequately does the sample of items or
questions represent the content to be assessed?
Criterion-related evidence of
refers to the relationship between scores
obtained using the instrument and scores
obtained using one or more other test (often
How strong is this relationship?
How well do such scores estimate present or
predict future performance of a certain type?
Construct-related evidence of
refers to the nature of the psychological
construct or characteristic being measured by
How well does a measure of the construct explain
differences in the behaviour of the individuals or
their performance on a certain task?
Usual procedure for
determining content validity
Teacher write out objectives based onTOS
Gives the objectives andTOS to 2 experts
along with a description of the test takers.
The experts look at the objectives, read over
the items in the test and place a check mark
in front of each question or item that they
feel does NOT measure one or more
Usual procedure for
determining content validity
They also place a check mark in front of each
objective NOT assessed by any item in the
The teacher then rewrites any item so
checked and resubmits to experts and/or
writes new items to cover those objectives
not heretofore covered by the existing test.
Usual procedure for
determining content validity
This continues until the experts approve all
items and also when the experts agree that
all of the objectives are sufficiently covered
by the test.
Obtaining Evidence for
The teacher usually compare scores on the
test in question with the scores on some
other independent criterion test which
presumably has already high validity
Another type of validity is called the
predictive validity wherein the test scores in
the instrument is correlated with scores on
later performance of the feelings.
Gronlunds Expectancy Table
Grade Point Average
Test Score Very Good Good Needs
High 20 10 5
Average 10 25 5
Low 1 10 14
The expectancy table shows that there were
20 students getting high test scores and
subsequently rated excellent in terms of their
And finally 14 students obtained low test
scores and were later graded as needing
The evidence for this particular test tends to
indicate that students getting high score on it
would be graded excellent; average scores
on it would be rated good later; and students
getting low scores on the test would be
graded needing improvement later.
Refers to the consistency of the scores
obtained – how consistent they are for each
individual from one administration of an
instrument to another and from one set of
items to another.
We already have the formulas for computing
the reliability of a test; for internal
consistency, for instance, we could use the
split-half method or the Kuder-Richardson
KR-20 or KR-21
Reliability and validity are related concepts. If
an instrument is unreliable, it cannot yet valid
As reliability improves, validity may
improve (or may not).
However, if an instrument is shown
scientifically to be valid then it is almost
certain that it is also reliable.
The ff. table is a standard followed by almost
universally in educational tests and
.90 and above Excellent reliability; at the level of the best
.80 - .90 Very good for a classroom test
.70 - .80 Good for a classroom test; in the range of most.There
are probably a few items which could be improved.
.60 - .70 Somewhat low.This test should be supplemented by
other measures (e.g., more test) for grading.
.50 - .60 Suggests need for revision of test, unless it is quite
short (ten or fewer items).The test definitely needs to
be supplemented by other measures (e.g., more tests)
.50 or below Questionable reliability.This test should not contribute
heavily to the course grade, and it needs revision.