2. How do we evaluate
our test?
A glimpse of our practices
When do we evaluate
our test?
3. What is a test?
An objective measure of a sample of
behavior or psychological object
(Anastasi & Urbina, 1997).
A systematic procedure for measuring a
sample of behavior by posing a set of
questions in a uniform manner
(Gronlund & Linn, 2000).
Are only tools…
4. QUALITIES OF AQUALITIES OF A
GOOD TESTGOOD TEST
To be a “GOOD” test, a
test ought to have
validity, reliability, and
accuracy.
5. The Error Components ofThe Error Components of
a Testa Test
Random Error
◦ Sources: Fatigue, Cheating, Guessing,
etc.
Systematic Error
◦ Sources: Item bias, Technical errors,
Contextual clues, etc.
True Score = Observed Score ± Error
6. Important Notes from the Classical Test Theory
Systematic errors lead to poor
test reliability and validity.
Interpretations of test result
may be distorted by too many
errors.
Random errors are more
difficult to control than
systematic errors.
Systematic errors can be
controlled by systematic
assessment (Item Analysis).
7. How do we evaluate our tests?How do we evaluate our tests?
◦ QUALITATIVE EVALUATION
Systematic inspection of test plan, tasks,
and format
◦ QUANTITATIVE
EVALUATION
Psychometric techniques in item analysis
8. A Systematic Inspection of TestsA Systematic Inspection of Tests
Adequacy of Assessment or
Test Plan
Adequacy of Assessment
Task
Adequacy of Test Format
and Directions
10. Psychometric Techniques forPsychometric Techniques for
Item AnalysisItem Analysis
Test / Item Statistics - These are
psychometric techniques generally
based on a norm-referenced
perspective (Method of extreme
groups).
◦ Item Difficulty
◦ Item Discrimination Power
◦ Effectiveness of Distracters
11. The Method of Extreme GroupsThe Method of Extreme Groups
Selecting criterion groups
(Upper & Lower Criterion
Groups)
◦ If samples are less than 50:
Use 50% groupings
◦ If samples used are more than
50, and for a more refined
analysis: Use upper & lower 27%
groupings
◦ Set aside the papers which will
not be used in the analysis.
12. The Method of Extreme GroupsThe Method of Extreme Groups
Determine item statistics
◦ Item Difficulty
◦ Item Discrimination
◦ Effectiveness of distracters
Determine test statistics
◦ Solve for the mean (average) of the
difficulty and discrimination indices
of all the items in the test.
13. What is item difficulty?What is item difficulty?
Item difficulty is simply the percentage of
students taking the test who answered the
item correctly. It is represented by p index.
Range: 0.0 to 1.0 or 0% to 100%
It can be computed using the formula below
N
LU
p RR +
=
14. ◦ Some important notations
UR = No. of students from the UPPER
criterion group that had gotten the
item correctly or had chosen a
particular option under analysis.
LR = No. of students from the LOWER
criterion group that had gotten the
item correctly or had chosen a
particular option under analysis.
N = No. of students that had tried to
answer the item
15. Rule of the Thumb:Rule of the Thumb: pp - index- index
The nearer the p value is to 0.0,
or 0.0% the more DIFFICULT the
item becomes.
The nearer the p value is to 1.0,
or 100 % the EASIER the item
becomes.
16. How difficult should ourHow difficult should our
test/item be?test/item be?
For an objective item test, the ideal
difficulty would be halfway between the
percentage of pure guess and 100%.
(Thompson & Levitov, 1985)
◦ Ex. p=0.63 for a multiple choice test with 4
options.
Eclectic distribution of difficult, average
and easy items; with extremely limited
use of items having p = 0.9 or more
(Frary, 1995)
17. How important is item difficulty?How important is item difficulty?
An item having p = 0.0 and 1.0 does not
in any way contribute to measuring
individual differences, and seriously
affects test validity.
Item difficulty has a profound effect on
the variability of test scores and the
precision to which the test
discriminates between achievement
groups. (Thorndike, et al.,1991)
18. What is item discrimination?What is item discrimination?
It is the ability of an item to
discriminate between students with
high or low achievement. It is
represented by Di.
Range: -1.0 to 1.0 or 0% to 100%
It can be computed using the
formula below
0.5N
LU
Di RR −
=
19. Rule of the Thumb:Rule of the Thumb: D iD i- index- index
The higher the discrimination index,
the better the item becomes. This is
so because such a value indicates
that the discrimination index is in
favor of the upper achievement
group, whom we expect to get more
of the items in the test correctly.
20. Why should our items discriminate?Why should our items discriminate?
Items that do not discriminate
can seriously affect the validity of
the test.
Negatively discriminating items
are useless and tend to decrease
the validity of the test. (Wood,
1960)
21. How do we evaluate theHow do we evaluate the
effectiveness of distracters?effectiveness of distracters?
Solve for Di of each wrong
option (applicable only for
multiple choice items)
Rule of the thumb: The
more negative the Di index
will be the more effective is
the distracter
22. Workshop 2Workshop 2
Systematic inspection of
test items, tasks and format
Item analysis of using
Microsoft Excel ®
and
STATISTICA 6.0 .0 or SPSS
23. Some points of caution inSome points of caution in
interpreting item/test statisticsinterpreting item/test statistics
A low index of discriminating
power does NOT necessarily
indicate a defective item.
◦ Non-technical factors which
contribute to item discriminating
power
Emphasis given to a domain or
content covered by a test.
Homogeneity and characteristics
of student groups
Difficulty of an item
24. Some points of caution inSome points of caution in
interpreting item/test statisticsinterpreting item/test statistics
Item analysis data from small
samples are highly tentative and
tends to fluctuate due to its norm
reference perspective.
A test that had undergone
psychometric analysis is NOT
necessarily a STANDARDIZED test.
25. What should we do after itemWhat should we do after item
analysis?analysis?
Further analysis with
other psychometric
methods
◦ Ex. Item bias analysis,
Point-biserial and biserial
correlations
Item Calibration
◦ IRT and Rasch Scaling
Item banking
26. Practical Implications ofPractical Implications of
Evaluating TestsEvaluating Tests
It helps prevent wastage of
time and effort that went in
to test and assessment
preparation.
It provides a basis for the
general improvement of
classroom instruction.
It provides a venue for
teachers to develop their test
construction skills.
Editor's Notes
Behavior or psychological basis here are the competency that is learned.
Random error is hard to assess. Systematic error can be done through item analysis. If the students failed because of the two errors, the test was not prepared properly.
A poor test is to be blamed to the teacher. We give assessment in order to find out which objectives the children need to be assisted.