This webinar discusses using item analysis and statistics to evaluate assessment questions and better understand assessment data. It covers common item analysis statistics like item difficulty, discrimination index, and point-biserial values. Guidelines for desired statistical ranges are provided. The webinar emphasizes that statistics alone cannot determine if a question is good or bad, and provides factors to consider like author intent, content delivery method, and historical analysis. Best practices like increasing the item review process and reusing questions are recommended to improve the accuracy and usefulness of statistics.
2. EAC 2018 | Fort Lauderdale | June 27-29
Thank you for joining.
The webinar will begin shortly.
Psychometrics 101: Know What Your
Assessment Data is Telling You
Eric Ermie, Vice President of Sales, ExamSoft WorldWide, Inc.
Formerly Program Manager for Evaluation and Assessment at The Ohio State University
College of Medicine
3. EAC 2018 | Fort Lauderdale | June 27-29
AGENDA
• Overview
• Types of stats
• Interpreting the item analysis report
• Examples
• General statistical guidelines
4. EAC 2018 | Fort Lauderdale | June 27-29
Item analysis is not a fool proof answer to these questions.
But… YOU HAVE TO START SOMEWHERE.
“Where do
I start?”
“Is this a good or bad question?
Can statistics even tell me that?”
OVERVIEW
“How can I reconcile what I know
about my assessment’s past with
what the data is telling me?”
5. EAC 2018 | Fort Lauderdale | June 27-29
Item Difficulty/p Value:
a decimal representation of
difficulty using the percentage of
students who got the item
correct. The lower the decimal
the higher the difficulty.
Upper 27%:
of only the top 27% of
scorers what percentage
of those students got the
item correct.
Lower 27%:
of only the bottom 27% of
scorers what percentage
of those students got the
item correct.
TYPES OF STATS
Common Stats:
6. EAC 2018 | Fort Lauderdale | June 27-29
Discrimination index:
calculated by subtracting the % of the
bottom 27% group that got the item correct
from the % of the top 27% group that got the
item correct. Discrimination index measures
whether the item discriminate between
highest and lowest performers.
Point-Biserial:
a discrimination statistic that indicates
whether doing well on that specific item
correlated with doing well on the exam
overall. Thus was that item a good or
bad predictor of overall performance on
the exam.
TYPES OF STATS
Common Stats:
7. EAC 2018 | Fort Lauderdale | June 27-29
Item Difficulty:
Range 1.0 to 0.0
Discrimination Index:
Range 1.0 to -1.0
Point Biserial:
Range 1.0 to -1.0
ITEM ANALYSIS REPORT
8. EAC 2018 | Fort Lauderdale | June 27-29
But with any statistic
it is important to
remember:
context matters!
9. EAC 2018 | Fort Lauderdale | June 27-29
6 Factors to always consider when evaluating item performance:
1. Cheating
2. Return on investment
3. Conflicting content/faculty
4. “Six degrees of Kevin Bacon”
5. Author Intent
6. Content delivery method
“Stats alone cannot tell the whole story..”
EXTRANEOUS FACTORS
10. EAC 2018 | Fort Lauderdale | June 27-29
Diff(p) Upper A B D E
0.98 100.00% 0.10 0 1 1 *178
0.00 0.55 0.55 98.34
0.00 0.02 -0.10 0.10
0.00 0.00 -0.02 0.02
0.00 0.00 0.00 1.00
0.00 0.00 0.02 0.98Lower 27%
Upper 27%
Disc. Index 0.00
0.00
0.00
0.00
0
0.00
Lower
Disc.
Index
1
% Selected
Point Biserial (rpb)
96.15% E0.04
Item
#
Correct Responses Point
Biserial
Correct
Answer
Response Frequencies (*Indicates correct answer)
C
ITEM ANALYSIS EXAMPLES
11. EAC 2018 | Fort Lauderdale | June 27-29
Diff(p) Upper A B D E
0.66 82.00% 0.28 7 17 *120 9
3.87 9.39 66.30 4.97
-0.11 -0.19 0.28 -0.07
-0.04 -0.19 0.36 -0.04
0.00 0.00 0.82 0.06
0.04 0.19 0.46 0.10
Lower C
Item
#
Correct Responses Disc.
Index
Point
Biserial
Correct
Answer
Response Frequencies (*Indicates correct answer)
0.36
Lower 27%
Upper 27%
Disc. Index -0.09
0.21
0.12
Point Biserial (rpb)
46.15% D 28
15.47
-0.12
7
% Selected
ITEM ANALYSIS EXAMPLES
12. EAC 2018 | Fort Lauderdale | June 27-29
Diff(p) Upper A B D E
0.36 52.00% 0.22 35 34 *66 25
19.34 18.78 36.46 13.81
-0.09 0.04 0.22 -0.06
-0.15 0.07 0.25 -0.02
0.10 0.24 0.52 0.10
0.25 0.17 0.27 0.12
Item
#
Correct Responses Disc.
Index
Point
Biserial
Correct
Answer
Response Frequencies (*Indicates correct answer)
Lower C
0.25
Lower 27%
Upper 27%
Disc. Index -0.15
0.19
0.04
Point Biserial (rpb)
26.92% D 21
11.60
-0.20
22
% Selected
ITEM ANALYSIS EXAMPLES
13. EAC 2018 | Fort Lauderdale | June 27-29
ITEM ANALYSIS EXAMPLES
14. EAC 2018 | Fort Lauderdale | June 27-29
Diff(p) Upper A B D E
0.52 64.00% 0.18 61 21 5 0
33.70 11.60 2.76 0.00
-0.10 -0.19 0.12 0.00
-0.12 -0.13 0.04 0.00
0.26 0.04 0.06 0.00
0.38 0.17 0.02 0.00
Item
#
Correct Responses Disc.
Index
Point
Biserial
Correct
Answer
Response Frequencies (*Indicates correct answer)
Lower C
0.22
Lower 27%
Upper 27%
Disc. Index 0.22
0.42
0.64
Point Biserial (rpb)
42.31% C *94
51.93
0.18
24
% Selected
ITEM ANALYSIS EXAMPLES
15. EAC 2018 | Fort Lauderdale | June 27-29
Diff(p) Upper A B D E
0.71 90.00% 0.31 0 *129 30 21
0.00 71.27 16.57 11.60
0.00 0.31 -0.25 -0.11
0.00 0.34 -0.23 -0.09
0.00 0.90 0.06 0.04
0.00 0.56 0.29 0.13
Item
#
Correct Responses Disc.
Index
Point
Biserial
Correct
Answer
Response Frequencies (*Indicates correct answer)
Lower C
0.34
Lower 27%
Upper 27%
Disc. Index -0.02
0.02
0.00
Point Biserial (rpb)
55.77% B 1
0.55
-0.16
34
% Selected
ITEM ANALYSIS EXAMPLES
16. EAC 2018 | Fort Lauderdale | June 27-29
Desired statistical ranges - opinions differ but most commonly used are:
• Item Difficulty/p Value - Acceptable item difficulty is not a set number but more a correlation
with question intention. If you intended the item to be a mastery item you want the difficulty as
close to 1.00 as possible. If you desired a discriminating question significantly lower levels are
acceptable.
• Upper 27% - if less than 60% of your top performers are getting a question correct a further
analysis is needed to see if there are issues with the question. Also if less of your upper 27%
get a question correct than your lower 27% then there could also be an issue.
• Lower 27% - generally you never want it to be higher than the upper 27%. As low as 0% can
be acceptable as high as 100% can be acceptable if it is a mastery question.
GENERAL GUIDELINES
17. EAC 2018 | Fort Lauderdale | June 27-29
Desired statistical ranges - opinions differ but most commonly used are:
• Discrimination index – some set specific numbers of acceptable and unacceptable values, I
would argue the more accurate guide is that the lower the p value the higher the discrimination
index needs to be.
Generally .2 the item is considered to have discriminated, less than that is considered no
discrimination. .3 or greater is consider highly discriminating.
• Point-Biserial – similarly to discrimination index some set specific numbers of acceptable and
unacceptable values.
Generally .2 and above is considered to have discrimination and have positive association with
overall performance on the assessment, lower levels are acceptable for mastery and .3+ would
be desired for discriminating questions.
GENERAL GUIDELINES
18. EAC 2018 | Fort Lauderdale | June 27-29
KR-20
• Used as an overall measure of reliability for the assessment.
• Measured on a scale from 0.0 to 1.0 with 0.0 being very poor and 1.0 being excellent.
• Quick notes:
1. Heavily influenced by number of questions in assessment
2. Heavily influenced by number of students taking the assessments
3. The combination can FREQUENTLY lead to false positive and false negative KR-20 values.
GENERAL GUIDELINES
19. EAC 2018 | Fort Lauderdale | June 27-29
Ways to increase the accuracy/usefulness of your stats:
• Item review process
– Format
– Level of difficulty
– Alternative correct options
• Historical item analysis
– Across assessments
– Across versions
• Reuse/Recycle
BEST PRACTICES
20. EAC 2018 | Fort Lauderdale | June 27-29
NEW PORTAL
21. EAC 2018 | Fort Lauderdale | June 27-29
NEW PORTAL
22. P H O N E
+ 1 . 9 5 4 . 4 2 9 . 8 8 8 9
E M A I L
i n f o @ e x a m s o f t . c o m
W E B S I T E
l e a r n . e x a m s o f t . c o m