MCQ test item analysis


Published on

Key Features of Student Assessment Methods:
Content and construct Validity
MCQ Test Item Analysis:
Difficulty index (p-value)
Discrimination index (DI)=Point-Biserial correlation (PBS)
Distractor efficiency (DE)
Internal Consistency Reliability
Writing a technical report (including remedial actions & recommendations)

Published in: Education
No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • MCQ, TF, Matching, SAQ, CompleteShort essay QLong essay QOral exam
  • MCQ test item analysis

    1. 1. MCQ Test Item AnalysisPresented by: Dr. Soha Rashed Prof. of Community Medicine Executive Director of Medical Education Department Alexandria Faculty of Medicine, Egypt 10 March 2013
    2. 2. Content outlines Why are we here (Purpose of this session)? What’s next (Needed future tasks)? Key Features of Student Assessment Methods: ⁻ Content and construct Validity ⁻ Reliability ⁻ Objectivity MCQ Test Item Analysis: ⁻ Difficulty index (p-value) ⁻ Discrimination index (DI)=Point-Biserial correlation (PBS) ⁻ Distractor efficiency (DE) ⁻ Internal Consistency Reliability ⁻ Writing a technical report (including remedial actions & recommendations) MCQs evaluation checklist
    3. 3.  Why are we here (Purpose of this session)? What’s next (Needed future tasks)?
    4. 4. What do we assess?Achievement of course ILOs: Knowledge Skills Attitudes
    5. 5. ILOs: 5 DOMAINS1. Knowledge (Recall) and Understanding2. Intellectual Skills3. Professional Skills (Practical, Procedural and Clinical)4. General and Transferable Skills5. Professional Attitudes and Ethics
    6. 6. Problemsolving Proble m solving
    7. 7. Written examsObjective written exams: MCQ, Matching, Extended matching, TF, and Short answer Qs.Essay Qs.(Long, short, modified essay Qs)
    8. 8. Key Features of Student Assessment MethodsQuality standards Validity: The ability of the test to measure what it is supposed to measure. Reliability: The consistency of the test scores over time, under different testing conditions, and with different raters. Objectivity: The degree by which examiners agree to the correct answer (Q is scored accurately and fairly, free of examiners’ bias) Practicability/Feasibility: Overall ease of construction, administration, scoring, and reporting of an assessment instrument. Acceptability: the responsiveness of faculty and students to the assessment. Value/Educational impact: The utility of the test results in producing meaningful conclusions (usable information) about the educational process.
    9. 9. ValidityValidity refers to the extent to which anassessment instrument or a test measures whatit intends to measure. Content validity Construct validity
    10. 10. I. Content validityContent validity ensures that knowledge and skillscovered by the test items are representative ofthe larger domain of knowledge and skills coveredin the course.
    11. 11. Test blueprint Learning Objectives to be tested Recall Unders Applica Problem solving Total %Content/ of facts tanding tion weight subject Analysi Synthe Evaluat area s sis ion …. 3 items 3 items -- -- -- -- 6 6% …. 2 items 4 items 2 items 2 items 10 10% …. 4 items 3 items 4 items 4 items 15 15% …. 5 items 4 items 4 items 4 items 17 17% …. 4 items 10 8 items 8 items 30 30% items …. 3 items 7 items 5 items 7 items 22 22% Total 21 31 23 25 100% weight 21% 31% 23% 25% 100%
    12. 12. II. Construct validityThis refers to the COMPATIBILITY/CONGRUENCE between the learningobjective (LO) to be assessed and the typeof assessment.In other words, construct validity emphasizesthat assessment techniques should be basedon the nature of the LOs that they aresupposed to measure.
    13. 13. Construct validity Learning objective to be Assessment instrument assessedKnowledge & understanding MCQ, TF, Matching, SAQ, Complete Short essay Q Long essay Q Oral examApplication & problem solving Clinical scenario-based MCQ Extended matching Q Modified essay Q Case study (Patient management problem) Oral examPractical skills OSPEClinical skills OSCE (real or simulated patients) Short Case Long CaseProcedural skills OSCE (Anatomical models)
    14. 14. To increase the test validity : Use of the test blue print Focus on the important content areas Sample widely across the domains and across the content area (% wt) To increase construct validity: Use items that have high discriminative value (those testing higher cognitive/thinking abilities such as comprehension, application and problem solving. e.g., applied Qs- Clinical scenario-based Qs) Use multiple methods to have a valid comprehensive assessment
    15. 15. Reliability Refers to consistency or repeatability of test scores. In practice, a reliable assessment should yield the same result: - When given to the same student at two different times (Test-retest reliability ), or - By different examiners (Inter-rater reliability), - While keeping all the other variables (timing, length, content or other contextual features) as consistent as possible.
    16. 16. - Internal consistency (intra-exam, inter-item reliability): Coherence of the test items, or the extent to which the test questions are interrelated. Cronbach’s alpha
    17. 17. MCQs are highly reliableThe results of the test are unlikely to beinfluenced by:  when the test is administered,  when the test is scored, or by  who does the scoring.Hence the term “objective” is often usedwhen referring to these kinds ofassessments.
    18. 18. On the other hand, reliability is an importantconcern when grading essay questions, ratingclinical skills or scoring other assessmentsrequiring judgment or interpretation.In these situations, clear scoring criteria areneeded to attain a high level of reliability, regardlessof whether one or multiple people will be involved ingrading the responses.
    19. 19. How to improve reliability of the test items? Writing clear unambiguous questions and test instructions improve reliability by generating consistent patterns of response from the students. Use of structured predefined marking scheme: An answer key for MCQs and essay Qs, standardized checklists (in OSCEs/OSPEs) with clear scoring criteria. A longer test with multiple items is more likely to have better reliability than a shorter test with a limited number of items as the former evens out possible inconsistencies of individual items.
    20. 20. Desirable Features of Valid and Reliable Assessments There is a clearly specified set of learning outcomes. Assessment tasks are matched to the stated learning outcomes. Assessment tasks are a representative sample of the stated learning outcomes. Assessment tasks are the appropriate level of difficulty.
    21. 21.  Assessment tasks effectively distinguish (discriminate) between achievers and non- achievers. Clear instructions are given for the administration, scoring, and interpretation of the assessment results.
    22. 22. MCQ test item analysis
    23. 23. Remark Classic OMR(Optical Mark Recognition) software
    24. 24. Parameters commonly assessed in MCQ test item analysis Item analysis:  Difficulty index (p-value)  Discrimination index (DI)=Point-Biserial correlation (PBS)  Distractor efficiency (DE) Internal Consistency Reliability
    25. 25. Do final grades attained by students actually reflect their competences?? Do they produce meaningful conclusions about their performance??
    26. 26. Difficulty and DiscriminationIndices
    27. 27. Difficulty Index (p-value) Calculated as the percentage of students that correctly answered the item. The range is from 0% to 100%, or more typically written as a proportion as 0.0 to 1.00 (p-value). The higher the value, the easier the item: Difficulty level  d ≥75% = very easy  d ≥ 70% = easy  d 30-70% = moderately difficult to moderately easy (Recommended)  d <30 % = difficult  d <25% = very difficult P-values above 0.90 are very easy items and should not be reused again for subsequent tests. If almost all of the students can get the item correct, it is a concept probably not worth testing. P-values below 0.20 are very difficult items and should be reviewed for possible confusing language, removed from subsequent tests, and/or highlighted for an area for re-instruction. If almost all of the students get the item wrong there is either a problem with the item or students did not get the concept.
    28. 28. Discrimination index (DI)= Point-Biserial correlation (PBS) It describes the ability of an item to distinguish between high and low scorers (scores of upper and lower 27% of students after being ordered descendingly). The range is from 0.0 to 1.00. The higher the value, the more discriminating the item. A highly discriminating item indicates that the students who had high tests scores got the item correct whereas students who had low test scores got the item incorrect. Items with discrimination values near or less than zero should be removed from the test. This indicates that students who overall did poorly on the test did better on that item than students who overall did well. The item may be confusing for your better scoring students in some way.
    29. 29. Interpreting discrimination index 0.40 or higher = very good discrimination 0.30 to 0.39 = reasonably good discrimination but possibly subject to improvement 0.20 to 0.29 = Marginal/acceptable discrimination (subject to improvement) 0.00 to 0.19 = poor discrimination (to be rejected or improved by revision) Negative DI = Low performing students selected the correct answer more often than high scorers (to be rejected)
    30. 30.  Use items that have high discrimination values in the test (those testing higher cognitive/thinking abilities such as comprehension, application and problem solving) Linking questions to case scenarios. Asking the question in the context of a clinical situation, diagram, graph, image, radiologic image, histo-pathological section, laboratory findings, etc.
    31. 31. Distractor efficiency The distractors are important components of an item, as they show a relationship between the total test score and the distractor chosen by the student. Distractor efficiency is one such tool that tells whether the item was well constructed or failed to perform its purpose. The quality of the distractors influences student performance on a test item. Ideally, low-scoring students, who have not mastered the subject, should choose the distractors more often, whereas, high scorers should discard them more frequently while choosing the correct option. Any distractor that has been selected by less than 5% of the students is considered to be a non-functioning distractor (NF-D). Reviewing the options can reveal potential errors of judgment and inadequate performance of distractors. These poor distractors can be revised, replaced, or removed.
    32. 32. Internal Consistency Reliability Internal consistency reliability indicates how well the items are correlated with one another. It measures whether multiple items within an instrument reveal similar results. Cronbachs Alpha is used as a coefficient of internal consistency.Interpreting Cronbachs Alpha: The range is from 0.0 to 1.0, with 0.7 generally accepted as a sign of acceptable reliability. High reliability indicates that the items are all measuring the same thing, or general construct The higher the value, the more reliable the overall test score.
    33. 33. Interpreting Cronbachs AlphaCronbachs Internal consistency alpha α ≥ 0.9 Excellent 0.8 ≤ α < 0.9 Very good Good (There are probably a few items 0.7 ≤ α < 0.8 which could be improved). Somewhat low (There are probably some 0.6 ≤ α < 0.7 items which could be improved. 0.5 ≤ α < 0.6 Poor (Suggests need for revision of test). Questionable/Unacceptable (This test α < 0.5 should not contribute heavily to the course grade, and it needs revision).
    34. 34. Practice exercises Interpreting Remark Classic OMR (Optical Mark Recognition) software outputs Writing a technical report on MCQ test item analysis (including remedial actions & recommendations) Use of MCQs evaluation checklist