2. Michael T Brannick ¹, H Tugba Erol-Korkmaz ² & Matthew Prewett ¹
¹ Department of Psychology, College of Arts and Sciences, University of South Florida, Tampa, Florida, US
² Department of Psychology, Middle East Technical University, Ankara, Turkey
Impact Factor 2016 : 4.005
A SYSTEMATIC REVIEW OF THE RELIABILITY OF
OBJECTIVE STRUCTURED CLINICAL
EXAMINATION (OSCE) SCORES
Blackwell Publishing Ltd 2011.
Medical Education
2011: 45: 1181–1189
Tesneem Layas 5th Year – medical student
2
3. OUTLINES OF PRESENTATION
• Introduction:
– OSCE Definition
– Important definitions related to the study
– Comparison of OSCE with other form of examination
• Questions to be answered from this article?
• Study design and method
• Results
• Discussion
• Limitations of the study
• Conclusion
3
4. Objective Structured Clinical Examination
(OSCE)
Assessment of clinical competence and skills in diagnosis &
treatment of patients.
In a structured or well-planned way with attention being
paid to objectivity.
4
5. Important definitions related to the study
• Reliability (reproducibility):
The degree to which an assessment tool produces stable &
consistent results.
• Alpha coefficient (Cronbach’s):
It measures the reliability by looking at how closely are related a set
of items as a group.
• Generalisability coefficient:
It is used to determine the reliability, particularly for descriptive
comparison.
5
6. OSCE vs. other exams
• More realistic context, content and procedures.
(vs. paper & pencil exam)
• Patients are standardized across examinees so scores are
comparable.
(vs. assessments that use real patients)
6
7. Questions to be answered by the study?
1. What reliability should we expect on average when developing
an OSCE?
2. What is the likely range of such values?
3. What factors appear to influence the expected reliability?
7
8. STUDY DESIGN & METHOD -1
PsycINFO & PubMed
‘OSCE’ ‘Reliability’ +
References
98 Journal articles
64 studies
457 Reliability
values
188 Alpha values
from
39 samples
100 ‘Alpha across
stations’
53 ‘Alpha across
items’
No empirical
reliability =
Elimination
7/8 reliability value/ study
[Coefficient Alpha
“Chronbachs”,
Generalizability coefficient
and other reliability values)
35 not
indicated
2 SEPARATE
META-ANALYSIS
8
9. STUDY DESIGN & METHOD-2
• The six parameters analyzed “Across-stations / “Across-items” were:
1) Content (Communication scale/ Clinical scale)
2) Number of raters (1 or 2)
3) Scale format (Checklist/ Likert scale)
4) Context (High stakes exams/ Research study OSCE)
5) Examiner type¹ (Faculty member judge/ “SP” Standardized Patient judge)
6) Examiner type ² (Content expert judge/ “SP”)
9
10. RESULTS-1
Reliability was better than average with:
1. Greater number of stations.
2. Higher number of examiners per station.
• Communication scales (Interpersonal skills) when compared with
clinical skills were evaluated less reliably across stations and more
reliably within stations.
10
11. Results-2:
Reliability Across-Stations estimates were:
significant in content:
• (clinical vs. communication)
significant in number of rater:
• (2 raters vs. 1 rater)
Alpha across stations was 0.66
(95% confidence interval [CI] 0.62–0.70)
11
95% CI
Parameter K Mean Lower Upper
Content
Communication scale 16 0.55 0.45 0.63
Clinical scale 67 0.69 0.66 0.73
Number of raters
1 rater 90 0.65 0.61 0.68
2 raters 8 0.81 0.73 0.88
12. Results-3:
Reliability Across-Items estimates were:
Significant in scale format:
•Likert scale vs Checklist scale
Significant in type of examiner
•Content expert versus SP (Standardized patient)
Alpha across items was 0.78
(95% CI 0.73–0.82).
12
95% CI
Parameter K Mean Lower Upper
Scale format
Likert scale 21 0.88 0.85 0.91
Checklist scale 28 0.67 0.62 0.72
Examiner type²
Content expert rater 4 0.61 0.29 0.81
SP rater 18 0.77 0.68 0.83
13. Discussion
• Large variability in reliability at any given number of stations.
• Increasing the number of stations is expensive, but some authors have
argued that adding stations is a better use of resources than adding raters.
• Communication evaluations were subjective so they were less reliable than
measure of clinical skills.
• Increasing the number of items on a communication scale increased
reliability estimate without gaining any real precision in measurement.
13
14. Limitations of the study
• The number of features it can code and analyze (number of
studies of specific kind and the content).
• The number of studies was few to examine other forms of
reliability.
• Lack of detailed information that are rarely provided in
studies.
14
15. Conclusion-1
Using 2 examiners and large numbers of stations is more helpful.
It is more difficult to reliably assess communication skills than clinical
skills.
Overall scores on the OSCE are often not very reliable.
15
16. The large variability in estimates suggests that non analyzed
features might be influential determinants.
Some of these limiting factors can be addressed by more
complete reporting of information.
Others cannot be addressed without changing the design of
the OSCE.
Conclusion-2
16