1. Advances in Health Sciences Education (2005) 10:105–113
Ó Springer 2005
Item Analysis to Improve Reliability
for an Internal Medicine Undergraduate OSCE
CHIRAYU AUEWARAKUL1,2,4,*, STEVEN M. DOWNING3,
RUNGNIRAND PRADITSUWAN1, and UAPONG JATURATAMRONG2
Department of Medicine, Faculty of Medicine Siriraj Hospital, Mahidol University, Bangkok,
Thailand; 2Oﬃce of Medical Education, Faculty of Medicine Siriraj Hospital, Mahidol
University, Bangkok, Thailand; 3Department of Medical Education, University of Illinois at
Chicago, Chicago, USA; 4Director of Medical Education Research Unit, Oﬃce of Medical
Education, Faculty of Medicine Siriraj Hospital, 2 Prannok Road, Bangkoknoi, Bangkok, 10700,
Thailand (*Corresponding author: Phone: 662-419-7000 ext. 4448; Fax: 662-418-1602; E-mail:
Received 13 January 2004; accepted 16 February 2005
Abstract. Utilization of objective structured clinical examinations (OSCEs) for ﬁnal assessment
of medical students in Internal Medicine requires a representative sample of OSCE stations. The
reliability and generalizability of OSCE scores provides validity evidence for OSCE scores and
supports its contribution to the ﬁnal clinical grade of medical students. The objective of this
study was to perform item analysis using OSCE stations as the unit of analysis and evaluate the
extent to which OSCE score reliability can be improved using item analysis data. OSCE scores
from eight cohorts of fourth-year medical students (n = 435) in a 6-year undergraduate program
were analyzed. Generalizability (G) coeﬃcients of OSCE scores were computed for each cohort.
Item analysis was performed by considering each OSCE station as an item and computing the
corrected item-total correlation. OSCE stations which negatively impacted the reliability were
deleted and the G-coeﬃcient was recalculated. The G-coeﬃcients of OSCE scores from the eight
cohorts ranged from 0.48 to 0.80 (median 0.62). The median number of OSCE stations that
negatively impacted the G-coeﬃcient was 3.5 (out of a median of 25 total stations). When the
‘‘problem stations’’ were deleted, the median G-coeﬃcient across eight cohorts increased to
0.62–0.72. In conclusion, item analysis of OSCE stations is useful and should be performed to
improve the reliability of total OSCE scores. Problem stations can then be identiﬁed and improved.
Key words: clinical competence, generalizability, item analysis, OSCE, performance assessment,
reliability, undergraduate medical education
Assessment of the clinical competence of medical students should consist
of multiple testing methods (Wass et al., 2001). Traditionally, written tests
including short-essays and multiple-choice questions are the primary
methods used to assess the knowledge of medical students. Medical
CHIRAYU AUEWARAKUL ET AL.
knowledge alone, however, does not always predict clinical competence of
the students (Miller, 1990). The Faculty of Medicine Siriraj Hospital at
the Mahidol University as the oldest medical school in Thailand has
traditionally relied on summative oral examinations in the form of short
cases and long cases at the end of the clinical rotations as a means to
assess clinical performance. This type of assessment, however, was shown
to be unreliable because the pass/fail decision depends on a single judgment of a student by one faculty member on one or two real patient cases
(van der Vleuten, 2000). The objective structured clinical examination
(OSCE) was developed in 1975, as a means of clinical competence
assessment (Harden et al., 1975), and this new performance method was
then introduced in Thailand around 1985. OSCE has gradually replaced or
been used in conjunction with the short- and long-case oral examinations
and presently contributes meaningfully to the ﬁnal clinical grade of
medical students in several clinical departments at the Faculty of Medicine
Siriraj Hospital, especially in the Department of Medicine.
The undergraduate MD curriculum at the Faculty of Medicine Siriraj
Hospital is a 6-year program. The clinical years include year 4–6 where
students rotate through all major specialties and electives. Students are required to rotate through Internal Medicine every clinical year, starting from
year 4 to 6. At the end of the fourth-year Internal Medicine rotation, students
are required to take an OSCE as part of the ﬁnal evaluation. The ﬁnal grade
depends on a composite score, 25-percent of which is derived from the OSCE
scores. Other components include problem-solving and general medical
knowledge multiple-choice questions (MCQ) (25%) and ward, outpatient
clinic, and preceptor evaluation (50%). Since its introduction to the
Department of Medicine, validity evidence and reliability data have never
been evaluated for the OSCEs. Whether it tests what it purports to test is not
known. The reliability and generalizability of OSCE scores should provide
some validity evidence for OSCE scores and their contribution to the ﬁnal
grade of fourth-year medical students (Crossley et al., 2002; Downing, 2003).
A representative sample of OSCE stations needs to be demonstrated in order
to justify the continued use of OSCE for ﬁnal grade evaluation (Colliver and
Williams, 1993; van der Vleuten, 2000).
The purpose of this study was to evaluate the eﬀect on reliability of removing
poorly performing OSCE stations. Although item analysis has routinely been
performed for most knowledge-based MCQ examinations, the utilization of
item analysis for undergraduate Internal Medicine OSCE is less frequently
practiced and reported in the literature (Kassam, 2003; Newble and Swanson,
1988). In this study, we undertook a Generalizability (G) study of 8 cohorts
3. ITEM ANALYSIS FOR AN UNDERGRADUATE OSCE
of fourth-year Internal Medicine OSCEs, followed by an item analysis using
OSCE stations as the unit of analysis.
Subjects and Instruments
In each academic year, 4 cohorts of fourth-year medical students in a 6-year
MD program rotate through Internal Medicine wards. At the end of a
9-week rotation, students are required to take OSCEs as part of their ﬁnal
evaluation in Internal Medicine. In this study, data from 8 cohorts of
fourth-year medical students in the academic year 2001 and 2002 was collected. The total number of fourth-year medical students studied was 435.
The OSCE is conducted during the last week of the students’ 9-week rotation.
Junior faculty members in the Department of Medicine are raters. The 20 to
25 four-minute OSCEs stations are designed to test history-taking and
physical examination skills, as well as interpretation of laboratory tests and
procedural skills. Some stations speciﬁcally test communication and counseling skills but in all stations, students must demonstrate appropriate skills
and attitudes in approaching the patients. Real volunteer patients and simulated patients are routinely used. All patients are instructed to act in a
standardized manner by the team of faculty members who developed the
questions. During the 2-hour examination, each student meets with the same
rater for each station.
OSCE questions are drawn from a pool of previously used as well as newly
developed questions. Faculty members in each discipline of Internal Medicine
write questions, which are reviewed by the Internal Medicine Undergraduate
Committee. Each OSCE is developed from a blueprint that corresponds to the
expected clinical knowledge and performance of fourth-year students.
Checklists are used for each station with a total score computed for each
checklist. OSCEs account for 25-percent of the ﬁnal composite scores.
Item Analysis of OSCE
Each OSCE consists of the questions from all subspecialties in Internal
Medicine. The design of OSCE questions was such that each station assessed
the fourth-year medical students by a multi-disciplinary approach. A single
OSCE station may assess 3–4 subspecialties simultaneously; for example, a
station on ‘‘history-taking of an elderly patient who presents with weight
4. CHIRAYU AUEWARAKUL ET AL.
loss’’ has items on the checklist that are related to oncology, hematology,
endocrine, gastroenterology and socioeconomic problems. On average, each
OSCE consists of 6 history-taking stations, 7 physical-examination stations,
11 laboratory-tests and procedural-skills stations and 1 counseling station.
G-Studies of OSCE
A G-study was performed for each student cohort using a random model p x
i design (Brennan, 2001; Colliver et al., 1989; Cronbach et al., 1972). Item
analysis was performed using each OSCE station as an item and computing
the corrected item-total correlation for each station. OSCE stations which
negatively impacted reliability were deleted and the generalizability coeﬃcient was recalculated. The content and nature of each problem station was
also reviewed to determine the possible cause of the problem.
The mean OSCE scores, standard deviations (SD) and G-coeﬃcients of
the eight OSCE cohorts are shown in Table I. The median number of stations
and students per cohort was 25 and 54.5, respectively. The mean OSCE
scores ranged from 58.57% to 61.27% with a median of 60.18%. The
G-coeﬃcients ranged from 0.45 to 0.80 with a median of 0.62. Five of the 8
cohorts had G-coeﬃcients over 0.60.
Identiﬁcation of ‘‘Problem Stations’’
Table II shows the number of stations per cohort that had negative impact
on the G-coeﬃcients. The median number of ‘‘problem stations’’ per cohort
was 3.5 of 25 stations. The recalculated G coeﬃcients ranged from 0.60 to
0.83 with a median of 0.72, compared to the median G coeﬃcient of 0.62
before deletion of the problem cases.
Table I. G-coeﬃcients of Internal Medicine undergraduate OSCE scores
Number of Stations
Number of students
ITEM ANALYSIS FOR AN UNDERGRADUATE OSCE
Table II. Number of stations deleted and G-coeﬃcient changes in each cohort
Number of stations
deleted per total station
Table III. Problem stations categorized according to speciﬁc skills tested
Number of problem
stations in each category
Laboratory and procedural skills
Physical examination skills
Upon review of each cohort, the problem stations were identiﬁed as shown
in Table III. Of 33 problem stations, 70% were interpretation of laboratory
tests stations (Table IV). The subspecialty with the most frequent negative
impact on generalizability was infectious disease and the stations frequently
found to be problematic were the ones that test the students’ ability to
diagnose infectious organisms under the microscope. The G coeﬃcients of
the stations with laboratory questions were lower than the non-laboratory
stations median of 0.40 and 0.64, respectively). The students’ mean scores on
laboratory stations were also lower than the non-laboratory stations (median
5.27 vs 6.40, respectively). (The medians were computed across cohorts to
simplify the presentation of data.)
Since they were developed in 1975, OSCEs have been extensively utilized for
various purposes in medical education (van der Vleuten and Swanson, 1990).
6. CHIRAYU AUEWARAKUL ET AL.
Table IV. OSCE station content with negative impact on G-coeﬃcients
Lab (urine exam)
Procedural skill (oxygen administration)
PE (skin exam)
Lab (synovial ﬂuid exam)
PE (weight loss)
PE (lung exam)
Lab (sputum exam)
Lab (blood smear)
Lab (urine exam)
Lab (stool parasite)
Lab (urine exam)
Lab (genetic disease)
Lab (spinal ﬂuid exam)
History-taking (social problem in the elderly)
Lab (urine exam)
Procedural skill (central venous pressure evaluation)
Lab (stool parasite)
Lab (sputum exam)
Lab (genetic disease)
Lab (blood gases)
PE (weight loss)
Lab (pus exam)
PE (visual ﬁeld testing)
Lab (sputum exam)
Counseling (diabetic patient)
Lab (blood smear)
Abbreviations: PE (Physical Examination), Lab (Laboratory Investigations).
In the teaching and learning arena, OSCEs have an important role in the
clinical learning process by providing exposure of medical students to standardized patients or real patients in various clinically relevant situations
designed by the medical school faculties. OSCEs have also been utilized in
7. ITEM ANALYSIS FOR AN UNDERGRADUATE OSCE
many medical schools around the world for formative and summative
assessment, either to give feedback to students at the end of clinical rotations
or as one component of the ﬁnal composite scores or grades or as a prerequisite for graduation from medical school (Collins and Gamble, 1996; van
der Vleuten, 2000).
In order to use OSCEs as the ﬁnal summative assessment, the reliability
and generalizability of OSCEs in each local program should be evaluated
(Boulet et al., 2003). Reliability data provides one major source of validity
evidence according to Messick’s unitary concept of construct validity
(Downing, 2003; Messick, 1989). In this study of 8 cohorts of fourth-year
medical students, the G-coeﬃcients were on average over 0.60, with a
range of 0.48–0.80. These results were comparable to other reports of
reliability achieved in locally developed OSCEs (A-Latif, 1992; Kassam,
2003; Matsell et al., 1991; Petrusa et al., 1990; Regehr et al., 1998; Verhoeven et al., 2000). Our OSCE blueprint, however, was quite diﬀerent
from those in some western medical schools since we put a major
emphasis not only on the data gathering, physical examination and
communication skills, but also on the ability of our students to interpret
laboratory tests. Approximately 40% of our OSCE stations were thus
related to laboratory tests. Since our graduates must work in rural hospitals with limited facilities, where they must be able to supervise the
medical technicians or even perform the laboratory tests themselves, laboratory skills are very important. For example, the laboratory tests in our
OSCE include blood smears for diagnosis of malaria, Dengue hemorrhagic
fever, thalassemia, stool examination for parasites, urine examination for
urinary tract infection, nephrotic syndrome, spinal ﬂuid examination for
meningitis and chest x-rays for tuberculosis and pneumonia.
Using item analysis for the Internal Medicine OSCE has revealed
important and useful information. When we deleted the problem stations,
the G-coeﬃcients improved considerably, so that all 8 cohorts had Gcoeﬃcients greater than 0.60. This suggested that the problem stations
might have tested skills that students did not acquire or were not taught
in the clinical rotation. We examined the content of the problem stations
and found that the majority of these stations identiﬁed by item analysis
were laboratory skills related. The improvement in the G-coeﬃcients by
eliminating these stations is likely due to the nature of skills tested in the
laboratory-skills stations, since laboratory skills are obviously diﬀerent
from data-gathering and physical examination skills. The overall lower
mean scores on the laboratory stations as compared to the non-laboratory
stations could also indicate that the students did not adequately acquire
these particular laboratory skills during the Internal Medicine rotation.
Students who perform well in communication skills may not do as well in
the interpretation of blood smears, urine analysis, and so on. Achieving
CHIRAYU AUEWARAKUL ET AL.
competence in one area is thus not necessarily a good predictor of competence in another (van der Vleuten, 2000; Wass et al., 2001).
We subsequently computed the G-coeﬃcients for laboratory skills stations alone and found these G-coeﬃcients to be low, with a median of
0.40 (0.04–0.63), suggesting ‘‘case speciﬁcity’’ for students’ laboratory
skills. The reliability of non-laboratory skills stations was better than the
laboratory skills stations as shown by the G coeﬃcients of 0.56–0.87
(median of 0.64). The lower reliability of laboratory stations could be
accounted for by the smaller number of laboratory stations (mean of 9
stations) as compared to non-laboratory stations (mean of 15 stations).
However, we also observed that the inter-station correlations were very
variable among laboratory stations across all cohorts. These results could
be due to the fact that the student’s laboratory skills are case-speciﬁc and
the stations independently tested unique and distinct skills or due to the
problem inherent within the station scoring process that needs to be
Since students are required to perform and interpret laboratory tests in
order to meet the requirement of our department and the National Medical
Council Standards, the problem stations can not be eliminated, because this
would reduce the validity evidence for the OSCE. Rather, there should be a
remediation of the station and further change in the curriculum to facilitate
students learning these skills. Validity should not be lost at the expense of
reliability (Downing, 2003; Norman et al., 1991; van der Vleuten, 2000).
In conclusion, this Internal Medicine OSCE has been shown to be reliable
and acceptable for local use. Item analysis of OSCE stations is useful and
should be performed to improve the reliability of total OSCE scores. Careful
structuring of OSCE questions and remediation of OSCE problem stations is
crucial to support the continued use of OSCEs for ﬁnal clinical performance
We would like to thank the Foundation for Advancement in Medical Education Research (FAIMER) for ﬁnancial support of Dr. Chirayu Auewarakul through the International Fellowship of Medical Education
(IFME)-2002 award. The Faculty development awards from the Anandamahidol Foundation and Siriraj Chalermprakiat Fund, Thailand are also
appreciated. We thank Ms. Jaree Prasarnkul, Ms. Pichavadee Sae-ung and
Mr. Somkuan Sriyounglek at the Oﬃce of Undergraduate Medical Education, Department of Medicine for their excellent work with OSCE administration, score processing and data collection.
9. ITEM ANALYSIS FOR AN UNDERGRADUATE OSCE
A-Latif, A. (1992). An examination of the examinations: The reliability of the objective structured clinical examination. Medical Teacher 14: 179–183.
Boulet, J.R., McKinley, D.W., Whelan, G.P. & Hambleton, R.K. (2003). Quality assurance
methods for performance-based assessments. Advances in Health Sciences Education 8: 27–47.
Brennan, R.L. (2001). Generalizability Theory. New York, NY: Springer-Verlag.
Collins, J.P. & Gamble, G.D. (1996). A multi-format interdisciplinary ﬁnal examination. Medical
Education 30: 259–265.
Colliver, J.A & Williams, R.G. (1993). Technical issues: Test application. Academic Medicine 68:
Colliver, J.A., Verhulst, S.J., Williams, R.G. & Norcini, J.J. (1989). Reliability of performance
on standardized patient cases: A comparison of consistency measures based on generalizability theory. Teaching and Learning in Medicine 1: 31–37.
Cronbach, L.J., Gleser, G.C., Nanda, H. & Rajaratnam, N. (1972). The dependability of behavioral measurements: Generalizability for scores and proﬁles. New York: John Wiley and Sons.
Crossley, J., Davies, H., Humphries, G. & Jolly, B. (2002). Generalisability: A key to unlock
professional assessment. Medical Education 36: 972–978.
Downing, S.M. (2003). Validity: On the meaningful interpretation of assessment data. Medical
Education 37: 830–837.
Harden, R., Stevenson, M., Downie, W. & Wilson, G. (1975). Assessment of clinical competence
using objective structured examinations. British Medical Journal 1: 447–451.
Kassam, N. (2003). Some validity evidence of an undergraduate Internal Medicine OSCE.
Masters of Health Professions Education (MHPE) Thesis, University of Illinois at Chicago,
Department of Medical Education, Chicago.
Matsell, D.G., Wolﬁsh, N.M. & Hsu, E. (1991). Reliability and validity of the objective structured clinical examination in pediatrics. Medical Education 25: 293–299.
Messick, S.J. (1989). Validity. In R.L. Linn (ed), Educational Measurement (3rd ed), pp. 13–104.
New York: American Council on Education and Macmillan.
Miller, G.E. (1990). The assessment of clinical skills/competence/performance. Academic Medicine
Newble, D.I. & Swanson, D.B. (1988). Psychometric characteristics of the objective structured
clinical examination. Medical Education 22: 325–334.
Norman, G.R., Van der Vleuten, C.P.M. & De Graﬀ, E. (1991). Pitfalls in the pursuit of
objectivity: issues of validity eﬃciency and acceptability. Medical Education 25: 119–126.
Petrusa, E.R., Blackwell, T.A. & Ainsworth, M.A. (1990). Reliability and validity of an objective
structured clinical examination for assessing the clinical performance of residents. Archives in
Internal Medicine 150: 573–577.
Regehr, G., MacRae, H., Reznick, R.K. & Szalay, D. (1998). Comparing the psychometric
properties of checklists and global rating scales for assessing performance on an OSCE
format examination. Academic Medicine 73: 993–997.
Van der Vleuten, C. (2000). Validity of ﬁnal examinations in undergraduate medical training.
British Medical Journal 321: 1217–1219.
Van der Vleuten, C.P.M. & Swanson, D.B. (1990). Assessment of clinical skills with standardized
patients: State of the art. Teaching and Learning in Medicine 2: 58–76.
Verhoeven, BH., Hamers, J., Scherpbier, A.J., Hoogenboom, R.J. & Vleuten, C.P.van der (2000).
The eﬀect on reliability of adding a separate written assessment component to an objective
structured clinical examination. Medical Education 34: 525–529.
Wass, V., Vleuten, C.van der, Shatzer, J. & Jones, R. (2001). Assessment of clinical competence.
Lancet 357: 945–949.