Your SlideShare is downloading. ×
Osce item-analysis-to-improve-reliability-for-internal-medicine-auewarakul-downing-praditswuan-jaturatamrong
Osce item-analysis-to-improve-reliability-for-internal-medicine-auewarakul-downing-praditswuan-jaturatamrong
Osce item-analysis-to-improve-reliability-for-internal-medicine-auewarakul-downing-praditswuan-jaturatamrong
Osce item-analysis-to-improve-reliability-for-internal-medicine-auewarakul-downing-praditswuan-jaturatamrong
Osce item-analysis-to-improve-reliability-for-internal-medicine-auewarakul-downing-praditswuan-jaturatamrong
Osce item-analysis-to-improve-reliability-for-internal-medicine-auewarakul-downing-praditswuan-jaturatamrong
Osce item-analysis-to-improve-reliability-for-internal-medicine-auewarakul-downing-praditswuan-jaturatamrong
Osce item-analysis-to-improve-reliability-for-internal-medicine-auewarakul-downing-praditswuan-jaturatamrong
Osce item-analysis-to-improve-reliability-for-internal-medicine-auewarakul-downing-praditswuan-jaturatamrong
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Osce item-analysis-to-improve-reliability-for-internal-medicine-auewarakul-downing-praditswuan-jaturatamrong

328

Published on

Published in: Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
328
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
5
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Advances in Health Sciences Education (2005) 10:105–113 DOI 10.1007/s10459-005-2315-3 Ó Springer 2005 Item Analysis to Improve Reliability for an Internal Medicine Undergraduate OSCE CHIRAYU AUEWARAKUL1,2,4,*, STEVEN M. DOWNING3, RUNGNIRAND PRADITSUWAN1, and UAPONG JATURATAMRONG2 1 Department of Medicine, Faculty of Medicine Siriraj Hospital, Mahidol University, Bangkok, Thailand; 2Office of Medical Education, Faculty of Medicine Siriraj Hospital, Mahidol University, Bangkok, Thailand; 3Department of Medical Education, University of Illinois at Chicago, Chicago, USA; 4Director of Medical Education Research Unit, Office of Medical Education, Faculty of Medicine Siriraj Hospital, 2 Prannok Road, Bangkoknoi, Bangkok, 10700, Thailand (*Corresponding author: Phone: 662-419-7000 ext. 4448; Fax: 662-418-1602; E-mail: chirayuaue@yahoo.com; sicaw@mahidol.ac.th) Received 13 January 2004; accepted 16 February 2005 Abstract. Utilization of objective structured clinical examinations (OSCEs) for final assessment of medical students in Internal Medicine requires a representative sample of OSCE stations. The reliability and generalizability of OSCE scores provides validity evidence for OSCE scores and supports its contribution to the final clinical grade of medical students. The objective of this study was to perform item analysis using OSCE stations as the unit of analysis and evaluate the extent to which OSCE score reliability can be improved using item analysis data. OSCE scores from eight cohorts of fourth-year medical students (n = 435) in a 6-year undergraduate program were analyzed. Generalizability (G) coefficients of OSCE scores were computed for each cohort. Item analysis was performed by considering each OSCE station as an item and computing the corrected item-total correlation. OSCE stations which negatively impacted the reliability were deleted and the G-coefficient was recalculated. The G-coefficients of OSCE scores from the eight cohorts ranged from 0.48 to 0.80 (median 0.62). The median number of OSCE stations that negatively impacted the G-coefficient was 3.5 (out of a median of 25 total stations). When the ‘‘problem stations’’ were deleted, the median G-coefficient across eight cohorts increased to 0.62–0.72. In conclusion, item analysis of OSCE stations is useful and should be performed to improve the reliability of total OSCE scores. Problem stations can then be identified and improved. Key words: clinical competence, generalizability, item analysis, OSCE, performance assessment, reliability, undergraduate medical education Background Assessment of the clinical competence of medical students should consist of multiple testing methods (Wass et al., 2001). Traditionally, written tests including short-essays and multiple-choice questions are the primary methods used to assess the knowledge of medical students. Medical
  • 2. 106 CHIRAYU AUEWARAKUL ET AL. knowledge alone, however, does not always predict clinical competence of the students (Miller, 1990). The Faculty of Medicine Siriraj Hospital at the Mahidol University as the oldest medical school in Thailand has traditionally relied on summative oral examinations in the form of short cases and long cases at the end of the clinical rotations as a means to assess clinical performance. This type of assessment, however, was shown to be unreliable because the pass/fail decision depends on a single judgment of a student by one faculty member on one or two real patient cases (van der Vleuten, 2000). The objective structured clinical examination (OSCE) was developed in 1975, as a means of clinical competence assessment (Harden et al., 1975), and this new performance method was then introduced in Thailand around 1985. OSCE has gradually replaced or been used in conjunction with the short- and long-case oral examinations and presently contributes meaningfully to the final clinical grade of medical students in several clinical departments at the Faculty of Medicine Siriraj Hospital, especially in the Department of Medicine. The undergraduate MD curriculum at the Faculty of Medicine Siriraj Hospital is a 6-year program. The clinical years include year 4–6 where students rotate through all major specialties and electives. Students are required to rotate through Internal Medicine every clinical year, starting from year 4 to 6. At the end of the fourth-year Internal Medicine rotation, students are required to take an OSCE as part of the final evaluation. The final grade depends on a composite score, 25-percent of which is derived from the OSCE scores. Other components include problem-solving and general medical knowledge multiple-choice questions (MCQ) (25%) and ward, outpatient clinic, and preceptor evaluation (50%). Since its introduction to the Department of Medicine, validity evidence and reliability data have never been evaluated for the OSCEs. Whether it tests what it purports to test is not known. The reliability and generalizability of OSCE scores should provide some validity evidence for OSCE scores and their contribution to the final grade of fourth-year medical students (Crossley et al., 2002; Downing, 2003). A representative sample of OSCE stations needs to be demonstrated in order to justify the continued use of OSCE for final grade evaluation (Colliver and Williams, 1993; van der Vleuten, 2000). Objective The purpose of this study was to evaluate the effect on reliability of removing poorly performing OSCE stations. Although item analysis has routinely been performed for most knowledge-based MCQ examinations, the utilization of item analysis for undergraduate Internal Medicine OSCE is less frequently practiced and reported in the literature (Kassam, 2003; Newble and Swanson, 1988). In this study, we undertook a Generalizability (G) study of 8 cohorts
  • 3. ITEM ANALYSIS FOR AN UNDERGRADUATE OSCE 107 of fourth-year Internal Medicine OSCEs, followed by an item analysis using OSCE stations as the unit of analysis. Subjects and Instruments Students In each academic year, 4 cohorts of fourth-year medical students in a 6-year MD program rotate through Internal Medicine wards. At the end of a 9-week rotation, students are required to take OSCEs as part of their final evaluation in Internal Medicine. In this study, data from 8 cohorts of fourth-year medical students in the academic year 2001 and 2002 was collected. The total number of fourth-year medical students studied was 435. OSCE Structure The OSCE is conducted during the last week of the students’ 9-week rotation. Junior faculty members in the Department of Medicine are raters. The 20 to 25 four-minute OSCEs stations are designed to test history-taking and physical examination skills, as well as interpretation of laboratory tests and procedural skills. Some stations specifically test communication and counseling skills but in all stations, students must demonstrate appropriate skills and attitudes in approaching the patients. Real volunteer patients and simulated patients are routinely used. All patients are instructed to act in a standardized manner by the team of faculty members who developed the questions. During the 2-hour examination, each student meets with the same rater for each station. OSCE questions are drawn from a pool of previously used as well as newly developed questions. Faculty members in each discipline of Internal Medicine write questions, which are reviewed by the Internal Medicine Undergraduate Committee. Each OSCE is developed from a blueprint that corresponds to the expected clinical knowledge and performance of fourth-year students. Checklists are used for each station with a total score computed for each checklist. OSCEs account for 25-percent of the final composite scores. Item Analysis of OSCE OSCE Content Each OSCE consists of the questions from all subspecialties in Internal Medicine. The design of OSCE questions was such that each station assessed the fourth-year medical students by a multi-disciplinary approach. A single OSCE station may assess 3–4 subspecialties simultaneously; for example, a station on ‘‘history-taking of an elderly patient who presents with weight
  • 4. CHIRAYU AUEWARAKUL ET AL. 108 loss’’ has items on the checklist that are related to oncology, hematology, endocrine, gastroenterology and socioeconomic problems. On average, each OSCE consists of 6 history-taking stations, 7 physical-examination stations, 11 laboratory-tests and procedural-skills stations and 1 counseling station. G-Studies of OSCE A G-study was performed for each student cohort using a random model p x i design (Brennan, 2001; Colliver et al., 1989; Cronbach et al., 1972). Item analysis was performed using each OSCE station as an item and computing the corrected item-total correlation for each station. OSCE stations which negatively impacted reliability were deleted and the generalizability coefficient was recalculated. The content and nature of each problem station was also reviewed to determine the possible cause of the problem. The mean OSCE scores, standard deviations (SD) and G-coefficients of the eight OSCE cohorts are shown in Table I. The median number of stations and students per cohort was 25 and 54.5, respectively. The mean OSCE scores ranged from 58.57% to 61.27% with a median of 60.18%. The G-coefficients ranged from 0.45 to 0.80 with a median of 0.62. Five of the 8 cohorts had G-coefficients over 0.60. Identification of ‘‘Problem Stations’’ Table II shows the number of stations per cohort that had negative impact on the G-coefficients. The median number of ‘‘problem stations’’ per cohort was 3.5 of 25 stations. The recalculated G coefficients ranged from 0.60 to 0.83 with a median of 0.72, compared to the median G coefficient of 0.62 before deletion of the problem cases. Table I. G-coefficients of Internal Medicine undergraduate OSCE scores Cohort OSCE Number of Stations 1 2 3 4 5 6 7 8 Median Number of students Mean (%) SD G-Coefficient 25 21 22 25 24 25 25 25 25 52 51 55 54 57 56 56 54 54.5 59.99 60.95 60.36 58.59 61.27 58.57 61.08 58.64 60.18 3.07 6.68 5.00 5.8 4.66 6.55 5.28 6.04 5.54 0.55 0.80 0.48 0.63 0.45 0.70 0.60 0.68 0.62
  • 5. 109 ITEM ANALYSIS FOR AN UNDERGRADUATE OSCE Table II. Number of stations deleted and G-coefficient changes in each cohort Cohort G-coefficient, before deletions G-coefficient, after deletions Number of stations deleted per total station 1 2 3 4 5 6 7 8 Median 0.55 0.80 0.48 0.63 0.45 0.70 0.60 0.68 0.62 0.64 0.83 0.60 0.73 0.62 0.76 0.70 0.74 0.72 6/25 2/21 6/22 4/25 5/24 3/25 3/25 4/25 3.5/25 Table III. Problem stations categorized according to specific skills tested Clinical skills Number of problem stations in each category Laboratory and procedural skills Physical examination skills History-taking skills Counseling skills 23/33 (70%) 6/33 (18.2%) 3/33 (9.1%) 1/33 (3%) Upon review of each cohort, the problem stations were identified as shown in Table III. Of 33 problem stations, 70% were interpretation of laboratory tests stations (Table IV). The subspecialty with the most frequent negative impact on generalizability was infectious disease and the stations frequently found to be problematic were the ones that test the students’ ability to diagnose infectious organisms under the microscope. The G coefficients of the stations with laboratory questions were lower than the non-laboratory stations median of 0.40 and 0.64, respectively). The students’ mean scores on laboratory stations were also lower than the non-laboratory stations (median 5.27 vs 6.40, respectively). (The medians were computed across cohorts to simplify the presentation of data.) Discussion Since they were developed in 1975, OSCEs have been extensively utilized for various purposes in medical education (van der Vleuten and Swanson, 1990).
  • 6. CHIRAYU AUEWARAKUL ET AL. 110 Table IV. OSCE station content with negative impact on G-coefficients Cohort Specialty Skill assessed* 1 General medicine Critical care General medicine Rheumatology Endocrine Infectious disease Infectious disease General medicine Infectious disease Pulmonary disease General medicine Hematology Nephrology Infectious disease Infectious disease Rheumatology Nephrology Genetics Neurology Geriatrics Nephrology Critical care Infectious disease Infectious disease Genetics Critical care Cardiology Endocrine Infectious disease Neurology Infectious disease Endocrine Hematology Lab (urine exam) Procedural skill (oxygen administration) PE (skin exam) Lab (synovial fluid exam) PE (weight loss) Lab (malaria) Lab (malaria) History-taking (hypertension) Lab (malaria) PE (lung exam) Lab (sputum exam) Lab (blood smear) Lab (urine exam) Lab (stool parasite) Lab (malaria) History-taking (arthritis) Lab (urine exam) Lab (genetic disease) Lab (spinal fluid exam) History-taking (social problem in the elderly) Lab (urine exam) Procedural skill (central venous pressure evaluation) Lab (stool parasite) Lab (sputum exam) Lab (genetic disease) Lab (blood gases) PE (precordium) PE (weight loss) Lab (pus exam) PE (visual field testing) Lab (sputum exam) Counseling (diabetic patient) Lab (blood smear) 2 3 4 5 6 7 8 Abbreviations: PE (Physical Examination), Lab (Laboratory Investigations). In the teaching and learning arena, OSCEs have an important role in the clinical learning process by providing exposure of medical students to standardized patients or real patients in various clinically relevant situations designed by the medical school faculties. OSCEs have also been utilized in
  • 7. ITEM ANALYSIS FOR AN UNDERGRADUATE OSCE 111 many medical schools around the world for formative and summative assessment, either to give feedback to students at the end of clinical rotations or as one component of the final composite scores or grades or as a prerequisite for graduation from medical school (Collins and Gamble, 1996; van der Vleuten, 2000). In order to use OSCEs as the final summative assessment, the reliability and generalizability of OSCEs in each local program should be evaluated (Boulet et al., 2003). Reliability data provides one major source of validity evidence according to Messick’s unitary concept of construct validity (Downing, 2003; Messick, 1989). In this study of 8 cohorts of fourth-year medical students, the G-coefficients were on average over 0.60, with a range of 0.48–0.80. These results were comparable to other reports of reliability achieved in locally developed OSCEs (A-Latif, 1992; Kassam, 2003; Matsell et al., 1991; Petrusa et al., 1990; Regehr et al., 1998; Verhoeven et al., 2000). Our OSCE blueprint, however, was quite different from those in some western medical schools since we put a major emphasis not only on the data gathering, physical examination and communication skills, but also on the ability of our students to interpret laboratory tests. Approximately 40% of our OSCE stations were thus related to laboratory tests. Since our graduates must work in rural hospitals with limited facilities, where they must be able to supervise the medical technicians or even perform the laboratory tests themselves, laboratory skills are very important. For example, the laboratory tests in our OSCE include blood smears for diagnosis of malaria, Dengue hemorrhagic fever, thalassemia, stool examination for parasites, urine examination for urinary tract infection, nephrotic syndrome, spinal fluid examination for meningitis and chest x-rays for tuberculosis and pneumonia. Using item analysis for the Internal Medicine OSCE has revealed important and useful information. When we deleted the problem stations, the G-coefficients improved considerably, so that all 8 cohorts had Gcoefficients greater than 0.60. This suggested that the problem stations might have tested skills that students did not acquire or were not taught in the clinical rotation. We examined the content of the problem stations and found that the majority of these stations identified by item analysis were laboratory skills related. The improvement in the G-coefficients by eliminating these stations is likely due to the nature of skills tested in the laboratory-skills stations, since laboratory skills are obviously different from data-gathering and physical examination skills. The overall lower mean scores on the laboratory stations as compared to the non-laboratory stations could also indicate that the students did not adequately acquire these particular laboratory skills during the Internal Medicine rotation. Students who perform well in communication skills may not do as well in the interpretation of blood smears, urine analysis, and so on. Achieving
  • 8. 112 CHIRAYU AUEWARAKUL ET AL. competence in one area is thus not necessarily a good predictor of competence in another (van der Vleuten, 2000; Wass et al., 2001). We subsequently computed the G-coefficients for laboratory skills stations alone and found these G-coefficients to be low, with a median of 0.40 (0.04–0.63), suggesting ‘‘case specificity’’ for students’ laboratory skills. The reliability of non-laboratory skills stations was better than the laboratory skills stations as shown by the G coefficients of 0.56–0.87 (median of 0.64). The lower reliability of laboratory stations could be accounted for by the smaller number of laboratory stations (mean of 9 stations) as compared to non-laboratory stations (mean of 15 stations). However, we also observed that the inter-station correlations were very variable among laboratory stations across all cohorts. These results could be due to the fact that the student’s laboratory skills are case-specific and the stations independently tested unique and distinct skills or due to the problem inherent within the station scoring process that needs to be further studied. Since students are required to perform and interpret laboratory tests in order to meet the requirement of our department and the National Medical Council Standards, the problem stations can not be eliminated, because this would reduce the validity evidence for the OSCE. Rather, there should be a remediation of the station and further change in the curriculum to facilitate students learning these skills. Validity should not be lost at the expense of reliability (Downing, 2003; Norman et al., 1991; van der Vleuten, 2000). In conclusion, this Internal Medicine OSCE has been shown to be reliable and acceptable for local use. Item analysis of OSCE stations is useful and should be performed to improve the reliability of total OSCE scores. Careful structuring of OSCE questions and remediation of OSCE problem stations is crucial to support the continued use of OSCEs for final clinical performance assessments. Acknowledgements We would like to thank the Foundation for Advancement in Medical Education Research (FAIMER) for financial support of Dr. Chirayu Auewarakul through the International Fellowship of Medical Education (IFME)-2002 award. The Faculty development awards from the Anandamahidol Foundation and Siriraj Chalermprakiat Fund, Thailand are also appreciated. We thank Ms. Jaree Prasarnkul, Ms. Pichavadee Sae-ung and Mr. Somkuan Sriyounglek at the Office of Undergraduate Medical Education, Department of Medicine for their excellent work with OSCE administration, score processing and data collection.
  • 9. ITEM ANALYSIS FOR AN UNDERGRADUATE OSCE 113 References A-Latif, A. (1992). An examination of the examinations: The reliability of the objective structured clinical examination. Medical Teacher 14: 179–183. Boulet, J.R., McKinley, D.W., Whelan, G.P. & Hambleton, R.K. (2003). Quality assurance methods for performance-based assessments. Advances in Health Sciences Education 8: 27–47. Brennan, R.L. (2001). Generalizability Theory. New York, NY: Springer-Verlag. Collins, J.P. & Gamble, G.D. (1996). A multi-format interdisciplinary final examination. Medical Education 30: 259–265. Colliver, J.A & Williams, R.G. (1993). Technical issues: Test application. Academic Medicine 68: 454–458. Colliver, J.A., Verhulst, S.J., Williams, R.G. & Norcini, J.J. (1989). Reliability of performance on standardized patient cases: A comparison of consistency measures based on generalizability theory. Teaching and Learning in Medicine 1: 31–37. Cronbach, L.J., Gleser, G.C., Nanda, H. & Rajaratnam, N. (1972). The dependability of behavioral measurements: Generalizability for scores and profiles. New York: John Wiley and Sons. Crossley, J., Davies, H., Humphries, G. & Jolly, B. (2002). Generalisability: A key to unlock professional assessment. Medical Education 36: 972–978. Downing, S.M. (2003). Validity: On the meaningful interpretation of assessment data. Medical Education 37: 830–837. Harden, R., Stevenson, M., Downie, W. & Wilson, G. (1975). Assessment of clinical competence using objective structured examinations. British Medical Journal 1: 447–451. Kassam, N. (2003). Some validity evidence of an undergraduate Internal Medicine OSCE. Masters of Health Professions Education (MHPE) Thesis, University of Illinois at Chicago, Department of Medical Education, Chicago. Matsell, D.G., Wolfish, N.M. & Hsu, E. (1991). Reliability and validity of the objective structured clinical examination in pediatrics. Medical Education 25: 293–299. Messick, S.J. (1989). Validity. In R.L. Linn (ed), Educational Measurement (3rd ed), pp. 13–104. New York: American Council on Education and Macmillan. Miller, G.E. (1990). The assessment of clinical skills/competence/performance. Academic Medicine 65: S63–S67. Newble, D.I. & Swanson, D.B. (1988). Psychometric characteristics of the objective structured clinical examination. Medical Education 22: 325–334. Norman, G.R., Van der Vleuten, C.P.M. & De Graff, E. (1991). Pitfalls in the pursuit of objectivity: issues of validity efficiency and acceptability. Medical Education 25: 119–126. Petrusa, E.R., Blackwell, T.A. & Ainsworth, M.A. (1990). Reliability and validity of an objective structured clinical examination for assessing the clinical performance of residents. Archives in Internal Medicine 150: 573–577. Regehr, G., MacRae, H., Reznick, R.K. & Szalay, D. (1998). Comparing the psychometric properties of checklists and global rating scales for assessing performance on an OSCE format examination. Academic Medicine 73: 993–997. Van der Vleuten, C. (2000). Validity of final examinations in undergraduate medical training. British Medical Journal 321: 1217–1219. Van der Vleuten, C.P.M. & Swanson, D.B. (1990). Assessment of clinical skills with standardized patients: State of the art. Teaching and Learning in Medicine 2: 58–76. Verhoeven, BH., Hamers, J., Scherpbier, A.J., Hoogenboom, R.J. & Vleuten, C.P.van der (2000). The effect on reliability of adding a separate written assessment component to an objective structured clinical examination. Medical Education 34: 525–529. Wass, V., Vleuten, C.van der, Shatzer, J. & Jones, R. (2001). Assessment of clinical competence. Lancet 357: 945–949.

×