Validity and reliability in assessment.


Published on

Describes the essential components of reliability and validity of the assessment methods with special emphasis on medical education.

Published in: Education, Technology, Business
  • Be the first to comment

Validity and reliability in assessment.

  1. 1. Validity and Reliability in Assessment This work is the summarizations .Of the previous efforts done by great educators A humble presentation by Dr Tarek Tawfik Amin
  2. 2. Measurement experts (and many educators) believe that every measurement device should possess certain qualities. The two most common technical concepts in measurement are reliability and validity.
  3. 3. Reliability Definition (Consistency)   The degree of consistency between two measures of the same thing. (Me hre ns and Le hman, 1 9 8 7 ( The measure of how stable, dependable, trustworthy, and consistent a test is in measuring the same thing each time (Wo rthe n e t al. , 1 9 9 3)
  4. 4. (Validity definition (Accuracy  Truthfulness: Does the test measure what it purports to measure? the extent to which certain inferences can be made from test scores or other measurement. (M hre ns and Le hman, 1 9 8 7 ) e  The degree to which they accomplish the purpose for which they are being used. (Wo rthe n e t al. , 1 9 9 3(
  5. 5.  The term “ validity” refers to the degree to which the conclusions (interpretations) derived from the results of any assessment are “ well-grounded or justifiable; being at once relevant and meaningful.” (M ssick S. 1 9 9 5) e  Content” : related to objectives and their sampling.  “ Construct” : referring to the theory underlying the target.  “ Criterion” : related to concrete criteria in the real world. It can be concurrent or predictive.  “ Concurrent” : correlating high with another measure already validated.  “ Predictive” : Capable of anticipating some later measure.  “ Face” : related to the test overall appearance. The usual concepts of validity.
  6. 6. Sources of validity in assessment Old concept
  7. 7. Sources of validity in assessment Usual concepts of validity
  8. 8. All assessments in medical education require evidence of validity to be interpreted meaningfully. In contemporary usage, all validity is construct validity, which requires multiple sources of evidence; construct validity is the whole of validity, but has multiple facets. (Do wning S 20 0 3)
  9. 9. ( Construct (Concepts, ideas and notions - Nearly all assessments in medical education, deal with constructs: intangible collections of abstract concepts and principles which are inferred from behavior and explained by educational or psychological theory. - Educational achievement is a construct, inferred from performance on assessments; written tests over domain of knowledge, oral examinations over specific problems or cases in medicine, or OSCE, history-taking or communication skills. - Educational ability or aptitude is another example of construct – a construct that may be even more intangible and abstract than achievement. (Do wning 20 0 3)
  10. 10. Sources of validity in assessment      Content: do instrument items completely represent the construct? Response process: the relationship between the intended construct and the thought processes of subjects or observers Internal structure: acceptable reliability and factor structure Relations to other variables: correlation with scores from another instrument assessing the same construct Consequences: do scores really make a difference? Downing 2003, Cook S 2007
  11. 11. Sources of validity in assessment Content Response process - Examination blueprint - Student format familiarity - Representativeness of test blueprint to achievement - Quality control of electronic domain - Test specification - Key validation of preliminary scores - Match of item content to test specifications - Accuracy in combining different formats scores - Representativeness of items to domain - Quality control/accuracy of final scores/marks/grades - Logical/empirical relationship of content tested domain - Quality of test questions - Item writer qualifications - Sensitivity review scanning/scoring - Subscore/subscale analyses: 1-Accuracy of applying pass-fail decision rules to scores 2-Quality control of score reporting Internal structure • Item analysis data: 1. Item difficulty/discriminati on 2. Item/test characteristic curves 3. Inter-item correlations 4. Item-total correlations (PBS) • Score scale reliability • Standard errors of measurement (SEM) • Generalizability • Item factor analysis • Differential Item Functioning (DIF) Relationship to other variables Consequences • Correlation with other relevant variables (exams) • Impact of test scores/results • Convergent correlations - • Consequences on learners/future internal/external: - Similar tests • Divergent correlations: internal/external on students/society learning • Reasonableness of method of establishing pass-fail (cut) score - Dissimilar measures • Pass-fail consequences: • Test-criterion correlations 1. P/F Decision reliability-accuracy • Generalizability of evidence 2. Conditional standard error of measurement • False +ve/-ve
  12. 12. Sources of validity Internal Structure-1 Statistical e vide nce o f the hypo the size d re latio nship :be twe e n te st ite m sco re s and the co nstruct (:Reliability (internal consistency - 1 􀂄 Test scale reliability 􀂄 Rater reliability 􀂄 Generalizability Item analysis data- 2 􀂄 Item difficulty and discrimination 􀂄 MCQ option function analysis 􀂄 Inter-item correlations Scale factor structure -3 Dimensionality studies- 4 Differential item functioning (DIF studies- 5 )
  13. 13. Sources of validity Relationship to other variables-2 Statistical e vide nce o f the hypo the size d re latio nship be twe e n te st sco re s and the co nstruct � Criterion-related validity studies �Correlations between test scores/ subscores and other measures �Convergent-Divergent studies
  14. 14. Keys of reliability assessment     “ Stability” : related to time consistency. “ Internal” : related to the instruments. “ Inter-rater” : related to the examiners’ criterion. “ Intra-rater” : related to the examiner’ s criterion. Validity and reliability are closely related. A test cannot be considered valid unless the measurements resulting from it are reliable. Likewise, results from a test can be reliable and not necessarily valid.
  15. 15. Keys of reliability assessment Validity and reliability are closely related. A test cannot be considered valid unless the measurements resulting from it are reliable. Likewise, results from a test can be reliable and not necessarily valid.
  16. 16. Sources of reliability in assessment Source of reliability Internal consistency Description M easures Definitions Comments - Rarely used because the “effective” instrument is only half as long as the actual instrument; SpearmanBrown† formula can adjust - Do all the items on an instrument measure the same construct? (If an instrument measures more than one construct, a single score will not measure either construct very well. Split-half reliability - Correlation between scores on the first and second halves of a given instrument - We would expect high correlation between item scores measuring a single construct. Kuder-Richardson 20 -Assumes all items are - Similar concept to split-half, but accounts equivalent, measure a for all items single construct, and have dichotomous responses - Internal consistency is probably the most commonly reported reliability statistic, in part because it can be calculated after a single administration of a single instrument. - Because instrument halves can be considered “alternate forms,” internal consistency can be viewed as an estimate of parallel forms reliability. Cronbach’ s alpha - A generalized form of the Kuder-Richardson formulas - Assumes all items are equivalent and measure a single construct; can be used with dichotomous or continuous data
  17. 17. Sources of reliability in assessment Source of reliability Temporal stability Description M easures Definitions Comments Does the instrument produce similar results when administered a second Test-retest reliability Administer the instrument to the same person at different times Usually quantified using correlation (eg, Pearson’ s r) Administer different versions of the instrument to the same individual at the same or Usually quantified using correlation (eg, Pearson’ s r) time? Parallel forms Do different versions of the “same” instrument produce similar results? Alternate forms reliability different times Agreement (inter-rater When using raters, does it matter who does the rating? Percent agreement reliability) Is one rater’ s score similar to another’ s? Kappa Phi Kendall’s tau Intraclass correlation coefficient %identical responses Simple correlation Agreement corrected for chance Agreement on ranked data ANOVA to estimate how well ratings from different raters coincide Does not account for agreement that would occur by chance Does not account for chance
  18. 18. Sources of reliability in assessment Source of reliability Generalizability theory Description Measures How much of the error in Generalizability measurement is the result coefficient of each factor (eg, item, item grouping, subject, rater, day of administration) involved in the measurement process? Definitions Complex model that allows estimation of multiple sources of error Comments As the name implies, this elegant method is “generalizable” to virtually any setting in which reliability is assessed; For example, it can determine the relative contribution of internal consistency and inter-rater reliability to the overall reliability of a given instrument . I ms” are the individual q ue stio ns o n the instrume nt*“ te .The “ co nstruct” is what is be ing me asure d, such as kno wle dg e , attitude , skill, o r sympto m in a spe cific are a The Spe arman B wn “ pro phe cy” fo rmula allo ws o ne to calculate the re liability o f an instrume nt’ s sco re s ro (.whe n the numbe r o f ite ms is incre ase d (o r de cre ase d (Co o k and B ckman Validity and Re liability o f Psycho me tric I e nstrume nts (20 0 7
  19. 19. Keys of reliability assessment
  20. 20. Keys of reliability assessment Different types of assessments require different kinds of reliability Written MCQs 􀂄 Scale reliability Oral Exams 􀂄 Internal consistency 􀂄 Rater reliability 􀂄 Generalizability Theory Written—Essay Observational Assessments 􀂄 Inter-rater agreement 􀂄 Rater reliability 􀂄 Generalizability Theory 􀂄 Inter-rater agreement 􀂄 Generalizability Theory Performance Exams (OSCEs) 􀂄 Rater reliability 􀂄 Generalizability Theory
  21. 21. Keys of reliability assessment ?R eliability – H high ow 􀂆Very high-stakes: > 0.90 + (L icensure (tests ( 􀂆M oderate stakes: at least ~0.75 (OSCE ( 􀂆L stakes: >0.60 (Quiz ow
  22. 22. Keys of reliability assessment ?How to increase reliability For Written tests 􀂄 objectively scored formats Use 􀂄 least 35-40 MCQs At 􀂄 MCQs that differentiate high-low students For performance exams 􀂄 least 7-12 cases At 􀂄 Well trained SPs 􀂄 Monitoring, QC Observational Exams ( 􀂄 Lots of independent raters (7-11 􀂄 Standard checklists/rating scales 􀂄 Timely ratings
  23. 23. Conclusion Validity = Meaning 􀂄 Evidence to aid interpretation of assessment data 􀂄 Higher the test stakes, more evidence needed 􀂄 Multiple sources or methods 􀂄 Ongoing research studies Reliability 􀂄 Consistency of the measurement 􀂄 One aspect of validity evidence 􀂄 Higher reliability always better than lower
  24. 24. References  National Board of Medical Examiners. United States Medical Licensing  Exam Bulletin. Produced by Federation of State Medical Boards of  the United States and the National Board of Medical Examiners.  Available at:  Norcini JJ, Blank LL, Duffy FD, Fortna GS. The mini-CEX: a method for assessing clinical skills. Ann Intern Med. 2003;138:476-481.  Litzelman DK, Stratos GA, Marriott DJ, Skeff KM. Factorial validation of a widely disseminated educational framework for evaluating clinical teachers. Acad Med. 1998;73:688-695.  Merriam-Webster Online. Available at:  Sackett DL, Richardson WS, Rosenberg W, Haynes RB. Evidence- Based Medicine: How to Practice and Teach EBM. Edinburgh: Churchill Livingstone; 1998.  Wallach J. Interpretation of Diagnostic Tests. 7th ed. Philadelphia: Lippincott Williams & Wilkins; 2000.  Beckman TJ, Ghosh AK, Cook DA, Erwin PJ, Mandrekar JN. How reliable are assessments of clinical teaching? A review of the published instruments. J Gen Intern Med. 2004;19:971-977.  Shanafelt TD, Bradley KA, Wipf JE, Back AL. Burnout and selfreported patient care in an internal medicine residency program. Ann Intern Med. 2002;136:358-367.  Alexander GC, Casalino LP, Meltzer DO. Patient-physician communication about out-of-pocket costs. JAMA. 2003;290:953-958.
  25. 25. Reference s - Pittet D, Simon A, Hugonnet S, Pessoa-Silva CL, Sauvan V, Perneger TV. Hand hygiene among physicians: performance, beliefs, and perceptions. Ann Intern Med. 2004;141:1-8. - Messick S. Validity. In: Linn RL, editor. Educational Measurement, 3rd Ed. New York: American Council on Education and Macmillan; 1989. - Foster SL, Cone JD. Validity issues in clinical assessment. Psychol Assess. 1995;7:248-260. American Educational Research Association, American Psychological Association, National Council on Measurement in Education. Standards for Educational and Psychological Testing. Washington, DC: American Educational Research Association; 1999. - Bland JM, Altman DG. Statistics notes: validating scales and indexes. BMJ. 2002;324:606-607. - Downing SM. Validity: on the meaningful interpretation of assessment data. Med Educ. 2003;37:830-837. 2005 Certification Examination in Internal Medicine Information Booklet. Produced by American Board of Internal Medicine. Available at: pdf. - Kane MT. An argument-based approach to validity. Psychol Bull. 1992;112:527-535. - Messick S. Validation of inferences from persons’ responses and performances as scientific inquiry into score meaning. Am Psychol. 1995;50:741-749. - Kane MT. Current concerns in validity theory. J Educ Meas. 2001; 38:319-342. American Psychological Association. Standards for Educational and Psychological Tests and Manuals. Washington, DC: American Psychological Association; 1966. - Downing SM, Haladyna TM. Validity threats: overcoming interference in the proposed interpretations of assessment data. Med Educ. 2004;38:327-333. - Haynes SN, Richard DC, Kubany ES. Content validity in psychological assessment: a functional approach to concepts and methods. Psychol Assess. 1995;7:238-247. - Feldt LS, Brennan RL. Reliability. In: Linn RL, editor. Educational Measurement, 3rd Ed. New York: American Council on Education and Macmillan; 1989. - Downing SM. Reliability: on the reproducibility of assessment data. Med Educ. 2004;38:10061012. Clark LA, Watson D. Constructing validity: basic issues in objective scale development. Psychol
  26. 26. Resources  For an excellent resource on item analysis:   For a more extensive list of item-writing tips:  %20Item%20Writing%20Guidelines%20-%20Haladyna%20and %20Downing.pdf c_tips.pdf For a discussion about writing higher-level multiple choice items:  odford.pdf   report/itemanalysis.php