Even when a test is constructed on the basis of a specific criterion, it may ultimately be judged to have greater construct validity than the criterion .We start with a vague concept which we associate with certain observations. We then discover empirically that these observations co vary with some other observation which possesses greater reliability or is more intimately correlated with relevant experimental changes than is the original measure.
1. Development of healthmeasurement scalesIf you cannot express in numbers something that you are describing,you probably have little knowledge about it.
2. Dr. Priyamadhaba BeheraJunior Resident, AIIMSRELIABILITY AND VALIDITY15/03/20133
3. Why you need to worry about reliability and Validity ?What happens with low reliability and validity ?What is the relationship between reliability and validity ?Do you need validity always ? Or reliability always ? Or both ?What is the minimum reliability that is needed for a scale ?
4. No matter how well theobjectives are written,or how clever the items,the quality and usefulnessof an examination ispredicated on Validityand ReliabilityValidity & Reliability
5. Validity ReliabilityValidity & ReliabilityWe don’t say “an exam is valid and reliable”We do say “the exam score is reliable andValid for a specified purpose”KEY ELEMENT!
6. Reliability Vs. Validity
7. Validity• Two steps to determine usefulness of a scale– Reliability – necessary but not sufficient– Validity – next step• Validity – is the test measuring what it is meant to measure?• Two important issues– The nature of the what is being measured– Relationship of that variable to its purported cause• Sr. creatinine is a measure of kidney func. because we know it is regulatedby the kidneys• But whether students who do volunteer work will become better doctors?• Since our understanding of human behaviour is far fromperfect, such predictions have to validated against actualperformance
8. Types of validity• Three Cs (conventionally)– Content– Criterion• Concurrent• Predictive• Construct –– Convergent, discriminant, trait etc.,All types of validity are addressing the sameissue of the degree of confidence we can placein the inferences we can draw from the scales•Others (face validity)
9. Differing perspectives• Previously validity was seen as demonstrating the propertiesof the scale• Current thinking - what inferences can be made about thepeople that have given rise to the scores on these scales?– Thus validation is a process of hypothesis testing (someone who scoreson test A, will do worse in test B, and will differ from people who dobetter in test C and D)– Researchers are only limited by their imagination to deviseexperiments to test such hypothesis
10. Validity & Reliability
11. • Face validity– On the face of it the tool appears to be measuring what it issupposed to measure– Subjective judgment by one/more experts, rarely by anyempirical means• Content validity– Measures whether the tool includes all relevant domains ornot– Closely related to face validity– aka. ‘validity by assumption’ because an expert says so• Certain situations where these may not be desired
12. Content validity• Example – cardiology exam;– Assume it contains all aspects of the circulatorysystem (physiology, anatomy, pathology,pharmacology etc., etc.,)– If a person scores high on this test, we can say ‘infer’that he knows much about the subject (i.e., ourinferences about the person will right across varioussituations)– In contrast, if the exam did not contain anything aboutcirculation, the inferences we make about a high scorermay be wrong most of the time and vice versa
13. • Generally, a measure that includes a more representativesample of the target behaviour will have more content validityand hence lead to more accurate inferences• Reliability places an upper limit on validity (the maximumvalidity is the square root of reliability coeff.) the higher thereliability the higher the maximum possible validity– One exception is that between internal consistency andvalidity (better to sacrifice IC to content validity)– The ultimate aim of scale is inferential which dependsmore on content validity than internal consistency
14. Criterion validity• Correlation of a scale to an accepted ‘gold standard’• Two types– Concurrent (both the new scale and standard scale are given at thesame time)– Predictive – the Gold Standard results will be available some time inthe future (eg. Entrance test for college admission to assess if a personwill graduate or not)• Why develop a new scale when we already have a criterion scale?– Diagnostic utility/substitutability(expensive, invasive, dangerous, time-consuming)– Predictive utility (no decision can be made on the basis of new scale)• Criterion contamination– If the result of the GS is in part determined in some way by the resultsof the new test, it may lead to an artificially high correlation
15. Construct validity• Height, weight – readily observable• Psychological - anxiety, pain, intelligence are abstractvariables and can’t be directly observed• For eg. Anxiety – we say that a person has anxiety if he hassweaty palms, tachycardia, pacing back and forth, difficulty inconcentrating etc., (i.e., we have a hypothesize that thesesymptoms are the result of anxiety)• Such proposed underlying factors are called hypotheticalconstructs/ constructs (eg. Anxiety, illness behaviour)• Such constructs arise from larger theories/ clinicalobservations• Most psychological instruments tap some aspect of construct
16. Establishing construct validity• IBS is a construct rather than a disease – it is adiagnosis of exclusion• A large vocabulary, wide knowledge andproblem solving skills – what is the underlyingconstruct?• Many clinical syndromes are constructs ratherthan actual entities (schizophrenia, SLE)
17. • Initial scales for IBS – ruling out other organicdiseases and some physical signs and symptoms– These scales were inadequate because they lead tomany missed and wrong diagnoses– New scales developed incorporating demographicalfeatures and personality features• Now how to assess the validity of this new scale– Based on theory high scorers on this scale shouldhave• Symptoms which will not clear with conventional therapy• Lower prevalence of organic bowel disease on autopsy
18. Differences form other types1. Content and criterion can be established in one or twostudies, but there is no single experiment that can prove aconstruct•Construct validation is an ongoing process, learning moreabout the construct, making new predictions and then testingthem•Each supportive study strengthens the construct but onewell designed negative study can question the entireconstruct2. We are assessing the theory as well as the measure at thesame time
19. IBS example• We had predicted that IBS patients will not respond toconventional therapy• Assume that we gave the test to a sample of patientswith GI symptoms and treated them with conventionaltherapy• If high scoring patients responded in the sameproportion as low scorers then there are 3 possibilities– Our scale is good but theory wrong– Our theory is good but scale bad– Both scale and theory are bad• We can identify the reason only from further studies
20. • If an experimental design is used to test theconstruct, then in addition to the abovepossibilities our experiment may be flawed• Ultimately, construct validity doesn’t differconceptually from other types of validity– All validity is at its base some form of constructvalidity… it is the basic meaning of validity –(Guion)
22. Extreme groups• Two groups – as decided by clinicians– One IBS and the other some other GI disease– Equivocal diagnosis eliminated• Two problems– That we are able to separate two extreme groups impliesthat we already have a tool which meets our needs(however we can do bootstrapping)– This is not sufficient, the real use of a scale is makingmuch finer discriminations. But such studies can be a firststep, if the scale fails this it will be probably useless inpractical situations
23. Multitrait-multimethod matrix• Two unrelated traits/constructs each measured by two different methods• Eg. Two traits – anxiety, intelligence; two methods – a rater, exam– Purple – reliabilities of the four instruments (sh be highest)– Blue – homotrait heteromethod corr. (convergent validity)– Yellow – heterotrait homomethod corr. (divergent validity)– Red – heterotrait heteromethod corr. (sh be lowest)• Very powerful method but very difficult to get such a combinationAnxiety IntelligenceRater Exam Rater ExamAnxietyRater 0.53Exam 0.42 0.79IntelligenceRater 0.18 0.17 0.58Exam 0.15 0.23 0.49 0.88
24. • Convergent validity - If there are two measures forthe same construct, then they should correlate witheach other but should not correlate too much.E.g. Index of anxiety and ANS awareness index• Divergent validity – the measure should not correlatewith a measure of a different construct, eg. Anxietyindex and intelligence index
25. Biases in validity assessment• Restriction in range• May be in new scale (MAO level)• May be in criterion (depression score)• A third variable correlated to both (severity)• Eg. A high correlation was found betweenMAO levels and depression score incommunity based study, but on replicating thestudy in hospital the correlation was low
26. Validity & ReliabilityContent/Action + ErrorThe information we seek and ourbest hope for obtaining it.Our human frailty and inability towrite effective questions.
27. Maximum validity of a test is the square root of reliability coefficient. Reliability placesan upper limit on validity so that higher the reliability, higher the maximum possiblevalidity
28. Variance = sum of (individual value – mean value)2----------------------------------------------------------------------------------no. of values
29. Reliability• Whether our tool is measuring the attribute in areproducible fashion or not• A way to show the amount of error (random andsystematic) in any measurement• Sources of error – observers, instruments, instabilityof the attribute• Day to day encounters– Weighing machine, watch, thermometer
30. Assessing Reliability• Internal Consistency– The average correlation among all the items in the tool• Item-total correlation• Split half reliability• Kuder-Richardson 20 & Cronbach’s alpha• Multifactor inventories• Stability– Reproducibility of a measure on different occasions• Inter-Observer reliability• Test-Retest reliability (Intra-Observer reliability)
31. Internal consistency• All items in a scale tap different aspects of the sameattribute and not different traits• Items should be moderately corr. with each other andeach item with the total• Two schools of thought– If the aim is to describe a trait/behaviour/disorder– If the aim is to discriminate people with the trait from thosewithout• The trend is towards scales that are more internallyconsistent• IC doesn’t apply to multidimensional scales
32. Item-total correlation• Oldest, still used• Correlation of each item with the total score w/o thatitem• For k number of items, we have to calculate k numberof correlations, labourious• Item should be discarded if r < 0.20(kline 1986)• Best is Pearson’s R, in case of dichotomous items -point-biserial correlation
33. Split half reliability• Divide the items into two halves and calculate corr.between them• Underestimates the true reliability because we arereducing the length of scale to half (r is directly relatedto the no. of items)– Corrected by Spearman-Brown formula• Should not be used in– Chained itemsDifficulties-ways to divide a test-doesnt point which item is contributing topoor reliability
34. KR 20/Cronbach’s alfa• KR-20 for dichotomous responses• Cronbach’s alfa for more than two responses• They give the average of all possible split half reliabilities of ascale• If removing an item increases the coeff. it should be discarded• Problems– Depends on the no. of items– A scale with two different sub-scales will prob. yield high alfa– Very high alfa denotes redundancy (asking the same question inslightly different ways)– Thus alfa should be more than 0.70 but not more than 0.90
35. • Cronbach’s basic equation for alpha– n = number of questions– Vi = variance of scores on each question– Vtest = total variance of overall scores on theentire test Σ−−=VtestVinn11α
36. Multifactor inventories• More sophisticated techniques• Item-total procedure – each item should correlatewith the total of its scale and the total of all the scales• Factor analysis– Determining the underlying factors– For eg., if there are five tests• Vocabulary, fluency, phonetics, reasoning andarithmetic• We can theorize that the first three would be correlatedunder a factor called ‘verbal factor’ and the last twounder ‘logic factor’
37. Stability/ Measuring error• A weighing machine shows weight in the range ofsay 40-80 kg and thus an error of ±1kg ismeaningfulReality we calculate the ratiovariability between subjects / total variability(Total variability includes subjects and measurement error)• So that a ratio of–1 indicates no measurement error/perfect reliability–0 indicates otherwise
38. • Reliability =subj. variability / (subj. variability + measurement error)• Statistically ‘variance’ is the measure of variability so,• Reliability =SD2of subjects / (SD2of subjects + SD2of error)• Thus reliability is the proportion of the total variance thatis due to the ‘true’ differences between the subjects• Reliability has meaning only when applied to specificpopulations
39. Calculation of reliability• The statistical technique used is ANOVA andsince we have repeated measurements inreliability, the method is– repeated measures ANOVA
41. • Classical definition of reliability• Interpretation is that 88% of the variance isdue to the true variance among patients (akaIntraclass Correlation coefficient)
42. Fixed/random factor• What happened to the variance due to observers?• Are these the same observers going to be used or theyare a random sample?• Other situations where observations may be treated asfixed is subjects answering ‘same items on a scale’
43. Other types of reliability• We have only examined the effect of differentobservers on the same behaviour• But there can be error due to ‘day to day’ differences,if we measure the same behaviour a week or twoapart we can calculate ‘intra-observer reliabilitycoefficient’• If there are no observers (self-rated tests) we can stillcalculate ‘test-retest reliability’
44. • Usually high inter-observer is sufficient, but if it islow then we may have to calculate intra-observerreliability to determine the source of unreliability• Mostly measures of internal consistency are reportedas ‘reliability’, because there are easily computed in asingle sitting.– Hence caution is required as they may not measurevariability due to day to day differences
45. Diff. forms of reliability coefficient• So far we have seen forms of ICC• Others– Pearson product-moment correlation– Cohen’s kappa– Bland – altman analysis
46. Pearson’s correlation• Based on regression – the extent to which the relationbetween two variables can be described by straightline
47. Limitations of Pearson’s R• A perfect fit of 1.0 may be obtained even if the interceptis non-zero and the slope is not equal to one unlike withICC• So, Pearson’s R will be higher than truth, but in practice itis usually equal to ICC as the predominant source of erroris random variation• If there are multiple observations then multiple pairwiseRs are required, unlike the single ICC• For eg. with 10 observers there will be 45 Pearson’s Rswhereas only one ICC
48. • Used when responses are dichotomous/categorical• When the frequency of positive results is very low or high,kappa will be very high• Weighted kappa focuses on disagreement, cells are weightedaccording to the distance from the diagonal of agreement• Weighting can be arbitrary or using quadratic weights (basedon square of the amount of discrepancy)• Quadratic scheme of weighted kappa is equivalent to ICCKappa coeff.
49. Kappa coeff.
50. Bland and Altman method• A plot of difference between two observationsagainst the mean of the two observations
51. • Agreement is expressed as the ‘limits of agreement’. Thepresentation of the 95% limits of agreement is for visualjudgement of how well two methods of measurement agree.The smaller the range between these two limits the better theagreement is.• The question of how small is small depends on the clinicalcontext: would a difference between measurement methods asextreme as that described by the 95% limits of agreementmeaningfully affect the interpretation of the results• Limitation - the onus is placed on the reader to juxtapose thecalculated error against some implicit notion of true variability
52. Standards for magnitude of reliability coeff.•How much reliability is good?Kelly (0.94) Stewart (0.85)•A test for individual judgment should be higherthan that for research in groups•Research purposes –– Mean score and the sample size will reduce the error– Conclusions are usually made after a series of studies– Acceptable reliability is dependent on the sample sizein research(in sample of 1000 reliablity may lowcompared to sample size of 10)
53. Reliability and probability of misclassification•Depends on the property of the instrument and thedecision of cut point•Relation between reliability and likelihood ofmisclassification– Eg. A sample of 100, one person ranked 25thand another50th– If the R is 0, 50% chance that the two will reverse order onretesting– If R is 0.5, 37% chance, with R=0.8, 2.2% chance•Hence R of 0.75 is minimum requirement for a usefulinstrument
54. Improving reliability• Increase the subject variance relative to the errorvariance (by legitimate means and otherwise)• Reducing error variance– Observer/rater training– Removing consistently extreme observers– Designing better scales• Increasing true variance– In case of ‘floor’ or ‘ceiling’ effect, introduce items thatwill bring the performance to the middle of the scale (thusincreasing true variance)• Eg. Fair-good-very good-excellent
55. • Ways that are not legitimate– Test the scale in a heterogeneous population(normal and bedridden arthritics)– A scale developed in homogeneous population willhave a larger reliability when used in aheterogeneous population• correct for attenuation
56. • Simplest way to increase R is to increase the no. ofitems(statistical theory)• True variance increases as the square of itemswhereas error variance increases only as the no. ofitems• If the length of the test is triples– Then Rspearman brown = 3R/ 1 + 2R
57. • In reality the equation overestimates the newreliability• We can also use this equation to determine thelength of a test for achieving a pre-decidedreliability• To improve test-retest reliability – shorten theinterval between the tests• An ideal approach is the examine all the sourcesof variation and try to reduce the larger ones(generalizability theory)
58. Summary for Reliability• Pearson R is theoretically incorrect but inpractice fairly close• Bland and Altman method is analogous toerror variance of ICC but doesn’t relate this tothe range of observations• kappa and ICC are identical and mostappropriate