Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

How does health psychology measure up?

5,570 views

Published on

A critical look at measurement in Health Psychology.

Published in: Career, Technology, Business
  • Login to see the comments

  • Be the first to like this

How does health psychology measure up?

  1. 1. How does health psychology measure up?<br />A critical look at measurement in health psychology<br />Matthew Hankins16th September 2011 <br />
  2. 2. The empirical basis of Health Psychology<br />Why do Health Psychologists collect data?<br />Theory generation, esp. identifying constructs<br />Theory corroboration <br />Measuring outcomes (trials etc.)<br />The value of such activities is therefore critically dependent on the quality of the data <br />2<br />
  3. 3. Questionnaire measures<br />Majority of data collected by Health Psychologists is generated by questionnaire measures (‘scales’)<br />Questionnaires vary in the quality of data that they generate<br />Validity: extent to which the questionnaire measures what is intended<br />Reliability: extent to which variance in data reflects variance in construct measured<br />Index of measurement error <br />3<br />
  4. 4. Pragmatic approach<br />Validity<br />Unidimensionality (factor analysis)<br />Associations between measures<br />Discrimination between known groups<br />Reliability<br />Estimated by Cronbach’s Alpha<br />Or test-retest correlation <br />4<br />
  5. 5. Scale development<br />Combination of these approaches is derived from ‘Classical Test Theory’ (CTT)<br />Originated with Spearman (1904)<br />Landmark text: Guilford 2nd ed. (1954) <br />Fully developed by Lord & Novick (1968)<br />Further developments: ‘item-response theory’ (IRT)<br />E.gRasch model (1960)<br />CTT implicit in most empirical Health Psychology research<br />5<br />
  6. 6. What is a scale?<br />A scale orders people on the construct of interest<br />Both CTT & IRT agree that a person’s position on the dimension can be estimated from the item scores<br />Strength of IRT is that it does not assume that a set of correlated items forms a scale<br />Implicit in CTT: if items load on same factor, we automatically assume that they form a scale<br />6<br />LowPerson APerson BPerson CPerson D<br />High<br />Construct<br />
  7. 7. Scaling problem<br />Whether a set of items forms a scale is a hypothesis (Guttman 1950)<br />Formally tested whether items formed ‘Guttman scales’<br />“In contemporary psychometric practice, it is the rule rather than the exception that two people having the same score on a test will have [endorsed]different items…Such scores are crude empirical devices known to have some predictive efficiency, but they cannot be called measurements in any strict sense” (Loevinger 1948)<br />Additionally, there is no rational basis for adding up a set of ordinalLikert scores unless they have been shown to scale<br />7<br />
  8. 8. Example: PHQ-9<br />Feeling tired + Little interest in doing things + Poor appetite several days in last 2 weeks<br />Scale score = +3<br />Thoughts of hurting yourself in some way nearly every day in last 2 weeks<br />Scale score = +3<br />Are these responses really equivalent?<br />8<br />
  9. 9. Implications<br />If a set of items are assumed to form a scale, then we cannot be sure that the scale score accurately ranks people on the construct of interest<br />People with different positions may be assigned the same score<br />People with the same position may be assigned different scores<br />Unless we test this hypothesis, assessing reliability & validity is pointless<br />9<br />
  10. 10. Rejecting the hypothesis of a scale<br />Scales are very rarely ‘rejected’ in health psychology<br />Reliability is usually reported as ‘acceptable’ or ‘good’<br />Based on arbitrary cut-off around 0.7 (0.6, 0.5…)<br />“Test-retest reliability was acceptable (r=0.43)”<br />Criteria for validity are usually not specified in advance<br />Any factor structure can be accommodated<br />Any association can be cited as ‘validating’ scale<br />Formal testing of ‘scalability’ of items rare<br />10<br />
  11. 11. 11<br />Disordered categories<br />What we would like: interval scales<br />What we might have: ordinal scales<br />What we probably have: disordered categories<br />A scale that cannot rank-order people is not a scale<br />
  12. 12. Item ‘difficulty’ (intensity)<br />The problem arises because CTT does not account for item difficulty or intensity<br />Some items are endorsed at low levels of the construct<br />‘Low intensity item’<br />Endorsement may indicate low or high level of construct<br />Some items are endorsed at high levels of the construct<br />‘High intensity item’<br />Endorsement indicates high level of construct<br />12<br />
  13. 13. Example: PHQ-9<br />Feeling tiredon several days is a low intensity item<br />Endorsed at low level of depression<br />But may also be endorsed at higher levels of depression <br />13<br />LowYesYesYesYes<br />High<br />Depression<br />
  14. 14. Example: PHQ-9<br />Thoughts of hurting yourself in some way nearly every day in last 2 weeks is a high intensity item<br />Endorsed at high level of depression<br />But not endorsed at lower levels of depression <br />14<br />LowNoNoNo Yes<br />High<br />Depression<br />
  15. 15. How CTT fails to deal with item intensity<br />Factor analysis groups items of similar intensity<br />Factor analysis of a unidimensional construct will produce more than one ‘factor’<br />These ‘factors’ are simply sets of items with similar intensities<br />15<br />
  16. 16. Example: GHQ-12<br />Example: GHQ-12<br />Many studies report 2- or 3-factor solutions<br />‘Factors’ simply group items by intensity (Hankins 2008)<br />16<br />Low <br />High<br />7 45 2 6 10 11<br />1 129<br />8 3<br />Psychiatric morbidity<br />
  17. 17. How CTT fails to deal with item intensity<br />Selecting items on basis of factor analysis exacerbates problem, but simultaneously conceals it<br />Items are selected on basis of similar intensities, creating scales with limited range but high reliability<br />17<br />Low <br />High<br />7 45 2 6 10 11<br />1 129<br />8 3<br />Psychiatric morbidity<br />7 4<br />1 12<br />8 3<br />Low <br />High<br />Psychiatric morbidity<br />
  18. 18. Why Rasch modelling is not the answer<br />Rasch modelling (RM) explicitly takes into account item intensities<br />Stochastic Guttman scale<br />Tests the hypothesis that items form a scale<br />Additionally claims to produce interval scaling & ‘objective’ measurement<br />Increasingly popular in Health Psychology<br />18<br />
  19. 19. CTT vs. IRT<br />Argument tends to be that IRT is superior to CTT & IRT is ‘objective’ measurement<br />Differences more apparent than real:<br />Large correlations between CTT data & IRT data<br />If data treated as ordinal, perfect correlation between CTT & Rasch data<br />19<br />From Embretson & Reise (2000) <br />
  20. 20. GHQ-12: CTT scoring vs. RM scoring<br />20<br />
  21. 21. Problems<br />Rasch models require very large samples to allow estimation of person and item parameters<br />Very strong assumptions, e.g. logistic item-response curve<br />Why should all items have the same form of response?<br />The data must fit the model, not the other way round<br />Discards potentially useful data to fit arbitrary assumptions<br />Interval scaling is questionable gain if psychological constructs are not quantitative in the first place<br />21<br />
  22. 22. Ontological diversion<br />In general, psychologists seem to believe that attributes are either categorical or quantitative<br />A ‘cat’ is a different from a ‘tree’: different categories, difference is qualitative<br />30cm is different 60cm: different quantities, difference is quantitative<br />Having made this distinction, quantitative attributes may be measured as categorical, ordinal, interval<br />Ordinal attributes cannot exist in their own right<br />Just a way of collecting data on a quantitative attribute<br />22<br />
  23. 23. Ontological diversion<br />Russell (1896): the difference between two quantities is itself a quantity<br />The difference between two lengths is itself a length<br />For psychological attributes to be quantitative, the difference between two ‘levels’ of that attribute must itself be a ‘level’ of that attribute<br />Is the difference between two pleasures itself a pleasure?<br />Is the difference between two levels of depression itself a level of depression?<br />If not, are psychological states then merely categorical?<br />But what then do we mean by ‘severity’ of depression?<br />23<br />
  24. 24. Ontological diversion<br />Is it possible for psychological attributes to be ordinal?<br />Can something exist in degree but not quantity?<br />Michell (2009) argues that we cannot assume quantity from degree<br />shows that they are logically separable: “It is possible that an ordered attribute is non-quantitative”<br />Collingwood (1933) argues that some concepts exist only in degree<br />24<br />
  25. 25. Ontological diversion<br />Are we comfortable talking about degree, rather than quantity?<br />Implicit in our descriptions and experiences of psychological attributes<br />But does not require the assumption that the attributes are quantitative <br />25<br />
  26. 26. The degrees of the lie<br />JAQUES<br /> Can you nominate in order now the degrees of the lie?<br />TOUCHSTONE<br />O sir, we quarrel in print, by the book; as you have books for good manners: I will name you the degrees. The first, the Retort Courteous; the second, theQuip Modest; the third, the Reply Churlish; thefourth, the Reproof Valiant; the fifth, theCountercheque Quarrelsome; the sixth, the Lie withCircumstance; the seventh, the Lie Direct.<br />As You Like It, Act 5 Scene 1<br />26<br />
  27. 27. Summary<br />Measurement methods in health psychology are suboptimal<br />In particular, the fundamental assumption that correlated items form a scale is not routinely tested<br />IRT models such as the Rasch model assume that interval scaling is meaningful<br />Psychological attributes may not exist as quantities<br />Is there a method for constructing purely ordinal scales?<br />27<br />
  28. 28. Non-parametric IRT (NPIRT)<br />E.g. Mokken (1971)<br />Takes into account item intensities<br />Stochastic Guttman scale<br />Claims only to rank order people<br />Very weak assumptions<br />Retains data<br />Complements CTT<br />Uses simple scale score<br />28<br />
  29. 29. Examples of NPIRT analysis<br />
  30. 30. Mokken (1971) proposed two models<br />Monotone homogeneity model (MH)<br />Doubly monotone model (DM)<br />Scales fitting the MH model rank order people on the attribute of interest<br />Corollary is that scales not fitting the MH model do not rank order people on the attribute of interest <br />
  31. 31. Select items for the scale based on homogeneity<br />Assess whether the resulting scale fits the MH model<br />Scaling procedure and the MH model based on the following minimal assumptions: <br />For all items, if person A has a higher degree of X than person B, A’s probability of endorsing an item will be equal to or higher than B’s<br />Local independence: item scores are uncorrelated for the same degree of attribute<br />
  32. 32. If the purpose of the scale is to rank order peopleon a given attribute then the scale must be monotone homogenous<br />Probability of item being endorsed must be monotone nondecreasingagainst attribute<br />i.e. probability of item endorsement does not decrease with an increase in the measured attribute<br />* - as estimated from the remaining items of the scale<br />
  33. 33. For this GHQ-12 item the probability of endorsement reaches 50% at a low level of psychological distress.<br />It is therefore a low intensity item: people endorsing this item are signalling a low level of distress. <br />
  34. 34. For this GHQ-12 item the probability of endorsement reaches 50% at a high level of psychological distress.<br />It is therefore a high intensity item: people endorsing this item are signalling a high level of distress. <br />
  35. 35. If two items belong to a unidimensional scale, then:<br />Endorsing the more intense item entails that the less intense item also be endorsed<br />Endorsing the less intense item does not entail that the more intense item be endorsed<br />For a Guttman scale, these are deterministic statements<br />For a Mokken scale, these are probabilistic statements<br />
  36. 36. Less intense item<br />More intense item<br />AGuttman error occurs when the moreintense item is endorsed but not the less intense item<br />Too many Guttman errors imply that items are not measuring the same attribute<br />
  37. 37. This asymmetrical relationship between item pairs can be summarised with Loevinger’s H <br />H is the coefficient of homogeneity between two items i and j<br />Ranges from 0.0 to 1.0<br />0.0 indicates no association between items<br />1.0 indicates perfect association, given the differences in item intensity<br />1.0 also indicates no Guttman errors<br />Mokken (1971) developed H for scale development<br />Hij: Homogeneity of pair of items<br />Hi : Homogeneity of item i with all items<br />H : Homogeneity of scale<br />
  38. 38. All Hij > 0<br />Start with item pair with highest Hij<br />Select third item to maximise scale H<br />Proceed until H reaches threshold value c<br />Produces a unidimensional scale<br />c = 0.3; weak scale<br />c = 0.4; medium scale<br />c = 0.5; strong scale<br />c = 1.0; perfect Guttman scale<br />
  39. 39. Results for GHQ-12<br />Step Item Scale H<br />1 p6d 0.79<br />1 n4d 0.79<br />2 n6d 0.73<br />3 n5d 0.68<br />4 n2d 0.64<br />5 n3d 0.61<br />6 p5d 0.59<br />7 p3d 0.57<br />8 p4d 0.55<br />9 n1d 0.53<br />10 p2d 0.51<br />11 p1d 0.50<br />=> the items of the GHQ-12 form a strong unidimensional scale <br />
  40. 40. Monotone homogeneity model: GHQ-12<br />Item H #vi maxvizmax #zsig<br />p1d 0.44 0 0.00 0.00 0<br />n1d 0.45 0 0.00 0.00 0<br />p2d 0.43 1 0.06 0.99 0<br />p3d 0.50 0 0.00 0.00 0<br />n2d 0.55 0 0.00 0.00 0<br />n3d 0.51 0 0.00 0.00 0<br />p4d 0.47 0 0.00 0.00 0<br />p5d 0.50 1 0.05 0.90 0<br />n4d 0.56 0 0.00 0.00 0<br />n5d 0.50 0 0.00 0.00 0<br />n6d 0.56 1 0.05 0.93 0<br />p6d 0.53 1 0.04 0.68 0<br />Small deviations from MH model but none significant<br />
  41. 41.
  42. 42.
  43. 43. Conclusion<br />The GHQ-12 is a strongly homogenous unidimensional scale<br />Small deviations from monotone homogeneity, none significant<br />The GHQ-12 summed score can rank order people by the measured attribute<br />i.e. it can serve as an ordinal measure of severity of psychiatric impairment<br />Compare to results of EFA/CFA studies<br />
  44. 44. Example: Northwick Park dependency scale<br />Item selection from pool of 16 items<br />Item Scale H<br />Q8 0.93<br />Q5 0.93<br />Q9 0.93<br />Q2 0.91<br />Q1 0.88<br />Q13 0.87<br />Q7 0.84<br />Q12 0.82<br />Q6 0.79<br />Q14 0.76<br />Q4 0.74<br />Q3 0.70<br />Q11 0.67<br />Q15 0.62<br />14 items form unidimensional scale<br />
  45. 45. Two items with serious violations of monotone homogeneity<br />Item H #vi maxvizmax #zsig<br />Q3 0.45 6 0.25 2.88 4<br />Q11 0.32 5 0.28 3.43 2<br />Q3: help required using toilet (urination)<br />Q11: help required with drinking<br />
  46. 46.
  47. 47. Some items decrease in probability as attribute increases<br />With extreme dependency, patients require less help with drinking and emptying bladder<br />Because at this extreme, they are more likely to be tube-fed and catheterised <br />Hence, for these items, probability of endorsement decreases as dependency increases<br />Scale is not monotone homogenous<br />The summed score will not rank order people on the measured attribute<br />
  48. 48. Summary<br />The credibility of Health Psychology research & practice rests on its empirical evidence base<br />This evidence base relies on the quality of questionnaire data<br />The quality of questionnaire data may be compromised by the use of inappropriate methods<br />We should stop relying on factor analysis & reliability coefficients & test the hypothesis that a set of items constitutes a scale<br />48<br />

×