How does health psychology measure up?

3,974 views
3,913 views

Published on

A critical look at measurement in Health Psychology.

Published in: Career, Technology, Business
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
3,974
On SlideShare
0
From Embeds
0
Number of Embeds
2,608
Actions
Shares
0
Downloads
19
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

How does health psychology measure up?

  1. 1. How does health psychology measure up?<br />A critical look at measurement in health psychology<br />Matthew Hankins16th September 2011 <br />
  2. 2. The empirical basis of Health Psychology<br />Why do Health Psychologists collect data?<br />Theory generation, esp. identifying constructs<br />Theory corroboration <br />Measuring outcomes (trials etc.)<br />The value of such activities is therefore critically dependent on the quality of the data <br />2<br />
  3. 3. Questionnaire measures<br />Majority of data collected by Health Psychologists is generated by questionnaire measures (‘scales’)<br />Questionnaires vary in the quality of data that they generate<br />Validity: extent to which the questionnaire measures what is intended<br />Reliability: extent to which variance in data reflects variance in construct measured<br />Index of measurement error <br />3<br />
  4. 4. Pragmatic approach<br />Validity<br />Unidimensionality (factor analysis)<br />Associations between measures<br />Discrimination between known groups<br />Reliability<br />Estimated by Cronbach’s Alpha<br />Or test-retest correlation <br />4<br />
  5. 5. Scale development<br />Combination of these approaches is derived from ‘Classical Test Theory’ (CTT)<br />Originated with Spearman (1904)<br />Landmark text: Guilford 2nd ed. (1954) <br />Fully developed by Lord & Novick (1968)<br />Further developments: ‘item-response theory’ (IRT)<br />E.gRasch model (1960)<br />CTT implicit in most empirical Health Psychology research<br />5<br />
  6. 6. What is a scale?<br />A scale orders people on the construct of interest<br />Both CTT & IRT agree that a person’s position on the dimension can be estimated from the item scores<br />Strength of IRT is that it does not assume that a set of correlated items forms a scale<br />Implicit in CTT: if items load on same factor, we automatically assume that they form a scale<br />6<br />LowPerson APerson BPerson CPerson D<br />High<br />Construct<br />
  7. 7. Scaling problem<br />Whether a set of items forms a scale is a hypothesis (Guttman 1950)<br />Formally tested whether items formed ‘Guttman scales’<br />“In contemporary psychometric practice, it is the rule rather than the exception that two people having the same score on a test will have [endorsed]different items…Such scores are crude empirical devices known to have some predictive efficiency, but they cannot be called measurements in any strict sense” (Loevinger 1948)<br />Additionally, there is no rational basis for adding up a set of ordinalLikert scores unless they have been shown to scale<br />7<br />
  8. 8. Example: PHQ-9<br />Feeling tired + Little interest in doing things + Poor appetite several days in last 2 weeks<br />Scale score = +3<br />Thoughts of hurting yourself in some way nearly every day in last 2 weeks<br />Scale score = +3<br />Are these responses really equivalent?<br />8<br />
  9. 9. Implications<br />If a set of items are assumed to form a scale, then we cannot be sure that the scale score accurately ranks people on the construct of interest<br />People with different positions may be assigned the same score<br />People with the same position may be assigned different scores<br />Unless we test this hypothesis, assessing reliability & validity is pointless<br />9<br />
  10. 10. Rejecting the hypothesis of a scale<br />Scales are very rarely ‘rejected’ in health psychology<br />Reliability is usually reported as ‘acceptable’ or ‘good’<br />Based on arbitrary cut-off around 0.7 (0.6, 0.5…)<br />“Test-retest reliability was acceptable (r=0.43)”<br />Criteria for validity are usually not specified in advance<br />Any factor structure can be accommodated<br />Any association can be cited as ‘validating’ scale<br />Formal testing of ‘scalability’ of items rare<br />10<br />
  11. 11. 11<br />Disordered categories<br />What we would like: interval scales<br />What we might have: ordinal scales<br />What we probably have: disordered categories<br />A scale that cannot rank-order people is not a scale<br />
  12. 12. Item ‘difficulty’ (intensity)<br />The problem arises because CTT does not account for item difficulty or intensity<br />Some items are endorsed at low levels of the construct<br />‘Low intensity item’<br />Endorsement may indicate low or high level of construct<br />Some items are endorsed at high levels of the construct<br />‘High intensity item’<br />Endorsement indicates high level of construct<br />12<br />
  13. 13. Example: PHQ-9<br />Feeling tiredon several days is a low intensity item<br />Endorsed at low level of depression<br />But may also be endorsed at higher levels of depression <br />13<br />LowYesYesYesYes<br />High<br />Depression<br />
  14. 14. Example: PHQ-9<br />Thoughts of hurting yourself in some way nearly every day in last 2 weeks is a high intensity item<br />Endorsed at high level of depression<br />But not endorsed at lower levels of depression <br />14<br />LowNoNoNo Yes<br />High<br />Depression<br />
  15. 15. How CTT fails to deal with item intensity<br />Factor analysis groups items of similar intensity<br />Factor analysis of a unidimensional construct will produce more than one ‘factor’<br />These ‘factors’ are simply sets of items with similar intensities<br />15<br />
  16. 16. Example: GHQ-12<br />Example: GHQ-12<br />Many studies report 2- or 3-factor solutions<br />‘Factors’ simply group items by intensity (Hankins 2008)<br />16<br />Low <br />High<br />7 45 2 6 10 11<br />1 129<br />8 3<br />Psychiatric morbidity<br />
  17. 17. How CTT fails to deal with item intensity<br />Selecting items on basis of factor analysis exacerbates problem, but simultaneously conceals it<br />Items are selected on basis of similar intensities, creating scales with limited range but high reliability<br />17<br />Low <br />High<br />7 45 2 6 10 11<br />1 129<br />8 3<br />Psychiatric morbidity<br />7 4<br />1 12<br />8 3<br />Low <br />High<br />Psychiatric morbidity<br />
  18. 18. Why Rasch modelling is not the answer<br />Rasch modelling (RM) explicitly takes into account item intensities<br />Stochastic Guttman scale<br />Tests the hypothesis that items form a scale<br />Additionally claims to produce interval scaling & ‘objective’ measurement<br />Increasingly popular in Health Psychology<br />18<br />
  19. 19. CTT vs. IRT<br />Argument tends to be that IRT is superior to CTT & IRT is ‘objective’ measurement<br />Differences more apparent than real:<br />Large correlations between CTT data & IRT data<br />If data treated as ordinal, perfect correlation between CTT & Rasch data<br />19<br />From Embretson & Reise (2000) <br />
  20. 20. GHQ-12: CTT scoring vs. RM scoring<br />20<br />
  21. 21. Problems<br />Rasch models require very large samples to allow estimation of person and item parameters<br />Very strong assumptions, e.g. logistic item-response curve<br />Why should all items have the same form of response?<br />The data must fit the model, not the other way round<br />Discards potentially useful data to fit arbitrary assumptions<br />Interval scaling is questionable gain if psychological constructs are not quantitative in the first place<br />21<br />
  22. 22. Ontological diversion<br />In general, psychologists seem to believe that attributes are either categorical or quantitative<br />A ‘cat’ is a different from a ‘tree’: different categories, difference is qualitative<br />30cm is different 60cm: different quantities, difference is quantitative<br />Having made this distinction, quantitative attributes may be measured as categorical, ordinal, interval<br />Ordinal attributes cannot exist in their own right<br />Just a way of collecting data on a quantitative attribute<br />22<br />
  23. 23. Ontological diversion<br />Russell (1896): the difference between two quantities is itself a quantity<br />The difference between two lengths is itself a length<br />For psychological attributes to be quantitative, the difference between two ‘levels’ of that attribute must itself be a ‘level’ of that attribute<br />Is the difference between two pleasures itself a pleasure?<br />Is the difference between two levels of depression itself a level of depression?<br />If not, are psychological states then merely categorical?<br />But what then do we mean by ‘severity’ of depression?<br />23<br />
  24. 24. Ontological diversion<br />Is it possible for psychological attributes to be ordinal?<br />Can something exist in degree but not quantity?<br />Michell (2009) argues that we cannot assume quantity from degree<br />shows that they are logically separable: “It is possible that an ordered attribute is non-quantitative”<br />Collingwood (1933) argues that some concepts exist only in degree<br />24<br />
  25. 25. Ontological diversion<br />Are we comfortable talking about degree, rather than quantity?<br />Implicit in our descriptions and experiences of psychological attributes<br />But does not require the assumption that the attributes are quantitative <br />25<br />
  26. 26. The degrees of the lie<br />JAQUES<br /> Can you nominate in order now the degrees of the lie?<br />TOUCHSTONE<br />O sir, we quarrel in print, by the book; as you have books for good manners: I will name you the degrees. The first, the Retort Courteous; the second, theQuip Modest; the third, the Reply Churlish; thefourth, the Reproof Valiant; the fifth, theCountercheque Quarrelsome; the sixth, the Lie withCircumstance; the seventh, the Lie Direct.<br />As You Like It, Act 5 Scene 1<br />26<br />
  27. 27. Summary<br />Measurement methods in health psychology are suboptimal<br />In particular, the fundamental assumption that correlated items form a scale is not routinely tested<br />IRT models such as the Rasch model assume that interval scaling is meaningful<br />Psychological attributes may not exist as quantities<br />Is there a method for constructing purely ordinal scales?<br />27<br />
  28. 28. Non-parametric IRT (NPIRT)<br />E.g. Mokken (1971)<br />Takes into account item intensities<br />Stochastic Guttman scale<br />Claims only to rank order people<br />Very weak assumptions<br />Retains data<br />Complements CTT<br />Uses simple scale score<br />28<br />
  29. 29. Examples of NPIRT analysis<br />
  30. 30. Mokken (1971) proposed two models<br />Monotone homogeneity model (MH)<br />Doubly monotone model (DM)<br />Scales fitting the MH model rank order people on the attribute of interest<br />Corollary is that scales not fitting the MH model do not rank order people on the attribute of interest <br />
  31. 31. Select items for the scale based on homogeneity<br />Assess whether the resulting scale fits the MH model<br />Scaling procedure and the MH model based on the following minimal assumptions: <br />For all items, if person A has a higher degree of X than person B, A’s probability of endorsing an item will be equal to or higher than B’s<br />Local independence: item scores are uncorrelated for the same degree of attribute<br />
  32. 32. If the purpose of the scale is to rank order peopleon a given attribute then the scale must be monotone homogenous<br />Probability of item being endorsed must be monotone nondecreasingagainst attribute<br />i.e. probability of item endorsement does not decrease with an increase in the measured attribute<br />* - as estimated from the remaining items of the scale<br />
  33. 33. For this GHQ-12 item the probability of endorsement reaches 50% at a low level of psychological distress.<br />It is therefore a low intensity item: people endorsing this item are signalling a low level of distress. <br />
  34. 34. For this GHQ-12 item the probability of endorsement reaches 50% at a high level of psychological distress.<br />It is therefore a high intensity item: people endorsing this item are signalling a high level of distress. <br />
  35. 35. If two items belong to a unidimensional scale, then:<br />Endorsing the more intense item entails that the less intense item also be endorsed<br />Endorsing the less intense item does not entail that the more intense item be endorsed<br />For a Guttman scale, these are deterministic statements<br />For a Mokken scale, these are probabilistic statements<br />
  36. 36. Less intense item<br />More intense item<br />AGuttman error occurs when the moreintense item is endorsed but not the less intense item<br />Too many Guttman errors imply that items are not measuring the same attribute<br />
  37. 37. This asymmetrical relationship between item pairs can be summarised with Loevinger’s H <br />H is the coefficient of homogeneity between two items i and j<br />Ranges from 0.0 to 1.0<br />0.0 indicates no association between items<br />1.0 indicates perfect association, given the differences in item intensity<br />1.0 also indicates no Guttman errors<br />Mokken (1971) developed H for scale development<br />Hij: Homogeneity of pair of items<br />Hi : Homogeneity of item i with all items<br />H : Homogeneity of scale<br />
  38. 38. All Hij > 0<br />Start with item pair with highest Hij<br />Select third item to maximise scale H<br />Proceed until H reaches threshold value c<br />Produces a unidimensional scale<br />c = 0.3; weak scale<br />c = 0.4; medium scale<br />c = 0.5; strong scale<br />c = 1.0; perfect Guttman scale<br />
  39. 39. Results for GHQ-12<br />Step Item Scale H<br />1 p6d 0.79<br />1 n4d 0.79<br />2 n6d 0.73<br />3 n5d 0.68<br />4 n2d 0.64<br />5 n3d 0.61<br />6 p5d 0.59<br />7 p3d 0.57<br />8 p4d 0.55<br />9 n1d 0.53<br />10 p2d 0.51<br />11 p1d 0.50<br />=> the items of the GHQ-12 form a strong unidimensional scale <br />
  40. 40. Monotone homogeneity model: GHQ-12<br />Item H #vi maxvizmax #zsig<br />p1d 0.44 0 0.00 0.00 0<br />n1d 0.45 0 0.00 0.00 0<br />p2d 0.43 1 0.06 0.99 0<br />p3d 0.50 0 0.00 0.00 0<br />n2d 0.55 0 0.00 0.00 0<br />n3d 0.51 0 0.00 0.00 0<br />p4d 0.47 0 0.00 0.00 0<br />p5d 0.50 1 0.05 0.90 0<br />n4d 0.56 0 0.00 0.00 0<br />n5d 0.50 0 0.00 0.00 0<br />n6d 0.56 1 0.05 0.93 0<br />p6d 0.53 1 0.04 0.68 0<br />Small deviations from MH model but none significant<br />
  41. 41.
  42. 42.
  43. 43. Conclusion<br />The GHQ-12 is a strongly homogenous unidimensional scale<br />Small deviations from monotone homogeneity, none significant<br />The GHQ-12 summed score can rank order people by the measured attribute<br />i.e. it can serve as an ordinal measure of severity of psychiatric impairment<br />Compare to results of EFA/CFA studies<br />
  44. 44. Example: Northwick Park dependency scale<br />Item selection from pool of 16 items<br />Item Scale H<br />Q8 0.93<br />Q5 0.93<br />Q9 0.93<br />Q2 0.91<br />Q1 0.88<br />Q13 0.87<br />Q7 0.84<br />Q12 0.82<br />Q6 0.79<br />Q14 0.76<br />Q4 0.74<br />Q3 0.70<br />Q11 0.67<br />Q15 0.62<br />14 items form unidimensional scale<br />
  45. 45. Two items with serious violations of monotone homogeneity<br />Item H #vi maxvizmax #zsig<br />Q3 0.45 6 0.25 2.88 4<br />Q11 0.32 5 0.28 3.43 2<br />Q3: help required using toilet (urination)<br />Q11: help required with drinking<br />
  46. 46.
  47. 47. Some items decrease in probability as attribute increases<br />With extreme dependency, patients require less help with drinking and emptying bladder<br />Because at this extreme, they are more likely to be tube-fed and catheterised <br />Hence, for these items, probability of endorsement decreases as dependency increases<br />Scale is not monotone homogenous<br />The summed score will not rank order people on the measured attribute<br />
  48. 48. Summary<br />The credibility of Health Psychology research & practice rests on its empirical evidence base<br />This evidence base relies on the quality of questionnaire data<br />The quality of questionnaire data may be compromised by the use of inappropriate methods<br />We should stop relying on factor analysis & reliability coefficients & test the hypothesis that a set of items constitutes a scale<br />48<br />

×