How does health psychology measure up?<br />A critical look at measurement in health psychology<br />Matthew Hankins16th S...
The empirical basis of Health Psychology<br />Why do Health Psychologists collect data?<br />Theory generation, esp. ident...
Questionnaire measures<br />Majority of data collected by Health Psychologists is generated by questionnaire measures (‘sc...
Pragmatic approach<br />Validity<br />Unidimensionality (factor analysis)<br />Associations between measures<br />Discrimi...
Scale development<br />Combination of these approaches is derived from ‘Classical Test Theory’ (CTT)<br />Originated with ...
CTT vs. IRT<br />Argument tends to be that IRT is superior to CTT<br />In particular, it is argued that IRT is ‘objective’...
What is a scale?<br />A scale orders people on the construct of interest<br />Both CTT & IRT agree that a person’s positio...
Scaling problem<br />Whether a set of items forms a scale is a hypothesis (Guttman 1950)<br />Formally tested whether item...
Example: PHQ-9<br />Feeling tired + Little interest in doing things + Poor appetite several days in last 2 weeks<br />Scal...
Implications<br />If a set of items are assumed to form a scale, then we cannot be sure that the scale score accurately ra...
11<br />Disordered categories<br />What we would like: interval scales<br />What we think we have: ordinal scales<br />Wha...
Item ‘difficulty’ (intensity)<br />The problem arises because CTT does not account for item difficulty or intensity<br />S...
Example: PHQ-9<br />Feeling tiredon several days is a low intensity item<br />Endorsed at low level of depression<br />But...
Example: PHQ-9<br />Thoughts of hurting yourself in some way nearly every day in last 2 weeks is a high intensity item<br ...
How CTT fails to deal with item intensity<br />Factor analysis groups items of similar intensity<br />Factor analysis of a...
Example: GHQ-12<br />Example: GHQ-12<br />Many studies report 2- or 3-factor solutions<br />‘Factors’ simply group items b...
How CTT fails to deal with item intensity<br />Selecting items on basis of factor analysis exacerbates problem, but simult...
Why Rasch modelling is not the answer<br />Rasch modelling explicitly takes into account item intensities<br />Stochastic ...
Problems<br />Rasch models require very large samples to allow estimation of person and item parameters<br />Very strong a...
Non-parametric IRT (NPIRT)<br />E.g. Mokken (1971)<br />Takes into account item intensities<br />Stochastic Guttman scale<...
21<br />
PROMIS project<br />NIH funded project since 2004 ($100m)<br />Establish a domain framework and develop candidate items fo...
Summary<br />The credibility of Health Psychology research & practice rests on its empirical evidence base<br />This evide...
Examples of NPIRT<br />
Mokken (1971) proposed two models<br />Monotone homogeneity model (MH)<br />Doubly monotone model (DM)<br />Scales fitting...
Select items for the scale based on homogeneity<br />Assess whether the resulting scale fits the MH model<br />Scaling pro...
If the purpose of the scale is to rank order peopleon a given attribute then the scale must be monotone homogenous<br />Pr...
For this GHQ-12 item the probability of endorsement reaches 50% at a low level of psychological distress<br />It is theref...
For this GHQ-12 item the probability of endorsement reaches 50% at a high level of psychological distress<br />It is there...
If two items belong to a unidimensional scale, then:<br />Endorsing the more intense item entails that the less intense it...
Less intense item<br />More intense item<br />AGuttman error occurs when the moreintense item is endorsed but not the less...
This asymmetrical relationship between item pairs can be summarised with Loevinger’s H	<br />H is the coefficient of homog...
All Hij > 0<br />Start with item pair with highest Hij<br />Select third item to maximise scale H<br />Proceed until H rea...
Results for GHQ-12<br />Step	Item		Scale H<br />1		p6d		0.79<br />1		n4d		0.79<br />2		n6d		0.73<br />3		n5d		0.68<br />4	...
Monotone homogeneity model: GHQ-12<br />Item	H 	#vi	maxvizmax 	#zsig<br />p1d  	0.44	0	0.00 	0.00     0<br />n1d  	0.45	0	...
Conclusion<br />The GHQ-12 is a strongly homogenous unidimensional scale<br />Small deviations from monotone homogeneity, ...
Example:  Northwick Park dependency scale<br />Item selection from pool of 16 items<br />Item	Scale H<br />Q8	 	0.93<br />...
Two items with serious violations of monotone homogeneity<br />Item	H 	#vi	maxvizmax #zsig<br />Q3   	0.45	 6    	0.25	2.8...
These items decrease in probability at the top end of the scale<br />With extreme dependency, patients require less help w...
Upcoming SlideShare
Loading in …5
×

How does health psychology measure up

605 views

Published on

Slides from a 20 minute presentation given at the Division of Health Psychology Annual Conference (Southampton 2011).

Published in: Health & Medicine
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
605
On SlideShare
0
From Embeds
0
Number of Embeds
9
Actions
Shares
0
Downloads
13
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

How does health psychology measure up

  1. 1. How does health psychology measure up?<br />A critical look at measurement in health psychology<br />Matthew Hankins16th September 2011 <br />
  2. 2. The empirical basis of Health Psychology<br />Why do Health Psychologists collect data?<br />Theory generation, esp. identifying constructs<br />Theory corroboration <br />Measuring outcomes (trials etc.)<br />The value of such activities is therefore critically dependent on the quality of the data <br />2<br />
  3. 3. Questionnaire measures<br />Majority of data collected by Health Psychologists is generated by questionnaire measures (‘scales’)<br />Questionnaires vary in the quality of data that they generate<br />Validity: extent to which the questionnaire measures what is intended<br />Reliability: extent to which variance in data reflects variance in construct measured<br />Index of measurement error <br />3<br />
  4. 4. Pragmatic approach<br />Validity<br />Unidimensionality (factor analysis)<br />Associations between measures<br />Discrimination between known groups<br />Reliability<br />Estimated by Cronbach’s Alpha<br />Or test-retest correlation <br />4<br />
  5. 5. Scale development<br />Combination of these approaches is derived from ‘Classical Test Theory’ (CTT)<br />Originated with Spearman (1904)<br />Landmark text: Guilford 2nd ed. (1954) <br />Fully developed by Lord & Novick (1968)<br />Further developments: ‘item-response theory’ (IRT)<br />E.gRasch model (1960)<br />CTT implicit in most empirical Health Psychology research<br />5<br />
  6. 6. CTT vs. IRT<br />Argument tends to be that IRT is superior to CTT<br />In particular, it is argued that IRT is ‘objective’ measurement<br />For large samples, differences more apparent than real:<br />Strong correlations between CTT data & IRT data<br />And differences tend to be smaller than the margin of error<br />If data treated as ordinal, perfect correlation between CTT & Rasch data<br />6<br />
  7. 7. What is a scale?<br />A scale orders people on the construct of interest<br />Both CTT & IRT agree that a person’s position on the dimension can be estimated from the item scores<br />Strength of IRT is that it does not assume that a set of correlated items forms a scale<br />Implicit in CTT: if items load on same factor, we automatically assume that they form a scale<br />7<br />LowPerson APerson BPerson CPerson D<br />High<br />Construct<br />
  8. 8. Scaling problem<br />Whether a set of items forms a scale is a hypothesis (Guttman 1950)<br />Formally tested whether items formed ‘Guttman scales’<br />“In contemporary psychometric practice, it is the rule rather than the exception that two people having the same score on a test will have [endorsed]different items…Such scores are crude empirical devices known to have some predictive efficiency, but they cannot be called measurements in any strict sense” (Loevinger 1948)<br />Additionally, there is no rational basis for adding up a set of ordinalLikert scores unless they have been shown to scale<br />8<br />
  9. 9. Example: PHQ-9<br />Feeling tired + Little interest in doing things + Poor appetite several days in last 2 weeks<br />Scale score = +3<br />Thoughts of hurting yourself in some way nearly every day in last 2 weeks<br />Scale score = +3<br />Are these responses really equivalent?<br />9<br />
  10. 10. Implications<br />If a set of items are assumed to form a scale, then we cannot be sure that the scale score accurately ranks people on the construct of interest<br />People with different positions may be assigned the same score<br />People with the same position may be assigned different scores<br />Unless we test the hypothesis, assessing reliability & validity is pointless<br />10<br />
  11. 11. 11<br />Disordered categories<br />What we would like: interval scales<br />What we think we have: ordinal scales<br />What we probably have: disordered categories<br />A scale that cannot rank-order people is not a scale<br />
  12. 12. Item ‘difficulty’ (intensity)<br />The problem arises because CTT does not account for item difficulty or intensity<br />Some items are endorsed at low levels of the construct<br />‘Low intensity item’<br />Endorsement may indicate low or high level of construct<br />Some items are endorsed at high levels of the construct<br />‘High intensity item’<br />Endorsement indicates high level of construct<br />12<br />
  13. 13. Example: PHQ-9<br />Feeling tiredon several days is a low intensity item<br />Endorsed at low level of depression<br />But may also be endorsed at higher levels of depression <br />13<br />LowYesYesYesYes<br />High<br />Depression<br />
  14. 14. Example: PHQ-9<br />Thoughts of hurting yourself in some way nearly every day in last 2 weeks is a high intensity item<br />Endorsed at high level of depression<br />But not endorsed at lower levels of depression <br />14<br />LowNoNoNo Yes<br />High<br />Depression<br />
  15. 15. How CTT fails to deal with item intensity<br />Factor analysis groups items of similar intensity<br />Factor analysis of a unidimensional construct will produce more than one ‘factor’<br />These ‘factors’ are simply sets of items with similar intensities<br />15<br />
  16. 16. Example: GHQ-12<br />Example: GHQ-12<br />Many studies report 2- or 3-factor solutions<br />‘Factors’ simply group items by intensity<br />16<br />Low <br />High<br />7 45 2 6 10 11<br />1 129<br />8 3<br />Psychiatric morbidity<br />
  17. 17. How CTT fails to deal with item intensity<br />Selecting items on basis of factor analysis exacerbates problem, but simultaneously conceals it<br />Items are selected on basis of similar intensities, creating scales with limited range but high reliability<br />17<br />Low <br />High<br />7 45 2 6 10 11<br />1 129<br />8 3<br />Psychiatric morbidity<br />7 4<br />1 12<br />8 3<br />Low <br />High<br />Psychiatric morbidity<br />
  18. 18. Why Rasch modelling is not the answer<br />Rasch modelling explicitly takes into account item intensities<br />Stochastic Guttman scale<br />Additionally claims to produce interval scaling & ‘objective’ measurement<br />Increasingly popular in Health Psychology<br />18<br />
  19. 19. Problems<br />Rasch models require very large samples to allow estimation of person and item parameters<br />Very strong assumptions, e.g. logistic item-response curve<br />The data must fit the model, not the other way round<br />Discards useful data to fit arbitrary assumptions<br />Interval scaling is questionable gain if psychological constructs are not quantitative in the first place<br />19<br />
  20. 20. Non-parametric IRT (NPIRT)<br />E.g. Mokken (1971)<br />Takes into account item intensities<br />Stochastic Guttman scale<br />Claims only to rank order people<br />Very weak assumptions<br />Retains data<br />Complements CTT<br />Uses simple scale score<br />20<br />
  21. 21. 21<br />
  22. 22. PROMIS project<br />NIH funded project since 2004 ($100m)<br />Establish a domain framework and develop candidate items for adult and paediatric Patient Reported Outcome Measures<br />Questionnaires developed using published methodology<br />Scaling methods include NPIRT and Graded Response Model (GRM)<br />22<br />
  23. 23. Summary<br />The credibility of Health Psychology research & practice rests on its empirical evidence base<br />This evidence base relies on the quality of questionnaire data<br />The quality of questionnaire data may be compromised by the use of inappropriate methods<br />We should stop relying on factor analysis & reliability coefficients & test the hypothesis that a set of items constitutes a scale<br />23<br />
  24. 24. Examples of NPIRT<br />
  25. 25. Mokken (1971) proposed two models<br />Monotone homogeneity model (MH)<br />Doubly monotone model (DM)<br />Scales fitting the MH model rank order people on the attribute of interest<br />Corollary is that scales not fitting the MH model do not rank order people on the attribute of interest <br />
  26. 26. Select items for the scale based on homogeneity<br />Assess whether the resulting scale fits the MH model<br />Scaling procedure and the MH model based on the following minimal assumptions: <br />For all items, if person A has a higher degree of X than person B, A’s probability of endorsing an item will be equal to or higher than B’s<br />Local independence: item scores are uncorrelated for the same degree of attribute<br />
  27. 27. If the purpose of the scale is to rank order peopleon a given attribute then the scale must be monotone homogenous<br />Probability of item being endorsed must be monotone nondecreasingagainst attribute<br />i.e. probability of item endorsement does not decrease with an increase in the measured attribute<br />* - as estimated from the remaining items of the scale<br />
  28. 28. For this GHQ-12 item the probability of endorsement reaches 50% at a low level of psychological distress<br />It is therefore a low intensity item: people endorsing this item are signalling a low level of distress<br />Note that probability (Y-axis) increases with increase in class score (X-axis)<br />
  29. 29. For this GHQ-12 item the probability of endorsement reaches 50% at a high level of psychological distress<br />It is therefore a high intensity item: people endorsing this item are signalling a high level of distress<br />Note that probability (Y-axis) also increases with increase in class score (X-axis), but curves:<br />Do not have the same slope<br />Are not required to have the same shape<br />
  30. 30. If two items belong to a unidimensional scale, then:<br />Endorsing the more intense item entails that the less intense item also be endorsed<br />Endorsing the less intense item does not entail that the more intense item be endorsed<br />For a Guttman scale, these are deterministic statements<br />For a Mokken scale, these are probabilistic statements<br />
  31. 31. Less intense item<br />More intense item<br />AGuttman error occurs when the moreintense item is endorsed but not the less intense item<br />Too many Guttman errors imply that items are not measuring the same attribute<br />
  32. 32. This asymmetrical relationship between item pairs can be summarised with Loevinger’s H <br />H is the coefficient of homogeneity between two items i and j<br />Ranges from 0.0 to 1.0<br />0.0 indicates no association between items<br />1.0 indicates perfect association, given the differences in item intensity<br />1.0 also indicates no Guttman errors<br />Mokken (1971) developed H for scale development<br />Hij: Homogeneity of pair of items<br />Hi : Homogeneity of item i with all items<br />H : Homogeneity of scale<br />
  33. 33. All Hij > 0<br />Start with item pair with highest Hij<br />Select third item to maximise scale H<br />Proceed until H reaches threshold value c<br />Produces a unidimensional scale<br />c = 0.3; weak scale<br />c = 0.4; medium scale<br />c = 0.5; strong scale<br />c = 1.0; perfect Guttman scale<br />
  34. 34. Results for GHQ-12<br />Step Item Scale H<br />1 p6d 0.79<br />1 n4d 0.79<br />2 n6d 0.73<br />3 n5d 0.68<br />4 n2d 0.64<br />5 n3d 0.61<br />6 p5d 0.59<br />7 p3d 0.57<br />8 p4d 0.55<br />9 n1d 0.53<br />10 p2d 0.51<br />11 p1d 0.50<br />=> the items of the GHQ-12 form a strong unidimensional scale <br />
  35. 35. Monotone homogeneity model: GHQ-12<br />Item H #vi maxvizmax #zsig<br />p1d 0.44 0 0.00 0.00 0<br />n1d 0.45 0 0.00 0.00 0<br />p2d 0.43 1 0.06 0.99 0<br />p3d 0.50 0 0.00 0.00 0<br />n2d 0.55 0 0.00 0.00 0<br />n3d 0.51 0 0.00 0.00 0<br />p4d 0.47 0 0.00 0.00 0<br />p5d 0.50 1 0.05 0.90 0<br />n4d 0.56 0 0.00 0.00 0<br />n5d 0.50 0 0.00 0.00 0<br />n6d 0.56 1 0.05 0.93 0<br />p6d 0.53 1 0.04 0.68 0<br />Small deviations from MH model but none significant<br />
  36. 36.
  37. 37.
  38. 38. Conclusion<br />The GHQ-12 is a strongly homogenous unidimensional scale<br />Small deviations from monotone homogeneity, none significant<br />The GHQ-12 summed score can rank order people by the measured attribute<br />i.e. it can serve as an ordinal measure of severity of psychiatric impairment<br />Compare to results of EFA/CFA studies<br />
  39. 39. Example: Northwick Park dependency scale<br />Item selection from pool of 16 items<br />Item Scale H<br />Q8 0.93<br />Q5 0.93<br />Q9 0.93<br />Q2 0.91<br />Q1 0.88<br />Q13 0.87<br />Q7 0.84<br />Q12 0.82<br />Q6 0.79<br />Q14 0.76<br />Q4 0.74<br />Q3 0.70<br />Q11 0.67<br />Q15 0.62<br />14 items form unidimensional scale<br />
  40. 40. Two items with serious violations of monotone homogeneity<br />Item H #vi maxvizmax #zsig<br />Q3 0.45 6 0.25 2.88 4<br />Q11 0.32 5 0.28 3.43 2<br />Q3: help required using toilet (urination)<br />Q11: help required with drinking<br />
  41. 41.
  42. 42. These items decrease in probability at the top end of the scale<br />With extreme dependency, patients require less help with drinking and emptying bladder<br />Because at this extreme, they are more likely to be tube-fed and catherised<br />Hence, for these items, probability of endorsement decreases as dependency increases<br />Scale is not monotone homogenous<br />The summed score will not rank order people on the measured attribute<br />

×