How does health psychology measure up

How does health psychology measure up? A critical look at measurement in health psychology Matthew Hankins16th September 2011

The empirical basis of Health Psychology Why do Health Psychologists collect data? Theory generation, esp. identifying constructs Theory corroboration Measuring outcomes (trials etc.) The value of such activities is therefore critically dependent on the quality of the data 2

Questionnaire measures Majority of data collected by Health Psychologists is generated by questionnaire measures (‘scales’) Questionnaires vary in the quality of data that they generate Validity: extent to which the questionnaire measures what is intended Reliability: extent to which variance in data reflects variance in construct measured Index of measurement error 3

Pragmatic approach Validity Unidimensionality (factor analysis) Associations between measures Discrimination between known groups Reliability Estimated by Cronbach’s Alpha Or test-retest correlation 4

Scale development Combination of these approaches is derived from ‘Classical Test Theory’ (CTT) Originated with Spearman (1904) Landmark text: Guilford 2nd ed. (1954) Fully developed by Lord & Novick (1968) Further developments: ‘item-response theory’ (IRT) E.gRasch model (1960) CTT implicit in most empirical Health Psychology research 5

CTT vs. IRT Argument tends to be that IRT is superior to CTT In particular, it is argued that IRT is ‘objective’ measurement For large samples, differences more apparent than real: Strong correlations between CTT data & IRT data And differences tend to be smaller than the margin of error If data treated as ordinal, perfect correlation between CTT & Rasch data 6

What is a scale? A scale orders people on the construct of interest Both CTT & IRT agree that a person’s position on the dimension can be estimated from the item scores Strength of IRT is that it does not assume that a set of correlated items forms a scale Implicit in CTT: if items load on same factor, we automatically assume that they form a scale 7 LowPerson APerson BPerson CPerson D High Construct

Scaling problem Whether a set of items forms a scale is a hypothesis (Guttman 1950) Formally tested whether items formed ‘Guttman scales’ “In contemporary psychometric practice, it is the rule rather than the exception that two people having the same score on a test will have [endorsed]different items…Such scores are crude empirical devices known to have some predictive efficiency, but they cannot be called measurements in any strict sense” (Loevinger 1948) Additionally, there is no rational basis for adding up a set of ordinalLikert scores unless they have been shown to scale 8

Example: PHQ-9 Feeling tired + Little interest in doing things + Poor appetite several days in last 2 weeks Scale score = +3 Thoughts of hurting yourself in some way nearly every day in last 2 weeks Scale score = +3 Are these responses really equivalent? 9

Implications If a set of items are assumed to form a scale, then we cannot be sure that the scale score accurately ranks people on the construct of interest People with different positions may be assigned the same score People with the same position may be assigned different scores Unless we test the hypothesis, assessing reliability & validity is pointless 10

11 Disordered categories What we would like: interval scales What we think we have: ordinal scales What we probably have: disordered categories A scale that cannot rank-order people is not a scale

Item ‘difficulty’ (intensity) The problem arises because CTT does not account for item difficulty or intensity Some items are endorsed at low levels of the construct ‘Low intensity item’ Endorsement may indicate low or high level of construct Some items are endorsed at high levels of the construct ‘High intensity item’ Endorsement indicates high level of construct 12

Example: PHQ-9 Feeling tiredon several days is a low intensity item Endorsed at low level of depression But may also be endorsed at higher levels of depression 13 LowYesYesYesYes High Depression

Example: PHQ-9 Thoughts of hurting yourself in some way nearly every day in last 2 weeks is a high intensity item Endorsed at high level of depression But not endorsed at lower levels of depression 14 LowNoNoNo Yes High Depression

How CTT fails to deal with item intensity Factor analysis groups items of similar intensity Factor analysis of a unidimensional construct will produce more than one ‘factor’ These ‘factors’ are simply sets of items with similar intensities 15

Example: GHQ-12 Example: GHQ-12 Many studies report 2- or 3-factor solutions ‘Factors’ simply group items by intensity 16 Low High 7 45 2 6 10 11 1 129 8 3 Psychiatric morbidity

How CTT fails to deal with item intensity Selecting items on basis of factor analysis exacerbates problem, but simultaneously conceals it Items are selected on basis of similar intensities, creating scales with limited range but high reliability 17 Low High 7 45 2 6 10 11 1 129 8 3 Psychiatric morbidity 7 4 1 12 8 3 Low High Psychiatric morbidity

Why Rasch modelling is not the answer Rasch modelling explicitly takes into account item intensities Stochastic Guttman scale Additionally claims to produce interval scaling & ‘objective’ measurement Increasingly popular in Health Psychology 18

Problems Rasch models require very large samples to allow estimation of person and item parameters Very strong assumptions, e.g. logistic item-response curve The data must fit the model, not the other way round Discards useful data to fit arbitrary assumptions Interval scaling is questionable gain if psychological constructs are not quantitative in the first place 19

Non-parametric IRT (NPIRT) E.g. Mokken (1971) Takes into account item intensities Stochastic Guttman scale Claims only to rank order people Very weak assumptions Retains data Complements CTT Uses simple scale score 20

PROMIS project NIH funded project since 2004 ($100m) Establish a domain framework and develop candidate items for adult and paediatric Patient Reported Outcome Measures Questionnaires developed using published methodology Scaling methods include NPIRT and Graded Response Model (GRM) 22

Summary The credibility of Health Psychology research & practice rests on its empirical evidence base This evidence base relies on the quality of questionnaire data The quality of questionnaire data may be compromised by the use of inappropriate methods We should stop relying on factor analysis & reliability coefficients & test the hypothesis that a set of items constitutes a scale 23

Mokken (1971) proposed two models Monotone homogeneity model (MH) Doubly monotone model (DM) Scales fitting the MH model rank order people on the attribute of interest Corollary is that scales not fitting the MH model do not rank order people on the attribute of interest

Select items for the scale based on homogeneity Assess whether the resulting scale fits the MH model Scaling procedure and the MH model based on the following minimal assumptions: For all items, if person A has a higher degree of X than person B, A’s probability of endorsing an item will be equal to or higher than B’s Local independence: item scores are uncorrelated for the same degree of attribute

If the purpose of the scale is to rank order peopleon a given attribute then the scale must be monotone homogenous Probability of item being endorsed must be monotone nondecreasingagainst attribute i.e. probability of item endorsement does not decrease with an increase in the measured attribute * - as estimated from the remaining items of the scale

For this GHQ-12 item the probability of endorsement reaches 50% at a low level of psychological distress It is therefore a low intensity item: people endorsing this item are signalling a low level of distress Note that probability (Y-axis) increases with increase in class score (X-axis)

For this GHQ-12 item the probability of endorsement reaches 50% at a high level of psychological distress It is therefore a high intensity item: people endorsing this item are signalling a high level of distress Note that probability (Y-axis) also increases with increase in class score (X-axis), but curves: Do not have the same slope Are not required to have the same shape

If two items belong to a unidimensional scale, then: Endorsing the more intense item entails that the less intense item also be endorsed Endorsing the less intense item does not entail that the more intense item be endorsed For a Guttman scale, these are deterministic statements For a Mokken scale, these are probabilistic statements

Less intense item More intense item AGuttman error occurs when the moreintense item is endorsed but not the less intense item Too many Guttman errors imply that items are not measuring the same attribute

This asymmetrical relationship between item pairs can be summarised with Loevinger’s H H is the coefficient of homogeneity between two items i and j Ranges from 0.0 to 1.0 0.0 indicates no association between items 1.0 indicates perfect association, given the differences in item intensity 1.0 also indicates no Guttman errors Mokken (1971) developed H for scale development Hij: Homogeneity of pair of items Hi : Homogeneity of item i with all items H : Homogeneity of scale

All Hij > 0 Start with item pair with highest Hij Select third item to maximise scale H Proceed until H reaches threshold value c Produces a unidimensional scale c = 0.3; weak scale c = 0.4; medium scale c = 0.5; strong scale c = 1.0; perfect Guttman scale

Results for GHQ-12 Step Item Scale H 1 p6d 0.79 1 n4d 0.79 2 n6d 0.73 3 n5d 0.68 4 n2d 0.64 5 n3d 0.61 6 p5d 0.59 7 p3d 0.57 8 p4d 0.55 9 n1d 0.53 10 p2d 0.51 11 p1d 0.50 => the items of the GHQ-12 form a strong unidimensional scale

Monotone homogeneity model: GHQ-12 Item H #vi maxvizmax #zsig p1d 0.44 0 0.00 0.00 0 n1d 0.45 0 0.00 0.00 0 p2d 0.43 1 0.06 0.99 0 p3d 0.50 0 0.00 0.00 0 n2d 0.55 0 0.00 0.00 0 n3d 0.51 0 0.00 0.00 0 p4d 0.47 0 0.00 0.00 0 p5d 0.50 1 0.05 0.90 0 n4d 0.56 0 0.00 0.00 0 n5d 0.50 0 0.00 0.00 0 n6d 0.56 1 0.05 0.93 0 p6d 0.53 1 0.04 0.68 0 Small deviations from MH model but none significant

Conclusion The GHQ-12 is a strongly homogenous unidimensional scale Small deviations from monotone homogeneity, none significant The GHQ-12 summed score can rank order people by the measured attribute i.e. it can serve as an ordinal measure of severity of psychiatric impairment Compare to results of EFA/CFA studies

Example: Northwick Park dependency scale Item selection from pool of 16 items Item Scale H Q8 0.93 Q5 0.93 Q9 0.93 Q2 0.91 Q1 0.88 Q13 0.87 Q7 0.84 Q12 0.82 Q6 0.79 Q14 0.76 Q4 0.74 Q3 0.70 Q11 0.67 Q15 0.62 14 items form unidimensional scale

Two items with serious violations of monotone homogeneity Item H #vi maxvizmax #zsig Q3 0.45 6 0.25 2.88 4 Q11 0.32 5 0.28 3.43 2 Q3: help required using toilet (urination) Q11: help required with drinking

These items decrease in probability at the top end of the scale With extreme dependency, patients require less help with drinking and emptying bladder Because at this extreme, they are more likely to be tube-fed and catherised Hence, for these items, probability of endorsement decreases as dependency increases Scale is not monotone homogenous The summed score will not rank order people on the measured attribute

How does health psychology measure up

Recommended

Recommended

More Related Content

What's hot

What's hot (18)

Viewers also liked

Viewers also liked (12)

Similar to How does health psychology measure up

Similar to How does health psychology measure up (20)

Recently uploaded

Recently uploaded (20)

How does health psychology measure up