Successfully reported this slideshow.
Upcoming SlideShare
×

# 4. Instrumentation

25 views

Published on

SFU, FHS, HSCI 432

Published in: Education
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

• Be the first to like this

### 4. Instrumentation

1. 1. INSTRUMENTATIONS i m o n Fr a s e r U n i v e r s i t y - H S C I 3 4 0 – D r. K i f fe r G . C a rd
2. 2. NOMOLOGICAL NETWORK • A set of relationships between constructs and between consequent measures. Theoretical Domain Conceptual Domain Empirical Domain Theory item item item item Prediction Operationalization
3. 3. WARM-UP ACTIVITY •Imagine you are wanting to examine the relationships between “depression” and “physical health” •Discuss in your groups how would you measure these two outcomes?
4. 4. VALIDITY • Validity has been described as 'the agreement between a test score or measure and the quality it is believed to measure. • In other words, it measures the gap between what a test actually measures and what it is intended to measure. • This gap can be caused by two particular circumstances: • the design of the test is insufficient for the intended purpose, and • the test is used in a context or fashion which was not intended in the design.
5. 5. Validity Construct Validity Inference Validity Internal External Translation Criterion Face Content Predictive Concurrent Convergent Discriminant
6. 6. Inference Data Reality Statistical Theory Measurement Theory Reality Interpretation Inference Construct
7. 7. VALIDITY | STATISTICAL THEORY • Provides the basis by which researchers can make inferences (i.e., conclusions reached on the basis of evidence) from data collected for a specific purpose. • When statistical assumptions have been met for a given test and when results have been interpreted correctly, researchers can be confidence that their inferences are valid.
8. 8. • Inference validity refers to the validity of a research design as a whole. • It refers to whether you can trust the conclusions of a study. • Generally the issue concerns causality. • Statistical measures show relationships, but it is the theory and the study design that affect what kinds of claims to causality you can reasonably make. VALIDITY | INFERENCE
9. 9. VALIDITY | INFERENCE | INTERNAL • Refers to whether conclusions, especially relating to causality, are consistent with research results (e.g., statistical results) and research design (e.g., presence of appropriate control variables, use of appropriate methodology).
10. 10. Interpretation of Effect Sizes • Effect sizes are often confused, misinterpreted, or they use arbitrary cut- offs…leading to internal consistency issues. Linear Slope (Estimate) Pearson’s r R2 • Used to measure linear correlation between two variables. • Often, interpretations are ignored altogether and only the statistical “significance” of the p- value is discussed. • Used to measure linear correlation between two variables. • ± 0.3 = ‘weak’ • ± 0.7 = moderate' • ± 1.0 = ‘strong • Used to measure linear correlation between two variables. • 0.2 = ‘weak’ • 0.4 = moderate’ • 0.7 = ‘substantial’
11. 11. Interpretation of Effect Sizes • Effect sizes are often confused, misinterpreted, or they use arbitrary cut- offs…leading to internal consistency issues. Linear Slope (Estimate) Pearson’s r R2 • Tells you the unit increase in Y for each unit increase in X. • Pearson’s r tells you the strength and direction of an association. • R2 tells you how good the data fits the line.
12. 12. Interpretation of Effect Sizes • Effect sizes are often misinterpreted or they use arbitrary cut-offs, leading to internal consistency issues. Odds Ratio Relative Risk Cohen’s d • Used to compare the relative effect of one group to the “reference” group. • >1.00 = Positive Association • <1.00 = Negative Association • 1.00 = No association • Used to compare the relative effect of one group to the “reference” group. • >1.00 = Positive Association • <1.00 = Negative Association • 1.00 = No association • • Used to compare two means by indicating the number of standard deviations that the means differ by. • 0.2 = ‘small’ effect • 0.5 = 'medium' effect • 0.8 = 'large' effect • 1.3 = ‘very large’ effect.
13. 13. Interpretation of Effect Sizes • Effect sizes are often misinterpreted or they use arbitrary cut-offs, leading to internal consistency issues. Odds Ratio Relative Risk Cohen’s d • Ratio of odds of an event in one group compared to the odds of an event in another group. • >1.00 = Positive Association • <1.00 = Negative Association • 1.00 = No association • Ratio of incidence rate in one group compared to the incidence rate in another group. • >1.00 = Positive Association • <1.00 = Negative Association • 1.00 = No association • • Used to compare two means by indicating the number of standard deviations that the means differ by. • 0.2 = ‘small’ effect • 0.5 = 'medium' effect • 0.8 = 'large' effect • 1.3 = ‘very large’ effect.
14. 14. Odds vs. Risk • Let us look at the hypothetical example of a randomized trial comparing endoscopic sclerotherapy (n = 65) to band ligation (n = 64) for the treatment of bleeding esophageal varices. Group-specific Risk of Death Ligation: 18 / 64 = 0.28 Sclerotherapy: 29 / 65 = 0.44 Group-specific Odds of Death Ligation: 18 / 46 = 0.39 Sclerotherapy: 29 / 36 = 0.81 Overall Risk of Death 47 / 129 = 0.36 Overall Odds of Death 47 / 82 = 0.57 Relative risk 0.44/0.28 = 1.57 Odds Ratio 0.81/0.39 = 2.08
15. 15. Odds vs. Risk • Calculation of risk requires the use of “people at risk” as the denominator. • In retrospective (case-control) studies, where the total number of exposed people is not available, RR cannot be calculated and OR is used as a measure of the strength of association between exposure and outcome. • By contrast, in prospective studies (cohort studies), where the number at risk (number exposed) is available, either RR or OR can be calculated.
16. 16. P-VALUES AND INTERNAL CONSISTENCY • P-values are often misinterpreted, leading to internal consistency issues: • The p-value is not the probability that the null hypothesis is true, or the probability that the alternative hypothesis is false. • The p-value is not the probability that the observed effects were produced by random chance alone • The p-value does not indicate the size or importance of the observed effect. A very “significant” p-value could correlate with a minuscule effect. • The 0.05 significance level is merely a convention and is nearly indistinguishable from 0.04 or 0.06; yet often times statements regarding “significance” are made based on threshold significance. This is termed the p-value fallacy (Goodman, 1999)
17. 17. P-VALUES AND INTERNAL CONSISTENCY • Confidence Intervals are often misinterpreted, leading to internal consistency issues: • A common misunderstanding about CIs is that with a 95% CI, (A – B), there is a 95% probability that the true population mean lies between A and B. • This is an incorrect interpretation of 95% CI because the true population mean is a fixed unknown value that is either inside or outside the CI with 100% certainty. • In other words, the inclusion of a true population mean is not a probabilistic occurrence. • The choice of whether to use a 90% or 95% CI is somewhat arbitrary, and depends on the level of “confidence” that the investigator wishes to convey in his or her estimate.
18. 18. CORRECTLY INTERPRETING CONFIDENCE S1 S2 S3 S4 S5 Population Distribution Sampled Units Sample Means P-Value • The probability of obtaining data as extreme, or more extreme, than those observed if the null hypothesis is correct. Confidence Intervals • A 95% CI simply means that if the study is conducted multiple times (multiple sampling from the same population) with corresponding 95% CI for the mean constructed, we expect 95% of these CIs to contain the true population mean.
19. 19. MULTIPLE TESTING • The family-wise error rate (FWER) is the probability of obtaining at least one false positive when the null hypothesis is true. • Increases with the number of tests performed. 2 40 = 0.05 NULLNULL
20. 20. VALIDITY | INFERENCE | EXTERNAL • Refers to whether the results of a study can be applied, or generalized, to the real world. • Three strategies for strengthening external validity: • Sampling. Select cases from a known population via a probability sample, then claim the results apply to the population as a whole. • Representativeness. Show the similarities between the cases you studied with a population you wish your results to be applied to. • Replication. repeat the study in multiple settings. Use meta statistics to evaluate the results across studies. Although journal reviewers might not agree, consistent results across many settings with small samples may be just as good (or better) than a large sample of a single settings.
21. 21. VALIDITY | MEASUREMENT THEORY 1 2 3 2 2 Measurement • Measurement theory is the theory that characteristics about an individual can be categorized and represented numerically.
22. 22. VALIDITY | MEASUREMENT THEORY | PROPERTIES • Scales of measurement are distinguished by which properties of a subject are preserved. • The scale used depends on what properties are “meaningful.” • Equality properties allow you to compare equalities between objects. • e.g., if object A is 560° Kelvin (K) and object B is 280°K then value A will not equal value B. • Ordinality properties allow you to compare the order or ranking of objects. • e.g., if object A is 560°K and object B is 280°K then value A will be greater than value B. • Interval properties allow you to compare the intervals between objects. • e.g., if object A is 560°K and object B is 280°K and object C is 140°K then • the interval between value A and value B will be twice as great as the interval between value B and value C. • (i.e., each 1-unit value increase will equal a 140° Kelvin increase) • Ratio properties maintain the correspondence between the ratios of the measured values to the ratios of the actual properties being measured. • e.g if object A is twice as hot as object B then value A will be twice as high as value B. Absolute Ratio Interval Ordinal Nominal
23. 23. VALIDITY | MEASUREMENT THEORY | PROPERTIES 560°K 286.85°C 548.33°F 280°K 6.85°C 44.33°F • Ratio scales maintain equality, ordinality, interval, and ratio properties • Interval scales maintain equality, ordinality, and interval properties • Ordinal scales maintain equality and ordinality properties • Nominal scales maintain only equality properties Which properties does each measurement scale (i.e., Kelvin, Celsius, and Fahrenheit) possess?
24. 24. VALIDITY | CONSTRUCT VALIDITY • Construct validity refers to the validity of a variable being measured. • Validity for constructs can be defined based on • One’s subjective evaluation of whether a measure matches the construct it is meant to measure (i.e. Translational validity). • How well the measure relates to other measures and characteristics (i.e. Criterion validity).
25. 25. VALIDITY | CONSTRUCT | TRANSLATIONAL FACE VALIDITY CONTENT VALIDITY On “face value,” does the measure describe the construct well? Does the measure represent all the facets of a given construct?
26. 26. ASSESSING TRANSLATION VALIDITY Face Validity Content Validity • “What does the question below measure?” • “Is there a better way to measure _________?” • “What additional questions do you think are needed to capture ___________?” • “Is there any element of ___________ missing from the questions above?” • Translationally, validity is often evaluated by so-called “experts.” However, you can involve participants in this process using participant interviews, focus groups, and even survey questions:
27. 27. WARNING ABOUT FACE VALIDITY • A test can appear to be invalid but actually be perfectly valid. • This may be due to strong correlation between the construct being measured and the items used to measure it. • A test that does not have face validity may be confuse participants. • Other researchers may not be willing to use a test if it does not have face validity.
28. 28. WARNING ABOUT CONTENT VALIDITY • In practice, few measures capture every dimension of a construct. Researchers therefore often rely on a limited subset of variables that capture each dimension “well enough.” • Content under-representation occurs when important areas are missed. • Construct-irrelevant variation occurs when irrelevant factors contaminate the test.
29. 29. PREDICTIVE AND CONCURRENT VALIDITY • Concurrent validity is assessed by comparing two related measures completed at the same time. • Predictive validity is assessed by examining the ability of a test to predict some event that occurs in the real world.
30. 30. CONVERGENT AND DISCRIMINANT VALIDITY • Convergent validity occurs where measures of constructs that are expected to relate do so. • Discriminant validity occurs where constructs that are expected not to relate do not, such that it is possible to discriminate between these constructs. • Convergence and discrimination are often demonstrated by correlation of the measures used within constructs.
31. 31. NOMOLOGICAL NETWORK • A set of relationships between constructs and between consequent measures. Theoretical Domain Conceptual Domain Empirical Domain Theory item item item item Prediction OperationalizationCausation Construct Validity Correlation
32. 32. COMMON THREATS TO VALIDITY • Inappropriate selection of constructs or measures. • Insufficient data collected to make valid conclusions. • Measurement done in too few contexts. • Measurement done with too few measurement variables. • Too great a variation in data (can't see the wood for the trees). • Inadequate selection of target subjects. • Complex interaction across constructs. • Subjects giving biased answers or trying to guess what they should say. • Experimental method not valid. • Operation of experiment not rigorous.
33. 33. RELIABILITY • If a test is unreliable, then although the results for one use may actually be valid, for another they may be invalid. Reliability is thus a measure of how much you can trust the results of a test.
34. 34. Reliability Stability Consistency Internal Questions,often part of a scale, all correlate with one another. Test- Retest Repetition of an identical measure on a second occasion produces similar results. Parallel Form Asking questions multiple ways in the same survey produces congruent responses. Inter-rater Similar results are obtained across raters or observers,
35. 35. STABILITY • Stability is a measure of the repeatability of a test over time, that it gives the same results whenever it is used (within defined constraints, of course).
36. 36. TEST-RETEST RELIABILITY • We estimate test-retest reliability when we administer the same test to the same sample on two different occasions. • This approach assumes that there is no substantial change in the construct being measured between the two occasions. • The amount of time allowed between measures is critical. Re-TestTest Timet1 t2
37. 37. LIMITATIONS OF TEST-RETEST • There is an assumption with stability that what is being measured does not change. Variation should be due to the test, not to any other factor. • Several factors may lead to poor measures of stability. • Carry-over effect: people remembering answers from last time. • Practice effect: repeated taking of test improves score (typical with classic IQ tests). • Attrition: People not being present for re-tests.
38. 38. PARALLEL FORM RELIABILITY • In parallel forms reliability you first have to create two parallel forms. • One way to accomplish this is to create a large set of questions that address the same construct and then randomly divide the questions into two sets. • You administer both instruments to the same sample of people. • The correlation between the two parallel forms is the estimate of reliability. Form 2 Form 1 = Time t1
39. 39. CONSISTENCY • Consistency is a measure of reliability through similarity within the test, with individual questions giving predictable answers every time.
40. 40. INTER-RATER RELIABILITY • There are two major ways to actually estimate inter-rater reliability. • If your measurement consists of categories -- the raters are checking off which category each observation falls in -- you can calculate the percent of agreement between the raters. • The other major way to estimate inter-rater reliability is appropriate when the measure is a continuous one. There, all you need to do is calculate the correlation between the ratings of the two observers. Obs. 2 Obs. 1 Obs. 4 Obs. 3 Observer 1 Observer 2 X XX X X
41. 41. INTERNAL CONSISTENCY Item 1 Item 2 Item 3 Item 4 Item 5 Item 6 Item 7 Item 1 1.00 Item 2 .89 1.00 Item 3 .91 .84 1.00 Item 4 .88 .80 .87 1.00 Item 5 .84 .81 .90 .94 1.00 Item 6 .83 .86 .99 .93 .87 1.00 Item 7 .92 .80 .95 .99 .99 .98 1.00 • The average inter-item correlation tells you the average of all correlations. Average = 0.89
42. 42. INTERNAL CONSISTENCY Item 1 Item 2 Item 3 Item 4 Item 5 Item 6 Item 7 Item 1 1.00 Item 2 .89 1.00 Item 3 .91 .84 1.00 Item 4 .88 .80 .87 1.00 Item 5 .84 .81 .90 .94 1.00 Item 6 .83 .86 .99 .93 .87 1.00 Item 7 .92 .80 .95 .99 .99 .98 1.00 Total .89 .79 .90 .91 .95 .80 .89 • The average item-total correlation tells you the correlation between each item and the overall scale score. Average = .88
43. 43. INTERNAL CONSISTENCY • In split-half reliability we randomly divide all items that purport to measure the same construct into two sets. We administer the entire instrument to a sample of people and calculate the total score for each randomly divided half. the split-half reliability estimate, as shown in the figure, is simply the correlation between these two total scores. … Item Item ItemItem ItemItem Random split Correlation … …
44. 44. INTERNAL CONSISTENCY • Cronbach’s α is a lower bound estimate for the reliability of a scale calculated as a function of • The number of items in a test, • The average covariance between item pairs, and • The variance of the total score. • Generally, alpha increases as the intercorrelations between items increase, and thus it is used to measure internal consistency. • It is mathematically equivalent to the average of all possible split-half estimates. Chronbach’s α ≥ 0.9 Excellent ≥ 0.8 Good ≥ 0.7 Acceptable ≥ 0.6 Questionable ≥ 0.5 Poor < 0.5 Unacceptable
45. 45. INTERNAL CONSISTENCY • Kuder-Richardson is a measure of internal consistency reliability for measures with dichotomous choices. • It is a special case of Cronbach’s α. • Values can range from 0.00 to 1.00 (sometimes expressed as 0 to 100), with high values indicating that the examination is likely to correlate with alternate forms (a desirable characteristic). • Internal consistency estimates may be affected by difficulty of the test, the spread in scores, and the length of the examination.
46. 46. INTRA-CLASS CORRELATIONS • Intraclass correlation coefficient (ICC) is a widely used reliability index in test-retest, intrarater, and interrater reliability analyses. • There are 10 forms of ICCs. • Because each form involves distinct assumptions in their calculation and will lead to different interpretations. • Based on the 95% confident interval of the ICC estimate: • <0.5 = poor • 0.5 to 0.75 = moderate • 0.75 to 0.9 = good • > 0.90 = excellent
47. 47. VALIDITY AND RELIABILITY Valid Not Valid Reliable You are measuring what you think you are using a measure that will produce stable and consistent results. You have a reliable measure of something, just not what you think it is. Not Reliable The average measurement is right on, but each individual measurement has error and is un-usable by itself. If you are measuring something, its not what you want and its not a reliable way of measuring whatever it is measuring.
48. 48. “After a century of theory and research on psychological test scores, for most test scores we still have no idea whether they really measure something, or are no more than relatively arbitrary summations of item responses” --- Borsboom, 2005
49. 49. REVIEW ACTIVITY • Search out the diagnostic criteria for depression from either: • The Diagnostic and Statistical Manual of Mental Disorders, 5th Addition (DSM-IV) • The International Classification of Disease, 10th Revision (ICD-10) • Now, based on these construct definitions, take a look at one of the following scales and discuss what changes you might make to improve it’s validity or reliability. • Hospital Anxiety and Depression Scale (HADS) • Center for Epidemiologic Studies Depression Scale (CES-D) • Beck Dépression Inventorie (BDI) • The Major Depression Inventory • Hamilton Depression Rating Scale (HDRS) • Montgomery-Åsberg Depression Rating Scale • Patient Health Questionnaire (PHQ) • Primary Care Evaluation of Mental Disorders (PRIME-MD)