Upcoming SlideShare
×

# Chapter 3

1,311 views

Published on

1 Like
Statistics
Notes
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

Views
Total views
1,311
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
42
0
Likes
1
Embeds 0
No embeds

No notes for slide

### Chapter 3

1. 1. Test Worthiness Chapter 3
2. 2. Test Worthiness <ul><li>Four cornerstones to test worthiness: </li></ul><ul><ul><li>Validity </li></ul></ul><ul><ul><li>Reliability </li></ul></ul><ul><ul><li>Cross-cultural Fairness </li></ul></ul><ul><ul><li>Cross-cultural Fairness </li></ul></ul><ul><li>But first, we must learn one statistical concept: </li></ul><ul><ul><li>Correlation Coefficient </li></ul></ul>
3. 3. Correlation Coefficient <ul><li>Correlation: Statistical expression of the relationship between two sets of scores (or variables) </li></ul><ul><ul><li>Positive correlation: Increase in one variable accompanied by increase in other </li></ul></ul><ul><ul><ul><li>“ Direct” relationship </li></ul></ul></ul><ul><ul><li>Negative correlation: Increase in one variable accompanied by decrease in other </li></ul></ul><ul><ul><ul><li>“ Inverse” relationship </li></ul></ul></ul>
4. 4. Correlation Coefficient (Cont’d) <ul><li>What is the relationship between? </li></ul><ul><ul><li>Gasoline prices and grocery prices? </li></ul></ul><ul><ul><li>Grocery prices and good weather? </li></ul></ul><ul><ul><li>Stress and depression? </li></ul></ul><ul><ul><li>Depression and job productivity? </li></ul></ul><ul><ul><li>Partying and grades? </li></ul></ul><ul><ul><li>Study time and grades? </li></ul></ul>
5. 5. Correlation Coefficient (Cont’d) <ul><li>Correlation coefficient ( r ) </li></ul><ul><ul><li>A number between -1 and +1 that indicates direction and strength of the relationship </li></ul></ul><ul><ul><ul><li>As “r” approaches +1, strength increases in a direct and positive way </li></ul></ul></ul><ul><ul><ul><li>As “r” approaches -1, strength increases in an inverse and negative way </li></ul></ul></ul><ul><ul><ul><li>As “r” approaches 0, the relationship is weak or non existent (at zero) </li></ul></ul></ul>
6. 6. Correlation (cont’d) <ul><li>Correlation coefficient “r” </li></ul><ul><ul><li>0 to +.3 = weak +.4 to +.6 = medium +.7 to +1.0 = strong </li></ul></ul><ul><ul><li>0 to -.3 = weak -.4 to -.6 = medium -.7 to -1.0 = strong </li></ul></ul>-1 0 +1 Direct Inverse Weak Strong Strong
7. 7. Correlation Examples r = .35 r = -.67 SAT score Coll. GPA 930 3.0 750 2.9 1110 3.8 625 2.1 885 3.3 950 2.6 605 2.8 810 3.2 1045 3.0 910 3.5 Missed Classes Coll. GPA 3 3.0 5 2.9 2 3.8 8 2.1 1 3.3 6 2.6 3 2.8 1 3.2 3 3.0 0 3.5
8. 8. Correlation Scatterplots <ul><li>Plotting two sets of scores from the previous examples on a graph? </li></ul><ul><ul><li>Place person A’s SAT score on the x-axis, and his/her GPA on the y-axis </li></ul></ul><ul><ul><li>Continue this for person B,C, D etc. </li></ul></ul><ul><li>This process forms a “Scatterplot” </li></ul>
9. 9. Examples of Scatterplots
10. 10. Scatterplots (cont’d) <ul><li>What correlation ( r ) do you think this graph has? </li></ul><ul><li>How about this correlation? </li></ul>
11. 11. More Scatterplots <ul><li>What might this correlation be? </li></ul><ul><li>This correlation? </li></ul>
12. 12. More Scatterplots <ul><li>This correlation? </li></ul><ul><li>Last one </li></ul>
13. 13. Coefficient of Determination (Shared Variance) <ul><li>The square of the correlation ( r = .80, r 2 = .64) </li></ul><ul><li>A statement about factors that underlie the variables that account for their relationship. </li></ul><ul><li>Correlation between depression and anxiety = .85. </li></ul><ul><ul><li>Shared variance = .72. </li></ul></ul><ul><ul><li>What factors might underlie both depression and anxiety? </li></ul></ul>Depression Anxiety Shared Trait Variance
14. 14. Validity <ul><li>What is validity? </li></ul><ul><ul><li>The degree to which all accumulated evidence supports the intended interpretation of test scores for the intended purpose </li></ul></ul><ul><ul><li>Lay Def’n: Does a test measure what it is supposed to measure? </li></ul></ul><ul><ul><li>It is a unitary concept; however, there are 3 general types of validity evidence </li></ul></ul><ul><ul><ul><li>Content Validity </li></ul></ul></ul><ul><ul><ul><li>Criterion-Related Validity </li></ul></ul></ul><ul><ul><ul><li>Construct Validity </li></ul></ul></ul>
15. 15. Content Validity <ul><li>Is the content of the test valid for the kind of test it is? </li></ul><ul><ul><li>Developers must show evidence that the domain was systematically analyzed and concepts are covered in correct proportion </li></ul></ul><ul><ul><li>Four-step process: </li></ul></ul><ul><ul><ul><li>Step 1 - Survey the domain </li></ul></ul></ul><ul><ul><ul><li>Step 2 - Content of the test matches the above domain </li></ul></ul></ul><ul><ul><ul><li>Step 3 - Specific test items match the content </li></ul></ul></ul><ul><ul><ul><li>Step 4 - Analyze relative importance of each objective (weight) </li></ul></ul></ul>
16. 16. Content Validity (cont’d) Survey of Domain Step 1: Survey the Domain Step 2: Content Matches Domain Step 3: Test items reflect content Step 4: Adjusted for relative importance Content Matches Domain Item 1 x 3 Item 2 x 2 Item 3 x 1 Item 4 x 2 Item 5 x 2.5
17. 17. Content Validity (cont’d) <ul><li>Face Validity </li></ul><ul><ul><li>Not a real type of content validity </li></ul></ul><ul><ul><li>A quick look at “face” value of questions </li></ul></ul><ul><ul><li>Sometimes, questions may not “seem” to measure the content, but do (e.g., panic disorder example in book (p. 48) </li></ul></ul><ul><li>How might you show content validity for an instrument that measures depression? </li></ul>
18. 18. Criterion-Related Validity: Concurrent and Predictive Validity <ul><li>Criterion-Related Validity </li></ul><ul><ul><li>The relationship between the test and a criterion the test should be related to </li></ul></ul><ul><li>Two types: </li></ul><ul><ul><li>Concurrent Validity – Does the instrument relate to another criterion “now” (in the present)? </li></ul></ul><ul><ul><li>Predictive Validity – Does the instrument relate to another criterion in the future? </li></ul></ul>
19. 19. Criterion-Related Validity: Concurrent Validity <ul><li>Example 1 </li></ul><ul><ul><li>100 clients take the BDI </li></ul></ul><ul><ul><li>Correlate their scores with clinicians’ ratings of depression of the same group of clients. </li></ul></ul><ul><li>Example 2 </li></ul><ul><ul><li>500 people take test of alcoholism tendency </li></ul></ul><ul><ul><li>Correlate their scores with how significant others rate the amount of alcohol they drink. </li></ul></ul>
20. 20. Criterion-Related Validity: Predictive Validity <ul><li>Examples: </li></ul><ul><ul><li>SAT scores correlated with how well students do in college. </li></ul></ul><ul><ul><li>ASVAB scores correlated with success at jobs. </li></ul></ul><ul><ul><li>GREs correlated with success in graduate school. (See Table 3.1, p. 49) </li></ul></ul><ul><ul><li>Do Exercise 3.2, p. 50 </li></ul></ul>
21. 21. Concepts Related to Predictive Validity <ul><li>Standard Error of the Estimate </li></ul><ul><ul><li>Using a known value of one variable to predict a potential range of scores on a second variable (ex GRE -> GPA range) </li></ul></ul><ul><li>False Positive </li></ul><ul><ul><li>Instrument predicts an attribute that does not exist </li></ul></ul><ul><li>False Negative </li></ul><ul><ul><li>Instrument forecasts no attribute but in fact it exists </li></ul></ul>
22. 22. Construct Validity <ul><li>Construct Validity </li></ul><ul><ul><li>Extent to which the instrument measures a theoretical or hypothetical trait </li></ul></ul><ul><ul><li>Many counseling and psychological constructs are complex, ambiguous and not easily agreed upon: </li></ul></ul><ul><ul><ul><li>Intelligence </li></ul></ul></ul><ul><ul><ul><li>Self-esteem </li></ul></ul></ul><ul><ul><ul><li>Empathy </li></ul></ul></ul><ul><ul><ul><li>Other personality characteristics </li></ul></ul></ul>
23. 23. Construct Validity (cont’d) <ul><li>Four methods of gathering evidence for construct validity: </li></ul><ul><ul><li>experimental design </li></ul></ul><ul><ul><li>factor analysis </li></ul></ul><ul><ul><li>convergence with other instruments </li></ul></ul><ul><ul><li>discrimination with other measures. </li></ul></ul>
24. 24. Construct Validity: Experimental Design <ul><ul><li>Creating hypothesis and research studies that show the instrument captures the correct concept </li></ul></ul><ul><ul><ul><li>Example: </li></ul></ul></ul><ul><ul><ul><ul><li>Hypothesis: The “Blank” depression test will discriminate between clinically depressed clients and “normals.” </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Method: </li></ul></ul></ul></ul><ul><ul><ul><ul><ul><li>Identify 100 clinically depressed clients </li></ul></ul></ul></ul></ul><ul><ul><ul><ul><ul><li>Identify 100 “normals” </li></ul></ul></ul></ul></ul><ul><ul><ul><ul><ul><li>Show statistical analysis </li></ul></ul></ul></ul></ul><ul><ul><ul><li>Second Example: Clinicians measure their depressed clients before, then after, 6 months of treatment </li></ul></ul></ul>
25. 25. Construct Validity: Factor Analysis <ul><li>Factor analysis: </li></ul><ul><ul><li>Statistical relationship between subscales of test </li></ul></ul><ul><ul><li>How similar or different are the sub-scales? </li></ul></ul><ul><ul><li>Example: </li></ul></ul><ul><ul><ul><li>Develop a depression test that has three subscales: self-esteem, suicidal ideation, hopelessness. </li></ul></ul></ul><ul><ul><ul><li>Correlate subscales correlate: </li></ul></ul></ul><ul><ul><ul><ul><li>Self-esteem and suicidal ideation: .35 </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Self-esteem and hopelessness: .25 </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Hopelessness and suicidal ideation: .82 </li></ul></ul></ul></ul><ul><ul><ul><li>What implications might the above scores have for this test? </li></ul></ul></ul>
26. 26. Construct Validity: Convergent Validity <ul><li>Convergence Evidence – </li></ul><ul><ul><li>Comparing test scores to other, well-established tests </li></ul></ul><ul><ul><li>Example: </li></ul></ul><ul><ul><ul><li>Correlate new depression test against the BDI. </li></ul></ul></ul><ul><ul><ul><li>Is there a good correlation between the two? </li></ul></ul></ul><ul><ul><ul><li>Implications if correlation is extremely high? (e.g., .96) </li></ul></ul></ul><ul><ul><ul><li>Implications if correlation is extremely low? (e.g., .21) </li></ul></ul></ul>
27. 27. Construct Validity: Discriminant Validity <ul><li>Discriminant Evidence – </li></ul><ul><ul><ul><li>Correlate test scores with other tests that are different </li></ul></ul></ul><ul><ul><ul><li>Hope to find a meager correlation </li></ul></ul></ul><ul><ul><ul><li>Example: </li></ul></ul></ul><ul><ul><ul><ul><li>Compare new depression test with an anxiety test. </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Implications if correlation is extremely high? (e.g., .96) </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Implications if correlation is extremely low? (e.g., .21) </li></ul></ul></ul></ul>
28. 28. Validity Recap <ul><li>Three types of validity </li></ul><ul><ul><li>Content </li></ul></ul><ul><ul><li>Criterion </li></ul></ul><ul><ul><ul><li>Concurrent </li></ul></ul></ul><ul><ul><ul><li>Predictive </li></ul></ul></ul><ul><ul><li>Construct validity </li></ul></ul><ul><ul><ul><li>Experimental </li></ul></ul></ul><ul><ul><ul><li>Factor Analysis </li></ul></ul></ul><ul><ul><ul><li>Convergent </li></ul></ul></ul><ul><ul><ul><li>Discriminant </li></ul></ul></ul>
29. 29. Reliability <ul><li>Accuracy or consistency of test scores. </li></ul><ul><li>Would you score the same if you took the test over, and over, and over again? </li></ul><ul><li>Reported as a reliability (correlation) coefficient. </li></ul><ul><li>The closer to r = 1.0, the less error in the test. </li></ul>
30. 30. Three Ways of Determining Reliability <ul><li>Test-Retest </li></ul><ul><li>Alternate, Parallel, or Equivalent Forms </li></ul><ul><li>Internal Consistency </li></ul><ul><ul><li>a. Coefficient Alpha </li></ul></ul><ul><ul><li>b. Kuder-Richardson </li></ul></ul><ul><ul><li>c. Split-half or Odd Even </li></ul></ul>
31. 31. Test-Retest Reliability <ul><li>Give the test twice to same group of people. </li></ul><ul><ul><li>E.g. Take the first test in this class, and very soon after, take it again. Are the scores about the same? </li></ul></ul><ul><ul><li>person 1 person 2 person 3 person 4 person 5 others…. </li></ul></ul><ul><ul><li>1 st test: 35 42 43 34 38 </li></ul></ul><ul><ul><li>2 nd test: 36 44 41 34 37 </li></ul></ul><ul><ul><li>Graphic: </li></ul></ul><ul><ul><li>Problem: Person can look up answers between 1 st and second testing </li></ul></ul>Time A A
32. 32. Alternate, Parallel, or Equivalent Forms Reliability <ul><li>Have two forms of same test </li></ul><ul><li>Give students two forms the same time </li></ul><ul><li>Correlate scores on first form with scores on second form. </li></ul><ul><li>Graphic: </li></ul><ul><li>Problem: Are two “equivalent” forms ever really equivalent? </li></ul>A B
33. 33. Internal Consistency Reliability <ul><li>How do individual items relate to each other and the test as a whole? </li></ul><ul><li>Internal Consistency reliability is going “within” the test rather than using multiple administrations </li></ul><ul><li>High speed computers and only one test administration has made internal consistency popular </li></ul><ul><li>Three types: </li></ul><ul><ul><li>Split-Half or Odd-Even </li></ul></ul><ul><ul><li>Cronbach’s Coefficient Alpha </li></ul></ul><ul><ul><li>Kuder-Richardson </li></ul></ul>
34. 34. Split-half or Odd-even Reliability <ul><li>Correlate one half of test with other half for all who took the test </li></ul><ul><li>Example: </li></ul><ul><ul><li>Person 1 scores 16 on first half of test and 16 on second half </li></ul></ul><ul><ul><li>Person 2 scores 14 on first half and 18 on second half </li></ul></ul><ul><ul><li>Also get scores for persons 3, 4, 5, etc. </li></ul></ul><ul><ul><li>Correlate all persons scores on first half with their scores on second half </li></ul></ul><ul><ul><li>The correlation = the reliability estimate </li></ul></ul><ul><li>Use “Spearman Brown formula to control for shortness of test </li></ul>
35. 35. Split-half or Odd-even Reliability Internal Consistency <ul><li>Example Continued: </li></ul><ul><ul><li>Person Score on 1 st Half Score on 2 nd half </li></ul></ul><ul><ul><li>1 16 16 </li></ul></ul><ul><ul><li>2 14 18 </li></ul></ul><ul><ul><li>3 12 20 </li></ul></ul><ul><ul><li>4 15 17 </li></ul></ul><ul><ul><li>And so forth….. </li></ul></ul><ul><ul><li>Problem: Are any two halves really equivalent? </li></ul></ul><ul><ul><li>Graphic: </li></ul></ul>A
36. 36. Cronbach’s Alpha and Kuder-Richardson Internal Consistency <ul><li>Other types of Internal Consistency: </li></ul><ul><ul><li>Average correlation of all of the possible split-half reliabilities </li></ul></ul><ul><ul><li>Two popular types: </li></ul></ul><ul><ul><ul><li>Cronbach’s Alpha </li></ul></ul></ul><ul><ul><ul><li>Kuder-Richardson (KR-20, KR-21) </li></ul></ul></ul><ul><ul><li>Graphic: </li></ul></ul>A
37. 37. Item Response Theory: Another Way of Looking at Reliability <ul><li>Extension of classical test theory which looks at the amount of error in the total test </li></ul><ul><li>IRT looks at the probability that individuals will answer each item correctly (or match the quality being assessed) </li></ul><ul><li>Or, each item is being assessed for its ability to measure the trait being examined </li></ul>
38. 38. Item Response Theory: Another Way of Looking at Reliability <ul><li>Figure 3.6 shows the S curve (p. 57): </li></ul><ul><ul><li>Individuals with lower ability, have less probability of getting certain items correct. </li></ul></ul><ul><ul><li>Individuals with higher ability have higher probability of getting more items correct. </li></ul></ul><ul><ul><li>Each item is examined for it’s ability to discriminate based on the trait being measured (in Figure 3.6, p. 57, discrimination is based on ability) </li></ul></ul><ul><ul><li>The better a test can discriminate, the more reliable it is. </li></ul></ul>
39. 39. Cross-Cultural Fairness <ul><li>Bias in testing did not get much attention until civil rights movement of 1960’s. </li></ul><ul><li>Series of court decisions established is was unfair to use tests to track students in schools. </li></ul><ul><ul><li>Black and Hispanic students were being unfairly compared to whites-not their norm group. </li></ul></ul><ul><li>Griggs v. Duke Power Company </li></ul><ul><ul><li>Tests for hiring and advancement much show ability to predict job performance. </li></ul></ul><ul><ul><li>Example: Can’t give a test to measure intelligence for those who want to get a job as a road worker . </li></ul></ul>
40. 40. Cross-Cultural Fairness <ul><li>Americans with Disabilities Act : </li></ul><ul><ul><li>Accommodations for individuals taking tests for employment must be made </li></ul></ul><ul><ul><li>Tests must be shown to be relevant to the job in question. </li></ul></ul><ul><li>Family Education Rights and Privacy Act (FERPA) </li></ul><ul><ul><li>Right to access school records, including test records. </li></ul></ul><ul><ul><li>Parents have the rights to their child’s records </li></ul></ul>
41. 41. Cross-Cultural Fairness <ul><li>Carl Perkins Act: </li></ul><ul><ul><li>Individuals with a disability have the right to vocational assessment, counseling and placement. </li></ul></ul><ul><li>Civil Rights Acts: </li></ul><ul><ul><li>S eries of laws concerned with tests used in employment and promotion. </li></ul></ul><ul><li>Freedom of Information Act: </li></ul><ul><ul><li>Assures access to federal records, including test records. </li></ul></ul><ul><ul><li>Most states have expanded this law so that it also applies to state records. </li></ul></ul>
42. 42. Cross-Cultural Fairness <ul><li>IDEIA and PL 94-142: </li></ul><ul><ul><li>Assures rights of students (age 3 – 21) suspected of having a learning disability to be tested at the school’s expense. </li></ul></ul><ul><ul><li>Child Study Teams and IEP set up when necessary </li></ul></ul><ul><li>Section 504 of the Rehabilitation Act: </li></ul><ul><ul><li>Relative to assessment, any instrument used to measure appropriateness for a program or service must measure the individual’s ability, not be a reflection of his or he disability. </li></ul></ul>
43. 43. Cross-Cultural Fairness <ul><li>The Use of Intelligence Tests with Minorities: Confusion and Bedlam </li></ul><ul><ul><li>(See Box 3.1, p. 59) </li></ul></ul>
44. 44. Disparities in Ability <ul><li>Cognitive differences between people exist, however, they are clouded over by issues of SES, prejudice, stereotyping, etc: are there real differences? </li></ul><ul><li>Why do differences exist and what can be done to eliminate these differences? </li></ul><ul><li>Often seen as environmental-No Child Left Behind </li></ul><ul><li>Exercise 3.4, p. 59: Why might there be differences among cultural groups on their ability scores? </li></ul>
45. 45. Practicality <ul><li>Several practical concerns : </li></ul><ul><ul><li>Time </li></ul></ul><ul><ul><li>Cost </li></ul></ul><ul><ul><li>Format (clarity of print, print size, sequencing of questions and types of questions) </li></ul></ul><ul><ul><li>Readability </li></ul></ul><ul><ul><li>Ease of Administration, Scoring, and Interpretation </li></ul></ul>
46. 46. Selecting & Administering Tests <ul><li>Five Steps: </li></ul><ul><ul><li>Determine your client’s goals </li></ul></ul><ul><ul><li>Choose instruments to reach client goals. </li></ul></ul><ul><ul><li>Access information about possible instruments: </li></ul></ul><ul><ul><ul><ul><li>Source books (e.g.,: Buros Mental Measurement Yearbook and Tests in Print) (see Box 3.2, p. 62) </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Publisher resource catalogs </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Journals in the field </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Books on testing </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Experts </li></ul></ul></ul></ul><ul><ul><ul><ul><li>The internet </li></ul></ul></ul></ul><ul><ul><li>Examine validity, reliability, cross-cultural fairness, and practicality. </li></ul></ul><ul><ul><li>Make a wise choice. </li></ul></ul>