Upcoming SlideShare
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Standard text messaging rates apply

# Chapter 3

1,139

Published on

1 Like
Statistics
Notes
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

Views
Total Views
1,139
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
37
0
Likes
1
Embeds 0
No embeds

No notes for slide

### Transcript

• 1. Test Worthiness Chapter 3
• 2. Test Worthiness
• Four cornerstones to test worthiness:
• Validity
• Reliability
• Cross-cultural Fairness
• Cross-cultural Fairness
• But first, we must learn one statistical concept:
• Correlation Coefficient
• 3. Correlation Coefficient
• Correlation: Statistical expression of the relationship between two sets of scores (or variables)
• Positive correlation: Increase in one variable accompanied by increase in other
• “ Direct” relationship
• Negative correlation: Increase in one variable accompanied by decrease in other
• “ Inverse” relationship
• 4. Correlation Coefficient (Cont’d)
• What is the relationship between?
• Gasoline prices and grocery prices?
• Grocery prices and good weather?
• Stress and depression?
• Depression and job productivity?
• 5. Correlation Coefficient (Cont’d)
• Correlation coefficient ( r )
• A number between -1 and +1 that indicates direction and strength of the relationship
• As “r” approaches +1, strength increases in a direct and positive way
• As “r” approaches -1, strength increases in an inverse and negative way
• As “r” approaches 0, the relationship is weak or non existent (at zero)
• 6. Correlation (cont’d)
• Correlation coefficient “r”
• 0 to +.3 = weak +.4 to +.6 = medium +.7 to +1.0 = strong
• 0 to -.3 = weak -.4 to -.6 = medium -.7 to -1.0 = strong
-1 0 +1 Direct Inverse Weak Strong Strong
• 7. Correlation Examples r = .35 r = -.67 SAT score Coll. GPA 930 3.0 750 2.9 1110 3.8 625 2.1 885 3.3 950 2.6 605 2.8 810 3.2 1045 3.0 910 3.5 Missed Classes Coll. GPA 3 3.0 5 2.9 2 3.8 8 2.1 1 3.3 6 2.6 3 2.8 1 3.2 3 3.0 0 3.5
• 8. Correlation Scatterplots
• Plotting two sets of scores from the previous examples on a graph?
• Place person A’s SAT score on the x-axis, and his/her GPA on the y-axis
• Continue this for person B,C, D etc.
• This process forms a “Scatterplot”
• 9. Examples of Scatterplots
• 10. Scatterplots (cont’d)
• What correlation ( r ) do you think this graph has?
• 11. More Scatterplots
• What might this correlation be?
• This correlation?
• 12. More Scatterplots
• This correlation?
• Last one
• 13. Coefficient of Determination (Shared Variance)
• The square of the correlation ( r = .80, r 2 = .64)
• A statement about factors that underlie the variables that account for their relationship.
• Correlation between depression and anxiety = .85.
• Shared variance = .72.
• What factors might underlie both depression and anxiety?
Depression Anxiety Shared Trait Variance
• 14. Validity
• What is validity?
• The degree to which all accumulated evidence supports the intended interpretation of test scores for the intended purpose
• Lay Def’n: Does a test measure what it is supposed to measure?
• It is a unitary concept; however, there are 3 general types of validity evidence
• Content Validity
• Criterion-Related Validity
• Construct Validity
• 15. Content Validity
• Is the content of the test valid for the kind of test it is?
• Developers must show evidence that the domain was systematically analyzed and concepts are covered in correct proportion
• Four-step process:
• Step 1 - Survey the domain
• Step 2 - Content of the test matches the above domain
• Step 3 - Specific test items match the content
• Step 4 - Analyze relative importance of each objective (weight)
• 16. Content Validity (cont’d) Survey of Domain Step 1: Survey the Domain Step 2: Content Matches Domain Step 3: Test items reflect content Step 4: Adjusted for relative importance Content Matches Domain Item 1 x 3 Item 2 x 2 Item 3 x 1 Item 4 x 2 Item 5 x 2.5
• 17. Content Validity (cont’d)
• Face Validity
• Not a real type of content validity
• A quick look at “face” value of questions
• Sometimes, questions may not “seem” to measure the content, but do (e.g., panic disorder example in book (p. 48)
• How might you show content validity for an instrument that measures depression?
• 18. Criterion-Related Validity: Concurrent and Predictive Validity
• Criterion-Related Validity
• The relationship between the test and a criterion the test should be related to
• Two types:
• Concurrent Validity – Does the instrument relate to another criterion “now” (in the present)?
• Predictive Validity – Does the instrument relate to another criterion in the future?
• 19. Criterion-Related Validity: Concurrent Validity
• Example 1
• 100 clients take the BDI
• Correlate their scores with clinicians’ ratings of depression of the same group of clients.
• Example 2
• 500 people take test of alcoholism tendency
• Correlate their scores with how significant others rate the amount of alcohol they drink.
• 20. Criterion-Related Validity: Predictive Validity
• Examples:
• SAT scores correlated with how well students do in college.
• ASVAB scores correlated with success at jobs.
• GREs correlated with success in graduate school. (See Table 3.1, p. 49)
• Do Exercise 3.2, p. 50
• 21. Concepts Related to Predictive Validity
• Standard Error of the Estimate
• Using a known value of one variable to predict a potential range of scores on a second variable (ex GRE -> GPA range)
• False Positive
• Instrument predicts an attribute that does not exist
• False Negative
• Instrument forecasts no attribute but in fact it exists
• 22. Construct Validity
• Construct Validity
• Extent to which the instrument measures a theoretical or hypothetical trait
• Many counseling and psychological constructs are complex, ambiguous and not easily agreed upon:
• Intelligence
• Self-esteem
• Empathy
• Other personality characteristics
• 23. Construct Validity (cont’d)
• Four methods of gathering evidence for construct validity:
• experimental design
• factor analysis
• convergence with other instruments
• discrimination with other measures.
• 24. Construct Validity: Experimental Design
• Creating hypothesis and research studies that show the instrument captures the correct concept
• Example:
• Hypothesis: The “Blank” depression test will discriminate between clinically depressed clients and “normals.”
• Method:
• Identify 100 clinically depressed clients
• Identify 100 “normals”
• Show statistical analysis
• Second Example: Clinicians measure their depressed clients before, then after, 6 months of treatment
• 25. Construct Validity: Factor Analysis
• Factor analysis:
• Statistical relationship between subscales of test
• How similar or different are the sub-scales?
• Example:
• Develop a depression test that has three subscales: self-esteem, suicidal ideation, hopelessness.
• Correlate subscales correlate:
• Self-esteem and suicidal ideation: .35
• Self-esteem and hopelessness: .25
• Hopelessness and suicidal ideation: .82
• What implications might the above scores have for this test?
• 26. Construct Validity: Convergent Validity
• Convergence Evidence –
• Comparing test scores to other, well-established tests
• Example:
• Correlate new depression test against the BDI.
• Is there a good correlation between the two?
• Implications if correlation is extremely high? (e.g., .96)
• Implications if correlation is extremely low? (e.g., .21)
• 27. Construct Validity: Discriminant Validity
• Discriminant Evidence –
• Correlate test scores with other tests that are different
• Hope to find a meager correlation
• Example:
• Compare new depression test with an anxiety test.
• Implications if correlation is extremely high? (e.g., .96)
• Implications if correlation is extremely low? (e.g., .21)
• 28. Validity Recap
• Three types of validity
• Content
• Criterion
• Concurrent
• Predictive
• Construct validity
• Experimental
• Factor Analysis
• Convergent
• Discriminant
• 29. Reliability
• Accuracy or consistency of test scores.
• Would you score the same if you took the test over, and over, and over again?
• Reported as a reliability (correlation) coefficient.
• The closer to r = 1.0, the less error in the test.
• 30. Three Ways of Determining Reliability
• Test-Retest
• Alternate, Parallel, or Equivalent Forms
• Internal Consistency
• a. Coefficient Alpha
• b. Kuder-Richardson
• c. Split-half or Odd Even
• 31. Test-Retest Reliability
• Give the test twice to same group of people.
• E.g. Take the first test in this class, and very soon after, take it again. Are the scores about the same?
• person 1 person 2 person 3 person 4 person 5 others….
• 1 st test: 35 42 43 34 38
• 2 nd test: 36 44 41 34 37
• Graphic:
• Problem: Person can look up answers between 1 st and second testing
Time A A
• 32. Alternate, Parallel, or Equivalent Forms Reliability
• Have two forms of same test
• Give students two forms the same time
• Correlate scores on first form with scores on second form.
• Graphic:
• Problem: Are two “equivalent” forms ever really equivalent?
A B
• 33. Internal Consistency Reliability
• How do individual items relate to each other and the test as a whole?
• Internal Consistency reliability is going “within” the test rather than using multiple administrations
• High speed computers and only one test administration has made internal consistency popular
• Three types:
• Split-Half or Odd-Even
• Cronbach’s Coefficient Alpha
• Kuder-Richardson
• 34. Split-half or Odd-even Reliability
• Correlate one half of test with other half for all who took the test
• Example:
• Person 1 scores 16 on first half of test and 16 on second half
• Person 2 scores 14 on first half and 18 on second half
• Also get scores for persons 3, 4, 5, etc.
• Correlate all persons scores on first half with their scores on second half
• The correlation = the reliability estimate
• Use “Spearman Brown formula to control for shortness of test
• 35. Split-half or Odd-even Reliability Internal Consistency
• Example Continued:
• Person Score on 1 st Half Score on 2 nd half
• 1 16 16
• 2 14 18
• 3 12 20
• 4 15 17
• And so forth…..
• Problem: Are any two halves really equivalent?
• Graphic:
A
• 36. Cronbach’s Alpha and Kuder-Richardson Internal Consistency
• Other types of Internal Consistency:
• Average correlation of all of the possible split-half reliabilities
• Two popular types:
• Cronbach’s Alpha
• Kuder-Richardson (KR-20, KR-21)
• Graphic:
A
• 37. Item Response Theory: Another Way of Looking at Reliability
• Extension of classical test theory which looks at the amount of error in the total test
• IRT looks at the probability that individuals will answer each item correctly (or match the quality being assessed)
• Or, each item is being assessed for its ability to measure the trait being examined
• 38. Item Response Theory: Another Way of Looking at Reliability
• Figure 3.6 shows the S curve (p. 57):
• Individuals with lower ability, have less probability of getting certain items correct.
• Individuals with higher ability have higher probability of getting more items correct.
• Each item is examined for it’s ability to discriminate based on the trait being measured (in Figure 3.6, p. 57, discrimination is based on ability)
• The better a test can discriminate, the more reliable it is.
• 39. Cross-Cultural Fairness
• Bias in testing did not get much attention until civil rights movement of 1960’s.
• Series of court decisions established is was unfair to use tests to track students in schools.
• Black and Hispanic students were being unfairly compared to whites-not their norm group.
• Griggs v. Duke Power Company
• Tests for hiring and advancement much show ability to predict job performance.
• Example: Can’t give a test to measure intelligence for those who want to get a job as a road worker .
• 40. Cross-Cultural Fairness
• Americans with Disabilities Act :
• Accommodations for individuals taking tests for employment must be made
• Tests must be shown to be relevant to the job in question.
• Family Education Rights and Privacy Act (FERPA)
• Right to access school records, including test records.
• Parents have the rights to their child’s records
• 41. Cross-Cultural Fairness
• Carl Perkins Act:
• Individuals with a disability have the right to vocational assessment, counseling and placement.
• Civil Rights Acts:
• S eries of laws concerned with tests used in employment and promotion.
• Freedom of Information Act:
• Most states have expanded this law so that it also applies to state records.
• 42. Cross-Cultural Fairness
• IDEIA and PL 94-142:
• Assures rights of students (age 3 – 21) suspected of having a learning disability to be tested at the school’s expense.
• Child Study Teams and IEP set up when necessary
• Section 504 of the Rehabilitation Act:
• Relative to assessment, any instrument used to measure appropriateness for a program or service must measure the individual’s ability, not be a reflection of his or he disability.
• 43. Cross-Cultural Fairness
• The Use of Intelligence Tests with Minorities: Confusion and Bedlam
• (See Box 3.1, p. 59)
• 44. Disparities in Ability
• Cognitive differences between people exist, however, they are clouded over by issues of SES, prejudice, stereotyping, etc: are there real differences?
• Why do differences exist and what can be done to eliminate these differences?
• Often seen as environmental-No Child Left Behind
• Exercise 3.4, p. 59: Why might there be differences among cultural groups on their ability scores?
• 45. Practicality
• Several practical concerns :
• Time
• Cost
• Format (clarity of print, print size, sequencing of questions and types of questions)
• Ease of Administration, Scoring, and Interpretation
• 46. Selecting & Administering Tests
• Five Steps: