Your SlideShare is downloading. ×
Chapter 3
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Chapter 3

1,139

Published on

0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,139
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
37
Comments
0
Likes
1
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Test Worthiness Chapter 3
  • 2. Test Worthiness
    • Four cornerstones to test worthiness:
      • Validity
      • Reliability
      • Cross-cultural Fairness
      • Cross-cultural Fairness
    • But first, we must learn one statistical concept:
      • Correlation Coefficient
  • 3. Correlation Coefficient
    • Correlation: Statistical expression of the relationship between two sets of scores (or variables)
      • Positive correlation: Increase in one variable accompanied by increase in other
        • “ Direct” relationship
      • Negative correlation: Increase in one variable accompanied by decrease in other
        • “ Inverse” relationship
  • 4. Correlation Coefficient (Cont’d)
    • What is the relationship between?
      • Gasoline prices and grocery prices?
      • Grocery prices and good weather?
      • Stress and depression?
      • Depression and job productivity?
      • Partying and grades?
      • Study time and grades?
  • 5. Correlation Coefficient (Cont’d)
    • Correlation coefficient ( r )
      • A number between -1 and +1 that indicates direction and strength of the relationship
        • As “r” approaches +1, strength increases in a direct and positive way
        • As “r” approaches -1, strength increases in an inverse and negative way
        • As “r” approaches 0, the relationship is weak or non existent (at zero)
  • 6. Correlation (cont’d)
    • Correlation coefficient “r”
      • 0 to +.3 = weak +.4 to +.6 = medium +.7 to +1.0 = strong
      • 0 to -.3 = weak -.4 to -.6 = medium -.7 to -1.0 = strong
    -1 0 +1 Direct Inverse Weak Strong Strong
  • 7. Correlation Examples r = .35 r = -.67 SAT score Coll. GPA 930 3.0 750 2.9 1110 3.8 625 2.1 885 3.3 950 2.6 605 2.8 810 3.2 1045 3.0 910 3.5 Missed Classes Coll. GPA 3 3.0 5 2.9 2 3.8 8 2.1 1 3.3 6 2.6 3 2.8 1 3.2 3 3.0 0 3.5
  • 8. Correlation Scatterplots
    • Plotting two sets of scores from the previous examples on a graph?
      • Place person A’s SAT score on the x-axis, and his/her GPA on the y-axis
      • Continue this for person B,C, D etc.
    • This process forms a “Scatterplot”
  • 9. Examples of Scatterplots
  • 10. Scatterplots (cont’d)
    • What correlation ( r ) do you think this graph has?
    • How about this correlation?
  • 11. More Scatterplots
    • What might this correlation be?
    • This correlation?
  • 12. More Scatterplots
    • This correlation?
    • Last one
  • 13. Coefficient of Determination (Shared Variance)
    • The square of the correlation ( r = .80, r 2 = .64)
    • A statement about factors that underlie the variables that account for their relationship.
    • Correlation between depression and anxiety = .85.
      • Shared variance = .72.
      • What factors might underlie both depression and anxiety?
    Depression Anxiety Shared Trait Variance
  • 14. Validity
    • What is validity?
      • The degree to which all accumulated evidence supports the intended interpretation of test scores for the intended purpose
      • Lay Def’n: Does a test measure what it is supposed to measure?
      • It is a unitary concept; however, there are 3 general types of validity evidence
        • Content Validity
        • Criterion-Related Validity
        • Construct Validity
  • 15. Content Validity
    • Is the content of the test valid for the kind of test it is?
      • Developers must show evidence that the domain was systematically analyzed and concepts are covered in correct proportion
      • Four-step process:
        • Step 1 - Survey the domain
        • Step 2 - Content of the test matches the above domain
        • Step 3 - Specific test items match the content
        • Step 4 - Analyze relative importance of each objective (weight)
  • 16. Content Validity (cont’d) Survey of Domain Step 1: Survey the Domain Step 2: Content Matches Domain Step 3: Test items reflect content Step 4: Adjusted for relative importance Content Matches Domain Item 1 x 3 Item 2 x 2 Item 3 x 1 Item 4 x 2 Item 5 x 2.5
  • 17. Content Validity (cont’d)
    • Face Validity
      • Not a real type of content validity
      • A quick look at “face” value of questions
      • Sometimes, questions may not “seem” to measure the content, but do (e.g., panic disorder example in book (p. 48)
    • How might you show content validity for an instrument that measures depression?
  • 18. Criterion-Related Validity: Concurrent and Predictive Validity
    • Criterion-Related Validity
      • The relationship between the test and a criterion the test should be related to
    • Two types:
      • Concurrent Validity – Does the instrument relate to another criterion “now” (in the present)?
      • Predictive Validity – Does the instrument relate to another criterion in the future?
  • 19. Criterion-Related Validity: Concurrent Validity
    • Example 1
      • 100 clients take the BDI
      • Correlate their scores with clinicians’ ratings of depression of the same group of clients.
    • Example 2
      • 500 people take test of alcoholism tendency
      • Correlate their scores with how significant others rate the amount of alcohol they drink.
  • 20. Criterion-Related Validity: Predictive Validity
    • Examples:
      • SAT scores correlated with how well students do in college.
      • ASVAB scores correlated with success at jobs.
      • GREs correlated with success in graduate school. (See Table 3.1, p. 49)
      • Do Exercise 3.2, p. 50
  • 21. Concepts Related to Predictive Validity
    • Standard Error of the Estimate
      • Using a known value of one variable to predict a potential range of scores on a second variable (ex GRE -> GPA range)
    • False Positive
      • Instrument predicts an attribute that does not exist
    • False Negative
      • Instrument forecasts no attribute but in fact it exists
  • 22. Construct Validity
    • Construct Validity
      • Extent to which the instrument measures a theoretical or hypothetical trait
      • Many counseling and psychological constructs are complex, ambiguous and not easily agreed upon:
        • Intelligence
        • Self-esteem
        • Empathy
        • Other personality characteristics
  • 23. Construct Validity (cont’d)
    • Four methods of gathering evidence for construct validity:
      • experimental design
      • factor analysis
      • convergence with other instruments
      • discrimination with other measures.
  • 24. Construct Validity: Experimental Design
      • Creating hypothesis and research studies that show the instrument captures the correct concept
        • Example:
          • Hypothesis: The “Blank” depression test will discriminate between clinically depressed clients and “normals.”
          • Method:
            • Identify 100 clinically depressed clients
            • Identify 100 “normals”
            • Show statistical analysis
        • Second Example: Clinicians measure their depressed clients before, then after, 6 months of treatment
  • 25. Construct Validity: Factor Analysis
    • Factor analysis:
      • Statistical relationship between subscales of test
      • How similar or different are the sub-scales?
      • Example:
        • Develop a depression test that has three subscales: self-esteem, suicidal ideation, hopelessness.
        • Correlate subscales correlate:
          • Self-esteem and suicidal ideation: .35
          • Self-esteem and hopelessness: .25
          • Hopelessness and suicidal ideation: .82
        • What implications might the above scores have for this test?
  • 26. Construct Validity: Convergent Validity
    • Convergence Evidence –
      • Comparing test scores to other, well-established tests
      • Example:
        • Correlate new depression test against the BDI.
        • Is there a good correlation between the two?
        • Implications if correlation is extremely high? (e.g., .96)
        • Implications if correlation is extremely low? (e.g., .21)
  • 27. Construct Validity: Discriminant Validity
    • Discriminant Evidence –
        • Correlate test scores with other tests that are different
        • Hope to find a meager correlation
        • Example:
          • Compare new depression test with an anxiety test.
          • Implications if correlation is extremely high? (e.g., .96)
          • Implications if correlation is extremely low? (e.g., .21)
  • 28. Validity Recap
    • Three types of validity
      • Content
      • Criterion
        • Concurrent
        • Predictive
      • Construct validity
        • Experimental
        • Factor Analysis
        • Convergent
        • Discriminant
  • 29. Reliability
    • Accuracy or consistency of test scores.
    • Would you score the same if you took the test over, and over, and over again?
    • Reported as a reliability (correlation) coefficient.
    • The closer to r = 1.0, the less error in the test.
  • 30. Three Ways of Determining Reliability
    • Test-Retest
    • Alternate, Parallel, or Equivalent Forms
    • Internal Consistency
      • a. Coefficient Alpha
      • b. Kuder-Richardson
      • c. Split-half or Odd Even
  • 31. Test-Retest Reliability
    • Give the test twice to same group of people.
      • E.g. Take the first test in this class, and very soon after, take it again. Are the scores about the same?
      • person 1 person 2 person 3 person 4 person 5 others….
      • 1 st test: 35 42 43 34 38
      • 2 nd test: 36 44 41 34 37
      • Graphic:
      • Problem: Person can look up answers between 1 st and second testing
    Time A A
  • 32. Alternate, Parallel, or Equivalent Forms Reliability
    • Have two forms of same test
    • Give students two forms the same time
    • Correlate scores on first form with scores on second form.
    • Graphic:
    • Problem: Are two “equivalent” forms ever really equivalent?
    A B
  • 33. Internal Consistency Reliability
    • How do individual items relate to each other and the test as a whole?
    • Internal Consistency reliability is going “within” the test rather than using multiple administrations
    • High speed computers and only one test administration has made internal consistency popular
    • Three types:
      • Split-Half or Odd-Even
      • Cronbach’s Coefficient Alpha
      • Kuder-Richardson
  • 34. Split-half or Odd-even Reliability
    • Correlate one half of test with other half for all who took the test
    • Example:
      • Person 1 scores 16 on first half of test and 16 on second half
      • Person 2 scores 14 on first half and 18 on second half
      • Also get scores for persons 3, 4, 5, etc.
      • Correlate all persons scores on first half with their scores on second half
      • The correlation = the reliability estimate
    • Use “Spearman Brown formula to control for shortness of test
  • 35. Split-half or Odd-even Reliability Internal Consistency
    • Example Continued:
      • Person Score on 1 st Half Score on 2 nd half
      • 1 16 16
      • 2 14 18
      • 3 12 20
      • 4 15 17
      • And so forth…..
      • Problem: Are any two halves really equivalent?
      • Graphic:
    A
  • 36. Cronbach’s Alpha and Kuder-Richardson Internal Consistency
    • Other types of Internal Consistency:
      • Average correlation of all of the possible split-half reliabilities
      • Two popular types:
        • Cronbach’s Alpha
        • Kuder-Richardson (KR-20, KR-21)
      • Graphic:
    A
  • 37. Item Response Theory: Another Way of Looking at Reliability
    • Extension of classical test theory which looks at the amount of error in the total test
    • IRT looks at the probability that individuals will answer each item correctly (or match the quality being assessed)
    • Or, each item is being assessed for its ability to measure the trait being examined
  • 38. Item Response Theory: Another Way of Looking at Reliability
    • Figure 3.6 shows the S curve (p. 57):
      • Individuals with lower ability, have less probability of getting certain items correct.
      • Individuals with higher ability have higher probability of getting more items correct.
      • Each item is examined for it’s ability to discriminate based on the trait being measured (in Figure 3.6, p. 57, discrimination is based on ability)
      • The better a test can discriminate, the more reliable it is.
  • 39. Cross-Cultural Fairness
    • Bias in testing did not get much attention until civil rights movement of 1960’s.
    • Series of court decisions established is was unfair to use tests to track students in schools.
      • Black and Hispanic students were being unfairly compared to whites-not their norm group.
    • Griggs v. Duke Power Company
      • Tests for hiring and advancement much show ability to predict job performance.
      • Example: Can’t give a test to measure intelligence for those who want to get a job as a road worker .
  • 40. Cross-Cultural Fairness
    • Americans with Disabilities Act :
      • Accommodations for individuals taking tests for employment must be made
      • Tests must be shown to be relevant to the job in question.
    • Family Education Rights and Privacy Act (FERPA)
      • Right to access school records, including test records.
      • Parents have the rights to their child’s records
  • 41. Cross-Cultural Fairness
    • Carl Perkins Act:
      • Individuals with a disability have the right to vocational assessment, counseling and placement.
    • Civil Rights Acts:
      • S eries of laws concerned with tests used in employment and promotion.
    • Freedom of Information Act:
      • Assures access to federal records, including test records.
      • Most states have expanded this law so that it also applies to state records.
  • 42. Cross-Cultural Fairness
    • IDEIA and PL 94-142:
      • Assures rights of students (age 3 – 21) suspected of having a learning disability to be tested at the school’s expense.
      • Child Study Teams and IEP set up when necessary
    • Section 504 of the Rehabilitation Act:
      • Relative to assessment, any instrument used to measure appropriateness for a program or service must measure the individual’s ability, not be a reflection of his or he disability.
  • 43. Cross-Cultural Fairness
    • The Use of Intelligence Tests with Minorities: Confusion and Bedlam
      • (See Box 3.1, p. 59)
  • 44. Disparities in Ability
    • Cognitive differences between people exist, however, they are clouded over by issues of SES, prejudice, stereotyping, etc: are there real differences?
    • Why do differences exist and what can be done to eliminate these differences?
    • Often seen as environmental-No Child Left Behind
    • Exercise 3.4, p. 59: Why might there be differences among cultural groups on their ability scores?
  • 45. Practicality
    • Several practical concerns :
      • Time
      • Cost
      • Format (clarity of print, print size, sequencing of questions and types of questions)
      • Readability
      • Ease of Administration, Scoring, and Interpretation
  • 46. Selecting & Administering Tests
    • Five Steps:
      • Determine your client’s goals
      • Choose instruments to reach client goals.
      • Access information about possible instruments:
          • Source books (e.g.,: Buros Mental Measurement Yearbook and Tests in Print) (see Box 3.2, p. 62)
          • Publisher resource catalogs
          • Journals in the field
          • Books on testing
          • Experts
          • The internet
      • Examine validity, reliability, cross-cultural fairness, and practicality.
      • Make a wise choice.

×