RELIABILITY
PSYCHOLOGICAL TESTING & ASSESSMENT
History and Theory of Reliability
Conceptualization of Error
Physical vs. Psychological Measurement
• Physical sciences: precise, direct measurements (e.g., using a ruler).
• Psychology: measures abstract traits (e.g., intelligence), which are harder to quantify.
The “Rubber Yardstick” Metaphor
• Psychological tools may distort results—overestimating or underestimating true values.
• Represents the variability and inaccuracy in measuring human traits.
Challenges in Psychological Measurement
• No fixed units or tools for abstract constructs.
• Instruments must be critically evaluated for reliability.
Importance of Reliability
• Inaccurate tools lead to unreliable conclusions.
• Psychology places strong emphasis on identifying and reducing measurement error.
History and Theory of Reliability
Spearman’s Contribution to Reliability Theory
• Foundations of Reliability Theory
• Sampling Error – Introduced by De Moivre (1733)
• Correlation – Developed by Karl Pearson (1896)
• Charles Spearman (1904)
• Integrated sampling error and correlation into reliability theory
• Published “The Proof and Measurement of Association between Two Things”
• Work influenced E.L. Thorndike and early psychological measurement
• Key Developments After Spearman
• Kuder & Richardson (1937) – Introduced new reliability coefficients
• Cronbach (1989, 1995) – Advanced methods for evaluating multiple sources of error
• Item Response Theory (IRT) – Modern advancement rooted in Spearman’s ideas
• Legacy
• Spearman’s work laid the foundation for modern psychometrics
• Reliability theory continues to evolve with advanced statistical models
History and Theory of Reliability
Basics of Classical Test Score Theory
• Observed Score Equation:
X=T+EX=T+E
• X: Observed score
• T: True score
• E: Error of measurement
• True Score Concept:
• Hypothetical score with no measurement error
• Remains constant across repeated testing
• Measurement Error:
• Random, not systematic (unlike consistent bias)
• Causes score variability across test administrations
• Rubber Yardstick Analogy:
• Measurement tools "stretch" or "shrink" randomly
• Leads to a distribution of observed scores around the true score
• Standard Error of Measurement (SEM):
• Reflects average deviation of observed scores from the true score
• Smaller SEM = more precise measurement
• Score Distributions:
• Wide dispersion = high error
• Narrow dispersion = high reliability
History and Theory of Reliability
Basics of Classical Test Score Theory
• Observed Score Equation:
X=T+EX=T+E
• X: Observed score
• T: True score
• E: Error of measurement
• True Score Concept:
• Hypothetical score with no measurement error
• Remains constant across repeated testing
• Measurement Error:
• Random, not systematic (unlike consistent bias)
• Causes score variability across test administrations
• Rubber Yardstick Analogy:
• Measurement tools "stretch" or "shrink" randomly
• Leads to a distribution of observed scores around the true score
• Standard Error of Measurement (SEM):
• Reflects average deviation of observed scores from the true score
• Smaller SEM = more precise measurement
• Score Distributions:
• Wide dispersion = high error
• Narrow dispersion = high reliability
Domain Sampling Model in Reliability Theory
•
Estimates reliability by examining how well a limited sample of items represents a larger domain (e.g., testing spelling with a subset of
words).
• True Score Concept:
• True score = score from all possible items in the domain
• We estimate it using a sample, which introduces error
• Source of Error:
• Using limited test items instead of the full domain
• Sampling error causes variability in estimated scores
• Main Assumptions:
• Items are randomly sampled from the domain
• Each test is an unbiased estimate of the true score
• Repeated random sampling → normal distribution of scores
• Reliability Estimate:
• Correlation between scores on parallel forms (tests from the same domain)
• Higher item count better domain coverage
→ → higher reliability
• Practical Implication:
• Longer tests generally yield more reliable scores
Item Response Theory (IRT)
• What Is IRT?
• A modern alternative to Classical Test Theory (CTT)
• Focuses on individual item performance and ability estimation
• Limitations of Classical Test Theory:
• Same items administered to everyone
• Many items may be too easy or too hard
• Weak alignment with individual ability level lower reliability
→
• IRT Approach:
• Adaptive testing using computer algorithms
• Items adjust based on performance (easy hard or hard easier)
→ →
• Focuses on the zone of maximum information (where the person gets some right and some wrong)
• Advantages of IRT:
• More precise and reliable ability estimates
• Can use fewer items for better results
• Challenges:
• Requires a large, calibrated item bank
• Demands complex software and extensive test development
Item Response Theory (IRT)
• What Is IRT?
• A modern alternative to Classical Test Theory (CTT)
• Focuses on individual item performance and ability estimation
• Limitations of Classical Test Theory:
• Same items administered to everyone
• Many items may be too easy or too hard
• Weak alignment with individual ability level lower reliability
→
• IRT Approach:
• Adaptive testing using computer algorithms
• Items adjust based on performance (easy hard or hard easier)
→ →
• Focuses on the zone of maximum information (where the person gets some right and some wrong)
• Advantages of IRT:
• More precise and reliable ability estimates
• Can use fewer items for better results
• Challenges:
• Requires a large, calibrated item bank
• Demands complex software and extensive test development
MODELS OF
RELIABILITY
1. Test–retest
2. Parallel forms
3. Internal
consistency
1. Test–Retest Reliability (Time Sampling Method)
Measures consistency of scores over time by administering the same test on two occasions.
• Appropriate Use:
• For stable traits (e.g., intelligence)
• Not suitable for changing states (e.g., mood, health, Rorschach scores)
• Assumptions:
• Trait remains unchanged between tests
• Differences in scores reflect random error
• Carryover Effects:
• First test influences second (e.g., remembering answers)
• May inflate reliability estimates
• Practice Effects:
• Skills improve due to test experience, not actual trait change
• Affects individuals unequally, adding error
• Timing Matters:
• Short intervals high risk of carryover/practice effects
→
• Long intervals risk of actual change in the trait
→
• Interpreting Correlations:
• Low correlation ≠ always poor reliability
• Could reflect true trait change over time
2. Parallel Forms Reliability (Item Sampling Method)
•
Assesses how much item selection affects test scores by comparing two equivalent forms of a test.
• Definition:
• Two test forms with different items
• Both measure the same attribute
• Items are matched in difficulty and content
• Method:
• Administer both forms to the same group
• Calculate Pearson correlation between scores
• If tested same day: isolates form differences and random error
• If tested at different times: includes time sampling error
• Advantages:
• One of the most rigorous methods for reliability
• Ensures scores aren’t biased by specific item sets
• Limitations:
• Developing parallel forms is time-consuming and resource-heavy
• Rarely used in practice due to logistical constraints
• Alternatives:
• If only one test form is available, use internal consistency methods (e.g., split-half, Cronbach’s alpha)
3. Internal Consistency Methods of Reliability
1. Split-Half Method
• Procedure: Divide test into two halves (e.g., odd-even); correlate scores.
• Limitation: Underestimates reliability (each half is shorter).
• Use When: You can logically divide test into two comparable parts.
2. KR20 (Kuder-Richardson Formula 20)
• Use When: Items are scored dichotomously (right/wrong).
• Strength: Considers all possible splits (unlike simple split-half).
• Limitation: Only for tests with objective scoring (e.g., multiple choice).
3. Coefficient Alpha (Cronbach’s Alpha)
• Use When: Items are not dichotomous (e.g., Likert scales).
• Formula:
• Advantage: Most general and widely used measure.

Psych Ass Chapter 5- Reliability (Cohen).pptx

  • 1.
  • 2.
    History and Theoryof Reliability Conceptualization of Error Physical vs. Psychological Measurement • Physical sciences: precise, direct measurements (e.g., using a ruler). • Psychology: measures abstract traits (e.g., intelligence), which are harder to quantify. The “Rubber Yardstick” Metaphor • Psychological tools may distort results—overestimating or underestimating true values. • Represents the variability and inaccuracy in measuring human traits. Challenges in Psychological Measurement • No fixed units or tools for abstract constructs. • Instruments must be critically evaluated for reliability. Importance of Reliability • Inaccurate tools lead to unreliable conclusions. • Psychology places strong emphasis on identifying and reducing measurement error.
  • 3.
    History and Theoryof Reliability Spearman’s Contribution to Reliability Theory • Foundations of Reliability Theory • Sampling Error – Introduced by De Moivre (1733) • Correlation – Developed by Karl Pearson (1896) • Charles Spearman (1904) • Integrated sampling error and correlation into reliability theory • Published “The Proof and Measurement of Association between Two Things” • Work influenced E.L. Thorndike and early psychological measurement • Key Developments After Spearman • Kuder & Richardson (1937) – Introduced new reliability coefficients • Cronbach (1989, 1995) – Advanced methods for evaluating multiple sources of error • Item Response Theory (IRT) – Modern advancement rooted in Spearman’s ideas • Legacy • Spearman’s work laid the foundation for modern psychometrics • Reliability theory continues to evolve with advanced statistical models
  • 4.
    History and Theoryof Reliability Basics of Classical Test Score Theory • Observed Score Equation: X=T+EX=T+E • X: Observed score • T: True score • E: Error of measurement • True Score Concept: • Hypothetical score with no measurement error • Remains constant across repeated testing • Measurement Error: • Random, not systematic (unlike consistent bias) • Causes score variability across test administrations • Rubber Yardstick Analogy: • Measurement tools "stretch" or "shrink" randomly • Leads to a distribution of observed scores around the true score • Standard Error of Measurement (SEM): • Reflects average deviation of observed scores from the true score • Smaller SEM = more precise measurement • Score Distributions: • Wide dispersion = high error • Narrow dispersion = high reliability
  • 5.
    History and Theoryof Reliability Basics of Classical Test Score Theory • Observed Score Equation: X=T+EX=T+E • X: Observed score • T: True score • E: Error of measurement • True Score Concept: • Hypothetical score with no measurement error • Remains constant across repeated testing • Measurement Error: • Random, not systematic (unlike consistent bias) • Causes score variability across test administrations • Rubber Yardstick Analogy: • Measurement tools "stretch" or "shrink" randomly • Leads to a distribution of observed scores around the true score • Standard Error of Measurement (SEM): • Reflects average deviation of observed scores from the true score • Smaller SEM = more precise measurement • Score Distributions: • Wide dispersion = high error • Narrow dispersion = high reliability
  • 6.
    Domain Sampling Modelin Reliability Theory • Estimates reliability by examining how well a limited sample of items represents a larger domain (e.g., testing spelling with a subset of words). • True Score Concept: • True score = score from all possible items in the domain • We estimate it using a sample, which introduces error • Source of Error: • Using limited test items instead of the full domain • Sampling error causes variability in estimated scores • Main Assumptions: • Items are randomly sampled from the domain • Each test is an unbiased estimate of the true score • Repeated random sampling → normal distribution of scores • Reliability Estimate: • Correlation between scores on parallel forms (tests from the same domain) • Higher item count better domain coverage → → higher reliability • Practical Implication: • Longer tests generally yield more reliable scores
  • 7.
    Item Response Theory(IRT) • What Is IRT? • A modern alternative to Classical Test Theory (CTT) • Focuses on individual item performance and ability estimation • Limitations of Classical Test Theory: • Same items administered to everyone • Many items may be too easy or too hard • Weak alignment with individual ability level lower reliability → • IRT Approach: • Adaptive testing using computer algorithms • Items adjust based on performance (easy hard or hard easier) → → • Focuses on the zone of maximum information (where the person gets some right and some wrong) • Advantages of IRT: • More precise and reliable ability estimates • Can use fewer items for better results • Challenges: • Requires a large, calibrated item bank • Demands complex software and extensive test development
  • 8.
    Item Response Theory(IRT) • What Is IRT? • A modern alternative to Classical Test Theory (CTT) • Focuses on individual item performance and ability estimation • Limitations of Classical Test Theory: • Same items administered to everyone • Many items may be too easy or too hard • Weak alignment with individual ability level lower reliability → • IRT Approach: • Adaptive testing using computer algorithms • Items adjust based on performance (easy hard or hard easier) → → • Focuses on the zone of maximum information (where the person gets some right and some wrong) • Advantages of IRT: • More precise and reliable ability estimates • Can use fewer items for better results • Challenges: • Requires a large, calibrated item bank • Demands complex software and extensive test development
  • 9.
    MODELS OF RELIABILITY 1. Test–retest 2.Parallel forms 3. Internal consistency
  • 10.
    1. Test–Retest Reliability(Time Sampling Method) Measures consistency of scores over time by administering the same test on two occasions. • Appropriate Use: • For stable traits (e.g., intelligence) • Not suitable for changing states (e.g., mood, health, Rorschach scores) • Assumptions: • Trait remains unchanged between tests • Differences in scores reflect random error • Carryover Effects: • First test influences second (e.g., remembering answers) • May inflate reliability estimates • Practice Effects: • Skills improve due to test experience, not actual trait change • Affects individuals unequally, adding error • Timing Matters: • Short intervals high risk of carryover/practice effects → • Long intervals risk of actual change in the trait → • Interpreting Correlations: • Low correlation ≠ always poor reliability • Could reflect true trait change over time
  • 11.
    2. Parallel FormsReliability (Item Sampling Method) • Assesses how much item selection affects test scores by comparing two equivalent forms of a test. • Definition: • Two test forms with different items • Both measure the same attribute • Items are matched in difficulty and content • Method: • Administer both forms to the same group • Calculate Pearson correlation between scores • If tested same day: isolates form differences and random error • If tested at different times: includes time sampling error • Advantages: • One of the most rigorous methods for reliability • Ensures scores aren’t biased by specific item sets • Limitations: • Developing parallel forms is time-consuming and resource-heavy • Rarely used in practice due to logistical constraints • Alternatives: • If only one test form is available, use internal consistency methods (e.g., split-half, Cronbach’s alpha)
  • 12.
    3. Internal ConsistencyMethods of Reliability 1. Split-Half Method • Procedure: Divide test into two halves (e.g., odd-even); correlate scores. • Limitation: Underestimates reliability (each half is shorter). • Use When: You can logically divide test into two comparable parts. 2. KR20 (Kuder-Richardson Formula 20) • Use When: Items are scored dichotomously (right/wrong). • Strength: Considers all possible splits (unlike simple split-half). • Limitation: Only for tests with objective scoring (e.g., multiple choice). 3. Coefficient Alpha (Cronbach’s Alpha) • Use When: Items are not dichotomous (e.g., Likert scales). • Formula: • Advantage: Most general and widely used measure.