History and Theoryof Reliability
Conceptualization of Error
Physical vs. Psychological Measurement
• Physical sciences: precise, direct measurements (e.g., using a ruler).
• Psychology: measures abstract traits (e.g., intelligence), which are harder to quantify.
The “Rubber Yardstick” Metaphor
• Psychological tools may distort results—overestimating or underestimating true values.
• Represents the variability and inaccuracy in measuring human traits.
Challenges in Psychological Measurement
• No fixed units or tools for abstract constructs.
• Instruments must be critically evaluated for reliability.
Importance of Reliability
• Inaccurate tools lead to unreliable conclusions.
• Psychology places strong emphasis on identifying and reducing measurement error.
3.
History and Theoryof Reliability
Spearman’s Contribution to Reliability Theory
• Foundations of Reliability Theory
• Sampling Error – Introduced by De Moivre (1733)
• Correlation – Developed by Karl Pearson (1896)
• Charles Spearman (1904)
• Integrated sampling error and correlation into reliability theory
• Published “The Proof and Measurement of Association between Two Things”
• Work influenced E.L. Thorndike and early psychological measurement
• Key Developments After Spearman
• Kuder & Richardson (1937) – Introduced new reliability coefficients
• Cronbach (1989, 1995) – Advanced methods for evaluating multiple sources of error
• Item Response Theory (IRT) – Modern advancement rooted in Spearman’s ideas
• Legacy
• Spearman’s work laid the foundation for modern psychometrics
• Reliability theory continues to evolve with advanced statistical models
4.
History and Theoryof Reliability
Basics of Classical Test Score Theory
• Observed Score Equation:
X=T+EX=T+E
• X: Observed score
• T: True score
• E: Error of measurement
• True Score Concept:
• Hypothetical score with no measurement error
• Remains constant across repeated testing
• Measurement Error:
• Random, not systematic (unlike consistent bias)
• Causes score variability across test administrations
• Rubber Yardstick Analogy:
• Measurement tools "stretch" or "shrink" randomly
• Leads to a distribution of observed scores around the true score
• Standard Error of Measurement (SEM):
• Reflects average deviation of observed scores from the true score
• Smaller SEM = more precise measurement
• Score Distributions:
• Wide dispersion = high error
• Narrow dispersion = high reliability
5.
History and Theoryof Reliability
Basics of Classical Test Score Theory
• Observed Score Equation:
X=T+EX=T+E
• X: Observed score
• T: True score
• E: Error of measurement
• True Score Concept:
• Hypothetical score with no measurement error
• Remains constant across repeated testing
• Measurement Error:
• Random, not systematic (unlike consistent bias)
• Causes score variability across test administrations
• Rubber Yardstick Analogy:
• Measurement tools "stretch" or "shrink" randomly
• Leads to a distribution of observed scores around the true score
• Standard Error of Measurement (SEM):
• Reflects average deviation of observed scores from the true score
• Smaller SEM = more precise measurement
• Score Distributions:
• Wide dispersion = high error
• Narrow dispersion = high reliability
6.
Domain Sampling Modelin Reliability Theory
•
Estimates reliability by examining how well a limited sample of items represents a larger domain (e.g., testing spelling with a subset of
words).
• True Score Concept:
• True score = score from all possible items in the domain
• We estimate it using a sample, which introduces error
• Source of Error:
• Using limited test items instead of the full domain
• Sampling error causes variability in estimated scores
• Main Assumptions:
• Items are randomly sampled from the domain
• Each test is an unbiased estimate of the true score
• Repeated random sampling → normal distribution of scores
• Reliability Estimate:
• Correlation between scores on parallel forms (tests from the same domain)
• Higher item count better domain coverage
→ → higher reliability
• Practical Implication:
• Longer tests generally yield more reliable scores
7.
Item Response Theory(IRT)
• What Is IRT?
• A modern alternative to Classical Test Theory (CTT)
• Focuses on individual item performance and ability estimation
• Limitations of Classical Test Theory:
• Same items administered to everyone
• Many items may be too easy or too hard
• Weak alignment with individual ability level lower reliability
→
• IRT Approach:
• Adaptive testing using computer algorithms
• Items adjust based on performance (easy hard or hard easier)
→ →
• Focuses on the zone of maximum information (where the person gets some right and some wrong)
• Advantages of IRT:
• More precise and reliable ability estimates
• Can use fewer items for better results
• Challenges:
• Requires a large, calibrated item bank
• Demands complex software and extensive test development
8.
Item Response Theory(IRT)
• What Is IRT?
• A modern alternative to Classical Test Theory (CTT)
• Focuses on individual item performance and ability estimation
• Limitations of Classical Test Theory:
• Same items administered to everyone
• Many items may be too easy or too hard
• Weak alignment with individual ability level lower reliability
→
• IRT Approach:
• Adaptive testing using computer algorithms
• Items adjust based on performance (easy hard or hard easier)
→ →
• Focuses on the zone of maximum information (where the person gets some right and some wrong)
• Advantages of IRT:
• More precise and reliable ability estimates
• Can use fewer items for better results
• Challenges:
• Requires a large, calibrated item bank
• Demands complex software and extensive test development
1. Test–Retest Reliability(Time Sampling Method)
Measures consistency of scores over time by administering the same test on two occasions.
• Appropriate Use:
• For stable traits (e.g., intelligence)
• Not suitable for changing states (e.g., mood, health, Rorschach scores)
• Assumptions:
• Trait remains unchanged between tests
• Differences in scores reflect random error
• Carryover Effects:
• First test influences second (e.g., remembering answers)
• May inflate reliability estimates
• Practice Effects:
• Skills improve due to test experience, not actual trait change
• Affects individuals unequally, adding error
• Timing Matters:
• Short intervals high risk of carryover/practice effects
→
• Long intervals risk of actual change in the trait
→
• Interpreting Correlations:
• Low correlation ≠ always poor reliability
• Could reflect true trait change over time
11.
2. Parallel FormsReliability (Item Sampling Method)
•
Assesses how much item selection affects test scores by comparing two equivalent forms of a test.
• Definition:
• Two test forms with different items
• Both measure the same attribute
• Items are matched in difficulty and content
• Method:
• Administer both forms to the same group
• Calculate Pearson correlation between scores
• If tested same day: isolates form differences and random error
• If tested at different times: includes time sampling error
• Advantages:
• One of the most rigorous methods for reliability
• Ensures scores aren’t biased by specific item sets
• Limitations:
• Developing parallel forms is time-consuming and resource-heavy
• Rarely used in practice due to logistical constraints
• Alternatives:
• If only one test form is available, use internal consistency methods (e.g., split-half, Cronbach’s alpha)
12.
3. Internal ConsistencyMethods of Reliability
1. Split-Half Method
• Procedure: Divide test into two halves (e.g., odd-even); correlate scores.
• Limitation: Underestimates reliability (each half is shorter).
• Use When: You can logically divide test into two comparable parts.
2. KR20 (Kuder-Richardson Formula 20)
• Use When: Items are scored dichotomously (right/wrong).
• Strength: Considers all possible splits (unlike simple split-half).
• Limitation: Only for tests with objective scoring (e.g., multiple choice).
3. Coefficient Alpha (Cronbach’s Alpha)
• Use When: Items are not dichotomous (e.g., Likert scales).
• Formula:
• Advantage: Most general and widely used measure.