Fundamentals of Classical Test Theory
(CTT)
Dr. Jai Singh
National Accreditation Board for Education and
Training (NABET) -QCI
Objectives-
• To understand the construct and latent traits.
• To know about measuring latent traits.
• To understand the terminology in test
construction.
• To know about fundamentals of CTT.
• To know various assumptions of CTT.
• To critically evaluate use of CTT and its
limitations.
Constructs and Measures
constructs are theoretical terms that refer to unobserved, idealized entities.
A Construct's height, weight or depth cannot be measured because
constructs are not concrete materials in the visible world.
In psychology, Construct refers to any complex psychological concept.
Construct is a skill, attribute, or ability that is based on one or more
established theories. Constructs exist in the human brain and are not directly
observable.
In psychology and cognitive science, constructs include terms like person's
motivation, Intelligence, anxiety, and fear, anger, personality, love,
attachment, memory, creativity, learning outcomes and attention.
Measures are the observations used in science to learn about constructs.
These include things like reaction times, accuracy scores, and response
frequencies.
Latent Traits
Latent traits are a specific kind of construct.
– Relatively stable qualities of individuals that are
changeable, but only over the long term.
• Transient things, such as “attention,” are not traits.
– Latent traits include everyday things like attitudes,
personality, preferences, and dispositions (e.g.,
“talkative”).
– Latent traits also include many kinds of things that
educators are interested in:
• Ability, aptitude, creativity, expertise, learning
outcomes and intelligence.
Measuring Latent Traits
It is important to recognize that no single measure of a
latent trait is ever taken to be a perfectly accurate
measure of that trait.
– Instead, different kinds of “measures” or “tests” are
seen as “tapping into” the latent trait.
– Different measures may “tap into” a latent trait in
different ways, capturing some aspects of the trait
better than others.
– Multiple measures can provide “converging”
evidence.
Psychometric Tests
• Psychometric tests are standardized tests, and
they are designed to assess a particular
variable.
• Psychometric tests- scientific and systematic
ways to test someone's ability to do a job or
measure their personality or some mental
ability (like math achievement, learning
outcomes-language outcomes etc.).
• Psychometrics means the study of developing
measurements.
Measuring body temperature
• Using temperature to indicate illness
• Measurement tool: a mercury
• thermometer - a glass vacuum tube with a
bulb of mercury at one end.
Measuring body temperature
To make inference between taking temperature
and illness
– theory regarding: Thermal equilibrium via conduction.
– The proportionality of mercury density with a
conceptual temperature scale.
– Relationship between mouth and core body
temperature.
– Relationship between core body temperature and
illness.
At each stage, error may intrude
• Thermal equilibrium may not have been reached (e.g.
thermometer removed too quickly).
• –
• Expansion of mercury also affected by other things (e.g. air
pressure).
• –
• Mouth temperature may not reflect core body temperature
(e.g. after a hot cup of tea).
• –
• Core body temperature does not vary with all illnesses, and
is not even completely stable in health.
Identify the sources of errors in measuring students’
attributes.
Test developer’s concerns –
• Quality of test items
• How examinees respond to it when constructing
tests
• Reliable and Valid tool
A psychometrician generally uses psychometric
techniques to determine the validity and reliability.
Construction of Test based on CTT
Table of Specification (Blue Print) for Class 5th Science
Instructional
Objective
Content Areas
Knowledge
40%
Understanding
35%
Application
25%
Total
SU MC MT TF SU MC MT TF SU MC MT TF
1) Food and Health 1 2 2 1 1 2 1 1 1 1 1 1 15
2)Plant Life 0 2 2 1 0 1 2 1 1 0 1 1 12
3)Animal Life 1 1 2 1 1 0 1 2 0 1 1 1 12
4)Force Work and Energy 2 2 1 2 1 2 1 1 1 1 1 1 16
5)Weight, Volume &
Density
2 2 1 2 2 1 2 1 2 1 1 1 18
6)The Environment 2 1 2 1 1 1 1 2 1 2 1 0 15
7)The Rocks and
Minerals
1 1 2 1 1 1 1 1 0 1 1 1 12
Total 9 11 12 9 7 8 9 9 6 7 7 6 100
SU= Supply Type, MC=Multiple Choice, MT=Matching, TF=True False
Examples
1) Instrument used to measure earthquake is
known as-
(a)Seismograph
(b) Quake meter
(c ) Barometer
(d) None of above
2) How many seismograph stations are needed to
locate the epicenter of earthquake?
(a) 2
(b) 3
(c ) 4
(d) 5
3) In which situation spring tides can occur?
(a) The moon , sun and earth are at right angle
with the earth at apex
(b) The moon is the farthest from the earth
(c )The sun is closest to the earth
(d) The moon , sun and earth are at the same line
4) The topic of cancer passes through
(a) India and Iran
(b) Iran and Pakistan
(c) India and Saudi Arabia
(d) Iran and Iraq
1) The area of semi circle is
(a) ∏R2
(b) ∏R2/2
(c) 2∏R
(d) ∏R
2) In a circle given below, if
AB is diameter, angle a=300
Then the value of angle b
will be
(a) 450
(b) 600
(c) 900
(d) 550
Higher Score Achiever and Low Score Achiever
Higher Group
27%
Lower Group
27%
Difficulty of an Item
The difficulty of an item is understood as the
proportion of the persons who answer a test
item correctly.
– The higher this proportion- the lower the
difficulty
– the greater the difficulty of an item- the lower its
index
Index of Difficulty
DI= RU+RL x 100
T
RU= The number in upper group who answered correctly
RL=The number in lower group who answered correctly
T= The total number who tried the item
Hypothetical Example-
Discrimination
A good item should discriminate between those who score
high on the test and those who score low.
We would expect that –
- those having a high overall test score would have a high
probability of being able to answer the item.
- those having low test scores would have a low probability of
answering the item correctly.
The higher the discrimination index, the better the item can
determine the difference between those with high test scores and
those with low ones.
Formula for item discriminating power
Item discriminating power =
RU-RL
T/2
Where
RU= Students from upper group who got the answer correct.
RL= Students from lower group who got the answer correct.
T/2 = half of the total number of pupils included in the item analysis.
Hypothetical Example-
Test Theories/Model -
Classical Test Theory
Item Response Theory
Test
Theory/Model
Classical Test
Theory (CTT)
Item Response
Theory (IRT)
• Both theories enable to predict outcomes of psychological tests by identifying
parameters of item difficulty and the ability of test takers.
• Both are concerned to improve the reliability and validity of psychological
tests. Both of these approaches provide measures of validity and reliability.
Classical Test Theory
Classical Test Theory is used to predict an
individual’s latent trait based on an observed
total score on an instrument.
Continued-
• In CTT, the true score predicts the level of the
latent variable.
• The random error is normally distributed with
a mean of 0 and a SD of 1.
• The random errors are uncorrelated with each
other and also are uncorrelated to the true
scores.
Mathematical Model of CTT
Observed test scores (X) are composed of a true score (T) and an error
score (E)
-the true and the error scores are independent.
Charles Spearman- reduce random error as much as possible, thereby making tests better.
Illustrated in the formula: X = T + E.
Where –
X= Total Score
T=True Score
E=Error Score
The variables are established by Spearman (1904) and Novick (1966)
Classical Test Theory
Classical test theory (CTT) in psychometrics
is all about reliability.
• Reliability refers to how consistent a test or measure is.
• In CTT- Three Base Terms- test/Observed score, error,
and true score.
• Ex. - math exam and get an 85,
• Test score-85.
• Error – sound
- mistake in the test, or
-external environment not totally control but
that impact testing
But psychometrics assumes everyone has, in theory, a
true score.
- We can calculate this true score with an equation.
Is True score reflect
true ability?
Why true score vary
without intervention?
Standard error of measurement
Sm = S √1 - r .
The standard deviation of the distribution of random errors for each individual
standard error of measurement-larger- the less certain is the accuracy
standard error of measurement-small- high accuracy- individual score is probably
close to the true score.
Use of Standard error of measurement –
create confidence intervals around specific observed scores
The lower and upper bound of the confidence interval approximate the value of the
true score.
Will distribution of random errors be the same for all individuals - ?
Why score vary over different administration on subjects ?
Is not error due to item characteristics, administration , environment, and nature of
tool ?
Error Distribution
St. Obtained Score True Score Error
1 85 80 +5
2 69 72 -3
3 48 45 +3
4 82 85 -3
5 39 43 -4
6 45 41 +4
7 78 79 -1
Assumption of Classical Test Theory
• Varying responses of examinees are due only to
variation in ability of interest.
• All other potential sources of variation existing in
the testing materials such as external conditions
or internal conditions of examinees are assumed
to be constant.
Continued-
• Each individual has a true score which would be
obtained if there were no errors in measurement.
• The difference between the true score and the observed
test score results from measurement error.
• Error is often assumed to be a random variable having a
normal distribution.
• Tests are fallible imprecise tools. The true score for an
individual will not change with repeated applications of
the same test.
Shortcomings of CTT
• Examinee characteristics and test characteristics -cannot be separated
each can only be interpreted in the context of the other.
• Reliability is "the correlation between test scores on parallel forms of a test".
differing opinions of what parallel tests are-reliability coefficients
provide either lower bound estimates of reliability or reliability estimates with
unknown biases.
• Standard error of measurement
the standard error of measurement is assumed to be the same for all
examinees of different ability.
• Measurement accuracy and Attribute level
Common estimate of the measurement precision that is assumed to be equal
for all individuals irrespective of their attribute levels.
• CTT is test oriented, rather than item oriented
cannot help us make predictions of how well an individual or even a group
of examinees might do on a test item
Limitations of CTT
Sample Dependent-
The focus of the analysis is –
 total test score;
 frequency of correct responses (to indicate question difficulty);
 frequency of responses (to examine distracters);
 reliability of the test and item-total correlation (to evaluate discrimination
at the item level)
• one limitation is that they relate to the sample under scrutiny and thus all
the statistics that describe items and questions are sample dependent
This critique may not be particularly relevant where successive samples are
reasonably representative and do not vary across time, but this will need to
be confirmed and complex strategies have been proposed to overcome this
limitation.
CTT: Limitations
• Item analysis from CTT perspectives "is
essentially sample-based descriptive statistics"
- This means that, for example, DV and DP
values are only representative of the specific
sample of examinees from which they were
calculated.
- so that making generalizations across
different groups of examinees—or across
different test formats—may not be possible.
Need of More complex analytic
approaches
More complex assessment situations
such as measuring test taker
performance at different points in time
(pre/ post);
using different test forms
different items of different difficulty
Different Raters assign scores
different elements of a performance exam
CTT VS. IRT
• The test is the unit of
analysis
• Measures with more items
(longer) are more reliable
than their counterparts.
• Comparing scores from
different measures can
only be done when the test
forms/measures are
parallel.
• Item properties depend on
a representative sample.
• The items is the unit of
analysis
• Measures with lesser items
(Shorter) can be more reliable
than their counterparts.
• Item responses of different
measures can be compared as
long as they are measuring the
same latent trait.
• Item properties do not
depend on a representative
sample
CTT VS. IRT
• Position on the latent
trait continuum is
derived from comparing
the test score with
score of reference
group.
• All items on the
measure must have the
same response
categories.
• Position on the latent
trait continuum are
derived by comparing
the distance between
items on the ability
scale.
• Items on measure can
have different response
categories.
Thanks to All

Classical Test Theory (CTT)- By Dr. Jai Singh

  • 1.
    Fundamentals of ClassicalTest Theory (CTT) Dr. Jai Singh National Accreditation Board for Education and Training (NABET) -QCI
  • 2.
    Objectives- • To understandthe construct and latent traits. • To know about measuring latent traits. • To understand the terminology in test construction. • To know about fundamentals of CTT. • To know various assumptions of CTT. • To critically evaluate use of CTT and its limitations.
  • 3.
    Constructs and Measures constructsare theoretical terms that refer to unobserved, idealized entities. A Construct's height, weight or depth cannot be measured because constructs are not concrete materials in the visible world. In psychology, Construct refers to any complex psychological concept. Construct is a skill, attribute, or ability that is based on one or more established theories. Constructs exist in the human brain and are not directly observable. In psychology and cognitive science, constructs include terms like person's motivation, Intelligence, anxiety, and fear, anger, personality, love, attachment, memory, creativity, learning outcomes and attention. Measures are the observations used in science to learn about constructs. These include things like reaction times, accuracy scores, and response frequencies.
  • 4.
    Latent Traits Latent traitsare a specific kind of construct. – Relatively stable qualities of individuals that are changeable, but only over the long term. • Transient things, such as “attention,” are not traits. – Latent traits include everyday things like attitudes, personality, preferences, and dispositions (e.g., “talkative”). – Latent traits also include many kinds of things that educators are interested in: • Ability, aptitude, creativity, expertise, learning outcomes and intelligence.
  • 5.
    Measuring Latent Traits Itis important to recognize that no single measure of a latent trait is ever taken to be a perfectly accurate measure of that trait. – Instead, different kinds of “measures” or “tests” are seen as “tapping into” the latent trait. – Different measures may “tap into” a latent trait in different ways, capturing some aspects of the trait better than others. – Multiple measures can provide “converging” evidence.
  • 6.
    Psychometric Tests • Psychometrictests are standardized tests, and they are designed to assess a particular variable. • Psychometric tests- scientific and systematic ways to test someone's ability to do a job or measure their personality or some mental ability (like math achievement, learning outcomes-language outcomes etc.). • Psychometrics means the study of developing measurements.
  • 7.
    Measuring body temperature •Using temperature to indicate illness • Measurement tool: a mercury • thermometer - a glass vacuum tube with a bulb of mercury at one end.
  • 8.
    Measuring body temperature Tomake inference between taking temperature and illness – theory regarding: Thermal equilibrium via conduction. – The proportionality of mercury density with a conceptual temperature scale. – Relationship between mouth and core body temperature. – Relationship between core body temperature and illness.
  • 9.
    At each stage,error may intrude • Thermal equilibrium may not have been reached (e.g. thermometer removed too quickly). • – • Expansion of mercury also affected by other things (e.g. air pressure). • – • Mouth temperature may not reflect core body temperature (e.g. after a hot cup of tea). • – • Core body temperature does not vary with all illnesses, and is not even completely stable in health. Identify the sources of errors in measuring students’ attributes.
  • 10.
    Test developer’s concerns– • Quality of test items • How examinees respond to it when constructing tests • Reliable and Valid tool A psychometrician generally uses psychometric techniques to determine the validity and reliability.
  • 11.
    Construction of Testbased on CTT Table of Specification (Blue Print) for Class 5th Science Instructional Objective Content Areas Knowledge 40% Understanding 35% Application 25% Total SU MC MT TF SU MC MT TF SU MC MT TF 1) Food and Health 1 2 2 1 1 2 1 1 1 1 1 1 15 2)Plant Life 0 2 2 1 0 1 2 1 1 0 1 1 12 3)Animal Life 1 1 2 1 1 0 1 2 0 1 1 1 12 4)Force Work and Energy 2 2 1 2 1 2 1 1 1 1 1 1 16 5)Weight, Volume & Density 2 2 1 2 2 1 2 1 2 1 1 1 18 6)The Environment 2 1 2 1 1 1 1 2 1 2 1 0 15 7)The Rocks and Minerals 1 1 2 1 1 1 1 1 0 1 1 1 12 Total 9 11 12 9 7 8 9 9 6 7 7 6 100 SU= Supply Type, MC=Multiple Choice, MT=Matching, TF=True False
  • 12.
    Examples 1) Instrument usedto measure earthquake is known as- (a)Seismograph (b) Quake meter (c ) Barometer (d) None of above 2) How many seismograph stations are needed to locate the epicenter of earthquake? (a) 2 (b) 3 (c ) 4 (d) 5 3) In which situation spring tides can occur? (a) The moon , sun and earth are at right angle with the earth at apex (b) The moon is the farthest from the earth (c )The sun is closest to the earth (d) The moon , sun and earth are at the same line 4) The topic of cancer passes through (a) India and Iran (b) Iran and Pakistan (c) India and Saudi Arabia (d) Iran and Iraq 1) The area of semi circle is (a) ∏R2 (b) ∏R2/2 (c) 2∏R (d) ∏R 2) In a circle given below, if AB is diameter, angle a=300 Then the value of angle b will be (a) 450 (b) 600 (c) 900 (d) 550
  • 14.
    Higher Score Achieverand Low Score Achiever Higher Group 27% Lower Group 27%
  • 15.
    Difficulty of anItem The difficulty of an item is understood as the proportion of the persons who answer a test item correctly. – The higher this proportion- the lower the difficulty – the greater the difficulty of an item- the lower its index
  • 16.
    Index of Difficulty DI=RU+RL x 100 T RU= The number in upper group who answered correctly RL=The number in lower group who answered correctly T= The total number who tried the item Hypothetical Example-
  • 17.
    Discrimination A good itemshould discriminate between those who score high on the test and those who score low. We would expect that – - those having a high overall test score would have a high probability of being able to answer the item. - those having low test scores would have a low probability of answering the item correctly. The higher the discrimination index, the better the item can determine the difference between those with high test scores and those with low ones.
  • 18.
    Formula for itemdiscriminating power Item discriminating power = RU-RL T/2 Where RU= Students from upper group who got the answer correct. RL= Students from lower group who got the answer correct. T/2 = half of the total number of pupils included in the item analysis. Hypothetical Example-
  • 19.
    Test Theories/Model - ClassicalTest Theory Item Response Theory Test Theory/Model Classical Test Theory (CTT) Item Response Theory (IRT) • Both theories enable to predict outcomes of psychological tests by identifying parameters of item difficulty and the ability of test takers. • Both are concerned to improve the reliability and validity of psychological tests. Both of these approaches provide measures of validity and reliability.
  • 20.
    Classical Test Theory ClassicalTest Theory is used to predict an individual’s latent trait based on an observed total score on an instrument.
  • 21.
    Continued- • In CTT,the true score predicts the level of the latent variable. • The random error is normally distributed with a mean of 0 and a SD of 1. • The random errors are uncorrelated with each other and also are uncorrelated to the true scores.
  • 22.
    Mathematical Model ofCTT Observed test scores (X) are composed of a true score (T) and an error score (E) -the true and the error scores are independent. Charles Spearman- reduce random error as much as possible, thereby making tests better. Illustrated in the formula: X = T + E. Where – X= Total Score T=True Score E=Error Score The variables are established by Spearman (1904) and Novick (1966)
  • 23.
    Classical Test Theory Classicaltest theory (CTT) in psychometrics is all about reliability. • Reliability refers to how consistent a test or measure is. • In CTT- Three Base Terms- test/Observed score, error, and true score. • Ex. - math exam and get an 85, • Test score-85. • Error – sound - mistake in the test, or -external environment not totally control but that impact testing But psychometrics assumes everyone has, in theory, a true score. - We can calculate this true score with an equation. Is True score reflect true ability? Why true score vary without intervention?
  • 24.
    Standard error ofmeasurement Sm = S √1 - r . The standard deviation of the distribution of random errors for each individual standard error of measurement-larger- the less certain is the accuracy standard error of measurement-small- high accuracy- individual score is probably close to the true score. Use of Standard error of measurement – create confidence intervals around specific observed scores The lower and upper bound of the confidence interval approximate the value of the true score. Will distribution of random errors be the same for all individuals - ? Why score vary over different administration on subjects ? Is not error due to item characteristics, administration , environment, and nature of tool ?
  • 25.
    Error Distribution St. ObtainedScore True Score Error 1 85 80 +5 2 69 72 -3 3 48 45 +3 4 82 85 -3 5 39 43 -4 6 45 41 +4 7 78 79 -1
  • 26.
    Assumption of ClassicalTest Theory • Varying responses of examinees are due only to variation in ability of interest. • All other potential sources of variation existing in the testing materials such as external conditions or internal conditions of examinees are assumed to be constant.
  • 27.
    Continued- • Each individualhas a true score which would be obtained if there were no errors in measurement. • The difference between the true score and the observed test score results from measurement error. • Error is often assumed to be a random variable having a normal distribution. • Tests are fallible imprecise tools. The true score for an individual will not change with repeated applications of the same test.
  • 28.
    Shortcomings of CTT •Examinee characteristics and test characteristics -cannot be separated each can only be interpreted in the context of the other. • Reliability is "the correlation between test scores on parallel forms of a test". differing opinions of what parallel tests are-reliability coefficients provide either lower bound estimates of reliability or reliability estimates with unknown biases. • Standard error of measurement the standard error of measurement is assumed to be the same for all examinees of different ability. • Measurement accuracy and Attribute level Common estimate of the measurement precision that is assumed to be equal for all individuals irrespective of their attribute levels. • CTT is test oriented, rather than item oriented cannot help us make predictions of how well an individual or even a group of examinees might do on a test item
  • 29.
    Limitations of CTT SampleDependent- The focus of the analysis is –  total test score;  frequency of correct responses (to indicate question difficulty);  frequency of responses (to examine distracters);  reliability of the test and item-total correlation (to evaluate discrimination at the item level) • one limitation is that they relate to the sample under scrutiny and thus all the statistics that describe items and questions are sample dependent This critique may not be particularly relevant where successive samples are reasonably representative and do not vary across time, but this will need to be confirmed and complex strategies have been proposed to overcome this limitation.
  • 30.
    CTT: Limitations • Itemanalysis from CTT perspectives "is essentially sample-based descriptive statistics" - This means that, for example, DV and DP values are only representative of the specific sample of examinees from which they were calculated. - so that making generalizations across different groups of examinees—or across different test formats—may not be possible.
  • 31.
    Need of Morecomplex analytic approaches More complex assessment situations such as measuring test taker performance at different points in time (pre/ post); using different test forms different items of different difficulty Different Raters assign scores different elements of a performance exam
  • 32.
    CTT VS. IRT •The test is the unit of analysis • Measures with more items (longer) are more reliable than their counterparts. • Comparing scores from different measures can only be done when the test forms/measures are parallel. • Item properties depend on a representative sample. • The items is the unit of analysis • Measures with lesser items (Shorter) can be more reliable than their counterparts. • Item responses of different measures can be compared as long as they are measuring the same latent trait. • Item properties do not depend on a representative sample
  • 33.
    CTT VS. IRT •Position on the latent trait continuum is derived from comparing the test score with score of reference group. • All items on the measure must have the same response categories. • Position on the latent trait continuum are derived by comparing the distance between items on the ability scale. • Items on measure can have different response categories.
  • 34.