Classical Test Theory (CTT)- By Dr. Jai Singh

Fundamentals of Classical Test Theory
(CTT)
Dr. Jai Singh
National Accreditation Board for Education and
Training (NABET) -QCI

Objectives-
• To understand the construct and latent traits.
• To know about measuring latent traits.
• To understand the terminology in test
construction.
• To know about fundamentals of CTT.
• To know various assumptions of CTT.
• To critically evaluate use of CTT and its
limitations.

Constructs and Measures
constructs are theoretical terms that refer to unobserved, idealized entities.
A Construct's height, weight or depth cannot be measured because
constructs are not concrete materials in the visible world.
In psychology, Construct refers to any complex psychological concept.
Construct is a skill, attribute, or ability that is based on one or more
established theories. Constructs exist in the human brain and are not directly
observable.
In psychology and cognitive science, constructs include terms like person's
motivation, Intelligence, anxiety, and fear, anger, personality, love,
attachment, memory, creativity, learning outcomes and attention.
Measures are the observations used in science to learn about constructs.
These include things like reaction times, accuracy scores, and response
frequencies.

Latent Traits
Latent traits are a specific kind of construct.
– Relatively stable qualities of individuals that are
changeable, but only over the long term.
• Transient things, such as “attention,” are not traits.
– Latent traits include everyday things like attitudes,
personality, preferences, and dispositions (e.g.,
“talkative”).
– Latent traits also include many kinds of things that
educators are interested in:
• Ability, aptitude, creativity, expertise, learning
outcomes and intelligence.

Measuring Latent Traits
It is important to recognize that no single measure of a
latent trait is ever taken to be a perfectly accurate
measure of that trait.
– Instead, different kinds of “measures” or “tests” are
seen as “tapping into” the latent trait.
– Different measures may “tap into” a latent trait in
different ways, capturing some aspects of the trait
better than others.
– Multiple measures can provide “converging”
evidence.

Psychometric Tests
• Psychometric tests are standardized tests, and
they are designed to assess a particular
variable.
• Psychometric tests- scientific and systematic
ways to test someone's ability to do a job or
measure their personality or some mental
ability (like math achievement, learning
outcomes-language outcomes etc.).
• Psychometrics means the study of developing
measurements.

Measuring body temperature
• Using temperature to indicate illness
• Measurement tool: a mercury
• thermometer - a glass vacuum tube with a
bulb of mercury at one end.

Measuring body temperature
To make inference between taking temperature
and illness
– theory regarding: Thermal equilibrium via conduction.
– The proportionality of mercury density with a
conceptual temperature scale.
– Relationship between mouth and core body
temperature.
– Relationship between core body temperature and
illness.

At each stage, error may intrude
• Thermal equilibrium may not have been reached (e.g.
thermometer removed too quickly).
• –
• Expansion of mercury also affected by other things (e.g. air
pressure).
• –
• Mouth temperature may not reflect core body temperature
(e.g. after a hot cup of tea).
• –
• Core body temperature does not vary with all illnesses, and
is not even completely stable in health.
Identify the sources of errors in measuring students’
attributes.

Test developer’s concerns –
• Quality of test items
• How examinees respond to it when constructing
tests
• Reliable and Valid tool
A psychometrician generally uses psychometric
techniques to determine the validity and reliability.

Construction of Test based on CTT
Table of Specification (Blue Print) for Class 5th Science
Instructional
Objective
Content Areas
Knowledge
40%
Understanding
35%
Application
25%
Total
SU MC MT TF SU MC MT TF SU MC MT TF
1) Food and Health 1 2 2 1 1 2 1 1 1 1 1 1 15
2)Plant Life 0 2 2 1 0 1 2 1 1 0 1 1 12
3)Animal Life 1 1 2 1 1 0 1 2 0 1 1 1 12
4)Force Work and Energy 2 2 1 2 1 2 1 1 1 1 1 1 16
5)Weight, Volume &
Density
2 2 1 2 2 1 2 1 2 1 1 1 18
6)The Environment 2 1 2 1 1 1 1 2 1 2 1 0 15
7)The Rocks and
Minerals
1 1 2 1 1 1 1 1 0 1 1 1 12
Total 9 11 12 9 7 8 9 9 6 7 7 6 100
SU= Supply Type, MC=Multiple Choice, MT=Matching, TF=True False

Examples
1) Instrument used to measure earthquake is
known as-
(a)Seismograph
(b) Quake meter
(c ) Barometer
(d) None of above
2) How many seismograph stations are needed to
locate the epicenter of earthquake?
(a) 2
(b) 3
(c ) 4
(d) 5
3) In which situation spring tides can occur?
(a) The moon , sun and earth are at right angle
with the earth at apex
(b) The moon is the farthest from the earth
(c )The sun is closest to the earth
(d) The moon , sun and earth are at the same line
4) The topic of cancer passes through
(a) India and Iran
(b) Iran and Pakistan
(c) India and Saudi Arabia
(d) Iran and Iraq
1) The area of semi circle is
(a) ∏R2
(b) ∏R2/2
(c) 2∏R
(d) ∏R
2) In a circle given below, if
AB is diameter, angle a=300
Then the value of angle b
will be
(a) 450
(b) 600
(c) 900
(d) 550

Higher Score Achiever and Low Score Achiever
Higher Group
27%
Lower Group
27%

Difficulty of an Item
The difficulty of an item is understood as the
proportion of the persons who answer a test
item correctly.
– The higher this proportion- the lower the
difficulty
– the greater the difficulty of an item- the lower its
index

Index of Difficulty
DI= RU+RL x 100
T
RU= The number in upper group who answered correctly
RL=The number in lower group who answered correctly
T= The total number who tried the item
Hypothetical Example-

Discrimination
A good item should discriminate between those who score
high on the test and those who score low.
We would expect that –
- those having a high overall test score would have a high
probability of being able to answer the item.
- those having low test scores would have a low probability of
answering the item correctly.
The higher the discrimination index, the better the item can
determine the difference between those with high test scores and
those with low ones.

Formula for item discriminating power
Item discriminating power =
RU-RL
T/2
Where
RU= Students from upper group who got the answer correct.
RL= Students from lower group who got the answer correct.
T/2 = half of the total number of pupils included in the item analysis.
Hypothetical Example-

Test Theories/Model -
Classical Test Theory
Item Response Theory
Test
Theory/Model
Classical Test
Theory (CTT)
Item Response
Theory (IRT)
• Both theories enable to predict outcomes of psychological tests by identifying
parameters of item difficulty and the ability of test takers.
• Both are concerned to improve the reliability and validity of psychological
tests. Both of these approaches provide measures of validity and reliability.

Classical Test Theory is used to predict an
individual’s latent trait based on an observed
total score on an instrument.

Continued-
• In CTT, the true score predicts the level of the
latent variable.
• The random error is normally distributed with
a mean of 0 and a SD of 1.
• The random errors are uncorrelated with each
other and also are uncorrelated to the true
scores.

Mathematical Model of CTT
Observed test scores (X) are composed of a true score (T) and an error
score (E)
-the true and the error scores are independent.
Charles Spearman- reduce random error as much as possible, thereby making tests better.
Illustrated in the formula: X = T + E.
Where –
X= Total Score
T=True Score
E=Error Score
The variables are established by Spearman (1904) and Novick (1966)

Classical test theory (CTT) in psychometrics
is all about reliability.
• Reliability refers to how consistent a test or measure is.
• In CTT- Three Base Terms- test/Observed score, error,
and true score.
• Ex. - math exam and get an 85,
• Test score-85.
• Error – sound
- mistake in the test, or
-external environment not totally control but
that impact testing
But psychometrics assumes everyone has, in theory, a
true score.
- We can calculate this true score with an equation.
Is True score reflect
true ability?
Why true score vary
without intervention?

Standard error of measurement
Sm = S √1 - r .
The standard deviation of the distribution of random errors for each individual
standard error of measurement-larger- the less certain is the accuracy
standard error of measurement-small- high accuracy- individual score is probably
close to the true score.
Use of Standard error of measurement –
create confidence intervals around specific observed scores
The lower and upper bound of the confidence interval approximate the value of the
true score.
Will distribution of random errors be the same for all individuals - ?
Why score vary over different administration on subjects ?
Is not error due to item characteristics, administration , environment, and nature of
tool ?

Error Distribution
St. Obtained Score True Score Error
1 85 80 +5
2 69 72 -3
3 48 45 +3
4 82 85 -3
5 39 43 -4
6 45 41 +4
7 78 79 -1

Assumption of Classical Test Theory
• Varying responses of examinees are due only to
variation in ability of interest.
• All other potential sources of variation existing in
the testing materials such as external conditions
or internal conditions of examinees are assumed
to be constant.

Continued-
• Each individual has a true score which would be
obtained if there were no errors in measurement.
• The difference between the true score and the observed
test score results from measurement error.
• Error is often assumed to be a random variable having a
normal distribution.
• Tests are fallible imprecise tools. The true score for an
individual will not change with repeated applications of
the same test.

Shortcomings of CTT
• Examinee characteristics and test characteristics -cannot be separated
each can only be interpreted in the context of the other.
• Reliability is "the correlation between test scores on parallel forms of a test".
differing opinions of what parallel tests are-reliability coefficients
provide either lower bound estimates of reliability or reliability estimates with
unknown biases.
• Standard error of measurement
the standard error of measurement is assumed to be the same for all
examinees of different ability.
• Measurement accuracy and Attribute level
Common estimate of the measurement precision that is assumed to be equal
for all individuals irrespective of their attribute levels.
• CTT is test oriented, rather than item oriented
cannot help us make predictions of how well an individual or even a group
of examinees might do on a test item

Limitations of CTT
Sample Dependent-
The focus of the analysis is –
 total test score;
 frequency of correct responses (to indicate question difficulty);
 frequency of responses (to examine distracters);
 reliability of the test and item-total correlation (to evaluate discrimination
at the item level)
• one limitation is that they relate to the sample under scrutiny and thus all
the statistics that describe items and questions are sample dependent
This critique may not be particularly relevant where successive samples are
reasonably representative and do not vary across time, but this will need to
be confirmed and complex strategies have been proposed to overcome this
limitation.

CTT: Limitations
• Item analysis from CTT perspectives "is
essentially sample-based descriptive statistics"
- This means that, for example, DV and DP
values are only representative of the specific
sample of examinees from which they were
calculated.
- so that making generalizations across
different groups of examinees—or across
different test formats—may not be possible.

Need of More complex analytic
approaches
More complex assessment situations
such as measuring test taker
performance at different points in time
(pre/ post);
using different test forms
different items of different difficulty
Different Raters assign scores
different elements of a performance exam

CTT VS. IRT
• The test is the unit of
analysis
• Measures with more items
(longer) are more reliable
than their counterparts.
• Comparing scores from
different measures can
only be done when the test
forms/measures are
parallel.
• Item properties depend on
a representative sample.
• The items is the unit of
analysis
• Measures with lesser items
(Shorter) can be more reliable
than their counterparts.
• Item responses of different
measures can be compared as
long as they are measuring the
same latent trait.
• Item properties do not
depend on a representative
sample

CTT VS. IRT
• Position on the latent
trait continuum is
derived from comparing
the test score with
score of reference
group.
• All items on the
measure must have the
same response
categories.
• Position on the latent
trait continuum are
derived by comparing
the distance between
items on the ability
scale.
• Items on measure can
have different response
categories.

Classical Test Theory (CTT)- By Dr. Jai Singh

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Classical Test Theory (CTT)- By Dr. Jai Singh

Similar to Classical Test Theory (CTT)- By Dr. Jai Singh (20)

More from Academy for Higher Education and Social Science Research

More from Academy for Higher Education and Social Science Research (11)

Recently uploaded

Recently uploaded (20)

Classical Test Theory (CTT)- By Dr. Jai Singh