Reliability in Language Testing

Introduction
 to identify potential sources of error in a given measure of
communicative language ability and to minimize the effect
of these factors on that measure.
 errors of measurement (unreliability) because we know
that test performance is affected by factors other than the
abilities we want to measure.
 When we minimize the effects of these various factors, we
minimize measurement error and maximize reliability.
‘How much of an individual’s test performance is due to
measurement error, or to factors 0ther than the language
ability we want to measure?’

Introduction
 In this chapter,
 Measurement error in test scores,
 The potential sources of this error,
 The different approaches to estimating the relative
effects of these sources of error on test scores, and
 The considerations to be made in determining which of
these approaches may be appropriate for a given testing
situation.

Factors that affect language test
scores
 The examination of reliability depends upon
distinguishing the effects of the abilities we want to
measure from the effects of other factors.
 If we wish to estimate how reliable our test scores
are:
 we must begin with a set of definitions of the
abilities we want to measure, and of the other
factors that we expect to affect test scores.

Factors that affect language test
scores

Factors that affect language test scores
1. Communicative language ability: specific abilities that determine
how an individual performs on a given test. Example: In a test of
sensitivity to register: the students who perform the best and receive
the highest scores would be those with the highest level of
sociolinguistic competence.
2. Test method facets: testing environment, the test rubric, the nature
of input and expected response, the relationship between input and
response
3. Personal attributes: individual characteristics (cognitive style and
knowledge of particular content areas) - group characteristics (sex,
race, and ethnic background)
4. Random factors: unpredictable and largely temporary conditions
(mental alertness or emotional state) - uncontrolled differences in test
method facets (changes in the test environment from one day to the
next or differences in the way different test administrators carry out
their responsibilities)

Factors that affect
language test scores
1. The primary interest in using language tests is to make
inferences about one or more components of an
individual’s communicative language ability.
2. Random factors and test method facets are generally
considered to be sources of measurement error
(reliability)
3. Personal attributes (i.e. sex, ethnic background,
cognitive style and prior knowledge of content area) are
discussed as sources of test bias, or test invalidity, and
these will therefore be discussed in Chapter 7 (validity)

Theories and models of
reliability
 Any factors other than the ability being tested that affect
test scores are potential sources of error that decrease
the reliability of scores.
 to identify these sources of error and estimate the
magnitude of their effect on test scores.
how different theories and models define the various
influences on test scores

1. Classical True Score
Measurement Theory
 Classical true score (CTS) measurement theory consists
of a set of assumptions about the relationships between
true or observed test scores and the factors that affect
these scores.
 Reliability is defined in the CTS theory in terms of true
score variance.
 True score: due to an individual’s level of ability / Error
score: due to factors other than the ability being tested
observed score = true score + error score
(actual test score)

Measurement Theory
 Since we can never know the true scores of individuals, we
can never know what the reliability is, but can estimate it
from the observed scores.
 The basis for all such estimates in the CTS model is the
correlation between parallel tests.
Parallel Test: In order for two tests to be considered parallel,
they are supposed to measures of the same ability
(equivalents, alternate forms).
If the observed scores on two parallel tests are highly
correlated, these test can be considered reliable
indicators of the ability being measured.

Measurement Theory
Within the CTS model there are 3 approaches to estimating
reliability, each of which addresses different sources of error:
a. Internal consistency estimates are concerned with sources of
error such as differences in test-tasks and item formats,
inconsistencies within and among scorers.
b. Stability estimates indicate how consistent test scores are
over time
c. Equivalence estimates provide an indication of the extent to
which scores on alternate forms of a test are equivalent.
The estimates of reliability that these approaches yield are
called reliability coefficients.

Measurement Theory
Internal consistency is concerned with how consistent test
takers’ performances on different parts of the test are with
each other
Two approaches to estimating internal consistency
-an estimate based on correlation between two halves (the
Spearman-Brown split-half estimate)
-estimates which are based on ratios of the variances of
parts of the test – halves or items – to total test score
variance (the Guttman split-half, the Kuder-Richardson
formulae, and coefficient alpha)

Measurement Theory
Rater consistency: In test scores that are obtained
subjectively (ratings of compositions or oral interviews) a
source of error is inconsistency in these ratings.
Intra-rater reliability: In order to examine the reliability of
ratings of a single rater, at least two independent ratings
from this rater are obtained. This is accomplished by rating
the individual samples once and then re-rating them at a later
time in a different, random order.
Inter-rater reliability: two different raters. In examining
inter-rater consistency, at that time, two ratings from these
rater are obtained and correlated.

Measurement Theory
 Stability (test-retest reliability): In this approach, we
administer the test twice to a group of individuals and then
compute the correlation between the two sets of scores.
This correlation can then be interpreted as an indication of
how stable the scores are over time.
 Equivalence (parallel forms reliability): In this approach,
we try to estimate the reliability of alternate forms of a
given test, by administer both forms to a group of
individuals. Then the correlation between the two set of
scores can be computed.

2. Generalizability theory
(G-theory)
 Generalizability theory is an extension of the classical model
 It enables test developers to examine several sources of
variance simultaneously, and to distinguish the systematic
from random error.
 Firstly, the test developer designs and conducts a study to
investigate the sources of the variances (G-study).
 Depending on the outcome of this G-study, the test developer
may revise the test or the procedures for administering it, or if
the results are satisfactory, the test developer proceeds to the
second stage, a decision study (D-study).

(G-theory)
 In a D-study, the test developer administers the test under
operational conditions, in which the test will be used to make
the decisions for which it is designed. Then, the test developer
uses G-theory procedures to estimate the magnitude of the
variance components.
 Terms related to generalizability theory
 Universe of generalization: the domain of uses or abilities (or
both) to which we want test scores to generalize.
 Universe of measures: the types of test scores we would be
willing to accept as indicators of the ability to be measured.

(G-theory)
 Terms related to generalizability theory
 Populations of persons: the group about whom we are going
to make decisions or inferences
 Universe score: the mean of a person’s scores on all measures
from the universe of possible measures (similar to CTS-theory
true score)
 This conceptualization of generalizability reveals that a given
estimate of generalizability is limited to the specific universe
of measures and population of persons within which it is
defined, and that a test score that is ‘True’ for all persons,
times, and places simply does not exist.

(G-theory)
Generalizability Coefficients: The G-theory analog of the
CTS-theory reliability coefficient is the generalizability
coefficient
universe score coefficient
generalizability coefficient =
observed score coefficient
Estimation: In order to estimate the relative effect of
different sources of variance on the observed scores, it is
necessary to obtain multiple measures for each person
under the different conditions for each facet

(G-theory)
Estimation: One statistical procedure that can be used for
estimating the relative effects of different sources of variance on
test scores is the (ANOVA)
Example: An oral interview: with different question forms, or sets
of questions, and different interviewer/raters
Using ANOVA, we could obtain estimates for all the variance
components in the design: (1) the main effects for persons,
raters, and forms; (2) the two-way interactions between persons
and raters, persons and forms, and forms and raters, and (3) a
component that contains the three-way interaction among
persons, raters, and forms, as well as for the random variance

3. Standard Error of Measurement
(SEM)
 The approaches to estimating reliability that have been developed
within both CTS theory and G-theory are based on group
performance, and provide information for test developers and test
users about how consistent the scores of groups are on a given test.
 Reliability and generalizability coefficients provide no direct
information about the accuracy of individual test scores.
 A need for one indicator of how much we would expect an
individual’s test scores to vary.
 The most useful indicator for this purpose is called the standard
error of measurement.
The smaller standard deviation of errors (standard error of
measurement, SEM) results in more reliable tests

4. Item-response theory
Because of the limitations in CTS-theory and G-theory,
psychometricians have developed a number of
mathematical models for relating an individual’s test
performance to that individual’s level of ability.
Item response theory presents a more powerful
approach in that it can provide sample-free estimates of
individual's true scores, or ability levels, as well as
sample-free estimates of measurement error at each
ability level.

The unidimensionality assumption: Most of the IRT models
make the specific assumption that the items in a test
measure a single, or unidimensional ability or trait, and
that the items form a unidimensional scale of
measurement
Item characteristic curve: Each specific IRT model makes
specific assumptions about the relationship between the
test taker’s ability and his performances on a given item.
These assumptions are explicitly stated in the mathematical
formula that is item characteristic curve (ICC).

Ability score: Recall that neither CTS theory nor G-theory
provides an estimation of an individual’s level of ability. One
of the advantages of IRT is that it provides estimates of
individual test takers’ levels of ability.
Precision of measurement: Precision of measurement are
addressed in the IRT concept of item information function
which refers to the amount of information a given item
provides for estimating an individual’s level of ability. Test
of information function, on the other hand, is the sum of
the item information functions, each of which contributes
independently to the total, and is a measure of how much
information a test provides at different ability levels.

Reliability of criterion-
referenced test score
 NR test scores are most useful in situations in which
comparative decisions are made such as the selection of
individuals for a program. CR test scores, on the other hand,
are more useful when making ‘absolute’ decisions regarding
mastery or nonmastery of the ability domain.
 The concept of reliability applies to two aspects of criterion-
referenced tests:
- the accuracy of the obtained score as an indicator of a ‘domain’
score (J. D. Brown (1989) has derived a formula)
- the consistency of the decisions that are based on CR test
scores (Threshold loss agreement indices - Squared-error loss
agreement indices)

Factors that affect
reliability estimates
 Length of test: long tests are generally more reliable than
short ones
 Difficulty of test and test score variance: the greater the
score variance, the more reliable the tests will tend to be
(Norm-referenced tests)
 Cut-off score: the greater the differences between the
cut-off score and the mean score, the greater will be the
reliability (Criterion-referenced tests).

Systematic measurement
error
 Systematic error is different from random error.
For example, if every form of a reading comprehension test
contained passages from the area of “economics”, then
the facet ‘passage content’ would be fixed to one
condition - economics.
To the extent that test scores are influenced by individuals’
familiarity with this particular content area, as opposed
to their reading comprehension ability, this facet will be a
source of error in our measurement of reading
comprehension. It is a kind of systematic error

Systematic measurement
error
The effects of systematic error:
- The general effect of systematic error is constant for all
observations; it affects the scores of all individuals who take the
test.
- The specific effect varies across individuals; it affects different
individuals differentially
The effects of test method
Standardization of test facets results in introducing sources of
systematic variance into the test scores. When a single testing
technique is used (close test), the test might be a better indicator
of individuals’ ability to take cloze tests than of their reading
comprehension ability.

Conclusion
 Any factors other than the ability being tested that
affect test scores are potential sources of error that
decrease the reliability of scores.
 Therefore, it is essential that we be able to identify
these sources of error and estimate the magnitude
of their effect on test scores.

Reliability in Language Testing

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Reliability in Language Testing

Similar to Reliability in Language Testing (20)

More from Seray Tanyer

More from Seray Tanyer (6)

Recently uploaded

Recently uploaded (20)

Reliability in Language Testing

Editor's Notes