2. INTRODUCTION
A fundamental concern in the development and use of language tests is to identify
potential sources of error in a given measure of communicative language ability and to
minimize the effect of these factors on that measure.
We must be concerned about errors of measurement, or unreliability, as we know that
test performance is affected by factors other than the abilities we want to measure.
Poor health, fatigue, lack of interest test method facets
or motivation, and test-wiseness …
Minimizing the effects of these factors Minimizing measurement error
maximizing reliability.
Unsystematic (unpredictable) Systematic
3. INTRODUCTION
A necessary condition for validity: in order for a test score to be valid, it must be
reliable.
Reliability and validity are not two distinct concepts; they are complementary
aspects of a common concern in measurement
Reliability answers the question: ‘How much of an individual’s test performance
is due to measurement error, or to factors 0ther than the language ability we want
to measure?’ and with minimizing the effects of these factors on test scores.
Validity answers the question: ‘How much of an individual’s test performance is
due to the language abilities we want to measure?’ and with maximizing the effects
of these abilities on test scores.
4. INTRODUCTION
The investigation of reliability involves both logical analysis and empirical
research.
We must identify sources of error and estimate the magnitude of their effects on
test scores.
To identify sources of error, we need to distinguish the effects of the language
abilities we want to measure from the effects of other factors, which is a complex
problem due to:
1. The interaction between components of language ability and test method facets
makes it difficult to mark a clear ‘boundary’ between the ability being measured and
the method facets → a particular topic of a conversational interaction an oral
interview.
2. Other characteristics, such as sex, age, cognitive style, and native language.
Estimating the magnitude of the effects of different sources of error, once these
sources have been identified, is a matter of empirical research, and is a major
concern of measurement theory.
5. FACTORS THAT AFFECT LANGUAGE TEST SCORES
Thorndike (1951) and Stanley (1971) begin their treatments of
reliability with
general frameworks for describing the factors that cause test
scores to vary from individual to individual:
general and specific lasting characteristics,
general and specific temporary characteristics,
and systematic and chance factors related to test
administration and scoring.
6. FACTORS THAT AFFECT LANGUAGE TEST SCORES
Factors that affect language test scores (Bachman 1990):
Note: In a ‘path diagram’
rectangles: observed variables,
ovals: unobserved variables,
straight arrows: causal relationships
Systematic Unsystematic
Test Score
Communicative
language ability
Personal
attributes
Test
method
facets
Random
factors
7. FACTORS THAT AFFECT LANGUAGE TEST SCORES
Test method facets are systematic to the extent that they are uniform from
one test administration to the next. That is, if the input format facet is multiple-
choice, this will not vary, whether the test is given in the morning or afternoon.
Attributes of individuals include individual characteristics such as cognitive style
and knowledge of particular content areas, and group characteristics such as
sex, race, and ethnic background. These are also systematic in the sense that they
are likely to affect a given individual’s test performance regularly.
An individual’s test score will be affected to some degree by unsystematic, or
random factors: unpredictable and largely temporary conditions, such as his
mental alertness or emotional state, and uncontrolled differences in test method
facets, such as changes in the test environment from one day to the next, or
idioosyncratic differences in the way different test administrators carry out their
responsibilities.
8. FACTORS THAT AFFECT LANGUAGE TEST SCORES
Random factors and test method facets are generally considered to
be sources of measurement error, and have thus been the primary
concern of approaches to estimating reliability.
Personal attributes that are not considered part of the ability tested,
such as sex, ethnic background, cognitive style and prior knowledge of
content area, on the other hand, have traditionally been discussed as
sources of test bias, or test invalidity.
9. CLASSICAL TRUE SCORE (CTS) MEASUREMENT THEORY
This theory consists of a set of assumptions about the relationships between
actual, or observed test scores and the factors that affect these scores.
Assumption 1: an observed score on a test comprises two factors or components: a
true score that is due to an individual’s level of ability and an error score, that is due
to factors other than the ability being tested.’
→ x = Xt + Xe where x is the observed score, Xt is the true score, and Xe the error
score.
the variance of a set of test scores consists of two components:
S2
x = S2
t + S2
e
where S2
x is the observed score variance, S2
t is the true score variance component,
and S2
e is the error score variance component.
10. Assumption 2: the relationship between true and error scores: error scores are
unsystematic, or random, and are uncorrelated with true scores.
CTS model’s definition of measurement error:
that variation in a set of test scores that is unsystematic or random
In CTS→ Two sources of variance:
True score
variance due to
differences in
ability
Measurement
error
(unsystematic)
11. PARALLEL TESTS
In order for two tests to be considered parallel, we assume
that they are measures of the same ability, that is, that an individual’s true score on
one test will be the same as his true score on the other.
Two tests are parallel if, for every group of persons taking both tests,
(1) the true score on one test is equal to the true score on the other, and
(2) the error variances for the two tests are equal.
Operational definition: parallel tests are two tests of the same ability that have
the same means and variances and are equally correlated with other tests of that
ability.
Virtually we never have strictly parallel tests, we treat two tests as if they were
parallel if the differences between their means and variances are not statistically
significant. Equivalent tests; alternate forms.
12. RELIABILITY AS THE CORRELATION
BETWEEN PARALLEL TESTS
Since we never know what the true or error scores are, we cannot know the
reliability of the observed scores. To be able to estimate the reliability of observed
scores, then, we must define reliability operationally in a way that depends only on
observed scores.
Thus, if the observed scores on two parallel tests are highly correlated, this
indicates that effects of the error scores are minimal, and that they can be
considered reliable indicators of the ability being measured.
The definition of reliability (the basis for all estimates of reliability within CTS
theory): the correlation between the observed scores on two parallel tests, which
we can symbolize as rxx’.
Assumption: the observed scores on the two tests are experimentally independent.
That is, an individual's performance on the second test should not depend on how
she performs on the first.
14. RELIABILITY AND MEASUREMENT ERROR AS PROPORTIONS OF
OBSERVED SCORE VARIANCE
If an individual's observed score on a test is composed of a
true score and an error score, the greater the proportion of
true score, the less the proportion of error score, and thus the
more reliable the observed score.
Thus, one way of defining reliability is as the proportion of
the observed score variance that is true score variance:
rxx’ = s2
t/ s2
x
Note: reliability refers to the test scores, and not the test
itself.
15. APPROACHES TO ESTIMATING RELIABILITY WITHIN THE CTS
Internal consistency: concerned primarily with
sources of error from within the test and scoring
procedures,
Stability: how consistent test scores are over time,
• The estimates of reliability that these approaches yield are called
reliability coefficients.
Equivalence: an indication of the extent to which scores on
alternate forms of a test are equivalent.
16. INTERNAL CONSISTENCY
Internal consistency is concerned with how consistent test takers’ performances
on the different parts of the test are with each other.
Inconsistencies in performance on different parts of tests can be caused by a
number of factors, including the test method facets.
SPLIT-HALF RELIABILITY ESTIMATES
One approach to examining the internal consistency of a test is the split-half
method, in which we divide the test into two halves and then determine the extent
to which scores on these two halves are consistent with each other.
In so doing, we are treating the halves as parallel tests, and so we must make
certain assumptions about the equivalence of the two halves, specifically that they
have equal means and variances. In addition, we must also assume that the two
halves are independent of each other.
17. INTERNAL CONSISTENCY
In some cases, where we are not sure that the items are measuring the same ability
or that they are independent of each other, the test-retest and parallel forms
methods, are more appropriate for estimating reliability.
The Spearman-Brown split-half estimate
Once the test has been split into halves, it is rescored, yielding two scores - one for
each half - for each test taker.
In one approach to estimating reliability, we then compute the correlation between
the two sets of scores. This gives us an estimate of how consistent the halves are,
however, and we are interested in the reliability of the whole test.
In general, a long test will be more reliable than a short one, assuming that the
additional items correlate positively with the other items in the test.
18. INTERNAL CONSISTENCY
Two assumptions must be met in order to use this method:
First, since we are in effect treating the two halves as parallel tests, we must assume
that they have equal means and variances (an assumption we
can check).
Second, we must assume that the two halves are experimentally independent of each
other (an assumption that is very difficult to check). That is, that an individual's
performance on one half does not affect how he performs on the other.
19. INTERNAL CONSISTENCY
The Guttman sp!it-bdf estimate
Another approach to estimating reliability from split-halves is that developed by
Guttman (1945), which does not assume equivalence of the halves, and which does
not require computing a correlation between them. This split-half reliability
coefficient is based on the ratio of the sum of the variances of the two halves to the
variance of the whole test:
Since this formula is based on the variance of the total test, it provides a direct
estimate of the reliability of the whole test.
Therefore, unlike Spearman-Brown, the Guttman split-half estimate does not
require an additional correction for length.
20. INTERNAL CONSISTENCY
Reliability estimates based on item variances
1. Kuder-Richardson reliability coefficients
There is a way of estimating the average of all the possible split-half coefficients
on the basis of the statistical characteristics of the test items.
This approach developed by Kuder and Richardson (1937), involves computing
the means and variances of the items that constitute the test.
The mean of a dichotomous item (one that is scored as either right or wrong) is
the proportion, symbolized as p, of individuals who answer the item correctly. The
proportion of individuals who answer the item incorrectly is equal to 1 - p, and is
symbolized as q.
The variance of a dichotomous item is the product of these two proportions, or pq.
21. INTERNAL CONSISTENCY
The reliability coefficient provided by Kuder-Richardson formula 20 (KR-20),
based on the ratio of the sum of the item variances to the total test score variance, is
as follows:
Assumption: the items are of nearly equal difficulty and independent of each other.
If the items are of equal difficulty, the reliability coefficient can be computed by
using Kuder-Richardson formula 21 (KR-21:
This formula will generally yield a reliability coefficient that is lower than that
given by KR-20. The Kuder-Richardson formulae are based on total score variance,
and thus they do not require any correction for length.
22. INTERNAL CONSISTENCY
2. Coefficient alpha
Both the Guttman split-half estimate and the Kuder-Richardson formulae estimate
reliability on the basis of ratios of the variances of test components - halves and
items - to total test score variance.
Cronbach (1951) developed a general formula for estimating internal consistency
which he called ‘coefficient alpha’, and which is often referred to as ‘Cronbach’s
alpha’:
It can thus be shown that all estimates of reliability based on the analysis of
variance components can be derived from this formula and are special cases of
coefficient alpha.
23. INTERNAL CONSISTENCY
In summary, the reliability of a set of test scores can be estimated on the basis of a
single test administration only if certain assumptions about the characteristics of
the parts of the test are satisfied.
Assumptions for internal consistency reliability estimates, and effects of
violating assumptions
24. RATER CONSISTENCY
In test scores that are obtained subjectively, such as ratings of compositions or
oral interviews, a source of error is inconsistency in these ratings.
In the case of a single rater, we need to be concerned about the consistency within
that individual’s ratings, or with intrarater reliability.
When there are several different raters, we want to examine the consistency across
raters, or inter-rater reliability.
In both cases, the primary causes of inconsistency will be either the
application of different rating criteria to different samples or
the inconsistent application of the rating criteria to different samples.
25. RATER CONSISTENCY
Intra-rater reliability
Factors introducing inconsistency:
Sequence of paying attention to different kinds of errors
Sequence of scoring from the 1st to the last person
We need to obtain at least two independent ratings from a rater for each individual
language sample.
This is typically accomplished by rating the individual samples once and then re-
rating them at a later time in a different, random order.
Once the two sets of ratings have been obtained, the reliability between them can
be estimated in two ways.
26. RATER CONSISTENCY
One way is to treat the two sets of ratings as scores from parallel tests and
compute the appropriate correlation coefficient (commonly the Spearman rank-
order coefficient) between the two sets of ratings, interpreting this as an estimate
of reliability.
Another approach to examining the consistency of multiple ratings is to compute a
coefficient alpha, treating the independent ratings as different parts:
27. RATER CONSISTENCY
Inter-rater reliability
Factors introducing inconsistency:
Different criteria for rating
Different interpretations of the same rating criteria
The reliability can be estimated in two ways:
We can compute the correlation between two different raters and interpret this as
an estimate of reliability.
When more than two raters are involved, however, rather than computing
correlations for all different pairs, a preferable approach is that recommended by
Ebel (1979), in which we sum the ratings of the different raters and then estimate
the reliability of these summed ratings by computing a coefficient alpha.
28. STABILITY (TEST-RETEST RELIABILITY)
This approach can be used in three cases:
For tests such as cloze and dictation we cannot appropriately estimate the internal
consistency of the scores because of the interdependence of the parts of the test.
There are also testing situations in which it may be necessary to administer a test
more than once as part of a time-series design.
This might also be the concern of a language program evaluator who is interested
in relating changes in language ability to teaching and learning activities in the
program.
This approach to reliability provides an estimate of the stability of the test scores
over time.
29. STABILITY (TEST-RETEST RELIABILITY)
In this approach, we administer the test twice to a group of individuals and then
compute the correlation between the two sets of scores. This correlation can then
be interpreted as an indication of how stable the scores are over time.
Two sources of inconsistency - differential practice effects and differential
changes in ability -pose a dilemma for the test-retest approach.
That is, we must assume that both practice and learning (or unlearning) effects
are either uniform across individuals or random.
Practice effects may occur if certain individuals remember some of the items or
feel more comfortable with the test method, and consequently perform better on
the second administration of the test.
If, on the other hand, there is a considerable time lapse between test
administrations, some individuals’ language ability may actually improve or
decline more than that of others, causing them to perform differently the second
time.
30. STABILITY (TEST-RETEST RELIABILITY)
Possible Solution:
There is no single length of time between test
administrations that is best for all situations.
In each situation, the test developer or user must
attempt to determine the extent to which practice and
learning are likely to influence test performance, and
choose the length of time between test and retest so as
to optimize reduction in the effects of both.
31. EQUIVALENCE (PARALLEL FORMS RELIABILITY)
Like the test-retest approach, this is an appropriate means of
estimating the reliability of tests for which internal consistency
estimates are either inappropriate or not possible.
It is of particular interest in testing situations where alternate forms of
the test may be actually used, either for security reasons, or to
minimize the practice effect.
Assumption: that the different forms of the test are equivalent,
particularly that they are at the same difficulty level and have similar
standard deviations.
32. EQUIVALENCE (PARALLEL FORMS RELIABILITY)
To estimate the reliability of alternate forms of a given test, the procedure used is
to administer both forms to a group of individuals.
One way to minimize the possibility of an ordering effect is to use a
‘counterbalanced’ design, in which half of the individuals take one form first and
the other half take the other form first.
The means and standard deviations for each of the two forms can then be
computed and compared to determine their equivalence, after which the
correlation between the two sets of scores (Form A and Form B) can be computed.
This correlation is then interpreted as an indicator of the equivalence of the two
tests, or as an estimate of the reliability of either one.
33. Note:
o As a matter of procedure, we generally attempt to estimate the internal
consistency of a test first, since if a test is not reliable in this respect, it is not
likely to be equivalent to other forms or stable over time.
o This is because measurement error is random and therefore not correlated
with anything.
o Thus, the greater the proportion of measurement error in the scores from a
given test, the lower the correlation of those scores with other scores will be.
34. PROBLEMS WITH THE CLASSICAL TRUE SCORE MODEL
1. Different sources of error may interact with each other, even when we carefully
design our reliability study. One problem with the CTS model, then, is that it treats
error variance as homogeneous in origin. Each of the estimates of reliability
addresses one specific source of error, and treats other potential sources either as part
of that source, or as true score.
In the classical model, therefore, different sources of error may be confused, or
confounded with each other and with true score variance, since it is not possible to
examine more than one source of error at a time, even though performance on
any given test may be affected by several different sources of error simultaneously.
35. PROBLEMS WITH THE CLASSICAL TRUE SCORE MODEL
2. A second, related problem is that the CTS model considers all error to be random,
and consequently fails to distinguish systematic error from random error.
Factors other than the ability being measured that regularly affect the performance
of some individuals and not others can be regarded as sources of systematic error
or test bias.
Sources of systematic error:
Test method
Cultural content
Psychological task set
Guessing or what Carroll (1961b) called ‘topastic error’.