SlideShare a Scribd company logo
RELIABILITY
BASED ON:
FUNDAMENTAL CONSIDERATIONS IN
LANGUAGE TESTING
BACHMAN (1990)
CHAPTER 6
Prepared by: Amirhamid Foroughameri
ahfameri@gmail.com
November 2015
INTRODUCTION
 A fundamental concern in the development and use of language tests is to identify
potential sources of error in a given measure of communicative language ability and to
minimize the effect of these factors on that measure.
 We must be concerned about errors of measurement, or unreliability, as we know that
test performance is affected by factors other than the abilities we want to measure.
Poor health, fatigue, lack of interest test method facets
or motivation, and test-wiseness …
Minimizing the effects of these factors Minimizing measurement error
maximizing reliability.
Unsystematic (unpredictable) Systematic
INTRODUCTION
 A necessary condition for validity: in order for a test score to be valid, it must be
reliable.
 Reliability and validity are not two distinct concepts; they are complementary
aspects of a common concern in measurement
 Reliability answers the question: ‘How much of an individual’s test performance
is due to measurement error, or to factors 0ther than the language ability we want
to measure?’ and with minimizing the effects of these factors on test scores.
 Validity answers the question: ‘How much of an individual’s test performance is
due to the language abilities we want to measure?’ and with maximizing the effects
of these abilities on test scores.
INTRODUCTION
 The investigation of reliability involves both logical analysis and empirical
research.
 We must identify sources of error and estimate the magnitude of their effects on
test scores.
 To identify sources of error, we need to distinguish the effects of the language
abilities we want to measure from the effects of other factors, which is a complex
problem due to:
1. The interaction between components of language ability and test method facets
makes it difficult to mark a clear ‘boundary’ between the ability being measured and
the method facets → a particular topic of a conversational interaction an oral
interview.
2. Other characteristics, such as sex, age, cognitive style, and native language.
 Estimating the magnitude of the effects of different sources of error, once these
sources have been identified, is a matter of empirical research, and is a major
concern of measurement theory.
FACTORS THAT AFFECT LANGUAGE TEST SCORES
Thorndike (1951) and Stanley (1971) begin their treatments of
reliability with
general frameworks for describing the factors that cause test
scores to vary from individual to individual:
general and specific lasting characteristics,
general and specific temporary characteristics,
and systematic and chance factors related to test
administration and scoring.
FACTORS THAT AFFECT LANGUAGE TEST SCORES
 Factors that affect language test scores (Bachman 1990):
Note: In a ‘path diagram’
rectangles: observed variables,
ovals: unobserved variables,
straight arrows: causal relationships
Systematic Unsystematic
Test Score
Communicative
language ability
Personal
attributes
Test
method
facets
Random
factors
FACTORS THAT AFFECT LANGUAGE TEST SCORES
 Test method facets are systematic to the extent that they are uniform from
one test administration to the next. That is, if the input format facet is multiple-
choice, this will not vary, whether the test is given in the morning or afternoon.
 Attributes of individuals include individual characteristics such as cognitive style
and knowledge of particular content areas, and group characteristics such as
sex, race, and ethnic background. These are also systematic in the sense that they
are likely to affect a given individual’s test performance regularly.
 An individual’s test score will be affected to some degree by unsystematic, or
random factors: unpredictable and largely temporary conditions, such as his
mental alertness or emotional state, and uncontrolled differences in test method
facets, such as changes in the test environment from one day to the next, or
idioosyncratic differences in the way different test administrators carry out their
responsibilities.
FACTORS THAT AFFECT LANGUAGE TEST SCORES
 Random factors and test method facets are generally considered to
be sources of measurement error, and have thus been the primary
concern of approaches to estimating reliability.
 Personal attributes that are not considered part of the ability tested,
such as sex, ethnic background, cognitive style and prior knowledge of
content area, on the other hand, have traditionally been discussed as
sources of test bias, or test invalidity.
CLASSICAL TRUE SCORE (CTS) MEASUREMENT THEORY
 This theory consists of a set of assumptions about the relationships between
actual, or observed test scores and the factors that affect these scores.
Assumption 1: an observed score on a test comprises two factors or components: a
true score that is due to an individual’s level of ability and an error score, that is due
to factors other than the ability being tested.’
→ x = Xt + Xe where x is the observed score, Xt is the true score, and Xe the error
score.
 the variance of a set of test scores consists of two components:
S2
x = S2
t + S2
e
 where S2
x is the observed score variance, S2
t is the true score variance component,
and S2
e is the error score variance component.
 Assumption 2: the relationship between true and error scores: error scores are
unsystematic, or random, and are uncorrelated with true scores.
 CTS model’s definition of measurement error:
that variation in a set of test scores that is unsystematic or random
In CTS→ Two sources of variance:
True score
variance due to
differences in
ability
Measurement
error
(unsystematic)
PARALLEL TESTS
 In order for two tests to be considered parallel, we assume
that they are measures of the same ability, that is, that an individual’s true score on
one test will be the same as his true score on the other.
 Two tests are parallel if, for every group of persons taking both tests,
(1) the true score on one test is equal to the true score on the other, and
(2) the error variances for the two tests are equal.
 Operational definition: parallel tests are two tests of the same ability that have
the same means and variances and are equally correlated with other tests of that
ability.
 Virtually we never have strictly parallel tests, we treat two tests as if they were
parallel if the differences between their means and variances are not statistically
significant. Equivalent tests; alternate forms.
RELIABILITY AS THE CORRELATION
BETWEEN PARALLEL TESTS
 Since we never know what the true or error scores are, we cannot know the
reliability of the observed scores. To be able to estimate the reliability of observed
scores, then, we must define reliability operationally in a way that depends only on
observed scores.
 Thus, if the observed scores on two parallel tests are highly correlated, this
indicates that effects of the error scores are minimal, and that they can be
considered reliable indicators of the ability being measured.
 The definition of reliability (the basis for all estimates of reliability within CTS
theory): the correlation between the observed scores on two parallel tests, which
we can symbolize as rxx’.
 Assumption: the observed scores on the two tests are experimentally independent.
That is, an individual's performance on the second test should not depend on how
she performs on the first.
CORRELATIONS BETWEEN TRUE AND OBSERVED SCORES ON
PARALLEL TESTS
=
rxx’
Ability
x’t
x’e
xt
xe
x X’
RELIABILITY AND MEASUREMENT ERROR AS PROPORTIONS OF
OBSERVED SCORE VARIANCE
If an individual's observed score on a test is composed of a
true score and an error score, the greater the proportion of
true score, the less the proportion of error score, and thus the
more reliable the observed score.
Thus, one way of defining reliability is as the proportion of
the observed score variance that is true score variance:
rxx’ = s2
t/ s2
x
Note: reliability refers to the test scores, and not the test
itself.
APPROACHES TO ESTIMATING RELIABILITY WITHIN THE CTS
Internal consistency: concerned primarily with
sources of error from within the test and scoring
procedures,
Stability: how consistent test scores are over time,
• The estimates of reliability that these approaches yield are called
reliability coefficients.
Equivalence: an indication of the extent to which scores on
alternate forms of a test are equivalent.
INTERNAL CONSISTENCY
 Internal consistency is concerned with how consistent test takers’ performances
on the different parts of the test are with each other.
 Inconsistencies in performance on different parts of tests can be caused by a
number of factors, including the test method facets.
SPLIT-HALF RELIABILITY ESTIMATES
 One approach to examining the internal consistency of a test is the split-half
method, in which we divide the test into two halves and then determine the extent
to which scores on these two halves are consistent with each other.
 In so doing, we are treating the halves as parallel tests, and so we must make
certain assumptions about the equivalence of the two halves, specifically that they
have equal means and variances. In addition, we must also assume that the two
halves are independent of each other.
INTERNAL CONSISTENCY
 In some cases, where we are not sure that the items are measuring the same ability
or that they are independent of each other, the test-retest and parallel forms
methods, are more appropriate for estimating reliability.
 The Spearman-Brown split-half estimate
 Once the test has been split into halves, it is rescored, yielding two scores - one for
each half - for each test taker.
 In one approach to estimating reliability, we then compute the correlation between
the two sets of scores. This gives us an estimate of how consistent the halves are,
however, and we are interested in the reliability of the whole test.
 In general, a long test will be more reliable than a short one, assuming that the
additional items correlate positively with the other items in the test.
INTERNAL CONSISTENCY
 Two assumptions must be met in order to use this method:
First, since we are in effect treating the two halves as parallel tests, we must assume
that they have equal means and variances (an assumption we
can check).
Second, we must assume that the two halves are experimentally independent of each
other (an assumption that is very difficult to check). That is, that an individual's
performance on one half does not affect how he performs on the other.
INTERNAL CONSISTENCY
 The Guttman sp!it-bdf estimate
 Another approach to estimating reliability from split-halves is that developed by
Guttman (1945), which does not assume equivalence of the halves, and which does
not require computing a correlation between them. This split-half reliability
coefficient is based on the ratio of the sum of the variances of the two halves to the
variance of the whole test:
 Since this formula is based on the variance of the total test, it provides a direct
estimate of the reliability of the whole test.
 Therefore, unlike Spearman-Brown, the Guttman split-half estimate does not
require an additional correction for length.
INTERNAL CONSISTENCY
 Reliability estimates based on item variances
1. Kuder-Richardson reliability coefficients
 There is a way of estimating the average of all the possible split-half coefficients
on the basis of the statistical characteristics of the test items.
 This approach developed by Kuder and Richardson (1937), involves computing
the means and variances of the items that constitute the test.
 The mean of a dichotomous item (one that is scored as either right or wrong) is
the proportion, symbolized as p, of individuals who answer the item correctly. The
 proportion of individuals who answer the item incorrectly is equal to 1 - p, and is
symbolized as q.
 The variance of a dichotomous item is the product of these two proportions, or pq.
INTERNAL CONSISTENCY
 The reliability coefficient provided by Kuder-Richardson formula 20 (KR-20),
based on the ratio of the sum of the item variances to the total test score variance, is
as follows:
 Assumption: the items are of nearly equal difficulty and independent of each other.
 If the items are of equal difficulty, the reliability coefficient can be computed by
using Kuder-Richardson formula 21 (KR-21:
 This formula will generally yield a reliability coefficient that is lower than that
given by KR-20. The Kuder-Richardson formulae are based on total score variance,
and thus they do not require any correction for length.
INTERNAL CONSISTENCY
2. Coefficient alpha
 Both the Guttman split-half estimate and the Kuder-Richardson formulae estimate
reliability on the basis of ratios of the variances of test components - halves and
items - to total test score variance.
 Cronbach (1951) developed a general formula for estimating internal consistency
which he called ‘coefficient alpha’, and which is often referred to as ‘Cronbach’s
alpha’:
 It can thus be shown that all estimates of reliability based on the analysis of
variance components can be derived from this formula and are special cases of
coefficient alpha.
INTERNAL CONSISTENCY
 In summary, the reliability of a set of test scores can be estimated on the basis of a
single test administration only if certain assumptions about the characteristics of
the parts of the test are satisfied.
Assumptions for internal consistency reliability estimates, and effects of
violating assumptions
RATER CONSISTENCY
 In test scores that are obtained subjectively, such as ratings of compositions or
oral interviews, a source of error is inconsistency in these ratings.
 In the case of a single rater, we need to be concerned about the consistency within
that individual’s ratings, or with intrarater reliability.
 When there are several different raters, we want to examine the consistency across
raters, or inter-rater reliability.
 In both cases, the primary causes of inconsistency will be either the
application of different rating criteria to different samples or
the inconsistent application of the rating criteria to different samples.
RATER CONSISTENCY
 Intra-rater reliability
Factors introducing inconsistency:
 Sequence of paying attention to different kinds of errors
 Sequence of scoring from the 1st to the last person
 We need to obtain at least two independent ratings from a rater for each individual
language sample.
 This is typically accomplished by rating the individual samples once and then re-
rating them at a later time in a different, random order.
 Once the two sets of ratings have been obtained, the reliability between them can
be estimated in two ways.
RATER CONSISTENCY
 One way is to treat the two sets of ratings as scores from parallel tests and
compute the appropriate correlation coefficient (commonly the Spearman rank-
order coefficient) between the two sets of ratings, interpreting this as an estimate
of reliability.
 Another approach to examining the consistency of multiple ratings is to compute a
coefficient alpha, treating the independent ratings as different parts:
RATER CONSISTENCY
 Inter-rater reliability
Factors introducing inconsistency:
 Different criteria for rating
 Different interpretations of the same rating criteria
 The reliability can be estimated in two ways:
 We can compute the correlation between two different raters and interpret this as
an estimate of reliability.
 When more than two raters are involved, however, rather than computing
correlations for all different pairs, a preferable approach is that recommended by
Ebel (1979), in which we sum the ratings of the different raters and then estimate
the reliability of these summed ratings by computing a coefficient alpha.
STABILITY (TEST-RETEST RELIABILITY)
 This approach can be used in three cases:
 For tests such as cloze and dictation we cannot appropriately estimate the internal
consistency of the scores because of the interdependence of the parts of the test.
 There are also testing situations in which it may be necessary to administer a test
more than once as part of a time-series design.
 This might also be the concern of a language program evaluator who is interested
in relating changes in language ability to teaching and learning activities in the
program.
 This approach to reliability provides an estimate of the stability of the test scores
over time.
STABILITY (TEST-RETEST RELIABILITY)
 In this approach, we administer the test twice to a group of individuals and then
compute the correlation between the two sets of scores. This correlation can then
be interpreted as an indication of how stable the scores are over time.
 Two sources of inconsistency - differential practice effects and differential
changes in ability -pose a dilemma for the test-retest approach.
 That is, we must assume that both practice and learning (or unlearning) effects
are either uniform across individuals or random.
 Practice effects may occur if certain individuals remember some of the items or
feel more comfortable with the test method, and consequently perform better on
the second administration of the test.
 If, on the other hand, there is a considerable time lapse between test
administrations, some individuals’ language ability may actually improve or
decline more than that of others, causing them to perform differently the second
time.
STABILITY (TEST-RETEST RELIABILITY)
Possible Solution:
There is no single length of time between test
administrations that is best for all situations.
In each situation, the test developer or user must
attempt to determine the extent to which practice and
learning are likely to influence test performance, and
choose the length of time between test and retest so as
to optimize reduction in the effects of both.
EQUIVALENCE (PARALLEL FORMS RELIABILITY)
 Like the test-retest approach, this is an appropriate means of
estimating the reliability of tests for which internal consistency
estimates are either inappropriate or not possible.
 It is of particular interest in testing situations where alternate forms of
the test may be actually used, either for security reasons, or to
minimize the practice effect.
 Assumption: that the different forms of the test are equivalent,
particularly that they are at the same difficulty level and have similar
standard deviations.
EQUIVALENCE (PARALLEL FORMS RELIABILITY)
 To estimate the reliability of alternate forms of a given test, the procedure used is
to administer both forms to a group of individuals.
 One way to minimize the possibility of an ordering effect is to use a
 ‘counterbalanced’ design, in which half of the individuals take one form first and
the other half take the other form first.
 The means and standard deviations for each of the two forms can then be
computed and compared to determine their equivalence, after which the
correlation between the two sets of scores (Form A and Form B) can be computed.
This correlation is then interpreted as an indicator of the equivalence of the two
tests, or as an estimate of the reliability of either one.
Note:
o As a matter of procedure, we generally attempt to estimate the internal
consistency of a test first, since if a test is not reliable in this respect, it is not
likely to be equivalent to other forms or stable over time.
o This is because measurement error is random and therefore not correlated
with anything.
o Thus, the greater the proportion of measurement error in the scores from a
given test, the lower the correlation of those scores with other scores will be.
PROBLEMS WITH THE CLASSICAL TRUE SCORE MODEL
1. Different sources of error may interact with each other, even when we carefully
design our reliability study. One problem with the CTS model, then, is that it treats
error variance as homogeneous in origin. Each of the estimates of reliability
addresses one specific source of error, and treats other potential sources either as part
of that source, or as true score.
 In the classical model, therefore, different sources of error may be confused, or
confounded with each other and with true score variance, since it is not possible to
examine more than one source of error at a time, even though performance on
any given test may be affected by several different sources of error simultaneously.
PROBLEMS WITH THE CLASSICAL TRUE SCORE MODEL
2. A second, related problem is that the CTS model considers all error to be random,
and consequently fails to distinguish systematic error from random error.
 Factors other than the ability being measured that regularly affect the performance
of some individuals and not others can be regarded as sources of systematic error
or test bias.
 Sources of systematic error:
 Test method
 Cultural content
 Psychological task set
 Guessing or what Carroll (1961b) called ‘topastic error’.

More Related Content

What's hot

validity and reliability
validity and reliabilityvalidity and reliability
validity and reliability
affera mujahid
 
Monika seminar
Monika seminarMonika seminar
Monika seminar
monika22singh
 
Reliability by Vartika Verma .pdf
Reliability by Vartika Verma .pdfReliability by Vartika Verma .pdf
Reliability by Vartika Verma .pdf
Vartika Verma
 
Reliability for testing and assessment
Reliability for testing and assessmentReliability for testing and assessment
Reliability for testing and assessment
Erlwinmer Mangmang
 
Reliability types
Reliability typesReliability types
Reliability types
JEEVARATHINAM ANTONY
 
Reliability
ReliabilityReliability
Reliability
shaziazamir1
 
Test Reliability and Validity
Test Reliability and ValidityTest Reliability and Validity
Test Reliability and Validity
Brian Ebie
 
Validity (Educational Assessment)
Validity (Educational Assessment)Validity (Educational Assessment)
Validity (Educational Assessment)
HennaAnsari
 
Qualities of a Good Test
Qualities of a Good TestQualities of a Good Test
Qualities of a Good Test
DrSindhuAlmas
 
RELIABILITY AND VALIDITY
RELIABILITY AND VALIDITYRELIABILITY AND VALIDITY
RELIABILITY AND VALIDITY
Joydeep Singh
 
Characteristics of a good test
Characteristics of a good testCharacteristics of a good test
Characteristics of a good testcyrilcoscos
 
Reliability
ReliabilityReliability
ReliabilityRoi Xcel
 
Reliability
ReliabilityReliability
Standard error of measurement
Standard error of measurementStandard error of measurement
Standard error of measurementtlcoffman
 
Reliability and validity
Reliability and validityReliability and validity
Reliability and validity
Carlos Tian Chow Correos
 
Understanding reliability and validity
Understanding reliability and validityUnderstanding reliability and validity
Understanding reliability and validity
Muhammad Faisal
 
Reliability and validity
Reliability and validityReliability and validity
Reliability and validity
shobhitsaxena67
 
Characteristics of a good test
Characteristics  of a good testCharacteristics  of a good test
Characteristics of a good test
Lalima Tripathi
 
Reliability & validity
Reliability & validityReliability & validity
Reliability & validityshefali84
 

What's hot (20)

Reliability and validity
Reliability and validityReliability and validity
Reliability and validity
 
validity and reliability
validity and reliabilityvalidity and reliability
validity and reliability
 
Monika seminar
Monika seminarMonika seminar
Monika seminar
 
Reliability by Vartika Verma .pdf
Reliability by Vartika Verma .pdfReliability by Vartika Verma .pdf
Reliability by Vartika Verma .pdf
 
Reliability for testing and assessment
Reliability for testing and assessmentReliability for testing and assessment
Reliability for testing and assessment
 
Reliability types
Reliability typesReliability types
Reliability types
 
Reliability
ReliabilityReliability
Reliability
 
Test Reliability and Validity
Test Reliability and ValidityTest Reliability and Validity
Test Reliability and Validity
 
Validity (Educational Assessment)
Validity (Educational Assessment)Validity (Educational Assessment)
Validity (Educational Assessment)
 
Qualities of a Good Test
Qualities of a Good TestQualities of a Good Test
Qualities of a Good Test
 
RELIABILITY AND VALIDITY
RELIABILITY AND VALIDITYRELIABILITY AND VALIDITY
RELIABILITY AND VALIDITY
 
Characteristics of a good test
Characteristics of a good testCharacteristics of a good test
Characteristics of a good test
 
Reliability
ReliabilityReliability
Reliability
 
Reliability
ReliabilityReliability
Reliability
 
Standard error of measurement
Standard error of measurementStandard error of measurement
Standard error of measurement
 
Reliability and validity
Reliability and validityReliability and validity
Reliability and validity
 
Understanding reliability and validity
Understanding reliability and validityUnderstanding reliability and validity
Understanding reliability and validity
 
Reliability and validity
Reliability and validityReliability and validity
Reliability and validity
 
Characteristics of a good test
Characteristics  of a good testCharacteristics  of a good test
Characteristics of a good test
 
Reliability & validity
Reliability & validityReliability & validity
Reliability & validity
 

Viewers also liked

Sifakis 2007
Sifakis 2007Sifakis 2007
The role of corrective feedback in second language learning
The role of corrective feedback in second language learningThe role of corrective feedback in second language learning
The role of corrective feedback in second language learning
Amir Hamid Forough Ameri
 
Extroversion introversion
Extroversion introversionExtroversion introversion
Extroversion introversion
Amir Hamid Forough Ameri
 
Exploring culture by ah forough ameri
Exploring culture by ah forough ameriExploring culture by ah forough ameri
Exploring culture by ah forough ameri
Amir Hamid Forough Ameri
 
Maryam Bolouri
Maryam BolouriMaryam Bolouri
Maryam Bolouri
Allame Tabatabaei
 
Thesis summary by amir hamid forough ameri
Thesis summary by amir hamid forough ameriThesis summary by amir hamid forough ameri
Thesis summary by amir hamid forough ameri
Amir Hamid Forough Ameri
 
Behavioral view of motivation
Behavioral view of motivationBehavioral view of motivation
Behavioral view of motivation
Amir Hamid Forough Ameri
 
Language testing the social dimension
Language testing  the social dimensionLanguage testing  the social dimension
Language testing the social dimension
Amir Hamid Forough Ameri
 
Critical pedagogy in l2 learning and teaching suresh canagarajah
Critical pedagogy in l2 learning and teaching  suresh canagarajahCritical pedagogy in l2 learning and teaching  suresh canagarajah
Critical pedagogy in l2 learning and teaching suresh canagarajah
Amir Hamid Forough Ameri
 
Context culture .... m. wendt
Context culture .... m. wendtContext culture .... m. wendt
Context culture .... m. wendt
Amir Hamid Forough Ameri
 
Integrated syllabus
Integrated syllabusIntegrated syllabus
Integrated syllabus
Amir Hamid Forough Ameri
 
ANxiety bolouri
ANxiety bolouriANxiety bolouri
ANxiety bolouri
Allame Tabatabaei
 
Swan.bolouri
Swan.bolouriSwan.bolouri
Swan.bolouri
Allame Tabatabaei
 
attitide anxiety bolouri
 attitide anxiety bolouri attitide anxiety bolouri
attitide anxiety bolouri
Allame Tabatabaei
 
Critical literacy and second language learning luke and dooley
Critical literacy and second language learning  luke and dooleyCritical literacy and second language learning  luke and dooley
Critical literacy and second language learning luke and dooley
Amir Hamid Forough Ameri
 
Classroom assessment, glenn fulcher
Classroom assessment, glenn fulcherClassroom assessment, glenn fulcher
Classroom assessment, glenn fulcher
Amir Hamid Forough Ameri
 
The task based approach some questions and suggestions littlewood
The task based approach some questions and suggestions littlewoodThe task based approach some questions and suggestions littlewood
The task based approach some questions and suggestions littlewood
Amir Hamid Forough Ameri
 
Notional functional syllabus
Notional functional syllabusNotional functional syllabus
Notional functional syllabus
Amir Hamid Forough Ameri
 
Task based research and language pedagogy ellis
Task based research and language pedagogy ellisTask based research and language pedagogy ellis
Task based research and language pedagogy ellis
Amir Hamid Forough Ameri
 

Viewers also liked (20)

Sifakis 2007
Sifakis 2007Sifakis 2007
Sifakis 2007
 
The role of corrective feedback in second language learning
The role of corrective feedback in second language learningThe role of corrective feedback in second language learning
The role of corrective feedback in second language learning
 
Extroversion introversion
Extroversion introversionExtroversion introversion
Extroversion introversion
 
Exploring culture by ah forough ameri
Exploring culture by ah forough ameriExploring culture by ah forough ameri
Exploring culture by ah forough ameri
 
Maryam Bolouri
Maryam BolouriMaryam Bolouri
Maryam Bolouri
 
Thesis summary by amir hamid forough ameri
Thesis summary by amir hamid forough ameriThesis summary by amir hamid forough ameri
Thesis summary by amir hamid forough ameri
 
Behavioral view of motivation
Behavioral view of motivationBehavioral view of motivation
Behavioral view of motivation
 
Language testing the social dimension
Language testing  the social dimensionLanguage testing  the social dimension
Language testing the social dimension
 
Critical pedagogy in l2 learning and teaching suresh canagarajah
Critical pedagogy in l2 learning and teaching  suresh canagarajahCritical pedagogy in l2 learning and teaching  suresh canagarajah
Critical pedagogy in l2 learning and teaching suresh canagarajah
 
Context culture .... m. wendt
Context culture .... m. wendtContext culture .... m. wendt
Context culture .... m. wendt
 
Integrated syllabus
Integrated syllabusIntegrated syllabus
Integrated syllabus
 
ANxiety bolouri
ANxiety bolouriANxiety bolouri
ANxiety bolouri
 
Swan.bolouri
Swan.bolouriSwan.bolouri
Swan.bolouri
 
attitide anxiety bolouri
 attitide anxiety bolouri attitide anxiety bolouri
attitide anxiety bolouri
 
Critical literacy and second language learning luke and dooley
Critical literacy and second language learning  luke and dooleyCritical literacy and second language learning  luke and dooley
Critical literacy and second language learning luke and dooley
 
Classroom assessment, glenn fulcher
Classroom assessment, glenn fulcherClassroom assessment, glenn fulcher
Classroom assessment, glenn fulcher
 
Tsui 2011
Tsui 2011Tsui 2011
Tsui 2011
 
The task based approach some questions and suggestions littlewood
The task based approach some questions and suggestions littlewoodThe task based approach some questions and suggestions littlewood
The task based approach some questions and suggestions littlewood
 
Notional functional syllabus
Notional functional syllabusNotional functional syllabus
Notional functional syllabus
 
Task based research and language pedagogy ellis
Task based research and language pedagogy ellisTask based research and language pedagogy ellis
Task based research and language pedagogy ellis
 

Similar to Reliability bachman 1990 chapter 6

Reliability in Language Testing
Reliability in Language Testing Reliability in Language Testing
Reliability in Language Testing
Seray Tanyer
 
Valiadity and reliability- Language testing
Valiadity and reliability- Language testingValiadity and reliability- Language testing
Valiadity and reliability- Language testing
Phuong Tran
 
Characteristics of a good test
Characteristics of a good test Characteristics of a good test
Characteristics of a good test
Arash Yazdani
 
What makes a good testA test is considered good” if the .docx
What makes a good testA test is considered good” if the .docxWhat makes a good testA test is considered good” if the .docx
What makes a good testA test is considered good” if the .docx
mecklenburgstrelitzh
 
Chapter 8 compilation
Chapter 8 compilationChapter 8 compilation
Chapter 8 compilationHannan Mahmud
 
Validity & reliability seminar
Validity & reliability seminarValidity & reliability seminar
Validity & reliability seminar
mrikara185
 
Rep
RepRep
Rep
Cedy_28
 
Validity, Reliability ,Objective & Their Types
Validity, Reliability ,Objective & Their TypesValidity, Reliability ,Objective & Their Types
Validity, Reliability ,Objective & Their Types
MohammadRabbani18
 
EM&E.pptx
EM&E.pptxEM&E.pptx
EM&E.pptx
Hafiz20006
 
Meaning and Methods of Estimating Reliability of Test.pptx
Meaning and Methods of Estimating Reliability of Test.pptxMeaning and Methods of Estimating Reliability of Test.pptx
Meaning and Methods of Estimating Reliability of Test.pptx
sarat68
 
Quantitative analysis
Quantitative analysisQuantitative analysis
Quantitative analysis
Pachica, Gerry B.
 
Louzel Report - Reliability & validity
Louzel Report - Reliability & validity Louzel Report - Reliability & validity
Louzel Report - Reliability & validity Louzel Linejan
 
Reliability of test
Reliability of testReliability of test
Reliability of test
Sarat Rout
 
Testing in language programs (chapter 8)
Testing in language programs (chapter 8)Testing in language programs (chapter 8)
Testing in language programs (chapter 8)
Tahere Bakhshi
 
Topic validity
Topic validityTopic validity
Topic validity
mikki khan
 
Reliability and validity of Research Data
Reliability and validity of Research DataReliability and validity of Research Data
Reliability and validity of Research Data
Aida Arifin
 
Norms[1]
Norms[1]Norms[1]
Norms[1]
Milen Ramos
 
Research methods 2 operationalization & measurement
Research methods 2   operationalization & measurementResearch methods 2   operationalization & measurement
Research methods 2 operationalization & measurement
attique1960
 
VALIDITY
VALIDITYVALIDITY
VALIDITY
ANCYBS
 
Data processing
Data processingData processing
Data processing
marie chris portillas
 

Similar to Reliability bachman 1990 chapter 6 (20)

Reliability in Language Testing
Reliability in Language Testing Reliability in Language Testing
Reliability in Language Testing
 
Valiadity and reliability- Language testing
Valiadity and reliability- Language testingValiadity and reliability- Language testing
Valiadity and reliability- Language testing
 
Characteristics of a good test
Characteristics of a good test Characteristics of a good test
Characteristics of a good test
 
What makes a good testA test is considered good” if the .docx
What makes a good testA test is considered good” if the .docxWhat makes a good testA test is considered good” if the .docx
What makes a good testA test is considered good” if the .docx
 
Chapter 8 compilation
Chapter 8 compilationChapter 8 compilation
Chapter 8 compilation
 
Validity & reliability seminar
Validity & reliability seminarValidity & reliability seminar
Validity & reliability seminar
 
Rep
RepRep
Rep
 
Validity, Reliability ,Objective & Their Types
Validity, Reliability ,Objective & Their TypesValidity, Reliability ,Objective & Their Types
Validity, Reliability ,Objective & Their Types
 
EM&E.pptx
EM&E.pptxEM&E.pptx
EM&E.pptx
 
Meaning and Methods of Estimating Reliability of Test.pptx
Meaning and Methods of Estimating Reliability of Test.pptxMeaning and Methods of Estimating Reliability of Test.pptx
Meaning and Methods of Estimating Reliability of Test.pptx
 
Quantitative analysis
Quantitative analysisQuantitative analysis
Quantitative analysis
 
Louzel Report - Reliability & validity
Louzel Report - Reliability & validity Louzel Report - Reliability & validity
Louzel Report - Reliability & validity
 
Reliability of test
Reliability of testReliability of test
Reliability of test
 
Testing in language programs (chapter 8)
Testing in language programs (chapter 8)Testing in language programs (chapter 8)
Testing in language programs (chapter 8)
 
Topic validity
Topic validityTopic validity
Topic validity
 
Reliability and validity of Research Data
Reliability and validity of Research DataReliability and validity of Research Data
Reliability and validity of Research Data
 
Norms[1]
Norms[1]Norms[1]
Norms[1]
 
Research methods 2 operationalization & measurement
Research methods 2   operationalization & measurementResearch methods 2   operationalization & measurement
Research methods 2 operationalization & measurement
 
VALIDITY
VALIDITYVALIDITY
VALIDITY
 
Data processing
Data processingData processing
Data processing
 

Recently uploaded

ESC Beyond Borders _From EU to You_ InfoPack general.pdf
ESC Beyond Borders _From EU to You_ InfoPack general.pdfESC Beyond Borders _From EU to You_ InfoPack general.pdf
ESC Beyond Borders _From EU to You_ InfoPack general.pdf
Fundacja Rozwoju Społeczeństwa Przedsiębiorczego
 
Overview on Edible Vaccine: Pros & Cons with Mechanism
Overview on Edible Vaccine: Pros & Cons with MechanismOverview on Edible Vaccine: Pros & Cons with Mechanism
Overview on Edible Vaccine: Pros & Cons with Mechanism
DeeptiGupta154
 
The French Revolution Class 9 Study Material pdf free download
The French Revolution Class 9 Study Material pdf free downloadThe French Revolution Class 9 Study Material pdf free download
The French Revolution Class 9 Study Material pdf free download
Vivekanand Anglo Vedic Academy
 
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
siemaillard
 
Introduction to Quality Improvement Essentials
Introduction to Quality Improvement EssentialsIntroduction to Quality Improvement Essentials
Introduction to Quality Improvement Essentials
Excellence Foundation for South Sudan
 
Thesis Statement for students diagnonsed withADHD.ppt
Thesis Statement for students diagnonsed withADHD.pptThesis Statement for students diagnonsed withADHD.ppt
Thesis Statement for students diagnonsed withADHD.ppt
EverAndrsGuerraGuerr
 
Digital Tools and AI for Teaching Learning and Research
Digital Tools and AI for Teaching Learning and ResearchDigital Tools and AI for Teaching Learning and Research
Digital Tools and AI for Teaching Learning and Research
Vikramjit Singh
 
Unit 2- Research Aptitude (UGC NET Paper I).pdf
Unit 2- Research Aptitude (UGC NET Paper I).pdfUnit 2- Research Aptitude (UGC NET Paper I).pdf
Unit 2- Research Aptitude (UGC NET Paper I).pdf
Thiyagu K
 
CLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCE
CLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCECLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCE
CLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCE
BhavyaRajput3
 
The geography of Taylor Swift - some ideas
The geography of Taylor Swift - some ideasThe geography of Taylor Swift - some ideas
The geography of Taylor Swift - some ideas
GeoBlogs
 
Ethnobotany and Ethnopharmacology ......
Ethnobotany and Ethnopharmacology ......Ethnobotany and Ethnopharmacology ......
Ethnobotany and Ethnopharmacology ......
Ashokrao Mane college of Pharmacy Peth-Vadgaon
 
Template Jadual Bertugas Kelas (Boleh Edit)
Template Jadual Bertugas Kelas (Boleh Edit)Template Jadual Bertugas Kelas (Boleh Edit)
Template Jadual Bertugas Kelas (Boleh Edit)
rosedainty
 
TESDA TM1 REVIEWER FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
TESDA TM1 REVIEWER  FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...TESDA TM1 REVIEWER  FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
TESDA TM1 REVIEWER FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
EugeneSaldivar
 
How to Make a Field invisible in Odoo 17
How to Make a Field invisible in Odoo 17How to Make a Field invisible in Odoo 17
How to Make a Field invisible in Odoo 17
Celine George
 
The approach at University of Liverpool.pptx
The approach at University of Liverpool.pptxThe approach at University of Liverpool.pptx
The approach at University of Liverpool.pptx
Jisc
 
How to Create Map Views in the Odoo 17 ERP
How to Create Map Views in the Odoo 17 ERPHow to Create Map Views in the Odoo 17 ERP
How to Create Map Views in the Odoo 17 ERP
Celine George
 
Unit 8 - Information and Communication Technology (Paper I).pdf
Unit 8 - Information and Communication Technology (Paper I).pdfUnit 8 - Information and Communication Technology (Paper I).pdf
Unit 8 - Information and Communication Technology (Paper I).pdf
Thiyagu K
 
Polish students' mobility in the Czech Republic
Polish students' mobility in the Czech RepublicPolish students' mobility in the Czech Republic
Polish students' mobility in the Czech Republic
Anna Sz.
 
Home assignment II on Spectroscopy 2024 Answers.pdf
Home assignment II on Spectroscopy 2024 Answers.pdfHome assignment II on Spectroscopy 2024 Answers.pdf
Home assignment II on Spectroscopy 2024 Answers.pdf
Tamralipta Mahavidyalaya
 
Chapter 3 - Islamic Banking Products and Services.pptx
Chapter 3 - Islamic Banking Products and Services.pptxChapter 3 - Islamic Banking Products and Services.pptx
Chapter 3 - Islamic Banking Products and Services.pptx
Mohd Adib Abd Muin, Senior Lecturer at Universiti Utara Malaysia
 

Recently uploaded (20)

ESC Beyond Borders _From EU to You_ InfoPack general.pdf
ESC Beyond Borders _From EU to You_ InfoPack general.pdfESC Beyond Borders _From EU to You_ InfoPack general.pdf
ESC Beyond Borders _From EU to You_ InfoPack general.pdf
 
Overview on Edible Vaccine: Pros & Cons with Mechanism
Overview on Edible Vaccine: Pros & Cons with MechanismOverview on Edible Vaccine: Pros & Cons with Mechanism
Overview on Edible Vaccine: Pros & Cons with Mechanism
 
The French Revolution Class 9 Study Material pdf free download
The French Revolution Class 9 Study Material pdf free downloadThe French Revolution Class 9 Study Material pdf free download
The French Revolution Class 9 Study Material pdf free download
 
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
 
Introduction to Quality Improvement Essentials
Introduction to Quality Improvement EssentialsIntroduction to Quality Improvement Essentials
Introduction to Quality Improvement Essentials
 
Thesis Statement for students diagnonsed withADHD.ppt
Thesis Statement for students diagnonsed withADHD.pptThesis Statement for students diagnonsed withADHD.ppt
Thesis Statement for students diagnonsed withADHD.ppt
 
Digital Tools and AI for Teaching Learning and Research
Digital Tools and AI for Teaching Learning and ResearchDigital Tools and AI for Teaching Learning and Research
Digital Tools and AI for Teaching Learning and Research
 
Unit 2- Research Aptitude (UGC NET Paper I).pdf
Unit 2- Research Aptitude (UGC NET Paper I).pdfUnit 2- Research Aptitude (UGC NET Paper I).pdf
Unit 2- Research Aptitude (UGC NET Paper I).pdf
 
CLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCE
CLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCECLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCE
CLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCE
 
The geography of Taylor Swift - some ideas
The geography of Taylor Swift - some ideasThe geography of Taylor Swift - some ideas
The geography of Taylor Swift - some ideas
 
Ethnobotany and Ethnopharmacology ......
Ethnobotany and Ethnopharmacology ......Ethnobotany and Ethnopharmacology ......
Ethnobotany and Ethnopharmacology ......
 
Template Jadual Bertugas Kelas (Boleh Edit)
Template Jadual Bertugas Kelas (Boleh Edit)Template Jadual Bertugas Kelas (Boleh Edit)
Template Jadual Bertugas Kelas (Boleh Edit)
 
TESDA TM1 REVIEWER FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
TESDA TM1 REVIEWER  FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...TESDA TM1 REVIEWER  FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
TESDA TM1 REVIEWER FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
 
How to Make a Field invisible in Odoo 17
How to Make a Field invisible in Odoo 17How to Make a Field invisible in Odoo 17
How to Make a Field invisible in Odoo 17
 
The approach at University of Liverpool.pptx
The approach at University of Liverpool.pptxThe approach at University of Liverpool.pptx
The approach at University of Liverpool.pptx
 
How to Create Map Views in the Odoo 17 ERP
How to Create Map Views in the Odoo 17 ERPHow to Create Map Views in the Odoo 17 ERP
How to Create Map Views in the Odoo 17 ERP
 
Unit 8 - Information and Communication Technology (Paper I).pdf
Unit 8 - Information and Communication Technology (Paper I).pdfUnit 8 - Information and Communication Technology (Paper I).pdf
Unit 8 - Information and Communication Technology (Paper I).pdf
 
Polish students' mobility in the Czech Republic
Polish students' mobility in the Czech RepublicPolish students' mobility in the Czech Republic
Polish students' mobility in the Czech Republic
 
Home assignment II on Spectroscopy 2024 Answers.pdf
Home assignment II on Spectroscopy 2024 Answers.pdfHome assignment II on Spectroscopy 2024 Answers.pdf
Home assignment II on Spectroscopy 2024 Answers.pdf
 
Chapter 3 - Islamic Banking Products and Services.pptx
Chapter 3 - Islamic Banking Products and Services.pptxChapter 3 - Islamic Banking Products and Services.pptx
Chapter 3 - Islamic Banking Products and Services.pptx
 

Reliability bachman 1990 chapter 6

  • 1. RELIABILITY BASED ON: FUNDAMENTAL CONSIDERATIONS IN LANGUAGE TESTING BACHMAN (1990) CHAPTER 6 Prepared by: Amirhamid Foroughameri ahfameri@gmail.com November 2015
  • 2. INTRODUCTION  A fundamental concern in the development and use of language tests is to identify potential sources of error in a given measure of communicative language ability and to minimize the effect of these factors on that measure.  We must be concerned about errors of measurement, or unreliability, as we know that test performance is affected by factors other than the abilities we want to measure. Poor health, fatigue, lack of interest test method facets or motivation, and test-wiseness … Minimizing the effects of these factors Minimizing measurement error maximizing reliability. Unsystematic (unpredictable) Systematic
  • 3. INTRODUCTION  A necessary condition for validity: in order for a test score to be valid, it must be reliable.  Reliability and validity are not two distinct concepts; they are complementary aspects of a common concern in measurement  Reliability answers the question: ‘How much of an individual’s test performance is due to measurement error, or to factors 0ther than the language ability we want to measure?’ and with minimizing the effects of these factors on test scores.  Validity answers the question: ‘How much of an individual’s test performance is due to the language abilities we want to measure?’ and with maximizing the effects of these abilities on test scores.
  • 4. INTRODUCTION  The investigation of reliability involves both logical analysis and empirical research.  We must identify sources of error and estimate the magnitude of their effects on test scores.  To identify sources of error, we need to distinguish the effects of the language abilities we want to measure from the effects of other factors, which is a complex problem due to: 1. The interaction between components of language ability and test method facets makes it difficult to mark a clear ‘boundary’ between the ability being measured and the method facets → a particular topic of a conversational interaction an oral interview. 2. Other characteristics, such as sex, age, cognitive style, and native language.  Estimating the magnitude of the effects of different sources of error, once these sources have been identified, is a matter of empirical research, and is a major concern of measurement theory.
  • 5. FACTORS THAT AFFECT LANGUAGE TEST SCORES Thorndike (1951) and Stanley (1971) begin their treatments of reliability with general frameworks for describing the factors that cause test scores to vary from individual to individual: general and specific lasting characteristics, general and specific temporary characteristics, and systematic and chance factors related to test administration and scoring.
  • 6. FACTORS THAT AFFECT LANGUAGE TEST SCORES  Factors that affect language test scores (Bachman 1990): Note: In a ‘path diagram’ rectangles: observed variables, ovals: unobserved variables, straight arrows: causal relationships Systematic Unsystematic Test Score Communicative language ability Personal attributes Test method facets Random factors
  • 7. FACTORS THAT AFFECT LANGUAGE TEST SCORES  Test method facets are systematic to the extent that they are uniform from one test administration to the next. That is, if the input format facet is multiple- choice, this will not vary, whether the test is given in the morning or afternoon.  Attributes of individuals include individual characteristics such as cognitive style and knowledge of particular content areas, and group characteristics such as sex, race, and ethnic background. These are also systematic in the sense that they are likely to affect a given individual’s test performance regularly.  An individual’s test score will be affected to some degree by unsystematic, or random factors: unpredictable and largely temporary conditions, such as his mental alertness or emotional state, and uncontrolled differences in test method facets, such as changes in the test environment from one day to the next, or idioosyncratic differences in the way different test administrators carry out their responsibilities.
  • 8. FACTORS THAT AFFECT LANGUAGE TEST SCORES  Random factors and test method facets are generally considered to be sources of measurement error, and have thus been the primary concern of approaches to estimating reliability.  Personal attributes that are not considered part of the ability tested, such as sex, ethnic background, cognitive style and prior knowledge of content area, on the other hand, have traditionally been discussed as sources of test bias, or test invalidity.
  • 9. CLASSICAL TRUE SCORE (CTS) MEASUREMENT THEORY  This theory consists of a set of assumptions about the relationships between actual, or observed test scores and the factors that affect these scores. Assumption 1: an observed score on a test comprises two factors or components: a true score that is due to an individual’s level of ability and an error score, that is due to factors other than the ability being tested.’ → x = Xt + Xe where x is the observed score, Xt is the true score, and Xe the error score.  the variance of a set of test scores consists of two components: S2 x = S2 t + S2 e  where S2 x is the observed score variance, S2 t is the true score variance component, and S2 e is the error score variance component.
  • 10.  Assumption 2: the relationship between true and error scores: error scores are unsystematic, or random, and are uncorrelated with true scores.  CTS model’s definition of measurement error: that variation in a set of test scores that is unsystematic or random In CTS→ Two sources of variance: True score variance due to differences in ability Measurement error (unsystematic)
  • 11. PARALLEL TESTS  In order for two tests to be considered parallel, we assume that they are measures of the same ability, that is, that an individual’s true score on one test will be the same as his true score on the other.  Two tests are parallel if, for every group of persons taking both tests, (1) the true score on one test is equal to the true score on the other, and (2) the error variances for the two tests are equal.  Operational definition: parallel tests are two tests of the same ability that have the same means and variances and are equally correlated with other tests of that ability.  Virtually we never have strictly parallel tests, we treat two tests as if they were parallel if the differences between their means and variances are not statistically significant. Equivalent tests; alternate forms.
  • 12. RELIABILITY AS THE CORRELATION BETWEEN PARALLEL TESTS  Since we never know what the true or error scores are, we cannot know the reliability of the observed scores. To be able to estimate the reliability of observed scores, then, we must define reliability operationally in a way that depends only on observed scores.  Thus, if the observed scores on two parallel tests are highly correlated, this indicates that effects of the error scores are minimal, and that they can be considered reliable indicators of the ability being measured.  The definition of reliability (the basis for all estimates of reliability within CTS theory): the correlation between the observed scores on two parallel tests, which we can symbolize as rxx’.  Assumption: the observed scores on the two tests are experimentally independent. That is, an individual's performance on the second test should not depend on how she performs on the first.
  • 13. CORRELATIONS BETWEEN TRUE AND OBSERVED SCORES ON PARALLEL TESTS = rxx’ Ability x’t x’e xt xe x X’
  • 14. RELIABILITY AND MEASUREMENT ERROR AS PROPORTIONS OF OBSERVED SCORE VARIANCE If an individual's observed score on a test is composed of a true score and an error score, the greater the proportion of true score, the less the proportion of error score, and thus the more reliable the observed score. Thus, one way of defining reliability is as the proportion of the observed score variance that is true score variance: rxx’ = s2 t/ s2 x Note: reliability refers to the test scores, and not the test itself.
  • 15. APPROACHES TO ESTIMATING RELIABILITY WITHIN THE CTS Internal consistency: concerned primarily with sources of error from within the test and scoring procedures, Stability: how consistent test scores are over time, • The estimates of reliability that these approaches yield are called reliability coefficients. Equivalence: an indication of the extent to which scores on alternate forms of a test are equivalent.
  • 16. INTERNAL CONSISTENCY  Internal consistency is concerned with how consistent test takers’ performances on the different parts of the test are with each other.  Inconsistencies in performance on different parts of tests can be caused by a number of factors, including the test method facets. SPLIT-HALF RELIABILITY ESTIMATES  One approach to examining the internal consistency of a test is the split-half method, in which we divide the test into two halves and then determine the extent to which scores on these two halves are consistent with each other.  In so doing, we are treating the halves as parallel tests, and so we must make certain assumptions about the equivalence of the two halves, specifically that they have equal means and variances. In addition, we must also assume that the two halves are independent of each other.
  • 17. INTERNAL CONSISTENCY  In some cases, where we are not sure that the items are measuring the same ability or that they are independent of each other, the test-retest and parallel forms methods, are more appropriate for estimating reliability.  The Spearman-Brown split-half estimate  Once the test has been split into halves, it is rescored, yielding two scores - one for each half - for each test taker.  In one approach to estimating reliability, we then compute the correlation between the two sets of scores. This gives us an estimate of how consistent the halves are, however, and we are interested in the reliability of the whole test.  In general, a long test will be more reliable than a short one, assuming that the additional items correlate positively with the other items in the test.
  • 18. INTERNAL CONSISTENCY  Two assumptions must be met in order to use this method: First, since we are in effect treating the two halves as parallel tests, we must assume that they have equal means and variances (an assumption we can check). Second, we must assume that the two halves are experimentally independent of each other (an assumption that is very difficult to check). That is, that an individual's performance on one half does not affect how he performs on the other.
  • 19. INTERNAL CONSISTENCY  The Guttman sp!it-bdf estimate  Another approach to estimating reliability from split-halves is that developed by Guttman (1945), which does not assume equivalence of the halves, and which does not require computing a correlation between them. This split-half reliability coefficient is based on the ratio of the sum of the variances of the two halves to the variance of the whole test:  Since this formula is based on the variance of the total test, it provides a direct estimate of the reliability of the whole test.  Therefore, unlike Spearman-Brown, the Guttman split-half estimate does not require an additional correction for length.
  • 20. INTERNAL CONSISTENCY  Reliability estimates based on item variances 1. Kuder-Richardson reliability coefficients  There is a way of estimating the average of all the possible split-half coefficients on the basis of the statistical characteristics of the test items.  This approach developed by Kuder and Richardson (1937), involves computing the means and variances of the items that constitute the test.  The mean of a dichotomous item (one that is scored as either right or wrong) is the proportion, symbolized as p, of individuals who answer the item correctly. The  proportion of individuals who answer the item incorrectly is equal to 1 - p, and is symbolized as q.  The variance of a dichotomous item is the product of these two proportions, or pq.
  • 21. INTERNAL CONSISTENCY  The reliability coefficient provided by Kuder-Richardson formula 20 (KR-20), based on the ratio of the sum of the item variances to the total test score variance, is as follows:  Assumption: the items are of nearly equal difficulty and independent of each other.  If the items are of equal difficulty, the reliability coefficient can be computed by using Kuder-Richardson formula 21 (KR-21:  This formula will generally yield a reliability coefficient that is lower than that given by KR-20. The Kuder-Richardson formulae are based on total score variance, and thus they do not require any correction for length.
  • 22. INTERNAL CONSISTENCY 2. Coefficient alpha  Both the Guttman split-half estimate and the Kuder-Richardson formulae estimate reliability on the basis of ratios of the variances of test components - halves and items - to total test score variance.  Cronbach (1951) developed a general formula for estimating internal consistency which he called ‘coefficient alpha’, and which is often referred to as ‘Cronbach’s alpha’:  It can thus be shown that all estimates of reliability based on the analysis of variance components can be derived from this formula and are special cases of coefficient alpha.
  • 23. INTERNAL CONSISTENCY  In summary, the reliability of a set of test scores can be estimated on the basis of a single test administration only if certain assumptions about the characteristics of the parts of the test are satisfied. Assumptions for internal consistency reliability estimates, and effects of violating assumptions
  • 24. RATER CONSISTENCY  In test scores that are obtained subjectively, such as ratings of compositions or oral interviews, a source of error is inconsistency in these ratings.  In the case of a single rater, we need to be concerned about the consistency within that individual’s ratings, or with intrarater reliability.  When there are several different raters, we want to examine the consistency across raters, or inter-rater reliability.  In both cases, the primary causes of inconsistency will be either the application of different rating criteria to different samples or the inconsistent application of the rating criteria to different samples.
  • 25. RATER CONSISTENCY  Intra-rater reliability Factors introducing inconsistency:  Sequence of paying attention to different kinds of errors  Sequence of scoring from the 1st to the last person  We need to obtain at least two independent ratings from a rater for each individual language sample.  This is typically accomplished by rating the individual samples once and then re- rating them at a later time in a different, random order.  Once the two sets of ratings have been obtained, the reliability between them can be estimated in two ways.
  • 26. RATER CONSISTENCY  One way is to treat the two sets of ratings as scores from parallel tests and compute the appropriate correlation coefficient (commonly the Spearman rank- order coefficient) between the two sets of ratings, interpreting this as an estimate of reliability.  Another approach to examining the consistency of multiple ratings is to compute a coefficient alpha, treating the independent ratings as different parts:
  • 27. RATER CONSISTENCY  Inter-rater reliability Factors introducing inconsistency:  Different criteria for rating  Different interpretations of the same rating criteria  The reliability can be estimated in two ways:  We can compute the correlation between two different raters and interpret this as an estimate of reliability.  When more than two raters are involved, however, rather than computing correlations for all different pairs, a preferable approach is that recommended by Ebel (1979), in which we sum the ratings of the different raters and then estimate the reliability of these summed ratings by computing a coefficient alpha.
  • 28. STABILITY (TEST-RETEST RELIABILITY)  This approach can be used in three cases:  For tests such as cloze and dictation we cannot appropriately estimate the internal consistency of the scores because of the interdependence of the parts of the test.  There are also testing situations in which it may be necessary to administer a test more than once as part of a time-series design.  This might also be the concern of a language program evaluator who is interested in relating changes in language ability to teaching and learning activities in the program.  This approach to reliability provides an estimate of the stability of the test scores over time.
  • 29. STABILITY (TEST-RETEST RELIABILITY)  In this approach, we administer the test twice to a group of individuals and then compute the correlation between the two sets of scores. This correlation can then be interpreted as an indication of how stable the scores are over time.  Two sources of inconsistency - differential practice effects and differential changes in ability -pose a dilemma for the test-retest approach.  That is, we must assume that both practice and learning (or unlearning) effects are either uniform across individuals or random.  Practice effects may occur if certain individuals remember some of the items or feel more comfortable with the test method, and consequently perform better on the second administration of the test.  If, on the other hand, there is a considerable time lapse between test administrations, some individuals’ language ability may actually improve or decline more than that of others, causing them to perform differently the second time.
  • 30. STABILITY (TEST-RETEST RELIABILITY) Possible Solution: There is no single length of time between test administrations that is best for all situations. In each situation, the test developer or user must attempt to determine the extent to which practice and learning are likely to influence test performance, and choose the length of time between test and retest so as to optimize reduction in the effects of both.
  • 31. EQUIVALENCE (PARALLEL FORMS RELIABILITY)  Like the test-retest approach, this is an appropriate means of estimating the reliability of tests for which internal consistency estimates are either inappropriate or not possible.  It is of particular interest in testing situations where alternate forms of the test may be actually used, either for security reasons, or to minimize the practice effect.  Assumption: that the different forms of the test are equivalent, particularly that they are at the same difficulty level and have similar standard deviations.
  • 32. EQUIVALENCE (PARALLEL FORMS RELIABILITY)  To estimate the reliability of alternate forms of a given test, the procedure used is to administer both forms to a group of individuals.  One way to minimize the possibility of an ordering effect is to use a  ‘counterbalanced’ design, in which half of the individuals take one form first and the other half take the other form first.  The means and standard deviations for each of the two forms can then be computed and compared to determine their equivalence, after which the correlation between the two sets of scores (Form A and Form B) can be computed. This correlation is then interpreted as an indicator of the equivalence of the two tests, or as an estimate of the reliability of either one.
  • 33. Note: o As a matter of procedure, we generally attempt to estimate the internal consistency of a test first, since if a test is not reliable in this respect, it is not likely to be equivalent to other forms or stable over time. o This is because measurement error is random and therefore not correlated with anything. o Thus, the greater the proportion of measurement error in the scores from a given test, the lower the correlation of those scores with other scores will be.
  • 34. PROBLEMS WITH THE CLASSICAL TRUE SCORE MODEL 1. Different sources of error may interact with each other, even when we carefully design our reliability study. One problem with the CTS model, then, is that it treats error variance as homogeneous in origin. Each of the estimates of reliability addresses one specific source of error, and treats other potential sources either as part of that source, or as true score.  In the classical model, therefore, different sources of error may be confused, or confounded with each other and with true score variance, since it is not possible to examine more than one source of error at a time, even though performance on any given test may be affected by several different sources of error simultaneously.
  • 35. PROBLEMS WITH THE CLASSICAL TRUE SCORE MODEL 2. A second, related problem is that the CTS model considers all error to be random, and consequently fails to distinguish systematic error from random error.  Factors other than the ability being measured that regularly affect the performance of some individuals and not others can be regarded as sources of systematic error or test bias.  Sources of systematic error:  Test method  Cultural content  Psychological task set  Guessing or what Carroll (1961b) called ‘topastic error’.