SlideShare a Scribd company logo
1 of 35
RELIABILITY
BASED ON:
FUNDAMENTAL CONSIDERATIONS IN
LANGUAGE TESTING
BACHMAN (1990)
CHAPTER 6
Prepared by: Amirhamid Foroughameri
ahfameri@gmail.com
November 2015
INTRODUCTION
 A fundamental concern in the development and use of language tests is to identify
potential sources of error in a given measure of communicative language ability and to
minimize the effect of these factors on that measure.
 We must be concerned about errors of measurement, or unreliability, as we know that
test performance is affected by factors other than the abilities we want to measure.
Poor health, fatigue, lack of interest test method facets
or motivation, and test-wiseness …
Minimizing the effects of these factors Minimizing measurement error
maximizing reliability.
Unsystematic (unpredictable) Systematic
INTRODUCTION
 A necessary condition for validity: in order for a test score to be valid, it must be
reliable.
 Reliability and validity are not two distinct concepts; they are complementary
aspects of a common concern in measurement
 Reliability answers the question: ‘How much of an individual’s test performance
is due to measurement error, or to factors 0ther than the language ability we want
to measure?’ and with minimizing the effects of these factors on test scores.
 Validity answers the question: ‘How much of an individual’s test performance is
due to the language abilities we want to measure?’ and with maximizing the effects
of these abilities on test scores.
INTRODUCTION
 The investigation of reliability involves both logical analysis and empirical
research.
 We must identify sources of error and estimate the magnitude of their effects on
test scores.
 To identify sources of error, we need to distinguish the effects of the language
abilities we want to measure from the effects of other factors, which is a complex
problem due to:
1. The interaction between components of language ability and test method facets
makes it difficult to mark a clear ‘boundary’ between the ability being measured and
the method facets → a particular topic of a conversational interaction an oral
interview.
2. Other characteristics, such as sex, age, cognitive style, and native language.
 Estimating the magnitude of the effects of different sources of error, once these
sources have been identified, is a matter of empirical research, and is a major
concern of measurement theory.
FACTORS THAT AFFECT LANGUAGE TEST SCORES
Thorndike (1951) and Stanley (1971) begin their treatments of
reliability with
general frameworks for describing the factors that cause test
scores to vary from individual to individual:
general and specific lasting characteristics,
general and specific temporary characteristics,
and systematic and chance factors related to test
administration and scoring.
FACTORS THAT AFFECT LANGUAGE TEST SCORES
 Factors that affect language test scores (Bachman 1990):
Note: In a ‘path diagram’
rectangles: observed variables,
ovals: unobserved variables,
straight arrows: causal relationships
Systematic Unsystematic
Test Score
Communicative
language ability
Personal
attributes
Test
method
facets
Random
factors
FACTORS THAT AFFECT LANGUAGE TEST SCORES
 Test method facets are systematic to the extent that they are uniform from
one test administration to the next. That is, if the input format facet is multiple-
choice, this will not vary, whether the test is given in the morning or afternoon.
 Attributes of individuals include individual characteristics such as cognitive style
and knowledge of particular content areas, and group characteristics such as
sex, race, and ethnic background. These are also systematic in the sense that they
are likely to affect a given individual’s test performance regularly.
 An individual’s test score will be affected to some degree by unsystematic, or
random factors: unpredictable and largely temporary conditions, such as his
mental alertness or emotional state, and uncontrolled differences in test method
facets, such as changes in the test environment from one day to the next, or
idioosyncratic differences in the way different test administrators carry out their
responsibilities.
FACTORS THAT AFFECT LANGUAGE TEST SCORES
 Random factors and test method facets are generally considered to
be sources of measurement error, and have thus been the primary
concern of approaches to estimating reliability.
 Personal attributes that are not considered part of the ability tested,
such as sex, ethnic background, cognitive style and prior knowledge of
content area, on the other hand, have traditionally been discussed as
sources of test bias, or test invalidity.
CLASSICAL TRUE SCORE (CTS) MEASUREMENT THEORY
 This theory consists of a set of assumptions about the relationships between
actual, or observed test scores and the factors that affect these scores.
Assumption 1: an observed score on a test comprises two factors or components: a
true score that is due to an individual’s level of ability and an error score, that is due
to factors other than the ability being tested.’
→ x = Xt + Xe where x is the observed score, Xt is the true score, and Xe the error
score.
 the variance of a set of test scores consists of two components:
S2
x = S2
t + S2
e
 where S2
x is the observed score variance, S2
t is the true score variance component,
and S2
e is the error score variance component.
 Assumption 2: the relationship between true and error scores: error scores are
unsystematic, or random, and are uncorrelated with true scores.
 CTS model’s definition of measurement error:
that variation in a set of test scores that is unsystematic or random
In CTS→ Two sources of variance:
True score
variance due to
differences in
ability
Measurement
error
(unsystematic)
PARALLEL TESTS
 In order for two tests to be considered parallel, we assume
that they are measures of the same ability, that is, that an individual’s true score on
one test will be the same as his true score on the other.
 Two tests are parallel if, for every group of persons taking both tests,
(1) the true score on one test is equal to the true score on the other, and
(2) the error variances for the two tests are equal.
 Operational definition: parallel tests are two tests of the same ability that have
the same means and variances and are equally correlated with other tests of that
ability.
 Virtually we never have strictly parallel tests, we treat two tests as if they were
parallel if the differences between their means and variances are not statistically
significant. Equivalent tests; alternate forms.
RELIABILITY AS THE CORRELATION
BETWEEN PARALLEL TESTS
 Since we never know what the true or error scores are, we cannot know the
reliability of the observed scores. To be able to estimate the reliability of observed
scores, then, we must define reliability operationally in a way that depends only on
observed scores.
 Thus, if the observed scores on two parallel tests are highly correlated, this
indicates that effects of the error scores are minimal, and that they can be
considered reliable indicators of the ability being measured.
 The definition of reliability (the basis for all estimates of reliability within CTS
theory): the correlation between the observed scores on two parallel tests, which
we can symbolize as rxx’.
 Assumption: the observed scores on the two tests are experimentally independent.
That is, an individual's performance on the second test should not depend on how
she performs on the first.
CORRELATIONS BETWEEN TRUE AND OBSERVED SCORES ON
PARALLEL TESTS
=
rxx’
Ability
x’t
x’e
xt
xe
x X’
RELIABILITY AND MEASUREMENT ERROR AS PROPORTIONS OF
OBSERVED SCORE VARIANCE
If an individual's observed score on a test is composed of a
true score and an error score, the greater the proportion of
true score, the less the proportion of error score, and thus the
more reliable the observed score.
Thus, one way of defining reliability is as the proportion of
the observed score variance that is true score variance:
rxx’ = s2
t/ s2
x
Note: reliability refers to the test scores, and not the test
itself.
APPROACHES TO ESTIMATING RELIABILITY WITHIN THE CTS
Internal consistency: concerned primarily with
sources of error from within the test and scoring
procedures,
Stability: how consistent test scores are over time,
• The estimates of reliability that these approaches yield are called
reliability coefficients.
Equivalence: an indication of the extent to which scores on
alternate forms of a test are equivalent.
INTERNAL CONSISTENCY
 Internal consistency is concerned with how consistent test takers’ performances
on the different parts of the test are with each other.
 Inconsistencies in performance on different parts of tests can be caused by a
number of factors, including the test method facets.
SPLIT-HALF RELIABILITY ESTIMATES
 One approach to examining the internal consistency of a test is the split-half
method, in which we divide the test into two halves and then determine the extent
to which scores on these two halves are consistent with each other.
 In so doing, we are treating the halves as parallel tests, and so we must make
certain assumptions about the equivalence of the two halves, specifically that they
have equal means and variances. In addition, we must also assume that the two
halves are independent of each other.
INTERNAL CONSISTENCY
 In some cases, where we are not sure that the items are measuring the same ability
or that they are independent of each other, the test-retest and parallel forms
methods, are more appropriate for estimating reliability.
 The Spearman-Brown split-half estimate
 Once the test has been split into halves, it is rescored, yielding two scores - one for
each half - for each test taker.
 In one approach to estimating reliability, we then compute the correlation between
the two sets of scores. This gives us an estimate of how consistent the halves are,
however, and we are interested in the reliability of the whole test.
 In general, a long test will be more reliable than a short one, assuming that the
additional items correlate positively with the other items in the test.
INTERNAL CONSISTENCY
 Two assumptions must be met in order to use this method:
First, since we are in effect treating the two halves as parallel tests, we must assume
that they have equal means and variances (an assumption we
can check).
Second, we must assume that the two halves are experimentally independent of each
other (an assumption that is very difficult to check). That is, that an individual's
performance on one half does not affect how he performs on the other.
INTERNAL CONSISTENCY
 The Guttman sp!it-bdf estimate
 Another approach to estimating reliability from split-halves is that developed by
Guttman (1945), which does not assume equivalence of the halves, and which does
not require computing a correlation between them. This split-half reliability
coefficient is based on the ratio of the sum of the variances of the two halves to the
variance of the whole test:
 Since this formula is based on the variance of the total test, it provides a direct
estimate of the reliability of the whole test.
 Therefore, unlike Spearman-Brown, the Guttman split-half estimate does not
require an additional correction for length.
INTERNAL CONSISTENCY
 Reliability estimates based on item variances
1. Kuder-Richardson reliability coefficients
 There is a way of estimating the average of all the possible split-half coefficients
on the basis of the statistical characteristics of the test items.
 This approach developed by Kuder and Richardson (1937), involves computing
the means and variances of the items that constitute the test.
 The mean of a dichotomous item (one that is scored as either right or wrong) is
the proportion, symbolized as p, of individuals who answer the item correctly. The
 proportion of individuals who answer the item incorrectly is equal to 1 - p, and is
symbolized as q.
 The variance of a dichotomous item is the product of these two proportions, or pq.
INTERNAL CONSISTENCY
 The reliability coefficient provided by Kuder-Richardson formula 20 (KR-20),
based on the ratio of the sum of the item variances to the total test score variance, is
as follows:
 Assumption: the items are of nearly equal difficulty and independent of each other.
 If the items are of equal difficulty, the reliability coefficient can be computed by
using Kuder-Richardson formula 21 (KR-21:
 This formula will generally yield a reliability coefficient that is lower than that
given by KR-20. The Kuder-Richardson formulae are based on total score variance,
and thus they do not require any correction for length.
INTERNAL CONSISTENCY
2. Coefficient alpha
 Both the Guttman split-half estimate and the Kuder-Richardson formulae estimate
reliability on the basis of ratios of the variances of test components - halves and
items - to total test score variance.
 Cronbach (1951) developed a general formula for estimating internal consistency
which he called ‘coefficient alpha’, and which is often referred to as ‘Cronbach’s
alpha’:
 It can thus be shown that all estimates of reliability based on the analysis of
variance components can be derived from this formula and are special cases of
coefficient alpha.
INTERNAL CONSISTENCY
 In summary, the reliability of a set of test scores can be estimated on the basis of a
single test administration only if certain assumptions about the characteristics of
the parts of the test are satisfied.
Assumptions for internal consistency reliability estimates, and effects of
violating assumptions
RATER CONSISTENCY
 In test scores that are obtained subjectively, such as ratings of compositions or
oral interviews, a source of error is inconsistency in these ratings.
 In the case of a single rater, we need to be concerned about the consistency within
that individual’s ratings, or with intrarater reliability.
 When there are several different raters, we want to examine the consistency across
raters, or inter-rater reliability.
 In both cases, the primary causes of inconsistency will be either the
application of different rating criteria to different samples or
the inconsistent application of the rating criteria to different samples.
RATER CONSISTENCY
 Intra-rater reliability
Factors introducing inconsistency:
 Sequence of paying attention to different kinds of errors
 Sequence of scoring from the 1st to the last person
 We need to obtain at least two independent ratings from a rater for each individual
language sample.
 This is typically accomplished by rating the individual samples once and then re-
rating them at a later time in a different, random order.
 Once the two sets of ratings have been obtained, the reliability between them can
be estimated in two ways.
RATER CONSISTENCY
 One way is to treat the two sets of ratings as scores from parallel tests and
compute the appropriate correlation coefficient (commonly the Spearman rank-
order coefficient) between the two sets of ratings, interpreting this as an estimate
of reliability.
 Another approach to examining the consistency of multiple ratings is to compute a
coefficient alpha, treating the independent ratings as different parts:
RATER CONSISTENCY
 Inter-rater reliability
Factors introducing inconsistency:
 Different criteria for rating
 Different interpretations of the same rating criteria
 The reliability can be estimated in two ways:
 We can compute the correlation between two different raters and interpret this as
an estimate of reliability.
 When more than two raters are involved, however, rather than computing
correlations for all different pairs, a preferable approach is that recommended by
Ebel (1979), in which we sum the ratings of the different raters and then estimate
the reliability of these summed ratings by computing a coefficient alpha.
STABILITY (TEST-RETEST RELIABILITY)
 This approach can be used in three cases:
 For tests such as cloze and dictation we cannot appropriately estimate the internal
consistency of the scores because of the interdependence of the parts of the test.
 There are also testing situations in which it may be necessary to administer a test
more than once as part of a time-series design.
 This might also be the concern of a language program evaluator who is interested
in relating changes in language ability to teaching and learning activities in the
program.
 This approach to reliability provides an estimate of the stability of the test scores
over time.
STABILITY (TEST-RETEST RELIABILITY)
 In this approach, we administer the test twice to a group of individuals and then
compute the correlation between the two sets of scores. This correlation can then
be interpreted as an indication of how stable the scores are over time.
 Two sources of inconsistency - differential practice effects and differential
changes in ability -pose a dilemma for the test-retest approach.
 That is, we must assume that both practice and learning (or unlearning) effects
are either uniform across individuals or random.
 Practice effects may occur if certain individuals remember some of the items or
feel more comfortable with the test method, and consequently perform better on
the second administration of the test.
 If, on the other hand, there is a considerable time lapse between test
administrations, some individuals’ language ability may actually improve or
decline more than that of others, causing them to perform differently the second
time.
STABILITY (TEST-RETEST RELIABILITY)
Possible Solution:
There is no single length of time between test
administrations that is best for all situations.
In each situation, the test developer or user must
attempt to determine the extent to which practice and
learning are likely to influence test performance, and
choose the length of time between test and retest so as
to optimize reduction in the effects of both.
EQUIVALENCE (PARALLEL FORMS RELIABILITY)
 Like the test-retest approach, this is an appropriate means of
estimating the reliability of tests for which internal consistency
estimates are either inappropriate or not possible.
 It is of particular interest in testing situations where alternate forms of
the test may be actually used, either for security reasons, or to
minimize the practice effect.
 Assumption: that the different forms of the test are equivalent,
particularly that they are at the same difficulty level and have similar
standard deviations.
EQUIVALENCE (PARALLEL FORMS RELIABILITY)
 To estimate the reliability of alternate forms of a given test, the procedure used is
to administer both forms to a group of individuals.
 One way to minimize the possibility of an ordering effect is to use a
 ‘counterbalanced’ design, in which half of the individuals take one form first and
the other half take the other form first.
 The means and standard deviations for each of the two forms can then be
computed and compared to determine their equivalence, after which the
correlation between the two sets of scores (Form A and Form B) can be computed.
This correlation is then interpreted as an indicator of the equivalence of the two
tests, or as an estimate of the reliability of either one.
Note:
o As a matter of procedure, we generally attempt to estimate the internal
consistency of a test first, since if a test is not reliable in this respect, it is not
likely to be equivalent to other forms or stable over time.
o This is because measurement error is random and therefore not correlated
with anything.
o Thus, the greater the proportion of measurement error in the scores from a
given test, the lower the correlation of those scores with other scores will be.
PROBLEMS WITH THE CLASSICAL TRUE SCORE MODEL
1. Different sources of error may interact with each other, even when we carefully
design our reliability study. One problem with the CTS model, then, is that it treats
error variance as homogeneous in origin. Each of the estimates of reliability
addresses one specific source of error, and treats other potential sources either as part
of that source, or as true score.
 In the classical model, therefore, different sources of error may be confused, or
confounded with each other and with true score variance, since it is not possible to
examine more than one source of error at a time, even though performance on
any given test may be affected by several different sources of error simultaneously.
PROBLEMS WITH THE CLASSICAL TRUE SCORE MODEL
2. A second, related problem is that the CTS model considers all error to be random,
and consequently fails to distinguish systematic error from random error.
 Factors other than the ability being measured that regularly affect the performance
of some individuals and not others can be regarded as sources of systematic error
or test bias.
 Sources of systematic error:
 Test method
 Cultural content
 Psychological task set
 Guessing or what Carroll (1961b) called ‘topastic error’.

More Related Content

What's hot

Validity and Reliability
Validity and ReliabilityValidity and Reliability
Validity and ReliabilityMaury Martinez
 
Validity, its types, measurement & factors.
Validity, its types, measurement & factors.Validity, its types, measurement & factors.
Validity, its types, measurement & factors.Maheen Iftikhar
 
Norms and the Meaning of Test Scores
Norms and the Meaning of Test ScoresNorms and the Meaning of Test Scores
Norms and the Meaning of Test ScoresMushfikFRahman
 
Reliability and validity
Reliability and validityReliability and validity
Reliability and validityshobhitsaxena67
 
Reliability by Vartika Verma .pdf
Reliability by Vartika Verma .pdfReliability by Vartika Verma .pdf
Reliability by Vartika Verma .pdfVartika Verma
 
Types of Scores & Types of Standard Scores
Types of Scores & Types of Standard ScoresTypes of Scores & Types of Standard Scores
Types of Scores & Types of Standard ScoresCRISALDO CORDURA
 
Reliability of test
Reliability of testReliability of test
Reliability of testSarat Rout
 
Classical Test Theory and Item Response Theory
Classical Test Theory and Item Response TheoryClassical Test Theory and Item Response Theory
Classical Test Theory and Item Response Theorysaira kazim
 
Reliability (assessment of student learning I)
Reliability (assessment of student learning I)Reliability (assessment of student learning I)
Reliability (assessment of student learning I)Rey-ra Mora
 
Presentation Validity & Reliability
Presentation Validity & ReliabilityPresentation Validity & Reliability
Presentation Validity & Reliabilitysongoten77
 
Presentation validity
Presentation validityPresentation validity
Presentation validityAshMusavi
 

What's hot (20)

01 validity and its type
01 validity and its type01 validity and its type
01 validity and its type
 
Validity and Reliability
Validity and ReliabilityValidity and Reliability
Validity and Reliability
 
Validity in Assessment
Validity in AssessmentValidity in Assessment
Validity in Assessment
 
Reliability
ReliabilityReliability
Reliability
 
Validity, its types, measurement & factors.
Validity, its types, measurement & factors.Validity, its types, measurement & factors.
Validity, its types, measurement & factors.
 
Types of tests in measurement and evaluation
Types of tests in measurement and evaluationTypes of tests in measurement and evaluation
Types of tests in measurement and evaluation
 
Reliability types
Reliability typesReliability types
Reliability types
 
Norms and the Meaning of Test Scores
Norms and the Meaning of Test ScoresNorms and the Meaning of Test Scores
Norms and the Meaning of Test Scores
 
Reliability and validity
Reliability and validityReliability and validity
Reliability and validity
 
Reliability
ReliabilityReliability
Reliability
 
Reliability by Vartika Verma .pdf
Reliability by Vartika Verma .pdfReliability by Vartika Verma .pdf
Reliability by Vartika Verma .pdf
 
Types of Scores & Types of Standard Scores
Types of Scores & Types of Standard ScoresTypes of Scores & Types of Standard Scores
Types of Scores & Types of Standard Scores
 
Validity
ValidityValidity
Validity
 
Reliability of test
Reliability of testReliability of test
Reliability of test
 
Classical Test Theory and Item Response Theory
Classical Test Theory and Item Response TheoryClassical Test Theory and Item Response Theory
Classical Test Theory and Item Response Theory
 
Reliability (assessment of student learning I)
Reliability (assessment of student learning I)Reliability (assessment of student learning I)
Reliability (assessment of student learning I)
 
Presentation Validity & Reliability
Presentation Validity & ReliabilityPresentation Validity & Reliability
Presentation Validity & Reliability
 
Norms[1]
Norms[1]Norms[1]
Norms[1]
 
Construction of Test
Construction of TestConstruction of Test
Construction of Test
 
Presentation validity
Presentation validityPresentation validity
Presentation validity
 

Viewers also liked

The role of corrective feedback in second language learning
The role of corrective feedback in second language learningThe role of corrective feedback in second language learning
The role of corrective feedback in second language learningAmir Hamid Forough Ameri
 
Thesis summary by amir hamid forough ameri
Thesis summary by amir hamid forough ameriThesis summary by amir hamid forough ameri
Thesis summary by amir hamid forough ameriAmir Hamid Forough Ameri
 
Critical pedagogy in l2 learning and teaching suresh canagarajah
Critical pedagogy in l2 learning and teaching  suresh canagarajahCritical pedagogy in l2 learning and teaching  suresh canagarajah
Critical pedagogy in l2 learning and teaching suresh canagarajahAmir Hamid Forough Ameri
 
Critical literacy and second language learning luke and dooley
Critical literacy and second language learning  luke and dooleyCritical literacy and second language learning  luke and dooley
Critical literacy and second language learning luke and dooleyAmir Hamid Forough Ameri
 
The task based approach some questions and suggestions littlewood
The task based approach some questions and suggestions littlewoodThe task based approach some questions and suggestions littlewood
The task based approach some questions and suggestions littlewoodAmir Hamid Forough Ameri
 
Task based research and language pedagogy ellis
Task based research and language pedagogy ellisTask based research and language pedagogy ellis
Task based research and language pedagogy ellisAmir Hamid Forough Ameri
 

Viewers also liked (20)

Sifakis 2007
Sifakis 2007Sifakis 2007
Sifakis 2007
 
The role of corrective feedback in second language learning
The role of corrective feedback in second language learningThe role of corrective feedback in second language learning
The role of corrective feedback in second language learning
 
Extroversion introversion
Extroversion introversionExtroversion introversion
Extroversion introversion
 
Exploring culture by ah forough ameri
Exploring culture by ah forough ameriExploring culture by ah forough ameri
Exploring culture by ah forough ameri
 
Maryam Bolouri
Maryam BolouriMaryam Bolouri
Maryam Bolouri
 
Thesis summary by amir hamid forough ameri
Thesis summary by amir hamid forough ameriThesis summary by amir hamid forough ameri
Thesis summary by amir hamid forough ameri
 
Behavioral view of motivation
Behavioral view of motivationBehavioral view of motivation
Behavioral view of motivation
 
Language testing the social dimension
Language testing  the social dimensionLanguage testing  the social dimension
Language testing the social dimension
 
Critical pedagogy in l2 learning and teaching suresh canagarajah
Critical pedagogy in l2 learning and teaching  suresh canagarajahCritical pedagogy in l2 learning and teaching  suresh canagarajah
Critical pedagogy in l2 learning and teaching suresh canagarajah
 
Context culture .... m. wendt
Context culture .... m. wendtContext culture .... m. wendt
Context culture .... m. wendt
 
Integrated syllabus
Integrated syllabusIntegrated syllabus
Integrated syllabus
 
ANxiety bolouri
ANxiety bolouriANxiety bolouri
ANxiety bolouri
 
Swan.bolouri
Swan.bolouriSwan.bolouri
Swan.bolouri
 
attitide anxiety bolouri
 attitide anxiety bolouri attitide anxiety bolouri
attitide anxiety bolouri
 
Critical literacy and second language learning luke and dooley
Critical literacy and second language learning  luke and dooleyCritical literacy and second language learning  luke and dooley
Critical literacy and second language learning luke and dooley
 
Classroom assessment, glenn fulcher
Classroom assessment, glenn fulcherClassroom assessment, glenn fulcher
Classroom assessment, glenn fulcher
 
Tsui 2011
Tsui 2011Tsui 2011
Tsui 2011
 
The task based approach some questions and suggestions littlewood
The task based approach some questions and suggestions littlewoodThe task based approach some questions and suggestions littlewood
The task based approach some questions and suggestions littlewood
 
Notional functional syllabus
Notional functional syllabusNotional functional syllabus
Notional functional syllabus
 
Task based research and language pedagogy ellis
Task based research and language pedagogy ellisTask based research and language pedagogy ellis
Task based research and language pedagogy ellis
 

Similar to Reliability bachman 1990 chapter 6

Reliability in Language Testing
Reliability in Language Testing Reliability in Language Testing
Reliability in Language Testing Seray Tanyer
 
Valiadity and reliability- Language testing
Valiadity and reliability- Language testingValiadity and reliability- Language testing
Valiadity and reliability- Language testingPhuong Tran
 
Characteristics of a good test
Characteristics of a good test Characteristics of a good test
Characteristics of a good test Arash Yazdani
 
What makes a good testA test is considered good” if the .docx
What makes a good testA test is considered good” if the .docxWhat makes a good testA test is considered good” if the .docx
What makes a good testA test is considered good” if the .docxmecklenburgstrelitzh
 
Chapter 8 compilation
Chapter 8 compilationChapter 8 compilation
Chapter 8 compilationHannan Mahmud
 
Validity & reliability seminar
Validity & reliability seminarValidity & reliability seminar
Validity & reliability seminarmrikara185
 
Validity, Reliability ,Objective & Their Types
Validity, Reliability ,Objective & Their TypesValidity, Reliability ,Objective & Their Types
Validity, Reliability ,Objective & Their TypesMohammadRabbani18
 
Meaning and Methods of Estimating Reliability of Test.pptx
Meaning and Methods of Estimating Reliability of Test.pptxMeaning and Methods of Estimating Reliability of Test.pptx
Meaning and Methods of Estimating Reliability of Test.pptxsarat68
 
Louzel Report - Reliability & validity
Louzel Report - Reliability & validity Louzel Report - Reliability & validity
Louzel Report - Reliability & validity Louzel Linejan
 
Testing in language programs (chapter 8)
Testing in language programs (chapter 8)Testing in language programs (chapter 8)
Testing in language programs (chapter 8)Tahere Bakhshi
 
Topic validity
Topic validityTopic validity
Topic validitymikki khan
 
Reliability and validity of Research Data
Reliability and validity of Research DataReliability and validity of Research Data
Reliability and validity of Research DataAida Arifin
 
Research methods 2 operationalization & measurement
Research methods 2   operationalization & measurementResearch methods 2   operationalization & measurement
Research methods 2 operationalization & measurementattique1960
 
VALIDITY
VALIDITYVALIDITY
VALIDITYANCYBS
 

Similar to Reliability bachman 1990 chapter 6 (20)

Reliability in Language Testing
Reliability in Language Testing Reliability in Language Testing
Reliability in Language Testing
 
Valiadity and reliability- Language testing
Valiadity and reliability- Language testingValiadity and reliability- Language testing
Valiadity and reliability- Language testing
 
Characteristics of a good test
Characteristics of a good test Characteristics of a good test
Characteristics of a good test
 
Reliability and validity
Reliability and validityReliability and validity
Reliability and validity
 
What makes a good testA test is considered good” if the .docx
What makes a good testA test is considered good” if the .docxWhat makes a good testA test is considered good” if the .docx
What makes a good testA test is considered good” if the .docx
 
Chapter 8 compilation
Chapter 8 compilationChapter 8 compilation
Chapter 8 compilation
 
Validity & reliability seminar
Validity & reliability seminarValidity & reliability seminar
Validity & reliability seminar
 
Rep
RepRep
Rep
 
Validity, Reliability ,Objective & Their Types
Validity, Reliability ,Objective & Their TypesValidity, Reliability ,Objective & Their Types
Validity, Reliability ,Objective & Their Types
 
EM&E.pptx
EM&E.pptxEM&E.pptx
EM&E.pptx
 
Meaning and Methods of Estimating Reliability of Test.pptx
Meaning and Methods of Estimating Reliability of Test.pptxMeaning and Methods of Estimating Reliability of Test.pptx
Meaning and Methods of Estimating Reliability of Test.pptx
 
Quantitative analysis
Quantitative analysisQuantitative analysis
Quantitative analysis
 
Louzel Report - Reliability & validity
Louzel Report - Reliability & validity Louzel Report - Reliability & validity
Louzel Report - Reliability & validity
 
Testing in language programs (chapter 8)
Testing in language programs (chapter 8)Testing in language programs (chapter 8)
Testing in language programs (chapter 8)
 
Topic validity
Topic validityTopic validity
Topic validity
 
Reliability and validity of Research Data
Reliability and validity of Research DataReliability and validity of Research Data
Reliability and validity of Research Data
 
Research methods 2 operationalization & measurement
Research methods 2   operationalization & measurementResearch methods 2   operationalization & measurement
Research methods 2 operationalization & measurement
 
VALIDITY
VALIDITYVALIDITY
VALIDITY
 
Data processing
Data processingData processing
Data processing
 
Variables, Theory and Sampling Map
Variables, Theory and Sampling MapVariables, Theory and Sampling Map
Variables, Theory and Sampling Map
 

Recently uploaded

internship ppt on smartinternz platform as salesforce developer
internship ppt on smartinternz platform as salesforce developerinternship ppt on smartinternz platform as salesforce developer
internship ppt on smartinternz platform as salesforce developerunnathinaik
 
भारत-रोम व्यापार.pptx, Indo-Roman Trade,
भारत-रोम व्यापार.pptx, Indo-Roman Trade,भारत-रोम व्यापार.pptx, Indo-Roman Trade,
भारत-रोम व्यापार.pptx, Indo-Roman Trade,Virag Sontakke
 
CELL CYCLE Division Science 8 quarter IV.pptx
CELL CYCLE Division Science 8 quarter IV.pptxCELL CYCLE Division Science 8 quarter IV.pptx
CELL CYCLE Division Science 8 quarter IV.pptxJiesonDelaCerna
 
Earth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatEarth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatYousafMalik24
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Educationpboyjonauth
 
Painted Grey Ware.pptx, PGW Culture of India
Painted Grey Ware.pptx, PGW Culture of IndiaPainted Grey Ware.pptx, PGW Culture of India
Painted Grey Ware.pptx, PGW Culture of IndiaVirag Sontakke
 
Pharmacognosy Flower 3. Compositae 2023.pdf
Pharmacognosy Flower 3. Compositae 2023.pdfPharmacognosy Flower 3. Compositae 2023.pdf
Pharmacognosy Flower 3. Compositae 2023.pdfMahmoud M. Sallam
 
How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxmanuelaromero2013
 
Hierarchy of management that covers different levels of management
Hierarchy of management that covers different levels of managementHierarchy of management that covers different levels of management
Hierarchy of management that covers different levels of managementmkooblal
 
Alper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentAlper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentInMediaRes1
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationnomboosow
 
Full Stack Web Development Course for Beginners
Full Stack Web Development Course  for BeginnersFull Stack Web Development Course  for Beginners
Full Stack Web Development Course for BeginnersSabitha Banu
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptxVS Mahajan Coaching Centre
 
EPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptxEPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptxRaymartEstabillo3
 
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Biting mechanism of poisonous snakes.pdf
Biting mechanism of poisonous snakes.pdfBiting mechanism of poisonous snakes.pdf
Biting mechanism of poisonous snakes.pdfadityarao40181
 

Recently uploaded (20)

internship ppt on smartinternz platform as salesforce developer
internship ppt on smartinternz platform as salesforce developerinternship ppt on smartinternz platform as salesforce developer
internship ppt on smartinternz platform as salesforce developer
 
भारत-रोम व्यापार.pptx, Indo-Roman Trade,
भारत-रोम व्यापार.pptx, Indo-Roman Trade,भारत-रोम व्यापार.pptx, Indo-Roman Trade,
भारत-रोम व्यापार.pptx, Indo-Roman Trade,
 
OS-operating systems- ch04 (Threads) ...
OS-operating systems- ch04 (Threads) ...OS-operating systems- ch04 (Threads) ...
OS-operating systems- ch04 (Threads) ...
 
CELL CYCLE Division Science 8 quarter IV.pptx
CELL CYCLE Division Science 8 quarter IV.pptxCELL CYCLE Division Science 8 quarter IV.pptx
CELL CYCLE Division Science 8 quarter IV.pptx
 
Earth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatEarth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice great
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Education
 
Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Bikash Puri  Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Bikash Puri  Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝
 
Painted Grey Ware.pptx, PGW Culture of India
Painted Grey Ware.pptx, PGW Culture of IndiaPainted Grey Ware.pptx, PGW Culture of India
Painted Grey Ware.pptx, PGW Culture of India
 
Pharmacognosy Flower 3. Compositae 2023.pdf
Pharmacognosy Flower 3. Compositae 2023.pdfPharmacognosy Flower 3. Compositae 2023.pdf
Pharmacognosy Flower 3. Compositae 2023.pdf
 
How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptx
 
9953330565 Low Rate Call Girls In Rohini Delhi NCR
9953330565 Low Rate Call Girls In Rohini  Delhi NCR9953330565 Low Rate Call Girls In Rohini  Delhi NCR
9953330565 Low Rate Call Girls In Rohini Delhi NCR
 
Hierarchy of management that covers different levels of management
Hierarchy of management that covers different levels of managementHierarchy of management that covers different levels of management
Hierarchy of management that covers different levels of management
 
Alper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentAlper Gobel In Media Res Media Component
Alper Gobel In Media Res Media Component
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communication
 
Full Stack Web Development Course for Beginners
Full Stack Web Development Course  for BeginnersFull Stack Web Development Course  for Beginners
Full Stack Web Development Course for Beginners
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
 
EPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptxEPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptx
 
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
 
Biting mechanism of poisonous snakes.pdf
Biting mechanism of poisonous snakes.pdfBiting mechanism of poisonous snakes.pdf
Biting mechanism of poisonous snakes.pdf
 
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdfTataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
 

Reliability bachman 1990 chapter 6

  • 1. RELIABILITY BASED ON: FUNDAMENTAL CONSIDERATIONS IN LANGUAGE TESTING BACHMAN (1990) CHAPTER 6 Prepared by: Amirhamid Foroughameri ahfameri@gmail.com November 2015
  • 2. INTRODUCTION  A fundamental concern in the development and use of language tests is to identify potential sources of error in a given measure of communicative language ability and to minimize the effect of these factors on that measure.  We must be concerned about errors of measurement, or unreliability, as we know that test performance is affected by factors other than the abilities we want to measure. Poor health, fatigue, lack of interest test method facets or motivation, and test-wiseness … Minimizing the effects of these factors Minimizing measurement error maximizing reliability. Unsystematic (unpredictable) Systematic
  • 3. INTRODUCTION  A necessary condition for validity: in order for a test score to be valid, it must be reliable.  Reliability and validity are not two distinct concepts; they are complementary aspects of a common concern in measurement  Reliability answers the question: ‘How much of an individual’s test performance is due to measurement error, or to factors 0ther than the language ability we want to measure?’ and with minimizing the effects of these factors on test scores.  Validity answers the question: ‘How much of an individual’s test performance is due to the language abilities we want to measure?’ and with maximizing the effects of these abilities on test scores.
  • 4. INTRODUCTION  The investigation of reliability involves both logical analysis and empirical research.  We must identify sources of error and estimate the magnitude of their effects on test scores.  To identify sources of error, we need to distinguish the effects of the language abilities we want to measure from the effects of other factors, which is a complex problem due to: 1. The interaction between components of language ability and test method facets makes it difficult to mark a clear ‘boundary’ between the ability being measured and the method facets → a particular topic of a conversational interaction an oral interview. 2. Other characteristics, such as sex, age, cognitive style, and native language.  Estimating the magnitude of the effects of different sources of error, once these sources have been identified, is a matter of empirical research, and is a major concern of measurement theory.
  • 5. FACTORS THAT AFFECT LANGUAGE TEST SCORES Thorndike (1951) and Stanley (1971) begin their treatments of reliability with general frameworks for describing the factors that cause test scores to vary from individual to individual: general and specific lasting characteristics, general and specific temporary characteristics, and systematic and chance factors related to test administration and scoring.
  • 6. FACTORS THAT AFFECT LANGUAGE TEST SCORES  Factors that affect language test scores (Bachman 1990): Note: In a ‘path diagram’ rectangles: observed variables, ovals: unobserved variables, straight arrows: causal relationships Systematic Unsystematic Test Score Communicative language ability Personal attributes Test method facets Random factors
  • 7. FACTORS THAT AFFECT LANGUAGE TEST SCORES  Test method facets are systematic to the extent that they are uniform from one test administration to the next. That is, if the input format facet is multiple- choice, this will not vary, whether the test is given in the morning or afternoon.  Attributes of individuals include individual characteristics such as cognitive style and knowledge of particular content areas, and group characteristics such as sex, race, and ethnic background. These are also systematic in the sense that they are likely to affect a given individual’s test performance regularly.  An individual’s test score will be affected to some degree by unsystematic, or random factors: unpredictable and largely temporary conditions, such as his mental alertness or emotional state, and uncontrolled differences in test method facets, such as changes in the test environment from one day to the next, or idioosyncratic differences in the way different test administrators carry out their responsibilities.
  • 8. FACTORS THAT AFFECT LANGUAGE TEST SCORES  Random factors and test method facets are generally considered to be sources of measurement error, and have thus been the primary concern of approaches to estimating reliability.  Personal attributes that are not considered part of the ability tested, such as sex, ethnic background, cognitive style and prior knowledge of content area, on the other hand, have traditionally been discussed as sources of test bias, or test invalidity.
  • 9. CLASSICAL TRUE SCORE (CTS) MEASUREMENT THEORY  This theory consists of a set of assumptions about the relationships between actual, or observed test scores and the factors that affect these scores. Assumption 1: an observed score on a test comprises two factors or components: a true score that is due to an individual’s level of ability and an error score, that is due to factors other than the ability being tested.’ → x = Xt + Xe where x is the observed score, Xt is the true score, and Xe the error score.  the variance of a set of test scores consists of two components: S2 x = S2 t + S2 e  where S2 x is the observed score variance, S2 t is the true score variance component, and S2 e is the error score variance component.
  • 10.  Assumption 2: the relationship between true and error scores: error scores are unsystematic, or random, and are uncorrelated with true scores.  CTS model’s definition of measurement error: that variation in a set of test scores that is unsystematic or random In CTS→ Two sources of variance: True score variance due to differences in ability Measurement error (unsystematic)
  • 11. PARALLEL TESTS  In order for two tests to be considered parallel, we assume that they are measures of the same ability, that is, that an individual’s true score on one test will be the same as his true score on the other.  Two tests are parallel if, for every group of persons taking both tests, (1) the true score on one test is equal to the true score on the other, and (2) the error variances for the two tests are equal.  Operational definition: parallel tests are two tests of the same ability that have the same means and variances and are equally correlated with other tests of that ability.  Virtually we never have strictly parallel tests, we treat two tests as if they were parallel if the differences between their means and variances are not statistically significant. Equivalent tests; alternate forms.
  • 12. RELIABILITY AS THE CORRELATION BETWEEN PARALLEL TESTS  Since we never know what the true or error scores are, we cannot know the reliability of the observed scores. To be able to estimate the reliability of observed scores, then, we must define reliability operationally in a way that depends only on observed scores.  Thus, if the observed scores on two parallel tests are highly correlated, this indicates that effects of the error scores are minimal, and that they can be considered reliable indicators of the ability being measured.  The definition of reliability (the basis for all estimates of reliability within CTS theory): the correlation between the observed scores on two parallel tests, which we can symbolize as rxx’.  Assumption: the observed scores on the two tests are experimentally independent. That is, an individual's performance on the second test should not depend on how she performs on the first.
  • 13. CORRELATIONS BETWEEN TRUE AND OBSERVED SCORES ON PARALLEL TESTS = rxx’ Ability x’t x’e xt xe x X’
  • 14. RELIABILITY AND MEASUREMENT ERROR AS PROPORTIONS OF OBSERVED SCORE VARIANCE If an individual's observed score on a test is composed of a true score and an error score, the greater the proportion of true score, the less the proportion of error score, and thus the more reliable the observed score. Thus, one way of defining reliability is as the proportion of the observed score variance that is true score variance: rxx’ = s2 t/ s2 x Note: reliability refers to the test scores, and not the test itself.
  • 15. APPROACHES TO ESTIMATING RELIABILITY WITHIN THE CTS Internal consistency: concerned primarily with sources of error from within the test and scoring procedures, Stability: how consistent test scores are over time, • The estimates of reliability that these approaches yield are called reliability coefficients. Equivalence: an indication of the extent to which scores on alternate forms of a test are equivalent.
  • 16. INTERNAL CONSISTENCY  Internal consistency is concerned with how consistent test takers’ performances on the different parts of the test are with each other.  Inconsistencies in performance on different parts of tests can be caused by a number of factors, including the test method facets. SPLIT-HALF RELIABILITY ESTIMATES  One approach to examining the internal consistency of a test is the split-half method, in which we divide the test into two halves and then determine the extent to which scores on these two halves are consistent with each other.  In so doing, we are treating the halves as parallel tests, and so we must make certain assumptions about the equivalence of the two halves, specifically that they have equal means and variances. In addition, we must also assume that the two halves are independent of each other.
  • 17. INTERNAL CONSISTENCY  In some cases, where we are not sure that the items are measuring the same ability or that they are independent of each other, the test-retest and parallel forms methods, are more appropriate for estimating reliability.  The Spearman-Brown split-half estimate  Once the test has been split into halves, it is rescored, yielding two scores - one for each half - for each test taker.  In one approach to estimating reliability, we then compute the correlation between the two sets of scores. This gives us an estimate of how consistent the halves are, however, and we are interested in the reliability of the whole test.  In general, a long test will be more reliable than a short one, assuming that the additional items correlate positively with the other items in the test.
  • 18. INTERNAL CONSISTENCY  Two assumptions must be met in order to use this method: First, since we are in effect treating the two halves as parallel tests, we must assume that they have equal means and variances (an assumption we can check). Second, we must assume that the two halves are experimentally independent of each other (an assumption that is very difficult to check). That is, that an individual's performance on one half does not affect how he performs on the other.
  • 19. INTERNAL CONSISTENCY  The Guttman sp!it-bdf estimate  Another approach to estimating reliability from split-halves is that developed by Guttman (1945), which does not assume equivalence of the halves, and which does not require computing a correlation between them. This split-half reliability coefficient is based on the ratio of the sum of the variances of the two halves to the variance of the whole test:  Since this formula is based on the variance of the total test, it provides a direct estimate of the reliability of the whole test.  Therefore, unlike Spearman-Brown, the Guttman split-half estimate does not require an additional correction for length.
  • 20. INTERNAL CONSISTENCY  Reliability estimates based on item variances 1. Kuder-Richardson reliability coefficients  There is a way of estimating the average of all the possible split-half coefficients on the basis of the statistical characteristics of the test items.  This approach developed by Kuder and Richardson (1937), involves computing the means and variances of the items that constitute the test.  The mean of a dichotomous item (one that is scored as either right or wrong) is the proportion, symbolized as p, of individuals who answer the item correctly. The  proportion of individuals who answer the item incorrectly is equal to 1 - p, and is symbolized as q.  The variance of a dichotomous item is the product of these two proportions, or pq.
  • 21. INTERNAL CONSISTENCY  The reliability coefficient provided by Kuder-Richardson formula 20 (KR-20), based on the ratio of the sum of the item variances to the total test score variance, is as follows:  Assumption: the items are of nearly equal difficulty and independent of each other.  If the items are of equal difficulty, the reliability coefficient can be computed by using Kuder-Richardson formula 21 (KR-21:  This formula will generally yield a reliability coefficient that is lower than that given by KR-20. The Kuder-Richardson formulae are based on total score variance, and thus they do not require any correction for length.
  • 22. INTERNAL CONSISTENCY 2. Coefficient alpha  Both the Guttman split-half estimate and the Kuder-Richardson formulae estimate reliability on the basis of ratios of the variances of test components - halves and items - to total test score variance.  Cronbach (1951) developed a general formula for estimating internal consistency which he called ‘coefficient alpha’, and which is often referred to as ‘Cronbach’s alpha’:  It can thus be shown that all estimates of reliability based on the analysis of variance components can be derived from this formula and are special cases of coefficient alpha.
  • 23. INTERNAL CONSISTENCY  In summary, the reliability of a set of test scores can be estimated on the basis of a single test administration only if certain assumptions about the characteristics of the parts of the test are satisfied. Assumptions for internal consistency reliability estimates, and effects of violating assumptions
  • 24. RATER CONSISTENCY  In test scores that are obtained subjectively, such as ratings of compositions or oral interviews, a source of error is inconsistency in these ratings.  In the case of a single rater, we need to be concerned about the consistency within that individual’s ratings, or with intrarater reliability.  When there are several different raters, we want to examine the consistency across raters, or inter-rater reliability.  In both cases, the primary causes of inconsistency will be either the application of different rating criteria to different samples or the inconsistent application of the rating criteria to different samples.
  • 25. RATER CONSISTENCY  Intra-rater reliability Factors introducing inconsistency:  Sequence of paying attention to different kinds of errors  Sequence of scoring from the 1st to the last person  We need to obtain at least two independent ratings from a rater for each individual language sample.  This is typically accomplished by rating the individual samples once and then re- rating them at a later time in a different, random order.  Once the two sets of ratings have been obtained, the reliability between them can be estimated in two ways.
  • 26. RATER CONSISTENCY  One way is to treat the two sets of ratings as scores from parallel tests and compute the appropriate correlation coefficient (commonly the Spearman rank- order coefficient) between the two sets of ratings, interpreting this as an estimate of reliability.  Another approach to examining the consistency of multiple ratings is to compute a coefficient alpha, treating the independent ratings as different parts:
  • 27. RATER CONSISTENCY  Inter-rater reliability Factors introducing inconsistency:  Different criteria for rating  Different interpretations of the same rating criteria  The reliability can be estimated in two ways:  We can compute the correlation between two different raters and interpret this as an estimate of reliability.  When more than two raters are involved, however, rather than computing correlations for all different pairs, a preferable approach is that recommended by Ebel (1979), in which we sum the ratings of the different raters and then estimate the reliability of these summed ratings by computing a coefficient alpha.
  • 28. STABILITY (TEST-RETEST RELIABILITY)  This approach can be used in three cases:  For tests such as cloze and dictation we cannot appropriately estimate the internal consistency of the scores because of the interdependence of the parts of the test.  There are also testing situations in which it may be necessary to administer a test more than once as part of a time-series design.  This might also be the concern of a language program evaluator who is interested in relating changes in language ability to teaching and learning activities in the program.  This approach to reliability provides an estimate of the stability of the test scores over time.
  • 29. STABILITY (TEST-RETEST RELIABILITY)  In this approach, we administer the test twice to a group of individuals and then compute the correlation between the two sets of scores. This correlation can then be interpreted as an indication of how stable the scores are over time.  Two sources of inconsistency - differential practice effects and differential changes in ability -pose a dilemma for the test-retest approach.  That is, we must assume that both practice and learning (or unlearning) effects are either uniform across individuals or random.  Practice effects may occur if certain individuals remember some of the items or feel more comfortable with the test method, and consequently perform better on the second administration of the test.  If, on the other hand, there is a considerable time lapse between test administrations, some individuals’ language ability may actually improve or decline more than that of others, causing them to perform differently the second time.
  • 30. STABILITY (TEST-RETEST RELIABILITY) Possible Solution: There is no single length of time between test administrations that is best for all situations. In each situation, the test developer or user must attempt to determine the extent to which practice and learning are likely to influence test performance, and choose the length of time between test and retest so as to optimize reduction in the effects of both.
  • 31. EQUIVALENCE (PARALLEL FORMS RELIABILITY)  Like the test-retest approach, this is an appropriate means of estimating the reliability of tests for which internal consistency estimates are either inappropriate or not possible.  It is of particular interest in testing situations where alternate forms of the test may be actually used, either for security reasons, or to minimize the practice effect.  Assumption: that the different forms of the test are equivalent, particularly that they are at the same difficulty level and have similar standard deviations.
  • 32. EQUIVALENCE (PARALLEL FORMS RELIABILITY)  To estimate the reliability of alternate forms of a given test, the procedure used is to administer both forms to a group of individuals.  One way to minimize the possibility of an ordering effect is to use a  ‘counterbalanced’ design, in which half of the individuals take one form first and the other half take the other form first.  The means and standard deviations for each of the two forms can then be computed and compared to determine their equivalence, after which the correlation between the two sets of scores (Form A and Form B) can be computed. This correlation is then interpreted as an indicator of the equivalence of the two tests, or as an estimate of the reliability of either one.
  • 33. Note: o As a matter of procedure, we generally attempt to estimate the internal consistency of a test first, since if a test is not reliable in this respect, it is not likely to be equivalent to other forms or stable over time. o This is because measurement error is random and therefore not correlated with anything. o Thus, the greater the proportion of measurement error in the scores from a given test, the lower the correlation of those scores with other scores will be.
  • 34. PROBLEMS WITH THE CLASSICAL TRUE SCORE MODEL 1. Different sources of error may interact with each other, even when we carefully design our reliability study. One problem with the CTS model, then, is that it treats error variance as homogeneous in origin. Each of the estimates of reliability addresses one specific source of error, and treats other potential sources either as part of that source, or as true score.  In the classical model, therefore, different sources of error may be confused, or confounded with each other and with true score variance, since it is not possible to examine more than one source of error at a time, even though performance on any given test may be affected by several different sources of error simultaneously.
  • 35. PROBLEMS WITH THE CLASSICAL TRUE SCORE MODEL 2. A second, related problem is that the CTS model considers all error to be random, and consequently fails to distinguish systematic error from random error.  Factors other than the ability being measured that regularly affect the performance of some individuals and not others can be regarded as sources of systematic error or test bias.  Sources of systematic error:  Test method  Cultural content  Psychological task set  Guessing or what Carroll (1961b) called ‘topastic error’.