SlideShare a Scribd company logo
1 of 35
RELIABILITY
BASED ON:
FUNDAMENTAL CONSIDERATIONS IN
LANGUAGE TESTING
BACHMAN (1990)
CHAPTER 6
Prepared by: Amirhamid Foroughameri
ahfameri@gmail.com
November 2015
INTRODUCTION
 A fundamental concern in the development and use of language tests is to identify
potential sources of error in a given measure of communicative language ability and to
minimize the effect of these factors on that measure.
 We must be concerned about errors of measurement, or unreliability, as we know that
test performance is affected by factors other than the abilities we want to measure.
Poor health, fatigue, lack of interest test method facets
or motivation, and test-wiseness …
Minimizing the effects of these factors Minimizing measurement error
maximizing reliability.
Unsystematic (unpredictable) Systematic
INTRODUCTION
 A necessary condition for validity: in order for a test score to be valid, it must be
reliable.
 Reliability and validity are not two distinct concepts; they are complementary
aspects of a common concern in measurement
 Reliability answers the question: ‘How much of an individual’s test performance
is due to measurement error, or to factors 0ther than the language ability we want
to measure?’ and with minimizing the effects of these factors on test scores.
 Validity answers the question: ‘How much of an individual’s test performance is
due to the language abilities we want to measure?’ and with maximizing the effects
of these abilities on test scores.
INTRODUCTION
 The investigation of reliability involves both logical analysis and empirical
research.
 We must identify sources of error and estimate the magnitude of their effects on
test scores.
 To identify sources of error, we need to distinguish the effects of the language
abilities we want to measure from the effects of other factors, which is a complex
problem due to:
1. The interaction between components of language ability and test method facets
makes it difficult to mark a clear ‘boundary’ between the ability being measured and
the method facets → a particular topic of a conversational interaction an oral
interview.
2. Other characteristics, such as sex, age, cognitive style, and native language.
 Estimating the magnitude of the effects of different sources of error, once these
sources have been identified, is a matter of empirical research, and is a major
concern of measurement theory.
FACTORS THAT AFFECT LANGUAGE TEST SCORES
Thorndike (1951) and Stanley (1971) begin their treatments of
reliability with
general frameworks for describing the factors that cause test
scores to vary from individual to individual:
general and specific lasting characteristics,
general and specific temporary characteristics,
and systematic and chance factors related to test
administration and scoring.
FACTORS THAT AFFECT LANGUAGE TEST SCORES
 Factors that affect language test scores (Bachman 1990):
Note: In a ‘path diagram’
rectangles: observed variables,
ovals: unobserved variables,
straight arrows: causal relationships
Systematic Unsystematic
Test Score
Communicative
language ability
Personal
attributes
Test
method
facets
Random
factors
FACTORS THAT AFFECT LANGUAGE TEST SCORES
 Test method facets are systematic to the extent that they are uniform from
one test administration to the next. That is, if the input format facet is multiple-
choice, this will not vary, whether the test is given in the morning or afternoon.
 Attributes of individuals include individual characteristics such as cognitive style
and knowledge of particular content areas, and group characteristics such as
sex, race, and ethnic background. These are also systematic in the sense that they
are likely to affect a given individual’s test performance regularly.
 An individual’s test score will be affected to some degree by unsystematic, or
random factors: unpredictable and largely temporary conditions, such as his
mental alertness or emotional state, and uncontrolled differences in test method
facets, such as changes in the test environment from one day to the next, or
idioosyncratic differences in the way different test administrators carry out their
responsibilities.
FACTORS THAT AFFECT LANGUAGE TEST SCORES
 Random factors and test method facets are generally considered to
be sources of measurement error, and have thus been the primary
concern of approaches to estimating reliability.
 Personal attributes that are not considered part of the ability tested,
such as sex, ethnic background, cognitive style and prior knowledge of
content area, on the other hand, have traditionally been discussed as
sources of test bias, or test invalidity.
CLASSICAL TRUE SCORE (CTS) MEASUREMENT THEORY
 This theory consists of a set of assumptions about the relationships between
actual, or observed test scores and the factors that affect these scores.
Assumption 1: an observed score on a test comprises two factors or components: a
true score that is due to an individual’s level of ability and an error score, that is due
to factors other than the ability being tested.’
→ x = Xt + Xe where x is the observed score, Xt is the true score, and Xe the error
score.
 the variance of a set of test scores consists of two components:
S2
x = S2
t + S2
e
 where S2
x is the observed score variance, S2
t is the true score variance component,
and S2
e is the error score variance component.
 Assumption 2: the relationship between true and error scores: error scores are
unsystematic, or random, and are uncorrelated with true scores.
 CTS model’s definition of measurement error:
that variation in a set of test scores that is unsystematic or random
In CTS→ Two sources of variance:
True score
variance due to
differences in
ability
Measurement
error
(unsystematic)
PARALLEL TESTS
 In order for two tests to be considered parallel, we assume
that they are measures of the same ability, that is, that an individual’s true score on
one test will be the same as his true score on the other.
 Two tests are parallel if, for every group of persons taking both tests,
(1) the true score on one test is equal to the true score on the other, and
(2) the error variances for the two tests are equal.
 Operational definition: parallel tests are two tests of the same ability that have
the same means and variances and are equally correlated with other tests of that
ability.
 Virtually we never have strictly parallel tests, we treat two tests as if they were
parallel if the differences between their means and variances are not statistically
significant. Equivalent tests; alternate forms.
RELIABILITY AS THE CORRELATION
BETWEEN PARALLEL TESTS
 Since we never know what the true or error scores are, we cannot know the
reliability of the observed scores. To be able to estimate the reliability of observed
scores, then, we must define reliability operationally in a way that depends only on
observed scores.
 Thus, if the observed scores on two parallel tests are highly correlated, this
indicates that effects of the error scores are minimal, and that they can be
considered reliable indicators of the ability being measured.
 The definition of reliability (the basis for all estimates of reliability within CTS
theory): the correlation between the observed scores on two parallel tests, which
we can symbolize as rxx’.
 Assumption: the observed scores on the two tests are experimentally independent.
That is, an individual's performance on the second test should not depend on how
she performs on the first.
CORRELATIONS BETWEEN TRUE AND OBSERVED SCORES ON
PARALLEL TESTS
=
rxx’
Ability
x’t
x’e
xt
xe
x X’
RELIABILITY AND MEASUREMENT ERROR AS PROPORTIONS OF
OBSERVED SCORE VARIANCE
If an individual's observed score on a test is composed of a
true score and an error score, the greater the proportion of
true score, the less the proportion of error score, and thus the
more reliable the observed score.
Thus, one way of defining reliability is as the proportion of
the observed score variance that is true score variance:
rxx’ = s2
t/ s2
x
Note: reliability refers to the test scores, and not the test
itself.
APPROACHES TO ESTIMATING RELIABILITY WITHIN THE CTS
Internal consistency: concerned primarily with
sources of error from within the test and scoring
procedures,
Stability: how consistent test scores are over time,
• The estimates of reliability that these approaches yield are called
reliability coefficients.
Equivalence: an indication of the extent to which scores on
alternate forms of a test are equivalent.
INTERNAL CONSISTENCY
 Internal consistency is concerned with how consistent test takers’ performances
on the different parts of the test are with each other.
 Inconsistencies in performance on different parts of tests can be caused by a
number of factors, including the test method facets.
SPLIT-HALF RELIABILITY ESTIMATES
 One approach to examining the internal consistency of a test is the split-half
method, in which we divide the test into two halves and then determine the extent
to which scores on these two halves are consistent with each other.
 In so doing, we are treating the halves as parallel tests, and so we must make
certain assumptions about the equivalence of the two halves, specifically that they
have equal means and variances. In addition, we must also assume that the two
halves are independent of each other.
INTERNAL CONSISTENCY
 In some cases, where we are not sure that the items are measuring the same ability
or that they are independent of each other, the test-retest and parallel forms
methods, are more appropriate for estimating reliability.
 The Spearman-Brown split-half estimate
 Once the test has been split into halves, it is rescored, yielding two scores - one for
each half - for each test taker.
 In one approach to estimating reliability, we then compute the correlation between
the two sets of scores. This gives us an estimate of how consistent the halves are,
however, and we are interested in the reliability of the whole test.
 In general, a long test will be more reliable than a short one, assuming that the
additional items correlate positively with the other items in the test.
INTERNAL CONSISTENCY
 Two assumptions must be met in order to use this method:
First, since we are in effect treating the two halves as parallel tests, we must assume
that they have equal means and variances (an assumption we
can check).
Second, we must assume that the two halves are experimentally independent of each
other (an assumption that is very difficult to check). That is, that an individual's
performance on one half does not affect how he performs on the other.
INTERNAL CONSISTENCY
 The Guttman sp!it-bdf estimate
 Another approach to estimating reliability from split-halves is that developed by
Guttman (1945), which does not assume equivalence of the halves, and which does
not require computing a correlation between them. This split-half reliability
coefficient is based on the ratio of the sum of the variances of the two halves to the
variance of the whole test:
 Since this formula is based on the variance of the total test, it provides a direct
estimate of the reliability of the whole test.
 Therefore, unlike Spearman-Brown, the Guttman split-half estimate does not
require an additional correction for length.
INTERNAL CONSISTENCY
 Reliability estimates based on item variances
1. Kuder-Richardson reliability coefficients
 There is a way of estimating the average of all the possible split-half coefficients
on the basis of the statistical characteristics of the test items.
 This approach developed by Kuder and Richardson (1937), involves computing
the means and variances of the items that constitute the test.
 The mean of a dichotomous item (one that is scored as either right or wrong) is
the proportion, symbolized as p, of individuals who answer the item correctly. The
 proportion of individuals who answer the item incorrectly is equal to 1 - p, and is
symbolized as q.
 The variance of a dichotomous item is the product of these two proportions, or pq.
INTERNAL CONSISTENCY
 The reliability coefficient provided by Kuder-Richardson formula 20 (KR-20),
based on the ratio of the sum of the item variances to the total test score variance, is
as follows:
 Assumption: the items are of nearly equal difficulty and independent of each other.
 If the items are of equal difficulty, the reliability coefficient can be computed by
using Kuder-Richardson formula 21 (KR-21:
 This formula will generally yield a reliability coefficient that is lower than that
given by KR-20. The Kuder-Richardson formulae are based on total score variance,
and thus they do not require any correction for length.
INTERNAL CONSISTENCY
2. Coefficient alpha
 Both the Guttman split-half estimate and the Kuder-Richardson formulae estimate
reliability on the basis of ratios of the variances of test components - halves and
items - to total test score variance.
 Cronbach (1951) developed a general formula for estimating internal consistency
which he called ‘coefficient alpha’, and which is often referred to as ‘Cronbach’s
alpha’:
 It can thus be shown that all estimates of reliability based on the analysis of
variance components can be derived from this formula and are special cases of
coefficient alpha.
INTERNAL CONSISTENCY
 In summary, the reliability of a set of test scores can be estimated on the basis of a
single test administration only if certain assumptions about the characteristics of
the parts of the test are satisfied.
Assumptions for internal consistency reliability estimates, and effects of
violating assumptions
RATER CONSISTENCY
 In test scores that are obtained subjectively, such as ratings of compositions or
oral interviews, a source of error is inconsistency in these ratings.
 In the case of a single rater, we need to be concerned about the consistency within
that individual’s ratings, or with intrarater reliability.
 When there are several different raters, we want to examine the consistency across
raters, or inter-rater reliability.
 In both cases, the primary causes of inconsistency will be either the
application of different rating criteria to different samples or
the inconsistent application of the rating criteria to different samples.
RATER CONSISTENCY
 Intra-rater reliability
Factors introducing inconsistency:
 Sequence of paying attention to different kinds of errors
 Sequence of scoring from the 1st to the last person
 We need to obtain at least two independent ratings from a rater for each individual
language sample.
 This is typically accomplished by rating the individual samples once and then re-
rating them at a later time in a different, random order.
 Once the two sets of ratings have been obtained, the reliability between them can
be estimated in two ways.
RATER CONSISTENCY
 One way is to treat the two sets of ratings as scores from parallel tests and
compute the appropriate correlation coefficient (commonly the Spearman rank-
order coefficient) between the two sets of ratings, interpreting this as an estimate
of reliability.
 Another approach to examining the consistency of multiple ratings is to compute a
coefficient alpha, treating the independent ratings as different parts:
RATER CONSISTENCY
 Inter-rater reliability
Factors introducing inconsistency:
 Different criteria for rating
 Different interpretations of the same rating criteria
 The reliability can be estimated in two ways:
 We can compute the correlation between two different raters and interpret this as
an estimate of reliability.
 When more than two raters are involved, however, rather than computing
correlations for all different pairs, a preferable approach is that recommended by
Ebel (1979), in which we sum the ratings of the different raters and then estimate
the reliability of these summed ratings by computing a coefficient alpha.
STABILITY (TEST-RETEST RELIABILITY)
 This approach can be used in three cases:
 For tests such as cloze and dictation we cannot appropriately estimate the internal
consistency of the scores because of the interdependence of the parts of the test.
 There are also testing situations in which it may be necessary to administer a test
more than once as part of a time-series design.
 This might also be the concern of a language program evaluator who is interested
in relating changes in language ability to teaching and learning activities in the
program.
 This approach to reliability provides an estimate of the stability of the test scores
over time.
STABILITY (TEST-RETEST RELIABILITY)
 In this approach, we administer the test twice to a group of individuals and then
compute the correlation between the two sets of scores. This correlation can then
be interpreted as an indication of how stable the scores are over time.
 Two sources of inconsistency - differential practice effects and differential
changes in ability -pose a dilemma for the test-retest approach.
 That is, we must assume that both practice and learning (or unlearning) effects
are either uniform across individuals or random.
 Practice effects may occur if certain individuals remember some of the items or
feel more comfortable with the test method, and consequently perform better on
the second administration of the test.
 If, on the other hand, there is a considerable time lapse between test
administrations, some individuals’ language ability may actually improve or
decline more than that of others, causing them to perform differently the second
time.
STABILITY (TEST-RETEST RELIABILITY)
Possible Solution:
There is no single length of time between test
administrations that is best for all situations.
In each situation, the test developer or user must
attempt to determine the extent to which practice and
learning are likely to influence test performance, and
choose the length of time between test and retest so as
to optimize reduction in the effects of both.
EQUIVALENCE (PARALLEL FORMS RELIABILITY)
 Like the test-retest approach, this is an appropriate means of
estimating the reliability of tests for which internal consistency
estimates are either inappropriate or not possible.
 It is of particular interest in testing situations where alternate forms of
the test may be actually used, either for security reasons, or to
minimize the practice effect.
 Assumption: that the different forms of the test are equivalent,
particularly that they are at the same difficulty level and have similar
standard deviations.
EQUIVALENCE (PARALLEL FORMS RELIABILITY)
 To estimate the reliability of alternate forms of a given test, the procedure used is
to administer both forms to a group of individuals.
 One way to minimize the possibility of an ordering effect is to use a
 ‘counterbalanced’ design, in which half of the individuals take one form first and
the other half take the other form first.
 The means and standard deviations for each of the two forms can then be
computed and compared to determine their equivalence, after which the
correlation between the two sets of scores (Form A and Form B) can be computed.
This correlation is then interpreted as an indicator of the equivalence of the two
tests, or as an estimate of the reliability of either one.
Note:
o As a matter of procedure, we generally attempt to estimate the internal
consistency of a test first, since if a test is not reliable in this respect, it is not
likely to be equivalent to other forms or stable over time.
o This is because measurement error is random and therefore not correlated
with anything.
o Thus, the greater the proportion of measurement error in the scores from a
given test, the lower the correlation of those scores with other scores will be.
PROBLEMS WITH THE CLASSICAL TRUE SCORE MODEL
1. Different sources of error may interact with each other, even when we carefully
design our reliability study. One problem with the CTS model, then, is that it treats
error variance as homogeneous in origin. Each of the estimates of reliability
addresses one specific source of error, and treats other potential sources either as part
of that source, or as true score.
 In the classical model, therefore, different sources of error may be confused, or
confounded with each other and with true score variance, since it is not possible to
examine more than one source of error at a time, even though performance on
any given test may be affected by several different sources of error simultaneously.
PROBLEMS WITH THE CLASSICAL TRUE SCORE MODEL
2. A second, related problem is that the CTS model considers all error to be random,
and consequently fails to distinguish systematic error from random error.
 Factors other than the ability being measured that regularly affect the performance
of some individuals and not others can be regarded as sources of systematic error
or test bias.
 Sources of systematic error:
 Test method
 Cultural content
 Psychological task set
 Guessing or what Carroll (1961b) called ‘topastic error’.

More Related Content

What's hot

Language testing
Language testingLanguage testing
Language testingJihan Zayed
 
Principles of language assessment
Principles of language assessmentPrinciples of language assessment
Principles of language assessmentAmeer Al-Labban
 
Kinds of tests and testing
Kinds of tests and testingKinds of tests and testing
Kinds of tests and testingMaury Martinez
 
Chapter 2: Principles of Language Assessment
Chapter 2: Principles of Language AssessmentChapter 2: Principles of Language Assessment
Chapter 2: Principles of Language AssessmentHamid Najaf Pour Sani
 
Testing for language teachers 101 (1)
Testing for language teachers 101 (1)Testing for language teachers 101 (1)
Testing for language teachers 101 (1)Paul Doyon
 
Assessments, concepts and issues
Assessments, concepts and issuesAssessments, concepts and issues
Assessments, concepts and issuesRahila Khan
 
Language Testing/ Assessment
Language Testing/ AssessmentLanguage Testing/ Assessment
Language Testing/ AssessmentAnn Liza Sanchez
 
Stages of test development and common test techniques (1)
Stages of test development and common test techniques (1)Stages of test development and common test techniques (1)
Stages of test development and common test techniques (1)Maury Martinez
 
Common test techniques
Common test techniquesCommon test techniques
Common test techniquesMaury Martinez
 
Valiadity and reliability- Language testing
Valiadity and reliability- Language testingValiadity and reliability- Language testing
Valiadity and reliability- Language testingPhuong Tran
 
Testing for Language Teachers Arthur Hughes
Testing for Language TeachersArthur HughesTesting for Language TeachersArthur Hughes
Testing for Language Teachers Arthur HughesRajputt Ainee
 
Principles of language assessment
Principles of language assessmentPrinciples of language assessment
Principles of language assessmentSutrisno Evenddy
 
Lyle F. Bachman Measurement ( Chapter 2 )
Lyle F. Bachman  Measurement ( Chapter 2 )Lyle F. Bachman  Measurement ( Chapter 2 )
Lyle F. Bachman Measurement ( Chapter 2 )Abdolhossein Omidi
 
Language Testing :kinds of tests
Language Testing :kinds of testsLanguage Testing :kinds of tests
Language Testing :kinds of testsahmedabbas1121
 
Testing writing (for Language Teachers)
Testing writing (for Language Teachers)Testing writing (for Language Teachers)
Testing writing (for Language Teachers)Wenlie Jean
 
Testing, assessing,& teaching
Testing, assessing,& teachingTesting, assessing,& teaching
Testing, assessing,& teachingAstrid Caballero
 

What's hot (20)

Language testing
Language testingLanguage testing
Language testing
 
Principles of language assessment
Principles of language assessmentPrinciples of language assessment
Principles of language assessment
 
Kinds of tests and testing
Kinds of tests and testingKinds of tests and testing
Kinds of tests and testing
 
Chapter 2: Principles of Language Assessment
Chapter 2: Principles of Language AssessmentChapter 2: Principles of Language Assessment
Chapter 2: Principles of Language Assessment
 
Test Usefulness
Test UsefulnessTest Usefulness
Test Usefulness
 
Reliability
ReliabilityReliability
Reliability
 
Testing for language teachers 101 (1)
Testing for language teachers 101 (1)Testing for language teachers 101 (1)
Testing for language teachers 101 (1)
 
Assessments, concepts and issues
Assessments, concepts and issuesAssessments, concepts and issues
Assessments, concepts and issues
 
Language Testing/ Assessment
Language Testing/ AssessmentLanguage Testing/ Assessment
Language Testing/ Assessment
 
Stages of test development and common test techniques (1)
Stages of test development and common test techniques (1)Stages of test development and common test techniques (1)
Stages of test development and common test techniques (1)
 
Common test techniques
Common test techniquesCommon test techniques
Common test techniques
 
Testing writing
Testing writingTesting writing
Testing writing
 
Valiadity and reliability- Language testing
Valiadity and reliability- Language testingValiadity and reliability- Language testing
Valiadity and reliability- Language testing
 
Testing for Language Teachers Arthur Hughes
Testing for Language TeachersArthur HughesTesting for Language TeachersArthur Hughes
Testing for Language Teachers Arthur Hughes
 
Beyond tests alternatives in assessment
Beyond tests alternatives in assessmentBeyond tests alternatives in assessment
Beyond tests alternatives in assessment
 
Principles of language assessment
Principles of language assessmentPrinciples of language assessment
Principles of language assessment
 
Lyle F. Bachman Measurement ( Chapter 2 )
Lyle F. Bachman  Measurement ( Chapter 2 )Lyle F. Bachman  Measurement ( Chapter 2 )
Lyle F. Bachman Measurement ( Chapter 2 )
 
Language Testing :kinds of tests
Language Testing :kinds of testsLanguage Testing :kinds of tests
Language Testing :kinds of tests
 
Testing writing (for Language Teachers)
Testing writing (for Language Teachers)Testing writing (for Language Teachers)
Testing writing (for Language Teachers)
 
Testing, assessing,& teaching
Testing, assessing,& teachingTesting, assessing,& teaching
Testing, assessing,& teaching
 

Viewers also liked

Presentation Validity & Reliability
Presentation Validity & ReliabilityPresentation Validity & Reliability
Presentation Validity & Reliabilitysongoten77
 
Validity and Reliability
Validity and ReliabilityValidity and Reliability
Validity and ReliabilityMaury Martinez
 
Validity, reliability & practicality
Validity, reliability & practicalityValidity, reliability & practicality
Validity, reliability & practicalitySamcruz5
 
Language testing and evaluation validity and reliability.
Language testing and evaluation validity and reliability.Language testing and evaluation validity and reliability.
Language testing and evaluation validity and reliability.Vadher Ankita
 
Case studies and the statistical wolrdview
Case studies and the statistical wolrdviewCase studies and the statistical wolrdview
Case studies and the statistical wolrdviewrsd kol abundjani
 
B.tech admission in idia
B.tech admission in idiaB.tech admission in idia
B.tech admission in idiaEdhole.com
 
Reliability And Validity
Reliability And ValidityReliability And Validity
Reliability And ValidityJames Penny
 
Safety-Critical Embedded Systems Course
Safety-Critical Embedded Systems CourseSafety-Critical Embedded Systems Course
Safety-Critical Embedded Systems Coursepaupo
 
Classical Test Theory and Item Response Theory
Classical Test Theory and Item Response TheoryClassical Test Theory and Item Response Theory
Classical Test Theory and Item Response Theorysaira kazim
 
1 Reliability and Validity in Physical Therapy Tests
1  Reliability and Validity in Physical Therapy Tests1  Reliability and Validity in Physical Therapy Tests
1 Reliability and Validity in Physical Therapy Testsaebrahim123
 
Approaches to Language Testing
Approaches to Language TestingApproaches to Language Testing
Approaches to Language Testingmpazhou
 
Reliability engineering chapter-1csi
Reliability engineering chapter-1csiReliability engineering chapter-1csi
Reliability engineering chapter-1csiCharlton Inao
 
Testing, assessing, and teaching
Testing, assessing, and teachingTesting, assessing, and teaching
Testing, assessing, and teachingSutrisno Evenddy
 

Viewers also liked (20)

Presentation Validity & Reliability
Presentation Validity & ReliabilityPresentation Validity & Reliability
Presentation Validity & Reliability
 
Reliability and validity
Reliability and validityReliability and validity
Reliability and validity
 
Validity and Reliability
Validity and ReliabilityValidity and Reliability
Validity and Reliability
 
Validity, reliability & practicality
Validity, reliability & practicalityValidity, reliability & practicality
Validity, reliability & practicality
 
Language testing and evaluation validity and reliability.
Language testing and evaluation validity and reliability.Language testing and evaluation validity and reliability.
Language testing and evaluation validity and reliability.
 
Case studies and the statistical wolrdview
Case studies and the statistical wolrdviewCase studies and the statistical wolrdview
Case studies and the statistical wolrdview
 
B.tech admission in idia
B.tech admission in idiaB.tech admission in idia
B.tech admission in idia
 
Reliability And Validity
Reliability And ValidityReliability And Validity
Reliability And Validity
 
Safety-Critical Embedded Systems Course
Safety-Critical Embedded Systems CourseSafety-Critical Embedded Systems Course
Safety-Critical Embedded Systems Course
 
Week 8 & 9 - Validity and Reliability
Week 8 & 9 - Validity and ReliabilityWeek 8 & 9 - Validity and Reliability
Week 8 & 9 - Validity and Reliability
 
Reliability and Safety
Reliability and SafetyReliability and Safety
Reliability and Safety
 
Classical Test Theory and Item Response Theory
Classical Test Theory and Item Response TheoryClassical Test Theory and Item Response Theory
Classical Test Theory and Item Response Theory
 
1 Reliability and Validity in Physical Therapy Tests
1  Reliability and Validity in Physical Therapy Tests1  Reliability and Validity in Physical Therapy Tests
1 Reliability and Validity in Physical Therapy Tests
 
Reliability
ReliabilityReliability
Reliability
 
Approaches to Language Testing
Approaches to Language TestingApproaches to Language Testing
Approaches to Language Testing
 
Accounting theory
Accounting theoryAccounting theory
Accounting theory
 
Reliability
ReliabilityReliability
Reliability
 
Reliability engineering chapter-1csi
Reliability engineering chapter-1csiReliability engineering chapter-1csi
Reliability engineering chapter-1csi
 
Testing, assessing, and teaching
Testing, assessing, and teachingTesting, assessing, and teaching
Testing, assessing, and teaching
 
Language Testing
Language TestingLanguage Testing
Language Testing
 

Similar to Reliability bachman 1990 chapter 6

Characteristics of a good test
Characteristics of a good test Characteristics of a good test
Characteristics of a good test Arash Yazdani
 
What makes a good testA test is considered good” if the .docx
What makes a good testA test is considered good” if the .docxWhat makes a good testA test is considered good” if the .docx
What makes a good testA test is considered good” if the .docxmecklenburgstrelitzh
 
Chapter 8 compilation
Chapter 8 compilationChapter 8 compilation
Chapter 8 compilationHannan Mahmud
 
Validity & reliability seminar
Validity & reliability seminarValidity & reliability seminar
Validity & reliability seminarmrikara185
 
Validity, Reliability ,Objective & Their Types
Validity, Reliability ,Objective & Their TypesValidity, Reliability ,Objective & Their Types
Validity, Reliability ,Objective & Their TypesMohammadRabbani18
 
Meaning and Methods of Estimating Reliability of Test.pptx
Meaning and Methods of Estimating Reliability of Test.pptxMeaning and Methods of Estimating Reliability of Test.pptx
Meaning and Methods of Estimating Reliability of Test.pptxsarat68
 
Louzel Report - Reliability & validity
Louzel Report - Reliability & validity Louzel Report - Reliability & validity
Louzel Report - Reliability & validity Louzel Linejan
 
Validity and reliability in assessment.
Validity and reliability in assessment. Validity and reliability in assessment.
Validity and reliability in assessment. Tarek Tawfik Amin
 
Reliability of test
Reliability of testReliability of test
Reliability of testSarat Rout
 
Qualities of a Good Test
Qualities of a Good TestQualities of a Good Test
Qualities of a Good TestDrSindhuAlmas
 
Characteristics of a good test
Characteristics of a good testCharacteristics of a good test
Characteristics of a good testcyrilcoscos
 
Testing in language programs (chapter 8)
Testing in language programs (chapter 8)Testing in language programs (chapter 8)
Testing in language programs (chapter 8)Tahere Bakhshi
 
Topic validity
Topic validityTopic validity
Topic validitymikki khan
 
Reliability and validity of Research Data
Reliability and validity of Research DataReliability and validity of Research Data
Reliability and validity of Research DataAida Arifin
 
Research methods 2 operationalization & measurement
Research methods 2   operationalization & measurementResearch methods 2   operationalization & measurement
Research methods 2 operationalization & measurementattique1960
 
VALIDITY
VALIDITYVALIDITY
VALIDITYANCYBS
 

Similar to Reliability bachman 1990 chapter 6 (20)

Characteristics of a good test
Characteristics of a good test Characteristics of a good test
Characteristics of a good test
 
What makes a good testA test is considered good” if the .docx
What makes a good testA test is considered good” if the .docxWhat makes a good testA test is considered good” if the .docx
What makes a good testA test is considered good” if the .docx
 
Chapter 8 compilation
Chapter 8 compilationChapter 8 compilation
Chapter 8 compilation
 
Validity & reliability seminar
Validity & reliability seminarValidity & reliability seminar
Validity & reliability seminar
 
Rep
RepRep
Rep
 
Validity, Reliability ,Objective & Their Types
Validity, Reliability ,Objective & Their TypesValidity, Reliability ,Objective & Their Types
Validity, Reliability ,Objective & Their Types
 
EM&E.pptx
EM&E.pptxEM&E.pptx
EM&E.pptx
 
Meaning and Methods of Estimating Reliability of Test.pptx
Meaning and Methods of Estimating Reliability of Test.pptxMeaning and Methods of Estimating Reliability of Test.pptx
Meaning and Methods of Estimating Reliability of Test.pptx
 
Quantitative analysis
Quantitative analysisQuantitative analysis
Quantitative analysis
 
Louzel Report - Reliability & validity
Louzel Report - Reliability & validity Louzel Report - Reliability & validity
Louzel Report - Reliability & validity
 
Validity and reliability in assessment.
Validity and reliability in assessment. Validity and reliability in assessment.
Validity and reliability in assessment.
 
Reliability of test
Reliability of testReliability of test
Reliability of test
 
Qualities of a Good Test
Qualities of a Good TestQualities of a Good Test
Qualities of a Good Test
 
Characteristics of a good test
Characteristics of a good testCharacteristics of a good test
Characteristics of a good test
 
Testing in language programs (chapter 8)
Testing in language programs (chapter 8)Testing in language programs (chapter 8)
Testing in language programs (chapter 8)
 
Topic validity
Topic validityTopic validity
Topic validity
 
Reliability and validity of Research Data
Reliability and validity of Research DataReliability and validity of Research Data
Reliability and validity of Research Data
 
Norms[1]
Norms[1]Norms[1]
Norms[1]
 
Research methods 2 operationalization & measurement
Research methods 2   operationalization & measurementResearch methods 2   operationalization & measurement
Research methods 2 operationalization & measurement
 
VALIDITY
VALIDITYVALIDITY
VALIDITY
 

More from ahfameri

Exploring culture
Exploring cultureExploring culture
Exploring cultureahfameri
 
Thesis summary by amir hamid forough ameri
Thesis summary by amir hamid forough ameriThesis summary by amir hamid forough ameri
Thesis summary by amir hamid forough ameriahfameri
 
The role of corrective feedback in second language learning
The role of corrective feedback in second language learningThe role of corrective feedback in second language learning
The role of corrective feedback in second language learningahfameri
 
Test specifications and designs
Test specifications and designs  Test specifications and designs
Test specifications and designs ahfameri
 
Standards based classroom assessments of english proficiency
Standards based classroom  assessments of english proficiencyStandards based classroom  assessments of english proficiency
Standards based classroom assessments of english proficiencyahfameri
 
Reliability and dependability by neil jones
Reliability and dependability by neil jonesReliability and dependability by neil jones
Reliability and dependability by neil jonesahfameri
 
Language testing the social dimension
Language testing  the social dimensionLanguage testing  the social dimension
Language testing the social dimensionahfameri
 
Extroversion introversion
Extroversion introversionExtroversion introversion
Extroversion introversionahfameri
 
Developing a comprehensive empirically based research framework for classroom...
Developing a comprehensive empirically based research framework for classroom...Developing a comprehensive empirically based research framework for classroom...
Developing a comprehensive empirically based research framework for classroom...ahfameri
 
Cognitive approaches to learning piaget
Cognitive approaches to learning   piagetCognitive approaches to learning   piaget
Cognitive approaches to learning piagetahfameri
 
Classroom assessment glenn fulcher
Classroom assessment glenn fulcherClassroom assessment glenn fulcher
Classroom assessment glenn fulcherahfameri
 
Behavioral view of motivation
Behavioral view of motivationBehavioral view of motivation
Behavioral view of motivationahfameri
 

More from ahfameri (12)

Exploring culture
Exploring cultureExploring culture
Exploring culture
 
Thesis summary by amir hamid forough ameri
Thesis summary by amir hamid forough ameriThesis summary by amir hamid forough ameri
Thesis summary by amir hamid forough ameri
 
The role of corrective feedback in second language learning
The role of corrective feedback in second language learningThe role of corrective feedback in second language learning
The role of corrective feedback in second language learning
 
Test specifications and designs
Test specifications and designs  Test specifications and designs
Test specifications and designs
 
Standards based classroom assessments of english proficiency
Standards based classroom  assessments of english proficiencyStandards based classroom  assessments of english proficiency
Standards based classroom assessments of english proficiency
 
Reliability and dependability by neil jones
Reliability and dependability by neil jonesReliability and dependability by neil jones
Reliability and dependability by neil jones
 
Language testing the social dimension
Language testing  the social dimensionLanguage testing  the social dimension
Language testing the social dimension
 
Extroversion introversion
Extroversion introversionExtroversion introversion
Extroversion introversion
 
Developing a comprehensive empirically based research framework for classroom...
Developing a comprehensive empirically based research framework for classroom...Developing a comprehensive empirically based research framework for classroom...
Developing a comprehensive empirically based research framework for classroom...
 
Cognitive approaches to learning piaget
Cognitive approaches to learning   piagetCognitive approaches to learning   piaget
Cognitive approaches to learning piaget
 
Classroom assessment glenn fulcher
Classroom assessment glenn fulcherClassroom assessment glenn fulcher
Classroom assessment glenn fulcher
 
Behavioral view of motivation
Behavioral view of motivationBehavioral view of motivation
Behavioral view of motivation
 

Recently uploaded

Separation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesSeparation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesFatimaKhan178732
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphThiyagu K
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfchloefrazer622
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Sapana Sha
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxSayali Powar
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactPECB
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxheathfieldcps1
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityGeoBlogs
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Krashi Coaching
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...EduSkills OECD
 
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...Sapna Thakur
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionSafetyChain Software
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDThiyagu K
 
Student login on Anyboli platform.helpin
Student login on Anyboli platform.helpinStudent login on Anyboli platform.helpin
Student login on Anyboli platform.helpinRaunakKeshri1
 
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...fonyou31
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3JemimahLaneBuaron
 
Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Disha Kariya
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdfQucHHunhnh
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdfSoniaTolstoy
 

Recently uploaded (20)

Separation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesSeparation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and Actinides
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot Graph
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdf
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activity
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory Inspection
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SD
 
Student login on Anyboli platform.helpin
Student login on Anyboli platform.helpinStudent login on Anyboli platform.helpin
Student login on Anyboli platform.helpin
 
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3
 
Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..
 
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
 

Reliability bachman 1990 chapter 6

  • 1. RELIABILITY BASED ON: FUNDAMENTAL CONSIDERATIONS IN LANGUAGE TESTING BACHMAN (1990) CHAPTER 6 Prepared by: Amirhamid Foroughameri ahfameri@gmail.com November 2015
  • 2. INTRODUCTION  A fundamental concern in the development and use of language tests is to identify potential sources of error in a given measure of communicative language ability and to minimize the effect of these factors on that measure.  We must be concerned about errors of measurement, or unreliability, as we know that test performance is affected by factors other than the abilities we want to measure. Poor health, fatigue, lack of interest test method facets or motivation, and test-wiseness … Minimizing the effects of these factors Minimizing measurement error maximizing reliability. Unsystematic (unpredictable) Systematic
  • 3. INTRODUCTION  A necessary condition for validity: in order for a test score to be valid, it must be reliable.  Reliability and validity are not two distinct concepts; they are complementary aspects of a common concern in measurement  Reliability answers the question: ‘How much of an individual’s test performance is due to measurement error, or to factors 0ther than the language ability we want to measure?’ and with minimizing the effects of these factors on test scores.  Validity answers the question: ‘How much of an individual’s test performance is due to the language abilities we want to measure?’ and with maximizing the effects of these abilities on test scores.
  • 4. INTRODUCTION  The investigation of reliability involves both logical analysis and empirical research.  We must identify sources of error and estimate the magnitude of their effects on test scores.  To identify sources of error, we need to distinguish the effects of the language abilities we want to measure from the effects of other factors, which is a complex problem due to: 1. The interaction between components of language ability and test method facets makes it difficult to mark a clear ‘boundary’ between the ability being measured and the method facets → a particular topic of a conversational interaction an oral interview. 2. Other characteristics, such as sex, age, cognitive style, and native language.  Estimating the magnitude of the effects of different sources of error, once these sources have been identified, is a matter of empirical research, and is a major concern of measurement theory.
  • 5. FACTORS THAT AFFECT LANGUAGE TEST SCORES Thorndike (1951) and Stanley (1971) begin their treatments of reliability with general frameworks for describing the factors that cause test scores to vary from individual to individual: general and specific lasting characteristics, general and specific temporary characteristics, and systematic and chance factors related to test administration and scoring.
  • 6. FACTORS THAT AFFECT LANGUAGE TEST SCORES  Factors that affect language test scores (Bachman 1990): Note: In a ‘path diagram’ rectangles: observed variables, ovals: unobserved variables, straight arrows: causal relationships Systematic Unsystematic Test Score Communicative language ability Personal attributes Test method facets Random factors
  • 7. FACTORS THAT AFFECT LANGUAGE TEST SCORES  Test method facets are systematic to the extent that they are uniform from one test administration to the next. That is, if the input format facet is multiple- choice, this will not vary, whether the test is given in the morning or afternoon.  Attributes of individuals include individual characteristics such as cognitive style and knowledge of particular content areas, and group characteristics such as sex, race, and ethnic background. These are also systematic in the sense that they are likely to affect a given individual’s test performance regularly.  An individual’s test score will be affected to some degree by unsystematic, or random factors: unpredictable and largely temporary conditions, such as his mental alertness or emotional state, and uncontrolled differences in test method facets, such as changes in the test environment from one day to the next, or idioosyncratic differences in the way different test administrators carry out their responsibilities.
  • 8. FACTORS THAT AFFECT LANGUAGE TEST SCORES  Random factors and test method facets are generally considered to be sources of measurement error, and have thus been the primary concern of approaches to estimating reliability.  Personal attributes that are not considered part of the ability tested, such as sex, ethnic background, cognitive style and prior knowledge of content area, on the other hand, have traditionally been discussed as sources of test bias, or test invalidity.
  • 9. CLASSICAL TRUE SCORE (CTS) MEASUREMENT THEORY  This theory consists of a set of assumptions about the relationships between actual, or observed test scores and the factors that affect these scores. Assumption 1: an observed score on a test comprises two factors or components: a true score that is due to an individual’s level of ability and an error score, that is due to factors other than the ability being tested.’ → x = Xt + Xe where x is the observed score, Xt is the true score, and Xe the error score.  the variance of a set of test scores consists of two components: S2 x = S2 t + S2 e  where S2 x is the observed score variance, S2 t is the true score variance component, and S2 e is the error score variance component.
  • 10.  Assumption 2: the relationship between true and error scores: error scores are unsystematic, or random, and are uncorrelated with true scores.  CTS model’s definition of measurement error: that variation in a set of test scores that is unsystematic or random In CTS→ Two sources of variance: True score variance due to differences in ability Measurement error (unsystematic)
  • 11. PARALLEL TESTS  In order for two tests to be considered parallel, we assume that they are measures of the same ability, that is, that an individual’s true score on one test will be the same as his true score on the other.  Two tests are parallel if, for every group of persons taking both tests, (1) the true score on one test is equal to the true score on the other, and (2) the error variances for the two tests are equal.  Operational definition: parallel tests are two tests of the same ability that have the same means and variances and are equally correlated with other tests of that ability.  Virtually we never have strictly parallel tests, we treat two tests as if they were parallel if the differences between their means and variances are not statistically significant. Equivalent tests; alternate forms.
  • 12. RELIABILITY AS THE CORRELATION BETWEEN PARALLEL TESTS  Since we never know what the true or error scores are, we cannot know the reliability of the observed scores. To be able to estimate the reliability of observed scores, then, we must define reliability operationally in a way that depends only on observed scores.  Thus, if the observed scores on two parallel tests are highly correlated, this indicates that effects of the error scores are minimal, and that they can be considered reliable indicators of the ability being measured.  The definition of reliability (the basis for all estimates of reliability within CTS theory): the correlation between the observed scores on two parallel tests, which we can symbolize as rxx’.  Assumption: the observed scores on the two tests are experimentally independent. That is, an individual's performance on the second test should not depend on how she performs on the first.
  • 13. CORRELATIONS BETWEEN TRUE AND OBSERVED SCORES ON PARALLEL TESTS = rxx’ Ability x’t x’e xt xe x X’
  • 14. RELIABILITY AND MEASUREMENT ERROR AS PROPORTIONS OF OBSERVED SCORE VARIANCE If an individual's observed score on a test is composed of a true score and an error score, the greater the proportion of true score, the less the proportion of error score, and thus the more reliable the observed score. Thus, one way of defining reliability is as the proportion of the observed score variance that is true score variance: rxx’ = s2 t/ s2 x Note: reliability refers to the test scores, and not the test itself.
  • 15. APPROACHES TO ESTIMATING RELIABILITY WITHIN THE CTS Internal consistency: concerned primarily with sources of error from within the test and scoring procedures, Stability: how consistent test scores are over time, • The estimates of reliability that these approaches yield are called reliability coefficients. Equivalence: an indication of the extent to which scores on alternate forms of a test are equivalent.
  • 16. INTERNAL CONSISTENCY  Internal consistency is concerned with how consistent test takers’ performances on the different parts of the test are with each other.  Inconsistencies in performance on different parts of tests can be caused by a number of factors, including the test method facets. SPLIT-HALF RELIABILITY ESTIMATES  One approach to examining the internal consistency of a test is the split-half method, in which we divide the test into two halves and then determine the extent to which scores on these two halves are consistent with each other.  In so doing, we are treating the halves as parallel tests, and so we must make certain assumptions about the equivalence of the two halves, specifically that they have equal means and variances. In addition, we must also assume that the two halves are independent of each other.
  • 17. INTERNAL CONSISTENCY  In some cases, where we are not sure that the items are measuring the same ability or that they are independent of each other, the test-retest and parallel forms methods, are more appropriate for estimating reliability.  The Spearman-Brown split-half estimate  Once the test has been split into halves, it is rescored, yielding two scores - one for each half - for each test taker.  In one approach to estimating reliability, we then compute the correlation between the two sets of scores. This gives us an estimate of how consistent the halves are, however, and we are interested in the reliability of the whole test.  In general, a long test will be more reliable than a short one, assuming that the additional items correlate positively with the other items in the test.
  • 18. INTERNAL CONSISTENCY  Two assumptions must be met in order to use this method: First, since we are in effect treating the two halves as parallel tests, we must assume that they have equal means and variances (an assumption we can check). Second, we must assume that the two halves are experimentally independent of each other (an assumption that is very difficult to check). That is, that an individual's performance on one half does not affect how he performs on the other.
  • 19. INTERNAL CONSISTENCY  The Guttman sp!it-bdf estimate  Another approach to estimating reliability from split-halves is that developed by Guttman (1945), which does not assume equivalence of the halves, and which does not require computing a correlation between them. This split-half reliability coefficient is based on the ratio of the sum of the variances of the two halves to the variance of the whole test:  Since this formula is based on the variance of the total test, it provides a direct estimate of the reliability of the whole test.  Therefore, unlike Spearman-Brown, the Guttman split-half estimate does not require an additional correction for length.
  • 20. INTERNAL CONSISTENCY  Reliability estimates based on item variances 1. Kuder-Richardson reliability coefficients  There is a way of estimating the average of all the possible split-half coefficients on the basis of the statistical characteristics of the test items.  This approach developed by Kuder and Richardson (1937), involves computing the means and variances of the items that constitute the test.  The mean of a dichotomous item (one that is scored as either right or wrong) is the proportion, symbolized as p, of individuals who answer the item correctly. The  proportion of individuals who answer the item incorrectly is equal to 1 - p, and is symbolized as q.  The variance of a dichotomous item is the product of these two proportions, or pq.
  • 21. INTERNAL CONSISTENCY  The reliability coefficient provided by Kuder-Richardson formula 20 (KR-20), based on the ratio of the sum of the item variances to the total test score variance, is as follows:  Assumption: the items are of nearly equal difficulty and independent of each other.  If the items are of equal difficulty, the reliability coefficient can be computed by using Kuder-Richardson formula 21 (KR-21:  This formula will generally yield a reliability coefficient that is lower than that given by KR-20. The Kuder-Richardson formulae are based on total score variance, and thus they do not require any correction for length.
  • 22. INTERNAL CONSISTENCY 2. Coefficient alpha  Both the Guttman split-half estimate and the Kuder-Richardson formulae estimate reliability on the basis of ratios of the variances of test components - halves and items - to total test score variance.  Cronbach (1951) developed a general formula for estimating internal consistency which he called ‘coefficient alpha’, and which is often referred to as ‘Cronbach’s alpha’:  It can thus be shown that all estimates of reliability based on the analysis of variance components can be derived from this formula and are special cases of coefficient alpha.
  • 23. INTERNAL CONSISTENCY  In summary, the reliability of a set of test scores can be estimated on the basis of a single test administration only if certain assumptions about the characteristics of the parts of the test are satisfied. Assumptions for internal consistency reliability estimates, and effects of violating assumptions
  • 24. RATER CONSISTENCY  In test scores that are obtained subjectively, such as ratings of compositions or oral interviews, a source of error is inconsistency in these ratings.  In the case of a single rater, we need to be concerned about the consistency within that individual’s ratings, or with intrarater reliability.  When there are several different raters, we want to examine the consistency across raters, or inter-rater reliability.  In both cases, the primary causes of inconsistency will be either the application of different rating criteria to different samples or the inconsistent application of the rating criteria to different samples.
  • 25. RATER CONSISTENCY  Intra-rater reliability Factors introducing inconsistency:  Sequence of paying attention to different kinds of errors  Sequence of scoring from the 1st to the last person  We need to obtain at least two independent ratings from a rater for each individual language sample.  This is typically accomplished by rating the individual samples once and then re- rating them at a later time in a different, random order.  Once the two sets of ratings have been obtained, the reliability between them can be estimated in two ways.
  • 26. RATER CONSISTENCY  One way is to treat the two sets of ratings as scores from parallel tests and compute the appropriate correlation coefficient (commonly the Spearman rank- order coefficient) between the two sets of ratings, interpreting this as an estimate of reliability.  Another approach to examining the consistency of multiple ratings is to compute a coefficient alpha, treating the independent ratings as different parts:
  • 27. RATER CONSISTENCY  Inter-rater reliability Factors introducing inconsistency:  Different criteria for rating  Different interpretations of the same rating criteria  The reliability can be estimated in two ways:  We can compute the correlation between two different raters and interpret this as an estimate of reliability.  When more than two raters are involved, however, rather than computing correlations for all different pairs, a preferable approach is that recommended by Ebel (1979), in which we sum the ratings of the different raters and then estimate the reliability of these summed ratings by computing a coefficient alpha.
  • 28. STABILITY (TEST-RETEST RELIABILITY)  This approach can be used in three cases:  For tests such as cloze and dictation we cannot appropriately estimate the internal consistency of the scores because of the interdependence of the parts of the test.  There are also testing situations in which it may be necessary to administer a test more than once as part of a time-series design.  This might also be the concern of a language program evaluator who is interested in relating changes in language ability to teaching and learning activities in the program.  This approach to reliability provides an estimate of the stability of the test scores over time.
  • 29. STABILITY (TEST-RETEST RELIABILITY)  In this approach, we administer the test twice to a group of individuals and then compute the correlation between the two sets of scores. This correlation can then be interpreted as an indication of how stable the scores are over time.  Two sources of inconsistency - differential practice effects and differential changes in ability -pose a dilemma for the test-retest approach.  That is, we must assume that both practice and learning (or unlearning) effects are either uniform across individuals or random.  Practice effects may occur if certain individuals remember some of the items or feel more comfortable with the test method, and consequently perform better on the second administration of the test.  If, on the other hand, there is a considerable time lapse between test administrations, some individuals’ language ability may actually improve or decline more than that of others, causing them to perform differently the second time.
  • 30. STABILITY (TEST-RETEST RELIABILITY) Possible Solution: There is no single length of time between test administrations that is best for all situations. In each situation, the test developer or user must attempt to determine the extent to which practice and learning are likely to influence test performance, and choose the length of time between test and retest so as to optimize reduction in the effects of both.
  • 31. EQUIVALENCE (PARALLEL FORMS RELIABILITY)  Like the test-retest approach, this is an appropriate means of estimating the reliability of tests for which internal consistency estimates are either inappropriate or not possible.  It is of particular interest in testing situations where alternate forms of the test may be actually used, either for security reasons, or to minimize the practice effect.  Assumption: that the different forms of the test are equivalent, particularly that they are at the same difficulty level and have similar standard deviations.
  • 32. EQUIVALENCE (PARALLEL FORMS RELIABILITY)  To estimate the reliability of alternate forms of a given test, the procedure used is to administer both forms to a group of individuals.  One way to minimize the possibility of an ordering effect is to use a  ‘counterbalanced’ design, in which half of the individuals take one form first and the other half take the other form first.  The means and standard deviations for each of the two forms can then be computed and compared to determine their equivalence, after which the correlation between the two sets of scores (Form A and Form B) can be computed. This correlation is then interpreted as an indicator of the equivalence of the two tests, or as an estimate of the reliability of either one.
  • 33. Note: o As a matter of procedure, we generally attempt to estimate the internal consistency of a test first, since if a test is not reliable in this respect, it is not likely to be equivalent to other forms or stable over time. o This is because measurement error is random and therefore not correlated with anything. o Thus, the greater the proportion of measurement error in the scores from a given test, the lower the correlation of those scores with other scores will be.
  • 34. PROBLEMS WITH THE CLASSICAL TRUE SCORE MODEL 1. Different sources of error may interact with each other, even when we carefully design our reliability study. One problem with the CTS model, then, is that it treats error variance as homogeneous in origin. Each of the estimates of reliability addresses one specific source of error, and treats other potential sources either as part of that source, or as true score.  In the classical model, therefore, different sources of error may be confused, or confounded with each other and with true score variance, since it is not possible to examine more than one source of error at a time, even though performance on any given test may be affected by several different sources of error simultaneously.
  • 35. PROBLEMS WITH THE CLASSICAL TRUE SCORE MODEL 2. A second, related problem is that the CTS model considers all error to be random, and consequently fails to distinguish systematic error from random error.  Factors other than the ability being measured that regularly affect the performance of some individuals and not others can be regarded as sources of systematic error or test bias.  Sources of systematic error:  Test method  Cultural content  Psychological task set  Guessing or what Carroll (1961b) called ‘topastic error’.