SlideShare a Scribd company logo
Introduction
 to identify potential sources of error in a given measure of
communicative language ability and to minimize the effect
of these factors on that measure.
 errors of measurement (unreliability) because we know
that test performance is affected by factors other than the
abilities we want to measure.
 When we minimize the effects of these various factors, we
minimize measurement error and maximize reliability.
‘How much of an individual’s test performance is due to
measurement error, or to factors 0ther than the language
ability we want to measure?’
Introduction
 In this chapter,
 Measurement error in test scores,
 The potential sources of this error,
 The different approaches to estimating the relative
effects of these sources of error on test scores, and
 The considerations to be made in determining which of
these approaches may be appropriate for a given testing
situation.
Factors that affect language test
scores
 The examination of reliability depends upon
distinguishing the effects of the abilities we want to
measure from the effects of other factors.
 If we wish to estimate how reliable our test scores
are:
 we must begin with a set of definitions of the
abilities we want to measure, and of the other
factors that we expect to affect test scores.
Factors that affect language test
scores
Factors that affect language test scores
1. Communicative language ability: specific abilities that determine
how an individual performs on a given test. Example: In a test of
sensitivity to register: the students who perform the best and receive
the highest scores would be those with the highest level of
sociolinguistic competence.
2. Test method facets: testing environment, the test rubric, the nature
of input and expected response, the relationship between input and
response
3. Personal attributes: individual characteristics (cognitive style and
knowledge of particular content areas) - group characteristics (sex,
race, and ethnic background)
4. Random factors: unpredictable and largely temporary conditions
(mental alertness or emotional state) - uncontrolled differences in test
method facets (changes in the test environment from one day to the
next or differences in the way different test administrators carry out
their responsibilities)
Factors that affect
language test scores
1. The primary interest in using language tests is to make
inferences about one or more components of an
individual’s communicative language ability.
2. Random factors and test method facets are generally
considered to be sources of measurement error
(reliability)
3. Personal attributes (i.e. sex, ethnic background,
cognitive style and prior knowledge of content area) are
discussed as sources of test bias, or test invalidity, and
these will therefore be discussed in Chapter 7 (validity)
Theories and models of
reliability
 Any factors other than the ability being tested that affect
test scores are potential sources of error that decrease
the reliability of scores.
 to identify these sources of error and estimate the
magnitude of their effect on test scores.
how different theories and models define the various
influences on test scores
1. Classical True Score
Measurement Theory
 Classical true score (CTS) measurement theory consists
of a set of assumptions about the relationships between
true or observed test scores and the factors that affect
these scores.
 Reliability is defined in the CTS theory in terms of true
score variance.
 True score: due to an individual’s level of ability / Error
score: due to factors other than the ability being tested
observed score = true score + error score
(actual test score)
1. Classical True Score
Measurement Theory
 Since we can never know the true scores of individuals, we
can never know what the reliability is, but can estimate it
from the observed scores.
 The basis for all such estimates in the CTS model is the
correlation between parallel tests.
Parallel Test: In order for two tests to be considered parallel,
they are supposed to measures of the same ability
(equivalents, alternate forms).
If the observed scores on two parallel tests are highly
correlated, these test can be considered reliable
indicators of the ability being measured.
1. Classical True Score
Measurement Theory
Within the CTS model there are 3 approaches to estimating
reliability, each of which addresses different sources of error:
a. Internal consistency estimates are concerned with sources of
error such as differences in test-tasks and item formats,
inconsistencies within and among scorers.
b. Stability estimates indicate how consistent test scores are
over time
c. Equivalence estimates provide an indication of the extent to
which scores on alternate forms of a test are equivalent.
The estimates of reliability that these approaches yield are
called reliability coefficients.
1. Classical True Score
Measurement Theory
Internal consistency is concerned with how consistent test
takers’ performances on different parts of the test are with
each other
Two approaches to estimating internal consistency
-an estimate based on correlation between two halves (the
Spearman-Brown split-half estimate)
-estimates which are based on ratios of the variances of
parts of the test – halves or items – to total test score
variance (the Guttman split-half, the Kuder-Richardson
formulae, and coefficient alpha)
1. Classical True Score
Measurement Theory
Rater consistency: In test scores that are obtained
subjectively (ratings of compositions or oral interviews) a
source of error is inconsistency in these ratings.
Intra-rater reliability: In order to examine the reliability of
ratings of a single rater, at least two independent ratings
from this rater are obtained. This is accomplished by rating
the individual samples once and then re-rating them at a later
time in a different, random order.
Inter-rater reliability: two different raters. In examining
inter-rater consistency, at that time, two ratings from these
rater are obtained and correlated.
1. Classical True Score
Measurement Theory
 Stability (test-retest reliability): In this approach, we
administer the test twice to a group of individuals and then
compute the correlation between the two sets of scores.
This correlation can then be interpreted as an indication of
how stable the scores are over time.
 Equivalence (parallel forms reliability): In this approach,
we try to estimate the reliability of alternate forms of a
given test, by administer both forms to a group of
individuals. Then the correlation between the two set of
scores can be computed.
2. Generalizability theory
(G-theory)
 Generalizability theory is an extension of the classical model
 It enables test developers to examine several sources of
variance simultaneously, and to distinguish the systematic
from random error.
 Firstly, the test developer designs and conducts a study to
investigate the sources of the variances (G-study).
 Depending on the outcome of this G-study, the test developer
may revise the test or the procedures for administering it, or if
the results are satisfactory, the test developer proceeds to the
second stage, a decision study (D-study).
2. Generalizability theory
(G-theory)
 In a D-study, the test developer administers the test under
operational conditions, in which the test will be used to make
the decisions for which it is designed. Then, the test developer
uses G-theory procedures to estimate the magnitude of the
variance components.
 Terms related to generalizability theory
 Universe of generalization: the domain of uses or abilities (or
both) to which we want test scores to generalize.
 Universe of measures: the types of test scores we would be
willing to accept as indicators of the ability to be measured.
2. Generalizability theory
(G-theory)
 Terms related to generalizability theory
 Populations of persons: the group about whom we are going
to make decisions or inferences
 Universe score: the mean of a person’s scores on all measures
from the universe of possible measures (similar to CTS-theory
true score)
 This conceptualization of generalizability reveals that a given
estimate of generalizability is limited to the specific universe
of measures and population of persons within which it is
defined, and that a test score that is ‘True’ for all persons,
times, and places simply does not exist.
2. Generalizability theory
(G-theory)
Generalizability Coefficients: The G-theory analog of the
CTS-theory reliability coefficient is the generalizability
coefficient
universe score coefficient
generalizability coefficient =
observed score coefficient
Estimation: In order to estimate the relative effect of
different sources of variance on the observed scores, it is
necessary to obtain multiple measures for each person
under the different conditions for each facet
2. Generalizability theory
(G-theory)
Estimation: One statistical procedure that can be used for
estimating the relative effects of different sources of variance on
test scores is the (ANOVA)
Example: An oral interview: with different question forms, or sets
of questions, and different interviewer/raters
Using ANOVA, we could obtain estimates for all the variance
components in the design: (1) the main effects for persons,
raters, and forms; (2) the two-way interactions between persons
and raters, persons and forms, and forms and raters, and (3) a
component that contains the three-way interaction among
persons, raters, and forms, as well as for the random variance
3. Standard Error of Measurement
(SEM)
 The approaches to estimating reliability that have been developed
within both CTS theory and G-theory are based on group
performance, and provide information for test developers and test
users about how consistent the scores of groups are on a given test.
 Reliability and generalizability coefficients provide no direct
information about the accuracy of individual test scores.
 A need for one indicator of how much we would expect an
individual’s test scores to vary.
 The most useful indicator for this purpose is called the standard
error of measurement.
The smaller standard deviation of errors (standard error of
measurement, SEM) results in more reliable tests
4. Item-response theory
Because of the limitations in CTS-theory and G-theory,
psychometricians have developed a number of
mathematical models for relating an individual’s test
performance to that individual’s level of ability.
Item response theory presents a more powerful
approach in that it can provide sample-free estimates of
individual's true scores, or ability levels, as well as
sample-free estimates of measurement error at each
ability level.
4. Item-response theory
The unidimensionality assumption: Most of the IRT models
make the specific assumption that the items in a test
measure a single, or unidimensional ability or trait, and
that the items form a unidimensional scale of
measurement
Item characteristic curve: Each specific IRT model makes
specific assumptions about the relationship between the
test taker’s ability and his performances on a given item.
These assumptions are explicitly stated in the mathematical
formula that is item characteristic curve (ICC).
4. Item-response theory
Ability score: Recall that neither CTS theory nor G-theory
provides an estimation of an individual’s level of ability. One
of the advantages of IRT is that it provides estimates of
individual test takers’ levels of ability.
Precision of measurement: Precision of measurement are
addressed in the IRT concept of item information function
which refers to the amount of information a given item
provides for estimating an individual’s level of ability. Test
of information function, on the other hand, is the sum of
the item information functions, each of which contributes
independently to the total, and is a measure of how much
information a test provides at different ability levels.
Reliability of criterion-
referenced test score
 NR test scores are most useful in situations in which
comparative decisions are made such as the selection of
individuals for a program. CR test scores, on the other hand,
are more useful when making ‘absolute’ decisions regarding
mastery or nonmastery of the ability domain.
 The concept of reliability applies to two aspects of criterion-
referenced tests:
- the accuracy of the obtained score as an indicator of a ‘domain’
score (J. D. Brown (1989) has derived a formula)
- the consistency of the decisions that are based on CR test
scores (Threshold loss agreement indices - Squared-error loss
agreement indices)
Factors that affect
reliability estimates
 Length of test: long tests are generally more reliable than
short ones
 Difficulty of test and test score variance: the greater the
score variance, the more reliable the tests will tend to be
(Norm-referenced tests)
 Cut-off score: the greater the differences between the
cut-off score and the mean score, the greater will be the
reliability (Criterion-referenced tests).
Systematic measurement
error
 Systematic error is different from random error.
For example, if every form of a reading comprehension test
contained passages from the area of “economics”, then
the facet ‘passage content’ would be fixed to one
condition - economics.
To the extent that test scores are influenced by individuals’
familiarity with this particular content area, as opposed
to their reading comprehension ability, this facet will be a
source of error in our measurement of reading
comprehension. It is a kind of systematic error
Systematic measurement
error
The effects of systematic error:
- The general effect of systematic error is constant for all
observations; it affects the scores of all individuals who take the
test.
- The specific effect varies across individuals; it affects different
individuals differentially
The effects of test method
Standardization of test facets results in introducing sources of
systematic variance into the test scores. When a single testing
technique is used (close test), the test might be a better indicator
of individuals’ ability to take cloze tests than of their reading
comprehension ability.
Conclusion
 Any factors other than the ability being tested that
affect test scores are potential sources of error that
decrease the reliability of scores.
 Therefore, it is essential that we be able to identify
these sources of error and estimate the magnitude
of their effect on test scores.

More Related Content

What's hot

Principles of language assessment
Principles of language assessmentPrinciples of language assessment
Principles of language assessment
Sutrisno Evenddy
 
Unit1(testing, assessing, and teaching)
Unit1(testing, assessing, and teaching)Unit1(testing, assessing, and teaching)
Unit1(testing, assessing, and teaching)
Kheang Sokheng
 
Stages of test development and common test techniques (1)
Stages of test development and common test techniques (1)Stages of test development and common test techniques (1)
Stages of test development and common test techniques (1)
Maury Martinez
 
RELIABILITY IN LANGUAGE TESTING-TITIN'S GROUP
RELIABILITY IN LANGUAGE TESTING-TITIN'S GROUPRELIABILITY IN LANGUAGE TESTING-TITIN'S GROUP
RELIABILITY IN LANGUAGE TESTING-TITIN'S GROUP
Titin Rohayati
 
Communicative Testing
Communicative  TestingCommunicative  Testing
Communicative Testing
Ningsih SM
 
How to make tests more reliable
How to make tests more reliableHow to make tests more reliable
How to make tests more reliable
Nawaphat Deelert
 
Language testing and evaluation validity and reliability.
Language testing and evaluation validity and reliability.Language testing and evaluation validity and reliability.
Language testing and evaluation validity and reliability.
Vadher Ankita
 
Aptitude As In Individual Difference In Sla 2
Aptitude As In Individual Difference In Sla 2Aptitude As In Individual Difference In Sla 2
Aptitude As In Individual Difference In Sla 2
Dr. Cupid Lucid
 
Testing for Language Teachers Arthur Hughes
Testing for Language TeachersArthur HughesTesting for Language TeachersArthur Hughes
Testing for Language Teachers Arthur Hughes
Rajputt Ainee
 
Designing classroom language tests
Designing classroom language testsDesigning classroom language tests
Designing classroom language tests
Sutrisno Evenddy
 

What's hot (20)

Stages of test development
Stages of test developmentStages of test development
Stages of test development
 
Principles of language assessment
Principles of language assessmentPrinciples of language assessment
Principles of language assessment
 
Unit1(testing, assessing, and teaching)
Unit1(testing, assessing, and teaching)Unit1(testing, assessing, and teaching)
Unit1(testing, assessing, and teaching)
 
Testing oral ability
Testing oral abilityTesting oral ability
Testing oral ability
 
Stages of test development and common test techniques (1)
Stages of test development and common test techniques (1)Stages of test development and common test techniques (1)
Stages of test development and common test techniques (1)
 
Kinds of tests and testing
Kinds of tests and testingKinds of tests and testing
Kinds of tests and testing
 
Language testing
Language testingLanguage testing
Language testing
 
discrete-point and integrative testing
discrete-point and integrative testingdiscrete-point and integrative testing
discrete-point and integrative testing
 
RELIABILITY IN LANGUAGE TESTING-TITIN'S GROUP
RELIABILITY IN LANGUAGE TESTING-TITIN'S GROUPRELIABILITY IN LANGUAGE TESTING-TITIN'S GROUP
RELIABILITY IN LANGUAGE TESTING-TITIN'S GROUP
 
Introduction to Language Assessment by Brown
Introduction to Language Assessment by BrownIntroduction to Language Assessment by Brown
Introduction to Language Assessment by Brown
 
Test methods in Language Testing
Test methods in Language TestingTest methods in Language Testing
Test methods in Language Testing
 
Communicative Testing
Communicative  TestingCommunicative  Testing
Communicative Testing
 
How to make tests more reliable
How to make tests more reliableHow to make tests more reliable
How to make tests more reliable
 
Language testing and evaluation validity and reliability.
Language testing and evaluation validity and reliability.Language testing and evaluation validity and reliability.
Language testing and evaluation validity and reliability.
 
Achieving beneficial blackwash
Achieving beneficial blackwashAchieving beneficial blackwash
Achieving beneficial blackwash
 
Aptitude As In Individual Difference In Sla 2
Aptitude As In Individual Difference In Sla 2Aptitude As In Individual Difference In Sla 2
Aptitude As In Individual Difference In Sla 2
 
Testing for Language Teachers Arthur Hughes
Testing for Language TeachersArthur HughesTesting for Language TeachersArthur Hughes
Testing for Language Teachers Arthur Hughes
 
Understanding Authenticity in Language Teaching & Assessment
Understanding Authenticity in Language Teaching & Assessment Understanding Authenticity in Language Teaching & Assessment
Understanding Authenticity in Language Teaching & Assessment
 
Designing classroom language tests
Designing classroom language testsDesigning classroom language tests
Designing classroom language tests
 
Testing grammar and vocabulary
Testing grammar and vocabularyTesting grammar and vocabulary
Testing grammar and vocabulary
 

Viewers also liked

Presentation Validity & Reliability
Presentation Validity & ReliabilityPresentation Validity & Reliability
Presentation Validity & Reliability
songoten77
 
Validity, reliability & practicality
Validity, reliability & practicalityValidity, reliability & practicality
Validity, reliability & practicality
Samcruz5
 
Validity, its types, measurement & factors.
Validity, its types, measurement & factors.Validity, its types, measurement & factors.
Validity, its types, measurement & factors.
Maheen Iftikhar
 
Characteristics of a good test
Characteristics of a good testCharacteristics of a good test
Characteristics of a good test
cyrilcoscos
 
Characteristics of a good test
Characteristics of a good testCharacteristics of a good test
Characteristics of a good test
Boyet Aluan
 
Language Testing: Approaches and Techniques
Language Testing: Approaches and TechniquesLanguage Testing: Approaches and Techniques
Language Testing: Approaches and Techniques
Monica Angeles
 
IRT - Item response Theory
IRT - Item response TheoryIRT - Item response Theory
IRT - Item response Theory
Ajay Dhamija
 
communicative languag
communicative languagcommunicative languag
communicative languag
khalafi
 
Testing, assessing, and teaching
Testing, assessing, and teachingTesting, assessing, and teaching
Testing, assessing, and teaching
Sutrisno Evenddy
 
Validity & reliability an interesting powerpoint slide i created
Validity & reliability  an interesting powerpoint slide i createdValidity & reliability  an interesting powerpoint slide i created
Validity & reliability an interesting powerpoint slide i created
Sze Kai
 

Viewers also liked (20)

Presentation Validity & Reliability
Presentation Validity & ReliabilityPresentation Validity & Reliability
Presentation Validity & Reliability
 
Validity, reliability & practicality
Validity, reliability & practicalityValidity, reliability & practicality
Validity, reliability & practicality
 
Validity and Reliability
Validity and ReliabilityValidity and Reliability
Validity and Reliability
 
Lyle F. Bachman Measurement ( Chapter 2 )
Lyle F. Bachman  Measurement ( Chapter 2 )Lyle F. Bachman  Measurement ( Chapter 2 )
Lyle F. Bachman Measurement ( Chapter 2 )
 
Reliability and validity
Reliability and validityReliability and validity
Reliability and validity
 
Validity, its types, measurement & factors.
Validity, its types, measurement & factors.Validity, its types, measurement & factors.
Validity, its types, measurement & factors.
 
Fundamental concepts and principles in Language Testing
Fundamental concepts and principles in Language TestingFundamental concepts and principles in Language Testing
Fundamental concepts and principles in Language Testing
 
Characteristics of a good test
Characteristics of a good testCharacteristics of a good test
Characteristics of a good test
 
Characteristics of a good test
Characteristics of a good testCharacteristics of a good test
Characteristics of a good test
 
Language Testing: Approaches and Techniques
Language Testing: Approaches and TechniquesLanguage Testing: Approaches and Techniques
Language Testing: Approaches and Techniques
 
Test Usefulness
Test UsefulnessTest Usefulness
Test Usefulness
 
Item Response Theory in Constructing Measures
Item Response Theory in Constructing MeasuresItem Response Theory in Constructing Measures
Item Response Theory in Constructing Measures
 
Reliability, validity, generalizability and the use of multi-item scales
Reliability, validity, generalizability and the use of multi-item scalesReliability, validity, generalizability and the use of multi-item scales
Reliability, validity, generalizability and the use of multi-item scales
 
IRT - Item response Theory
IRT - Item response TheoryIRT - Item response Theory
IRT - Item response Theory
 
communicative languag
communicative languagcommunicative languag
communicative languag
 
Introduction to language testing (wed, 23 sept 2014)
Introduction to language testing (wed, 23 sept 2014)Introduction to language testing (wed, 23 sept 2014)
Introduction to language testing (wed, 23 sept 2014)
 
Reliability
ReliabilityReliability
Reliability
 
Test Reliability and Validity
Test Reliability and ValidityTest Reliability and Validity
Test Reliability and Validity
 
Testing, assessing, and teaching
Testing, assessing, and teachingTesting, assessing, and teaching
Testing, assessing, and teaching
 
Validity & reliability an interesting powerpoint slide i created
Validity & reliability  an interesting powerpoint slide i createdValidity & reliability  an interesting powerpoint slide i created
Validity & reliability an interesting powerpoint slide i created
 

Similar to Reliability in Language Testing

Louzel Report - Reliability & validity
Louzel Report - Reliability & validity Louzel Report - Reliability & validity
Louzel Report - Reliability & validity
Louzel Linejan
 
Chapter 8 compilation
Chapter 8 compilationChapter 8 compilation
Chapter 8 compilation
Hannan Mahmud
 
Reliability & validity
Reliability & validityReliability & validity
Reliability & validity
udspm
 
What makes a good testA test is considered good” if the .docx
What makes a good testA test is considered good” if the .docxWhat makes a good testA test is considered good” if the .docx
What makes a good testA test is considered good” if the .docx
mecklenburgstrelitzh
 
Adapted from Assessment in Special and incl.docx
Adapted from Assessment in Special and incl.docxAdapted from Assessment in Special and incl.docx
Adapted from Assessment in Special and incl.docx
nettletondevon
 

Similar to Reliability in Language Testing (20)

Valiadity and reliability- Language testing
Valiadity and reliability- Language testingValiadity and reliability- Language testing
Valiadity and reliability- Language testing
 
Reliability bachman 1990 chapter 6
Reliability bachman 1990 chapter 6Reliability bachman 1990 chapter 6
Reliability bachman 1990 chapter 6
 
Louzel Report - Reliability & validity
Louzel Report - Reliability & validity Louzel Report - Reliability & validity
Louzel Report - Reliability & validity
 
Characteristics of a good test
Characteristics of a good test Characteristics of a good test
Characteristics of a good test
 
Validity & reliability seminar
Validity & reliability seminarValidity & reliability seminar
Validity & reliability seminar
 
Chapter 8 compilation
Chapter 8 compilationChapter 8 compilation
Chapter 8 compilation
 
Validity and reliability in assessment.
Validity and reliability in assessment. Validity and reliability in assessment.
Validity and reliability in assessment.
 
Reliability & validity
Reliability & validityReliability & validity
Reliability & validity
 
Qualities of a Good Test
Qualities of a Good TestQualities of a Good Test
Qualities of a Good Test
 
EM&E.pptx
EM&E.pptxEM&E.pptx
EM&E.pptx
 
Rep
RepRep
Rep
 
Meaning and Methods of Estimating Reliability of Test.pptx
Meaning and Methods of Estimating Reliability of Test.pptxMeaning and Methods of Estimating Reliability of Test.pptx
Meaning and Methods of Estimating Reliability of Test.pptx
 
Reliability of test
Reliability of testReliability of test
Reliability of test
 
What makes a good testA test is considered good” if the .docx
What makes a good testA test is considered good” if the .docxWhat makes a good testA test is considered good” if the .docx
What makes a good testA test is considered good” if the .docx
 
Adapted from Assessment in Special and incl.docx
Adapted from Assessment in Special and incl.docxAdapted from Assessment in Special and incl.docx
Adapted from Assessment in Special and incl.docx
 
Chapter 4: Of Tests and Testing
Chapter 4: Of Tests and TestingChapter 4: Of Tests and Testing
Chapter 4: Of Tests and Testing
 
Testing in language programs (chapter 8)
Testing in language programs (chapter 8)Testing in language programs (chapter 8)
Testing in language programs (chapter 8)
 
validity and reliability
validity and reliabilityvalidity and reliability
validity and reliability
 
Validity, Reliability ,Objective & Their Types
Validity, Reliability ,Objective & Their TypesValidity, Reliability ,Objective & Their Types
Validity, Reliability ,Objective & Their Types
 
Validity & reliability
Validity & reliabilityValidity & reliability
Validity & reliability
 

More from Seray Tanyer (6)

Tpological Universals & SLA (Linguistic Typology)
Tpological Universals & SLA (Linguistic Typology)Tpological Universals & SLA (Linguistic Typology)
Tpological Universals & SLA (Linguistic Typology)
 
A history of english language teaching - Section 1 (3,4,5)
A history of english language teaching - Section 1 (3,4,5)A history of english language teaching - Section 1 (3,4,5)
A history of english language teaching - Section 1 (3,4,5)
 
The Role of Writing and Reading Self Efficacy in First-year Preservice EFL Te...
The Role of Writing and Reading Self Efficacy in First-year Preservice EFL Te...The Role of Writing and Reading Self Efficacy in First-year Preservice EFL Te...
The Role of Writing and Reading Self Efficacy in First-year Preservice EFL Te...
 
Motivation in the Workplace
Motivation in the Workplace Motivation in the Workplace
Motivation in the Workplace
 
Intercultural Communication & ELT
Intercultural Communication & ELTIntercultural Communication & ELT
Intercultural Communication & ELT
 
A Closer Look at the Foreign Language Writing Anxiety of Turkish EFL Pre-serv...
A Closer Look at the Foreign Language Writing Anxiety of Turkish EFL Pre-serv...A Closer Look at the Foreign Language Writing Anxiety of Turkish EFL Pre-serv...
A Closer Look at the Foreign Language Writing Anxiety of Turkish EFL Pre-serv...
 

Recently uploaded

Accounting and finance exit exam 2016 E.C.pdf
Accounting and finance exit exam 2016 E.C.pdfAccounting and finance exit exam 2016 E.C.pdf
Accounting and finance exit exam 2016 E.C.pdf
YibeltalNibretu
 

Recently uploaded (20)

Operations Management - Book1.p - Dr. Abdulfatah A. Salem
Operations Management - Book1.p  - Dr. Abdulfatah A. SalemOperations Management - Book1.p  - Dr. Abdulfatah A. Salem
Operations Management - Book1.p - Dr. Abdulfatah A. Salem
 
GIÁO ÁN DẠY THÊM (KẾ HOẠCH BÀI BUỔI 2) - TIẾNG ANH 8 GLOBAL SUCCESS (2 CỘT) N...
GIÁO ÁN DẠY THÊM (KẾ HOẠCH BÀI BUỔI 2) - TIẾNG ANH 8 GLOBAL SUCCESS (2 CỘT) N...GIÁO ÁN DẠY THÊM (KẾ HOẠCH BÀI BUỔI 2) - TIẾNG ANH 8 GLOBAL SUCCESS (2 CỘT) N...
GIÁO ÁN DẠY THÊM (KẾ HOẠCH BÀI BUỔI 2) - TIẾNG ANH 8 GLOBAL SUCCESS (2 CỘT) N...
 
Accounting and finance exit exam 2016 E.C.pdf
Accounting and finance exit exam 2016 E.C.pdfAccounting and finance exit exam 2016 E.C.pdf
Accounting and finance exit exam 2016 E.C.pdf
 
Gyanartha SciBizTech Quiz slideshare.pptx
Gyanartha SciBizTech Quiz slideshare.pptxGyanartha SciBizTech Quiz slideshare.pptx
Gyanartha SciBizTech Quiz slideshare.pptx
 
Palestine last event orientationfvgnh .pptx
Palestine last event orientationfvgnh .pptxPalestine last event orientationfvgnh .pptx
Palestine last event orientationfvgnh .pptx
 
How to Create Map Views in the Odoo 17 ERP
How to Create Map Views in the Odoo 17 ERPHow to Create Map Views in the Odoo 17 ERP
How to Create Map Views in the Odoo 17 ERP
 
Introduction to Quality Improvement Essentials
Introduction to Quality Improvement EssentialsIntroduction to Quality Improvement Essentials
Introduction to Quality Improvement Essentials
 
Ethnobotany and Ethnopharmacology ......
Ethnobotany and Ethnopharmacology ......Ethnobotany and Ethnopharmacology ......
Ethnobotany and Ethnopharmacology ......
 
Sha'Carri Richardson Presentation 202345
Sha'Carri Richardson Presentation 202345Sha'Carri Richardson Presentation 202345
Sha'Carri Richardson Presentation 202345
 
Mattingly "AI & Prompt Design: Limitations and Solutions with LLMs"
Mattingly "AI & Prompt Design: Limitations and Solutions with LLMs"Mattingly "AI & Prompt Design: Limitations and Solutions with LLMs"
Mattingly "AI & Prompt Design: Limitations and Solutions with LLMs"
 
B.ed spl. HI pdusu exam paper-2023-24.pdf
B.ed spl. HI pdusu exam paper-2023-24.pdfB.ed spl. HI pdusu exam paper-2023-24.pdf
B.ed spl. HI pdusu exam paper-2023-24.pdf
 
Students, digital devices and success - Andreas Schleicher - 27 May 2024..pptx
Students, digital devices and success - Andreas Schleicher - 27 May 2024..pptxStudents, digital devices and success - Andreas Schleicher - 27 May 2024..pptx
Students, digital devices and success - Andreas Schleicher - 27 May 2024..pptx
 
Danh sách HSG Bộ môn cấp trường - Cấp THPT.pdf
Danh sách HSG Bộ môn cấp trường - Cấp THPT.pdfDanh sách HSG Bộ môn cấp trường - Cấp THPT.pdf
Danh sách HSG Bộ môn cấp trường - Cấp THPT.pdf
 
50 ĐỀ LUYỆN THI IOE LỚP 9 - NĂM HỌC 2022-2023 (CÓ LINK HÌNH, FILE AUDIO VÀ ĐÁ...
50 ĐỀ LUYỆN THI IOE LỚP 9 - NĂM HỌC 2022-2023 (CÓ LINK HÌNH, FILE AUDIO VÀ ĐÁ...50 ĐỀ LUYỆN THI IOE LỚP 9 - NĂM HỌC 2022-2023 (CÓ LINK HÌNH, FILE AUDIO VÀ ĐÁ...
50 ĐỀ LUYỆN THI IOE LỚP 9 - NĂM HỌC 2022-2023 (CÓ LINK HÌNH, FILE AUDIO VÀ ĐÁ...
 
[GDSC YCCE] Build with AI Online Presentation
[GDSC YCCE] Build with AI Online Presentation[GDSC YCCE] Build with AI Online Presentation
[GDSC YCCE] Build with AI Online Presentation
 
Advances in production technology of Grapes.pdf
Advances in production technology of Grapes.pdfAdvances in production technology of Grapes.pdf
Advances in production technology of Grapes.pdf
 
Jose-Rizal-and-Philippine-Nationalism-National-Symbol-2.pptx
Jose-Rizal-and-Philippine-Nationalism-National-Symbol-2.pptxJose-Rizal-and-Philippine-Nationalism-National-Symbol-2.pptx
Jose-Rizal-and-Philippine-Nationalism-National-Symbol-2.pptx
 
Basic phrases for greeting and assisting costumers
Basic phrases for greeting and assisting costumersBasic phrases for greeting and assisting costumers
Basic phrases for greeting and assisting costumers
 
Fish and Chips - have they had their chips
Fish and Chips - have they had their chipsFish and Chips - have they had their chips
Fish and Chips - have they had their chips
 
Application of Matrices in real life. Presentation on application of matrices
Application of Matrices in real life. Presentation on application of matricesApplication of Matrices in real life. Presentation on application of matrices
Application of Matrices in real life. Presentation on application of matrices
 

Reliability in Language Testing

  • 1.
  • 2. Introduction  to identify potential sources of error in a given measure of communicative language ability and to minimize the effect of these factors on that measure.  errors of measurement (unreliability) because we know that test performance is affected by factors other than the abilities we want to measure.  When we minimize the effects of these various factors, we minimize measurement error and maximize reliability. ‘How much of an individual’s test performance is due to measurement error, or to factors 0ther than the language ability we want to measure?’
  • 3. Introduction  In this chapter,  Measurement error in test scores,  The potential sources of this error,  The different approaches to estimating the relative effects of these sources of error on test scores, and  The considerations to be made in determining which of these approaches may be appropriate for a given testing situation.
  • 4. Factors that affect language test scores  The examination of reliability depends upon distinguishing the effects of the abilities we want to measure from the effects of other factors.  If we wish to estimate how reliable our test scores are:  we must begin with a set of definitions of the abilities we want to measure, and of the other factors that we expect to affect test scores.
  • 5. Factors that affect language test scores
  • 6. Factors that affect language test scores 1. Communicative language ability: specific abilities that determine how an individual performs on a given test. Example: In a test of sensitivity to register: the students who perform the best and receive the highest scores would be those with the highest level of sociolinguistic competence. 2. Test method facets: testing environment, the test rubric, the nature of input and expected response, the relationship between input and response 3. Personal attributes: individual characteristics (cognitive style and knowledge of particular content areas) - group characteristics (sex, race, and ethnic background) 4. Random factors: unpredictable and largely temporary conditions (mental alertness or emotional state) - uncontrolled differences in test method facets (changes in the test environment from one day to the next or differences in the way different test administrators carry out their responsibilities)
  • 7. Factors that affect language test scores 1. The primary interest in using language tests is to make inferences about one or more components of an individual’s communicative language ability. 2. Random factors and test method facets are generally considered to be sources of measurement error (reliability) 3. Personal attributes (i.e. sex, ethnic background, cognitive style and prior knowledge of content area) are discussed as sources of test bias, or test invalidity, and these will therefore be discussed in Chapter 7 (validity)
  • 8. Theories and models of reliability  Any factors other than the ability being tested that affect test scores are potential sources of error that decrease the reliability of scores.  to identify these sources of error and estimate the magnitude of their effect on test scores. how different theories and models define the various influences on test scores
  • 9. 1. Classical True Score Measurement Theory  Classical true score (CTS) measurement theory consists of a set of assumptions about the relationships between true or observed test scores and the factors that affect these scores.  Reliability is defined in the CTS theory in terms of true score variance.  True score: due to an individual’s level of ability / Error score: due to factors other than the ability being tested observed score = true score + error score (actual test score)
  • 10. 1. Classical True Score Measurement Theory  Since we can never know the true scores of individuals, we can never know what the reliability is, but can estimate it from the observed scores.  The basis for all such estimates in the CTS model is the correlation between parallel tests. Parallel Test: In order for two tests to be considered parallel, they are supposed to measures of the same ability (equivalents, alternate forms). If the observed scores on two parallel tests are highly correlated, these test can be considered reliable indicators of the ability being measured.
  • 11. 1. Classical True Score Measurement Theory Within the CTS model there are 3 approaches to estimating reliability, each of which addresses different sources of error: a. Internal consistency estimates are concerned with sources of error such as differences in test-tasks and item formats, inconsistencies within and among scorers. b. Stability estimates indicate how consistent test scores are over time c. Equivalence estimates provide an indication of the extent to which scores on alternate forms of a test are equivalent. The estimates of reliability that these approaches yield are called reliability coefficients.
  • 12. 1. Classical True Score Measurement Theory Internal consistency is concerned with how consistent test takers’ performances on different parts of the test are with each other Two approaches to estimating internal consistency -an estimate based on correlation between two halves (the Spearman-Brown split-half estimate) -estimates which are based on ratios of the variances of parts of the test – halves or items – to total test score variance (the Guttman split-half, the Kuder-Richardson formulae, and coefficient alpha)
  • 13. 1. Classical True Score Measurement Theory Rater consistency: In test scores that are obtained subjectively (ratings of compositions or oral interviews) a source of error is inconsistency in these ratings. Intra-rater reliability: In order to examine the reliability of ratings of a single rater, at least two independent ratings from this rater are obtained. This is accomplished by rating the individual samples once and then re-rating them at a later time in a different, random order. Inter-rater reliability: two different raters. In examining inter-rater consistency, at that time, two ratings from these rater are obtained and correlated.
  • 14. 1. Classical True Score Measurement Theory  Stability (test-retest reliability): In this approach, we administer the test twice to a group of individuals and then compute the correlation between the two sets of scores. This correlation can then be interpreted as an indication of how stable the scores are over time.  Equivalence (parallel forms reliability): In this approach, we try to estimate the reliability of alternate forms of a given test, by administer both forms to a group of individuals. Then the correlation between the two set of scores can be computed.
  • 15. 2. Generalizability theory (G-theory)  Generalizability theory is an extension of the classical model  It enables test developers to examine several sources of variance simultaneously, and to distinguish the systematic from random error.  Firstly, the test developer designs and conducts a study to investigate the sources of the variances (G-study).  Depending on the outcome of this G-study, the test developer may revise the test or the procedures for administering it, or if the results are satisfactory, the test developer proceeds to the second stage, a decision study (D-study).
  • 16. 2. Generalizability theory (G-theory)  In a D-study, the test developer administers the test under operational conditions, in which the test will be used to make the decisions for which it is designed. Then, the test developer uses G-theory procedures to estimate the magnitude of the variance components.  Terms related to generalizability theory  Universe of generalization: the domain of uses or abilities (or both) to which we want test scores to generalize.  Universe of measures: the types of test scores we would be willing to accept as indicators of the ability to be measured.
  • 17. 2. Generalizability theory (G-theory)  Terms related to generalizability theory  Populations of persons: the group about whom we are going to make decisions or inferences  Universe score: the mean of a person’s scores on all measures from the universe of possible measures (similar to CTS-theory true score)  This conceptualization of generalizability reveals that a given estimate of generalizability is limited to the specific universe of measures and population of persons within which it is defined, and that a test score that is ‘True’ for all persons, times, and places simply does not exist.
  • 18. 2. Generalizability theory (G-theory) Generalizability Coefficients: The G-theory analog of the CTS-theory reliability coefficient is the generalizability coefficient universe score coefficient generalizability coefficient = observed score coefficient Estimation: In order to estimate the relative effect of different sources of variance on the observed scores, it is necessary to obtain multiple measures for each person under the different conditions for each facet
  • 19. 2. Generalizability theory (G-theory) Estimation: One statistical procedure that can be used for estimating the relative effects of different sources of variance on test scores is the (ANOVA) Example: An oral interview: with different question forms, or sets of questions, and different interviewer/raters Using ANOVA, we could obtain estimates for all the variance components in the design: (1) the main effects for persons, raters, and forms; (2) the two-way interactions between persons and raters, persons and forms, and forms and raters, and (3) a component that contains the three-way interaction among persons, raters, and forms, as well as for the random variance
  • 20. 3. Standard Error of Measurement (SEM)  The approaches to estimating reliability that have been developed within both CTS theory and G-theory are based on group performance, and provide information for test developers and test users about how consistent the scores of groups are on a given test.  Reliability and generalizability coefficients provide no direct information about the accuracy of individual test scores.  A need for one indicator of how much we would expect an individual’s test scores to vary.  The most useful indicator for this purpose is called the standard error of measurement. The smaller standard deviation of errors (standard error of measurement, SEM) results in more reliable tests
  • 21. 4. Item-response theory Because of the limitations in CTS-theory and G-theory, psychometricians have developed a number of mathematical models for relating an individual’s test performance to that individual’s level of ability. Item response theory presents a more powerful approach in that it can provide sample-free estimates of individual's true scores, or ability levels, as well as sample-free estimates of measurement error at each ability level.
  • 22. 4. Item-response theory The unidimensionality assumption: Most of the IRT models make the specific assumption that the items in a test measure a single, or unidimensional ability or trait, and that the items form a unidimensional scale of measurement Item characteristic curve: Each specific IRT model makes specific assumptions about the relationship between the test taker’s ability and his performances on a given item. These assumptions are explicitly stated in the mathematical formula that is item characteristic curve (ICC).
  • 23. 4. Item-response theory Ability score: Recall that neither CTS theory nor G-theory provides an estimation of an individual’s level of ability. One of the advantages of IRT is that it provides estimates of individual test takers’ levels of ability. Precision of measurement: Precision of measurement are addressed in the IRT concept of item information function which refers to the amount of information a given item provides for estimating an individual’s level of ability. Test of information function, on the other hand, is the sum of the item information functions, each of which contributes independently to the total, and is a measure of how much information a test provides at different ability levels.
  • 24. Reliability of criterion- referenced test score  NR test scores are most useful in situations in which comparative decisions are made such as the selection of individuals for a program. CR test scores, on the other hand, are more useful when making ‘absolute’ decisions regarding mastery or nonmastery of the ability domain.  The concept of reliability applies to two aspects of criterion- referenced tests: - the accuracy of the obtained score as an indicator of a ‘domain’ score (J. D. Brown (1989) has derived a formula) - the consistency of the decisions that are based on CR test scores (Threshold loss agreement indices - Squared-error loss agreement indices)
  • 25. Factors that affect reliability estimates  Length of test: long tests are generally more reliable than short ones  Difficulty of test and test score variance: the greater the score variance, the more reliable the tests will tend to be (Norm-referenced tests)  Cut-off score: the greater the differences between the cut-off score and the mean score, the greater will be the reliability (Criterion-referenced tests).
  • 26. Systematic measurement error  Systematic error is different from random error. For example, if every form of a reading comprehension test contained passages from the area of “economics”, then the facet ‘passage content’ would be fixed to one condition - economics. To the extent that test scores are influenced by individuals’ familiarity with this particular content area, as opposed to their reading comprehension ability, this facet will be a source of error in our measurement of reading comprehension. It is a kind of systematic error
  • 27. Systematic measurement error The effects of systematic error: - The general effect of systematic error is constant for all observations; it affects the scores of all individuals who take the test. - The specific effect varies across individuals; it affects different individuals differentially The effects of test method Standardization of test facets results in introducing sources of systematic variance into the test scores. When a single testing technique is used (close test), the test might be a better indicator of individuals’ ability to take cloze tests than of their reading comprehension ability.
  • 28. Conclusion  Any factors other than the ability being tested that affect test scores are potential sources of error that decrease the reliability of scores.  Therefore, it is essential that we be able to identify these sources of error and estimate the magnitude of their effect on test scores.

Editor's Notes

  1. 1. A fundamental concern in the development and use of language tests is … 2. We must be concerned about errors of measurement, or unreliability, 4. The investigation of reliability is concerned with answering the question:
  2. 1. So, you see the outline. In this chapter, we will discuss …
  3. Let’s start with the factors that affect language test scores. 1. Measurement specialists have long recognized that the examination of reliability depends upon distinguishing the effects (on test scores) of the abilities we want to measure from the effects of other factors. 2. That is, if we wish to estimate how reliable our test scores are, we must begin with a set of definitions of the abilities we want to measure, and of the other factors that we expect to affect test scores (Stanley 1971: 362).
  4. The effects of these various factors on a test score can be illustrated as in this figure.
  5. 1. Communicative language ability REFERS TO the specific abilities that determine how an individual performs on a given test. In a test of sensitivity to register, for example, we would expect that the students who perform the best and receive the highest scores would be those with the highest level of sociolinguistic competence. 3. As for personal attributes, we can say that, Attributes of individuals that are not related to language ability include individual characteristics such as cognitive style and knowledge of particular content areas, and group characteristics such as sex, race, and ethnic background. 4. Random factors on the other hand refer to the unpredictable and largely temporary conditions, such as mental alertness or emotional state, and uncontrolled differences in test method facets, such as changes in the test environment from one day to the next, or idiosyncratic differences in the way different test administrators carry out their responsibilities.
  6. I would retell that our primary interest in using language tests is to make inferences about one or more components of an individual’s communicative language ability.
  7. 1. A fundamental concern in the development and use of language tests is that … 2. Therefore, it is essential that we be able to identify these sources of error and estimate the magnitude of their effect on test scores. 3. So, now, let’s talk about how…
  8. 4. So, Observed score, in other words, the actual test score, equals to the sum of true score and error score.
  9. This means that if the observed scores on two tests…
  10. 2. Two approaches to estimating internal consistency can be discussed: On the one hand, we have an estimate based on correlation between two halves (the Spearman-Brown split-half estimate), on the other, estimates which are based on ratios of the variances of parts of the test – halves or items – to total test score variance (the Guttman split-half, the Kuder-Richardson formulae, and coefficient alpha).
  11. There two types of reliability estimates:
  12. Equivalence: several alternate sets of interview questions.
  13. Component diyor çünkü bu method birden fazla effect I estimate edebiliyoruz.
  14. Farazi…
  15. When universe score coefficient is divided by observed score coefficient, we get generalizability coefficient. How do we estimate relative effect of different sources variance in this theory?
  16. To illustrate the logic of this procedure, consider our earlier example of an oral interview with different question forms, or sets of questions, and different interviewer/raters
  17. 2. however, 3. For this, we need 5. As an assumption, we can say that the smaller standard deviation of errors
  18. Let’s talk about some specific concepts related to item response theory.
  19. Precision: kesinlik hassasiyet, duyarlılık…
  20. Until now, we have talked about the reliability of norm-referenced test scores.
  21. Thus far, different sources of error have been discussed as the primary factors that affect the reliability of tests. In addition to these sources of error, there are general characteristics of tests and test scores that influence the size of our estimates of reliability.