SlideShare a Scribd company logo
1 of 63
www.ou.edu.vn
Click to edit Master subtitle style
HCMC OPEN UNIVERSITY GRADUATE SCHOOL
MINISTRY OF EDUCATION AND TRAINING
HO CHI MINH CITY OPEN UNIVERSITY
TEST CONSIDERATION: REALIABILITY &
VALIDITY
Presenters : Group 5 Lý Tuấn Phú
Đặng Kiều Anh
Nguyễn Duy Cường
Nguyễn Thị Kim Loan
Mai Xuân Ái
Trần Thị Kim Ngân
July, Hochiminh City, 2013
 I. Reliability
 - Introduction
 - Factors that affect language test scores
 - Classical true score measurement theory
 - Generalizability theory
 - Standard error of measurement: interpreting individual test scores
within classical true score and generalizability theory
 - Item response theory
 - Reliability of criterion-referenced test scores
 - Factors that affect reliability estimates
 - Systematic measurement error
 II. Validation/ Validity
 - Introduction
 - Reliability and validity revisited
 - Validity as a unitary concept
 - The evidential basis of validity
 - Test bias
 - The consequential or ethical basis of validity
 - Postmortem: face validity
2
 Relationship between reliability & validity:
complementary aspects
(1) (Reliability) to minimize the effects of
measurement error, and
(2) (Validity) to maximize the effects of the language
abilities we want to measure.
 Investigation of reliability: <= we must identify
sources of error and estimate the degree of their
effects on test scores <= distinguishing the effects
of the language abilities we want to measure from
the effects of other factors: Complex problem!
Test
method
facets Personal
attributes
TEST SCORE
Communicative
language ability
Random
factors
 Different factors will affect different individuals
differently.
 Designing and developing language tests: to
minimize their effects on test performance :
◦ test method,
◦ random factors
◦ personal attributes: Sources of test bias (test invalidity)
Sources of measurement error
• ‘Mean’ ( 𝑥): the average of the scores of a
given group of test takers.
• ‘Variance’ (𝑠2
): how much individual scores
vary from the group mean.
Classical true score (CTS) measurement theory
consists of a set of assumptions about the
relationships between actual, or observed test scores
and the factors that affect these scores.
Concept 1: True score and error score
1. An observed score on a test comprises 2 factors: a
true score (an individual’s level of ability) & an
error score (factors other than the ability being
tested).
2. The relationship between true and error scores:
error scores are unsystematic, or random, and are
uncorrelated with true scores.
Concept 2: Parallel tests
Two tests are parallel if, for every group of
persons taking both tests, (1) the true score on one
test is equal to the true score on the other, and (2)
the error variances for the two tests are equal.
3. Reliability of observed scores:
a. Reliability as the correlation between parallel tests:
If the observed scores on two parallel tests are
highly correlated, this indicates that effects of the error
scores are minimal, and that they can be considered
reliable indicators of the ability being measured.
b. Reliability and measurement error as proportions of
observed score variance
If an individual's observed score on a test is
composed of a true score and an error score, the greater
the proportion of true score, the less the proportion of
error score, and thus the more reliable the observed
score.
3 approaches to estimating reliability: (p.173 – p.
185)
1. Internal consistency estimates: are concerned
primarily with sources of error from within the test
and scoring procedures.
2. Stability estimates indicate how consistent test
scores are over time.
3. Equivalence estimates provide an indication of the
extent to which scores on alternate forms of a test
are equivalent.
 Problems with the classical true score model:
+ The CTS model treats error variance as
homogeneous in origin.
+ The CTS model considers all error to be random,
and consequently fails to distinguish systematic error
from random error.
https://www.youtube.com/watch?v=CSI-1Zk6oeM
https://www.youtube.com/watch?v=k84ksLUWKKc
 Constitutes a theory and set of procedures
for specifying and estimating the relative
effects of different factors on observed test
scores
=> Provide a means for relating the uses or
interpretations of test scores to the way test
users specify and interpret dif. factors, or
sources of errors.
 A given measure or score is treated as a
sample from a hypothetical universe of
possible measures.
 Interpreting a test score = generalizing from
a single measure to a universe of measures.
(on the basis of an individual’s performance
on a test => generalize to her performance in
other contexts).
 The more reliable the sample of performance,
or test score is, the more generalizable it is.
Reliability =
generalizabilit
y
The extent of
generalizabilit
y
Defining the
universe of
measures
The universe of
generalization
 The application of G-theory to test
development and use:
generalizability study (‘G-study’)
decision study (‘D-study’)
specify the dif. sources of variance,
 estimate the relative importance of these dif.
sources simultaneously,
 employ these estimates in the interpretation
and use of test scores.
 Universe of generalization is a domain of
uses or abilities to which we want test scores
to generalize.
 Universe of measures are types of test scores
we would be willing to accept as indicators of
the ability to be measured for the purpose
intended.
 Are whom we are going to make decisions or
inferences.
 The degree of generalizability determines the
way we define the population.
Ex: using test results for making decisions
about 1 group => this group is population of
persons.
Using a test with more than one group
(entrance or placement tests) => generalizing
beyond a particular group.
 If we could obtain measures for an individual
under all the different conditions specified in
the universe of possible measures, his
average score on these measures might be
considered the best indicator of his ability.
=> is defined as the mean of a person’s scores
on all measures from the universe of possible
measures.
 The standard error of measurement is the
indicator of how much we would expect an
individual’s test scores to vary, given a
particular level of reliability.
 When investigating the amount of
measurement error in individual test scores,
we are looking at differences b/w test takers’
obtained scores and their true scores.
 The error score is the difference between an
obtained score and the true score.
 The more reliable the test is, the closer the
obtained scores will cluster around the true score
mean => smaller standard deviation of errors.
 The less reliable the test, the greater the
standard deviation.
 Because of the importance in the interpretation
of test scores, the standard deviation of the error
scores has a name: the standard error of
measurement (SEM).
SEM provides a means for applying estimates
of reliability to the interpretation and use of
individuals’ observed test scores
 its primary advantage: makes test users
aware of how much variability in observed
scores to expect as a result of measurement
error.
 Norm-referenced (NR) test scores:
maximize inter-individual score
differences or score variance
 Criterion-referenced (CR) test scores:
- provide information about an
individual’s relative ‘mastery’ of an
ability domain
- develop to be representative of the
criterion ability
- occur in educational programs and
language classrooms
- commonly use achievement tests
consistency
stability
equivalence
CR tests
 well-defined set of tasks or items that
constitute a domain  CR test
development
 “true score” and “universe score” 
“domain score”
 Length of test
 Difficulty of test and test score variance
 Cut-off score
 Effects of systematic
-General effect
-Specific effect
 Effects of test method
- systematic error
- random error
II. 1. Introduction
Definition of validation
 The examination of validity : examining the validity
of a given use of test scores is a complex process that
must involve the examination of both the evidence that
supports that interpretation or use and the ethical values
that provide the basis or justification for that
interpretation or use ( Messick 1975,1980, 1989).
In test validation we are not examining the validity of
the test content or of even the test scores themselves,
but rather the validity of the way we interpret or use
the information gathered through the testing
procedure.
Validity is not simply a function of the content and
procedure of the test itself, it must consider how test
takers perform also.
 Reliability is a requirement for validity
 The investigation of reliability and validity can be viewed as
complementary aspects of identifying, estimating, & interpreting
different sources of variance in test scores.
 The investigation of reliability is concerned with answering the
question : How much variance in test scores is due to
measurement error?, How much variance is due to factors other
than measurement error?
 Validity is concerned with identifying the factors that produce
the reliable variance in test scores,
The question addressed : what specific abilities account for the
reliable variance in test score ?
II. 2. Reliability and validity revisited
The relationship b/w reliability & validity
 Definition of validation.
 The relationship between reliability and validity, viewing the
estimation of reliability as an essential requisite of validation.
 The framework proposed by Messick (1989) for considering
validity as a unitary through multifaceted concept.
 The evidential basis for validity
 Construct validity ( includes content relevance, criterion
relatedness)
 Test bias ( including culture, test content, personality characteristics
of test takers, sex, age).
 The ethical, or consequential basis of test use.
II. Validation/ Validity
Another way to distinguish reliability from validity : to
consider the theoretical frameworks upon which
they depends.
 In estimating reliability we are concerned primarily with
examining variance in test scores themselves.
 In validity, we must consider other sources of variance, and
utilize the theory of abilities that we hypothesize will affect test
performance.
 The process if validation must look beyond reliability and
examine the relationship b/w test performance and factors
outside the test itself.
The relationship b/w reliability & validity (cont)
However, the distinguishing of reliability & validity is
still not clear , due to :
 Different test methods from each other
 Abilities from test methods
The relationship b/w reliability & validity (cont)
The classic statement of the relationship b/w reliability
& validity by Campbell and Fiske (1959) :
Agreement b/w similar
Measures of the same trait
( for example, correlation b/w
scores on parallel tests)
Agreement b/w different
measures of the same trait
(for example, correlation
b/w scores on a multiple
Choice test of grammar
& ratings of grammar on
An oral interview)
The relationship b/w reliability & validity (cont)
 In many cases, the distinctiveness of the test methods is
not so clear.
->We must carefully consider not only similarities in the
test content, but also similarities in the test methods, in
order to determine whether correlations b/w tests should
be interpreted as estimators of reliability or as evidence
supporting validity.
 Language testing has a very special and complex problem
when its comes to traits and methods-> It’s difficult for
language test to distinguish traits and methods.
Source of
Justification
Evidential
Basis
Consequential
Basis
Function of outcome
of testing
Test
interpretation Test use
Construct
validity
Construct validity+
Value implications
Construct validity+
Relevance/
Utility
Construct validity+
Relevance/
Utility+
Social
consequences
II. 2. Validity as a unitary concept
 content relevance requires ‘the specification
of the behavioral domain in question and
the attendant specification of the task or
test domain’ (Messick 1980: 1017).
 content coverage, or the extent to which the
tasks required in the test adequately
represent the behavioral domain in question.
The examination of the
content relevance and
content coverage is a
necessary part of the validity
process
 may be level of ability as defined by group
membership, individuals’ performance on
another test of the ability in question, or their
relative success in performing some task that
involves this ability.
 Information on concurrent criterion relatedness is
undoubtedly the most commonly used in
language testing.
 There are 2 forms:
(1) examining differences in test performance
among groups of individuals at different levels of
language ability.
(2) examining correlations among various
measures of a given ability.
 need to collect data a relationship between
scores on the test and job or course
performance
 can largely ignore the question of what abilities
are being measured
Construct validity is indeed the unifying concept
integrates criterion and content considerations
into a common framework for testing rational
hypotheses about theoretically relevant
relationships.
(Messick 1980: 1015)
Construct validation requires both logical
analysis and empirical investigation.
the test developer involved in the process of construct
validation is likely to collect several types of empirical
evidence. These may include any or all of the following:
(1) the examination of patterns of correlations among
item scores and test scores, and between
characteristics of and tests and scores on items and
tests;
(2) analyses and modeling of the processes underlying
test performance;
(3) studies of group differences;
(4) studies of changes over time
(5) investigation of the effects of experimental
treatment (Messick 1989)
Correlational evidence is derived €ram a
family of statistical procedures that examine
the relationships among variables, or
measures.
A correlation is a functional relationship
between two measures.
Correlational approaches to construct
validation may utilize both
exploratory and confirmatory modes.
It is impossible to make clear, unambiguous
inferences regarding the influence of various
factors on test scores on the basis of a single
correlation between two tests.
A commonly used procedure for interpreting a
large number of correlations is factor analysis
 Characteristic: each measure is considered to
be a combination of trait and method, and
tests are included in the design so as to
combine multiple traits with multiple
methods.
 Advantage: permits the investigator to
examine patterns of both convergence and
discrimination among correlations.
Convergence is essentially what
Analysis of data: many ways
(1) the direct inspection of convergent and
discriminant correlations
(2) the analysis of variance
(3) confirmatory factor analysis
 individuals are assigned .at random to two or more
groups, each of which is given a different treatment. At
the end of the treatment, observations are made to
investigate differences among the different groups.
 There are two distinguishing characteristics of a true
experimental design. The first is that of randomization,
which means that (1) a sample of subjects is randomly
selected from a population, and (2) the individuals in
this random sample are then randomly assigned to two
or more groups for comparison.
 The second characteristic is that of experimental
intervention, or treatment. This means that the
different groups of subjects are exposed to distinct
treatments, or sets of circumstances, as part of the
experiment.
 the process of construct validation is a complex
and continuous undertaking, involving both ( 1)
theoretical, logical analysis leading to empirically
testable hypotheses, and (2) a variety of
appropriate approaches to empirical observation
and analysis
 The result of this process of construct
validation will be a statement regarding the extent
to which the test under consideration provides a
valid basis for making inferences about the given
ability with respect to the types of individuals and
contexts that have provided the setting for the
validation research.
51
What is test bias?
 Systematic differences in test performance ,
resulted by the differences in individual
characteristics
 Examples: Gender Difference in Mathematical
Ability
A reliable mathematics test to a representative
groups of males and females.
On average, males have higher scores than
females
=> Tendency to interpret that: “males have
greater mathematical ability than female”
52
However, test score should not be interpreted to reflect
purely mathematical ability.
The differences b/w test scores due to test score bias,
NOT due to differences in true mathematical abilitt
=> Differences in group performance do not indicate
test bias.
=> The systematic differences which are not logically
related to the ability in the questions/ tests => test is
biased
53
 + misinterpretation of test score
 + sexist / racist content (content validity)
 + unequal prediction of criterion performance
 + unfair content (content validity)
 + inappropriate selection procedures
 + inadequate criterion measures
 + threatening atmosphere
 + conditions of testing
54
+ Cultural background
 Cultural differences (Britre (1968, 1973: Britre and Brown. 1971)
 The problem of cultural content ((Plaister (1967) and Condon
(1975))
 In item response theory, some items in multiple - choice
vocabulary are in favor of one linguistic and cultural subgroups
(Chen and Henning (1985))
 Aptitude tests: possibly biased toward culturally different
groups (Zeidner (1986))
+ Background knowledge
 Prior knowledge affects test performance (Chacevycn et al.
(1982))
 In ESP testing, students' performance: affected as much by their
prior knowledge as by their language proficiency
+ Cognitive characteristics
 Cognitive factors influence language acquisition (Brown 1987)
55
 Cognitive styles/ learning styles:
 + field- dependent/ independent
 a field-independent learning style is defined
by a tendency to separate details from the
surrounding context ( cited from
http://www.teachingenglish.org.uk/knowledge-
database/field-independent-learners)
 a field-dependent learning style, which is
defined by a relative inability to distinguish
detail from other information around it
56
 Example
 Field-independent learners tend to rely less on the
teacher or other learners for support.
 => Psychological differences
 Ambiguity tolerance/ intolerance : cognitive
flexibility
 Tolerance of ambiguity: one's acceptance of
confusing situation and a lack of clear r line
demarcation (Ely (1989)),
 One facet of personality characteristics : related to
risk taking . Those who can tolerate ambiguity are
more likely to take risks in language learning, an
essential of making progress on the language
acquisition
 (As cited in Grace ,1997)
57
 Test: serve the need of an educational system or of
society
 The use of language tests reflect in microcosm the
role of test in general as instrument of social policy
 The role of tests can be described via kinds of tests
 + placement
 + diagnosis
 + selection (based in the proficiency/ achievement )
 + evaluation
 + making decisions
 The issues involved in the ethics of tests:
 + numerous
 + vary across societies, cultures, testing contexts
58
=> focus on the rights of individual test takers
:
+ secrecy
+ access to information
+ privacy
+ confidentiality
+ consent
+ the balance b/w individual rights and the
values of the society
59
 As test developers and test users, people need to
consider:
+ the rights & interests of test takers
+ the responsibilities of institutions for
making decisions based on tests
+ public interest
 These considerations are political, dynamic, and
vary across societies
 These considerations have implications for the
practice of teachers' profession, kinds of tests to
be developed an the ways in which test usefulness
is justified .
60
 We must move out of the comfortable combines of
applies linguistic and psychometric theory into the
arena of public policy.
 Hulin et at. (1983)
 "it is important to realize that testing and social policy
a=cannot be totally separated and that questions about
the use of tests can not be addressed without
considering existing social forces, whatever they are (p.
285)
 4 areas of considerations in the ethical use and
interpretation of test results (Messick (1980, 1988b)
+ construct validity/ the evidence supports the
interpretation of test scores
+ value systems that inform test use
+ practical usefulness of the test
+ the consequences to the educational system or
society of using test results for a particular purpose
61
In short , complete evidence should be
provided:
+ to prove that tests are used as
valid indicators of the abilities which
are appropriate to the intended use
+ to determine the use of test
62
 Test validity: the appeal or appearance of a
test
 Measure what it is supposed to measure.
 Test appearance has a considerable effect on
the acceptability of tests to both test takers
and test users.
 Test talkers will take the test seriously
enough to try the best or not. Accept/ not
accept the test. Test is useful or not.
 => test takers' reaction influent the validity
and reliability of tests.
63

More Related Content

What's hot

Designing classroom language tests
Designing classroom language testsDesigning classroom language tests
Designing classroom language testsSutrisno Evenddy
 
Language testing and evaluation validity and reliability.
Language testing and evaluation validity and reliability.Language testing and evaluation validity and reliability.
Language testing and evaluation validity and reliability.Vadher Ankita
 
Principles of language assessment
Principles of language assessmentPrinciples of language assessment
Principles of language assessmentAstrid Caballero
 
Chapter 2: Principles of Language Assessment
Chapter 2: Principles of Language AssessmentChapter 2: Principles of Language Assessment
Chapter 2: Principles of Language AssessmentHamid Najaf Pour Sani
 
Chapter 2(principles of language assessment)
Chapter 2(principles of language assessment)Chapter 2(principles of language assessment)
Chapter 2(principles of language assessment)Kheang Sokheng
 
How to make tests more reliable
How to make tests more reliableHow to make tests more reliable
How to make tests more reliableNawaphat Deelert
 
Assessment &testing in the classroom
Assessment &testing in the classroomAssessment &testing in the classroom
Assessment &testing in the classroomCidher89
 
Testing for Language Teachers
Testing for Language TeachersTesting for Language Teachers
Testing for Language Teachersmpazhou
 
Chapter 3(designing classroom language tests)
Chapter 3(designing classroom language tests)Chapter 3(designing classroom language tests)
Chapter 3(designing classroom language tests)Kheang Sokheng
 
Testing for Language Teachers Arthur Hughes
Testing for Language TeachersArthur HughesTesting for Language TeachersArthur Hughes
Testing for Language Teachers Arthur HughesRajputt Ainee
 
Language testing
Language testingLanguage testing
Language testingJihan Zayed
 
Types of test and testing
Types of test and testingTypes of test and testing
Types of test and testinguzma bashir
 
Practicality of a test
Practicality of a testPracticality of a test
Practicality of a testLenWoolly
 
Common test techniques
Common test techniquesCommon test techniques
Common test techniquesMaury Martinez
 
Principles of language assessment
Principles of language assessmentPrinciples of language assessment
Principles of language assessmentAmeer Al-Labban
 
Designing Classroom Language Tests
Designing Classroom Language TestsDesigning Classroom Language Tests
Designing Classroom Language TestsYee Bee Choo
 

What's hot (20)

Designing classroom language tests
Designing classroom language testsDesigning classroom language tests
Designing classroom language tests
 
Language testing and evaluation validity and reliability.
Language testing and evaluation validity and reliability.Language testing and evaluation validity and reliability.
Language testing and evaluation validity and reliability.
 
Principles of language assessment
Principles of language assessmentPrinciples of language assessment
Principles of language assessment
 
Chapter 2: Principles of Language Assessment
Chapter 2: Principles of Language AssessmentChapter 2: Principles of Language Assessment
Chapter 2: Principles of Language Assessment
 
Chapter 2(principles of language assessment)
Chapter 2(principles of language assessment)Chapter 2(principles of language assessment)
Chapter 2(principles of language assessment)
 
How to make tests more reliable
How to make tests more reliableHow to make tests more reliable
How to make tests more reliable
 
Assessment &testing in the classroom
Assessment &testing in the classroomAssessment &testing in the classroom
Assessment &testing in the classroom
 
Language testing
Language testingLanguage testing
Language testing
 
Testing for Language Teachers
Testing for Language TeachersTesting for Language Teachers
Testing for Language Teachers
 
Chapter 3(designing classroom language tests)
Chapter 3(designing classroom language tests)Chapter 3(designing classroom language tests)
Chapter 3(designing classroom language tests)
 
Testing for Language Teachers Arthur Hughes
Testing for Language TeachersArthur HughesTesting for Language TeachersArthur Hughes
Testing for Language Teachers Arthur Hughes
 
Language testing
Language testingLanguage testing
Language testing
 
Types of test and testing
Types of test and testingTypes of test and testing
Types of test and testing
 
Practicality of a test
Practicality of a testPracticality of a test
Practicality of a test
 
Common test techniques
Common test techniquesCommon test techniques
Common test techniques
 
Principles of Language Assessment
Principles of Language AssessmentPrinciples of Language Assessment
Principles of Language Assessment
 
Principles of language assessment
Principles of language assessmentPrinciples of language assessment
Principles of language assessment
 
testing writing
testing writingtesting writing
testing writing
 
Test techniques
Test techniquesTest techniques
Test techniques
 
Designing Classroom Language Tests
Designing Classroom Language TestsDesigning Classroom Language Tests
Designing Classroom Language Tests
 

Viewers also liked

Validity and Reliability
Validity and ReliabilityValidity and Reliability
Validity and ReliabilityMaury Martinez
 
Presentation Validity & Reliability
Presentation Validity & ReliabilityPresentation Validity & Reliability
Presentation Validity & Reliabilitysongoten77
 
Validity, its types, measurement & factors.
Validity, its types, measurement & factors.Validity, its types, measurement & factors.
Validity, its types, measurement & factors.Maheen Iftikhar
 
validity its types and importance
validity its types and importancevalidity its types and importance
validity its types and importanceIerine Joy Caserial
 
Messick’s framework
Messick’s frameworkMessick’s framework
Messick’s frameworklerise
 
Diversity in living organism by durgesh,jnv,jbp
Diversity in living organism by                durgesh,jnv,jbpDiversity in living organism by                durgesh,jnv,jbp
Diversity in living organism by durgesh,jnv,jbpDJJNV
 
Language learning strategies
Language learning strategiesLanguage learning strategies
Language learning strategiesSheida Karagah
 
Class 1 introduction to research
Class 1 introduction to research Class 1 introduction to research
Class 1 introduction to research Sharon Young
 
Production role presentation
Production role presentationProduction role presentation
Production role presentationTyrell Willock
 
Games as Tools for Learning - Microsoft Global Educator Exchange
Games as Tools for Learning - Microsoft Global Educator ExchangeGames as Tools for Learning - Microsoft Global Educator Exchange
Games as Tools for Learning - Microsoft Global Educator ExchangeBBC
 
Tugas so yuliani
Tugas so yulianiTugas so yuliani
Tugas so yulianiyulianii
 
Dubber Partner Presentation Dec 2014
Dubber Partner Presentation Dec 2014Dubber Partner Presentation Dec 2014
Dubber Partner Presentation Dec 2014Justin Absalom
 
Jdi global indonesia
Jdi global indonesiaJdi global indonesia
Jdi global indonesiaMogi Mukhtar
 
Presentation Rho Kinase-2 Activation in HUVEC111
Presentation Rho Kinase-2 Activation in HUVEC111Presentation Rho Kinase-2 Activation in HUVEC111
Presentation Rho Kinase-2 Activation in HUVEC111Keren Ferris
 

Viewers also liked (20)

Validity and Reliability
Validity and ReliabilityValidity and Reliability
Validity and Reliability
 
Presentation Validity & Reliability
Presentation Validity & ReliabilityPresentation Validity & Reliability
Presentation Validity & Reliability
 
Validity
ValidityValidity
Validity
 
Validity, its types, measurement & factors.
Validity, its types, measurement & factors.Validity, its types, measurement & factors.
Validity, its types, measurement & factors.
 
validity its types and importance
validity its types and importancevalidity its types and importance
validity its types and importance
 
Messick’s framework
Messick’s frameworkMessick’s framework
Messick’s framework
 
Diversity in living organism by durgesh,jnv,jbp
Diversity in living organism by                durgesh,jnv,jbpDiversity in living organism by                durgesh,jnv,jbp
Diversity in living organism by durgesh,jnv,jbp
 
Language learning strategies
Language learning strategiesLanguage learning strategies
Language learning strategies
 
Recce presentation
Recce presentationRecce presentation
Recce presentation
 
uni
uniuni
uni
 
Charina presentation
Charina presentationCharina presentation
Charina presentation
 
Class 1 introduction to research
Class 1 introduction to research Class 1 introduction to research
Class 1 introduction to research
 
Production role presentation
Production role presentationProduction role presentation
Production role presentation
 
Games as Tools for Learning - Microsoft Global Educator Exchange
Games as Tools for Learning - Microsoft Global Educator ExchangeGames as Tools for Learning - Microsoft Global Educator Exchange
Games as Tools for Learning - Microsoft Global Educator Exchange
 
Tugas so yuliani
Tugas so yulianiTugas so yuliani
Tugas so yuliani
 
Dubber Partner Presentation Dec 2014
Dubber Partner Presentation Dec 2014Dubber Partner Presentation Dec 2014
Dubber Partner Presentation Dec 2014
 
Jdi global indonesia
Jdi global indonesiaJdi global indonesia
Jdi global indonesia
 
Was, were
Was, wereWas, were
Was, were
 
ppt2
ppt2ppt2
ppt2
 
Presentation Rho Kinase-2 Activation in HUVEC111
Presentation Rho Kinase-2 Activation in HUVEC111Presentation Rho Kinase-2 Activation in HUVEC111
Presentation Rho Kinase-2 Activation in HUVEC111
 

Similar to Valiadity and reliability- Language testing

Louzel Report - Reliability & validity
Louzel Report - Reliability & validity Louzel Report - Reliability & validity
Louzel Report - Reliability & validity Louzel Linejan
 
Validity and reliability in assessment.
Validity and reliability in assessment. Validity and reliability in assessment.
Validity and reliability in assessment. Tarek Tawfik Amin
 
Reliability bachman 1990 chapter 6
Reliability bachman 1990 chapter 6Reliability bachman 1990 chapter 6
Reliability bachman 1990 chapter 6ahfameri
 
Validity & reliability seminar
Validity & reliability seminarValidity & reliability seminar
Validity & reliability seminarmrikara185
 
What makes a good testA test is considered good” if the .docx
What makes a good testA test is considered good” if the .docxWhat makes a good testA test is considered good” if the .docx
What makes a good testA test is considered good” if the .docxmecklenburgstrelitzh
 
Meaning and Methods of Estimating Reliability of Test.pptx
Meaning and Methods of Estimating Reliability of Test.pptxMeaning and Methods of Estimating Reliability of Test.pptx
Meaning and Methods of Estimating Reliability of Test.pptxsarat68
 
Reliability of test
Reliability of testReliability of test
Reliability of testSarat Rout
 
Characteristics of a good test
Characteristics of a good test Characteristics of a good test
Characteristics of a good test Arash Yazdani
 
RELIABILITY AND VALIDITY
RELIABILITY AND VALIDITYRELIABILITY AND VALIDITY
RELIABILITY AND VALIDITYJoydeep Singh
 
Validity, reliability & practicality
Validity, reliability & practicalityValidity, reliability & practicality
Validity, reliability & practicalitySamcruz5
 
Chapter 8 compilation
Chapter 8 compilationChapter 8 compilation
Chapter 8 compilationHannan Mahmud
 
Test characteristics
Test characteristicsTest characteristics
Test characteristicsSamcruz5
 
Validity and reliability of questionnaires
Validity and reliability of questionnairesValidity and reliability of questionnaires
Validity and reliability of questionnairesVenkitachalam R
 
Adapted from Assessment in Special and incl.docx
Adapted from Assessment in Special and incl.docxAdapted from Assessment in Special and incl.docx
Adapted from Assessment in Special and incl.docxnettletondevon
 

Similar to Valiadity and reliability- Language testing (20)

Louzel Report - Reliability & validity
Louzel Report - Reliability & validity Louzel Report - Reliability & validity
Louzel Report - Reliability & validity
 
EM&E.pptx
EM&E.pptxEM&E.pptx
EM&E.pptx
 
Validity and reliability in assessment.
Validity and reliability in assessment. Validity and reliability in assessment.
Validity and reliability in assessment.
 
Rep
RepRep
Rep
 
Reliability bachman 1990 chapter 6
Reliability bachman 1990 chapter 6Reliability bachman 1990 chapter 6
Reliability bachman 1990 chapter 6
 
Reliability bachman 1990 chapter 6
Reliability bachman 1990 chapter 6Reliability bachman 1990 chapter 6
Reliability bachman 1990 chapter 6
 
Reliability and validity
Reliability and validityReliability and validity
Reliability and validity
 
Validity & reliability seminar
Validity & reliability seminarValidity & reliability seminar
Validity & reliability seminar
 
What makes a good testA test is considered good” if the .docx
What makes a good testA test is considered good” if the .docxWhat makes a good testA test is considered good” if the .docx
What makes a good testA test is considered good” if the .docx
 
Meaning and Methods of Estimating Reliability of Test.pptx
Meaning and Methods of Estimating Reliability of Test.pptxMeaning and Methods of Estimating Reliability of Test.pptx
Meaning and Methods of Estimating Reliability of Test.pptx
 
Reliability of test
Reliability of testReliability of test
Reliability of test
 
Characteristics of a good test
Characteristics of a good test Characteristics of a good test
Characteristics of a good test
 
Qualities of good evaluation tool (1)
Qualities of good evaluation  tool (1)Qualities of good evaluation  tool (1)
Qualities of good evaluation tool (1)
 
RELIABILITY AND VALIDITY
RELIABILITY AND VALIDITYRELIABILITY AND VALIDITY
RELIABILITY AND VALIDITY
 
Validity, reliability & practicality
Validity, reliability & practicalityValidity, reliability & practicality
Validity, reliability & practicality
 
Chapter 8 compilation
Chapter 8 compilationChapter 8 compilation
Chapter 8 compilation
 
Test characteristics
Test characteristicsTest characteristics
Test characteristics
 
Week 8 & 9 - Validity and Reliability
Week 8 & 9 - Validity and ReliabilityWeek 8 & 9 - Validity and Reliability
Week 8 & 9 - Validity and Reliability
 
Validity and reliability of questionnaires
Validity and reliability of questionnairesValidity and reliability of questionnaires
Validity and reliability of questionnaires
 
Adapted from Assessment in Special and incl.docx
Adapted from Assessment in Special and incl.docxAdapted from Assessment in Special and incl.docx
Adapted from Assessment in Special and incl.docx
 

Recently uploaded

Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Celine George
 
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdf
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdfFraming an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdf
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdfUjwalaBharambe
 
Full Stack Web Development Course for Beginners
Full Stack Web Development Course  for BeginnersFull Stack Web Development Course  for Beginners
Full Stack Web Development Course for BeginnersSabitha Banu
 
Meghan Sutherland In Media Res Media Component
Meghan Sutherland In Media Res Media ComponentMeghan Sutherland In Media Res Media Component
Meghan Sutherland In Media Res Media ComponentInMediaRes1
 
Alper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentAlper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentInMediaRes1
 
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfLike-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfMr Bounab Samir
 
Types of Journalistic Writing Grade 8.pptx
Types of Journalistic Writing Grade 8.pptxTypes of Journalistic Writing Grade 8.pptx
Types of Journalistic Writing Grade 8.pptxEyham Joco
 
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...Marc Dusseiller Dusjagr
 
Presiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsPresiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsanshu789521
 
Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Celine George
 
Pharmacognosy Flower 3. Compositae 2023.pdf
Pharmacognosy Flower 3. Compositae 2023.pdfPharmacognosy Flower 3. Compositae 2023.pdf
Pharmacognosy Flower 3. Compositae 2023.pdfMahmoud M. Sallam
 
How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17Celine George
 
Blooming Together_ Growing a Community Garden Worksheet.docx
Blooming Together_ Growing a Community Garden Worksheet.docxBlooming Together_ Growing a Community Garden Worksheet.docx
Blooming Together_ Growing a Community Garden Worksheet.docxUnboundStockton
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxpboyjonauth
 
Capitol Tech U Doctoral Presentation - April 2024.pptx
Capitol Tech U Doctoral Presentation - April 2024.pptxCapitol Tech U Doctoral Presentation - April 2024.pptx
Capitol Tech U Doctoral Presentation - April 2024.pptxCapitolTechU
 
MICROBIOLOGY biochemical test detailed.pptx
MICROBIOLOGY biochemical test detailed.pptxMICROBIOLOGY biochemical test detailed.pptx
MICROBIOLOGY biochemical test detailed.pptxabhijeetpadhi001
 
Historical philosophical, theoretical, and legal foundations of special and i...
Historical philosophical, theoretical, and legal foundations of special and i...Historical philosophical, theoretical, and legal foundations of special and i...
Historical philosophical, theoretical, and legal foundations of special and i...jaredbarbolino94
 
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxiammrhaywood
 

Recently uploaded (20)

Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17
 
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdf
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdfFraming an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdf
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdf
 
ESSENTIAL of (CS/IT/IS) class 06 (database)
ESSENTIAL of (CS/IT/IS) class 06 (database)ESSENTIAL of (CS/IT/IS) class 06 (database)
ESSENTIAL of (CS/IT/IS) class 06 (database)
 
Full Stack Web Development Course for Beginners
Full Stack Web Development Course  for BeginnersFull Stack Web Development Course  for Beginners
Full Stack Web Development Course for Beginners
 
Meghan Sutherland In Media Res Media Component
Meghan Sutherland In Media Res Media ComponentMeghan Sutherland In Media Res Media Component
Meghan Sutherland In Media Res Media Component
 
Alper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentAlper Gobel In Media Res Media Component
Alper Gobel In Media Res Media Component
 
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfLike-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
 
Types of Journalistic Writing Grade 8.pptx
Types of Journalistic Writing Grade 8.pptxTypes of Journalistic Writing Grade 8.pptx
Types of Journalistic Writing Grade 8.pptx
 
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
 
Presiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsPresiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha elections
 
Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17
 
Pharmacognosy Flower 3. Compositae 2023.pdf
Pharmacognosy Flower 3. Compositae 2023.pdfPharmacognosy Flower 3. Compositae 2023.pdf
Pharmacognosy Flower 3. Compositae 2023.pdf
 
How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17
 
OS-operating systems- ch04 (Threads) ...
OS-operating systems- ch04 (Threads) ...OS-operating systems- ch04 (Threads) ...
OS-operating systems- ch04 (Threads) ...
 
Blooming Together_ Growing a Community Garden Worksheet.docx
Blooming Together_ Growing a Community Garden Worksheet.docxBlooming Together_ Growing a Community Garden Worksheet.docx
Blooming Together_ Growing a Community Garden Worksheet.docx
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptx
 
Capitol Tech U Doctoral Presentation - April 2024.pptx
Capitol Tech U Doctoral Presentation - April 2024.pptxCapitol Tech U Doctoral Presentation - April 2024.pptx
Capitol Tech U Doctoral Presentation - April 2024.pptx
 
MICROBIOLOGY biochemical test detailed.pptx
MICROBIOLOGY biochemical test detailed.pptxMICROBIOLOGY biochemical test detailed.pptx
MICROBIOLOGY biochemical test detailed.pptx
 
Historical philosophical, theoretical, and legal foundations of special and i...
Historical philosophical, theoretical, and legal foundations of special and i...Historical philosophical, theoretical, and legal foundations of special and i...
Historical philosophical, theoretical, and legal foundations of special and i...
 
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
 

Valiadity and reliability- Language testing

  • 1. www.ou.edu.vn Click to edit Master subtitle style HCMC OPEN UNIVERSITY GRADUATE SCHOOL MINISTRY OF EDUCATION AND TRAINING HO CHI MINH CITY OPEN UNIVERSITY TEST CONSIDERATION: REALIABILITY & VALIDITY Presenters : Group 5 Lý Tuấn Phú Đặng Kiều Anh Nguyễn Duy Cường Nguyễn Thị Kim Loan Mai Xuân Ái Trần Thị Kim Ngân July, Hochiminh City, 2013
  • 2.  I. Reliability  - Introduction  - Factors that affect language test scores  - Classical true score measurement theory  - Generalizability theory  - Standard error of measurement: interpreting individual test scores within classical true score and generalizability theory  - Item response theory  - Reliability of criterion-referenced test scores  - Factors that affect reliability estimates  - Systematic measurement error  II. Validation/ Validity  - Introduction  - Reliability and validity revisited  - Validity as a unitary concept  - The evidential basis of validity  - Test bias  - The consequential or ethical basis of validity  - Postmortem: face validity 2
  • 3.  Relationship between reliability & validity: complementary aspects (1) (Reliability) to minimize the effects of measurement error, and (2) (Validity) to maximize the effects of the language abilities we want to measure.  Investigation of reliability: <= we must identify sources of error and estimate the degree of their effects on test scores <= distinguishing the effects of the language abilities we want to measure from the effects of other factors: Complex problem!
  • 5.  Different factors will affect different individuals differently.  Designing and developing language tests: to minimize their effects on test performance : ◦ test method, ◦ random factors ◦ personal attributes: Sources of test bias (test invalidity) Sources of measurement error • ‘Mean’ ( 𝑥): the average of the scores of a given group of test takers. • ‘Variance’ (𝑠2 ): how much individual scores vary from the group mean.
  • 6. Classical true score (CTS) measurement theory consists of a set of assumptions about the relationships between actual, or observed test scores and the factors that affect these scores. Concept 1: True score and error score 1. An observed score on a test comprises 2 factors: a true score (an individual’s level of ability) & an error score (factors other than the ability being tested). 2. The relationship between true and error scores: error scores are unsystematic, or random, and are uncorrelated with true scores.
  • 7. Concept 2: Parallel tests Two tests are parallel if, for every group of persons taking both tests, (1) the true score on one test is equal to the true score on the other, and (2) the error variances for the two tests are equal.
  • 8. 3. Reliability of observed scores: a. Reliability as the correlation between parallel tests: If the observed scores on two parallel tests are highly correlated, this indicates that effects of the error scores are minimal, and that they can be considered reliable indicators of the ability being measured. b. Reliability and measurement error as proportions of observed score variance If an individual's observed score on a test is composed of a true score and an error score, the greater the proportion of true score, the less the proportion of error score, and thus the more reliable the observed score.
  • 9. 3 approaches to estimating reliability: (p.173 – p. 185) 1. Internal consistency estimates: are concerned primarily with sources of error from within the test and scoring procedures. 2. Stability estimates indicate how consistent test scores are over time. 3. Equivalence estimates provide an indication of the extent to which scores on alternate forms of a test are equivalent.
  • 10.  Problems with the classical true score model: + The CTS model treats error variance as homogeneous in origin. + The CTS model considers all error to be random, and consequently fails to distinguish systematic error from random error. https://www.youtube.com/watch?v=CSI-1Zk6oeM https://www.youtube.com/watch?v=k84ksLUWKKc
  • 11.  Constitutes a theory and set of procedures for specifying and estimating the relative effects of different factors on observed test scores => Provide a means for relating the uses or interpretations of test scores to the way test users specify and interpret dif. factors, or sources of errors.
  • 12.  A given measure or score is treated as a sample from a hypothetical universe of possible measures.  Interpreting a test score = generalizing from a single measure to a universe of measures. (on the basis of an individual’s performance on a test => generalize to her performance in other contexts).  The more reliable the sample of performance, or test score is, the more generalizable it is.
  • 13. Reliability = generalizabilit y The extent of generalizabilit y Defining the universe of measures The universe of generalization
  • 14.  The application of G-theory to test development and use: generalizability study (‘G-study’) decision study (‘D-study’) specify the dif. sources of variance,  estimate the relative importance of these dif. sources simultaneously,  employ these estimates in the interpretation and use of test scores.
  • 15.  Universe of generalization is a domain of uses or abilities to which we want test scores to generalize.  Universe of measures are types of test scores we would be willing to accept as indicators of the ability to be measured for the purpose intended.
  • 16.  Are whom we are going to make decisions or inferences.  The degree of generalizability determines the way we define the population. Ex: using test results for making decisions about 1 group => this group is population of persons. Using a test with more than one group (entrance or placement tests) => generalizing beyond a particular group.
  • 17.  If we could obtain measures for an individual under all the different conditions specified in the universe of possible measures, his average score on these measures might be considered the best indicator of his ability. => is defined as the mean of a person’s scores on all measures from the universe of possible measures.
  • 18.  The standard error of measurement is the indicator of how much we would expect an individual’s test scores to vary, given a particular level of reliability.  When investigating the amount of measurement error in individual test scores, we are looking at differences b/w test takers’ obtained scores and their true scores.
  • 19.  The error score is the difference between an obtained score and the true score.  The more reliable the test is, the closer the obtained scores will cluster around the true score mean => smaller standard deviation of errors.  The less reliable the test, the greater the standard deviation.  Because of the importance in the interpretation of test scores, the standard deviation of the error scores has a name: the standard error of measurement (SEM).
  • 20. SEM provides a means for applying estimates of reliability to the interpretation and use of individuals’ observed test scores  its primary advantage: makes test users aware of how much variability in observed scores to expect as a result of measurement error.
  • 21.
  • 22.  Norm-referenced (NR) test scores: maximize inter-individual score differences or score variance  Criterion-referenced (CR) test scores: - provide information about an individual’s relative ‘mastery’ of an ability domain - develop to be representative of the criterion ability - occur in educational programs and language classrooms - commonly use achievement tests
  • 24.  well-defined set of tasks or items that constitute a domain  CR test development  “true score” and “universe score”  “domain score”
  • 25.  Length of test  Difficulty of test and test score variance  Cut-off score
  • 26.  Effects of systematic -General effect -Specific effect  Effects of test method - systematic error - random error
  • 28.  The examination of validity : examining the validity of a given use of test scores is a complex process that must involve the examination of both the evidence that supports that interpretation or use and the ethical values that provide the basis or justification for that interpretation or use ( Messick 1975,1980, 1989). In test validation we are not examining the validity of the test content or of even the test scores themselves, but rather the validity of the way we interpret or use the information gathered through the testing procedure. Validity is not simply a function of the content and procedure of the test itself, it must consider how test takers perform also.
  • 29.  Reliability is a requirement for validity  The investigation of reliability and validity can be viewed as complementary aspects of identifying, estimating, & interpreting different sources of variance in test scores.  The investigation of reliability is concerned with answering the question : How much variance in test scores is due to measurement error?, How much variance is due to factors other than measurement error?  Validity is concerned with identifying the factors that produce the reliable variance in test scores, The question addressed : what specific abilities account for the reliable variance in test score ? II. 2. Reliability and validity revisited The relationship b/w reliability & validity
  • 30.  Definition of validation.  The relationship between reliability and validity, viewing the estimation of reliability as an essential requisite of validation.  The framework proposed by Messick (1989) for considering validity as a unitary through multifaceted concept.  The evidential basis for validity  Construct validity ( includes content relevance, criterion relatedness)  Test bias ( including culture, test content, personality characteristics of test takers, sex, age).  The ethical, or consequential basis of test use. II. Validation/ Validity
  • 31. Another way to distinguish reliability from validity : to consider the theoretical frameworks upon which they depends.  In estimating reliability we are concerned primarily with examining variance in test scores themselves.  In validity, we must consider other sources of variance, and utilize the theory of abilities that we hypothesize will affect test performance.  The process if validation must look beyond reliability and examine the relationship b/w test performance and factors outside the test itself. The relationship b/w reliability & validity (cont)
  • 32. However, the distinguishing of reliability & validity is still not clear , due to :  Different test methods from each other  Abilities from test methods The relationship b/w reliability & validity (cont)
  • 33. The classic statement of the relationship b/w reliability & validity by Campbell and Fiske (1959) : Agreement b/w similar Measures of the same trait ( for example, correlation b/w scores on parallel tests) Agreement b/w different measures of the same trait (for example, correlation b/w scores on a multiple Choice test of grammar & ratings of grammar on An oral interview) The relationship b/w reliability & validity (cont)
  • 34.  In many cases, the distinctiveness of the test methods is not so clear. ->We must carefully consider not only similarities in the test content, but also similarities in the test methods, in order to determine whether correlations b/w tests should be interpreted as estimators of reliability or as evidence supporting validity.  Language testing has a very special and complex problem when its comes to traits and methods-> It’s difficult for language test to distinguish traits and methods.
  • 35. Source of Justification Evidential Basis Consequential Basis Function of outcome of testing Test interpretation Test use Construct validity Construct validity+ Value implications Construct validity+ Relevance/ Utility Construct validity+ Relevance/ Utility+ Social consequences II. 2. Validity as a unitary concept
  • 36.
  • 37.  content relevance requires ‘the specification of the behavioral domain in question and the attendant specification of the task or test domain’ (Messick 1980: 1017).  content coverage, or the extent to which the tasks required in the test adequately represent the behavioral domain in question.
  • 38. The examination of the content relevance and content coverage is a necessary part of the validity process
  • 39.  may be level of ability as defined by group membership, individuals’ performance on another test of the ability in question, or their relative success in performing some task that involves this ability.
  • 40.  Information on concurrent criterion relatedness is undoubtedly the most commonly used in language testing.  There are 2 forms: (1) examining differences in test performance among groups of individuals at different levels of language ability. (2) examining correlations among various measures of a given ability.
  • 41.  need to collect data a relationship between scores on the test and job or course performance  can largely ignore the question of what abilities are being measured
  • 42. Construct validity is indeed the unifying concept integrates criterion and content considerations into a common framework for testing rational hypotheses about theoretically relevant relationships. (Messick 1980: 1015) Construct validation requires both logical analysis and empirical investigation.
  • 43. the test developer involved in the process of construct validation is likely to collect several types of empirical evidence. These may include any or all of the following: (1) the examination of patterns of correlations among item scores and test scores, and between characteristics of and tests and scores on items and tests; (2) analyses and modeling of the processes underlying test performance; (3) studies of group differences; (4) studies of changes over time (5) investigation of the effects of experimental treatment (Messick 1989)
  • 44. Correlational evidence is derived €ram a family of statistical procedures that examine the relationships among variables, or measures. A correlation is a functional relationship between two measures. Correlational approaches to construct validation may utilize both exploratory and confirmatory modes.
  • 45. It is impossible to make clear, unambiguous inferences regarding the influence of various factors on test scores on the basis of a single correlation between two tests.
  • 46. A commonly used procedure for interpreting a large number of correlations is factor analysis
  • 47.  Characteristic: each measure is considered to be a combination of trait and method, and tests are included in the design so as to combine multiple traits with multiple methods.  Advantage: permits the investigator to examine patterns of both convergence and discrimination among correlations. Convergence is essentially what
  • 48. Analysis of data: many ways (1) the direct inspection of convergent and discriminant correlations (2) the analysis of variance (3) confirmatory factor analysis
  • 49.  individuals are assigned .at random to two or more groups, each of which is given a different treatment. At the end of the treatment, observations are made to investigate differences among the different groups.  There are two distinguishing characteristics of a true experimental design. The first is that of randomization, which means that (1) a sample of subjects is randomly selected from a population, and (2) the individuals in this random sample are then randomly assigned to two or more groups for comparison.  The second characteristic is that of experimental intervention, or treatment. This means that the different groups of subjects are exposed to distinct treatments, or sets of circumstances, as part of the experiment.
  • 50.  the process of construct validation is a complex and continuous undertaking, involving both ( 1) theoretical, logical analysis leading to empirically testable hypotheses, and (2) a variety of appropriate approaches to empirical observation and analysis  The result of this process of construct validation will be a statement regarding the extent to which the test under consideration provides a valid basis for making inferences about the given ability with respect to the types of individuals and contexts that have provided the setting for the validation research.
  • 51. 51 What is test bias?  Systematic differences in test performance , resulted by the differences in individual characteristics  Examples: Gender Difference in Mathematical Ability A reliable mathematics test to a representative groups of males and females. On average, males have higher scores than females => Tendency to interpret that: “males have greater mathematical ability than female”
  • 52. 52 However, test score should not be interpreted to reflect purely mathematical ability. The differences b/w test scores due to test score bias, NOT due to differences in true mathematical abilitt => Differences in group performance do not indicate test bias. => The systematic differences which are not logically related to the ability in the questions/ tests => test is biased
  • 53. 53  + misinterpretation of test score  + sexist / racist content (content validity)  + unequal prediction of criterion performance  + unfair content (content validity)  + inappropriate selection procedures  + inadequate criterion measures  + threatening atmosphere  + conditions of testing
  • 54. 54 + Cultural background  Cultural differences (Britre (1968, 1973: Britre and Brown. 1971)  The problem of cultural content ((Plaister (1967) and Condon (1975))  In item response theory, some items in multiple - choice vocabulary are in favor of one linguistic and cultural subgroups (Chen and Henning (1985))  Aptitude tests: possibly biased toward culturally different groups (Zeidner (1986)) + Background knowledge  Prior knowledge affects test performance (Chacevycn et al. (1982))  In ESP testing, students' performance: affected as much by their prior knowledge as by their language proficiency + Cognitive characteristics  Cognitive factors influence language acquisition (Brown 1987)
  • 55. 55  Cognitive styles/ learning styles:  + field- dependent/ independent  a field-independent learning style is defined by a tendency to separate details from the surrounding context ( cited from http://www.teachingenglish.org.uk/knowledge- database/field-independent-learners)  a field-dependent learning style, which is defined by a relative inability to distinguish detail from other information around it
  • 56. 56  Example  Field-independent learners tend to rely less on the teacher or other learners for support.  => Psychological differences  Ambiguity tolerance/ intolerance : cognitive flexibility  Tolerance of ambiguity: one's acceptance of confusing situation and a lack of clear r line demarcation (Ely (1989)),  One facet of personality characteristics : related to risk taking . Those who can tolerate ambiguity are more likely to take risks in language learning, an essential of making progress on the language acquisition  (As cited in Grace ,1997)
  • 57. 57  Test: serve the need of an educational system or of society  The use of language tests reflect in microcosm the role of test in general as instrument of social policy  The role of tests can be described via kinds of tests  + placement  + diagnosis  + selection (based in the proficiency/ achievement )  + evaluation  + making decisions  The issues involved in the ethics of tests:  + numerous  + vary across societies, cultures, testing contexts
  • 58. 58 => focus on the rights of individual test takers : + secrecy + access to information + privacy + confidentiality + consent + the balance b/w individual rights and the values of the society
  • 59. 59  As test developers and test users, people need to consider: + the rights & interests of test takers + the responsibilities of institutions for making decisions based on tests + public interest  These considerations are political, dynamic, and vary across societies  These considerations have implications for the practice of teachers' profession, kinds of tests to be developed an the ways in which test usefulness is justified .
  • 60. 60  We must move out of the comfortable combines of applies linguistic and psychometric theory into the arena of public policy.  Hulin et at. (1983)  "it is important to realize that testing and social policy a=cannot be totally separated and that questions about the use of tests can not be addressed without considering existing social forces, whatever they are (p. 285)  4 areas of considerations in the ethical use and interpretation of test results (Messick (1980, 1988b) + construct validity/ the evidence supports the interpretation of test scores + value systems that inform test use + practical usefulness of the test + the consequences to the educational system or society of using test results for a particular purpose
  • 61. 61 In short , complete evidence should be provided: + to prove that tests are used as valid indicators of the abilities which are appropriate to the intended use + to determine the use of test
  • 62. 62  Test validity: the appeal or appearance of a test  Measure what it is supposed to measure.  Test appearance has a considerable effect on the acceptability of tests to both test takers and test users.  Test talkers will take the test seriously enough to try the best or not. Accept/ not accept the test. Test is useful or not.  => test takers' reaction influent the validity and reliability of tests.
  • 63. 63